EDN logo


Design Feature: May 9, 1996

Content-addressable memories add processing power to embedded systems

Tom Weldon,
MUSIC Semiconductors

CAMs enable you to apply associative processing to designs that require list searches and data translation. The newest CAMs make it practical to embed these tasks in high-speed systems using simple controllers.

The latest generation of content-addressable memories (CAMs) makes it possible to handle list searches and data translation as embedded functions within systems. The combination of a CAM and a state machine creates an economical controller for real-time processes that need to perform look-ups, data translations, and entry maintenance in sparsely populated tables—ones with few entries compared to the address space required for direct look-up.

By simplifying many of the tasks required to deal with lists and tables of data, a CAM can offer you one of two possible benefits. You can simplify the controller by switching to a lower cost processor or a state machine. Or, you can dedicate the additional processing power that the CAM provides to additional features and functions. In either case, more processing power is available to an embedded controller at a lower cost.

A CAM is a hardware implementation of associative processing. Associative processing manipulates data based on matching, or associating, an input value with other values stored in an array. Associative-processing hardware incorporates a limited amount of computational capability at each memory location and allows you to examine an entire memory array at once. CAMs combine these functions with a control structure to perform associative processing.

This type of processing stands in contrast to today's predominant computational architecture: the Von Neumann engine. This architecture relies on address-based memories to store instructions and data. A few computational elements (most often one) sequentially process the instructions, operating on a single element of data at a time. Parallel-computing systems collect many of these processors, either operating together or independently, into large arrays.

Application: network bridge/router
Network bridges link two network segments while trying to reduce network traffic. They reduce the traffic by filtering packets that have their source and destination on the same segment. Routers link network segments that use different protocols, and routers also usually include bridging functions. These applications maintain station lists to track the location of different nodes on the network. Each port can have its own station list, or all the ports can share a station list.

The bridge/router reads the destination addresses from the packet header and checks the address against the list for the filtering decision. The bridge/router builds a station list by reading source addresses from packet headers and adding previously un- known addresses to the list.

An address in the station list may have information associated with it, such as a port ID or a time stamp for aging old, unused entries. The associated time stamp becomes associative data when it is time to search for old entries to purge from the list, a quirk that matches well with the capabilities of a CAM with a mask register.

This application requires 48 bits of associative data for the network ad-dress, giving 16 bits of associated data. Although you will also search on the time stamps that are stored with the associated data, you initially configure the memory as 48 bits of CAM. For time-stamp compares, you reconfigure the memory to 64 bits of CAM and use mask register 1 to limit the comparison to just the time-stamp bits.

The filter routine compares the destination address of an incoming packet to the station list. The match indication tells the bridge how to handle the packet (forward or filter). If a match is found, moving a previously stored time stamp in the RAM segment (segment 0) updates the time stamp for the matching entry. Reading the central station list's associated data finds the port ID. You need not perform any other CAM operations at this time; however, you must unlock the daisy chain before the next operation.

The learn routine checks the source address against the known addresses in the station list. The four data writes load the 48-bit source address, the current time stamp, and the port ID and then trigger a compare on the source address. If a match exists, the routine updates the time stamp and port ID by moving them through mask register 2. If no match exists, a nonoperating command unlocks the daisy chain before moving the station address, time stamp, and port ID to the next free location, which adds them to the station list.

When the station list is full (or earlier if you believe in preventive maintenance), the purge routine clears old, unused entries to make locations available for new entries. The control register configures the memory array to 64 bits of CAM using mask register 1 for compares, which limits matches to the time-stamp bits only. The routine loads the time stamp to be purged into segment 0 of the comparand register, triggering a compare. Setting the validity bits on all matching entries to empty clears all the old entries in a single cycle. Finally, the purge routine loads a new time stamp into segment 0 of the comparand register and restores the memory configuration to 48 bits of CAM with no mask register for the next destination-address comparison.


CAMs implement associative processing

Associative-processing theory has been around for a long time. However, the first large CAMs were expensive, and only a few specialized applications could effectively use small CAMs. Now, larger, general-purpose CAMs allow you to use associative processing to solve a wider variety of problems. Several vendors make devices that are part of this latest generation of CAMs. Although some differences exist among vendors in array organization, interface specifics, and product variety, many similarities among offerings also exist. The following application examples are based on MUSIC Semiconductors' MU9C1480 LANCAM.

A CAM compares an input value, the comparand, to all the associative data stored in the memory array at once. The output from a CAM can be either a flag that indicates one or more matches or associated data that is related in some way to the matched values.

The external connections to this type of memory device are simple. A data bus provides command inputs (Table 1) and data I/O. Output flags indicate the comparison results and indicate when the memory is full. Flag inputs allow you to connect multiple devices in a daisy chain without external logic, thereby expanding the number of words in the memory array. Table 2 lists the CAM memory and data locations.

Table 1—LANCAM Instruction Summary
MOV dest,src[mr][,V]Move data
VBC memdest,vbSet validity bits
CMP vbCompare on specified validity types
NOPNo operation
SPS srcSet persistent data source
SPD dest[mr][,vb]Set persistent data destination
TCO contrlregRead or write to configuration and control registers
SFFSet Full Flag for initialization of a daisy chain
src: data register or memory-array location
dest: data register or memory-array location
memdest: memory-array location
mr: Mask Register 1 or 2 (moves to and from
comparand or memory array)
V: Set Validity bits Valid (memory-array destinations only)
vb: any validity bit setting (V, E, S, R)
contrlreg: source or destination for command read or write

Table 2—CAM memory and data locations
Associative memory array locations:
HMHighest priority matching memory location
NFHighest priority empty (Next Free) memory location (destination only)
ALMAll matching memory locations (validity bits only)
RAM-array locations:
[AR]
aaaH
Memory at the location specified by the address register. Memory at a specified hex address
Data Registers: CRComparand Register
MR1Mask Register 1
MR2Mask Register 2

Note that the CAM logic symbol has no external address lines (Figure 1). In a CAM, most memory accesses are associative, that is, based on the contents of the memory, so external address lines are unnecessary for most operations. An internal address register, accessible through the command structure of the device, allows address-based random access of the CAM array when necessary. Two mask registers provide bit-level control of the comparand register for compare operations, allowing comparisons on a limited number of bits for special situations. You can also use the mask registers to update specific memory-array bits during moves from the comparand register.

Storing the associative data within the CAM device minimizes the component count in an embedded system. However, if the total number of associative and associated bits exceeds the width of the CAM, you can store the associated data in an external RAM (Figure 2).



Application: ATM switch
An asynchronous-transfer-mode (ATM) switch performs many functions. The switch must search internal tables that hold the necessary information for each connection that routes through the switch. The index to these tables is the virtual-path identifier (VPI) or the VPI/virtual-channel-identifier (VCI) combination from the header of an incoming data cell. The switch uses this information to look up the VPI and VCI for the outgoing link, the internal path through the switch to the correct output port, the billing rates for accounting purposes, traffic-flow parameters, and flags for any special functions incorporated into the switch.

A CAM configured for such an application requires at least 28 bits of associative data for matching VPI/VCI combinations and, possibly, more bits to include a port ID if several input ports share the CAM. Using a mask register for VPI-only look-ups lets you combine the channel and path lists, which, in turn, allows you to make trade-offs between speed and hardware simplicity.

The accompanying figure shows the VPI and VCI in separate segments, but the bit-level comparison control of the mask register allows you to distribute the identifiers across the two segments in any convenient fashion. Because of the amount of data that you can associate with each connection, this example uses external RAM for associated data storage; the CAM match address is the address of the associative data in the RAM.

The search routine loads the VPI and VCI into the comparand register with two data writes. A command read attempts to fetch the match address. If a match exists, the routine uses the match address to look up the VPI/VCI associated data.

If no match occurred, a TCO command to the control register switches to mask register 1 and triggers a second search of the database for a VPI-only match. A second command-read cycle again attempts to read the match address. The match address for this comparison looks up the associated data for the VPI. Another TCO CT command switches back to no mask register for the next VPI/VCI comparison.

The add routine runs when setting up a new connection. The new VPI or VPI/VCI combination loads into the comparand register and moves to the next free location in the CAM with a MOV NF,CR,V instruction. A forced compare matches the new entry, so that a command read of the status register recovers the entry's CAM address. The match address then serves as the external RAM address for storing the associated data.

The delete routine runs when tearing down a connection. The routine loads the entry to be removed into the comparand register. A VBC ALM,E command purges all matching entries from the list by changing the validity bits to empty (even though having more than one matching entry should be considered an error condition). The routine need not clear the associated data from the RAM; this data will not be accessed until the next time the corresponding CAM location is used.


CAMs fit into embedded systems

You commonly find CAMs in two types of applications: data filters and data translators. Data filters search a table or list and generate a flag that indicates whether the filter finds the input comparand. An example of a data filter is an address filter in network bridges, routers, and firewalls. Data translators search a table and output data associated with the comparand. Examples are switches (header translation), caches (main storage to cache location translation), and compression (symbol translation).

Various boxes accompanying this article show typical instruction routines for a number of these applications, including network bridges, ATM switches, disk caches, branch tables, and compression. Also see Table 3, which shows the corresponding CAM cycle keys. By studying these routines, you can see how a CAM's associative properties simplify not only the search sequence, but also the maintenance of a data list. The use of a CAM reduces these tasks to operations that simple state machines can easily handle.

Table 3—CAM cycle key
CycleType/CM/W/EC
C2RCommand readLOWHIGHHIGH
CWCommand writeLOWLOWHIGH
DRData readHIGHHIGHHIGH
DWData writeHIGHLOWHIGH
CRECommand read, /EC LOWLOWHIGHLOW
CWECommand write, /EC LOWLOWLOWLOW
DREData Read, /EC LOWHIGHHIGHLOW
DWEData Write, /EC LOWHIGHLOWLOW

As one example, the cycle sequence in Figure 3 shows how the CAM architecture simplifies the control of a 48-bit comparison, a read of the match address, and a read of 16 bits of associated data. (Assume that the CAM has a 16-bit external bus and a 64-bit-wide memory array. Four 16-bit segments, numbered 0 through 3, access the memory array.) When you initialize the device, the memory array comprises 48 bits (segments 1, 2, and 3) of CAM and 16 bits (segment 0) of RAM. Data writes go to segments 1, 2, and 3 of the comparand register; data reads come from segment 0 of the highest priority matching location.

In Figure 3, the first three cycles write the data to the three segments of the comparand register and trigger a comparison. A command-read cycle gets the match address from the status register, which is the default source for command-read cycles. A single read cycle fetches the associated data from segment 0 of the highest priority matching location. Because the device's configuration is programmable, all five cycles deliver and fetch useful data without any additional overhead. This efficiency is very important in time-critical processes.

Searches are only part of the challenge of working with lists of data, however. Many embedded applications must also maintain the list over time by adding or learning new entries and deleting or purging obsolete ones. The associative-processing power of a CAM is well-suited to this task. New entries in the data list typically load into the next empty location in the CAM's memory array. A related concept is learning: After a comparison that finds no match, a single command (MOV NF,CR,V) moves data from the comparand register to the next free address. Because a typical list comprises unique entries with equal priority, the CAM can store a new entry anywhere in the memory array. Thus, you can construct an efficient sequence that adds data to the list by taking advantage of a CAM's ability to associatively access the next empty location.

Sometimes, you may want to update an entry that exists in the CAM's memory array, such as a time stamp in a segment of the comparand. Either the CAM/RAM configuration or a mask register usually masks out the time-stamp segment from the normal compare operations. When the CAM finds a match for the unmasked portion of the comparand, a single instruction (MOV HM,CR,MR1) moves only the bits indicated by mask register 1 to the highest priority matching location from the comparand register, updating only the time stamp in the process.

When it is time to delete or purge one or more entries from the data list, the CAM can use associative locations Highest Priority Match or All Matching. To perform the purge, the comparand register loads the entry to be removed, which forces a compare. Then, a single instruction (VBC ALM,E) changes the validity bits at all the matching locations to mark them as empty, thus removing them from compare operations and making them available for new entries as the next free location.

Application: disk cache
Disk caches store recently used data in RAM for faster access the next time you use this or nearby data. When the operating system requests data from the disk, the challenge is to quickly discover if the data is already in the cache. The CAM acts as a tag buffer to correlate a disk index to a location in a virtual disk, that is, RAM used as a cache.

The CAM searches the wide tags that are the original positions on the disk (associative data) and outputs a smaller pointer that is the location in the cache (associated data). You can use a time stamp or other indicator as part of a replacement algorithm.

There are two ways to implement a disk cache. A hardware cache (see part (a) of accompanying figure) uses an embedded processor to intercept disk requests and search a table to see if and where a requested block is located in the RAM cache. A software cache (see part (b) of figure) is either a part of the operating system or a low-level driver. A software cache uses a section of the host's main RAM as the disk cache. The CAM sequences are the same for either approach. A write cache is similar, but you must choose between a write-through or write-back scheme.

An initialization routine preloads segment 0 in the CAM with the first address of each of the fixed-size block locations in the cache. Segment 1 holds the time stamps, which are all set to the same initial value. Segments 2 and 3 hold the disk-location indexes whose initialized values do not generate a match. The memory array's initial configuration makes segments 0 and 1 of RAM automatically mask out the associated time stamp and cache pointers. New disk indexes move from the comparand through mask register 1 to preserve the preloaded block-start addresses. Mask register 2 compares and updates the time stamps.

The tag-search routine preloads segment 1 of the comparand with a time stamp, so that only two data writes are necessary between a disk request and the CAM comparison. The routine writes to segments 2 and 3, triggering a comparison. If the requested data is found in the cache, the routine retrieves the cache address from segment 0 of the highest matching location and immediately starts the data transfer from the cache to the host processor. This portion of the routine is the most time-critical, because the speed of the system depends on how quickly you can find the data and deliver it to the host. As soon as possible after the data transfer starts, the time stamp of the matching entry is updated, and the time stamp for the next request preloads into the comparand.

If the routine doesn't find the data in the cache, the learn/replace routine must retrieve the data from the disk. The disk-access time gives this routine plenty of time to perform its functions. The routine reconfigures the CAM array to 48 bits using mask register 2 to ignore all the bits except the time stamp. The routine configures the destination counter for a single write to segment 1 to accelerate the search for a time stamp to purge. Then, the routine loads the oldest known time stamp, which triggers a compare. A data read attempts to fetch the cache-page address associated with that time stamp and unlocks the daisy chain. If no match exists, the routine loads the next oldest time stamp into the comparand, which triggers another compare. This process repeats until a match is found, indicating the entry that has been unused for the longest period of time.

The controller places the block of data arriving from the disk into the cache starting at the location that the last data read indicates. The destination segment counter resets to its original three-segment configuration to keep the load of the current time stamp back into segment 1 from triggering a compare. The current time stamp and the disk index then move through mask register 1 to the entry that matched the last time stamp compare. Finally, the routine sets the control register back to 32 CAM bits and no mask, resets the destination segment counter to segment 1, and loads the next time stamp into the compar-and to prepare for the next disk request.

Application: branch tables and rule-based systems
You can use CAMs to quickly search through branch tables. Directly looking up a service routine's start location in a RAM-stored branch table is fast, as long as the table is small enough.

However, a directly indexed table may start to get large and sparse if wide spaces and gaps separate the input codes. Also, selecting a single entry from a long list may make the execution time of an event loop too long, especially in embedded systems that have real-time requirements. A CAM may then be an appropriate choice for accelerating program execution.

The CAM stores the input codes and the addresses of the routines that handle each condition as associated data. A data write followed by a data read is all that's necessary to look up the execution point of the service routine (see fetch routine).

Rule-based systems are a special case of this application. The memory array stores the rules, and the CAM comparison tests all the rules at once. A sequence of masked compares can generate partial matches and closest matches to prioritized rule sets. In this way, associative processing can accelerate small expert systems to handle real-time events.


CAMs make hash of the alternatives

Choosing between CAMs and the alternatives requires making trade-offs between processing power and speed. Linear and binary searches are one alternative, but they require extensive shuffling of the data list when you add or delete entries. To add to a sorted data list, every entry from the end of the table to the location of the new entry must be read and then written into the next location. Removing an entry requires the same process in reverse. Also, search times depend on the size of the data list.

Another alternative is a hash table. Hash tables are the software implementation of associative processing and are fairly efficient at making the table smaller. However, hash tables require an increased level of computing power to quickly calculate the hash-table key. List maintenance, which comprises calculating the hash key for the new entry and placing the data at this location, is fairly straightforward until a new entry maps to an already-used location, causing a collision. Resolving collisions involves performing additional hash-key calculations until the data finds an open location in the hash table. Collisions cause nondeterministic hash-table search times. That is, the compare time may vary based on the number of collisions that occur when constructing the table.

These alternatives are cumbersome in embedded applications, requiring processing power and long sequences of instructions that are unnecessary with a CAM. In an embedded system, the simple control interface makes it easy for a state machine to control the CAM's operation (Figure 4). Also, a CAM's powerful associative operations reduce the number of instructions required for any application, making it possible for simple state machines to control otherwise-complex operations.

When should you, as a designer, consider using CAMs? A CAM costs more than SRAM, because a CAM cell requires more transistors to build. However, a CAM's advantages outweigh the additional cost compared to other alternatives for the following conditions:

Your application need not meet all of these conditions for you to choose a CAM. At times, only one of these items may justify the use of a CAM. However, most suitable applications can take advantage of several CAM and associative-processing attributes.

Application: compression
Most lossless compression schemes, such as an LZW algorithm, are dictionary-based. These schemes typically search a look-up table and output translated data, such as a symbol code or expanded data. The dictionary table builds from the data stream that is compressing, so the dictionary contains only a subset of all possible entries. In other words, the dictionary is a sparse table.

The dictionary assumes that the first 258 entries comprise the literal 8-bit codes plus a clear code and an end-of-file code, so the first 258 CAM locations are marked "S" or "R" to let you use the match addresses as the codes for pairs of symbols. Initially, the comparison is 17 bits wide but can grow as the dictionary fills with new symbol combinations. The comparand for a 4-kbyte dictionary can typically grow to as many as 20 bits—12 bits for the code and 8 bits for the input symbol. The compression adds entries to the dictionary and assigns a code to each one. The additional entries comprise combinations of two ASCII literals or an ASCII literal and the code for a previous dictionary entry.

The compression routine appends the first two input symbols to create a base string. Two data-write cycles load the base string into the comparand. If the base string is not in the dictionary (typical examples are shown at steps 1 and 6 of the example), the routine sends the first symbol in the base string to the output. The MOV NF,V command adds the base string to the dictionary, which causes the 9-bit code assigned to the symbol pair to correspond to the CAM memory address. The second symbol from the base string then forms a new base string with the next symbol from the input.

If the base string is found in the table (as shown at steps 5 and 7 of the example) the routine sends nothing to the output. The routine reads the code (the match address) for this base string from the status register (the default location for command reads). The routine creates the base string for the next look-up by appending the next input symbol to this code.

The algorithm repeats this process until the input ends or the table fills. Even with an index of only two symbols, this procedure can build long strings, producing a good compression ratio. If the dictionary fills, you can extend the table length by going to codes of 10, then 11, and then 12 bits for symbol pairs that are added to the dictionary. When the dictionary reaches maximum size, the table is cleared, and a new table starts.

The decompression table comprises contiguous entries, having been built from incremental codes (the next free location was always one address higher than the previous CAM location used for a new entry). Therefore, a direct RAM look-up scheme can efficiently perform the decompression.

Compression example
StepInputComparand
[Seg1],[Seg0]
Code
[Match Addr]
Table entryOutput

M



1iM,i-[258]=M+iM
2si,s-[259]=i+si
3ss,s-[260]=s+ss
4is,i-[261]=s+is
5si,s259——--
6s259,s-[262]=259+s259
7is,i261——--
8p261,p-[263]=261+p261/td>
9pp,p-[264]=p+pp
10ip,i-[265]=p+ip
11--i,_-[266]=i+_i


Author's biography

Tom Weldon is director of product marketing and applications at MUSIC Semiconductors (Colorado Springs, CO), where he has worked for six years. He holds a BSE from Harvey Mudd College (Claremont, CA).



| EDN Access | feedback | subscribe to EDN! |
| design features | out in front | design ideas | columnist | departments | products |


Copyright © 1995 EDN Magazine. EDN is a registered trademark of Reed Properties Inc, used under license.