Design Feature: January 5, 1995
To derive the best performance and efficiency in using a complex PLD's (CPLD) resources, you need to understand the device. Fortunately, CPLDs are not as complex as their name suggests. The devices comprise a few basic elements: logic blocks, a programmable interconnect scheme to provide communication between the blocks, and I/O cells. The way that you structure these elements within a part and assign the elements when implementing a function can make all the difference in your design's performance.
Note that a CPLD is distinct in architecture from an FPGA (field-programmable gate array). In a sense, a CPLD is nothing more than a collection of PLDs, similar to the popular 22V10, on a single chip. The CPLD contains product-term array blocks, a product-term allocation function, macrocells, and I/O cells. The software connects the CPLD's elements to build functions. The logic blocks execute logic equations that are simple sum of products expressions and put the results into registers located in the macrocells. A central routing area, or matrix, interconnects and routes signals to and from the blocks. A signal must travel through the programmable interconnect matrix to go from a given register back to the same logic block.
There are three major resources available in all CPLDs: the amount of macrocells that the device assigns to each logic block, the
number of inputs going into each logic block from the global interconnect,
and the logic block's product terms(Fig 1).
The CPLD also contains other useful resources, such as polarity control. The way in which you utilize these resources when you meld a logic design into a part determines the efficiency of the final performance.
The number of macrocells refers to the basic registers available to build functions, such as counters and state machines. The number of macrocells limits the size of certain designs within a given device. A CPLD with only 32 macrocells cannot hold a 40-bit counter. (Note: Some CPLDs offer input registers, which are useful for pipelined designs and improved metastability. However, these registers are not replacements for the macrocell's register, which contains general logic in front of the register.)
The device's second resource is the number of available inputs going into the logic block from the matrix. If each logic block of 16 macrocells can have up to 36 inputs, the logic for those macrocells can accommodate up to 36 input signals. If the design requires additional signals, you have to divide the logic between multiple logic blocks.
The last resource is the number and allocation of product terms. The total number of product terms is the parameter that places the upper bound on the amount of logic that can fit in a CPLD. The number of product terms that the device can OR together, rather than the total number of product terms, is what counts in a given macrocell. Some CPLDs can OR up to 16 or 20 product terms. In others, the number of terms varies from spot to spot within the device. Sometimes the vendor offers a pool of shared or redirectable terms, called expanders, to bump up the number of terms the device can OR. However, these expanders add additional delay.
If a logic design requires more terms, you can program multiple passes through the array, but at the cost of speed. If your goal is achieving a part's maximum speed, you want to fit logic into a single pass. But if the speed does not matter, a reasonable trade-off is to use multiple passes.
You should know that the way the device allocates product terms can be more important than the absolute number of terms. Some CPLD device architectures allow you to assign product terms to individual macrocells-product-term steering-and allocate the same product term to multiple macrocells-product-term sharing. The design-synthesis tool can allocate product terms appropriately, as is the case with the Cypress Flash370 family, depending on the device's architecture. Product-term steering and sharing are responsible for much of the efficiency in using chip resources, allowing you to use more of the device's macrocells.
A word of warning: As you go about the design task, remember that you can implement a given function with different performance and logic efficiency. Without proper guidance, the synthesis tool can make a choice that may not satisfy your speed and cost objectives. The final circuit may prosper or suffer depending on the way you program your design to take advantage of the CPLD's architectural features.
Once you've mastered the device's architectural benefits and available resources, you should consider how ac specifications or timing can affect design performance. There are four major timing parameters: the input-to-output propagation delay (tpd); the setup time, or the time from an input or I/O edge (change) to a clock edge (ts); the clock-to-output delay (tco); and the output-clock-to-output-clock period (tscs).
When designing functions such as counters or state machines, the most important ac parameter is the minimum clock period at which you can clock the state registers. For other designs, usually I/O-intensive designs that solve large multiplexing problems, the minimum clock period is unimportant. Little sleep is lost over clock-to-clock time because the CPLD does not use registers. Instead, the focus is on the propagation-delay parameter, tpd.
Different factors cause tscs and tpd to increase from their baseline values. Tscs and tpd may end up slightly or much longer than the nominal value, depending on the CPLD you choose and the product terms it requires. In fact, the values may double. All CPLD architectures allow for multiple passes through the array in order to satisfy the logical OR requirements of a large number of product terms. However, some CPLDs apply speed penalties during every array pass due to increased internal fan-out, increased output loading, increased number of product terms or outputs switching, and the implementation of expanders.
The same effects can apply to the other time constants. For example, a setup time (ts) of 5 nsec can result from the data's going through the array once without expanders or additional product-term allocation. The need for expanders can certainly increase the setup time beyond its baseline value. Even tco could stretch if the logic implementation requires an additional logic-array pass to condition the output.
Note that tscs is the most important ac-timing parameter in determining the maximum frequency of a counter. For an adder, which usually requires several passes through the array, the issue is how tscs changes as a result of the multiple passes.
Many CPLD architectures use the so-called variable-timing model, in which simple functions can go fast, but complex ones slow down. This timing model leads to an unnecessarily complicated design process and to devices that boast 10-nsec performance, when 15, 20, or 25 nsec may be closer to reality. The fixed-timing model, characteristic of the Cypress Flash370 family, eases the design process because it provides a fixed-propagation delay through the array for any path with up to 16 product terms. Understanding the timing of your device, the timing limitations, how to design within them, and when to exceed these limitations is all part of getting what you want out of your CPLD.
Whether a particular large counter fits into the selected architecture depends on a number of factors, including the available inputs into the logic block, the macrocell's flip-flops (D or T), the number of control signals needed, and whether the counter must be loadable. A simple reset counter calls for one set of resources; an up-down counter calls for another.
For a simple 32-bit counter and a 32-macrocell device, there are usually only 16 macrocells which with to work inside a logic block. A logical step is to break up the counter bits and put 16 bits into one logic block and 16 into the other. Clearly, the most significant bit (MSB) eats the most resources because its state depends upon the states of the other bits.
In a counter that uses D flip-flops, you need all 32 counter-state bits to determine the next state of the MSB. For a T flip-flop, you need only the least significant 31 counter-state bits as inputs into the logic block that contains the MSB. But, if the logic block has 30 inputs or less, or if loading the counter is required, the counter can't fit. Although some counters require only a reset, most designs require loading, which is the ability to accept an input value.
Consequently, the CPLD's architecture determines the counter size and nature. If your CPLD has at least five product terms for every macrocell and provides a T flip-flop, the product terms are usually not the limiting factor for a counter. The number of available logic-block inputs typically determines the fitting of a certain counter.
You can determine the counter's general input limitations by assuming an n-bit-loadable counter with I/O pins, D0 to Dn-1 and two control inputs, load (LD), and output enable (OE). Start by placing 16 counter bits in the first logic block and M counter bits in the second block, where n-m=16. Logic block 2 then has a total of n+1+m inputs; the n-1 counter state bits. The m data inputs for loading, the load-control inputs, and the output-enable-control input. Because n-m=16, n+1+m = 2n-15. For a CPLD having 36 logic-block inputs, 2n-15 <= 36, which yields a maximum 25-bit counter. Note that logic-block1 has a total of 33 inputs. If that number exceeds 36, the block would accommodate fewer counter bits, and n-m is not equal to 16.
Fig 2 shows how to arrange the Cypress
Flash370's resources. The CPLD has 36 inputs to accommodate as
large as a 31-bit-loadable counter while working at full speed.
The "secret" to getting around the logic-block limitations
is using one macrocell in one logic block that links the carry
information from the least significant portion of the counter
to the most significant portion.
The key issues are how to generate the logic for the carry bit and how to register the carry information. During loading, the carry register receives a 1 only when the D (input) bits D14 to D0 are all 1. During counting, carry receives a 1 only when the R (register) bits R14 to R1 are 1 and the least significant bit (LSB), R0 is a zero. The logic loads carry with a 1 when you load each of the R registers R14 to R0 with 1. Consequently, clocking occurs at a frequency of 1/tSCS.
To simplify the circuitry of an up/down counter, or one with other kinds of controls, you can make the carry macrocell in Fig 2 combinatorial. You can AND all the registered data and feed the results into the right-hand block for registration. However, this practice cuts the maximum frequency because of one pass through each of the two arrays. In the case of the Cypress CY7C371, performance drops from 143 to less than 100 MHz.
| CPLD performance: a cautionary tale |
|---|
|
The CPLD's performance specifications impress you, so you design a counter or state machine and complete the design. But, when you simulate the design or plug the part into the board, your design does not achieve the expected performance. How could you have foreseen this problem prior to completing the design? The answer rests in the details of the timing specifications for your CPLD.
The type of flip-flop the CPLD uses can impose a speed penalty. The penalty occurs if the device uses a T flip-flop instead of a D flip-flop (Fig A). This output frequency clearly affects the performance of counters because counters are more efficient with T flip-flops than with D flip-flops. For some CPLDs, expanders apply a speed penalty to the delay. The speed penalty occurs through the logic array if the device allocates additional expander product terms beyond the nominal amount provided. You may encounter parallel expanders, shared expanders, or both, if your design is more complex than the simplest counter. This additional delay is shown in (b) of Fig A. This delay affects the majority of your designs, and you may have a problem depicting when the synthesis tool chooses one type of expander over another. The third issue, internal loading, applies a speed penalty to some CPLD signals based on internal fanout. The specification for this penalty is typically a minimum and maximum performance for a given fan-out. Fig A shows this performance degradation along with an expander delay in part (c). Knowing the exact timing is admittedly difficult while providing bounds on the performance. And this penalty applies to every design. All of these issues may decrease the device's performance beyond the specified minimum. You must be careful when comparing devices because vendors apply speed penalties for these features differently. Some vendors charge a speed penalty for using some features but not others, so keeping up is difficult but necessary. The task for you as a designer is to filter through the details of the CPLD vendor's data sheets to understand the "taxes" that they may or may not have added to the initial performance measure. You should also understand the appropriate "fees" you have to pay before the design is working in your system. PREP benchmarks, which actually measure design performance to offer users relative comparison points, are an excellent resource to use in this somewhat confusing task. |
Resource availability also determines the logic arrangement of other important functions. For example, you can implement a 16-bit adder in several ways. Generally, you can trade off speed for density-the efficiency in circuit utilization-through the selection process. The carry look-ahead approach generally yields the fastest speed; the ripple adder yields the best density. For the best speed, you should arrange the logic to perform the function in the least number of passes. For a ripple-carry adder, grouping the bits reduces the number of passes. A carry-look-ahead scheme cuts the number of passes even more. In most cases, however, the penalty for fewer passes is an increase in the use of device resources. (But, by going from 1- to 2-bit groups in a ripple-carry adder, you actually save macrocells because there are fewer carries.)
You should try to make a bit grouping as large as possible without incurring the delay penalties of expanders. Still, the final implementation varies with the CPLD's architecture. An adder may be a small part of an overall design going into a CPLD, perhaps containing state machines and other functions. In some cases, such as 16-bit addition, placing an adder into an FPGA, rather than a 32-macrocell CPLD, may make more sense. When the adder is only part of the job, you may want to use a larger four- or eight-block CPLD.
Working out the logic equations for a 16-bit adder shows the resources your design needs. Opting for a ripple-carry adder with 1-bit groups yields the following results: 31 macrocells, 16 for the sum and 15 for the carries; 108 product terms, three for bit zero and seven terms per bit for bits 1 through 15; and 47 logic block inputs, 16 for the addend, 16 for the augend, and 15 for the carries. You need 16 passes through the array to implement the adder.
By using 2-bit groups, you change the numbers to 23 macrocells, 171 product terms, and 39 logic-block inputs, with only eight passes through the array. Clearly, the device runs twice as fast and uses about 60 percent more product terms. Utilizing variable groupings of bits results in commensurate variance in resource usage and speed.
The best choice for ultimate speed in an adder is the carry look-ahead adder. Carry look-ahead is a well-documented technique in many logic textbooks (Ref 1) With a 2-bit grouping, the numbers for carry look-ahead work out to 35 macrocells: 16 for the sum, six for the group generates, six for the group propagates, and seven for the carries. The technique contains 180 product terms: three for each group generate (six groups), two for each group propagate (six groups), 11 for the least significant group, 16 for adding other groups (seven groups), and 27 for the carries. There are 51 logic-block inputs: 16 for the addend, 16 for the augend, six each for the group generates and propagates, and seven terms for the carries. The number of passes the design requires is only three.
The first pass calculates C2, and the group generates and propagates for groups one through six; the second pass calculates carries C4,C6,C8,C10,C12, and C14; and the third pass calculates all of the sum bits, based on the carries coming into each group. The number of product terms increases as the number of passes decreases. For example, parts with only 160 product terms, or 32 macrocells, cannot handle the second and third design choices. However, higher density devices can fit this extra logic with resources available for implementing additional logic.
The least significant group is slightly easier to calculate than the others because the given examples have no carry-in. The case with the 108 product terms has three product terms for bit zero and seven product terms for the other bits. The case results because, for bit zero, the sum bit takes two product terms, and the carry is a simple AND operation. The next bit is a full adder operation requiring the carry and two operand bits and produces four terms for the sum and three for the carry.
The carry look-ahead approach works well in CPLDs because the generate/propagate terms and carry look-ahead equations only call for simple AND/OR gates. Although the design may need slightly more product terms and macrocells, the final circuit is more than twice as fast as the eight-pass implementation. You should select the number of groups with a particular CPLD in mind, depending on how much you can accomplish in a single pass and the delays you find acceptable.
Sometimes a single pass does not work. Multiplexing or comparison functions may be too wide to perform in one pass. Suppose you want to compare the identity of two 8-bit variables-one from a counter and the other from a bus. This comparison takes 16 product terms. You could decide to use the comparison results to select control functions such as what data to load into a counter or whether to let the counter keep counting or load from another data bus. Then, the first pass does the 8-bit comparison, takes the one bit of information, and goes back through the array a second time.
A smaller comparison, such as a 4-bit comparison, only takes eight product terms. Therefore, you can complete the whole function in one pass, making the comparison and figuring out how to control the counter-whether to load it, count up or down, etc. Generally, whatever the function, the CPLD should make some of the initial calculations in the first pass. You should also tell the synthesis tool how to break up the logic for the multiple passes.
For multiple passes, the device uses one macrocell for a partial result, such as a comparison. In the second array pass, the product terms that generate that function are saved. The number of passes saves the terms, so that you only have to use the terms, or the logic producing the terms, once.
Consider an 8-bit counter and the comparison of two 4-bit words. Suppose the functions contain few other controls so each bit takes about 12 product terms. This scenario results in a total of 96 product terms (8×12), allowing the circuit to run at the maximum of 143 MHz. If you split the compare function and perform it in a single pass with just eight product terms and feed the result back into the eight bits, the logic in front of each bit may only require four or five product terms. Now you have 40 product terms for those eight bits and eight terms for the initial compare. Consequently, you require less than 50 product terms, as opposed to 96 in the previous configuration. You have used up one more macrocell, but saved about 50 product terms. Of course, now the circuit runs at only 77 MHz.
You could also perform a partial compare in one pass and the rest of the compare in the next pass. If you choose to go this route, pick a division point. Doing so can accrue advantages according to how you use the input-routing resources. The down side is that you have two intermediate results. The necessary functions may be more difficult to develop.
The total circuit arrangement in Fig 3a cannot
fit into a single pass because it requires 32 product terms to
do the compare and the multiplexing. You need to determine how
to best handle the two passes. The partitions shown in Fig 3a
and 3b are two choices. Fig 3a compares two 8-bit
signals directly. Fig 3b breaks each signal into its least- and most-significant
nibbles, or halves, and executes two 4-bit comparisons. The timing
is the same in both cases because the circuit makes two passes.
Signals A and B move through the array twice; C and D go through
once. The clock to the Q-output delay is the same in both cases.
The difference is how you use the resources of a particular part.
Pay attention to synchronous resets. The resets are notorious producers of product terms. You can also apply other methods that trade off timing and resources in this arena. For example, you could use a counter that comprises D flip-flops that you can load or that performs functions beyond counting. The LSB requires two product terms. The next bit requires three terms, and so on. You end up with many product terms when you reach 16 bits. By contrast, a T flip-flop counter typically requires the same number of terms for all of the bits.
Synchronous resets with T flip-flops usually add some product terms. However, you can avoid these terms when a latch is available. You can latch the reset signal using an active low-latch enable that the clock input drives. You then feed the output of the latch to the asynchronous reset, usually available in a macrocell. The reset timing changes because the device samples the reset input during the low portion of the clock as opposed to being sampled only at the rising edge of the clock.
Consider a 10-bit, loadable up-counter with count-enable and reset
inputs. T flip-flops compose the counter. You can reduce the number
of product terms by 25 percent by modifying the conventional synchronous
reset arrangement. In Fig 4a, the conventional
approach calls for 40 product terms. However, if the design can
stand the slight change in the reset timing(Fig 4b),
a slight modification slashes that to 30.
As you describe a design, guard against putting the same logic or fixed logic states into the same registers. Sometimes using a high-level language and 12- or 16-bit quantities hides these situations. You could have redundancies or unnecessary registers. In some cases, a zero or a one can end up in a register. However, you could actually delete that register by tying the output low or high. The solution is to make an end run around the synthesizer by redescribing your design to prevent duplication.
You could have a situation where the same logic goes into multiple registers and those registers do not go to I/O pins. In this case, you should eliminate all but one of those registers. You need only one output to represent the information. By doing so, you save registers. Also, you can sometimes reduce the logic that accepts these register outputs as inputs.
You may have to invest the time to break the word lengths into smaller quantities and manipulate the smaller values during coding. The fix hinges on how well you describe your design to let your synthesis tool fully utilize your CPLD architecture.
When using any CPLD, the first step is understanding the basic resource limitations of the architecture. You should then focus on the critical ac-timing parameters. The timing parameters require that you understand the relationship of the timing to the architecture. You should then realize how typical functions utilize the available resources. Understanding this utilization can lead to determining how to get the most out of your CPLD.
Reference