EDN Access

 

May 22, 1997


Shattering the programmable-logic speed barrier

BRIAN DIPERT, TECHNICAL EDITOR

Upfront planning and careful design, plus an awareness of architecture and development-tool strengths and shortcomings, ensure that you squeeze maximum performance out of your programmable-logic device.

Read almost any magazine article written on programmable logic in the last several years, and you immediately sense the battle cry of the industry: "We're going after gate-array ASICs!" Today, this once-quixotic prediction is increasingly becoming reality. Ever-smaller process lithographies improve programmable-logic cost-effectiveness, resulting in die sizes comparable to gate-array counterparts, even at gate densities of several tens of thousands.

Ironically, these same lithography trends also enable gate arrays to squeeze more logic onto a die and more die onto a wafer. However, lithography shrinks are pushing ASIC cost-effective gate counts, minimum-order quantities, and leadtimes beyond the needs of a growing number of designs, creating opportunities for programmable-logic alternatives in the process. Finally, more and more companies are growing to depend on programmable logic's flexibility, which provides them with the fastest possible time to market for their products.

However, programmable logic currently doesn't come close to matching gate arrays and other ASIC technologies in one area: performance. Thanks to aggressive lithographies, direct metal-to-metal interconnections, and abundant signal-routing resources, ASICs' logic gate delays are measured in tens of picoseconds, with input-to-output delays for even complex multilevel logic paths only a few nanoseconds. Delays for comparably sized programmable-logic counterparts are easily an order of magnitude longer.

In the past, you could make an effective argument that most designs didn't use the performance potential that ASICs provided. However, times have changed dramatically, especially in applications such as networking and telecommunications. The migration from 10-Mbit Ethernet to 100 Mbits (and, soon, 1Gbit), 155-Mbps asynchronous transfer mode (ATM), high-bandwidth digital-cellular protocols, the embedded-PCI push beyond 33 MHz, fast DSP functions, 50-MHz DMA, 100-MHz graphics controllers, and other examples highlight the growing "need for speed."

Given that you're sold on programmable logic for your next design (either for prototyping, initial production runs, or even through full production), how do you hit the required system-performance targets? Unfortunately, easy answers to this problem rarely exist. As is the case with most problems of this nature, you must first do a fair amount of research into silicon and tools, coupled with some creative "out- of-box" brainstorming, and finish by weighing various combinations of system-level trade-offs to determine the right answer for your situation. With programmable-logic devices increasingly resembling systems on a chip, many of the performance optimizations for logic modules within a device have much in common with techniques for optimizing device performance on a board.

Don't rely on your tools

The first time you hear a presentation from a PLD- or FPGA-device or -tool vendor, you might get the impression that you can implement your design as sloppily as you want, using any synthesis method you prefer, and the tools will take care of cleaning up the design for you (somehow reading your mind in the process to figure out exactly what you want). After subsequent evaluation, odds are you'll probably modify that opinion. Leaving the optimization up to the compiler and fitter is an admirable goal, and in relatively simple designs it may in fact be a reality. (For brevity's sake, the PLD term "fitter" also refers to an FPGA's mapper and automated place-and-route tools.) However, in most cases, remember the phrase, "garbage in, garbage out, " and realize that you'll have to take a more active role in the optimization.

Until recently, this scenario was especially applicable when using third-party tools (that is, tools not developed by or proprietary to a specific silicon vendor), which tended to generate relatively generic netlists because of the tools' ASIC foundations. You're asking a bit much if you expect a silicon vendor's mapper to look at a collection of random gates and correctly infer that some of the gates represent a fast counter, others are a narrow-but-deep FIFO buffer, and the rest are slow decoding logic. Even if the tools have selections for high performance or low gate count, this option list is often too coarse to produce meaningful results. A more effective alternative is to follow some relatively straightforward steps in the upfront design to guide the downstream compilation and fitting. As Rocky Awalt, senior member of the technical staff at Memec Design Services, states, "Fixing [the design] does not necessarily mean placing and routing by hand; it means designing in a way that fits the architecture of the part and that suits the tools you are using." (See Reference 1.)

At a high level, programmable-logic-device performance is a function of several main variables:

  • Input-buffer/logic-gate delays,
  • Internal logic-gate delays,
  • Interconnect signal-propagation delays (a function of the interconnect load, the driver strength, and any signal noise),
  • The number of serial levels of both logic gates and interconnect be-tween input and output, and
  • Output-buffer/logic-gate delays.

Note that logic and interconnects that function in parallel may decrease available resources, but they do not slow performance.

The simple- and complex-PLD (SPLD/CPLD) and FPGA structures fundamentally consist of some level of combinatorial logic, followed by an optional register (Figures 1 and 2). This fact becomes important when you consider various state-machine configurations. As a short review, remember that in a Mealy machine, logic outputs are a decoded function of both the respective state variables and the logic inputs. In a Moore machine, however, logic outputs are either a decoded or, ideally, a direct function of just the state variables (Figure 3). Moore state machines have several advantages over their Mealy counterparts, including easier design verification and greater noise immunity. Most important, Mealy state machines always force at minimum a two-stage logic-structure implementation--one for the state register and another for the postregister de-code. Directly driving outputs from Moore states eliminates this second-stage effect, thereby boosting performance.

Because FPGAs are register-rich but fan-in-poor, you might want to consider an interesting alternative, the One-Hot state machine, to the standard binary counter. This technique was developed for FPGAs as a variant of the state-per-bit approach in use since the days of paper-tape readers and front-panel computer-input switches. One-Hot uses a register for each device state (state-per-bit), with only one register active (or "hot") at a time. In other words, a 32-state machine might use as few as five registers in a standard binary-encoded scheme (more in a binary-coded decimal counter or a state machine with additional unused states) but would use at least 32 registers in a One-Hot configuration. Figure 4 shows a six-register, One-Hot state-machine example with the output an active decode of states 2 to 5.

At first glance, One-Hot state machines might seem resource-inefficient. Remember, though, that FPGAs are register-rich but have logical units with limited fan-in and that the design might therefore consume incremental FPGA-logic resources, both to generate wide fan-in functions and to route signals. By minimizing next-state decoding logic and interconnect delays, a One-Hot state machine often runs significantly faster in an FPGA than its binary-counter-based alternative does. However, note that the One-Hot technique does not apply to SPLDs and CPLDs; these devices, in contrast, are fan-in-rich but register-poor and thus are best suited for traditional counter-based state machines. Another caution: Because of the number of register elements involved, you must carefully define and control One-Hot state machines. They contain a large number of invalid states, defined as any combination of more than one active register. Alpha-particle effects, ground bounce, metastability, and asynchronous inputs all potentially cause unintended state transitions.

Many tool sets, both silicon-vendor-supplied and third-party-developed, support not only Mealy and Moore but also One-Hot state-machine generation. Some tools automatically select an optimum state machine in response to your circuit implementation, guidance, or both (for example, prioritizing performance to the compiler), whereas other tools always default to one method unless you override them. Consult your development-tool documentation for more information.

In all FPGA applications that use counters (state machines or others), consider linear-feedback-shift registers (LFSRs) as alternatives to binary counters whenever possible. Figure 5 shows two alternative LFSR approaches (Reference 2). In deciding what to implement, ask yourself if your counter must increment or decrement in a specific pattern and if you use all the counter values or just the terminal and perhaps initial count.

Two examples illustrate possible LFSR applications (Reference 1). Often, when dividing something down, you need to know the number of division steps and the terminal count, but not the various counter values along the way. Also, if you are designing a RAM address generator and only your hardware will access the memory, the addressing pattern must be only consistent, not sequential. Even if you have to use a binary counter with an FPGA, minimize the counter's feature set to reduce logic and routing complexity and thereby boost performance. For example, if the counter can wait after loading before actually beginning to count (allowing time for sequential-logic initialization), you can significantly simplify its design.

Sometimes, using a flip-flop other than the traditional D-type reduces logic and routing complexity in front of the register, minimizing propagation delay and boosting resultant circuit performance. SPLDs and CPLDs perhaps have an advantage here because they can usually implement T-type and other flip-flops without requiring additional combinatorial-logic layers (which would defeat the purpose of the exercise). Some more specialized FPGA structures include register enhancements that conceptually achieve the same result. Figure 6, an admittedly simple, 3-bit up-counter example (reset not shown), nonetheless illustrates the potential of this approach. Before implementing this technique, make sure that the device's flip-flops deliver equivalent performance when configured as either D- or T-type.

Another technique you can use to overcome long propagation delays involves inserting one or several synchronous pipelines within a logic-level chain. This approach increases the clock latency from first input to first valid-circuit output. However, this approach usually also allows the circuit to operate at a faster frequency by reducing the number of combinatorial-logic levels that a signal must traverse between successive clock edges. In many applications that use counters and data-flow circuits, you often find that sustainable performance or bandwidth is more critical than initial latency. Remember that when you determine the placement of the pipe-line or pipelines, include the delays not only of the logic gates but also of the interconnects. (See box, "Andraka's top-10 list," rules 4 and 7.)

Reducing logic complexity to boost speed is an important aspect of programmable-logic design. Ironically, you can also increase performance by increasing logic complexity, but only in specific situations. These situations in-volve configurations that execute in parallel functions that would otherwise happen in a slower serial fashion. One device example compares a multibit ripple-carry adder or subtracter with the adder or subtracter's carry-look-ahead counterpart. Al-though this circuit requires more gates to implement, it calculates the sum/difference and carry values for all bits in a parallel fashion, significantly reducing the delay path needed to calculate a valid result (Reference 3).

Another example involves minimizing logic-output fan-out by splitting a signal and routing it through multiple buffers to its destinations or, in a more extreme case, also duplicating the logic that generates the signal. Notice that, although you've in-creased both logic and interconnect complexity, you've also reduced the loading that any of the parallel signals see, thereby speeding propagation. Be careful, though, that the greater interconnect complexity doesn't result in significantly longer routing paths, whose negative results would outweigh any loading improvements.

Probably the most challenging technique of adding and removing gates is that of breaking a large, complex state machine into several smaller, faster alternatives. Although several smaller, tighter state machines can independently outperform their unified, bigger, and more cumbersome counterpart, the degree of interaction between the smaller machines is often a roadblock to high performance. At best, several small state machines may take longer to conceptualize and implement, which would toss your development schedule out the window. At worst, the interaction between the state machines, the added logic complexity, and the routing delays that the added logic complexity causes negate any potential performance advantage. One suggested approach that often works well is to think of your design as an operating system with one main process that occasionally starts subprocesses in response to input stimuli. By compartmentalizing any single state machine's functions in this way, you can often minimize interaction and maximize performance (References 4 and 5).

Software developers are increasingly turning to high-level languages (HLLs) with object-oriented structures, such as C++, Visual Basic, and Java. HLLs offer some simple yet compelling benefits: They enable straightforward code-module integration and reuse, they are relatively isolated from, and therefore portable to, a variety of hardware platforms, and they relatively quickly generate a large amount of software. For engineers evaluating HDLs, such as VHDL and Verilog, the same selling points apply as compared with schematic entry and other more traditional approaches. You can easily reuse circuits implemented in HDLs later in other designs. HDL-based designs can compile to any programmable-logic architecture that provides sufficient resources. And, in this era of 100,000-gate CPLDs and FPGAs and shrinking development schedules, anything that can more rapidly generate a lot of logicthan alternative methods garners lots of interest.

However, low-level sche-matic entry for programmable logic (just like low-level assembly language for software) offers one compelling advantage: It produces high-performance results. Al-though HDL tools today are more programmable-logic-optimized than in the past, a Boolean-equation- or sche-matic-generated design is sometimes smaller and runs faster than its HDL-synthesized alternative. So what do you do? If necessary, follow the approach of your software-engineer counterparts and design the performance-critical portions of your logic using schematic-entry tools and still harness HDL's advantages for less speed-critical circuits. Many HDL compilers enable you to insert, or instantiate, an EDIF-compliant netlist generated by schematic entry. In the future, as HDL compilers become more robust and silicon becomes more flexible, you'll un-doubtedly implement a greater percentage of your designs in HDLs (just as more programmers now choose HLL-only software development than in the past).

In your HDL code, you can make many simple optimizations to boost speed. Often, they involve explicitly defining all possible values of variables and functions so that the compiler doesn't generate additional gates to cover the situations you've inadvertently omitted. Also, compilers are doing a better, though not perfect, job of compressing combinatorial-logic levels, and you can help compilers here. The examples that follow use VHDL (although the concepts apply equally to other HDLs) and represent some of the more common optimizations.

When coding VHDL, be careful with IF...THEN or CASE statements. Listing 1a shows a process that generates a latch, whereas Listing 1b's process does not. The difference between the two listings is the ELSE clause. In VHDL, when you do not assign a value to a signal under a condition (such as to q when a1='0'), the signal retains its previous value, thereby generating a latch. Listing 1b clearly defines what happens when a1 does not equal '1,' so no latch is generated. If d in this example happened to be a 16-bit bus, Listing 1a would use almost twice the number of logic cells as Listing 1b.

  Listing 1a
PROCESS (a1,d)
BEGIN
IF (a1='1') THEN q<=d;
END IF;
END PROCESS;
 
  Listing 1b
PROCESS (a1,d)
BEGIN
IF (a1='1') THEN q<=d;
ELSE q<=0;
END IF;
END PROCESS;
 

CASE statements operate under the same rules. Listing 2a generates a latch, whereas Listing 2b does not. When you don't define '01' and '10' states for sel, the synthesis tool generates latches to retain previous values.

  Listing 2a
PROCESS (sel,a,b)
BEGIN
CASE sel IS
WHEN "00''=>q<=a;
WHEN "11''=>q<=b;

END CASE ;
END PROCESS;

 
  Listing 2b
PROCESS (sel,a,b)
BEGIN
CASE sel IS

WHEN "00''=>q<=a;
WHEN "11''=>q<=b;
WHEN OTHERS=>q<='0';
END CASE ;
END PROCESS;

 

Careful use of parentheses can direct the compiler to implement a circuit that maximizes its performance (perhaps, however, at the expense of additional gate count). As an example, Listing 3a often generates a design that uses three cascaded adders. Listing 3b, by simple inclusion of two sets of parentheses, synthesizes to two parallel adders whose outputs the circuit then adds. The first design uses three combinatorial-logic levels, and the second design uses two.

  Listing 3a
PROCESS (a,b,c,d)
BEGIN
sum<=a+b+c+d;
END PROCESS;
 
  Listing 3b
PROCESS (a,b,c,d)
BEGIN
sum<=(a+b)+(c+d);
END PROCESS;
 

Also, CASE statements infer an equal-priority parallel comparison, whereas IF...THEN... statements imply a prioritized serial design. A CASE statement often significantly compresses synthesized logic levels and boosts speed, especially if it can fit within a single FPGA look-up table or multiplexer, for example.

A balancing act

Given enough time, knowledge, and effort, you can almost always compile and fit a design that is smaller and faster and that consumes less power than one an automated tool generates. Having said that, the 100% manual-fitting approach is almost never the option you should choose because of fundamental time-to-market reasons. Subdue the ever-present engineering temptation to "do it yourself" or "tweak it just a little bit more," and let your timing-driven automated tools help you complete your project on schedule and under budget. Just as with the HDL-compilation examples discussed earlier, you can influence your tools to produce the kinds of results you need. Also, the tools themselves are moving from their ASIC-centric foundations toward more programmable-logic-aware approaches. The tools now tend to compile directly to a vendor's specific architecture instead of to ASIClike, two-input NAND gates, which would then rely on the vendor's mapper to reconstruct the coarse-grained programmable-logic cell.

Different tools use different techniques to specify design parameters. Two possible methods are to include the desired parameters along with the logic element in the schematic (Figure 7) or to define various options in the tool's configuration dialogue box before compilation or fitting (Figure 8). Probably the most common approach and the one that almost all tools support is the use of scripts, or initialization files. Regardless of the technique, you can influence variables such as the following, often on a component, logic-module, or even individual-gate basis (see vendor documentation for information specific to your tool suite):

  • The effort (for example, compilation time) that the tools should expend in optimization,
  • The number of compilation runs,
  • Coarse-granularity options, such as "optimize for speed" or "optimize for gate count,"
  • The logic-output drive capability and slew rate,
  • The logic-output maximum fan-out,
  • The maximum allowable signal-propagation delay,
  • I/O levels (CMOS/TTL) and voltages (3/5V),
  • Pullup- and pulldown-resistor values,
  • The maximum allowable percentage usage of available resources,
  • Logic-mapping control,
  • Logic pin-locking, and
  • Logic-relative and "hard" placement locking.

Modifying the design in this way can produce significant positive results to a point. Beyond this threshold, additional investment often produces diminishing returns. Your silicon and tool vendors can advise you on which options you should focus your time and effort. The experience of multiple design projects if you standardize on common tools and devices is also an effective teacher.

Sometimes, more drastic surgery is necessary. Considering the less-than-perfect abilities of today's tools to accurately infer your design intent, you might consider a partially manual-fitted approach. This option is especially attractive when your design contains performance-critical and structured datapath elements that minimally interact with other circuitry (or not at all). Floorplanning your design involves conceptualizing where you'd like the fitter to locate various logic blocks within the device--such as close to inputs to quickly respond to a stimulus or close to outputs to guarantee fastest possible clock to output (important in PCI designs, for example) (References 8 and 9). Sometimes, you can even bypass an FPGA's input registers and directly drive a signal through the buffer and into the core to save a few nanoseconds. In this case, make sure that the corresponding logic block is physically close to the input pin, or your hard work will be for naught.

Defining pins where necessary or where obvious from a floorplanning exercise and leaving other definitions to the compiler for flexibility often provide significant benefits. SPLD architectures usually force at least some of the input- and output-pin definitions. However, you can configure some pins as either inputs or outputs. With CPLDs and FPGAs, you have even more pin-definition flexibility. At this point, look at the internal device architecture. Some PLD macrocells have more product-term inputs than others, and some macrocell outputs can feed back to the interconnect, whereas others cannot. Some FPGA architectures are symmetrical, whereas others, via the layout of their registers and routing resources, naturally encourage a certain data flow (usually, from one side of the chip to another, with control signals making up the remainder of the device).

Use the expertise of others

As an engineer, your job is to understand the subsystem or system that you're designing, not necessarily every nuance of every component you use in it. Programmable-logic vendors, on the other hand, understand their silicon in abundant detail. Working with the design-software vendor, you can produce amazing results. That is the point of the industry-standard Library of Parameterizable Modules (LPM). You specify generic elements, such as discrete-logic gates, counters, adders, shift registers, multiplexers, and memory blocks, in your design, and during compilation, the company's fitter replaces these elements with silicon-optimized circuits. Often, a vendor creates multiple versions of the same circuit, with one optimized for performance, another optimized for speed, and a third taking a middle-of-the-road approach, for example. You control which version the fitter uses via the previously described method. Utilities, such as Actel's ActGen macro builder and Lucent's SCUBA (Synthesis Compiler for User-Programmable Arrays), automate the creation of higher complexity logic functions optimized for speed, area, or some prioritized combination of these lower level elements, and you can instantiate the results into either schematics or HDL code.

For even more complex functions, such as PCI initiators or targets, USB interfaces, and even entire CPUs, consider cores provided by either your silicon vendor or a third party. Often referred to as "intellectual property" (IP) or a variety of other vendor-specific names, cores are completed and verified designs with interface "hooks" for your logic. Cores can be "soft" (implemented in HDLs or schematics), "hard" (already compiled and laid out and added to your design in floorplanning), or "firm" (an intermediate step implemented in netlist form).

Hard cores tend to deliver the highest performance but the least flexibility and ease of integration, soft cores are slower but easier to work with, and firm cores take a middle-ground position. However, to guarantee functionality and performance, core providers prefer that you standardize on one vendor's silicon and require this standardization with hard cores (References 10 and 11). Performance also depends on how symmetrically a vendor routes its logic array and, if asymmetrical, where you place the core.

Look beyond the chip

If you've followed all the suggestions so far and you're still a few nanoseconds short, it's time to lift your head out of the one-chip details and look for system-related optimizations. As noted earlier, you can partition one state machine into several smaller and faster alternatives. Sometimes, this approach works at the chip level as well. As a general rule, smaller and less complex chips perform faster than their bigger, more complex counterparts. One example when the smaller/faster approach works well is when you have several sections of parallel logic performing identical functions and only minimally interacting with each other, analogous to very-long-instruction-word (VLIW) processors. Note, however, that the timing delays if you go off-chip with signals, such as the carry-in and -out functions of parallel arithmetic units, can be significant.

Another scenario in which using more chips is better might be the partitioning of functions into optimized devices, such as putting a small, fast decode unit into an SPLD, a controlling state machine into a CPLD, and a data-flow engine into an FPGA. Balance the performance you might gain from this approach against factors such as power consumption, reliability, total cost, and board space before heading too far down this path.

Minimize fan-out to minimize output-transition times. For example, don't run a CPLD's decoded READ output to the inputs of 20-mP peripheral devices. Sometimes, you can minimize each output's fan-out within a programmable-logic device by routing an internal signal to multiple outputs. (Ensure that you duplicate sufficient internal circuitry to avoid excessive signal loading.) Otherwise, look for opportunities to run an output through several external parallel buffers to lower the loading on any one path. Device packages might also have slightly different thermal and bond-wire- or lead-frame impedance characteristics, which affect performance.

If necessary, consider alternatives to programmable-logic devices. Chip Express' (Santa Clara, CA) Quick Laser LPGAs (laser-programmable gate arrays) are true gate-array ASICs, with circuits formed via laser trimming of a generic, fully fused silicon array instead of a semiconductor process. You can't program Quick Laser LPGAs at your desktop, and the fastest possible turnaround is 24 hours (not minutes), but they do offer inherently faster performance than any programmable-logic device and are particularly interesting if you're eventually going to move to ASICs anyway.

Like any other semiconductor, a programmable-logic device's performance depends on variables such as voltage and temperature. Data sheets specify minimum or maximum values or both for various specifications across ranges; if you can tighten the tolerances, the parts usually run faster. However, some vendors do not guarantee your results at these altered ranges, thereby avoiding additional testing steps for the vendors that would impact device cost for all customers. This technique is a bit like overclocking a processor: It might work sometimes, but you'll probably eventually get caught. You can always characterize the parts yourself or have a company that provides this service do so for you, but there's often a significant yield loss that you still have to pay for. In most cases, the results do not justify the investment.

Real engineering, as opposed to pure R&D, naturally involves trading off criteria to achieve a balanced and optimized result. Specifically, focusing exclusively on performance often causes you to overlook other important design objectives. From a synthesis angle, the HDL approach becomes more appealing and often essential as gate counts rise and time-to-market pressures increase. A balance between schematics for performance-critical sections of a design and HDL for the remainder of the design, plus time invested in learning an HDL if you haven't yet got around to it, will pay dividends down the road. As tools grow more capable over time, the balance will increasingly tilt toward HDL-only design.

As a general rule, power consumption increases with corresponding increases in gate count and frequency. PLD and FPGA power is a complex and hard-to-predict combination of the number of transistors switching at a clock edge and the power burned in driving signals, especially clocks, down routing paths. Programmable-logic manufacturers, like their mP and memory counterparts, encourage and in some cases require their customers to transition to lower voltages to keep power consumption manageable. You can often configure advanced programmable-logic architectures to consume little power when in standby, but the usefulness of this feature depends on your application and sometimes involves a speed trade-off.

What happens when your design changes in response to simulation or prototype testing results or as a result of the invariable last-minute requests from marketing? If you've packed your programmable-logic device almost full, a recompilation might not change things at all. More realistically, though, rerouting the design causes negative results in performance or pinout change or both, which mean head-aches. Do you want to take the chance? To prepare for the inevitable, don't allow a design to consume a large percentage of a device's resources, and make sure that the design has enough performance headroom to tolerate some degradation. Saving pennies by selecting a smaller or slower part now might force more substantial redesigns, which cost more in time, effort, and money down the road. The amount of resource and performance headroom you require depends on a device's routing capability and efficiency, so let practice and the experience of those who have gone before you be your guide. If possible, you should also select devices that have larger gate-count family members available in identical packages and pinouts, just in case you need the additional resources in the future.

You need to walk the tightrope between using the sole-sourced capabilities of a vendor's architecture and choosing a more generic, conservative, multisourced design approach. In reality, you'll more often choose the route. (If you don't, your competition probably will.) This situation is especially true with CPLDs and FPGAs, but blind reliance on each and every feature of one product in one vendor's portfolio may be unwise. HDL techniques help make your design portable to multiple silicon platforms, but your results might vary widely, depending on the degree to which each company's fitter uses unique product options. Also, every time you instantiate a vendor's module or core into your design, you're making your design less portable to others' devices.

Architecture optimizations

Identifying greater performance as an increasing need of their customers, programmable-logic suppliers have developed some interesting options. The stringent requirements of PCI were notable in driving improvements in performance, which came about thanks to silicon-design changes and process-lithography shrinks. For PLDs, the fundamental specification areas that define performance include input-to-output propagation delay (tPD), input setup to clock (tS), input hold after clock (tH), and clock to output (tCO). With FPGAs, performance is harder to decipher and predict from a data sheet; usually, you need to compile and simulate a design to see where you are and then tweak the device and design as required. Sometimes an FPGA's speed bin refers to the maximum toggle frequency, sometimes it refers to the propagation delay through a look-up table, and sometimes it doesn't directly correlate to any one device characteristic.

When interpreting device specifications, a little research into the assumptions involved can be instructive. Operating-frequency specifications are especially confusing, because the vendor often quotes toggle frequency using a circuit that is little more than a tightly coupled shift register or an internal flip-flop output feeding back into its input, an uncommon implementation in the real world. Even when the specifications include clock-to-output and input-setup times, the vendor often assumes a 0-nsec external trace-propagation delay. Specified input-to-output delays sometimes bypass the internal logic array. (This approach is valid in devices that include logic as part of the input or output buffer or both.) With PLDs, vendors often assume a best possible design with no product-term sharing or output feedback. Finally, look at the assumed operating-temperature and voltage ranges and the specified output capacitive loading. Timing specified with a 5-pF load is usually not applicable if your device's output is driving four inputs!

With appropriate test conditions understood, speed also varies with voltage, gate count, and complexity. Lattice Semiconductor's GAL 16V8D-3 offers a 3.5-nsec tPD, 3-nsec tCO, and 2.5-nsec tS, whereas other SPLDs commonly offer 5-nsec or slower tPD. Altera is shipping 6-nsec-tPD versions of its MAX7000 devices with as many as 1250 usable gates, and the entire family is no slower than 7.5 nsec throughout the maximum 5000-gate range. Altera's upcoming "A" versions of the MAX9000 family are targeting similar performance levels. Xilinx reports that its XC4000E-1 FPGA family is 25% faster than its preceding -2 family speed bin in many designs, with a maximum operating frequency specified at greater than 80 MHz.

Philips has developed a fairly revolutionary architecture, called the extended PLA (XPLA), for the company's re-entry into the PLD market (Figure 9). XPLAs combine the standard PAL array with a flexible PLA, whose product terms you can add (OR) on an individual basis to any macrocell as needed. PLA terms add 2 nsec to the propagation delay (not cumulative when you include additional terms). However, this approach can sometimes be faster than the assumed alternative design that incorporates output feedback, product-term sharing, or both to achieve desired logic complexity. In the FPGA world, Atmel's devices are notable for their register richness at the expense of per-register product terms. Arithmetic-intensive designs with easily routed datapaths and minimal random logic are best suited for such an architecture. Many vendors' FPGAs also target traditional PLD strengths by incorporating dedicated wide and fast decode circuitry, either within the internal logic array or at the device periphery.

From a technological perspective, Actel's and QuickLogic's antifuse FPGAs tend to deliver a performance boost on a comparable lithography as compared with SRAM and flash-based alternatives. Antifuse interconnects have lower impedance than pass transistors do, and the interconnects are smaller, allowing comparatively more routing resources at a gate count. Because clock routing is so crucial to performance, programmable-logic vendors focus a great deal of effort here. In its upcoming ORCA 3C FPGA family, for example, Lucent Technologies will provide not only horizontal and vertical traces, but also diagonal interconnect to route clock signals throughout the device as fast as possible. Other specialized routing resources on some programmable-logic devices include dedicated carry and set-reset networks.

Higher end CPLDs and FPGAs are beginning to integrate PLLs. These circuits have several interesting applications for boosting performance. First, you can resynchronize and therefore minimize clock delay from the device input through the input buffer and interconnect and finally to an internal circuit's registers. This technique, assuming comparably fast routing of other input signals, relaxes system setup-and-hold requirements and boosts apparent device performance. You can also use PLLs to reduce skew either between an internal register's clock path and input datapath or between the clock paths of multiple internal registers (Figure 10). Finally, just as with mPs, you can use a PLL to internally multiply an externally applied clock. This technique might allow you, for example, to use a slow external oscillator for interfacing to other peripherals or to eliminate high-frequency noise-emission concerns and still operate a fast DSP core within the programmable-logic device. An internal multiplier might also allow you to more efficiently use available logic resources (a 16-bit circuit running twice as fast as its slower and higher gate-count, 32-bit alternative, for example).

A variety of options exists to optimize programmable-logic performance. Admittedly, discussing the breadth of options is different from depth of discussion in any one area. Each design is somewhat different, and each silicon architecture and tool is unique. To tailor the general concepts to your situation, first ignore the temptation to "dive in," and spend a little time upfront researching what re-sources are available.

When evaluating prospective silicon and tool vendors, look for implementation flexibility that you can easily and quickly use for your benefit. Many times, if you supply a vendor with a portion of your design, the vendor will compile the design in its architecture and let you know the performance and other results. Vendors also often provide high-level tools for rough speed, gate count, and power estimation before detailed implementation. Don't overlook the potential of evaluation boards, if necessary, to give you more in-depth information.

After selecting your vendors, look for design suggestions and sometimes even full-blown reference designs that can apply to your situation. Find out what design-tool settings produce the most significant improvement for the types of circuits you develop. Companies are substantially beefing up their Web-site contents for 24-hour access to documentation, technical-support-database contents, and other useful information. Use the work of others when you can to complete your design more quickly and to avoid making mistakes.

Sometimes, a broader perspective is also useful. The Programmable Logic Jump Station (pw1.netcom.com/~optmagic) has links to a variety of industry resources, including user-developed documentation and consultation services. Internet newsgroup comp.arch.fpga provides effective data-sharing and feedback from those who have already solved the problems you're facing.


References

  1. Awalt, Rockland K, "Achieving timing requirements in FPGAs," Memec Design Services, 1997 Design Supercon, On-Chip System Design Conference.
  2. Maxfield, Clive "Max," "The ouroboros of the digital consciousness: linear-feedback-shift registers," EDN, Jan 4, 1996, pg 135.
  3. Rangasayee, Krishna, "Complex PLDs let you produce efficient arithmetic designs," EDN, June 20, 1996, pg 109.
  4. Maxfield, Clive "Max," "Deus ex machina state machines: one-big or many-small?" EDN, Nov 9, 1995, pg 129.
  5. Wasson, Stephen L, "High-speed state machine design," Integrated System Design, July 1995.
  6. Jian, Jim, "QuickNote 51: Advanced VHDL design techniques," QuickLogic Corp, Sunnyvale, CA, 1996.
  7. "HDL synthesis in FPGAs," Xilinx Corp, San Jose, CA.
  8. Wasson, Stephen L, "Floorplanning Xilinx FPGA designs for high performance," HighGate Design Inc, 1995 Design Supercon, On-Chip System Design Conference.
  9. Von Herzen, Brian, "Digital signal processing at 250 MHz using high-performance FPGAs," Rapid Prototypes Inc, 1997 Design Supercon, High-Performance System Design Conference.
  10. Lipman, Jim, "The hard facts about soft cores," EDN, Sept 2, 1996, pg 38.
  11. Grosse, Debora, "FPGAs: a matter of cores," EDN, March 27, 1997, pg 61.

Acknowledgment

Stephen Wasson from HighGate Design Inc, a programmable-logic-design consultation company, was an invaluable source of both detailed information and extensive feedback during the development of this article.


  • Achieving programmable-logic-performance goals requires silicon and tools research, creative thinking, and evaluating trade-offs to achieve a balanced result.
  • Don't leave optimization solely to your tools; understand your device-architecture characteristics and tailor your design accordingly.
  • Boosting performance often involves removing excess logic, but sometimes it requires that you add gates.
  • Traditionally, schematic entry has produced the most efficient designs, but programmable-logic-aware HDL tools are quickly catching up.
  • Programmable-logic process improvements and architecture enhancements help you achieve your goals.
 

An acronym smorgasbord

Categorizing programmable-logic devices as simple PLDs (SPLDs), complex PLDs (CPLDs), FPGAs, or some other label can be a difficult task. Different people define different categories in different ways. The situation is analogous to the mP world; is a Pentium processor a RISC CPU, a CISC CPU, a little bit of both, or none of these? Like mPs, each vendor's programmable-logic architecture differs in slight or substantial ways from any other and yet simultaneously overlaps with other products both in the vendor's and in other companies' product lines. Although HDL-design synthesis is beginning to help isolate you and your design from the low-level architectural details, a basic understanding of the types of programmable logic can help you predict how design-implementation options compare in performance, power consumption, gate count, and other factors.

An SPLD is an integrated collection of macrocells (Figure 1). Each macrocell typically consists of a wide AND/OR structure (known as coarse-grained logic) followed by an optional flip-flop that you can configure to one of several types (D, T, etc). Most SPLDs also support programmable output polarity and internal-signal feedback to the same or other macrocells, which can widen the product-term implementation at the expense of speed. SPLDs also combine dedicated inputs, dedicated outputs, and programmable input or output pins to suit the needs of various designs. SPLDs are also known as programmable array logic (PAL) or generic array logic (GAL). They offer highly predictable timing as fast as 3.5 nsec on some devices, and most design engineers understand development tools well. (You may even have used the Abel programming language in college.) Effective gate counts range as high as 500 or so, with as many as 28 pins.

Early PALs were bipolar, which offered speed at the expense of high power consumption, but most of today's devices are CMOS. The programmable elements can be PROM, EPROM, EEPROM, or flash memory. Lattice Semiconductor pioneered the concept of in-system programming (hence, the acronym switch from PAL to GAL), but today, almost all vendors offer some form of this capability. Perhaps the most popular SPLD is the ubiquitous 22V10, which many vendors supply either in its generic form or with various superset enhancements.

CPLDs (Figure A), simply described, are collections of SPLD structures, combined on one die and interconnected by a central multiplexer or switch matrix that routes signals to macrocells from device inputs and other macrocell outputs. At higher gate counts, most CPLDs do not provide 100% routing of all signals to all macrocells (to minimize interconnect complexity and the resultant impact on device performance and cost). Instead, CPLDs rely on statistical assumptions of device usage in implementing both the interconnect structure size for a given number of macrocells and inputs and the maximum number of signals that can enter each grouping of macrocells. CPLDs tend to provide less product compatibility among vendors than SPLDs do, although, in some cases, superset features proliferate and diverge from a common base functionality.

With the exception of Altera's FLEX (Flexible Logic Element Matrix) architecture, SPLDs typically provide densities measured in thousands or a few tens of thousands of gates, and they also provide fewer registers at a given gate count compared with FPGAs. CPLD timing is highly predictable, thanks to relatively low routing complexity, and the SPLD foundation provides a rather simple learning curve for engineers who cut their teeth on PALs. CPLD designs compile quickly compared with their FPGA counterparts. The macrocell structure of both SPLDs and CPLDs makes them most attractive in applications that involve relatively complex decoding logic, counter-based state machines, or both.

The common atomic element of all FPGAs is a multi-input look-up table (LUT) or multiplexer, followed by a flip-flop (usually a D-type) (Figure 2). This structure repeats itself throughout the FPGA array, connected to other LUT and register combinations via multilevel segmented routing traces. You can use antifuse links, SRAM cells, and flash cells to configure LUTs, registers, and interconnects. Antifuse links, offered by Actel and QuickLogic, consume the least silicon area of the three approaches on a given lithography, have the lowest impedance (boosting interconnect performance), and offer high protection from reverse-engineering and copying attempts. However, antifuses are one-time-configurable, require high programming voltages, and typically have 1 to 2% yield loss during programming, thus limiting them to offboard configuration.

SRAM-based FPGAs have the longest vendor list. Xilinx was the first and today the largest in this segment, with Actel, Atmel, Lucent Technologies, and Motorola joining and Cypress and Vantis soon to follow. SRAM cells take up the most silicon area of the three approaches and require an external source for the configuration information. However, their onboard infinite-reprogramming is useful on the prototyping bench, in the manufacturing line, and during normal system operation in so-called reconfigurable-computing applications. Flash memory, offered by Gatefield, in many respects occupies the middle ground between antifuse and SRAM. Flash-memory cells are smaller than SRAM cells, but the significantly slower programming times and finite cycling limit flash memory's application base.

Because of the more complex routing architecture, FPGA performance tends to be less predictable than that of CPLDs, although an optimized FPGA design can outperform its CPLD counterpart in certain cases. The register-rich but comparatively logic-sparse (or fine-grained) FPGA structure is most appropriate in data-flow and -manipulation applications, such as DSP, in which you often find circuits such as FIFO buffers, memory blocks, accumulators, and multipliers. PGA-development tools are usually flexible, but this flexibility can also be intimidating to the first-time PLD user. On the other hand, ASIC designers, used to an extremely flexible sea-of-gates architecture, may find FPGAs too restrictive at first inspection.

Recently, Actel began marketing the term "SPGA" (system-programmable gate array), and Lucent's counterpart is the "FPSC" (field-programmable system on a chip). SPGA can generically apply to any high-gate-count programmable logic, implying a device that enables system-on-a-chip designs. Specifically, SPGA refers to combining gate-array or standard-cell ASIC logic and FPGA circuitry on the same die. The SPGA is an intriguing evolutionary concept whose benefits include higher performance for circuitry implemented in the ASIC gates (such as a 66-MHz PCI initiator or target). Because ASIC circuitry takes up less silicon area than any FPGA counterpart does, SPGA devices potentially could cost less than FPGA alternatives. However, cost is a function of many variables: die size, manufacturing complexity, volume, etc. SPGAs might also be subject to the same minimum-order quantities and leadtimes of ASICs unless the industry standardizes a few common, identical cores.

Now, what about Altera FLEX? This architecture (Figure B) is in many ways a hybrid of CPLDs and FPGAs. Like CPLDs, FLEX devices have low-complexity, continuous-signal-routing structures, and, like FPGAs, the basic logic element is a four-element LUT followed by a register. Altera calls FLEX devices "CPLDs" to stress the predictable timing and quick design compilation that the routing arrangement provides. Many design engineers and consultants refer to FLEX devices as "FPGAs," reflective of their finer grained logic. Look at many industry-analyst reports, and you see the CPLD (and FPGA) market-size and vendor market-share numbers calculated twice, both with and without Altera's FLEX devices included. To use another mP analogy, Intel (Santa Clara, CA) selected portions of competitive RISC architectures as it evolved beyond the pure CISC Intel386 CPU. With FLEX, Altera seems to be following a similar development path, as is Lattice with its ispLSI 6000 series. A year or so from now, these two companies will probably not be the only ones blending CPLD and FPGA concepts.

Andraka's top-10 list

The One-Hot state machine uses an FPGA's abundant registers (courtesy Memec Design Services).

Ray Andraka is the chairman of the Andraka Consulting Group, which specializes in CPLD and FPGA designs. He has compiled a David Letterman-style top-10 list for FPGAs, which, from his experience, are more performance-challenging than their CPLD counterparts.

According to Andraka, "FPGAs can be made to run considerably faster than the CPLDs, provided everything is done right. CPLD timing is relatively fixed by the architecture, while the FPGA permits much more complicated designs and may be faster." Here are Andraka's top 10 things a designer should do for best performance from an FPGA.

  10. Minimize use of internal three-state buses. They tend to be slower than the logic and constrain placement.
  9. Use timing constraints on automatic place-and-route tools available, even with floorplanned designs. The timing constraints help optimize the automatic route.
  8. Duplicate logic to reduce fan-outs and long routes, especially on critical signals.
  7. Pipeline wherever practical. Redefine the sequence in which things happen if you need to (for example, decode a state ahead or insert extra states).
  6. Floorplan the design. The human brain is still the best placement engine available.
  5. Arrange the logic in multilevel combinatorial stuff (circuits) so the critical signals pass through only one level.
  4. Level-compress. Keep the number of combinatorial levels (delay) between flip-flops to about half the clock period. The other half is for routing.
  3. Consider alternative design approaches. The most obvious approach is often not best. (Editor's note: This one is my favorite.)
  2. Don't synthesize logic in performance- or density-critical designs. Synthesized designs typically have about half the performance and density of a carefully handcrafted design.
  1. Tailor the design to the architecture. If, for example, the architecture elements are four or five input-logic blocks, design using four inputs ahead of each flop. When possible, use fewer than the maximum number of inputs to avoid route congestion. The more familiar you are with the architecture and the tools, the better you can do.

For more information...

 

What ever happened to PREP?

In 1991, a consortium of programmable-logic companies (Actel, Atmel, Cypress, Lattice, Lucent, QuickLogic, and Xilinx) and interested users formed the Programmable Electronics Performance Corp (PREP). The group's purpose was to develop a series of benchmarks to evaluate programmable-logic-device logic capacity (gate count) and performance. Usable programmable-logic resources depend not only on the architecture foundation and amount of available logic but also on the type of design implementation. For this reason, PREP standardized the following circuits that it believed represented low-level functions commonly found in programmable-logic designs (see the PREP Web site, www.prep.org, for more details):

  • 8-bit multiplexer/register/shift-register datapath,
  • 8-bit loadable up-counter,
  • Eight-state, 13-transition finite-state machine,
  • 16-state, 40-transition state machine,
  • 4-bit multiply/accumulator,
  • 16-bit resettable accumulator,
  • 16-bit loadable up-counter with asynchronous reset,
  • 16-bit fully synchronous loadable counter,
  • Memory-mapped address decoder with bus error.

Each vendor was free to implement each circuit as that vendor saw fit, as long as the circuit met the functional definitions in the PREP specification. To evaluate logic density, vendors "stepped and repeated" each circuit as many times as it would fit into a device. With each repetition, vendors also calculated the maximum internal and external (adding clock to output and input setup to clock) operating frequencies based on worst-case device specifications. The final performance benchmark reported the best, worst, and average of these values.

Sounds good, right? PREP was in fact a pretty good indicator, although not a definitive ruling, on a device's size and performance. (In fact, PREP was intended to be only a pretty good indicator.) But by late 1994, the PREP committee on programmable-logic benchmarks had essentially disbanded, and today, most vendors no longer report PREP results for their products. (Work on benchmarking EDA tools continues.) What happened?

The reasons for this falling-out are numerous and somewhat complex. Some vendors and users felt that the symmetrical step-and-repeat-circuit nature of the benchmarks did not accurately approximate real-life implementations. Real designs combine a variety of circuits along with assorted random logic, all of which tend to require additional interconnect resources, impacting performance, especially for FPGAs.

Some vendors, especially those that didn't score highly in the PREP benchmarks, also felt that the chosen circuits favored alternative architectures. A few vendors went so far as to accuse their counterparts of designing elements of their architectures to maximize PREP scores, as has also been rumored for graphics-chip-set and mP benchmarks. Given that PREP circuits represented those circuits you commonly find in designs, though, even if the accusations were true, it's unclear whether what the vendors were doing was wrong. On the other hand, some vendors that provided specialized features, such as fast decoders, internal three-state buffers, and dedicated RAM blocks, felt that the PREP benchmarks ignored these optimizations. The circuit specifications also contained loopholes that some vendors found and used to limit the amount of stepped-and-repeated logic.

Some engineers also remember (with no shortage of chuckles) that each vendor quickly took the opportunity to "spin" the initial PREP results in its favor. Unlike the mP SPECint and SPECfp benchmarks, for example, PREP took the form of a somewhat large and intimidating set of numbers. To simplify matters, some vendors communicated a subset of the PREP results (not coincidentally, often those results that showed the vendors' products in the best light), whereas others averaged the numbers. Unfortunately, there was no standardization in the averaging method. This lack of standardization proved to be more confusing than helpful for many engineers, who didn't have the time to sort through the incompatible number sets and extract meaningful comparisons that applied to their situations.

Finally, the programmable-logic vendors, which are experts in their architectures and optimizing circuits for those circuits, calculated the benchmarks. They also often chose different design methods, such as Boolean equations, schematics, or HDL synthesis, to optimize their results. It's debatable how closely an average engineer who substitutes system-knowledge breadth for programmable-logic depth, could achieve the same optimizations, especially given limited system-development time constraints.


Brian Dipert, Technical Editor

You can reach Brian Dipert at 1-916-454-5242, fax 1-916-454-5101, edndipert@worldnet.att.net.


| EDN Access | Feedback | Table of Contents |


Copyright © 1997 EDN Magazine, EDN Access. EDN is a registered trademark of Reed Properties Inc, used under license. EDN is published by Cahners Publishing Company, a unit of Reed Elsevier Inc.