|
|||||||||||||||||||||
|
|||||||||||||||||||||
AT - A - GLANCE |
|
"Will my design fit into this part?" This question is one of the fundamental issues confronting you when you evaluate programmable-logic suppliers. (Other pertinent questions include: "How much will it cost?" "How fast will it run?" "How much power will it burn?") Unfortunately, logic capacity is difficult to quantify, because of the diversity of architectures available and the incompatible marketing-driven logic-counting equations that vendors use.
When comparing various PLDs and FPGAs, your or your peers' design experience is the best benchmark. However, even if you are using an architecture for the first time, a basic understanding of the device you're considering and the assumptions behind its specifications can guide your selection. By matching a product's capabilities to the circuits in your designs, you can make optimal use of available silicon resources. Also make sure that the product family, ideally in compatible packages and pinouts, is broad enough to support your design, should your predesign logic estimates prove to be too conservative or too aggressive.
Compilers can also provide valuable information on the amount of on-chip resources needed to construct logic and memory that will be in your final design. However, evaluation circuits that you design generically may not make the best use of each architecture's features, limiting the evaluation results' usefulness. Routing impacts are also unknown until you run the compiled netlist through a vendor's proprietary back-end tools, which may not be robust enough to take advantage of the silicon flexibility an architecture provides them. You also need to balance your desire to pack a device full of logic with the need to retain pinout and timing through subsequent design iterations (see sidebar "The secrets of your success").
CPLDs
Complex-PLD (CPLD) vendors most frequently tout macrocells as their measures of logic capacity. A macrocell comprises a register and a various-sized dedicated-product-term-logic structure. Most CPLDs use a programmable AND and fixed OR architecture, and many vendors standardize on five dedicated OR product terms per flip-flop. Groups of macrocells, or logic blocks, are each conceptually equivalent to a simple PLD (SPLD), such as the popular 22V10.
Complex signals' logic equations may require more than one macrocell's directly allocated product terms. With first-generation devices, this requirement meant that the design consumed additional entire macrocells, with the output of one feeding the input of another, to implement the desired function. In a worst-case scenario, you might use an entire macrocell's resources to create just one additional product term for another. Aside from reducing the available logic resources within a chip, this multipass technique also detrimentally affected performance. For this reason, SPLDs often provide as many as 10 directly allocated product terms per macrocell, but increasing macrocell counts in CPLDs make this approach cost-prohibitive.
Today's CPLDs let you reallocate or share sets of multiple product terms within a logic block. This redistribution affects performance less than does the multipass approach. Partial-product-term "stealing" can also leave a portion of each macrocell's resources available to implement other functions, improving design efficiency. Carefully examine the product-term-redistribution techniques each vendor uses, with an eye toward balancing capabilities against cost for your design's complexity. For example, Xilinx's 9500 CPLD family supports bidirectional product-term redistribution from both nearest neighbor macrocells. Some devices from multiple vendors offer several redistribution options, trading off granularity of product-term sharing or stealing, degree of flexibility in rerouting, and speed.
As an alternative to most CPLDs' programmable-AND and fixed-OR, PAL-based structures, you might want to consider a programmable-logic-array (PLA)-based architecture. Vendors used PLAs in the earliest programmable-logic devices. They offer per-macrocell user programmability of the number of both the OR product terms to implement a function and the ANDed inputs within a product term (Figure 1). Dual-pass-transistor-based PLAs had one transistor for ANDs and the other for ORs. However, these devices were slower than the PAL alternatives with a hard-wired OR array or PROM with a hard-wired AND structure. Dual programmable arrays also increased silicon-die-size requirements, extended per-device test time and cost, and decreased yield. One-time-programmable (OTP) fuse-based PLAs minimized many of these issues, but some users viewed them as too inflexible, limiting their success.
Thanks to the high speed, high yield, and small dimensions of today's submicron lithographies, a few companies are revisiting PLAs, whose advantages for implementing complex-logic functions, beyond increased flexibility, include more predictable timing and less cumulative performance degradation. International CMOS Technology (ICT) uses PLAs as the basis of its programmable electrically erasable logic (PEEL) Array CPLDs. ICT's PEEL Array product line comprises 5V PA7000 devices having 36 to 60 registers. ICT predicts that by press time, it will have available for sampling 5 and 3V versions of 40- and 60-register members of the follow-on PA71000 product family. Additional devices in the PA71000 and derivative product lines will become available throughout the rest of the year.
Philips includes a supplemental PLA array in each logic block of the company's CoolRunner XPLA and XPLA2 products. Regardless of the number of additional product terms added via the PLA, Philips devices incur a single "adder" in the signal's timing equations for input-to-output propagation delay and register input setup. Since the OR array of a PLA is user-programmable, multiple macrocells can share the same AND product-term resource if the design supports this capability. For example, a single product term can control the configuration of a multimacrocell counter circuit in response to an external event, such as a reset.
Other macrocell enhancements differentiating vendors include multiple-sourced register clocking, including optional latching; positive- and negative-edge-clock polarity; and numerous flip-flop set/reset options and sources. Having both active-high and -low macrocell outputs available allows the fitter to choose the logic equation with the fewest product terms or inputs, whichever is the limiting factor of single-macrocell usage. Some devices allow you to configure the macrocell's flip-flop as a D, a T, or--less commonly--a JK or SR type without consuming product terms, again with the goal of using the simplest logic equation to implement a function.
Several vendors embed XOR gates, which are useful for efficiently implementing circuits such as comparators, within their macrocells. Altera's Max 9000 family provides optional "register packing," which provides two outputs per macrocell: a combinatorial function of four product terms and a registered function of one product term plus a flip-flop. If you use the macrocell flip-flops to generate shift registers, for example, the remaining product terms are still available for asynchronous logic circuits.
In many designs, only a percentage of all macrocell outputs need to interface to the outside world. The outputs of flip-flops in synchronous state machines, for example, commonly control only other logic within the device. In cases such as these, consider buried-macrocell architectures, in which only a subset of all macrocell outputs connects to I/O pins. Even for macrocells that normally connect to outputs, you can usually reconfigure them as buried macrocells and use the corresponding pin as an input. Fewer pins mean a cheaper package, lower power consumption, reduced board space, improved manufacturing yield, and greater long-term reliability.
Moving signals around
Vendors take various approaches to feeding a macrocell output to the same or another logic block. The most flexible approach gives you the choice of using a registered or a combinatorial version of the signal either before or after the output-enable stage, independently of the signal version that may go to the output pin. Lattice's ispLSI and Vantis' Mach 4 devices include an output routing matrix, which lets you route each macrocell output to its dedicated I/O pin or, with a few-nanosecond speed penalty, to any other I/O pin associated with that macrocell's logic block. In researching various PLD options, you'll also uncover multiple degrees of per-pin output-enable control.
How does the chip get a signal from one macrocell and allocated I/O pin back to itself or to other macrocells in the same or another logic block? Look at each vendor's data-sheet diagrams, and you'll see one or several routing structures, which some manufacturers refer to as "programmable interconnect matrices," "global routing pools," "central-switch matrices," and "block and central interconnects." Early CPLDs used a fully populated crossbar switch. This concept meant that all device inputs and macrocell-feedback signals routed into the routing structure, and each logic block had access to all signals. The crossbar switch still finds use in some low-density CPLDs.
A crossbar switch, however, consumes an inordinate amount of silicon area as logic-block and macrocell counts rise. Each intersection of the input and output lines within the routing array requires either an OTP fuse/antifuse or a pass transistor. Thanks to advances in process lithography, vendors can more cost-effectively implement a given-sized crosspoint switch. However, the cumulative capacitance of all these switching elements can also have a detrimental effect on signal-propagation delays.
Fortunately, most designs don't require this degree of routing flexibility. As a result, vendors have migrated away from fully populated crossbar-switch structures to more limited but more cost-effective multiplexing arrangements. As with a crossbar switch, in multiplexing arrangements, all logic blocks' macrocell-feedback signals and I/O pins define the available routing lines. However, only a subset of these routing signals can selectively feed into each of the logic blocks' inputs. This compromise reduces the logic complexity of each input's multiplexer. The input switch matrix in Vantis' Mach 4 product line gives the fitter software more routing options by allowing each feedback and I/O signal to switch onto one of several routing lines.
Multiplexing arrangements also offer fewer inputs into each logic block than the number of available routing-matrix signals, reducing the number of multiplexers allocated to each logic block. Multiplexer width and number both provide vendors with opportunities for innovation. Ultimately, these factors, along with the degree of product-term sharing within a macrocell and output-switching capability, directly affect your ability to pin- and performance-lock a device through multiple design revisions. This capability is especially crucial when considering in-system programming: The point of the technique is to enable just-in-time logic configuration without board revision during manufacturing in response to last-minute design changes or customer needs.
Atmel's ATF1508AS begins with Altera Max 7128-compatible package, pinout, and internal-logic structures, according to the company. In addition to a variety of improvements Atmel made within the macrocell, the company increased the number of inputs into each logic block from 36 to 40. Atmel also modified the multiplexer architecture feeding each logic block's inputs to give the fitter more flexibility in connecting routing-matrix signals to macrocells (Figure 2). Cypress Semiconductor markets its robust signal-interconnection scheme as the foundation of the in-system-reprogrammable (ISR) Flash370i and Ultra37000 CPLD product families, which also support as many as 16 product terms per macrocell without timing degradation. The FastConnect global switch matrix in Xilinx's 9500 architecture can combine multiple internal macrocell feedback signals into a single wired-AND connection before feeding the connection into a destination logic block. The approach increases both effective logic capacity and the virtual number of logic-block inputs.
Even one partially populated routing matrix becomes cost-prohibitive at high macrocell counts. In response, vendors such as Altera with its Max 9000 architecture, Philips with its XPLA2 line, and Vantis with its Mach 5 devices have migrated to multiple smaller routing matrices interconnecting adjacent logic blocks (Figure 3). This approach, "hierarchical interconnect," moves CPLD routing toward the segmented approach that many FPGAs use. However, the CPLD's product-term-based logic structures and coarse granularity remain key differentiators from fine-grained FPGAs.
The primary downside of hierarchical interconnect is that it makes CPLD timing less predictable. (Predictability has historically been a fundamental PLD selling point.) Further, logic-block-to-logic-block propagation delays depend on respective logic-block locations. When reviewing data sheets, you'll find fast timing specifications for signals within a local routing matrix and slower specifications for signals that span multiple matrices, using the global interconnect. Some designs may benefit from a multimatrix approach, however. In Altera's Max 9000 devices, for example, all macrocell outputs and I/O pins are available to all other macrocells within the same logic block, whereas in Max 7000 and other common CPLDs, the restricted number of logic-block inputs limits routing for all signals, even those originating within the same logic block.
FPGAs ain't ASICs
Gate count as a measure of logic capacity has its roots in ASICs, in which the technique is reasonably straightforward. A gate-array ASIC's logic structure comprises a sea of fine-grained, two-input NAND gates. ASICs also offer abundant signal routing, so you can generally trust the logic estimates of the front-end synthesis compiler without taking the additional time and expense of back-end placement and routing. In the early days of FPGAs, designers used them as integrators of multiple simple logic chips, for ASIC prototyping, and, perhaps, for initial production volumes if speed weren't an issue. These situations prompt the question of how many gate-array, two-input NAND-gate equivalents each vendor's FPGA provides. The fact that FPGAs aren't fundamentally based on discrete gates complicates the problem.
Whether constructed of look-up-table, multiplexer, or dedicated logic (as with DynaChip, for example), FPGAs use coarser-grained combinatorial logic structures with associated registers. The alternative approach, interconnecting NAND gates with a huge number of user-programmable antifuses or pass transistors, would be cost-, power-, and performance-prohibitive. Even Gatefield, with perhaps the finest-grained base logic cell available, uses a four-input, one-output element that can construct a variety of both combinatorial and clock-driven functions.
How do you equate a chip not made of NAND gates with a chip made of NAND gates? You benchmark. The FPGA vendors selected a group of what they claimed were commonly used logic circuits and implemented them in FPGAs and gate arrays, comparing required device sizes and the percentage of the device used in each case. The averages the vendors determined across all designs resulted in commonly cited conversion factors, such as: "A four-input look-up table and register equals 12 gate-array ASIC gates."
Each vendor gravitated toward the circuits that portrayed its devices in the most positive light. The number of circuits that constituted a valid sample set also varied. Finally, design techniques were undocumented; did the vendor use an inefficient HDL synthesis technique or--more likely--handcraft a finely tuned netlist? Because the manufacturers have tremendous knowledge of their architectures, their results would probably surpass those of a typical system engineer with far less silicon expertise and time and a smaller tools budget. Industry-standards bodies attempted to create order from this chaos, but their efforts ultimately fell apart because of vendor defections and questionable interpretation of unclear specifications (Reference 1).
As FPGA use becomes more widespread and extends beyond prototyping into full system production and as the number of FPGA vendors increases, manufacturers are drifting away from benchmarking their products against ASICs and have begun directly comparing their devices with those of other FPGA vendors. Unfortunately, this trend means that competitive pressures cause everyone's gate-count claims to expand to the level of the vendor with the most inflated numbers. This article could spend the next few paragraphs dissecting each company's gate-counting scheme. It won't, however, for a few basic reasons.
For one thing, none of the gate-counting schemes are inherently wrong. They all make reasonable estimates of silicon use in a subset of all possible designs. However, in some cases, the subset is so small as to be almost nonexistent. Also, each vendor documents the assumptions behind the calculations it uses in determining a gate-count range and, ultimately, the gate count it specifies for a part number. Data sheets and application notes contain this information, and, because the equations and numbers change with every product generation and response to competitive pressure, any equations this article provides would quickly become outdated. Instead, consider the following observations when comparing FPGA architectures and gate-count specifications.
The first issue in comparing gate counts concerns design efficiency. Some manufacturers optimistically assume that you can use 100% of the available logic in a part. A lower usage percentage is more realistic. The coarser the fundamental building-block logic element, the greater the variety of circuits that you can implement within it. For example, the most efficient thing you can build with a four-input look-up table would probably be a 2-bit32-bit multiplier or comparator. The least efficient circuit, on the other hand, would be a single-input inverter. Efficiency varies across the structures within a design and across multiple designs and rarely comes close to 100%.
Even if your circuits could use all the logic capability within a chip, you might not want to go down that path. The higher the usage, the less likely it is that you can retain pinout and timing through design revisions, regardless of whether your FPGA uses a segmented or a continuous routing architecture or whether the vendor bases that architecture on pass-transistor or antifuse switching. Your design may run out of routing resources long before it uses all the available logic.
FPGAs contain a variety of other on-chip resources that a vendor might choose to include in a gate-count specification. Remember that the vendor decides which set of circuits to benchmark-test against an ASIC. Some companies include only the most commonly used resources: look-up tables, multiplexers, and flip-flops. What about registers and other logic structures in I/O buffers? To some companies, these resources exist only to boost I/O performance, not to enable fundamental design creation within the chip, so the vendors omit them from the specified gate counts.
Other manufacturers view I/O logic as crucial to implementing popular circuits, such as 33-MHz and faster PCI cores. Without the fast input setup-and-hold times and clock-to-output delay that I/O registers enable, such circuits would be impossible. These companies also point out that synchronization of signals as they enter or exit a chip is a good design practice, so they include I/O logic. Other vigorously debated resources include embedded AND gates and secondary look-up tables and multiplexers, PLLs, fast carry chains, supplemental fast-decoding resources, internal three-state buffers, boundary-scan and power-up configuration logic, clock drivers, global set/reset circuits, and dedicated CPU interfaces.
No integrated resource provokes more heated debate than embedded memory, because a vendor's assumptions about whether and how a typical customer will use on-chip SRAM has a tremendous impact on claimed device gate count. When comparing FPGAs, you have to consider two competing embedded-memory techniques: using "distributed-memory" architectures and embedding larger discrete SRAM arrays on a die. So-called distributed-memory architectures construct SRAM arrays from the small memory inside each logic block's look-up table. This approach has two fundamental advantages: You use only the memory density you need, and the memory is physically close to the logic, which can be a performance advantage in some designs.
Disadvantages of distributed memory include the fact that its use decreases the available logic resources on the chip. Also, as the memory becomes deeper, the routing needed to interconnect look-up tables incrementally slows performance. The alternative discrete-memory approach offers silicon efficiency and performance advantages over distributed memory but only if your density needs exactly match the available SRAM array sizes. Otherwise, you pay for extra RAM bits that you don't need. Also, although you can use distributed-memory FPGAs for both logic-only and logic-plus-memory designs, discrete-memory FPGAs serve a more limited range of possible applications.
Speaking of applications, here comes the next complication: What will you use the memory for? As each available memory resource grows larger, the number of logic functions that you should implement in the resource decreases. Large memory blocks can make very fast, silicon-efficient wide input comparators, multipliers, digital filters, and complex state machines--but little else--and only a small percentage of designs contain these functions.
The big gate-count boost, however, comes from using embedded RAM, in distributed or discrete form, to implement memory circuits. These on-chip memories usually take the form of FIFOs, although you can generally construct any type of asynchronous or synchronous, single-port or multiport RAM. The FPGA vendors begin with an assumption that each standard SRAM bit corresponds to at least four gates. For a four-input look-up table that along with the associated register might otherwise translate to 12 logic gates, using RAM as distributed memory allows the vendor to claim a minimum of 64 gates. A 2048-bit discrete-memory block similarly corresponds to 8192 or more gates.
Next question: How versatile is the vendor's embedded memory? For example, Actel claims that its A42MX3 has a capacity of six gates per bit; the embedded memory can implement both single- and dual-port RAM without consuming other logic. Less flexible devices use generic logic to create complex memory structures. If your design has simpler memory needs, however, Actel's gate-counting method is probably too aggressive. A similar situation holds true for implementing FIFOs. The on-chip RAM is useful only for constructing the FIFO array; the required input and output data ports, full and empty flags, and counters all consume logic and routing resources.
Vendors selling FPGAs with discrete memory have an obvious motivation to include it in their gate-count specifications to justify the silicon area it consumes, but, unless you can use the memory, those extra gates are meaningless. Vendors with distributed-memory FPGAs balance their desire to present the highest gate count with the reality that if they counted all the on-chip look-up tables as memory, they'd be advocating an expensive, slow SRAM without any leftover resources to implement logic circuits. Adjusting the assumed percentage of look-up tables used to implement logic versus memory is key to Xilinx's higher "system-gates" spec-ification on the Spartan family compared with the "logic-gate" number of the previous-generation XC4000E architecture.
Regardless of the type of embedded memory, some companies assume 100% usage in their gate counts, whereas others use a more realistic percentage. The whole FPGA-to-ASIC "gates-per-bit" conversion scheme is a bit like comparing apples to oranges, because, although a gate-array ASIC constructs RAM bits from NAND gates, an embedded-array, cell-based, or full-custom ASIC takes the more FPGA-compatible and silicon-efficient approach of using RAM on the die. Not all FPGA vendors lump memory into their gate counts; some use a strict interpretation of the term "logic gates" to refer only to on-chip resources used to implement logic circuits and separately specify embedded memory. Also, just because an FPGA uses an SRAM look-up table as its logic element does not mean that you can alternatively use that look-up table for distributed memory, and some vendors do not advocate using discrete memory to implement logic functions because of insufficient routing and other logic resources around the memory block.
Lattice Semiconductor, with its ispLSI 6000 family, is the only vendor to date to offer on-chip SRAM in a CPLD architecture. Limited embedded-memory availability also occurs with antifuse FPGAs and reflects a decreased motivation to include SRAM if the vendor's technology foundation for configuring logic blocks and signal routing is some other technology. However, as silicon-integration capabilities increase and devices become more application-specific, you'll probably see more extensive inclusion of both embedded memory and tailored-function logic (see sidebar "The future will get only more complicated"). Waferscale Integration takes a different embedded-memory tack, integrating EPROM, EEPROM, and flash memory alongside various generic and special-purpose PLD logic blocks in its PSD devices.
Predicting your results
Several engineers interviewed for this article point out that, although their system designs require FIFOs and other RAM structures, they frequently use external memory components instead of relying on embedded SRAM. This approach offers less flexible granularity, increased power consumption and board space, and potentially lower performance. The primary upside, however, is much lower memory cost per bit, and, as both memory and programmable logic increase in performance, the speed impact of going off-chip diminishes.
The engineers (and, anonymously, even a few programmable-logic and design-software vendors) also comment that, because of the tremendous variability in gate-counting schemes, they ignore the data-sheet claims when selecting manufacturers and devices. Consultant Ray Andraka, president of the Andraka Consulting Group (http://users.ids.net/~randraka), begins instead by converting and comparing both his design and each vendor's logic structure to four-input look-up-table and flip-flop or other common elementary logic-structure equivalents. Although this technique ignores the impact of secondary logic resources, it provides a good ballpark estimate. However, this approach can still underestimate the logic that some circuit configurations need.
"If you were to count four-input look-up tables for a design targeted at Altera's Flex 10K, for example, you could underestimate the required resources by as much as 50% if you are using carry chains or clock enables," says Andraka. "Using the carry splits the four-input look-up table into a pair of three-input look-up tables, one for the carry and one for the bit-logic function; the carry-in signal consumes one input to each. Similarly, the clock enable uses up one of the four-input look-up-table inputs, leaving you with a three-input look-up table for your logic."
Inspecting a device's data sheet lets you predict certain aspects of an architecture's compatibility with your design. For example, do you plan to use FIFOs or other memory structures, do you know what size these memory structures will be, and do you know how many of them you need? In this case, after also checking discrete-SRAM prices, you can decide whether a PLD with embedded memory makes sense and, if so, whether a distributed- or discrete-memory approach would work better.
Other questions are more difficult to answer. Odds are that you'll choose your devices early in the project before some logic is even designed and when the rest is in a preliminary form. How do you know whether you can use and should pay for a CPLD's product-term-sharing approach, degree of macrocell flexibility, or PLA? Which FPGA logic cell structure best suits your design, and do supplemental on-chip resources, such as fast decoders, I/O registers, and dedicated arithmetic logic gates make sense? If you select a device with embedded discrete memory, do a few large SRAM arrays or more small ones, such as with Atmel's AT40K architecture, work better for you?
One common approach involves creating sample circuits that you anticipate will be in your project, either in a schematic or HDL tool, targeting the design for multiple vendors and architectures and comparing the results (Figure 4). You should use the same front-end design software you plan to use for your project, not something that a PLD company provides. Results vary, depending on how the EDA vendor tailors its tools for vendors and circuits.
This technique has some merit, and it can often give you at least a ballpark estimate of logic capacity. Although the absolute numbers you achieve for each architecture may be inaccurate, the comparisons among architectures can provide useful relative information. However, keep a few caveats in mind. First, if you don't run the netlist through a back-end fitter or place-and-route tool, the approach provides only a rudimentary count of the number of elementary logic elements, such as look-up tables, multiplexers, macrocells, and registers, that you need to implement a given circuit.
A vendor's logic block, the next-coarsest logic structure, typically comprises several of these elements. For example, the Programmable Logic Cell (PLC) in Lucent's ORCA3 series of FPGAs has eight look-up tables and registers, plus combinatorial decoding logic and other circuits, within it. Just as you can't use every logic element to its full potential when implementing logic circuits, you can't use 100% of every logic block in a CPLD or FPGA because of input-routing constraints. Until you map your netlist to an FPGA or fit it to a CPLD, you won't have insight into this higher level limitation or to its cause (for example, whether it results from insufficient register- or combinatorial-logic resources).
The finer grained the logic block, the more accurate the front-end estimates are. QuickLogic and Vantis have built variable-grain logic structures in their respective pASIC and VH1 FPGA families, in an attempt to handle the logic-flexibility issue. QuickLogic's logic cell (Figure 5) can support either one 16-input function or five independent, smaller input-term functions. Each Vantis variable-grain block (VGB) can also implement a range of functions, from all three-input equations to many 16-input functions. By combining multiple VGBs using local interconnection, synthesis can even construct 32-input logic structures.
To minimize evaluation time, you might choose to write your example-circuit HDL code in a generic, vendor-independent fashion. Realize, though, that by not instantiating functions from each vendor's library, you don't present each architecture under evaluation in its best possible light. There's no guarantee that the synthesis vendor's tools can infer functions from your code and map them to silicon resources. The inefficiencies of vendor-independent code are perhaps most noticeable when you create memory-based circuits (Reference 2). Aside from matching designs to devices, test cases are useful for determining the synthesizer's technology-mapping behavior.
When you pack multiple circuits into a single device, the interaction between those circuits causes the block allocation to change, and you begin to see the impact of interconnection. You might run out of acceptable routing resources before you've put a significant dent in the on-chip logic, especially if your design runs at a high clock rate or has critical setup, hold, and input-to-output timings. You'll probably at least encounter degraded route performance after including multiple blocks of logic in one chip, all competing for the same resources.
"A high degree of interconnect, such as might be found in an adder tree, may use up routing resources long before the design consumes the available logic," says Andraka. "In cases where there appears to be a lot of interconnect, you need to visualize the flow of the design and then look at the number of routing channels available in a row or column in the FPGA. Counting the routes per bit in the design and knowing how many routes per row or column tells you if you need to block out a column or row of logic cells to get enough routing resources. This scenario is especially true for heavily datapath designs."
If you run your design through programmable-logic back-end tools and it doesn't hit the target clock frequency, one common response is to insert multistage pipelines inside combinatorial logic chains with long propagation delays. But, by boosting the number of registers your design contains, you may have just obsoleted your predesign logic-complexity estimate, and, if so, you may need to move to a bigger, slower FPGA. Make sure your reference circuits have plenty of performance head room in the device you select to avoid this last-minute headache.
Instead of spending lots of money on a vendor's full-featured back-end tool set before deciding on the company's architecture, you might instead use the vendor's free or low-cost versions, which support only the smallest parts in the product line. However, this approach doesn't let you put all of your reference circuits in the same chip at once, and circuit interaction may produce different results in the final design.
If you'd also like to avoid the learning curve of using multiple manufacturers' software, contact your vendors' field-application engineers and have them run the back-end optimizations for you. Make sure, though, that you understand how these engineers configure each tool's options so that the comparisons are meaningful and so you can duplicate their results. Alternatively, QuickLogic offers the free Web-based QuickMap service (Figure 6). You upload your HDL code to QuickLogic's server, which compiles it and responds with a variety of information, including recommended device, amount and type of resources consumed, and anticipated performance.
Acknowledgments
Thanks to analyst Murray Disman from Information Associates (www.plnv.com) and consultants Debora Grosse from Agate Technology (dgrosse@pobox.com), Ray Andraka from the Andraka Consulting Group (users.ids.net/~randraka), Rocky and Suzanne Awalt from Memec Design Services (www.mds.memec.com), and Mark Santoro from Santoro Systems Engineering (santoro@mindspring.com) for balancing vendor-supplied information with their real-life experiences and, in some cases, for also supplying valuable feedback on early article drafts.
Congratulations. You've picked a programmable-logic vendor, an architecture, and a target device. Now, how do you ensure that you'll achieve a high degree of logic usage--not just with those test circuits you used for the evaluation stage, but with your real design?
First, rely on your vendor's knowledge of its devices. Your expertise is with the system, not with every feature of each chip within it. Instead of diving into the design headfirst, spend a few days, and really read the data sheet, appropriate application notes, and user manual for both front- and back-end design software. You'll no doubt learn a few tricks that will measurably improve your results.
Second, use predesigned circuits if they make sense. Although using a vendor's macrofunctions and cores may feel like a blow to your talented engineering ego and instantiation makes your design less portable to other vendors' programmable-logic devices and ASICs, the people who created the macrofunctions understand their chips a lot better than you do. They can probably create a circuit that is more reliable, runs faster, uses fewer gates, and burns less power than yours.
Third, learn from the expertise of your peers, both at your company and elsewhere, who have used the device. Internet newsgroup comp.arch.fpga is a tremendous source of useful information and--usually--intelligent exchange of ideas on a variety of programmable-logic topics. I am amazed not only by how many vendor representatives participate in the discussions, but also by how many consultants log in and how much good, free advice they share.
Speaking of consultants, my fourth recommendation is to consider tapping into the knowledge base of one of these individuals and to do it before the 11th hour. Using an outside expert can be cheaper than attempting the design in-house. This situation is especially true when you consider the tool acquisition, the learning curve to get to enough proficiency to drive the tools, and the iterations to get the design right. Consultants use a variety of devices from multiple manufacturers and have seen lots of designs, including the most challenging ones you can imagine.
A little money invested up-front on device and design advice may reap significant returns down the road. However, insist on a confidentiality clause in the contract, and make sure the consultant provides all the necessary files and clearly documents the design, so that you can understand and maintain it in the future.
Finally, don't work too hard. Over the last several years, programmable-logic devices have significantly increased in logic capacity and decreased in cost per gate; beyond a certain point, your continued efforts at squeezing more logic into a part reap diminishing returns. Bite the bullet, switch to a bigger device, wrap up your design, and get on with your life. Just make sure you can still hit your timing with the larger device. One secondary benefit of this approach is that if marketing comes in with last-minute feature-set changes that increase design complexity, you'll have the gate-count head room to ensure performance and pin locking.
So, you think quantifying logic capacity is difficult now? Trust me, you haven't seen anything yet. Next-generation architectures will offer significant advances in performance and capacity, but comparing them with each other is becoming ever more complex. Altera plans to combine complex-PLD (CPLD)-like product-term logic, FPGA-like look-up-table structures, and discrete embedded-memory blocks on one device with the upcoming Raphael architecture, which should appear in the first half of next year. Xilinx will provide three memory options with Virtex, which the company hopes will be available for sampling by the end of this year. For the first time, Xilinx will offer discrete embedded memory, alongside the company's traditional SelectRAM distributed memory, in Virtex, along with fast I/O buffers to access external RAM.
Even more logic capacity for a given silicon area is possible when the vendor constructs some of the gates from ASIC technology (Figure A). Lucent's OR3TP12, which the manufacturer claims will be available for sampling by press time, uses ASIC technology for an on-chip, 66-MHz, 64-bit PCI core and offers a high proportion of FPGA-to-ASIC silicon area. Other vendors that intend to pursue mixed-logic devices include Actel, Atmel (which, like Lucent, has in-house ASIC expertise and access to a variety of proprietary intellectual property), and the Gatefield/Siemens and DynaChip/Fujitsu partnerships.
Reconfigurable computing, the concept of using a fixed number of programmable gates to implement a variety of functions in a time-shared fashion, is a valid, if somewhat esoteric, way to boost the effective gate count of a device. At February's FPGA '98 conference in Monterey, CA, Xilinx discussed the results of an in-house experimental silicon platform that combines an array of modified XC4000E logic blocks with eight separate SRAM configuration planes. Registers within each logic block store function results between configurations, and reconfiguring the device takes 30 nsec or less. Intriguing silicon such as this, however, continues to wait for easy-to-use design tools, which translate software instructions to hardware logic gates, enabling more widespread adoption of reconfigurable techniques.
| For more information: | ||
| Actel Corp Sunnyvale, CA 1-408-739-1010 fax 1-408-739-1540 www.actel.com |
Altera Corp San Jose, CA 1-408-544-7000 fax 1-408-544-6410 www.altera.com |
Atmel Corp San Jose, CA 1-408-441-0311 fax 1-408-436-4300 www.atmel.com |
| Cypress Semiconductor San Jose, CA 1-408-943-2600 fax 1-408-943-2741 www.cypress.com |
DynaChip Corp Sunnyvale, CA 1-408-481-3100 fax 1-408-481-3136 www.dyna.com |
Gatefield Corp Fremont, CA 1-510-623-4400 fax 1-510-226-0147 www.gatefield.com |
| International CMOS Technology Inc San Jose, CA 1-408-434-0678 fax 1-408-434-0688 www.ictpld.com |
Lattice Semiconductor Corp Hillsboro, OR 1-503-681-0118 fax 1-503-681-3037 www.lattice.com |
Lucent Technologies Allentown, PA 1-610-712-4331 fax 1-610-712-4209 www.lucent.com |
| Philips Semiconductors Albuquerque, NM 1-505-822-7629 fax 1-505-822-7804 www.philips.com |
QuickLogic Corp Sunnyvale, CA 1-408-990-4000 fax 1-408-990-4040 www.quicklogic.com |
Vantis Corp Sunnyvale, CA 1-408-732-0555 fax 1-408-774-8461 www.vantis.com |
| WaferScale Integration Fremont, CA 1-510-656-5400 fax 1-510-657-5916 www.wsipsd.com |
Xilinx Inc San Jose, CA 1-408-559-7778 fax 1-408-879-4780 www.xilinx.com |
|
Brian Dipert, Technical Editor
You can reach Technical Editor Brian Dipert at 1-916-454-5242, fax 1-916-454-5101, e-mail edndipert@worldnet.att.net, URL http//:members.aol.com/bdipert.
| EDN Access | Feedback | Table of Contents |
Copyright © 1998 EDN Magazine, EDN Access. EDN is a registered trademark of Reed Properties Inc, used under license. EDN is published by Cahners Business Information, a unit of Reed Elsevier Inc.