|
|||||||||||||||||||||||
May 7, 1998EDN Hands-On Project: Getting a handle on HDLsBrian Dipert, Technical EditorProgrammable-logic chips and designs are growing more complex. As a result, you'll sooner or later need to add HDL expertise to your skills if you want to keep hitting those project deadlines. For this Hands-On Project, I learn VHDL, complete a mixed logic and embedded-memory design in an FPGA, and share observations along the way.After hearing innumerable hardware-description-language (HDL) pitches from silicon and software vendors and watching the schematic-versus-synthesis debates at industry conferences and on newsgroup comp.arch.fpga, I decided to take the plunge, learn an HDL, and form my own opinions (see box "Synthesis strengths and shortcomings"). The resulting project, which I cover in a two-part article series beginning here and concluding in EDN's Sept 11, 1998, issue, comprises several steps:
I had also hoped to compare power consumption for the design alternatives. After further research, however, I realized that today's tools provide only rough estimates and that I'd need to bench-test each design with an oscilloscope if I wanted to obtain numbers that were accurate and, therefore, meaningful. Realizing that this effort was beyond the scope of the available time and not wanting to base conclusions on questionable approximations, I shelved this part of the project (see box "Current status, future plans"). My first task was selecting an HDL. I decided on VHDL, because it seems to be the most common language that PLD, FPGA, and synthesis vendors advocate, especially in their low-cost, entry-level tool suites. Therefore, I hoped that this choice would be directly applicable to more readers. The choice between Verilog and VHDL usually comes down to personal coding preference, although each approach has its secondary merits. Now for the silicon platform. I targeted a design having at least 10,000 gates. Industry pundits call this gate count the schematic "pain threshold" at which HDLs begin to become attractive, also depending on desired design efficiency and performance. This gate-count complexity rules out CPLDs. I settled on Xilinx's XC4000XL FPGA architecture, because I hoped that its pass-transistor-based interconnection and mix of short and long routing networks would expose any design-compilation differences between schematic- and synthesis-design versions, as well as among synthesis vendors' implementations. Xilinx's XC4000XL FPGAs also provided a range of gate counts in case my design ended up larger or smaller than I had initially estimated. If I had extra time on my hands, I could even port my design to Xilinx's XC4000E or Spartan FPGAs, which differ from the XC4000XL in the amount of available routing. This experiment would let me modify the routing variable and hold all other factors essentially constant. I'm a newcomer to HDLs, excluding Abel and Palasm work, and my programmable-logic experience is restricted to much less complicated devices and designs. Realizing, therefore, that I represented only one end of the spectrum of EDN readers' expertise, the answer was to ask the programmable-logic consultant company HighGate Design to join me in the project. Principal Engineer Stephen Wasson has completed hundreds of FPGA designs, many using Xilinx devices and extending back to that company's earliest days. Stephen is also the initial developer of Xilinx's PCI core. His primary motivation in working on this project was to satisfy his intellectual curiosity, because he'd encountered the same synthesis-versus-schematic de-bates I had. The project output would also provide him a testbench for future EDA-tool and PLD benchmarking at his company. Stephen and I began discussing this project soon after the January 1997 DesignCon conference. He agreed that, as I had planned, HighGate Design would target development of both schematic and synthesis designs. He also accepted my challenge to create our common architecture specification. Stephen and I both wanted the design to contain a mix of functions--decoding logic, counters and state ma-chines, arithmetic units, and memory blocks--to level the playing field. Because Stephen doesn't use HDLs, that design task would fall to fellow HighGate Design Engineer David Holmes. By making the VHDL de-signs modular, we could also compile design subsets and identify logic-dependent differences among vendors. Next, we considered tools. I'd used Viewlogic's WorkView Office in past ASIC designs, and Stephen also prefers it. Moreover, Viewlogic wanted to participate in my project, so that part of the decision was easy. We used version 7.40, along with software updates we obtained via the company's Web update feature. Synthesis was a tougher challenge. Paranoia about having the results published drove one large vendor to immediately respond, "Thanks, but no thanks." Another company bounced back and forth several times over a multiweek period before definitively declining. Synopsys, however, was enthusiastic about the project, so we decided to use the company's FPGA Express 2.01. FPGA Express contains an HDL text editor--sort of. If compilation creates errors and warnings, you can click on each message to bring up a window that displays the offending line and enables you to edit the file. However, you can't create an HDL file from scratch. Instead, I used EditPad from Jan Goyvaerts (www.tornado.be/;johnfg), an enhanced alternative to Windows 95's Notepad. EditPad worked well for both my coding and my article-writing needs. The only feature I missed, which other synthesis vendors' editors support, is color-coded VHDL keywords. This capability would have minimized the keystrokes and the typographical errors. HDL editors also commonly provide generic entity and architecture templates, as well as preconstructed libraries of functions. Third-party HDL editors include HDL Turbo Writer from Saros Technology, ED-4W-HDL from Silicon System Solutions, and EALE/HDL from Translogic. The back-end place-and-route software in all cases was Xilinx's Alliance version 1.4. Because I'd already done ASIC-design work using WorkView Office and hoped that a brief refresher would get me up to speed, I started with the synthesis-based design. Ideally, I'd develop a solid VHDL foundation before complicating the picture by using VHDL to implement circuits for this project. However, like many of you, I didn't have the time to read a book cover to cover without distraction or attend a multiday off-site class. Last summer, I began to occasionally review a few pages or a chapter from Reference 1 when I had a spare few hours or to compile an example or two on the accompanying Cypress Semiconductor (www.cypress.com) Warp2 software. (I recommend both of these resources, along with References 2 through 4, for engineers learning VHDL.) Version 4.1 of Warp2Sim, so named because it includes a postfit waveform simulator, is especially useful because it covers both Cypress' CPLDs and now-defunct antifuse FPGAs. This support means that I could run the same source code through a PLD fitter and an FPGA place and router for comparison. Last fall, I attended part of a seminar sponsored by Lattice Semiconductor (www.lattice.com) and taught by Mark Santoro of Santoro System Engineering (Encino, CA). I also attended QuickLogic's (www.quicklogic.com) session with Synplicity (www.synplicity.com) and beginner and intermediate VHDL sessions taught by Escalade, both at January's DesignCon '98. I've also heard great things about sessions by Esperan and TM Associates, and I encourage you to take one if you have the time. I also encourage you not to limit your VHDL training to elementary designs, such as two-input NAND gates and simple, clocked logic circuits with asynchronous and synchronous resets, as I did. Fortunately, research and review of previous programmable-logic articles has made me aware that coding style has a measurable impact on final results, so this knowledge found its way into my design from the beginning (References 5 through 8). My final step before starting was to use the Quick Tour and tutorial that comes with FPGA Express. Although both the Quick Tour and tutorial improved my user interface and overall design-flow understanding, the prewritten source files meant that my overall VHDL expertise didn't grow as a result. Stephen Wasson lives and works just south of Silicon Valley, whereas I'm roughly three hours north in Sacramento. My preferred work schedule starts at around 5 am and, most days, wraps up by 6 pm or so. Stephen, on the other hand, follows a more traditional engineering regimen, beginning in the late morning or early afternoon hours and continuing till 3 or 4 the next morning. As a result, Stephen and I met face to face only once or twice through the course of the project and spoke on the phone only a few more times. Most of our interaction was over e-mail, which had both its good and its bad points. Design teams that span multiple geographical locations and time zones can learn a lesson from our experience. E-mail positives included the ability to polish the wording of questions and answers before sending them on and to combine multiple comments into one communication instead of relying on a dozen or so phone calls each day. E-mail also bridged our work schedules; for example, I could send a message to him at the end of my day and would find a reply waiting for me (often composed past midnight) when I stumbled into the office the next morning. E-mail's conciseness sometimes caused confusion and delay, however. I particularly remember one multiday, multiple-e-mail debate about which synchronous-DRAM (SDRAM) density and architecture we should target, which ended only when we realized we were advocating the same part but just using different terminology to describe it. Stephen wanted to design a multi-FPGA neural network "fabric," or parallel-processing scheme. Although the idea is technically intriguing, I was unsure about how much practical application the architecture would have for EDN's readers. As I continued negotiating with software vendors, he went back to the drawing board and in early December 1997 delivered a Revision 0.3 specification for a data-transform controller design that includes two system buses, a DRAM controller, and various internal arithmetic functions. Feedback and negotiation resulted in an almost-complete Revision 0.4 in early January. Design details
The design includes status, ID, and control registers, accessible via host-driven I/O reads and writes. Memory reads and writes can access the 64-Mbit SDRAM or the back-end bus. Back-end reads and writes always flow through the X2 and X1 transforms, respectively. By manipulating the transform-control inputs, you can make the design pass unchanged data from the back-end bus through the transforms, or you can modify the data with SDRAM contents and arithmetic and logical functions. The host bus approximates but doesn't functionally match a 32-bit multiplexed PCI bus. The initiator and target both drive the open-collector-ready signal, a conceptual combination of PCI's initiator- and target-ready signals. Our design does not support burst suspend or early termination and does not use request or grant lines. The FPGA decodes the upper 3 address bits to determine host selection, allowing for as many as eight targets on one bus. SDRAM and back-end 32-bit data bursts have a fixed length of 16, which an address/control cycle precedes. The target has as many as 16, originally eight, host clocks to respond to an initiator read request with first data. All I/O- and memory-read transactions include a single-clock turnaround cycle between an address and the first data. The DRAM controller supports 64-Mbit, self-refresh SDRAM with a 16-bit data bus and four-bank internal array. The design uses explicit bank precharging at the end of each 32-access-long read or write burst, which also terminates the SDRAM internal address burst counter. Automatic FPGA-driven writes on power-up and reset configure the SDRAM for column-access-strobe-to-read data latencies of two clocks and full-page burst length. The back-end bus provides separate 20-bit address and 16-bit data buses. The FPGA specifies the desired operation by driving the ZRD# or ZRW# outputs along with the address and data transfers. Data bursts have a fixed length of 32 accesses. Although the design specification describes an optional asynchronous test port to manipulate data within the FPGA, such as the output of each stage of the X1 and X2 transforms, neither Stephen nor I have as yet implemented this feature. However, we provide four control register outputs suitable for monitoring with an oscilloscope or logic analyzer, or to drive debug LEDs. A 17-bit host-configurable length counter decrements for each host-bus burst cycle. The SDRAM address counter increments for each SDRAM read or write burst cycle, including back-end reads and writes, which use SDRAM data in the transforms. The back-end address counter increments for each back-end read or write cycle. Although we could have alternatively implemented the FIFO as several-stage pipelines or made them narrower, we stuck with the original 16×32- and 32×16-bit definitions. This depth allows for flexibility in operating frequency for the logic-core, host, and back-end buses. It also lets you and us add host-bus hold-off and pipelining capabilities in the future. Although I had intended to immediately begin the design, my execution of this plan was not so good. Other work obligations, plus unanticipated business travel and a few ill-timed computer crashes (unrelated to the FPGA-design software) distracted me. As a result, I wrote my first line of VHDL on Sunday afternoon, February 22. The following discussion assumes that you have some familiarity with VHDL command syntax and entity, architecture, and component structures. If not, you might want to begin by examining any of the materials listed in References 1 through 4 and 8. Two engineers: two approaches From the beginning of the design, I strove to test one of the key selling points of synthesis: vendor independence, which means not using libraries of parameterizable modules (LPMs), direct instantiation, or EDA vendor-proprietary routines. Stephen, however, would instantiate whenever possible in his schematic and synthesis designs. He also planned to insert timing and layout constraints to guide compilation and the back-end tools, neither of which I intended to do. If necessary, he would predefine the pinout to guide the data flow through the part, whereas I let the tools pick the pins they thought best. My approach would also theoretically let a designer migrate my design to an ASIC without significant rewrites. Unintentionally, however, I broke my vendor-independency vow. I declared all signals of type std_logic and std_logic_vector, whose possible states extend beyond '1' and '0' to include simulation-friendly 'Z,' 'X,' 'U,' and other values. This approach also comes in handy when you create tri-states and bidirectional buffers. Direct manipulation of std_logic signals worked fine for logical functions, such as ANDs and ORs, shifts, and rotates. However, FPGA Express flagged as an error every attempt to add or subtract a number and signal or two signals. VHDL is a strongly typed language. Indiscriminately mixing signals or variables of different type or length causes error messages, which is both good and bad. Type enforcement keeps you from making mistakes that would be difficult to catch and debug during simulation, but this enforcement can be frustrating when you know what you want to do and are looking for a syntax that the synthesis tool accepts.
Assuming that data_ina has value "10000000'' and data_inb has value "00000000'' when a rising-edge clock transition causes the process to run, data_out would be '0' after the process conclusion. If, however, you changed the line USE ieee.std_logic_unsigned.ALL; to USE ieee.std_logic_signed.ALL;, you'd end up with an opposite result. Using the signed and unsigned libraries solved the mixed-logic and arithmetic-function problems, but I later found that their "IEEE" label was deceptive. Synopsys developed the ieee.std_logic_unsigned and ieee.std_ logic_signed libraries and put them into the IEEE directory. Synopsys assured me that simulation vendors support these libraries, and other synthesis companies have developed their own versions, although the files may not exist in the same place in the directory structure or have the same name. I'll revisit this topic after I've had a chance to run my code through multiple vendors' synthesis software and will report my findings. How could I have avoided this type-incompatibility problem? According to David Pellerin from Accolade Design Automation (www.acc-eda.com), an alternative technique uses the VHDL-93 ieee.numeric_std library. With this approach, I'd convert the std_logic and std_logic_vector signals or variables to type signed or unsigned for arithmetic functions and then back to std_logic and std_logic_vector before performing additional logical operations or passing them to other entities. FPGA Express, however, doesn't support VHDL-93 syntax, which is not a big problem in my design. Although commands such as XNOR, SLL (shift left logical), and SRL (shift right logical) would have been nice to have, I developed workarounds without much effort. All my research before coding indicated that HDLs are conceptually similar to high-level software languages. Because I consider myself a decent C programmer, I felt reasonably confident that developing VHDL expertise meant just learning another language syntax. In retrospect, I see that this attitude was somewhat naive, and I'm not sure whether software experience helps or hinders an HDL novice. The simple explanation of the difference is that software executes sequentially, whereas most HDL statements, which create hardware, execute concurrently. All concurrent statements between the BEGIN and END commands
of all architectures in a design, as well as all processes initiated by transitions of
signals in their sensitivity lists, evaluate in parallel. This fact also means that the
order of statements and processes within the file is irrelevant. Within a process,
statements execute sequentially, but signals don't update until after the process
terminates. As long as I kept these facts in mind, I was OK. Looking at this function with software-biased eyes, you might think that
the process runs each time a bit in the 8-bit a_bus variable changes. You'd be
right. You might also think that the process terminates with x='1' if a_bus="11111111."
Here, unfortunately, you'd be wrong. Remember that VHDL updates all signals only at the
conclusion of the process. Both Listings 2 and 3 use a variable, i, for looping, whereas Listing 3 adds variable tmp. Listing 3 assigns a value to signal x only at the end of the process. Variables differ from signals in that the simulator--and, therefore, the synthesis compiler--immediately up-dates variables. However, Reference 1 cautions against indiscriminate use of variables because of nonstandard support among simulator and synthesis tools, so I avoided them whenever possible. A good rule of thumb: In any process, the last statement that assigns a value to a signal is the one that takes precedence. You can also selectively make assignments using IF-THEN and CASE statements. Coding style influences results Placing multiple design entities in separate source files is another habit of software engineers who are used to writing modular code and numerous subroutines. A modular, hierarchical approach provides many benefits over one large design entity: easier reuse of portions of the architecture in future designs; step-by-step simulation as you develop each module; and an simpler-to-understand documentation style, which is analogous to multiple chapters in a book. In our case, this technique will also let us test vendors' synthesis tools not only on the entire design, but also on circuits within it. However, modularity has one significant downside: Because the tools sequentially compile each module and merge the netlists afterward, these tools cannot identify resource-sharing opportunities that span multiple modules.
Synthesis attempts to fit additional logic functions into the available portions of partially used look-up tables, but success depends on your design and isn't guaranteed. Notice too that Listings 4 and 5 define an outcome for all possible input combinations. This technique has two benefits. First, if you don't define all possible input conditions, synthesis infers and inserts a latch to hold signals at previous values when these undefined conditions occur. This result is acceptable if an inferred latch is what you intend, but otherwise you end up at least with unnecessary logic or, worse, a circuit that functions incorrectly. Second, because I used the std_logic type, my IF-THEN and CASE statements cover 'X,' 'Z,' and other signal and variable states. Even the manner in which you use parentheses can impact the results. According to Xilinx's documentation, many synthesis tools implement the expression A+B+C+D as three series adders. By including two sets of parentheses, resulting in the expression (A+B)+(C+D), you end up with two parallel adders feeding a third adder, a higher performance configuration. Standing on others' shoulders As I dug deeper into the design, I often used Reference 1 and my seminar notes. Instead of creating my own FIFOs from scratch, I quickly modified an example in Reference 1. This task involved adding internal logic that simplified the external interface to read, write, clock, and reset inputs and adding empty and full flags. Modifications were straightforward, and I completed them in just over an hour. The host-bus FIFOs were a challenge, because I was interfacing between 32- and 16-bit ports, but resolving this issue was relatively easy. I also came across several other useful resources (References 9 through 11). My experiences with the other "core" went less smoothly. Christian Green from MoSys (www.mosys.com) had developed a simple synchronous-graphics-RAM (SGRAM) controller (Reference 12), and I thought that migrating it to my SDRAM design would be simple. Wrong. My difficulty had little to do with the design itself, which was commented and made efficient use of silicon. However, I had to tear apart and redo more of the states and state transitions than I had anticipated. Reference 12's design targeted 8-Mbit, low-latency SGRAM, with a burst length of four accesses and autoprecharge. Mine, on the other hand, interfaced with 64-Mbit standard SDRAM, with a full page burst terminated after 32 accesses and no autoprecharge. The original design also used a modified Mealy state-machine design technique, which was a challenge to decipher. I also blame the fact that I dived into the conversion with an incomplete understanding of either state-machine coding in VHDL or SDRAM operating details. After a week of work, averaging three hours per day on this project, I'd created the VHDL code for four unidirectional FIFOs, both transform-logic blocks, and the SDRAM controller. The transform functions were extremely fast and simple to code, and I imagined that both creating and modifying them in VHDL was much easier than what my partner was experiencing with his schematic designs. I'd also sketched out the remainder of the state machines, driven by host-bus input transitions. At this point, I decided to step back and see how well I was addressing chip- and system-level concerns. I'd moved ahead of Stephen's progress, and I encountered some minor spec discrepancies and missing details that we jointly needed to resolve before it made sense to proceed. I found several problems. Performance, size issues When I began modifying Reference 12's DRAM controller, I remembered that SDRAM access requests and in-progress row-refresh operations could occasionally collide. Because refresh takes priority, the result would be a delay in the first read- or write-data transfer. Adding up clock cycles, I realized that my design would be unable to satisfy the original maximum eight-clock delay from host-read request to first-data output. Then, Stephen re-minded me that, although the host bus used 32-bit data, everything else was 16 bits. The answer to the perform-ance problem was to run most of the chip at twice the clock frequency of the host-bus logic. If the XC4000XL had internal PLLs, I could have doubled an external host-bus clock. I could also halve a fast input clock with a T-flip-flop circuit, but then the design might be out of phase with the half-frequency clock that other host-bus devices created and use. Driving this derived clock to the other peripherals would solve the phase problem but cause unwanted timing skew. The concept of a target device's creating the host-bus clock also seemed awkward. The technique I chose instead employed two clock inputs from an external precision PLL. The inputs selectively drive portions of the FPGA.
Out of fairness to Synopsys, I'm not sure whether any other synthesis product can implement generic array definitions into RAM, either. Creating FIFO arrays from look-up tables or flip-flops means employing different logic usage (Table 1). The table omits the logic around the array and any logic blocks for signal routing. Unlike Xilinx's XC4000XL family, other PLDs and ASICs that provide no onboard RAM usable as embedded memory must use flip-flops to create FIFO elements. I had to do something to make the design smaller. One option would be to reduce the FIFO depth, but this approach would limit future design flexibility. I also could have used separate FIFO buffers outside the FPGA. However, I'd probably also have to partition the host, back-end, and other design logic into multiple PLDs with unknown cost, power, perform-ance, and board-space impacts. The design specification allowed for no simultaneous reads and writes for either the host or the back-end bus. So, I next converted the four 16-bit×32 unidirectional FIFO buffers into two bidirectional alternatives--one for the host bus and the other for the back-end bus. This step partially solved the array-size problem, but a few issues remained. The back-end FIFO buffer had separate clka and clkb inputs, corresponding to the two data I/O ports. However, FPGA Express warned that I had defined a single FIFO-array-write process with both clocks in the sensitivity list. Breaking the FIFO writes into two processes solved this problem but synthesized a circuit with two 512-element arrays, one for each clock. Equally disturbing, FPGA Express' warning list reported that the outputs of each clka and clkb array flip-flop pair were driving a common node. Redesigning the back-end FIFO buffer for one clock input and write process got me to the desired 512 elements and eliminated the internal short circuits. Because the host FIFO buffer's two ports needed to run at different frequencies for perform-ance reasons, I employed two 512-element unidirectional FIFO buffers in this case. Other improvements for both the back-end and host included explicitly initializing array flip-flops on reset, which allowed FPGA Express to use the XC-4000XL global reset network. I could have also multiplexed both ports for each FIFO buffer onto a common set of I/O signals, which would have reduced the routing requirements. However, for performance reasons, I wanted to allow the state machines feeding data into one end of the FIFO buffer and out the other to run in parallel. The lesson? When your design can take advantage of available embedded RAM, the amount of work to retarget the HDL code to a different FPGA or an ASIC may be an acceptable trade-off for significantly improved silicon efficiency. The reference design implements considerable data movement throughout the FPGA with the resulting potential for bus contention (Figure 2). For example, the host FIFO drives the SDRAM's data buffers during writes, and SDRAM reads can drive the X1 and X2 transform inputs as well as contend with the X2 transform output for the host FIFO buffer. I first controlled which source would drive a node using internal tristate buffers. However, rereading various VHDL and programmable-logic literature, I realized that unless I closely controlled the internal timing and logic placement, I might easily end up with transitional bus contention, increasing power consumption and reducing device reliability. The alternative strategy, which I implemented wherever possible, used multiplexers. This approach consumes additional logic and routing resources. However, my chosen strategy also meant I wouldn't have to create layout and timing constraints to ensure that two data sources weren't simultaneously driving a common internal node. The performance impacts of my multiplexer choice were unclear; I'd heard conflicting data in the past from engineers about whether internal tristates resulted in faster or slower designs. One performance improvement that I decided to omit involves placing one or several pipeline-register sets within the X1 and X2 transforms. Although this approach would increase first-data latency through the transforms, it might also boost the maximum operating frequency. However, this technique would also increase the on-chip logic and complicate the state machines. Stephen plans to investigate adding pipelines to his designs, and I'll be interested to see whether they significantly impact performance or size.
Acknowledgments It's been a pleasure working with Stephen Wasson from HighGate Design on this project; the reference-design concept and specification he developed are comprehensive in breadth and depth. I'd also like to acknowledge the technical assistance of David Pellerin of Accolade Design Automation, Kevin Skahill of Cypress Semiconductor, Christian Green of MoSys, Ron Plyer and Ramine Roane of Synopsys, Tom Barber of Viewlogic, and Loren Lacy of Xilinx. |
|||||||||||||||||||||||
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
| EDN Access | Feedback | Table of Contents | |
|||||||||||||||||||||||
| Copyright © 1998 EDN Magazine, EDN Access. EDN is a registered trademark of Reed Properties Inc, used under license. EDN is published by Cahners Business Information, a unit of Reed Elsevier Inc. | |||||||||||||||||||||||