Optimizing high-performance CPUs, GPUs and DSPs? Use logic and memory IP—Part I
In this two-part article we describe available logic library and memory compiler IP and a typical EDA flow for hardening processor cores. Part I continues on to provide innovative techniques, using those logic libraries and memory compilers within the design flow, to optimize processor area. Part II describes methods using these same elements for optimizing the performance and power consumption of processors. The article finishes with a preview of how the innovation of FinFET technology will affect logic and memory IP and its use in hardening optimal CPU, GPU and DSP cores.
Why Different PPA Goals for CPU, GPU and DSP Cores?
CPU, GPU and DSP cores co-exist in an SoC and are typically optimized to different points along the performance, power and area (PPA) axes.
For example, CPUs are typically tuned first for high performance at the lowest possible power while GPUs, because of the relatively large amount of silicon area they occupy, are usually optimized for small area and low power. GPUs can take advantage of parallel algorithms that reduce the operating frequency, but they increase the silicon area—accounting for up to 40 percent of the logic on an SoC. Depending on the application, a DSP core may be optimized for performance, as in the case of base station applications with many signals, or optimized for area and power for handset applications.
Logic Libraries for High-Performance and Low-Power Core Optimization
Synthesizable CPU, GPU and DSP cores, today’s high-performance standard cell libraries and EDA tools can achieve an optimal solution without having to design a new library for every processor implementation. To optimally harden high-performance core, designers need the following components in a standard cell library:
- Sequential cells
- Combinational cells
- Clock cells
- Power minimization libraries and power optimization cells for non-critical paths
The setup plus the delay time of flip-flops is sometimes referred to as the “dead” or “black hole” time. Like clock uncertainty, this time eats into every clock cycle that could otherwise be doing useful computational work. Multiple sets of high-performance flip-flops are required to optimally manage this dead time. Delay-optimized flops rapidly launch signals into critical path logic clusters and setup-optimized flops capture registers to extend the available clock cycle. Synthesis and routing optimization tools can be effectively constrained to use these multiple flip-flop sets for maximum speed, resulting in a 15-20 percent performance improvement.
Optimizing register-to-register paths requires a rich standard cell library that includes the appropriate functions, drive strengths, and implementation variants. Even though Roth’s “D-Algorithm” (IBM 1966) demonstrated that all logic functions can be constructed from a single NAND gate, a rich set of optimized functions (NAND, NOR, AND, OR, Inverter, buffers, XOR, XNOR, MUX, adders, compressors, etc.) are necessary for synthesis to create high-performance implementations. Advanced synthesis and place-and-route tools can take advantage of a rich set of drive strengths to optimally handle the different fanouts and loads created by the design topology and physical distances between cells.
Multiple voltage threshold (Vt) and channel length cells offer additional options for the tools as well as variants of these cell functions such as tapered cells that are optimized for minimal delays in typical processor critical paths. Having these critical path-efficient cells and computationally efficient cells, such as AOIs and OAIs, available from the standard cell library provider is critical, but so is having a design flow tuned to take advantage of these enhanced cell options. Additionally, high drive-strength variants of these cells must be designed with special layout considerations to effectively manage electromigration operating at gigahertz speeds.