Discovering the last unrealized power reduction
Power-optimized architectures help engineers designing chips with blocks that can power down or operate at reduced frequencies and voltages.
Jay Chiang, Synopsys -- EDN, September 9, 2010
At A Glance
|
| View as PDF |
Power has become one of the most important
design criteria for almost all design projects,
and the industry, in response, has invested a
lot of effort to address this challenge. Consequently,
we have seen a plethora of lowpower
design techniques and new technologies
emerge. Some of these techniques are
relatively easy to adopt. For example, clock
gating and multiple-threshold-voltage cells
have become mainstream design practices because they are effective.
In addition, EDA tools can automate their implementation.
Some techniques, on the other hand, require more planning.
For example, design engineers can group SOC (system-on-chip)
circuits into multiple blocks so that they can power down
some blocks or operate them at reduced frequencies or voltages
when operating conditions allow it. Although these more advanced
techniques take more deliberate effort to implement,
design engineers are increasingly employing them to meet the
more stringent power requirements in next-generation chips.When applying low-power design techniques, design engineers typically concentrate on only the few modules, such as embedded processors and on-chip memories, that consume more power than the other blocks. Although this focus is necessary, it is incomplete. Engineers may often overlook the fact that many low-power-consuming blocks frequently have a greater impact on energy consumption than their power- consumption number suggests. If you correctly plan a chip’s power-management strategy, the power-consumption profile and energy-consumption profile should not correlate closely. You should keep the active period of the high-power-consuming modules as short as possible. The modules that remain powered for a long time should not consume too much power. Even though these modules consume less power than other blocks, they consume a higher proportion of energy once you factor in their extended active time.
Consider a hypothetical cellularphone
design. Under typical usage, the
cellular phone is mostly in standby mode.
During standby, most circuits, except the
wireless receiver or receivers, are off. Although
standby mode consumes only
a fraction of the power that the other
modes consume, it still consumes 36%
of the total energy, after factoring in the
active period. In other words, it pays dividends
to aggressively reduce power for
circuits that are active in the standby
mode because it can lead to significant
savings in battery life (Table 1).

Such opportunities for energy reduction exist in most SOCs. In general, if the chip has multiple power domains, it has multiple power modes. If you identify the power modes that are most active, you can isolate the circuits that have higher impact on the chip’s energy consumption, and you can more aggressively pursue power reduction in these focused areas to reduce the overall energy footprint of the chip.
Analysis of these circuits in further detail uncovers some interesting characteristics. These modules must remain on for extended periods because they perform essential functions for the chip in that operating mode. They are often continuously calculating data or processing signals. In addition to the cellular-phone example, other circuits, such as audio or video processors in playback or talk mode and signal-processing blocks, such as equalizer, modulation, or cryptology units, in wireless and networking applications, have more datapath content than control logic and can benefit considerably from low-power techniques.
If you consider the technology horizon, a new generation of connected devices aiming to deliver better user experiences and higher data rates is driving many new design starts. Consequently, these new projects will demand higher audio quality, higher video resolution, more pixel support, more complex signal processing, faster data rates, and so forth. Increases in the size and complexity of the signal-processing blocks in turn lead to a higher energy footprint in the new designs. The impact of this design complexity requires design engineers to more closely manage the power consumption for these blocks.
Low-power datapaths
Power gating isn’t feasible for circuits that must continuously remain active, so the only choice is to make the circuit intrinsically low power. The first step is to lower the voltage, the operating frequency, or both without missing the performance target. However, slower clock frequencies mean deeper logic levels, and these circuits usually include more datapath logic than control logic. Datapath logic is notoriously prone to glitches—unwanted transitions that settle before the next clock edge—and switching because any spurious transitions propagate downstream and ripple throughout the entire datapath tree. Although glitches pose no functional issues, these transitions still consume power.
It is critical to avoid increasing power in other areas while reducing it in one area. Making this power-reduction approach more effective requires more balanced, shallower architectures that can limit the propagation of the transitions. Although most EDA tools do an adequate job producing timing- and area-optimized architectures that designers later optimize for power at the gate level, they are less effective in considering the power consequence of architectural selections upfront.
Some design engineers try various means of writing power-optimized architectures into RTL (register-transfer-level) code to save power. However, most low-power architectural-RTL coding focuses on reducing area, based on the assumption that using fewer cells equates to less power consumption. For example, some design engineers in networking and multimedia applications truncate the LSBs (least-significant bits) of the data when precision is not critical.
Although this technique is useful,
you must understand the details of how
to implement it. Datapaths differ from
other logic circuits in that they perform
computer arithmetic that generates
carries and sums, requiring carry-propagating
adders to add together the carry
and sum to produce a binary number.
For RTL coded at a high level, EDA
tools usually can generate datapath architectures,
keeping all the numbers in
redundant format—annotating the value
of the number with both carry and
sum—until the last level of the output.If you code the datapath at a lower
level, you might turn to coding practices
that divide a larger datapath block into
several small ones,
forcing the RTL-synthesis
tools to
insert carry-propagation
adders into
the final stage
of every smaller
block (Figure 1a),
hence increasing
area and delay. The resulting increased area sometimes
offsets the entire power gain from the
LSB truncation. For optimal results,
you must consider RTL-coding practices
that allow the merging of datapath
blocks to avoid unnecessary binary
conversions (Figure 1b).
Some design engineers also try to
code isolation logic in front of the datapath
logic so that they can suppress the
switching and transition of the datapath
tree until there is valid data. Depending
on the input-data profile and how frequently
the data is valid, this approach
could save significant dynamic power.
The concept, operand isolation, is similar
to clock gating, except that it takes
place on the datapath instead of the
clock paths (Figure 2). The concept,
also known as data gating or datapath
gating, is appealing, but it is sometimes
difficult to implement in practice. Unlike
clock gating, adding isolation logic
to datapaths increases the path delay.
This timing overhead can make it
tricky to close timing. Some RTL-synthesis
tools can automatically insert the
isolation logic; however, engineers do
not widely use the feature because it degrades
timing.An alternative approach
Datapath generators traditionally produce the most area-economic architectures that still meet the timing constraints. Engineers then optimize the generated designs for power at the gate level. At this level, the scale of optimization involves only a few gates. The flows don’t provide power-optimized architectures, so some designers manually code them in lowlevel RTL, which can hinder datapath optimization and degrade the quality of results.
To improve this situation, the first step is to understand what kind of datapath architectures consume less power so that you can use the knowledge to create more low-power architectures. Second, you should characterize the power costs of the datapath structures at a high level so that you can fully consider the power consequences when making architectural decisions.
Examples include the power-stingy architectures of the Synopsys DesignWare minPower components. These low-power datapath architectures are flatter, shallower, and more balanced than traditional architectures to produce fewer spurious transitions. When these unwanted transitions occur, datapath structures with smarter cell selections can limit their propagation. For example, instead of using common XOR-based datapath cells, such as full adders or XOR-based booth encoders, the manpower components employ architectures that favor more AND or NAND cells so that fewer transitions ripple throughout the datapath tree.
Integrating these power-friendly architectures yields some advantages. Aside from being easier to use, these architectures allow designers to capture power-saving opportunities that are hard to realize with a manual approach. Because power consumption depends on operating conditions, it is not enough to consider the circuit architecture outside the design’s context or independently of circuit switching.
To achieve the best result, you must reevaluate the architecture using a logic-synthesis tool, such as Design Compiler, employing the timing model and switching profile.
For example, consider a two-input
multiplier with uneven switching activities
on the operands. Although the
multiplication is a commutative operation,
the dynamic-power consequence
is not. If you use the high-activity input
for partial-product generation, the
multiplier will consume more dynamic
power due to a higher level of switching
activities that propagate through
the rest of the multiplier. If you switch
the high-activity input to the input of
the partial-product selector, you can
lower the switching activity in the partial-product generator as well as the
overall multiplier (Figure 3). This kind
of optimization is hard to plan in the
RTL-coding stage and is more suitable
to perform during synthesis.Applying this concept on a larger scale enables you to achieve more power savings. In general, irregularity in the data or a circuit provides a power-saving opportunity. For example, multimedia data usually has uneven activities among the data bits. It usually has lower activities in MSBs (most-significant bits) and higher activities in LSBs. If you are aware of this phenomenon, you can design datapath architectures so that the LSBs feed into the datapath tree downstream, hence reducing the dynamic power for audio- or video-signal processing. Likewise, you can use the circuit’s irregularity to lower internal power and leakage power. For example, you can substitute regular cells with slower high-threshold voltage or low-drive cells whenever there is timing slack.
You can configure the Design-Ware minPower architecture to create more timing slack to maximize this effect. However, manually exploiting the circuit’s irregularity is difficult because it is imperative to balance the power cost against the area cost to avoid any adverse effect from over-aggressive power optimization. You must automatically consider timing and design needs during the architecture selection to realize the power savings with minimal area overhead.
The biggest advantage of powerfriendly architectures is that they do not disrupt design flows. The power savings come from making power-smart choices when implementing the microarchitectures for the RTL code. This approach requires no changes to the higher-level software, system, or RTL design. After you select the architectures, the netlists go through the same gate-, physical-, or process-level optimization in the back end. You need not change design flows or design-database formats except for adding a new knowledge base—a synthetic-library database (.sldb file)—to the RTL-synthesis stage. The power savings increase the design project’s original power strategy by as much as 42% additional power reduction at the block level and as much as 24% reduction at the chip level.
This architecture-level power-optimization approach does have some limitations, however. To get the power benefit, you must integrate in-house or third-party IP (intellectual property) into the design at the RTL because the optimization takes place at the RTL. The automatic IP insertion relies on a logic-synthesis tool, such as Design Compiler, to extract the datapath architecture from the RTL; therefore, the code must be in a style that the synthesis tool recognizes. In other words, if the datapath is in low-level RTL that already prescribes the architectures, the synthesis tool cannot alter the design’s intent.
To enable architectural-level power optimization, designers should start from high-level RTL code using as much operator inference as possible. To allow extraction of larger datapath blocks, you should consider using automatic retiming instead of manually inserting a pipeline. Whenever possible, use a realistic representative switching profile, which usually improves the result, especially for applications that have unevenly distributed activities on the input.
Because power is a physical-domain characteristic, your standard-cell library can affect the power-optimization result. A standard-cell library with a collection of datapath cells that have good drive strength and threshold-voltage variations allows wider architecture selections.
Some libraries support special datapath cells but have few or no drivestrength variations or have them only with standard threshold-voltage implementations. You often do not select these cells, therefore limiting the number of available architectures. To improve results, use a standard-cell library with more drive strength and threshold-voltage variations that have accurately characterized power numbers.
You can’t optimize what you can’t observe. To lower the energy consumption of your next SOC project, you must first identify which portions of the SOC are consuming the most energy. It is worth distinguishing power consumption from energy consumption. To get a more energy-efficient design, you must pay attention to the circuits that remain on for a long time. Therefore, you must carefully analyze the power modes to identify the best energy-saving opportunities.
When working on these modules, decide early in the chip-planning stage to run these circuits at low clock frequencies and low voltage. The additional power-saving opportunities must come from designing more power-friendly circuits that require less switching activities and are built with a higher percentage of low-leakage and low-drive cells. The most inexpensive way of achieving this goal is by using power-friendly architectures that require no costly design-flow re-engineering efforts.
Talkback


















