ISSCC Day-2: Analog beats Digital, 28nm OMAP, and SandyBridge

-February 25, 2011

The 2nd day of ISSCC began with the Multimedia & Mobile session. Of the 8 papers presented, only 2 came from industry: Texas Instruments (in collaboration with MIT) and MediaTek. National Taiwan University authored 3 of the papers, while KAIST (the Korean Advanced Institute of Science and Technology) had 2. kaist-fuzzy.png

The  KAIST paper on “A 57mW Embedded Mixed-Mode Neuro-Fuzzy Accelerator for Intelligent Multi-core Processor” (photo courtesy ISSCC ©2011 IEEE) was noteworthy in a surprising way.  The IRIS (intelligent reconfigurable integrated system) design targets AI (artificial intelligence) applications such as object detection and recognition in smartphones, portable game consoles, and robotics. AI functions are often performed with neural networks software, but for mobile applications the designers determined that power and speed limitations called for a hardware solution. They came up with an unusual solution - the Analog PEC (processing element cluster)  that you can see in the lower-left corner of the die photo at left. The APEC is a 32 x 32 array capable of performing 1024 simultaneous multiply-accumulate operations for neuro-fuzzy inference in the analog domain. The analog design contributed to 54% lower power and 83% less processing delay in comparison to an equivalent multicore processor implementation.

Texas Instruments & MIT: A 28nm 0.6V Low-Power DSP for Mobile Applications

This joint research paper described  a new design methodology, using iterative stochastic optimization with static timing analysis, that TI developed to port a derivative of the TMS320C64x+ OMAP to a 28nm low power process. The DSP SoC integrates >600k instances of custom low-voltage logic cells and 43 instances (1.6 Mb) of 6T SRAM.

In developing the design, the TI-MIT team was focused on the impact of on-die variation in the 28nm process. Local variation resulted in  low-voltage delays that varied by 3x . Traditional Gaussian simulation of non-linear delay cells underestimated the actual delay distribution by 10 to 70%. The new iterative methodology is called NLOPALV, for nonlinear operating  point analysis for local variations.

The goal of the NLOPALV methodology is to be able to determine stochastic delay probability. The “most probable” cell delay characteristics define an “operating point”. Characterization starts with all parameters at nominal, then a SPICE simulation is performed with a delta +/- 0.1.  The 3-sigma delays are calculated and written to a standard .lib file.

In NLOPALV, analysis progresses from the cell library characterization, to path level, and finally full chip timing closure . Accuracy of ~5% was achieved by running STA (static timing analysis) twice about the operating point, interpolating & iterating. Chip timing analysis is performed in four passes of successively increasing accuracy. In the first pass, non-critical paths were identified and discarded using standard STA (static timing analysis) with 3σ cell delays. A total of 92% of setup paths and 95% of hold paths were eliminated in this pass.  In the 2nd pass the capture clock tree only is analyzed, followed by the capture and launch clock trees, and finally the entire timing path including the datapath. Final timing closure determined chip setup/hold with 3σ certainty.

In the end, the NLOPALV process got down to 87 paths that need to be fixed. Had they not been discovered, the device would have failed. Following a traditional 3σ would have caused “over fix”.  The DSP SoC design was fabricated and tested, with high-performance operation achieved at 587MHz and 1.0V (113mW), down to low-voltage operation at 3.6MHz with 0.34V supply (720μW) when operating from external memory (caches disabled).

A Fully Integrated Multi-CPU, GPU and Memory Controller 32nm Processor .. aka SandyBridge

A team from Intel in Haifa, Israel presented an overview of the SandyBridge processor, that “integrates up to 4 high performance Intel Architecture (IA) cores, a power/performance optimized graphic processing unit (GPU) and memory and PCIe controllers in the same die(photo courtesy ISSCC ©2011 IEEE).


The SandyBridge SOC, fabricated in Intel’s high-k 32nm process, contains 1.16B transistors and was described as the largest die the company is currently manufacturing at 216mm2 . The chip can be configured for 2 or 4 cores, with various sizes (8, 4 or 3MB) of L3 cache, and 12 or 6 execution units in the GPU.  The L3 cache is organized in slices, but is fully shared between the CPU cores and the graphics.  A high-performance on-die connectivity fabric (dubbed “the ring“) connects the processor to the L3 cache.

The core and L3 share a power plane. The PCU (power control unit) can shut off CPUs independently. Processor graphics runs off a separate supply. The SA (system agent) unit contains a dual channel DDR3 memory controller, a 20-lane PCIe-Gen2 controller, a two parallel pipe display engine, the PCU and the testability logic.  The SA I/O logic and DDR3 I/O are in separate voltage domains. The chip operates with 0.65V for the core, L3 and GPU, 1.05 I/O, and 1.5v for the DDR3.

The mobile quad-core version of SandyBridge is the  I7-2820QM. By removing 2 slices Intel creates the dual-core i7-2620M.  By slicing some more (3 MB cache) they get the i3-2100 desktop processor.

The clocking scheme relies on 13 PLLs (phase-locked loops). One PLL per slice is said to reduce power. A low jitter PLL design enables minimal latency.  A fixed frequency PLL is connected with a  finite state machine to select frequency bands.  An on-die low noise voltage regulator is used to compensate the ring oscillators. Two external reference clocks are required; 120 MHz for display and a 100MHz common reference for others parts of the chip. Only the PCI requires an LC (inductor-capacitor) tank for the VCO (voltage-controlled oscillator). Five overlapping frequency bands are required, and better than 4% open loop frequency stability was achieved for a 120 degree change in temperature.

Temperature compensation and frequency compensation are performed by monitoring control varactors, one per core. Sandy Bridge also has a miniaturized CMOS-based thermal sensor with substantially reduced area compared to the diode sensors, but with a more limited temperature range between 80-100°C. Because of the smaller size, this sensor is placed at several known core hot spots to provide a more accurate picture operating temperature profile.

The Sandy Bridge TDP (thermal dissipation power) ranges from 17W to 45W for a 2-core and a 4-core mobile part to 95W for a high-end desktop part.

Loading comments...

Write a Comment

To comment please Log In


No Article Found