Tuesday, June 17, 2008

At VLSI Symposium NEC shows novel thermal monitor for SoCs


If your travel budget is like ours, the VLSI Symposium is taking place in Honolulu without you again this year. But the stream of interesting papers goes on, and we do our best to get a glimpse.

One of the interesting spots is not a paper about mega-speed or nano-power, but about a chip management technique from NEC Corporate Labs. The company reported today on an embeddable thermal sensor for use in dynamically controlling the operating temperature within functional blocks on an SoC.

The problem NEC is addressing is one of local heating. In modern SoCs, an individual processor, accelerator, or memory instance may find itself operating at a much higher data rate than the surrounding circuitry. At today's frequencies, the dynamic power dissipated by this activity can cause intense local heating. The rising temperature, in turn, substantially increases leakage currents in the transistors, further heating the area. The worst-case result is physical failure of the device, and even in the best case the overall device power is unnecessarily high due to the increased leakage.

So NEC, like other chip design houses, is developing fine-grained ways to measure temperature at critical points on the die—processor pipelines, primary caches, and the like. The company uses the temperature information, interpreted by software, to control an external chip that dynamically shifts the voltage and clock frequencies of blocks on the SoC to manage local heating. This approach—combining local temperature sensing with dynamic voltage-frequency scaling (DVFS) has reportedly been heavily used in Intel processors as well media SoCs from a number of companies.

ADVERTISEMENT
Normally, the temperature-sensing element is a thermal diode. Current through the diode is quite sensitive to junction temperature. But NEC senior manager of device platform research Masayuki Mizuno points out that in order to be relatively accurate, the thermal diode must be quite large, and the designers must take considerable care in getting the analog signal—the junction current—to a location where it can be converted to digital for use in the monitoring equipment. An alternative is a ring oscillator circuit, whose frequency will of course be temperature-dependent. The oscillator produces a larger, more easily converted signal. But the frequency is not solely dependent on temperature. It's also strongly sensitive to voltage variations.

As an alternative, NEC settled on an elegantly simple circuit: a largish single transistor, Source shorted to Gate. The device generates only leakage current, which as mentioned is quite temperature-sensitive. The Drain is connected through a capacitor to ground, and a shorting switch is connected in parallel with the capacitor to provide a reset.

Once Reset is released, the capacitor integrates the leakage current, under the watchful eye of a simple comparator circuit. If you count up the time required for the comparator to fire, you have an accurate digital measure of the leakage current. But to get from current back to temperature, you have to calibrate the device.

That's where elegantly-simple comes in again. NEC researchers discovered that the slope of the leakage current-vs-temperature curve is closely linked to the actual current at a specific temperature. So by measuring the current at 20C, the researchers could accurately—within 3C—predict the current-temperature relationship over the whole range of interest.

The result is a small, 35-by-35-micron temperature sensor patch with inherently digital output, requiring only one bit to be routed off to a thermal sensor controller block. NEC's data indicate that the sensor subsystem has a settling time of about 3.5 ms, fast enough to keep up with temperature shifts from quite fine-grained control during DVFS operation. As a testbed, NEC has instrumented the multicore processor chip for their SX-9 supercomputer with an array of sensors, and has developed the necessary DVFS controller chip and software to do fine-grained thermal management of the die. By asking the software to shift loads among the CPUs, and by scaling down the voltage and frequency of each CPU core to the lowest possible level for it's load, NEC is able to keep the temperatures relatively constant. This corrects not only for local heating due to load differences, but also for heating due to nonuniformity in the system cooling. The result is lower peak temperatures on the processor dice, leading to both higher predicted reliability and—since leakage current is much more than linear in temperature--lower overall leakage power at the system level.


<< Back | Print
© Reed Business Information, a division of Reed Elsevier Inc. All rights reserved.