Zibb

News and New Products

NOC, NOC, NOCing on heaven's door: Beyond MPSOCs

A report from the seventh-annual International Symposium on System-on-Chip design.

By Steve Leibson, Tensilica -- EDN, 12/8/2005

In November, the frigid city of Tampere—up near the Arctic Circle in Finland—again played host to the SOC R&D community at the seventh annual International Symposium on System-on-Chip conference, organized as always by professor Jari Nurmi of the Technical University of Tampere (in Finnish: Tampereen teknillinen yliopisto).

Although the three-day conference included papers and presentations covering a broad swath of SOC-related topics, several invited papers underscored the theme of this year's conference: NOCs, or networks on chip. Last year's conference concentrated on MPSOCs (multiple-processor SOCs). Once you accept that complex SOC designs will incorporate large numbers of processors, the next logical step is to think about efficiently connecting all those on-chip processors together; hence this year's focus on NOCs.

Professor Nurmi presented the first invited NOC paper and wryly commented that this was the first time in seven years that he'd been invited to present a paper at his own conference. (Nurmi edited a textbook, Interconnect-Centric Design for Advanced SOC and NOC, which was published in 2004.) In his presentation, Nurmi provided an overview that served as an introduction to and foundation for the other invited NOC papers.

NOC designs and concepts vary widely, so Nurmi first summarized common characteristics:

  1. They are more than a single, shared medium (like a bus). They are truly networks.
  2. They provide point-to-point connections between any two hosts attached to the network either by employing true, point-to-point crossbar switches or through virtual point-to-point connections.
  3. They provide high aggregate bandwidth through parallelism.
  4. They clearly separate communication from computation.
  5. They take a layered approach to communications, such as that used in other macro networking schemes, although they may not employ as many network layers because such complexity is prohibitively expensive in terms of on-chip area.
  6. They have implicit pipelining and provide intermediate storage points for the data as it moves from sender to receiver.

NOC taxonomy

NOCs can be based on either circuit switching or packet switching. In addition, they can be classified by network topology.

Mesh networks arrange on-chip sender and receiver blocks in a regular grid. Each sender or receiver within the mesh is associated with a network switch and each network switch is connected to its nearest four neighbors (north, south, east, and west). The 2D mesh network is an intuitive network topology. It allows any sender to talk to any receiver and, judging from the papers presented at the conference, it seems to have won the popularity contest among academic NOC researchers.

Variations of the mesh network are the torus and folded torus, which respectively link network switches at two or four of the mesh edges to the corresponding switch at the opposite side of the mesh. This topology reduces the number of switches a message must traverse (the number of "hops").

The ring network is another familiar networking topology. Rings can be unidirectional or bidirectional. However, published research suggests that the ring is the least efficient NOC topology, and no conference papers discussed this topology, outside of Nurmi's overview.

When evaluating NOC performance, engineers look at two key figures of merit: throughput and latency. Throughput is simply the maximum amount of data senders can pump through the network at any time. However, complex network topologies can make it difficult to measure the true throughput of a network. One approach to measuring this parameter is to measure aggregate throughput. Another method is to measure the bisection throughput—the amount of data crossing an imaginary line drawn to bisect the network.

Latency measures the time needed for data to traverse the NOC from sender to receiver. Latency depends on the number of network switches the message must travel through, the amount of buffering or storage there is in each network switch, and the amount of NOC traffic. Throughput and latency are closely intertwined, and both depend on network loading. From the papers presented, it's clear that network congestion becomes a concern as the NOC messaging load increases above about 35%.

GALS address timing challenge

Nurmi finished with an introduction to a critically important NOC topic: GALS, or globally asynchronous, locally synchronous NOCs. The GALS concept addresses a growing problem in SOC design. As SOCs employ smaller and smaller geometries, as they become exponentially more complex, and as on-chip clock rates continue to climb, it's becoming very difficult to run the entire chip from one synchronous clock. Increasingly exotic clock-tree synthesis attempts to deliver low-skew clocks to all corners of the chip. However, achieving timing closure for such designs is simply becoming so difficult that designers must rethink the entire approach to overall SOC clocking.

The GALS approach to NOC design represents one new approach. It recognizes the difficulty of maintaining near-constant clock skews across a complex SOC by discarding the effort entirely. Complex SOCs are already partitioned into many self-contained blocks, and the GALS approach allows each of those blocks to be internally synchronous—they can even run at different clock rates. However, inter-block communication is asynchronous, which eliminates the need for a global, low-skew, reference clock.

Several of the conference papers dealt with approaches to designing GALS-based NOCs. In fact, an invited paper by Professor Alain Greiner of LIP6 (Laboratoire d'informatique de Paris 6) dealt explicitly with the concept. Greiner discussed the DSPIN network, a packet-switched NOC with a mesh topology. The unique characteristic of the DSPIN design is that the network switch (called a router in the DSPIN design) that's associated with each sender or receiver block (called a cluster) is embedded and diffused into each sender or receiver block instead of existing as an adjacent but separate NOC block on the SOC.

The DSPIN approach distributes pieces of the NOC router to the four edges of each rectangular cluster. At each edge, there's an input FIFO and an output multiplexer. The input FIFO receives packets from the immediately adjacent cluster on the SOC and distributes those packets either to the processing circuits within the cluster or to one of the three output multiplexers on the other edges of the cluster. Similarly, each output multiplexer can accept and transmit packets from one of the three input FIFOs along the other edges of the cluster or from the within the cluster.

This arrangement means that almost all of the complex NOC wire routing occurs within the computing cluster. Intercluster wire routing consists of simple point-to-point multiwire bundles that directly connect one output multiplexer to a corresponding input FIFO on an adjacent cluster. NOC wires between clusters are therefore relatively short and have low capacitance. Intracluster wires are longer, but their length is bounded by the size of the cluster.

The GALS nature of the DSPIN design resides entirely in the input FIFOs, which are bisynchronous. The FIFOs have separate input and output clocks, which means that two clusters communicating through one of these FIFOs can run with different clock skews or at different clock rates. In the current DSPIN design, these FIFOs are not synthesizable; they are hand-designed. However, during Q&A, Greiner indicated that it might be possible to develop a synthesizable bisynchronous FIFO.

Best-effort versus guaranteed service

The DSPIN paper also introduced another NOC design problem: Service requirements vary for different types of network communications. Mixing guaranteed-service and best-effort traffic on the same NOC causes no end of scheduling headaches and can result in resource deadlocks. The DSPIN NOC solves this problem by implementing parallel-but-separate, best-effort and guaranteed-service networks on the same SOC.

Professor Steve Furber of the University of Manchester presented an entirely different approach to NOC design. Furber quickly reviewed the evolution of on-chip communications for complex SOCs from single buses to a hierarchy of buses to NOCs. Although most of today's SOC designs employ synchronous, multiple-bus communications architectures, there's no escaping the following conclusion laid down by the ITRS (International Technology Roadmap for Semiconductors): "As it becomes impossible to move signals across a large die within one clock cycle or in a power-effective manner, or to run control and dataflow processes at the same rate [best effort versus guaranteed service], the likely result is a shift to [an] asynchronous (or, globally asynchronous and locally synchronous (GALS)) design style." This statement melds two problems: maintaining a constant clock skew across a large chip and efficiently conveying best-effort and guaranteed-service traffic among many blocks in a complex SOC.

Furber's group at the University of Manchester has been developing SOCs using asynchronous-circuit chip design for a decade. John Bainbridge earned his PhD by developing the asynchronous Marble bus during one of these development projects. The Marble bus taught Bainbridge everything he needed to know about how not to design asynchronous on-chip communications. From this education, he produced a second-generation effort called CHAIN, which is a self-timed, fully asynchronous GALS interconnect that's now being commercialized by British startup Silistix Limited.

CHAIN and mesh

Significantly, CHAIN interconnect can be used as either an asynchronous, point-to-point connection between blocks or woven into a GALS NOC that shares network wiring when system-bandwidth and -latency requirements permit. The CHAIN protocol employs a four-wire, one-hot transmission protocol and the CHAIN interconnect employs only six wires while realizing transmission speeds of 166 MHz/wire when fabricated with 180-nm ASIC technology. Multiple CHAIN bundles can be ganged in parallel if more bandwidth is needed, and a simple six-gate asynchronous repeater permits long-distance CHAIN connections.

Custom adapters are required to attach each synchronous block on the SOC to the asynchronous CHAIN network. Currently, these adapters must be designed by hand. In addition, the CHAIN interconnect doesn't employ a mesh, ring, or any other fixed topology. Instead, it has a fluid, amorphous, application-specific topology—which is to say it has no explicit topology at all. An EDA tool from Silistix custom tailors CHAIN networks for each SOC design based on system requirements.

The capstone to the NOC discussion at the Tampere conference has to have been the paper presented by Se-Joong Lee of KAIST (Korea Advanced Institute of Science and Technology). Lee's written paper discussed three generations of NOC research at KAIST on a NOC called BONE (Basic On-chip NEtwork), but his oral presentation was a well-considered plea for simplicity in NOC design.

With quite a bit of injected humor, Lee emphatically stated that the goal of on-chip network research should not be implementing state-of-the-art networks on silicon. The processing units or clusters on the chip are the network users and perform the actual work of the SOC. The on-chip network should only be sufficiently complex to serve the needs of these processing units. Otherwise, the increasing complexity of the on-chip network will devour the available silicon area, leaving less for the processing elements. For this reason, Lee directed the audience to keep on-chip networks as simple as possible, which means forgetting conventions developed for legacy macro networks such as Ethernet and the Internet.

Three NOC temptations

Lee then pressed on, discussing what he calls the three temptations that lead on-chip network researchers astray.

The first such temptation is the siren song of the mesh network. With their elegant, regular architectures and good scalability, mesh networks have become the darlings of academic NOC research. However, Lee argues, mesh networks aren't that attractive under close scrutiny. They are inefficient for global (broadcast) transactions. They are not nearly so regular and orderly as they appear in PowerPoint presentations when actually laid out on silicon, which complicates and knots physical layout. And they require far too many network switches (each NOC switch can often be as large as the 32-bit RISC processor it serves). Instead, says Lee, two-level star or tree networks can provide equivalent bandwidth and better latency while consuming far less silicon and much less energy.

Mesh networks also provide redundant paths, which leads to the second temptation: adaptive routing. Again, at first blush, adaptive routing looks promising for NOCs. It allows for intelligent message routing that can bypass congested communication hot spots and defective switch nodes or broken links. However, adaptive routing incurs the problem of packet ordering. If each packet can take a different path, with different latencies, then packets can arrive out of order. Consequently, each receiver must have a buffer for reordering these packets. Lee's recurring message, adapted from one of President George W. Bush's favorite phrases, is this: Network buffers are the "Axis of the Devil." They needlessly consume silicon and power and they increase network latency.

Instead, Lee says, NOCs that employ the recommended star topology need not and cannot support adaptive routing because there is only one path through the network between each sender and receiver. Because the routes are pre-determined, no time is required to search for an "optimal" route for each packet. There is only one route possible. Lee immediately dealt with an obvious objection to this recommendation. The objection: "Without adaptive routing, there's no fault tolerance and no way to circumvent hot spots." Lee's response: "If there's a fault, the chip is broken. If there's a hot spot, then that hot spot determines the maximum available network bandwidth, so design your system accordingly and don't spend silicon or power needlessly."

Lee's third temptation is packet switching, which permits multiplexing of multiple messages onto one segmented communications channel. However, channel segmentation adds packet buffers to the network and, as Lee reiterated, packet buffers are the "Axis of the Devil." Instead, Lee advocates circuit-switched NOCs.

Several of the other NOC presenters concurred with Lee's conclusion, which leaves only one question: If this year's NOCs followed last year's MPSOCs, then what follows NOCs next year? Perhaps this question will be answered next November, somewhere near the Arctic Circle.



Reed Business Information Resource Center

Featured Company


Related Resources

ADVERTISEMENT

ADVERTISEMENT

Feedback Loop


Post a CommentPost a Comment

There are no comments posted for this article.

Related Content

 

By This Author


ADVERTISEMENT

Knowledge Center



Technology Quick Links

EDN Marketplace


©1997-2009 Reed Business Information, a division of Reed Elsevier Inc. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy

Please visit these other Reed Business sites