Feature
Part 2: OC-48, OC-192, and beyond
Figuring out how to partition functionality at OC-48 and OC-192 speeds is only part of the battle of building a robust design that can later scale to faster speeds and increased services. At some point, however, all those high-speed ports have to come together, and that only compounds your problems.
Nicolas Cravotta, Technical Editor -- EDN, 11/9/2000
|
The next consideration is choosing the fabric. Interfacing various ports with each other is the domain of the switch fabric. Crossbar, or cut-through, architectures solve the problem of connecting all the channels through a fabric of OR-wired switches. A network processor first processes traffic, which then passes to the fabric over either a proprietary interface or a standard interface, such as CSIX (Common Switch Interface; see sidebar "CSIX: a least common denominator"). For example, a 16´16 switch fabric has the 16 ingress ports (x-axis) crossed with the 16 egress ports (y-axis) like threads in cloth. A clear channel exists between each ingress/egress combination. The downside is that you need a traffic manager or an arbiter. After all, some packets are more equal than others. At high speeds, switches require large buffers and sophisticated algorithms to handle QoS (quality of service).
Another important consideration is how large packets pass through a fabric. For efficiency, a fabric may pass cells; that is, fixed-size units. Such fabric can still support packets, but you have to include SAR (segmentation-and-reassembly) functions to break large packets into fixed-sized chunks to feed into the fabric. Some fabrics reassemble the packets, and some don't. If the fabric doesn't handle this job, then the problem falls to the possibly already-overburdened network processor.
For the most efficient throughput on the fabric side, it makes sense to handle the SAR function elsewhere and spoon-feed cells to the fabric. Such fabrics are known as dumb fabrics. Decisions, such as what traffic to send, managing back pressure (telling switches back up the line to back off traffic), and so on are made before cells reach the fabric, so the fabric can act with maximum efficiency of throughput. However, no matter where you put the SAR function, you need sufficient buffering to store cells as they arrive so that you can stitch the entire packet back together. If the fabric supports only FIFO queuing, then the network processor has to take over the SAR function.
You can describe a fabric as simply a high-speed pipe driven by a scheduling algorithm that tries to both keep pipes full and available. Fabrics, however, have various levels of intelligence (that is, the amount of the scheduling algorithm they support), and you need to match the fabric and the network processor. When you begin to couple all the functions, you can run into a breakdown or degradation of throughput. If the fabric can't handle QoS or queuing, then the network processor has to. In general, with OC-3 and OC-12, traffic management is part of fabric function; with OC-48 and OC-192, traffic management is part of network processor's job.
The market is split on how intelligent fabrics should be (see sidebar "The mythical petabit"). The issue is granularity (see sidebar "The importance of being granular"). The more granularity you need—that is, how far into a packet you need to look—the larger the queues you need and the more decisions you have to make. For example, an OC-192 flow may comprise four OC-48 streams, which have been previously multiplexed together. Each of these OC-48 streams comprises a variety of OC-3 and OC-12 streams, which comprise still smaller streams. To service a packet, you would need to drop the smaller stream out of the larger stream (a potentially challenging prospect) and then determine the actual format of the data. Such processing reduces the efficiency of the fabric. Thus, the fabric may have a companion chip that handles traffic shaping and manages the myriad flow buffers. Think of the companion chip as "chewing" the data before spoon-feeding it to the fabric. This procedure allows the fabric to focus completely on moving data as quickly and efficiently as possible.
Here's an analogy. A large truck can move a lot of boxes from one place to another. However, because it is difficult to remove the boxes that are piled far from the door of the truck, making several stops is probably an inefficient strategy. It might make more sense for the truck to stop at a central depot, where lots of little trucks make the deliveries. However, you can still use the big truck if you've packed it intelligently beforehand, with the boxes you'll need to remove first packed closest to the door.
What about building your own fabric? These days, no one designs his or her own framer. And, given the complexity of doing so, soon no one will design his or her own network processor and switch fabric. Designing your own ASIC, although adding time to the overall design process, also requires extra engineers at a time when companies are having trouble finding good people. And in more than a few acquisitions, the IP (intellectual property) purchased was far less valuable than the engineers gained.
Interestingly enough, picking your fabric first may be a more limiting choice than picking your network processor first. So how do you pick one fabric over another? The first step is to determine the kind of application that the fabric was designed for. For example, at the core, raw performance is more important than either price or power. At the edge, however, you want more functions in less space. Each application requires you to focus differently on the internal architecture of the fabric.
Can today's fabrics handle OC-768? No one knows when OC-768 will hit. If such capacity is built into a fabric, it will probably be simply as an insurance policy, rather than a response to demand. Will high-speed fabrics go optical? Again, it depends on the application. Simple routing at the core, where little processing is needed, makes room for optical switches. For intelligent routing, however, storing and deep-packet processing data are difficult in the optical domain, so such processing will continue to take place in the electrical domain.
Buffering or bust
Buffering plays an important role in switches and routers. Buffering data can solve many types of problems. For example, buffering smoothes speed mismatches between ports. You don't want the faster port to slow to the speed of the slower port, so you route the faster traffic into a buffer that feeds the slow port so the faster port can move on. The biggest buffering issue, however, comes from port contention, which occurs when more than one ingress port wants to send data out the same egress port. Given that all three ports support the same data rate, only one ingress port can pass its traffic through the egress port. The traffic from the other ingress port has to wait for its turn in a buffer. At some level, a traffic manager determines which port has the higher priority traffic and lets that traffic through first. Contention can lead to several undesirable and potentially serious conditions, such as head-of-line blocking. Head-of-line blocking occurs when low-priority traffic in one port prevents high-priority traffic behind it from passing through (Figure 1).
Some vendors claim to have nonblocking products. When two incoming cells vie for the same egress port, one passes through while the other waits its turn in a buffer. Nonblocking, however, is a theoretical quality. Vendors typically use Bernoulli traffic, which has no statistical correlation, to show that blocking does not occur. Statistically, the same port rarely encounters multiple collisions before encountering opportunities to empty the buffer of cells waiting their turn for the port while no cell is passed to the port. However, real-world traffic does have correlation, so port contention occurs with nonstatistical frequency. The challenge, then, is to determine the size buffer required to prevent dropped cells (see sidebar "The aggregation/smoothing paradox").
One way to reduce the need (and thus the size) of the ingress buffers for head-of-line blocking is to increase the speed of the fabric. A double-speed fabric could take two contending cells in and pass them to the appropriate output. Of course, you would then need output buffers; even though you can pass on two cells, the egress port can handle only one cell at a time. The extra cell passes through on the next round, with the next incoming cell waiting in the buffer. Thus, the output buffer holds one cell until no cell targets the egress port. If another contention occurs first, however, the output buffer stores two packets.
One way to organize ingress buffers is by using virtual output queues. Each ingress port has a queue for each egress port. In hierarchical queuing, there may be several levels of service, so a virtual queue exists for each port at each level of service (number of virtual queues=ports´levels of service). The fabric then mediates the passage of traffic. It polls all queues to determine the queue with the highest priority traffic waiting for a particular egress port. Likewise, you can improve QoS on the output side by employing virtual output queues (Figure 2).
Some switches come with locked-in buffer sizes. If you understand the dynamics of traffic for your application, however, a switch with flexible buffers will let you use that knowledge to more efficiently handle traffic and give your product a key point of differentiation.
Buffering potentially adds latency to traffic, but it can also reduce dropped traffic. Keep in mind that a system can have several hundred megabytes of buffering. You should take each layer into account as you design your overall architecture; no buffer is an island. For example, some chips have only so much on-chip buffering and do not support external buffering, which is expensive and slow, so they count on other buffers within the system to prevent traffic from causing overruns. It's important to model the behavior of such chips within your prospective architecture to make sure the rest of the system can accommodate these buffers' emergencies. In some cases, you can take a holistic approach and group memory buffers using a shared bus to reduce total memory usage and save board space without losing performance.
Another challenge is memory access. Traffic-management architectures that pass data among chips or blocks are often based on a shared-memory scheme. Before the emergence of crossbar technology, fabrics used shared memory to move data from one channel to another. The problem is that memory takes up a lot of board space, and, at some point, you can't achieve wire speed with all the stores and forwards to memory. As you increase bandwidth, you increase memory-bus width and frequency. Today, the memory bottleneck is 10 to 15 Gbps. This bottleneck directly affects the scalability of a shared-memory fabric.
As designs achieve 10-Gbps rates, memory will have to support an effective aggregate access rate of 20 Gbps (store and load); this requirement means that the actual access bandwidth will need to be much more than 20 Gbps to avoid collisions and to accommodate overhead. Dropping packets should be a last resort, so a robust congestion mechanism is necessary. Unfortunately, memory, such as SRAM, that is fast enough to serve this need is generally expensive and power-hungry, and it requires a lot of board space. DRAM has inherent latencies that translate into degraded performance and brings with it the issue of bus contention. For example, with 10 Gbps coming in and 10 Gbps going out, you need 20 Gbps of bandwidth, but with congestion degradation, you might need 30 to 40 Gbps to avoid contention problems.
For integrated chips requiring little memory, embedded SRAM can do the trick. However, an architecture that requires a significant amount of memory means going to external (and expensive) memory. Shared memory is still a viable strategy at OC-192 rates, but it is probably a breaking point for OC-768 (see sidebar "The bleak future of OC-768"). Again, one reason that today's fabrics can probably serve OC-768 designs is that the fabric itself doesn't buffer traffic, and, thus, it doesn't need a lot of memory.
Given the complexity of building an entire system, several vendors have already emerged with modules that "solve" parts of the OC-48/OC-192 problem. The fact that engineers building modules tend to partition functions differently from engineers building boards or even entire boxes further segments the overall market.
Modularity is also a method of achieving scalability. Your system can grow if you design it in pieces that customers can bolt together to increase capacity. Modularity also enables best-of-class designs. For example, several companies are focusing on building a piece of the OC-192 puzzle, such as an optics-and-modulation module, that you could bolt into a system. You get the advantage of concentrated expertise, faster time to market, high density, and "free" upgrades every time the vendor releases a new version of the module. Modules also bring flexibility. For example, a quad OC-48 module could provision an OC-192 box, quad OC-48, or even a Gigabit Ethernet design.
Some module vendors claim to have plug-and-play modules, although it is important to question the meaning of "plug and play" when you have a 300-pin connector with each pin running at 2.5 Gbps, complex high-speed signals, and lots of software and drivers to port. Plug and play also implies, to some degree, an ability to plug any of a number of options into the same slot. Thus, you may have the option of at some point replacing a single-channel module with a dual-channel or multichannel module. If you design your single-channel board with this option in mind, you'll be able to as painlessly as possible move to a dual-channel design. Additionally, modules are already pulling the SERDES (serializer/deserializer), clock-multiplier, clock-recovery, and optical-monitoring (for enclosure management) functions on-board. Is pulling the framer on-board a line that a module won't cross, or will you at some point be able to move this function off the board and onto the module, thereby simplifying your design or allowing for higher channel density? Make sure that the connector technology has room to grow. For example you should in the future be able to drive lines faster, or your design should have plenty of reserved pins. You can fit only so much on a module if you limit the I/O plane.
In any case, modules need to give you access to both the data and characteristics, such as incoming signal integrity. For example, if you receive consistently bad frames, you may need to switch from the service line to the protection or redundant line. If the module doesn't tell you the quality of signals you're getting, you may be unable to properly monitor the line and avoid trouble before it happens. Additionally, if you do switch to the protection line, you'll need to be able to change the overhead signaling in the egress overhead bytes to reflect the change.
Although you are trying to get a product out the door today, your goal is to release other similar products over time. Design reuse can reduce time to market for subsequent products, but only if you design reuse into your system and don't just leave it as theory. For example, you can structure your design to push most design changes to either hardware or software. However, given that software has become the most time-consuming aspect of design, it may pay to develop your system to focus on software reuse with changes reflected in hardware. However, if you need to keep products on the market up to date, you have to change the software portion of your design. Modules may provide a hybrid approach, letting you reuse most system hardware and software by replacing only a portion of hardware with new software and enhanced functions (see sidebar "A system-level perspective").
Buying modules offers the advantage that someone else has to struggle with the latest bleeding-edge ICs and power issues while you get to stay on top of the performance curve. Of course, you pay for this privilege. Certainly, costs can drop because of a vendor's aggregate volumes across several markets, but to truly pull off high performance, modules need to target specific markets, which limits the markets in which you can sell the modules. One way vendors keep modules flexible for multiple markets is by designing them in a modular sense, spinning off variants. For example, one application might require a variant with better airflow and a jitter filter.
Supplying power to a module also takes some serious thinking. Noisy power and ground lines can interfere with high-speed signals on the module. When thinking about a module, consider the type of power supply the vendor has specified, as well as how much ripple the specification allows. In the lab, everyone has clean power, but the real world gives you anything but. Ways to protect the power and ground planes from noise include shielding power supplies throughout the system, using PWM, decoupling the various components of the system from each other, providing clean references, and making sure the various power supplies are quiet.
If you believe the switch-fabric companies, moving a switch fabric from OC-48 to OC-192 is relatively straightforward if not easy. Switch fabrics need to run at overspeed rates to avoid blocking issues, and cascading fabrics to increase throughput is a problem companies are already trying to solve. Additionally, fabrics can expand in a third dimension, the z-axis, to accommodate additional traffic and avoid pad-limited situations in which a single-chip package simply can't supply enough pins to cross-connect all the desired ports. Instead of trying to cram all the interface lines and data from a port onto a single chip, you can break the stream into parallel pieces across several chips. Thus, several fabrics may be routing pieces of the same traffic between the same ports; once the fabric makes a connection between an ingress port and an egress port, it makes the same connection across all z-layers of the fabric. You can also use the z-plane to overspeed channels. For example, crossbar switches work with fixed cell sizes to reduce overhead and simplify traffic management. Passing data of variable size, however, results in underusage of cells. Overspeeding, that is, running the fabric faster than the data it carries, makes up for underusage and other fabric overhead. By adding a level to the z-plane, the fabric more quickly passes more data. For example, four OC-48 fabrics could create an OC-192 fabric.
Migration from OC-48 to OC-192 becomes difficult at the processing blocks, such as the classification and traffic-management blocks. Consider the concept of a breaking point. At a breaking point, a task becomes too difficult to execute. Flow decisions at OC-48 rates, for example, are possible but extremely difficult even given today's powerful network processors.
Also, working, off-the-shelf components for OC-192 are limited, and the OC-48 acquisition craze has already spilled over into OC-192. Several large companies are building up the pieces you'll need to design a system. Chances are, you're going to at some point use ASICs, unless you're clever enough to get an FPGA running with enough speed or with enough parallelism. Additionally, little public information is available about upcoming OC-192 parts. The only way you'll even know a part exists, never mind see the spec, is if you develop a strong enough relationship with the appropriate vendor. Partnerships are unavoidable if you want to stay on the bleeding edge, as are NDAs (nondisclosure agreements).
Every networking application is different. And with OC-48 just starting to blossom in access devices, the industry is just discovering all of the ways it can take advantage of this speed. Some applications focus on performance; others focus on service. But even the high-performance applications will soon want more services. Designing a box that can support new services in the future is critical to product survival. And, as DWDM (dense-wavelength-division multiplexing) continues to pack in more channels, the need to lower power becomes more critical. Head room and flexibility are expensive, but they are also insurance.
The standards are still evolving. That evolution is par for the course for bleeding-edge applications. And there are usually more than a few standards to choose from, most of them proprietary (despite vendors' claims of "open" standards). Picking a standard that operates among multiple vendors, never mind picking the winner, is hard enough. And, with the vendors themselves assembling "complete solutions," you must dig deeply to determine whether their mixed collection of ICs are the right pieces to give your application the performance edge it needs (see sidebar "The obvious and the not so obvious" in Part 1 of this article). Also, watch for variation in the parts that you can use to add variation to your products and expand into different markets. Above all, keep an eye on the reality of a vendor's scalability road map. If your vendor can't grow, then neither can you. Monitor the health of any standards you decide to go with. Also, keep track of the standards that your competitors use. If enough designs support the standard (that is, engineers like you are willing to pay for it), then chip vendors will probably support it (that is, chip vendors will be willing to sell it to you.)
Unless you've got the sales volume of a company such as Cisco Systems, you probably don't want to be the only one counting on that standard to be around next year.
| CSIX: a least common denominator Two years ago, no standard existed for switch fabrics. Then came CSIX (Common Switch Interface) to define the interface between fabric and network processors. Now, the CSIX spec can run as fast as OC-48 with a 32-bit bus running at 100 MHz. The CSIX Forum is currently working on the OC-192 spec with a proposed 64-bit bus running at 250 MHz (the extra megahertz are for overhead), targeted for next year. (Note that the 250-MHz interface may also operate at 200 MHz and allow lower speed components to connect, but this option is currently not fully compliant.) Quadrupling the speed of the bus—by just doubling the width and speed—is not as straightforward as it sounds. The 64-bit bus is not a done deal; some members are instead proposing a serial interface. The more pins required to represent a channel, the faster a switch supporting multiple channels becomes "pad-limited," meaning that you can't fit any more pins on the physical packaging. A high-speed serial interface, however, places a different burden on the network processor, requiring a mixed-signal/analog design approach with all of those signals running around internally. It also requires a SERDES (serializer/deserializer) on the network processor, which is a separate problem. CSIX is, at one level, trying to specify a common fabric interface between the network processor and the switch fabric. Such a spec has two parts: the electrical interface (data) and the flow-control mechanism (that determines what to do with the data). Flow control differs from the electrical interface in that the flow is logical and runs on top of the electrical interface (in-band) or on another bus (out-of-band). One goal of CSIX is to bring the fabric market to a common ground. The current standards are proprietary, and they vary. Thus, certain fabrics connect only to certain network processors. Note that CSIX is not about allowing you to change fabric or network-processor vendors. The ramifications of changing network-processor or fabric vendors is much more than just allowing swapping of chips that support the same interface. CSIX, rather, is more about reducing the time it takes to bring together compliant fabrics and network processors. One problem with proprietary interfaces is that some vendors are not forthcoming with details of the interface. In other words, they want to lock you into buying their fabric and network-processor combinations. Such a forced choice, although supposedly optimized for the chips in question, may not be in your best interest. There is the issue of not having a second source, but, realistically, few, if any, second sources are at the bleeding edge. Of course, you can probably put together an FPGA shim, but the idea is for your design to require fewer chips and consume less power. Regarding interfaces, possibly the greatest problem is that CSIX is still "under construction," resulting in a further fragmented market with limited choices. Parts serving the same function are not homogenous, and the final standards will depend upon who the winners are and who can hold out with their proprietary interfaces longest. (Right now, it's too early to tell.) To some degree, electrical interfaces are easy to define and approve. Getting everyone to agree on a similar flow-control mechanism will be difficult, given that some fabrics take on more functions than others. Flow control also determines the functions that the network processor can take over; if the network processor can't communicate a certain kind of request, then your design cannot support that function. Note that CSIX is not the only standard under development for switch fabrics. The OIF (Optical Internetworking Forum) is working on FPI-4 level 1, which specifies an interface at both the electrical and flow-control levels, running 64 bits at 200 MHz. It looks electrically similar to CSIX, and on some network-processors, the I/O pads you use might even be the same, thus supporting both FPI and CSIX. In a strict sense, FPI-4 is not exactly competing with CSIX. CSIX generally fits between the fabric and the network processor. FPI-4 is a chip-to-chip interface, not limited to network processors but capable of connecting network processors to fabric. It also offers only limited flow control. If you want in-band flow control, you must define tagging to run over the electrical interface. One advantage of in-band flow control is that no pin limitation affects the granularity of flow control. For example, it takes 8000 virtual queues to support 1000 ports each with eight levels of quality of service. To support flow control at a granularity of 8000, you need 13 pins (213 =8192) to represent the all the possibilities. In-band tags don't face such scalability limitations. Of course, you can use a hybrid scheme to mix in-band with out-of-band flow control, but then such a scheme becomes proprietary. Adapting CSIX for certain fabric architectures is difficult and costly. To meet the needs of a range of applications, CSIX is somewhat of a least-common-denominator interface, which is standard operating procedure for emerging standards. This requirement means that you may be unable to take advantage of all the features on the network processor or fabric if you want to stay true to CSIX. For example, the granularity of CSIX may be too coarse for your needs; you might need to control every cell passing through a fabric. If you lose granularity, you might lose your competitive edge. For example, at the edge, a box has many slow interfaces aggregate with a few fast interfaces. Such a box could conceivably require hundreds of thousands of queues so that a carrier could provide gold or bronze service levels for customers. Without granularity, you can't offer such features. CSIX is still not the panacea of fabrics. Some complain that CSIX isn't the most robust interface and that flow control is inefficient. Others are nervous because the specification has some undefined bits; future allocation of these bits could render existing interfaces obsolete, depending upon how CSIX allocates them. Another criticism of CSIX is that it is not a tight spec. It targets support of a range of applications, but to stay broad, it has to include certain inefficiencies. Thus, the spec does the job, but it isn't as robust or efficient when you compare it with a proprietary interface targeting a specific market. At the bleeding edge, you need every advantage you can squeeze from your parts. In this respect, CSIX may serve more as an interface for the future when the market slips down past the leading edge. CSIX is worth keeping an eye on. The decision to use it is not easy to make. In this sense, you might ask whether CSIX really has a place. CSIX may be trying to be the perfect answer to too many problems. People want to use the fabric to do different things, depending upon whether they are short-hauling or long-hauling, transporting or routing, at the core, or at the edge. Standards tend to work for functions that you can clearly delineate. The fabric interface is anything but clearly delineated. All this said, CSIX is still good technology. Chances are, CSIX will come to be the most commonly supported interface among fabrics. A fabric could still support a proprietary standard to offer better performance over CSIX, but you have more compatibility options because of CSIX. Some vendors are offering knock-off versions of CSIX that have private "extensions." These extensions are not part of the standard, but they are necessary to support the unique features that vendors want to offer. At some point, a standard such as CSIX will serve the market, and everyone will win. The problem is that you've got to build your device today while the network-processor and fabric vendors hash it all out. For at least the next generation or so of products, many vendors consider CSIX support insufficient. Thus, fabrics and network processors that support several interfaces are similar on the electrical level. Some of the proprietary interfaces are similar to the CSIX electrical interface and can work with CSIX interfaces with some glue logic. However, each IC can reasonably support only so many interfaces, so alliances and partnerships come into play. And, although CSIX may be a |
| A lot of hype surrounds switching capabilities. Several companies claim to support terabit rates, and at least one company, Hyperchip, claims scalability to petabits (1000 Tbits). Just how much capacity is necessary? Today's bleeding-edge designs may bring 16 to 32 OC-192 pipes together. How soon will the market demand more than this? Getting to multiple terabits or even a mythical petabit requires a multistage fabric. One way to manage the complexity of switching across multiple stages is with a separate control chip. The problem is that the efficiency of throughput drops as you add stages. According to Hyperchip, a one-stage fabric might achieve 95% efficiency. This efficiency drops to 80% at two stages, 40% at three stages, and a why-bother 10% with four stages. As you add ports, one chip can no longer handle the traffic. And, as you add multiple control stages, efficient allocation and management of ports sharply decreases. Additionally, you need to keep the signal clean between stages. You wouldn't need to add a CDR (clock-and-data-recovery) device between every stage, but you would need them in a few strategic places, such as after the last stage. Hyperchip approaches the problem from a different perspective: giving control to the fabric itself through a distributed architecture. The crossbar switches self-schedule themselves, have a layer of buffering, and use parallel serial-communications links to communicate with each other. Between each layer, however, you must supply external smart memory (memory controller, scheduler, and traffic manager). Hyperchip claims that multistage efficiency should theoretically maintain a linear drop from a single-stage efficiency of 97% to 94% (two-stage), 91% (three-stage), 88% (four-stage), and 85% (five-stage). The company's road map reaches four times the OC-768 rate, although no one has run the parts at this speed or density. Additionally, the parts currently take the form of FPGAs. By spring of 2001, Hyperchip hopes to crunch 16 of these FPGAs into an ASIC that runs 10 times faster. Now |
| The importance of being granular The location at which aggregation, done with a multiplexer, takes place is critical. Once you aggregate streams, you lose the ability to differentiate between them. High-speed pipes generally do not represent a single stream of data but rather a cornucopia of rates and transports and mixed synchronous and asynchronous data. Channelizing the pipe allows you to hand off lower rate streams to access points or perform specialized processing. This concept, Aggregation occurs in many places through the stream—from building an OC-3 into an OC-12 into an OC-48. From a service perspective, maintaining signal integrity at the OC-3 level makes a lot of sense. OC-3 is the first good SONET (synchronous-optical-network) transport level, and you can neatly build up OC-3 from T1s and T3s. Today, it's also a convenient size for handling services. Deciding what to feed into an OC-192 box requires a thorough understanding of what carriers want. The logical assumption to make is to aggregate four OC-48s into each OC-192 port. However, for ATM (asynchronous-transfer-mode) environments, for example, you may want to aggregate lines as low as OC-3 or OC-12 to eliminate the need for an extra OC-48 box. This issue takes place more at the edge; the network core doesn't necessarily require this much granularity. Granularity becomes important as you support more services. As OC-48 and OC-192 begin to serve more at the network edge, services such as QoS (quality of service) or statistical tracking for billing gain importance. OC-48 is no longer limited to underseas connections and is expanding into business-to-business connections. Initially, you use high-speed links for point-to-point connections; that is, to get data from one place to another. And this task requires little intelligence or traffic processing. However, as the technology progresses, those point-to-point connections become multiple points to multiple points, and processing and switching become necessary. To provide various service levels, you may need to step down or drop channels out of the overall traffic stream. Why do you need to step down a stream? Unless your transport stream is 100% pure, it's a mix of "x over SONET." So, to offer more services, you need to break an aggregated stream to the individual substreams. For example, you may need to drop a channel within a stream (using a demultiplexer) to give that stream the proper QoS. With OC-12, dropping down to a 64-kbps DS0 line is possible. At OC-192, however, dropping that DS0 line requires more steps. If you need this granularity, you might be looking at a system architecture that has to cross several boards. Perhaps the network will someday be monolithic IP (Internet Protocol) and avoid this problem altogether, but that reality is simply too far away to spend much time dreaming about it. Instead of having to reach too far down into traffic to provide services, you can instead employ a wrapper technology, such as MPLS (multiprotocol label switching). The wrapper simplifies the way you handle traffic. For example, MPLS wraps a label around an IP packet. The system can now switch and provide services based on the label instead of digging into the IP packet (or ATM cell or SONET frame). |
| The aggregation/smoothing paradox Steve Schwartz, a hardware architect at Ironbridge, shared some of the challenges he and his team encountered while building an OC-192 switch (Figure A). One area that required concentrated thinking was buffering, which must be based on an efficient memory subsystem. "Building small, fast memory systems is not hard," says Schwartz. "But building large fast memory systems—that's hard." The problem is that memory components are not increasing in speed at the same rate as fiber. Sophisticated memory systems, then, are necessary to handle the complex flow of high-speed traffic. However, the jury is still out on the size of buffers necessary for OC-192, are given that few OC-192 systems available to test and profile what's happening to traffic at this speed. Ironbridge employs theoretical models to determine the effects of different-sized buffers on different kinds of traffic. For example, a 2-Mbyte buffer running at 60% usage under certain conditions experiences a 4´10–2 packet drop rate. Boosting the buffer to 64 Mbytes brings the drop rate down to 3´10–8 . (The drop rate gives the frequency that packets drop, because bursts have pushed usage to more than 100% and overrun the available buffers.) These numbers are important because at layer one, traffic patterns change on the order of nanoseconds or microseconds. Traffic controls, such as back pressure and random early discard, affect a system on a much larger time scale, on the order of microseconds or seconds. Thus, by the time back pressure relieves a heavy burst, too much traffic has already been lost. A key factor in understanding the effects of different levels of usage, according to Schwartz, is based on aggregation. A high-speed pipe such as OC-192 is an aggregation of many smaller flows. Now, conventional, wisdom says that "burstiness" does not decrease when you aggregate lines. In other words, a monolithic flow should have less burstiness than an aggregated flow. Schwartz suggests that the opposite is true for OC-192. The conventional wisdom arises from the fact that most aggregation theory is based on Ethernet traffic, where you can do only two to four flows of aggregation. Additionally, because Ethernet always runs at full rate, bursts in the Ethernet world refer to longer—not faster—clumps of data. So why would massive aggregation reduce burstiness in an aggregated OC-192 flow? Some of the smaller flows into OC-192 switches come from OC-3 and OC-12 routers that work on data in batches. Thus, they tend to clump data and, over time, make it bursty rather than smooth The "mathspeak" for burstiness is an AC ("autocorrellation") artifact. Software routers that multitask a flow also introduce AC into the flow. Efficient traffic shaping can reduce the effects of AC without damaging traffic but at the expense of increased latency. How does AC affect a system? Say a system has bursts that are five times greater than the average traffic rate: Running at 15% usage means your system will experience bursts at 75%. (Keep in mind that, at 80%, usage-response time in the network significantly increases.) Say you have 100 flows feeding an OC-192 pipe. Now, although some flows are simultaneously bursting, most are running at the average rate. With 100 flows, therefore, you get a smoothing effect; that is, the average flows buffer the bursting flows, smoothing the overall traffic. This fact means that the 5-times-greaterburst rate has much less impact on the overall OC-192 usage. Thus, the switch can operate at a rate higher than 15% (increased usage) without exceeding the 80% threshold. A note on the 80% threshold. Where does the figure 80% come from? If you look at the variation of response time across different priorities of traffic, you'll see no difference until around 80% usage. At this point, the response time for high-priority traffic remains flat; after all, this traffic goes right through the switch or router. Low-priority traffic, however, sees increased latencies. You can think of 80% as the gridlock threshold, or the knee at which performance takes a sharp dive. For example, drivers can move quickly on an empty highway. As more traffic enters the road, the overall flow of traffic doesn't change much. At rush hour, however, there's a complete breakdown for everyone but drivers in the car-pool lane, although if traffic gets bad enough (beyond 80%), even the car-pool lane locks up. To avoid this traffic crunch, many network administrators upgrade systems when they hit 70%. An interesting side effect, then, is that QoS (quality of service) plays no role at these high speeds. Below the knee, latencies are in the microseconds. In other words, they don't matter. And, just when QoS starts to make a difference, administrators will probably upgrade the equipment. Thus, implementing QoS in an OC-192 might have no appreciable effect other than an increase in cost. Will aggregation actually smooth traffic and reduce burstiness? Again, there has been no opportunity to test the theory either way in the real world. But that situation is soon going to change. |
| It pays to occasionally step back and look at the system level of a design. With OC-48 and OC-192 systems, however, taking this step back is vital. Given the complexity of such systems, there's a justified tendency to break the overall problem into modular pieces and independently solve each of these smaller problems. One side effect of breaking down the problem in this way, however, is that you can lose significant system efficiency. Your design strategy needs to encompass the entire design. Blocking out the design fails to take into account the extreme dependence of each block on each of the others. The question to ask is whether you're treating the system as a monolithic entity or as a collection of smaller entities. For example, full connectivity is an important switch characteristic; that is, all ports can simultaneously "speak" to all other ports at full rate. If the subsystems are weakly connected, you'll be unable to pull together an any-to-any fabric. Any traffic entering a subsystem must also leave that subsystem. Pulling off an architecture in which traffic from any subsystem can leave any other subsystem requires forward thinking but yields better efficiency and functions. When you subdivide a problem, you can easily lose information. For example, you can consider a packet's priority from a subsystem perspective. However, when each subsystem acts independently, lower priority traffic can actually block higher priority traffic (Figure A). In principle, priority is not a problem that you can subdivide; the system must have a single notion of priority. You need to consider all the available information before you make a decision. You also need to consider how your product interacts with other components in the overall end-customer design. For example, most carriers double-provision links, running each at 40 to 50% usage. Your system, however, must be able to run at high usage even if your customers plan to run it at low usage. This ability is necessary, because if a backhoe destroys one of the links, the backup link must bear the double load. Thus, your design must be able to step up to the task of managing more than its share of the load. In any case, even systems that typically run at low usage occasionally need to run robustly at high usage. |
| Fortunately, communications generally evolves in simple multiples, such as four. This characteristic makes it possible to design an OC-192 device that demultiplexes the signal down to four OC-48 lines and handles the problem with OC-48 technology. The possibilities are the same for OC-768. The switch fabric might run at 40 Gbps, but the guts leading up to the fabric might be four OC-192 modules. However, whether OC-768 can scale as 16 OC-48 lines remains to be seen. OC-768 represents a quantum leap for communications. Not all backplanes can support data at these rates, and moving to OC-768 will more than likely require a "fork-lift upgrade." OC-768 will also require severe discipline when maintaining signal integrity. Standard semiconductor processes may be unable to handle these speeds when the market is ready for 40 Gbps. Processes that can handle such speeds may require special packaging or multichip modules. At OC-768 it is much harder, if not impossible, to get away with faking a design. Will the market soon demand OC-768? Given the technical leaps required to make OC-768 happen, it's safe to say that OC-768 won't follow on the heels of OC-192 as quickly as OC-192 followed OC-48. The question is whether the extra cost of OC-768 will be worth the effort, given the increasing capacity of fiber. The cost differential to save 3l by going to OC-768 rates (instead of using four OC-192 modules) will probably be less attractive than simply increasing the number of lambdas available on a line of fiber. |
| For more information... | ||
| For information on subjects discussed in this article, use EDN's information-request service. When you contact any of the following manufacturers directly, please let them know you read about their products in EDN. | ||
| Agilent www.agilent.com Enter No. 301 | Allayer 1-408-570-0888 www.allayer.com Enter No. 302 | AMCC 1-858-450-9333 www.amcc.com Enter No. 303 |
| Bay Microsystems 1-408-653-2181, ext 121 www.baymicrosystems.com Enter No. 304 | Broadcom 1-949-450-8700 www.broadcom.com Enter No. 305 | CSIX (Common Switch Interface) Consortium www.csix.org Enter No. 306 |
| Conexant 1-800-854-8099 www.conexant.com Enter No. 307 | ConnectCom MicroSystems 1-949-789-1188 www.connectcommicro.com Enter No. 308 | Cypress 1-408-943-2600 www.cypress.com Enter No. 309 |
| Exar 1-510-668-7000 www.exar.com Enter No. 310 | Hyperchip 1-514-931-5335 www.hyperchip.com Enter No. 311 | IBM 1-914-499-1900 www.ibm.com Enter No. 312 |
| Infineon Technologies 1-408-501-6000 www.infineon.com Enter No. 313 | Ironbridge Networks 1-781-372-8000 www.ironbridgenetworks.com Enter No. 314 | Marvell 1-408-222-2500 www.marvell.com Enter No. 315 |
| MMC Networks (merged with AMCC) 1-858-450-9333 www.mmcnetworks.com Enter No. 316 | NEC 1-800-366-9782 www.necel.com Enter No. 317 | Network Elements 1-503-644-7666 www.networkelements.com Enter No. 318 |
| NewPort Communications 1-949-450-1080 www.newportcom.com Enter No. 319 | OIF (Optical Internetworking Forum) www.oiforum.com Enter No. 320 | PMC-Sierra 1-650-567-9170 www.pmc-sierra.com Enter No. 321 |
| PowerX 1-408-456-2500, ext 16 www.powerxnetworks.com Enter No. 322 | Silicon Laboratories 1-877-444-3032 www.silabs.com Enter No. 323 | StarGen 1-508-786-9950 www.stargen.com Enter No. 324 |
| TranSwitch Corp 1-203-929-8810 www.transwitch.com Enter No. 500 | Velio Communications 1-408-434-9280 www.velio.com Enter No. 325 | Vitesse 1-805-388-3700 www.vitesse.com Enter No. 326 |
| Zettacom 1-408-514-6100 www.zettacom.com Enter No. 327 | ||
Author Information
You can reach Technical Editor Nicholas Cravotta at 1-510-558-8906, fax 1-510-558-8914, e-mail ednnick@pacbell.net.
ACKNOWLEDGMENT
Special thanks to Carl Bloom and Steve Schwartz from Ironbridge Networks for their contributions to this article.
This article ran on page 71 of the November 9, 2000 issue of EDN.
















