Congestion management clears a path through 10 GbE
With the new IEEE standards for consolidation and the ability to build highly scalable high-speed Ethernet networks, Ethernet will become the only fabric you need for converged data centers, and it will lower the cost of supercomputer networks.
By Zhi-Hern Loh, Fulcrum Microsystems -- EDN, January 7, 2010
Ethernet first emerged in the 1970s, and the IEEE standardized it in 1985. Over the years, the technology has undergone numerous updates. In today’s data centers, the newest trend is network convergence using Ethernet as a consolidated fabric replacing traditionally separate special-purpose fabrics, such as FC (Fibre Channel) for storage, InfiniBand or Myrinet for computer-cluster IPC (interprocess communication), and Ethernet for LAN (local-area-network) traffic. With the introduction of 10 GbE (10-Gbps Ethernet), link speed has become competitive with these other technologies, but speed alone is not sufficient for using Ethernet as a converged fabric because Ethernet does not usually supply other necessary data-center features. For example, FC provides a lossless fabric for the SAN (storage-area network), and InfiniBand provides advanced congestion management, neither of which Ethernet supports.
The IEEE DCB (Data Center Bridging) Task Group is developing three new standards—802.1Qaz ETS (enhanced transmission selection), 802.1Qbb PFC (priority flow control), and 802.1Qau QCN (quantized congestion notification)—to address network-convergence issues (Reference 1). Although Ethernet switches have historically supported 802.1Q traffic classes, the IEEE has not standardized the administration of the traffic classes, so vendors have implemented proprietary methods of defining scheduling policies for each traffic class.
ETS strives to provide one policy that the administrative domain will consistently apply. ETS enables an administrator to define per-traffic-class bandwidth allocation and strict prioritization. An expected usage of ETS is for consolidating IPC, SAN, and LAN traffic classes. The IPC traffic is latency-sensitive, and the IEEE would configure it as strict high priority. You can limit the bandwidth of bursty SAN traffic to prevent starvation of other traffic classes. LAN traffic, which is typically best-effort traffic, would use strict low priority to minimize the latency of other types of traffic.
Ethernet has for years supported lossless operation; however, the Ethernet Pause standard lacks the per-priority flow-control ability that today’s data centers require. Whereas all protocols running over Ethernet must handle drops, some protocols have high drop-recovery penalties and need lossless operation for performance. For example, the designers of the FCOE (Fibre Channel Over Ethernet) Protocol intended it for a lossless FC fabric, and it thus lacked the ability to retransmit single frames. Instead, on drops, FCOE retries an entire transaction, which could be megabits long. In a data center, cluster computing also suffers from high performance penalties because of dropped frames. For example, if the system drops interprocess-synchronization frames, the whole cluster stalls while the parallel processes wait for a time-out and retransmission of the dropped message. Moreover, traditional computer networks, such as InfiniBand and Myrinet, ensure reliable delivery, so most computing applications do not handle dropped frames. With the new PFC proposal, Ethernet can selectively provide flow control to the traffic classes serving these applications.
Issues arise with flow-control operation when you enable it by interchangeably using Pause and PFC. Pause prevents buffer overflows by stopping the flow of traffic on a port’s incoming link, whether on a switch or a NIC (network interface card). On a switch, you typically implement Pause with thresholds on queue occupancy per ingress port. If an ingress port uses too much memory, a Pause-on frame will wait in a queue for transmission to the link partner after the current frame on the wire. After the short delay, the link partner will receive the Pause-on command and stop sending new frames. When the link partner stops, the frames in the switch memory eventually drain, and the queue occupancy for the paused ingress port drops below the Pause-off threshold, causing the transmission of a Pause-off frame to the link partner to unpause. On average, the Pause-on/Pause-off cycles result in the matching of the ingress rate and the maximum egress rate and thus make the switch lossless.
When you enable Ethernet flow control, you slow the link speed of the entire network to the speed of the slowest link. A common misconception is that the degradation is due to the fact that Pause is not working; however, it is actually due to congestion spreading and the penalization of innocent flows.
Congestion spreading
Congestion spreading can cause reduced performance in data-center networks (Figure 1). Consider a case in which two flows are on each source: A to C, A to D and B to C, B to D. Initially, each source sends data at 10 Gbps, or 5 Gbps per flow, and the sinks drain data at 10 Gbps. Clearly, there is no need for flow control because C and D can both fully drain the arriving traffic. Now, replace D with a 1-Gbps NIC. The A-to-D and B-to-D flows then must slow down to 0.5 Gbps each, a reduction of one-twentieth link capacity. Because Pause operates on the entire ingress port, however, the innocent A-to-C and B-to-C flows also slow down to 0.5 Gbps, even though they do not have to.
If the network includes multiple switch hops instead of one hop, such as replacing A and B in Figure 1 with more switches, a congestion tree could form (Figure 2). Each of the upstream switches will see congestion on the downstream switches and propagate backward flow control on ingress links that talk to congested egress links. The flow control thus causes a congestion tree to appear, with Port D of S being the root of the tree. Eventually, all the flows passing through the congested links (the solid arrows in the figure) would slow down, even if some of them do not go through the root.
Usually, flow control handles short bursts of traffic with transient congestion patterns. When the bursts are short, a congestion tree does not have time to propagate. In the previous example, the upstream link’s capacity decreases due to the long-lived congestion that occurs when you plug in the low-speed NIC in a misuse of Pause. However, legitimate cases exist in which multiple sources send to one sink for a long time. If this scenario were to happen in a 1 million-node data center, a congestion collapse could occur when 10-Gbps links decrease their speed to kilobits per second due to the possibility of high fan-ins to the congested port (Reference 2). Besides using PFC to limit the congestion spreading to select traffic classes, a CN (congestion-notification) mechanism could further limit the spread of congestion.
The DCB Task Group has introduced the IEEE 802.1Qau QCN project as a more precise flow-control mechanism. Cisco first presented QCN as BCN (backward congestion notification). In 2007, a number of proposals emerged, which the InfiniBand congestion-control architecture influenced (Reference 3). The latest version, QCN, is a hybrid of the previous ideas (Reference 4).
CN background
CN is a control mechanism like the TCP (Transmission Control Protocol). Because TCP and CN are similar in principle, you may wonder why it is not sufficient to run TCP. One reason is that TCP without a hardware-offload engine requires high CPU usage to maintain maximum throughput and minimize latency (Reference 5). TCP’s developers intended it for congestion control in the Internet, in which the network’s diameter is orders of magnitude larger than that of a data center. With the longer round-trip time and lower bandwidth of the Internet, keeping the link running at capacity requires less CPU usage. Another difference between TCP and CN is that TCP guarantees reliable and in-order delivery of frames. In the presence of congestion, this requirement causes large latency jitter due to retransmissions and time-outs. Reliable and in-order delivery is not always necessary; some data-center applications, such as streaming audio, video, and multiplayer games, favor timely delivery of data grams instead. A competing proposal from the IETF (Internet Engineering Task Force) that satisfies this purpose is the DCCP (Data-gram Congestion Control Protocol).
With respect to congestion signaling, CN has advantages over TCP and DCCP. In TCP, the senders back off by sensing drops due to overflows. However, back-offs are not immediate due to the time lag for either the receiver to notify the sender of missed sequence numbers or the sender to detect a drop due to a retransmission time-out. CN also has congestion predictors like TCP’s RED (random early detection). CN relies on the congestion point’s queue length and rate of change as predictors of future congestion. Also, CN’s multilevel feedback from congestion points allows more precise congestion response than does the 1-bit ECN (explicit-congestion-notification) marking that TCP and DCCP use. In addition, some important data-center protocols, such as FCOE, can benefit from CN because they usually run without a congestion-control algorithm, causing poor network performance due to congestion spreading.
Basic operation
For CN, each source has a rate limiter, which is usually a token bucket; sensing the congestion in the network controls the rates of these sources. Congestion points detect congestion by monitoring the queue length and report congestion by sending CN messages back to the sources that have the rate-limiting reaction points. Sampling the frames going through the congestion point generates the CN messages. By using the source address in the sample, you can send a CN message back to a reaction point that is contributing to congestion. The sampling interval at the congestion point is also a function of the congestion. For example, in times of congestion, the sampling occurs more frequently to increase the back-off signaling.
Ideally, you implement reaction points in NICs, with each flow having its own reaction point with a dedicated congestion-control state. For rate decreases, you compute the decrease amount from the feedback in the CN message that the congestion point returns. QCN uses self-increases to autonomously compute rate increases.
CN is essentially a control loop that maintains maximum throughput at the congestion point. As in control theory, the CN control loop uses a proportional-integral controller in which the control feedback is proportional to the difference in the arrival and departure rates of the congestion point’s queue and the integral of the rate difference.
In control theory, a control loop aims to keep the process variable—the traffic rate at the congestion point, in this case—at the setpoint. In normal operation, the congestion point’s queue length never actually stays on the setpoint but oscillates about it. In QCN, these oscillations arise because of a cycle in which the reaction points autonomously perform self-increases, increasing their transmission rates until the congestion point detects the onset of congestion and signals the reaction points to slow down.
Due to the CN message packet propagation delay, a lag occurs before the congestion point observes a new rate from feedback events. Increasing delay increases the magnitude of the oscillations. To compensate, you can tune the gain parameters of the CN’s control loop. For example, many gain parameters in QCN control the amplitude of the reaction to congestion. The feedback calculation at the congestion point uses some of these parameters, and the rate-change calculations at the reaction point use others. By tuning the gain, you can optimize the control loop’s response.
As in any control loop, instability can occur. The amplitudes of the oscillations grow instead of attenuating, eventually resulting in loss of throughput due to underflows and overflows. Decreasing the gain may improve stability but at the cost of reduced responsiveness. The IEEE’s CN simulations show that instability can occur (Reference 6). However, they also show that, within their data-center parameters, you can use one set of static-gain parameters to stabilize the CN loop (reference 7 and reference 8).
Other possible causes of instability may be mixed-speed networks. Optimal parameters for 1 and 10 Gbps differ, so you must make trade-offs. With the upcoming 40- and 100-Gbps networks, interoperating with 10-Gbps CN devices represents a big unknown. Simulations have yet to show the effect of these higher-speed links on CN performance. CN also has yet to show that it will not negatively affect TCP flows. TCP friendliness is a good metric for measuring new CN protocols. DCCP explicitly states it as a goal. CN has also made this feature a goal, but it is yet to be proved in real-world applications. The issue is that TCP requires drops for sensing congestion and applying the back-off mechanism. Some TCP flavors, such as New Reno, can use other metrics, such as round-trip time, for sensing congestion and thus interact better with CN. You can also use ECN marking to notify TCP of congestion. Some IEEE simulations have illustrated these scenarios.
Implications for switches
Traditionally, switches have offloaded control-plane-processing operations to a multipurpose CPU. Unfortunately for CN, increasing stability margins require CN messages to provide quick feedback; thus, you must accelerate frame generation as much as possible with hardware. Switches implement the line-rate data-forwarding logic in silicon, whereas software implements the lower event rate but more functionally complex routing-table maintenance functions. The flexibility of software lowers the cost of maintenance and problem corrections. One issue with implementing CN with fast fixed logic is that the IEEE specification is new and likely to change. Dealing with these specification instabilities requires a hybrid approach of mixed firmware and hardware support to allow flexibility for handling specification changes and the ability to operate necessary features at line rate. For example, in most IEEE specifications, the frame formats are usually the last to stabilize. By offloading some of the congestion point’s functions to a programmable device, you can maximize compliance with the standard. Moreover, the increased flexibility allows experimentation with the algorithm for improving performance.
|
Although the 802.1Qau QCN standard is new, it has been in progress for several years, leading to many prestandard implementations. Interoperability between vendors’ implementations will be a big hurdle in the near future. Before the IEEE ratifies the standard, using programmable devices will improve interoperability. An example was a recent IEEE QCN demonstration of an NEC NIC with a prestandard Fulcrum Microsystems switch. Although the switch implemented much of the congestion point’s logic in silicon, the FPGA-based NIC provided enough performance and flexibility to implement QCN.
Another step toward deployment involves a switch-only QCN network. With more head-of-line blocking than a NIC-based reaction point, switches could implement the reaction point on-chip and convert the flow control into 802.1Qbb PFC signaling. A small data center could be a good place to evaluate this function because it would be easy to limit the switches to a single vendor.
The DCBX (Data Center Bridge Exchange) Protocol is yet another feature that will aid deployment. With DCBX, a network can automatically negotiate advanced capabilities, such as Qau for CN, Qbb for Pause, and Qaz for ETS. Each link would use the DCBX Protocol to discover its neighbor’s capabilities and automatically turn on the common features. When you replace switches and NICs in the cloud with Qau-capable devices, the CN domain would automatically expand.
Deployment in HPC networks
Another type of likely deployment outside the scope of the standard is for the switches to enable congestion points but not to enable the reaction points in the NICs. HPC (high-performance-computing) users have indicated that they would welcome the congestion point’s feedback for use in their applications. One use of the feedback would be for adaptive routing, enabling load balancing over multiple paths to avoid dynamic hot spots and for application tuning to avoid static hot spots. Adaptive routing can improve performance, but Ethernet does not natively support it. By default, Ethernet runs the spanning-tree protocol to cut loops in the connectivity graph. Loops can cause frame duplication when flooding occurs. While cutting loops, however, spanning trees also cut off alternative paths between sources and sinks, reducing the usable network bandwidth. Adaptive routing can recover these additional paths.
In a random network, loop prevention can be difficult because loops can form over multiple hops. However, with the fat-tree topology, or Clos network, that you commonly find in large data centers and supercomputers, loop prevention is simpler because of the fat tree’s regularity. The IBM MareNostrum uses a multistage Clos built on special-purpose Myrinet switches. With the recent enhancements of Ethernet and the availability of low-latency Ethernet switches, however, new supercomputers can deploy Ethernet instead.
An important property of the fat tree is the constant bisectional bandwidth over each stage. Ideal fat trees are nonblocking, meaning that you can connect any free input port to any free output port without affecting connections. However, due to the packet nature of Ethernet and flow-ordering constraints, an Ethernet fat tree is not exactly nonblocking. Figure 3 illustrates a three-stage fat tree using switches of degree N. A stage is one switch hop. The fat tree comprises line and spine switches. The line switches are externally facing; the spines handle internal connections. When traffic ingresses onto a line switch, it transmits onto one of the N/2 uplinks to a spine switch. Although all of the spine switches connect to the egress line switch, picking one to send the traffic is not trivial. Poor spine-selection algorithms can lead to load imbalance and create congestion on the uplinks in the line switches.
You can easily scale fat trees by recursively using smaller fat trees as the spine switches and adding line switches on the bottom externally facing layer. The number of ports grows exponentially with the number of stages. For example, using a 24-port switch, you can build a 288-port, three-stage; 3456-port, five-stage; or 41,472-port, seven-stage fat tree.
Unlike with fat trees, you cannot easily scale spanning trees because doing so turns off many uplinks, motivating a need for a different uplink-routing algorithm. The routing protocol must preserve frame ordering, though, because reordering can cause accuracy problems in applications that assume ordered frames. Even in applications that do not assume in-order delivery, such as those that run over the IP (Internet Protocol) and TCP, reordering can confuse the congestion-control algorithm, resulting in loss of performance (Reference 9).
The simplest form of such an algorithm that also performs load balancing is a hash function. A flow identifier hashes to an uplink port, thus ensuring that all frames of a flow follow the same path. Multiple flows typically use all uplinks, improving load balancing. Common flow identifiers include the five-tuple address, source IP address and port, destination IP address and port, and protocol number.
Adaptive multipath networks
Although hashing may seem like a good approach to building fat trees, the technique involves some subtleties. For example, the various ingress stages to the spine in a multistage fat tree must use independent hash functions to avoid hash polarization (Reference 10). Hash polarization occurs when the hash function produces the same results on multiple stages, such as when Stage S1 sends to its uplink P1 to a switch on Stage S2, which in turn must use only one uplink, P1, because the hash function is the same. As a result, the second stage uses only one of the N/2 possible uplinks. For an M×2+1-stage network, this approach reduces the effective bandwidth by a factor of (N/2)M. Even if subsequent stages use different hash functions, correlations would also cause some polarization.
To provide the hash-function performance necessary for load balancing within the fat tree, the latest Fulcrum Microsystems switch uses a combination of CRC (cyclic-redundancy-check) functions and Pearson’s hash to yield nearly perfect load balancing for large numbers of both randomly and consecutively numbered flow identifications. CRC hashing distributes a random identification well, but Pearson’s hashing better suits consecutive identifications. Although most data-center applications have large numbers of flows and thus hash evenly, some applications, such as HPC, have few high-bandwidth flows. With fewer flows to hash, the hash function is unlikely to perfectly distribute the flows. If the arrival rate is high, persistent congestion will occur. This issue has led some vendors to add adaptive-routing capability, such as the Fortinet VScale technology, which monitors fabric latency and assigns flows to paths in a manner that minimizes the switching latency through the fat tree.
Adaptive routing can also act as a control loop, meaning that it is possible to observe oscillations in the system. When flows move from congested uplinks to uncongested ones, the load-balancing algorithm can overcompensate and overload the previously uncongested uplink, leading to oscillations in the queue length and frame latency, which increase jitter. Another difficulty with adaptive routing is that routing cycles may form when you use it in conjunction with link-level flow control, causing deadlock situations in the network.
Large fat trees have posed a challenge to CN, as some IBM-Zurich lab simulations have shown (Reference 8). The congestion-point-to-reaction-point distance varies from one to five hops, resulting in a range of delays and making it hard to choose a set of gain parameters to stabilize the system. Additionally, a fat tree’s fan-in to the hot spot grows exponentially, further exacerbating the instability.
Going to 40- and 100-Gbps ports for uplinks could improve the efficiency and alleviate hashing issues. For example, instead of hashing to one of four 10-Gbps ports, you could replace the four uplinks with one 40-Gbps port. The flows using that port would enjoy better bandwidth sharing than they would if you statically allocated them to four discrete 10-Gbps ports.
| References |
|

















Zhi-Hern Loh is a senior IC-design engineer at Fulcrum Microsystems, where he has worked for five years. He architects and designs Ethernet system-on-chip ASICs and develops the next generation of tightly integrated high-performance routers, specializing in implementing the latest data-center protocols and standards. Loh has bachelor’s degrees in both electrical and computer engineering and computer science and a master’s degree in engineering from Cornell University (Ithaca, NY). In his spare time, he enjoys cycling with his wife in the East Coast Park of sunny Singapore.

