Virtualization and multicore x86 CPUs
August 8, 2008 will be a significant date in the history of computer architecture. This is the date that Intel intends to discontinue the Pentium 4 CPU line, therefore also the NetBurst microarchitecture they're based on, and thereby marking the final passing of the mainstream single-core x86 CPU. A closely related and equally important date is March 2006, when Intel announced the Core Microarchitecture and, in doing so, made virtualization a necessity for the continued dominance of the x86 architecture within the data center.
Recall the hot and heady days of the megahertz CPU wars, from the late '90s through the first half of the 2000s. Intel had heavily invested in propagating the concept that CPU performance was directly proportional to clock speed. The culmination of this investment was the Pentium 4 architecture, with a very long multistage pipeline, and expected to achieve clock speeds through its lifetime of up to 10 GHz.
While very successful from a marketing perspective, the campaign was starting to take some fire from competitors. In particular, Apple coined the phrase "Megahertz Myth" and made pointed comparisons between the performance of the PowerPC G4 and that of the Pentium 4 architectures. Intel's problems increased once the relative performance of the Pentium 4 architecture against both Intel's subsequent Pentium M and against other competitors' offerings became known. However the ultimate demise of the architecture came about due to the unexpectedly high leakage currents which became manifest as Intel attempted to drive the Pentium 4 architecture at higher clock speeds and to smaller process geometries. The Pentium 4 architecture never surpassed 4 GHz and suffered the ignominy of being used as a data-point to project a semiconductor power curve which quickly extrapolated to temperatures which were hotter than the Sun.
It was against this backdrop, a need to change the trajectory of the power efficiency curve for the CPU roadmap, that Intel ushered the Core Microarchitecture on stage in March 2006. This new architecture featured a much shorter and simpler execution pipeline, enabling a single die to contain multiple complete x86 CPU cores sharing a common L2 cache. Together with improved performance at lower clock speeds and improved cache efficiency, the Core architecture brought in a number of energy efficiency optimizations which have proven valuable in keeping the power requirements of successive generations of Core architecture-based CPUs under control.
However, one dark cloud remained on the horizon, obscuring the continuance of the Moore's Law extrapolation on which the semiconductor industry so depends. The Core architecture shifted the balance in x86 CPU design firmly away from a focus on the performance of a single instruction stream to the aggregate performance of multiple streams or threads. This shift left one big barrier in the way of the long-term adoption of the new architecture within the data center. Namely, where were the multiple threads going to come from?
Microprocessor architectures and instruction-level parallelism
Modern computer architectures in general still look remarkably similar to the computer models invented by early pioneers as far back as the 1930s. In particular, the 1945 Von Neumann (contested) report on the architecture of the EDSAC computer separated processing from the memory units containing instructions and data (Figure 1). Computation took place via the processing unit's fetching and executing a sequence of instructions from memory, with the instructions operating on and modifying the data.
Fundamentally, this principle is still at the heart of the modern CPU. Instructions move data between registers within the CPU and memory, perform arithmetical operations on register contents, and branch to other points in the instruction stream based on conditional operators. The instruction streams may be expressed in human-readable assembler mnemonics, which have a literal translation into a binary representation in the digital logic of a particular CPU architecture.
A computer program written in assembly language will be expressed exactly in terms of the computational model implemented by the CPU. This expression will differ between CPU architectures, and it also requires the programmer to think and express the program at a very low level of detail. However since the '70s (and arguably earlier), almost all computer programming has taken place using languages which offer a higher semantic level for greater programmer productivity. These languages are mapped by programs called compilers or interpreters onto an instruction stream which the CPU then executes.
While a very large number of programming languages have been developed and used over the past half-century, most have followed a computation model which is broadly very similar to an underlying Von Neumann machine. Languages such as FORTRAN, COBOL, Pascal, C, C++, Java, and C# are all defined as being imperative, in that they express computations as algorithms which manipulate stored states.
Other features of the underlying computation hardware model are also directly present in these languages. For example, all of these languages are also procedural in that they can allow for algorithm abstraction through the use of procedural constructs (i.e. subroutines, function calls, and method invocations). The translation of these constructs onto the stack, jump, and return instructions available in all contemporary CPU architectures is quite direct. This basis of the most commonly used high-level programming languages on a model of computation similar to that of a single execution core CPU means that most applications naturally compile or are interpreted as a single instruction stream (i.e. thread) for a single-core processor to execute.
The expression of computations using any given language will tend to produce a particular distribution of instructions for any given CPU architecture. This distribution is taken into account by CPU designers as they attempt to optimize its performance. Simple examples of this phenomenon include determining the appropriate number of registers for a processor, or including logic which attempts to predict the course of instruction stream branches. These optimizations can be extremely sophisticated. For example, a superscalar architecture aims to issue more than one instruction per clock tick (on average) from a single instruction stream by extracting the nascent parallelism of the stream in order to simultaneously feed multiple execution units.
Fundamentally, however, the distribution of instructions generated by high-level language encoding of an application places an upper bound on the degree of parallelism which may be extracted at run-time by a CPU from the instruction stream. Generally, therefore, the CPU complexity required to extract parallelism quickly grows in an unbounded fashion. If binary compatibility with the x86 instruction-set is not a requirement, it is possible to broaden the computational model exposed by the processor and allow explicit programming of the parallel hardware units. In so doing the task of extracting parallelism is shifted towards the compiler, which has the benefit of being able to analyze the whole program. This shift can result in CPU designs where relatively more logic is dedicated to execution units and cache.
Such a technique finds use with some success in the Itanium processor architecture, although the extracted parallelism is still fundamentally bound by the computational model of the high-level language(s) used. Sadly, changing the instruction-set architecture is not an option for the broad x86 CPU market, where decades' worth of binary compatibility have cemented expectations. Therefore, in order to generate the multiple instruction streams (threads) required to efficiently use a multicore x86 CPU, broad adoption of instruction set extensions or a different programming model are required.
Language-level techniques which deliver multiple instruction streams exist and have been extensively used, particularly in environments which employ parallel, cluster, or multiprocessor machine architectures. One technique moves away from traditional imperative procedural programming languages, thereby leading the programmer to algorithmically express a computation. Algorithms describe how a computation should take place relative to some idealized Von Neumann-like hardware model and therefore tend to describe a computation in a single-threaded manner.
Declarative programming languages express a computation in terms of the desired output. Such programming languages are inherently decoupled from the machine model and so are more amenable to tools which may generate parallel (or multiple) execution streams for the computation. This technique has been used, for example, to translate SQL database queries into an intermediate functional programming language whose execution is distributed over an array of execution elements.
Unfortunately programmers don't easily switch programming models, and the science of producing parallel instruction streams from declarative programming languages is also still in its infancy. Therefore, the most common language-based approach is to extend an existing imperative procedural language to encompass some idealized hardware model which does include parallelism. This approach has had most success on two fronts.
In the scientific community, the MPI (message-passing interface) has found use in providing an API and model of co-operating processes which communicate through message passing. MPI enables programmers using languages such as Fortran, C, and C++ to naturally express a single program as computation in terms of many processes, each executing over a portion of the whole data-set and periodically exchanging information in order to iterate to a global solution. This model works well when the problem can be decomposed in such a manner (for example, in a weather modeling application), and it transparently maps onto a wide range of parallel machine architectures and topologies.
Also, in the commercial application community, a model of threads in common use expresses a program in terms of cooperating processes which communicate through shared memory. For these applications, particularly where an application contains a GUI, threads are a more natural approach for the programmer than explicit message passing. Multiple threads share data structures, so language-level synchronization primitives are required to carefully control data access. However, both approaches are significantly demanding of the programmer's skills and require considerably more time and effort to debug and optimize for performance. Legacy applications must also be rewritten in order for language-level solutions to make them multicore compatible.
Many system-level solutions exist that deliver parallelism. The one most commonly adopted in the commercial space, and therefore familiar to most readers, is the SMP (symmetric multiprocessor (SMP) architecture. Here, a single operating system image manages multiple CPUs, all sharing access to a common pool of memory. In this architecture, the OS then schedules multiple simultaneous applications across the CPUs as an extension to time-slicing multiple applications on a single CPU. In an SMP system, different CPU cores may access memory through different local caches. To maintain efficiency and coherency between the various caches and memory, the caches must therefore coordinate with each other.
SMP is a mature technology and does provide parallelism for data-center applications without requiring the applications to be written in a parallel or threaded programming model. It is well suited to a multicore x86 CPU architecture, where each x86 CPU core is considered a symmetric processor. However scaling the number of active CPU cores in an SMP system is a challenge. For enterprise-class systems, eight to 16 CPUs has long been viewed as the generally accepted upper limit. And even when using advanced (aka business-critical) operating systems and specialized hardware, it has been difficult to scale beyond 32 CPUs. These limitations do not fit well with an x86 CPU roadmap which would predict more than 32 cores in common deployments within the next few years.
One reason for this scalability limitation is that the cache coherency algorithms required to maintain a single uniform access model to system memory do not scale well. Another is that the size of operating systems in terms of SLOC (source lines of code) continues to significantly grow over time (Figure 2). Given that in an SMP system the operating system is a single point of failure, this growth in code size limits improvements in reliability due to improved software engineering and validation techniques. If the operating system is not significantly improving in reliability, an empirical limit exists to the number of concurrent applications which system administrators are comfortable running on a single operating system instance.
In particular, these limitations mean that a typical enterprise-class SMP system would never be expected to execute more than a handful of simultaneous applications, resulting in a correspondingly low degree of CPU core utilization. Fortunately these issues of scaling out SMP architectures on x86 CPUs have been already dealt with elsewhere in the computer industry. The solution, virtualization, was accepted by the mainframe community, for example, decades ago.
Virtualization provides a scaleable solution
Virtualization adds a domain (often called a hypervisor) into the system architecture which is more privileged than the operating system itself and which takes over from the operating system the responsibility for managing the machine's physical resources. The hypervisor is implemented using a mix of hardware and software mechanisms, and it provides an isolated computing environment in which a number of operating system instances may concurrently execute. The property of isolation means that any single operating system instance running on the system is not able to degrade operation of the machine or any other operating system instance. For example, one operating system instance might completely crash without affecting the others. The use of a hypervisor improved the scalability of the SMP architecture for data-center applications in two ways.
As long as the code size of the hypervisor is manageable, the reliability of a virtualized system is decoupled from the number of operating system instances being executed. Empirical evidence suggests that many system administrators are today comfortable with running 16 virtualized operating system instances on a modern enterprise-class x86-based server, and that they see no barrier to further increasing this utilization. These statistics stand in contrast to the lack of willingness to execute anywhere near this many applications on the same machine running a single operating system instance. Managing the code size of the hypervisor over time will, of course, cause tension within the industry, especially as some virtual operating system vendors feel the pressure to increase the feature set of their products. However this tension is similar to that which motivated the development of micro-kernel architectures in the late 80s and 90s, and contemporary hypervisors share much in common with (or have roots in) these projects.
Virtualization also improves the cache locality of the system, thereby enabling greater hardware scalability. Because each operating system instance is isolated from the others, relatively little sharing of memory between the running instances exists. Inter-virtual machine communication takes place through dedicated messaging channels rather than via a global shared memory address space, so the degree to which a single virtualized operating system is constrained over the set of available CPU cores provides an upper limit to any requirements to scale the inter-core cache coherency algorithms.
Microprocessor manufacturers have noticed this trend and therefore future x86 CPUs shown in roadmaps are increasingly looking like NUMA (non-uniform memory architecture) machines. System memory is closely attached to the CPU package (with each package containing a large number of cores) and inter-package communication occurs via a switch-fabric like interconnect. If viewed from the top, this checkerboard architecture lends itself to scheduling of the virtual operating system instances to take maximum advantage of available cache and memory locality (Figure 3). Expect to see significant power savings using this architecture. If scheduling of the virtual operating systems takes account of memory footprints, complete CPU cores with associated caches and system memory can independently power up or down as the workload fluctuates.
Virtualization and I/O support
Increasing the density of CPU cores, replacing the traditional inter-CPU bus architecture with a switch interconnect, and executing larger numbers of consolidated applications on a single machine all have the effect of increasing the aggregate I/O requirements of the system. This trend is a great motivation for system architects to aggressively upgrade the I/O capabilities of the next generation of servers, particularly with regards to networking when coupled with the availability of mature 10Gbit Ethernet products.
Another major challenge for the next generation of virtualized servers will be to maintain effective cache locality for data being transferred as part of the I/O processes. The goal of virtualized I/O hardware has been on the agendas of software providers for some time now and can be architecturally considered as an extension of the I/O bottleneck manifest in non-virtualized systems at the start of the decade, which technologies such as Infiniband and receive-side core-scaling were invented to address. Such technologies are conceptually very similar to the techniques required to improve virtualized I/O performance.
Receive-side core-scaling is a mature technique used by operating systems to associate network flows with the CPU cores on which the protocol and application processing for those flows will take place. An I/O device (a network adaptor, in this case) will receive a mapping from network flow to a receive channel from the operating system, and subsequently will direct incoming packets onto the appropriate receive channel (Figure 4). The channel is usually also associated with an MSI-X interrupt, which can be specific to the appropriate core. When receive-side core-scaling is properly tuned, cache locality for network processing is improved and network processing is spread over multiple cores. The technique is simple enough to be implemented on virtualized operating systems in a manner almost identical to the non-virtualized case, and it has been deployed in production virtualized environments.
Infiniband, when used as a network interconnect, exhibits the architectural property that protected communication channels can be established between the network adaptor and the application library which is party to the communication. This approach is very different from the classical kernel-stack model of networking, where all network communication channels must pass through the operating system. Bypassing the operating system in this way can significantly reduce per-message and per-packet overheads associated with networking, as well as improve cache locality. Supporting this mode of operation requires the incorporation of significant additional complexity to the network adaptor, particularly to ensure isolation between communications channels (Figure 5). Registers corresponding to each communication channel must be virtualized so that different applications are unable to interfere with each other's communications. The adaptor hardware 'behind' the registers also needs to be fully virtualized in order to handle simultaneous communication requests from multiple channels.
This same technique can be used in a virtualized operating system environment, wherein each virtualized operating system instance is given direct access to a virtualized hardware communication channel, bypassing the hypervisor for common I/O operations. Significant performance gains have been observed for hypervisors such as Citrix/Xen, which have adopted this architecture (Figure 6). In this case the communication channel implementation can be as simple as passing Ethernet frames which are multiplexed by MAC address. Much of the complexity of a protocol like Infiniband can be eliminated, therefore explaining the enthusiasm exhibited by 10 Gbit Ethernet adaptor vendors with respect to adopting the PCI-SIG SR-IOV standard which aims to standardize the bus interface for virtualized adaptor cards.
However, the PCI-SIG standards only define the mechanism used to virtualize the register sets of the adaptor. They (rightly) do not attempt to standardize the semantics of the registers used by software to drive the communications channel. Herein lies the remaining issue to be solved in order to deploy PCI-IOV based adaptors; the semantics of the virtualized communications channel need to be defined and understood between:
The hardware which is implementing one side of the channel,
The software (executing in the virtualized operating system) which is implementing the other side of the channel, and
The hypervisor which is in overall control of the communication channel.
One implementation possibility would be for the hardware device semantics to be exactly specified and standardized. This approach would enable the execution of a generic software driver in the virtualized operating system which would support all such devices. A hardware-only solution simplifies the overall software architecture, but it also suffers from a number of problems, the main one being that device semantics are generally specified at a high level by the operating system vendors rather than at a detailed register operation level. Historically, this flexibility in device implementation enables otherwise-stymied innovation by hardware vendors.
If the semantics of virtualized device operation are expressed in terms of APIs within both the virtualized operating system and the hypervisor, components of the device hardware vendor's driver must therefore exist both in the virtualized operating system and in the hypervisor. This requirement for three-way cooperation between hardware, operating system and hypervisor vendors has been recognized, and all signs point to the respective parties moving forward from an implementation standpoint, thereby enabling virtual adaptors built using the PCI-IOV standard to be accommodated into the virtualized and virtual ecosystem.