Compute farms: the new data centers
Today's CPUs are not powerful enough to support the development of their successors. Dynamically allocated computing power keeps projects on time and fosters quality.
By Gabe Moretti, Technical Editor -- EDN, July 19, 2001
Increasing design sizes demand not only better EDA tools, but also significantly higher computing capacity. Physical-synthesis, place-and-route, and especially design-verification applications require so much computing power that they are often bottlenecks in product development. In the last 20 years, the industry has grown accustomed to depending on personal workstations, and engineers have used them to perform the functions required for the design and development of products. For much of this period, engineers have been artisans, using personal tools in the advancement of their craft. When designers worked alone on a project, it was easier for managers to predict the required hardware and software configuration for each desktop. Now, engineers work on various tasks with differing degrees of complexity and use a number of EDA tools. It is therefore impractical to dedicate a specific hardware configuration and particular EDA-tools licenses to individual engineers. The CPU usage in a workstation typically ranges from 5 to 20%. Most of the time, a workstation is performing I/O operations or is idle waiting for data. As a design grows, it becomes necessary to collaborate with co-workers to conquer the complexity, as does borrowing computing power from the network to finish the job in time. Companies have a difficult task procuring and managing the correct mix of workstations, because it is hard to predict the amount of computing power that each designer will require over the life of the equipment.
EDA vendors reacted to this new requirement by inventing the "floating license." With it, a tool is not tied to a specific CPU; you can move it from one workstation to another as requirements change.
The farm comes to town
The enterprise intranet is the modern-day equivalent of the data center. It allows engineers to use otherwise-idle machines to execute design functions that they cannot run on their personal workstations. In practice, it is difficult to schedule a job on someone else's machine, because tool-license, length-of-execution, and even hardware-configuration issues can result in aborted jobs and wasted time instead of productivity gain. Managing the distribution of jobs within a network is a full-time job, and, realistically, neither engineers nor system administrators can economically do it. When engineers schedule a job to run remotely, they need to be sure that the target workstation has enough resources to ensure that, unless an application error occurs, the job will run to completion.
Competition drives microprocessor companies to adhere to Moore's Law and provide faster and better CPUs every 18 months or so. With the increase in computing power, managers are forced to find uses for workstations that have not yet fully depreciated yet that must be replaced for the company to keep up with the competitions' productivity increase. After all, you can transfer only so many workstations to accounting or marketing! You could use the slower workstations to run batch jobs if only there were a practical way of doing so.
Workstation manufacturers are showing that they can be as flexible as the EDA vendors have been and are providing a new option: the "compute farm." A compute farm is a collection of independent computing nodes, storage arrays, networking components, and software that appears to end users as one computing resource. The components of a compute farm can be products specifically designed for and sold as a part of a compute-farm product or general-purpose hardware, such as workstations, disks, network switches, and software, connected both physically and functionally to appear as one computing resource.
The major advantage to designers is the availability of a computing resource that they can use 24 hours a day, seven days a week with enough capacity to ensure that it does not abort the most complex applications processing the largest designs for lack of resources. Managers can use hardware resources that they would have otherwise written off as obsolete capital equipment as a node in the compute farm and can take advantage of the flexibility in job assignment that a central and shareable resource provides. Design teams are also increasingly decentralized and dispersed around the world. Having a collection of centrally located compute farms available via an either private or public Internet connection diminishes the capital expenditure and the number of IT professionals required while maintaining computing power at a sufficient level to avoid project bottlenecks due to a lack of hardware resources.
The Sun experience
In 1991, Sun Microsystems was developing a new-generation SPARC processor using workstations based on the then-available SPARC. Engineers were running out of processing power and memory on their desktop machines, and it was uneconomical to upgrade all of the engineers with more powerful workstations. So, Sun invented the compute farm, aiming to reduce design cycles and achieve functionally correct silicon on the first iteration. The design required the use of more than 250 types of EDA and MCAD/MCAE applications covering all phases of the design: architecture, logic design and verification, circuit design and verification, and layout design and verification (Figure 1 ).
Since then, design complexity has increased significantly. The last generation of SPARC processors had 5 million transistors on one chip, the current SPARC has 23 million transistors on one chip, and system architects estimate that the next-generation device will pack as many as 200 million transistors. A full verification run of a modern microprocessor requires more than 150 million simulated cycles, which, using the latest commercially available workstation, could require as many as 35 CPU years to complete. However, because it is likely that you'll have to execute a full verification run more than once a week, the problem immediately becomes evident. Currently, Sun has five compute farms fully dedicated to EDA tasks. The Microelectronics farm has more than 600 multiprocessor workstations, representing more than 4000 CPUs, 3 Tbytes of memory, and 100 Tbytes of disk storage. The CPU usage in this farm is more than 97% around the clock. No wonder California has a power shortage!
Learning from its own requirements, Sun developed a family of products. A TCF (Turnkey Compute Farm) is a preconfigured product that Sun designed, tested, and supports. The base box offers a file server; four compute engines, each containing four processors and associated memory; a disk array; a 24-port switch; and two access servers in one rack. Cisco Systems distributes the switch and the access servers. The SGE (Sun Grid Engine) utility manages the hardware and the job queues in a TCF or collection of TCFs. Conceptually, the SGE is like a smart batch-queue manager that determines where and when jobs run based on the hardware resources that each job requires. Any software that runs on the Sun Solaris OS will run unchanged with the SGE. If a compute engine fails, the SGE restarts the job using different TCF resources, as they become available.
HP enters the market
Microprocessors and complex electronic systems are some of the many products that Hewlett-Packard is selling. The problems inherent in the development of complex systems with a geographically dispersed engineering team have confronted HP management as well. For example, to design the latest generation CPU, the design-and-verification team used 300 Desktop HP-UX workstations to develop the block-level design and perform interactive tasks. It also used 200 HP-UX compute farms supported by 20 workgroup servers providing as much as 10 Tbytes of storage to verify the design. The chip-assembly team used 20 more compute farms to perform its functions. Of course, because HP is in the workstation business, all of the compute farms used HP equipment and HP-UX operating system. As part of the project, HP formed a technology partnership with Cadence Design Systems (www.cadence.com) to jointly develop IC-design options. The initial focus of the partnership is design verification. The partnership has realized improvements in the efficiency of running the Cadence NC-Sim products on the HP-UX operating system. Additional improvements in the configuration of HP compute farms have reduced regression testing from eight to one and three-quarter hours. Another result of the cooperation between the two companies has been an optimization in the integration setup of HP-UX compute platforms within a Solaris environment.
Engineers involved in large and complex projects have often experienced unexpected side effects, both positive and negative. A positive outcome can improve efficiency, give ideas for a new product, or result in better training of team members. In the case of the Itanium-processor-design project, HP developed the Sim Launcher utility to enable a directed random-test methodology (Figure 2 ). Throughout the regression-testing phase, the simulation farm efficiency remained at more than 80%, but this benefit is only one of many that Sim Launcher provides. The tool also allows engineers to make changes to local files and test those changes against release files. It also provides the ability to specify a group of tests with unique random-number seeds. Sim Launcher supports NC-Verilog and Verilog-XL and runs on HP-UX servers and HP-UX or Linux clients. No one should be surprised to find out that HP has entered the compute-farm business with its J6000 product. The system has dual PA-8600 processors running at 552 MHz, as much as 16 Gbytes of memory, and 72 Gbytes of storage. HP offers various choices of operating-system version and interconnect capabilities for this product. You can integrate as many as 20 J6000 systems into one rack, resulting in a computing node that offers 88-GFLOPS peak performance and almost 1.5 Tbytes of storage. HP provides the Sim Launcher utility free with every J6000 unit.
SGI supports Linux
SGI (Silicon Graphics) has redefined its corporate mission and is broadening its product offerings. The company has recently entered the compute-farm market with hardware products that support the Linux operating system as well as Microsoft's NT and Windows 2000. Although many EDA vendors have de-emphasized their support for Microsoft's environments, a significant number of EDA products have been ported to Linux. The EDA Linux market is showing real growth, due to both the stability of the operating system and the generally lower price of the required hardware, which uses an Intel or equivalent CPU. SGI has introduced three compute-farm products, all based on the Intel Pentium III processor with various speed and configuration options: the 1100, 1200, and 1450 servers. Clock speed ranges from 550 MHz to 1 GHz, and system memory ranges from 128 Mbytes to 4Gbytes. The 1100 and 1200 use one or two CPUs, and the 1450 can accommodate four. Reflecting the still somewhat-fragmented Linux-support market, the SGI data sheets for its compute farms are a bit confusing when it comes to operating-system support. The data sheet for the 1100 compute node specifies support for Linux Version 6.2, and the data sheet for the 1200 claims support for Linux Red Hat 6.1 or Linux SuSe 6.3. In addition, the 1450 server data sheet lists support for Red Hat 6.2 and Linux SuSe 6.2. The good news is that SGI does offer its customers first-line support for the operating system. At the DATE (Design and Test Exposition) 2001 conference in Munich, Germany, SGI demonstrated NC-Sim from Cadence; ModelSim and IC Station from Mentor Graphics (www.mentor.com); and VCS, Scirocco, and Design Compiler from Synopsys (www.synopsys.com).
A common denominator
Although Sun's compute-farm products come with SGE software to manage the workload for the compute nodes; products from other vendors do not offer such utility. Platform Computing Corp is a nine-year-old Canadian company addressing the distributed-resource-management market. One of its products, LSF (Load Sharing Facility) has established itself as the de facto standard in compute-farm scheduling and management. Most system vendors, including Sun, use LSF and co-market it to their customers. The LSF suite encompasses distributed load sharing and job scheduling for Solaris, HP-UX, Linux, and NT environments, as well as other versions of Unix. Multiprocessor workstations and compute farms dramatically increase the processing power available to engineering teams. LSF harnesses all available computing resources, including the cumulative processing power of all workstations, or even workstation clusters, present on an intranet to efficiently process jobs. If a compute element fails, LSF reboots the machine if the application that was running at the time of failure allows restarting an interrupted run. Otherwise, it moves the application to another available machine to rerun it. LSF uses a master configuration, which keeps the data required to control the execution environment and a number of demons and agents that monitor the network for the status of active jobs and the availability of compute engines. LSF keeps a fail-safe mode, so that even if the server on which the master configuration resides fails, the data is not lost, and execution can continue.
LSF comprises a collection of modules that Platform Computing has integrated into product packs to serve a number of industries and applications. The product packs you commonly find in EDA applications are LSF Professional Edition, LSF Standard Edition, LSF Parallel, and Platform CADStarter. Platform CADStarter is appropriate for small installations, as the name implies. The utility boots up applications based on user-defined parameters and provides the administrator with a GUI to define, view, edit, and monitor the work sessions. LSF Parallel manages parallel applications, such as Plato's NanoRoute product. LSF Standard Edition features load sharing and batch scheduling in distributed Unix and NT computing environments. This package is the most popular among EDA users. Organizations that must routinely manage large amounts of computing nodes, such as Sun and HP in the two previous examples, turn to LSF Professional Edition. This package is built on the Standard Edition but adds reporting capabilities about the status and history of the networked system, capacity planning, charge-back accounting, and performance-improvement reporting. Some EDA vendors, such as Cadence, Synopsys, and Avanti (www.avanticorp.com), have worked with Platform Computing to implement license pre-emption in their products. In this way, if a high-priority job is scheduled and no licenses are available, a lower priority job is suspended, and the license is assigned to the higher priority job. This scheduling is dynamic, so that all jobs are eventually executed according to their relative priority.
The collection of products from Platform Computing allows IT departments to build compute farms using hardware from any vendor. Companies find administrative and possibly financial advantages in procuring systems from Sun, HP, or SGI that address the compute-farm market. Their customers can also rest assured that the vendor will provide the correct balance of computing power, memory, storage, and connectivity. In addition, at least in the case of Sun and HP, you can extend the configurations without practical limits to grow with the increasing requirements of growing organizations. But if an EDA user can take advantage of a tool that either NT or Linux supports, he or she can configure a compute farm from scratch. You purchase the appropriate LSF package and use any collection of Intel-compatible machines to configure a compute farm, given a reasonable knowledge of the hardware-configuration requirements.
Examples of applications
Both EDA vendors and customers have shown an appreciation for the boost in productivity that compute farms offer. The use of compute farms has increased significantly in the last two years. Platform Computing counts 18 of the world's 20 largest semiconductor companies as its customers, and most systems companies are using compute farms for verification and regression testing.
Synopsys uses compute farms to perform regression testing of its products. Most of the hardware is from Sun Microsystems, although a number of Intel-compatible workstations are also used for the Linux version of the product. All of the compute farms use LSF, because most Synopsys customers use LSF. Thus, the software is tested in an execution environment similar to the one it will encounter once it is released. The design-verification division of Synopsys tested VCS with LSF and made a few changes to the Verilog simulator to optimize its execution on compute farms. The Vera test tool now allows the distributed verification of a design using concurrent testbenches, a feature that is most effective when using a compute farm. Compute farms have also changed the way product licenses are marketed and sold. Synopsys offers bundles of licenses, which allow customers to optimize both license price and hardware use. Finally, CoverMeter can add the results from the various compute-farm nodes as well as measure incremental coverage. The result is that, to a design or verification engineer, a compute farm looks just like one computer.
Model Technology, a division of Mentor Graphics, sees an increase in the use of compute farms by its customers. In general, the number of simulator licenses is greater than the number of engineers at a customer installation. The company has modified its simulator products to enable checkpoint/restore under LSF. Model Technology has also modified its licensing scheme to acknowledge the new execution environment that compute farms offer. A compute-farm license allows a user to run the VHDL, Verilog, or mixed-language simulator (at slightly more than the price for one language). In addition to modifying the product to improve performance under LSF, Model Technology has also worked with both Sun and HP to optimize its products' performance on compute farms.
Mentor Graphics has found that its customers are also using compute farms with its Calibre physical-verification product since it released a multithreaded version. The major advantage is the optimization of memory use, because Calibre keeps the design in memory instead of on disk to optimize execution. Mentor also changed the debugging interface and the GUI of Calibre to decouple these two modules from the engine. Doing so allows the batch execution that compute farms require. The company has also modified the license scheme: For compute-farms environments, a Calibre license is valid for three CPUs to take advantage of the multithreading architecture.
Intrinsix Corp (www.intrinsix.com) is a worldwide consulting company specializing in electronic systems design. It uses both Sun and HP compute farms running LSF for verification and regression testing of its designs. Compute farms allow Intrinsix to maximize the use of its tools licenses by sharing them among the designers, even remotely. Currently, Intrinsix is not using any Linux compute farms in production, but it is evaluating the use of Linux because both Synopsys and Model Technology products run on the operating system, and because Intel-compatible hardware typically costs less than more traditional workstations.
Simutech, a company that provides virtual-component-evaluation services, has deployed a different version of a compute farm. It provides remote access to a farm of its Rave emulation engines, which allow secure and flexible evaluation for third-party virtual components. Transaction processing and job-spooling software serves as the front end of the farm and distributes a stream of tasks among the available Rave engines.
The pendulum is swinging back to the model of powerful, centralized-computing capabilities after years of totally distributed computing on engineers' desktops. The difference is that a number of compute nodes connect to provide the required increase in computing power.
For more information...
When you contact any of the following manufacturers directly, please let them know you read about their products in EDN.
Hewlett-Packard Corp
1-650-857-1501
www.hp.com
Platform Computing Corp
1-877-528-3676
www.platform.com
Silicon Graphics Inc
1-650-960-1980
www.sgi.com
Simutech
1-503-293-9595
www.simutech.com
Sun Microsystems Inc
1-800-555-9786
www.sun.com

















After more than 25 years in the EDA business, Gabe Moretti moved to Colorado to find good land to start his own compute farm. He has confirmed that computer waste is not biodegradable, and it contributes to productivity pollution. You can reach Technical Editor Gabe Moretti at 1-303-652-0480, fax 1-303-652-0479, e-mail 
