Subscribe to EDN
RSS
Reprints/License
Print
Email

ARM launches AMBA AXI coherent extensions

Hardware-based memory coherency protocol extends ARM's bus into the worlds of servers and virtualized embedded systems.

By Ron Wilson, Editor at large -- EDN, June 12, 2011

The best index to the evolution of the ARM architecture is, perhaps, the AMBA bus. In the beginning there was a simple microprocessor bus, perfect for connecting a discrete MCU to memory. Then ARM shifted its focus to IP cores rather than chips, and the MPU bus became a set of pins on a core. As SOCs took on their initial form as miniaturized board-level-computers, these pins defined the bus interconnect between the CPU core, on-chip controller blocks, and on-chip memory instances. The physical implementation was in multiplexers, but logically there were still address, data, and control busses.

As SOC architectures became more complex, so did AMBA, becoming a high-bandwidth multi-master bus friendly to direct-memory-access peripherals and slave processors. Gradually the emphasis shifted from supporting communication between a central CPU core and its peripherals to supporting traffic between a number of computing or high-speed I/O controller blocks and their memory. By the announcement of AMBA 4 last year, what was once a bus had become AXI4: a multi-channel switched network capable of bursts, out-of-order operations, and streaming connections.

Now comes the next step. Three factors have pushed AMBA 4 to a new level, announced at DAC last week as ACE: the AXI Coherency Extensions. One factor is ARM's push, through the Cortex A-15, into the server space, where symmetric multiprocessing is a given. A second is the increasing use of Cortex A-series processors in multicore configurations. A third factor is the criticality of energy efficiency in ARM's home market, the mobile space.

In each of these areas there is a need for hardware memory coherency. The need is most obvious in the first instance, where server software simply assumes the existence of a single, coherent virtual memory space. The need is just as real, though, in embedded multicore applications. We have already taken the low-hanging fruit of multicore partitioning: RTOS on one core, critical application on a second, background applications on a third, for example. Increasingly, tasks must share data structures across CPU cores. Equally increasingly, systems architects want that sharing managed by the hardware, not by explicit memory management software. And the use of virtualization, just beginning in embedded computing, will make the case for hardware support stronger still.

The third point is perhaps less obvious. Michael Dimelow, director of marketing in ARM's processor division, pointed out the strong relationship between energy use and memory coherency. "A cycle to external DRAM can take ten times the power of a cache snoop," he said. "If you can frequently prevent a DRAM cycle by transferring the data to or from another cache on the chip, you can save a lot of energy." In designs where several processors, accelerators, and peripheral controllers have direct access to memory, a hardware coherency scheme that lets on-chip caches source and hold shared data can save a lot of DRAM cycles, and thereby a lot of energy.

ACE addresses these needs with two new pieces of silicon intellectual property-currently intended for use with the Cortex A-15--and a protocol standard that uses them. The first IP block is a new interconnect matrix, the CoreLink CCI-400 Cache-Coherent Interconnect. Like AXI4, the CCI-400 provides a 128-bit path at one-half the A-15 core frequency. But it adds coherent ports for A-15 processor-cache clusters, co-processors, and peripheral concentrators.

There are two kinds of ports available on the CCI-400: ACE and ACE-Lite. The ACE port attaches A-15 processor clusters, and is fully coherent: that is, through the ACE port an A-15 cache can share data, snoop other A-15 caches and anyone on the interconnect can snoop it. To support these operations without tying up main-channel bandwidth, the ACE port adds three new channels: coherent address into the A-15, and coherent response and data back into the interconnect.

The second type of port, ACE-Lite, does not include the three new channels. A device, such as the Mali-T604 graphics engine, the DMC-400 DRAM controller, or the NIC-400 peripheral concentrator, can initiate a memory request that will snoop the A-15 caches. But there is no way for another device to snoop memory that is local to an ACE-Lite device. Thus, to speak precisely, the current ACE architecture is semi-coherent: it supports full coherency between A-15 CPU clusters, and allows other types of devices access to data from the A-15 caches. There are additional provisions for provisions for semi-coherent operation of Cortex A-5 and A-9 caches, as well.

The second important piece of IP is the MMU-400. One could parallel the arguments above to assert that as virtualization becomes ubiquitous in servers and increasingly common in embedded systems, hardware MMUs to support virtual addressing are increasingly necessary. In AMBA 4, ARM has chosen to introduce a distributed architecture-Distributed Virtual Memory (DVM)-in which coherent devices operate in their own virtual address spaces. Each of these devices attaches to the CCI-400 interconnect through its own instance of the MMU-400, which latter block provides address translation, translation look-aside buffers, and the ability to update all the TLBs in the system by broadcast when some hypervisor decides to remap virtual addresses. The DVM protocol cleverly reuses signals already present for coherent support, reducing wiring overhead.

Taken together, the coherence and virtualization IP announcements signal a new direction for the ARM architecture: into the server space, and simultaneously into an unknown territory where embedded systems are built on virtualized multiprocessing platforms.  Equally, these architectures represent a departure for SOC design teams. Hardware coherency protocols are complex, and cover an enormous state space. Verifying them is one of the grand challenges of hardware engineering.

To meet this challenge, ARM has teamed with Jasper Design Automation and with Cadence. The former company has built a Proof Kit of verification IP for the ACE protocol. Jasper vice president of marketing and business development Oz Levia said that ARM and Jasper teamed to develop a set of assertions that they believe form a sufficient condition for correct implementation of the ACE protocol.

Levia said the process started with an English-language specification , from which the team manually derived a set of formal rules and verification tests. These, in turn, the team reduced to SystemVerilog. From there, verification engineers can go in several directions. The System Verilog is synthesizable, so users can build it into an RTL test bench or FPGA prototype. It can be used directly in System Verilog simulations. And, Levia said, because of the speed and capacity of Jasper Gold, the assertions can be used to formally verify an ACE implementation. This last option may prove vital, because coherency networks are legendary for concealing bugs despite endless simulation.

The picture, then, is a complex one. New markets and changing requirements have led ARM to develop coherency and distributed virtual memory protocols. To support the protocols the company has revealed new capabilities of the Cortex A-15 core and additional IP cores. And to make the resulting systems verifiable, ARM has worked with Jasper and Cadence on verification IP.

Dimelow warned that all these steps still do not retire all the challenges. In fact, he said, all this is ground work for a more profound change: from task-centric to communication-centric architectural design. Dimelow envisioned a flow that begins by identifying system-level data transfers, and points at which processors will share data structures. From this map, architects can determine which blocks must be coherent, and which can remain simple. At this point, Dimelow said, systems designers have enough information to start trading processor throughput, memory parameters, and interconnect bandwidths to optimize their system. It is a long ways from that microprocessor bus.

 
RSS
Reprints/License
Print
Email
Talkback
Canon Resource Center

Featured Company


Most Recent Resources

Advertisement
Related Content

No related content found.

  • 0 rated items found.
Advertisement

KNOWLEDGE CENTER

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)
Featured Job On
Scroll for More Jobs
Advertisement
About EDN   |   Site Map   |   Contact Us   |   Subscription   |   RSS
© 2012 UBM Electronics. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy

Please visit these other UBM Canon sites

UBM Canon | Design News | Test & Measurement World | Packaging Digest | EDN | Qmed | Pharmalive | Appliance Magazine | Plastics Today | Powder Bulk Solids | Canon Trade Shows