Exploring the Xilinx Zynq: software platform, or complex FPGA?
One of the enduring challenges of FPGAs with embedded CPUs has been the connection between the processor and the programmable fabric. A conversation at Hot Chips earlier this month with Xilinx vice president Vidya Rajagopalan suggested that Xilinx’s forthcoming entry in this rather exclusive derby, the Zynq 7000, is to be no exception.
There have been two dominant approaches to the interconnect problem, both based on how the vendor perceived the product. If the vendor saw the programmable fabric as a blank slate onto which the customers could write whatever they pleased, the interface tended to be wide and general-purpose, offering enormous potential bandwidth at the cost of considerable complexity. An example might be Altera’s Excalibur product. Alternatively, if the vendor saw the fabric as simply a place to assemble controllers or bus bridges building-block fashion, the interface tended to be a series of standard-for instance AMBA or Wishbone-interfaces ending at stubs in the programmable routing. An example might be the QuickLogic QuickMIPS.
Ostensibly, Zynq is something different: a blank slate on which users write software, some of which will execute on the chip’s two hard ARM Cortex A-9 cores, and some of which will be implemented on accelerators in the chip’s logic fabric. This approach must have presented a bit of a puzzle to the Xilinx architects. Do you create a fully general interface between the A9 cluster and the fabric? If so, at what level-the A9’s coprocessor port? The L2 controller? The coherency engine? Or do you implement a more restrictive but familiar interface, and if so, which one?
Part of the complexity of the problem lies in the level of abstraction Xilinx has in mind. The main purpose of the fabric-aside from the odd protocol controller-is to implement accelerators inferred from the software. But the software contains little information from which anyone could infer an interface structure to go between the A9 cluster and the accelerator.
Presumably in an attempt to bound the problem and render it familiar to SoC architects, Xilinx chose to implement the interface using a standard AMBA 3 bus structure. The exact structure is described in various ways in different places, but it appears that there are two switch matrices: one AMBA-3 for general peripherals and one AXI-3 grouped around the DRAM controller. The peripheral-side switch appears to have five ports for the CPU cluster, a port for each of the eight hard I/O controller blocks, one for a hard static memory controller, and four that end as stubs in the programmable fabric. The second switch has two CPU ports, two ports for the hard DRAM controller, and five ports ending in the fabric.
One of these five ports supports the Accelerator Coherence Port (ACP.) This port provides additional protocol allowing an accelerator to snoop the processor cluster’s L1 and L2 caches-but not, apparently, the cluster’s On-Chip Memory. The intent is for a CPU task to leave a control and data block in cache, and for an accelerator in the programmable fabric to read the block directly from cache, avoiding a write-back to DRAM. The protocol is not symmetric, though: CPU reads and writes do not snoop memory in the fabric. So accelerators are not fully coherent with the system memory.
With this variety of ports, including general-purpose AMBA I/O ports, AXI ports with access to the DRAM controller, and one dedicated ACP, Xilinx has covered quite a range of possible structures in the programmable fabric. The company plans to provide AMBA interface IP, presumably including an ACP client, to ease implementation of connections to the AMBA structure.
That still leaves a gap between the vision of a C-driven virtual processing platform and the reality of architecting a complex AXI-based set of accelerators. Rajagopalan said that Xilinx’s AutoESL tool could infer some interfaces and buffers. But much of the task of isolating hot code segments, architecting accelerators, and fitting the accelerator’s control and data flows into Zynq’s AMBA/AXI structure will require skills in system architecture and a solid understanding of AXI 3. It’s not all about the software yet.
John Bass - DMS Design commented:
And there is a third choice rich with opportunity ... OpenMP ported to the fabric where the C/C++ language becomes a rich language for expressing hundreds/thousands of tight threads ... where the FPGA with embedded cores and C to hard logic netlists execution is just one of hundreds of viable execution platforms over time.
No language extensions necessary ... or WANTED! C/C++ with OpenMP is strictly all that is needed ... from there the programmer just needs to be aware that dynamic memory pools are a shared resource and bottlenecks ... and to carefully extract bandwidth by using local memories/register/logic with good C subset programming.
See openmp.org
The OpenMP API supports multi-platform shared-memory parallel programming in C/C++ and Fortran. The OpenMP API defines a portable, scalable model with a simple and flexible interface for developing parallel applications on platforms from the desktop to the supercomputer.
ron commented:
Meredith:
I think that's an excellent point. The gap between what programming languages can express naturally and what an FPGA synthesis tool would need to know is fertile ground for language extensions--or, who knows, new languages.
ron
Meredith Poor commented:
From a programmers standpoint, there are two ends of the spectrum. One end is the ‘custom opcode’, which would be an instruction with various implicit or explicit addresses in-line to the instruction or in registers. Had the application loaded a buffer full of binary information and run a ‘compress’ function, leaving the result in another buffer, the opcode analogy works nicely.
.........................................................................
The other is the ‘volatile’ region, in which inputs on various pins percolate through the fabric, and are either visible in some transformed state at certain memory addresses or trigger interrupts when one or more criteria are met. This would be ‘listener’, and usually the interrupt would present some value at a port address and lock a ‘region of interest’. The interrupt handler would transfer the contents of the locked region and then release it, allowing the fabric to toggle bits in the buffer.
..........................................................................
In the first example the CPU takes the initiative, in the second the CPU is the passive recipient of a trigger. In either case traditional C language constructions would probably need some hints that would map variables to registers or counters, and operators (custom equivalents to ‘+’, ‘-‘, ‘*’, etc.) to banks of ALU elements. This would suggest that the C language is going to take a step beyond ‘++’, and support structures and operators never envisioned by language designers.















