Analyst Loring Wirbel covers programmable logic from an application perspective, providing a sneak peek at the vertical applications that help drive FPGA complexity, performance, and density. The blog will feature videos allowing engineers to spotlight their latest designs, along with news of products and corporate trends at FPGA vendors and the developers of third-party tools for programmable logic.

Friday, October 16, 2009

Fermi-FPGA fights?

Oct 16 2009 4:04PM | Permalink | Email this | Comments (2) |
Blog This! using:  Blogger.com | LiveJournal |
Digg This | Slashdot This | add to Del.icio.us


I point readers to a new blog item at HPCwire, in which Michael Feldman suggests that Nvidia’s new Fermi graphics architecture could directly displace FPGAs in high-end simulation and rendering engines. Now I’ve seen instances where previous Nvidia Cuda generations have displaced multiple parallel DSP processors in an add-in graphics card. I’ve also seen instances where specialized DSP-intensive architectures from Xilinx or Altera have displaced DSPs.

But an Nvidia/FPGA battle for sockets is a notion I had not considered. My gut instinct tells me few FPGAs have proven themselves in graphic algorithms to do direct battle with a GPU giant like Nvidia. But maybe that view is behind the times. Readers, is Feldman on to something here?

Reader Comments



at 10/29/2009 3:39:38 PM, CUDA-VIS said:
There are many instances where CUDA enable GPU's have replaced ASICs and FPGA's. Is this really a well kept secret? The benefits for the designs have been huge. I know of at least 3 instances of where more than 10k lines of RTL have been replaced with 30 to 100 lines of C for CUDA code (not graphics either!). How can that be? Well NVIDIA did the architecture, RTL, Verification, Backend, Debugging for you – you just need to write a parallel C implementation. One design we did took 2 weeks to replace an FPGA system that took 4 man years to build (RTL and Board design) and had 10x improvement in performance. NVIDIA has >1000 engineers dedicated to just hardware. I don’t know of any FPGA team that can put in over a billion dollars per year of R&D into developing HW. NVIDIA has guys that only optimize 100 lines of RTL their whole lives! 100x in development effort, there hasn’t been too many times in history that this has happened.

What about performance though? All of the designs have received >10x in performance. With the new Fermi chip having > 1000 Gflops single precision,>500 Gflops double precision, >1000 Gops, >1000 Gbps bandwidth to external memory, >100 Tbps internal bandwidth is this surprising? Having 3.6 billion transistors and 1000’s of engineers to make it the fastest compute system in the world will be very hard for any team to beat. What do 3.6 billion transistors equate in gates? A very quick estimate is 6 transistors per gate or 600 million gates – you would need a 6 billion gate FPGA to duplicate it (more since it runs >1.5 gHz vs 400 mhz). Well you can say there is lots of space that is not use for compute? I've heard about a 1/4 is used for non-compute – that is still over 4 billion gate FPGA to duplicate the compute side. Well the price must suck? Well this is based on a consumer product that ships 150 million (yes 150 million!) units per year in a very competitive environment. What are the range for Vertex 5 on Avnet? $250 to $19k! vs $59 to $1200 for full sub-sytem GPU's (memory, power, & chip). The external bandwidth is excellent with PCIe 16x 2.0 with two copy engines – that is 128 Gbps of bandwidth.

What is the downside? Product life – just takes a bit of $ to cover and cuda is always backwards compatible, no RTOS support (INtime can solve most of these issues, I sure NVDIA would take NRE adding one if needed (hint)), power (many performance levels of GPU’s to choose from), temp range (can be take care of that with external companies), IO (Spartan FPGA is more than enough for most with PCIe interface), needs a cpu

I believe many of these short comings will be taken care of in the future or non-issues for most designs.

Search "CUDA ZONE NVIDIA" on google and see all the applications ported and xfactors 10 to 100x over core i7 are common.




at 10/31/2009 7:59:13 PM, www.ConcurrentEDA.com said:
As for design complexity, FPGAs are far more complex to design than writing software.

**IF** a GPU can be used (size, power, heat) and if the application can work on a GPU, then by all means, use a GPU or use an N-Core processor.

**IF** an FPGA can be used and if you can find a FPGA Core or a team of FPGA designers, then it may be best.

But can a GPU always be used? No.
Can an FPGA always be used? No.

It all depends on the type of parallelism that you can exploit.

If you have pure data parallelism and a high compute/memory ratio, then both an FPGA and a GPU can be used. The GPU will take the kernel and spread it across multiple GPU processors. If you have an FPGA, the a pipeline of computations is created by the hardware designer and data is streamed through the hardware in a computational assembly line.

If there is a large amount of code that is executed frequently, then the FPGA area may be large or too large. End of store, start coding the GPU.

If you have two loops, a data parallel outer loop with a sequential inner loop, then a GPU will work well. An FPGA will also work well by replicating the logic that implements the inner loop.

If you have two loops, a non-data parallel outer loop with a data parallel inner loop, then the parallelism is limited for GPU acceleration. Unless the inner loop iterates a large number of times, then peak performance will not be achieved. However, an FPGA will perform very well as its parallelism is dependent on creating a pipeline. In this case the pipe can be created and the parallelism is related to the depth of the pipe.

Have GPU's stolen the thunder of parallelism from FPGAs? In some cases, heck yes. In others, not a chance. For some, this is a Mac Vs PC argument that won't change anyone's view. But it all boils down to the code. But are there now two players? Yes, and more are joining the dance.

Search "Concurrent Analytics" to see how code can be analyzed to determine which is best.


Post a comment



Display Name

Change Image
Before submitting this form, please type the characters displayed above.
Note the letters are NOT case sensitive.

©1997-2009 Reed Business Information, a division of Reed Elsevier Inc. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy

Please visit these other Reed Business sites