Skip to search.
fpga-cpu · FPGA CPU and SoC discussion list

Group Information

  • Members: 897
  • Category: Microprocessors
  • Founded: Feb 2, 2000
  • Language: English
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Messages

  Messages Help
Advanced
philosophical musing   Message List  
Reply Message #1411 of 3364 |
RE: [fpga-cpu] philosophical musing -- P4 vs. FPGA MPSoC

> A CPU programmed in a FPGA is always going to be handicapped
> in clock speed relative to a conventional microprocessor.
> Whats the best we can do currently? 50MHz or so ? Pretty
> dismal against 2GHz for a current high end pentium.

The high end Pentium 4 approaches 3 GHz now. The ALUs are double
pumped, so each one can do up to 6 Gops. And there are multiple ALUs.
In practice, you won't see anything like that. For a single cache miss
that goes all the way out to main memory and is an open-page-miss in the
DRAM, the latency could easily be 100 ns. That's 100000 ps / 333 ps =
300 clock cycles or nearly a thousand potential issue slots. They don't
call it the "memory wall" for nothing.

The high end FPGA CPU is only ~150 MHz. But you can multiply
instantiate them. I have an unfinished 16-bit design in 4x8 V-II CLBs
that does about 167 MHz and includes a pipelined single-cycle
multiply-accumulate. You can put 40 of them in a 2V1000 for a peak
16-bit computation rate (never to exceed) of 333 Mops * 40 = ~12 Gops.
In a monster 2VP100 or 2VP125 you're looking at up to 10X that -- over
50 Gmacs (100 Gops). (Whether your problem can exploit that degree of
parallelism, or whether the part can handle the power dissipation of
such a design, I just don't know.)

When the Pentium 4 goes to main memory, it takes 50-150 ns. When the
FPGA CPU multiprocessor goes to main memory, it also takes 50-150 ns.
If the problem doesn't fit in cache, the P4 does not look so good.

Each P4 offers (with the help of a northbridge chipset) external
bandwidth of 3.2 GB/s (64-bits at 100 MB/s-quad-pumped). Each 2V1000
offers external bandwidth of at least 8 GB/s (e.g. go configure yourself
four 133-MHz 64-bit (~105-pin) DDR-DRAM channels).

When the Pentium 4 mispredicts a branch, it takes many, many (up to ~20)
cycles to recover. When the FPGA CPU core takes a branch (or not), it
wastes 0 or 1 cycles. If you are spending cycles parsing text, the
random nature of the data can eliminate many of the benefits of a
deeeeeeeeeeeeeeeeeep pipeline.

If I had to run Office, I'd rather have a P4.

If I had to classify XML data on the wire at wire speed, I'd rather have
an FPGA MPSoC or a mesh of same.

I think most of you will enjoy this lecture:
http://abp.lcs.mit.edu/6.823/lectures/lecture21.pdf.



> But what if instead of compiling to a pre-determined machine
> language, you generate a custom processor targeted at a
> single application? If the logic for large chunks of C code
> became the instructions of this single-use processor, the
> competitive tables might be turned.

This isn't strictly to the question, but a long time ago when we were
all writing and tuning p-code interpreters, the question of instruction
set compression came up. Hey, why not tune the p-code instruction set
for the application? If this application uses particular constants a
lot, or a lot of this kind of function call, or even
multi-syllable-instructions like push0-push0-call, then you could encode
those sorts of things more efficiently in a single one-byte opcode.

Back in those days memory was king, believe me. If you didn't fit into
the 60K or 100K or 200K budget, you were toast. Ever swapped-in
overlays from floppy disks?

So anyway, you would have to look at total memory footprint. To the
extent you optimized the instruction set to get the interpreted image
down, you might unintentionally grow the p-code interpreter itself. At
the very extreme would be a p-code interpreter optimized for running
(your favorite application) with one 0-bit instruction -- "run app".
However in that case the interpreter was just the application written in
native code...

So the moral is to consider the total footprint of the thing to be
hosted PLUS the footprint of host itself. If speed is an issue, I think
a scalar RISC is a good match for an FPGA. If speed is not an issue, a
stack machine backed by BRAM is also a good match for an FPGA. If power
is an issue ... .

Jan Gray, Gray Research LLC





Fri Oct 25, 2002 10:30 pm

gray_researc...
Offline Offline
Send Email Send Email

Message #1411 of 3364 |
Expand Messages Author Sort by Date

Hi One way to look at RISC is that if you can compile to microcode, your program will run faster. So if the tools hide the intricacy of programming at that...
Campbell, John
john.campbell@... Send Email
Oct 25, 2002
6:48 pm

... See Below. ... Funny I thought that was what CISC computers were about. The fact that RISC machines seem faster is because serial acess of memory is faster...
ben franchuk
woodelf1 Offline Send Email
Oct 25, 2002
9:52 pm

... The high end Pentium 4 approaches 3 GHz now. The ALUs are double pumped, so each one can do up to 6 Gops. And there are multiple ALUs. In practice, you...
Jan Gray
gray_researc... Offline Send Email
Oct 25, 2002
10:30 pm

... In a way you're right, in the sense of more powerful instructions provided by CISC. ... Partly. Also because it takes more engineering time to optimize a...
Campbell, John
john.campbell@... Send Email
Oct 25, 2002
11:31 pm

... Indeed... I've been thinking lately about how people see async logic as a frightening thing, but potentially useful for power consumption, reduced pin ...
Jason Watkins
jason_watkins98 Offline Send Email
Oct 26, 2002
5:53 pm

... I want big hungry power use :) Where is a TUBE logic FPGA when you need it. If I was designing as system I want constant power use rather than power...
Ben Franchuk
woodelf1 Offline Send Email
Oct 26, 2002
6:25 pm

... Not even close. Some of them draw 30 amps or more *continuously*....
Eric Smith
jdripper Offline Send Email
Oct 28, 2002
8:22 pm

My ISP is changing their setup so my xr16 in JHDL (xr16vx) page will change. After Dec. 1 it's only at: http://users.easystreet.com/mbutts/xr16vx_jhdl.html ...
Mike Butts
reconfigurab... Offline Send Email
Oct 29, 2002
4:26 am

... Ok I am off a BIG amount. I better stop looking at Z80 and 6502 and 6809 data sheets and go with something newer. :)...
ben franchuk
woodelf1 Offline Send Email
Oct 29, 2002
7:02 am

... But has anybody designed a architecure that is easy to build user instructions from? I am thinking of something like plumbing where you have a bunch of ...
Ben Franchuk
woodelf1 Offline Send Email
Jan 29, 2003
7:05 pm
Advanced

Copyright © 2010 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines NEW - Help