Search the web
Sign In
New User? Sign Up
fpga-cpu · FPGA CPU and SoC discussion list
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want to share photos of your group with the world? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
CISCifying RISC calls and returns   Message List  
Reply | Forward Message #746 of 3302 |
RE: [fpga-cpu] Re: CISCifying RISC calls and returns

Various ideas and comments on fast context switches in FPGA CPUs:

(Please treat this entire article with a big "If I Recall Correctly"
prefix -- I wrote this quickly and I haven't taken the time to check my
facts below.)


First, let us tip our caps to the designers of

* the Xerox Alto [Chuck Thacker et al] -- microtask switch as often as
each cycle, doing I/O tasks in the same gates as the CPU;

* the Inmos Transputer [who?]: fast task switch -- stack architecture
with limited state to switch);

* the Denelcor HEP and the Cray (nee Tera) MTA [Burton Smith et al]:
(multiple thread contexts with thread switch each cycle);

* and the i960CA [Steve McGeady et al] (fast reg file save/reload via
wide buses).


1. Thinking about FPGA CPUs, one can, of course, use block RAMs (BRAMs
on Xilinx, ESBs on Altera) as really deep register files, vector
register files, windowed register files, or multi-context register
files. (See The Myriad Uses of Block RAM,
http://www.fpgacpu.org/usenet/bb.html.)

Block RAM register files tend to be slower and less-ported than register
files built with LUT RAM. On Altera, which lacks LUT RAMs, you might as
well pursue one of these avenues, and that's just what Altera Nios does
-- register windows. (See also
http://www.fpgacpu.org/usenet/altera_cpus.html and
http://www.fpgacpu.org/usenet/altera_cpus_dual_port_EABs.html: "EABs
could certainly excel at building LARGE register files (e.g. for vector
registers or multiple thread contexts or register windows) ... " (1997)

There is a duality between a windowed reg file and a multi-context reg
file. If I am not mistaken, you can do limited fast context switches on
a SPARC with a window rotation or two (8 global registers
notwithstanding). Perhaps the same idea would work on Altera Nios.


2. One can also build a simple barrel processor (say 4 threads (slots) x
32 regs = 128 entries of 32-bits = 2 16-bit ports on a single 256x16
BRAM, tripled cycled, or two BRAMs double cycled) and switch threads on
each cycle. Then you can have a 4-deep pipeline without need for any
result forwarding muxes (by the time you read an operand on thread[i],
you have already retired that threads' previous result to the register
file).

This seems to me to be a perfectly simple and practical basis to issue
instructions faster than the ALU + result forwarding mux + operand
register recurrence critical path. Unfortunately single-thread
performance is not so hot but in workloads such as a "network
processing", who cares?

This idea was taken to sublime levels in the ill-fated 20-stage
pipelined 5-threaded 1 GHz MicroUnity MediaProcessor (which would have
needed *some* result forwarding, but not 18 stages worth).


3. You can do the same thing (multiple context register files) with LUT
RAM, of course. In fact, it is quite trivial to make the xr16 (with its
*PC/DMA* register *file* LUT FRAM) multi-threaded, so long as you divvy
up the available 16 general purpose registers to the available threads
(or make the general purpose register file larger). Just don't switch
threads on interlocked instructions, such as immediate prefix, which are
stateful between instructions. (Or, of course, make the imm prefix
register a register file too -- not a good use of LUTs though.)

(Of course, in the context of the XSOC system, the xr16 uses this
facility to do invisible, cycle-stealing DMA transfers using the xr16's
PC reg file and PC incrementer.)


4. The old superscalar i960CA achieved a fast procedure call/ret by
having a wide (128 bit) and fast bus between the 6 ported reg file and
the internal RAM and register file cache. On a CALL it could save the 16
local registers in just 4 cycles. This is entirely feasible in a
pedal-to-the-metal multithreaded FPGA CPU using (say) 4 BRAMs configured
each as 2x128x16 (or else 2 BRAMs at 2x256x32 in Virtex-II).


5. Like the good old Transputer, you can build a stack machine backed by
BRAM, so that a context switch is simply saving/restoring the stack
pointer register and perhaps a very few other task/process related
registers.


6. Finally, I have a (new?) wacky idea for doing fast context switches
using a LUT RAM register file backed by BRAM. (I don't like this idea
enough to actually try it, but you may find it interesting anyway.)

Assume we can't or won't use a purely BRAM-based multi-context register
file because it is not as fast or as multiported as we want (esp. if we
are doing a 2-issue super or an LIW -- BTW I sketched a simple 2-issue
6R2W-register file LIW 7 years back -- see the latter half of
http://www.fpgacpu.org/usenet/homebrew.html). No, we must use a
single-thread-context LUT RAM register file. In that case, on a context
switch, we would like to save the current reg file from LUT RAM to BRAM
and reload the new threads' reg file from BRAM into LUT RAM.

First I must note the idea (that follows) is a win only if the new
thread only reads a subset of its registers before another context
switch occurs. But that's fine. If you aren't going to switch threads
very often, the amortized cost of the context switch is insignificant.
If you are going to switch threads as often as 20-100 instructions, then
this idea might pay off for you.

Here's the idea. For concreteness, assume 8 threads of 32 registers,
with a 32x32 LUT RAM reg file and a 256x32 BRAM-based 8-thread-context
backing store. Build 32 "valid register" flip-flops. On a context
switch, these can be reset in one cycle. For each read port into the
reg file, build a 32-1 mux to fetch that port's register's valid bit.
For each write port into the reg file, allocate a corresponding write
port into the BRAM. As each instruction result is retired into the LUT
RAM, it is also retired into the BRAM.

After a context switch, all valid register flops are cleared. Then on
an instruction like "add r3,r1,r2", we'll find that r1 and r2 are not
yet valid (present) in LUT RAM and stall and fetch them from the
BRAM-based multi-context reg file backing store. This may well take a
cycle or two per register "read miss" (perhaps fewer if you do heroic
things with double-cycled LUT RAMs and multiple BRAM ports).

(Again, remembering the duality of thread context switch and function
call, I note that this same mechanism can be used to do very fast
function call/return -- on each CALL or RET update the block RAM
register window address counter and clear all the register file valid
bits. This provides all the fast call/return benefits of deep register
windows plus the benefits of a fast small register file. You never need
to save registers in a function prolog (because they're always
concurrently retired into the backing store), and you never need to
reload registers in an epilog (they'll be reloaded on demand in the
return site continuation). Again, in typical C/C++/Java code, with a lot
of function calls, "much of the time" (hand waving) you typically don't
read more than a quarter of the registers in the register file before
making another call or return.)

This idea is an example of the hybrid LUT RAM + BRAM idea I mention in
the aforementioned Myriad Uses of BRAM article/disclosure: "hybrid uses
of large embedded RAM blocks together with smaller distributed RAM
blocks to achieve large storage capacity with highly multiported access
to a subset of that storage".

(By the way: the above valid bit per entry discrete FF + mux trick can
also be used to flash invalidate a small (e.g. LUT-RAM-based) cache tag
array.)


Jan Gray
Gray Research LLC




Fri Nov 23, 2001 6:01 am

jsgray@...
Send Email Send Email

Forward
Message #746 of 3302 |
Expand Messages Author Sort by Date

... Seems like a good idea to me. You wouldn't want this memory cached so it would have to be in a non-cacheable memory area. I've thought of having a memory...
robfinch@...
rtfinch35
Offline Send Email
Nov 23, 2001
3:09 am

Various ideas and comments on fast context switches in FPGA CPUs: (Please treat this entire article with a big "If I Recall Correctly" prefix -- I wrote this...
Jan Gray
jsgray@...
Send Email
Nov 23, 2001
6:04 am

< Rob tips his hat to the designers that have come before, incl. members of this newsgroup. > ... (slots) x ... threads on ... any ... [i], ... register ... ...
robfinch@...
rtfinch35
Offline Send Email
Nov 23, 2001
10:05 am

... The Alto is quite a neat machine, but it does NOT have the potential for doing microtask switches in any cycle. It uses cooperative multitasking, and will...
Eric Smith
jdripper
Offline Send Email
Nov 23, 2001
10:09 pm

... You're right. My mistake. I thought NTASK management was automatically done in hardware (e.g. as higher priority tasks need service), but it is not: ...
Jan Gray
jsgray@...
Send Email
Nov 23, 2001
10:59 pm

... In case it isn't apparent from the literature, it is *important* that the Alto microcode not change tasks at arbitrary times. The memory control requires...
Eric Smith
jdripper
Offline Send Email
Nov 24, 2001
12:46 am

Hi ... I'm told that credit goes to David May. -jc...
Campbell, John
john.campbell@...
Send Email
Nov 26, 2001
5:12 pm
 First  |  |  Next > Last 
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help