Various ideas and comments on fast context switches in FPGA CPUs:
(Please treat this entire article with a big "If I Recall Correctly"
prefix -- I wrote this quickly and I haven't taken the time to check my
facts below.)
First, let us tip our caps to the designers of
* the Xerox Alto [Chuck Thacker et al] -- microtask switch as often as
each cycle, doing I/O tasks in the same gates as the CPU;
* the Inmos Transputer [who?]: fast task switch -- stack architecture
with limited state to switch);
* the Denelcor HEP and the Cray (nee Tera) MTA [Burton Smith et al]:
(multiple thread contexts with thread switch each cycle);
* and the i960CA [Steve McGeady et al] (fast reg file save/reload via
wide buses).
1. Thinking about FPGA CPUs, one can, of course, use block RAMs (BRAMs
on Xilinx, ESBs on Altera) as really deep register files, vector
register files, windowed register files, or multi-context register
files. (See The Myriad Uses of Block RAM,
http://www.fpgacpu.org/usenet/bb.html.)
Block RAM register files tend to be slower and less-ported than register
files built with LUT RAM. On Altera, which lacks LUT RAMs, you might as
well pursue one of these avenues, and that's just what Altera Nios does
-- register windows. (See also
http://www.fpgacpu.org/usenet/altera_cpus.html and
http://www.fpgacpu.org/usenet/altera_cpus_dual_port_EABs.html: "EABs
could certainly excel at building LARGE register files (e.g. for vector
registers or multiple thread contexts or register windows) ... " (1997)
There is a duality between a windowed reg file and a multi-context reg
file. If I am not mistaken, you can do limited fast context switches on
a SPARC with a window rotation or two (8 global registers
notwithstanding). Perhaps the same idea would work on Altera Nios.
2. One can also build a simple barrel processor (say 4 threads (slots) x
32 regs = 128 entries of 32-bits = 2 16-bit ports on a single 256x16
BRAM, tripled cycled, or two BRAMs double cycled) and switch threads on
each cycle. Then you can have a 4-deep pipeline without need for any
result forwarding muxes (by the time you read an operand on thread[i],
you have already retired that threads' previous result to the register
file).
This seems to me to be a perfectly simple and practical basis to issue
instructions faster than the ALU + result forwarding mux + operand
register recurrence critical path. Unfortunately single-thread
performance is not so hot but in workloads such as a "network
processing", who cares?
This idea was taken to sublime levels in the ill-fated 20-stage
pipelined 5-threaded 1 GHz MicroUnity MediaProcessor (which would have
needed *some* result forwarding, but not 18 stages worth).
3. You can do the same thing (multiple context register files) with LUT
RAM, of course. In fact, it is quite trivial to make the xr16 (with its
*PC/DMA* register *file* LUT FRAM) multi-threaded, so long as you divvy
up the available 16 general purpose registers to the available threads
(or make the general purpose register file larger). Just don't switch
threads on interlocked instructions, such as immediate prefix, which are
stateful between instructions. (Or, of course, make the imm prefix
register a register file too -- not a good use of LUTs though.)
(Of course, in the context of the XSOC system, the xr16 uses this
facility to do invisible, cycle-stealing DMA transfers using the xr16's
PC reg file and PC incrementer.)
4. The old superscalar i960CA achieved a fast procedure call/ret by
having a wide (128 bit) and fast bus between the 6 ported reg file and
the internal RAM and register file cache. On a CALL it could save the 16
local registers in just 4 cycles. This is entirely feasible in a
pedal-to-the-metal multithreaded FPGA CPU using (say) 4 BRAMs configured
each as 2x128x16 (or else 2 BRAMs at 2x256x32 in Virtex-II).
5. Like the good old Transputer, you can build a stack machine backed by
BRAM, so that a context switch is simply saving/restoring the stack
pointer register and perhaps a very few other task/process related
registers.
6. Finally, I have a (new?) wacky idea for doing fast context switches
using a LUT RAM register file backed by BRAM. (I don't like this idea
enough to actually try it, but you may find it interesting anyway.)
Assume we can't or won't use a purely BRAM-based multi-context register
file because it is not as fast or as multiported as we want (esp. if we
are doing a 2-issue super or an LIW -- BTW I sketched a simple 2-issue
6R2W-register file LIW 7 years back -- see the latter half of
http://www.fpgacpu.org/usenet/homebrew.html). No, we must use a
single-thread-context LUT RAM register file. In that case, on a context
switch, we would like to save the current reg file from LUT RAM to BRAM
and reload the new threads' reg file from BRAM into LUT RAM.
First I must note the idea (that follows) is a win only if the new
thread only reads a subset of its registers before another context
switch occurs. But that's fine. If you aren't going to switch threads
very often, the amortized cost of the context switch is insignificant.
If you are going to switch threads as often as 20-100 instructions, then
this idea might pay off for you.
Here's the idea. For concreteness, assume 8 threads of 32 registers,
with a 32x32 LUT RAM reg file and a 256x32 BRAM-based 8-thread-context
backing store. Build 32 "valid register" flip-flops. On a context
switch, these can be reset in one cycle. For each read port into the
reg file, build a 32-1 mux to fetch that port's register's valid bit.
For each write port into the reg file, allocate a corresponding write
port into the BRAM. As each instruction result is retired into the LUT
RAM, it is also retired into the BRAM.
After a context switch, all valid register flops are cleared. Then on
an instruction like "add r3,r1,r2", we'll find that r1 and r2 are not
yet valid (present) in LUT RAM and stall and fetch them from the
BRAM-based multi-context reg file backing store. This may well take a
cycle or two per register "read miss" (perhaps fewer if you do heroic
things with double-cycled LUT RAMs and multiple BRAM ports).
(Again, remembering the duality of thread context switch and function
call, I note that this same mechanism can be used to do very fast
function call/return -- on each CALL or RET update the block RAM
register window address counter and clear all the register file valid
bits. This provides all the fast call/return benefits of deep register
windows plus the benefits of a fast small register file. You never need
to save registers in a function prolog (because they're always
concurrently retired into the backing store), and you never need to
reload registers in an epilog (they'll be reloaded on demand in the
return site continuation). Again, in typical C/C++/Java code, with a lot
of function calls, "much of the time" (hand waving) you typically don't
read more than a quarter of the registers in the register file before
making another call or return.)
This idea is an example of the hybrid LUT RAM + BRAM idea I mention in
the aforementioned Myriad Uses of BRAM article/disclosure: "hybrid uses
of large embedded RAM blocks together with smaller distributed RAM
blocks to achieve large storage capacity with highly multiported access
to a subset of that storage".
(By the way: the above valid bit per entry discrete FF + mux trick can
also be used to flash invalidate a small (e.g. LUT-RAM-based) cache tag
array.)
Jan Gray
Gray Research LLC