> Small is often good but is the smallest the BEST?
If "smallest" delivers on requirements (e.g. fast enough, C programmable,
has interrupt handling, or what have you), probably yes.
"A small cat is better than a large cat because it eats less, poops less,
and sheds less." "So it follows that the ideal cat is a cat of zero
length?"
As with so many things, the first few resource units provide the essentials.
The rest are luxuries. As you climb the luxury curve, each resource spent
provides less and less additional value. Sometimes supposed luxuries (like
deeper pipelines) make things worse.
If you add up the number of 4-LUTs in a minimal "bare necessities" n-bit
processor datapath, for example,
Cost What
n 1 port 16-entry register file
n adder/subtractor
n logic unit
0 TBUF-based immediate mux
0 TBUF-based operand mux
---
3n
you can build a simple streamlined RISC datapath in only 3n logic cells.
Maybe even 2n if your ALU operation is "add/nand". If you're willing to
multi-cycle it (take k cycles per word) then it's 3n/k or 2n/k.
But it takes a few cycles to execute even one "RISC instruction" like add
r3,r1,r2:
(assume r[0]=0, rPC=1, r[2]=2, bus is 3-state bus, t is temp reg, ir is
instruction register)
; increment PC and fetch insn
t = bus <- r[2]
r[rPC] = mar = bus <- r[rPC] + t
ir = mem[mar]
; add instruction
t = bus <- r[ir.ra]
t = bus <- r[ir.rb] + t
r[ir.rd] = t
If you're only building a toaster SoC, or a toaster channel processor, where
100 kHz frequency would be quite adequate, you might as well build the 3n or
3n/k datapath.
But if that's not fast enough, if you need closer to one instruction per
cycle, you must add resources. The first thing you add is a dedicated PC
register, PC adder/incrementor, and PC mux. Next you add a second read port
to the register file, and perhaps a concurrent write port too. And you add
a result multiplexor to select among the various results (add, logic,
shifts, load-data-in, return address, etc.):
Cost What
2n-4n 2r1w 16-entry register file
n adder/subtractor
n logic unit
0-6n result multiplexer
n PC
n PC incrementer
n PC mux
---
7n-15n
This is a lot more costly, but is now approximately one instruction per
cycle.
If you still need more speed, you'll add pipelining to reduce the cycle
time. (But add 2n (or more) for result forwarding muxes for each stage.)
Each new pipeline stage you add will reduce the cycle time until the
diminishing returns set in, possibly due to the extra interconnect delay
incurred by signalling across many result forwarding multiplexers.
If you still need more speed, you'll think about multiple issue,
out-of-order, LIW, custom function units, or perhaps multiple processors on
chip.
Including control unit overhead, etc., xr16 is about 300 logic cells / 16
bits = ~20n overall, xr32 about ~14n overall.
Jan Gray
Gray Research LLC