http://www.cs.ubc.ca/local/reading/proceedings/spe91-
95/spe/vol21/issue8/spe046cm.pdf
Apparently what I've considered static binary translation is
properly called Compiled Instruction Set Simulation, if it
goes via a high level language as an intermediate code rather
than straight to binary...
By the way the work on the z80/ms pacman is coming along nicely.
I have a couple of days comp time due off work soon and I plan
to take it to the stage where it runs over those two days.
The biggest job so far has been doing a disassembly (i.e. identifying
the data bytes in the middle of the code) which would not have been
a problem if I had a working emulator to begin with that I could
have tweaked to record the run paths, but I didn't.
Here's a sample of what the code looks like now:
L008d: // push af ; [008d]
F5
mem[--SP]=RA; mem[--SP]=RF; /* check order */
// ld ($50c0),a ; [008e]
32 C0 50
mem[0x50c0] = RA;
// xor a ; [0091]
AF
RA=0;RF=(AF=Z_FLAG|V_FLAG);
// ld ($5000),a ; [0092]
32 00 50
mem[0x5000] = RA;
// di ; [0095]
F3
IFF1=IFF2=0;
// push bc ; [0096]
C5
mem[--SP]=RB; mem[--SP]=RC; /* check order */
// push de ; [0097]
D5
mem[--SP]=RD; mem[--SP]=RE; /* check order */
// push hl ; [0098]
E5
i=HL; mem[--SP]=i>>8; mem[--SP]=i&255;
// push ix ; [0099]
DD E5
i=IX; mem[--SP]=i>>8; mem[--SP]=i&255;
// push iy ; [009b]
FD E5
i=IY; mem[--SP]=i>>8; mem[--SP]=i&255;
// ld hl,$4e8c ; [009d]
21 8C 4E
HL=0x4e8c;
// ld de,$5050 ; [00a0]
11 50 50
RD=0x50;RE=0x50; /* Order? */
// ld bc,$0010 ; [00a3]
01 10 00
RB=0x00;RC=0x10; /* Order? */
// ldir ; [00a6]
ED B0
BC = (RB<<8)+RC;
DE = (RD<<8)+RE;
do { /* Might later optimise to a memmove() call? */
mem[DE++] = mem[HL++];
} while (--BC);
RF&=0xE9;
RC=BC;RB=BC>>8;
RE=DE;RD=DE>>8;
// ld a,($4ecc) ; [00a8]
3A CC 4E
RA=mem[0x4ecc];
// and a ; [00ab]
A7
RF=ZSPHTable[RA];
// ld a,($4ecf) ; [00ac]
3A CF 4E
RA=mem[0x4ecf];
// jr nz,$00b4 ; [00af]
20 03
if ((!(RF&Z_FLAG))) {goto L00b4;}
Although my Cinematronics recompiler was rather messy, and took a few
weeks to write, I got a itch to scratch tonight about doing a z80
recompiler (primarily for ms pacman, for my gp32) and started on
a translator for it. I found a fairly clean z80 emulator and started
from there... *two hours* later it is starting to look like something
that will work! Made somewhat easier by the Z80 not gratuitously setting
flags too often.
This one is being written following exactly the recipe in my howto doc.
Here's the current state of the output...
(from http://gtoal.com/sbt/z80/mspac.z80, which looks like this...)
L00016E: // ld ($4c2a),bc ; ED 43 2A 4C
M_WRMEM_WORD(0x2a43,R.BC.D);
L000172: // ld ($4c3a),de ; ED 53 3A 4C
M_WRMEM_WORD(0x3a53,R.DE.D);
L000176: // ld hl,$4c22 ; 21 22 4C
R.HL.D=0x4c22;
L000179: // ld de,$4ff2 ; 11 F2 4F
R.DE.D=0x4ff2;
L00017C: // ld bc,$000c ; 01 0C 00
R.BC.D=0x000c;
L00017F: // ldir ; ED B0
M_WRMEM(R.DE.D,M_RDMEM(R.HL.D));
++R.DE.W.l;
++R.HL.W.l;
--R.BC.W.l;
R.AF.B.l=(R.AF.B.l&0xE9)|(R.BC.D? V_FLAG:0);
if (R.BC.D) { R.PC.W.l-=2; }
L000181: // ld hl,$4c32 ; 21 32 4C
R.HL.D=0x4c32;
L000184: // ld de,$5062 ; 11 62 50
R.DE.D=0x5062;
L000187: // ld bc,$000c ; 01 0C 00
R.BC.D=0x000c;
L00018A: // ldir ; ED B0
M_WRMEM(R.DE.D,M_RDMEM(R.HL.D));
++R.DE.W.l;
++R.HL.W.l;
--R.BC.W.l;
R.AF.B.l=(R.AF.B.l&0xE9)|(R.BC.D? V_FLAG:0);
if (R.BC.D) { R.PC.W.l-=2; }
Clearly there are bits requiring some work, but considering how quickly
I was able to get this far, I'm confident it's not a huge amount of work.
Short summary... using this method it should be possible to get a basic
binary translator working over a weekend...
G
At 12:01 AM 1/13/2003 -0600, you wrote:
>The HOWTO has grown from 12 pages to 21 :-)
Very cool! I think you've taken a very good approach to this subject in
your document. It's much more accessible than the academic texts (as you
mention) and goes into detail with plenty of real code samples.
>http://www.gtoal.com/sbt/
This is a must-have addition to my page's links :)
----
Bart
>This info will be stored until a call to flush_basic_block() at which
>point only the minimum necessary statements will be output.
You append statements to a list which allows you to perform some analysis
before flushing it out, right?
In my SH-2 recompiler, I had 2 different lists: Input SH-2 code and output
X86 code.
Redundant flag elimination was performed by scanning the SH-2 list. Then,
the X86 backend would create a list of X86 ops. Some crude peephole
optimizations and redundant load/store elimination were performed and the
result was dumped out as executable X86 code. Each list element
corresponded to an instruction.
Kind of the same idea, I suppose. It seems pretty obvious in retrospect,
but the simplest dynarecs don't do it (they generate code and never store
lists of anything) and it could be tempting to approach static recompilers
the same way -- but whats the point of static recompilation if you're not
going to spend some of that time you've got to perform some sort of
optimizations? You only have to worry about doing it once in a static
recompiler :)
----
Bart
Wow I really appreciate your work, especially to bring this knowledge
to people new to the subject (like me). Thanks :)
--- In staticrecompilers@yahoogroups.com, "Graham Toal <gtoal@g...>"
<gtoal@g...> wrote:
> I have written a first draft of a static binary translation HOWTO.
>
> This is just my personal method; it isn't the state of the art,
> and I had to force myself to keep it to the size that it is as I
> really would have liked to add a lot more detail in several areas!
>
> Despite that, it is rather long, but I do believe that if you read
> it all you really will be able to write a simple translator.
>
> Here is is: http://www.gtoal.com/sbt/
>
> Comments welcome.
>
> G
I have written a first draft of a static binary translation HOWTO.
This is just my personal method; it isn't the state of the art,
and I had to force myself to keep it to the size that it is as I
really would have liked to add a lot more detail in several areas!
Despite that, it is rather long, but I do believe that if you read
it all you really will be able to write a simple translator.
Here is is: http://www.gtoal.com/sbt/
Comments welcome.
G
--- In staticrecompilers@yahoogroups.com, Graham Toal <gtoal@g...>
wrote:
> I hate giving myself a challenge :-) Here's the code ...
I wasn't happy with some aspects of that code and have now cleaned
it up considerably. In fact it now makes quite a nice demonstrator
of the concept, and I'll be freezing it now, and adding it to my web
page.
Here's the final code:
http://www.gtoal.com/athome/utils/deadcode/
G
> dump(unconditionally, 0, 0, " // INC B\n"); /* No code, just
comments */
> dump( conditionally, PC, 0, "PC = 0xA003;\n");
> dump( conditionally, TMP,B, "tmp = B+1;\n");
> dump( conditionally, CC, TMP, "CC = (tmp>>8)&1;\n");
> dump( conditionally, B, TMP, "B = tmp&255;\n");
By the way this is quite typical of real code. I don't think I've posted
it here before, so here: http://www.gtoal.com/athome/tailgunner/macros.h.html
is the guts of my static translator for the Cinematronics CPU. You can
see how easy it would be to add the scheme above to this code, which is
basically just a bunch of printf's that are invoked by something akin
to a smart disassembler...
Here's an excerpt from the LSL instruction:
fprintf(codefile, "%s%s", cur_tabs, "CINEWORD temp_word = 0xFFF;\n");
fprintf(codefile, "%s%s", cur_tabs, "\n");
fprintf(codefile, "%s%s", cur_tabs, "cmp_new = temp_word;\n");
fprintf(codefile, "%s%s", cur_tabs, "acc_a0 = register_A;\n");
fprintf(codefile, "%s%s", cur_tabs, "cmp_old = register_B;\n");
fprintf(codefile, "%s%s", cur_tabs, "\n");
fprintf(codefile, "%s%s", cur_tabs, "flag_C = (temp_word + register_B);\n");
fprintf(codefile, "%s%s", cur_tabs, "register_B <<= 1;\n");
fprintf(codefile, "%s%s", cur_tabs, "register_B &= 0x0FFF;\n");
It's pretty obvious how to modify that to add the optimisations...
> I've started a new job this week and don't have as much time to hack code
> as I usually do; and Bart is off on vacation for a couple of weeks, so don't
> expect to see this code for a month or so - unless someone else feels like
> walking up to the plate and giving it a go...
I hate giving myself a challenge :-) Here's the code ...
G
#define DEBUG 1
// A test program to eliminate redundant code; does *not* do common
// subexpression elimination or constant folding, although in a real
// program the CSE and dead code stuff would best be done by the same program
// We might assume (perhaps wrongly) that we are translating hand-written
// assembly code and that no obvious peephole optimisation is possible.
/* Output:
04 lookfor: 04 Item 2 - def 04
24 lookfor: 0e Item 21 - def 04
24 lookfor: 0a Item 18 - def 02
17 lookfor: 04 Item 15 - def 04
14 lookfor: 10 Item 13 - def 10
12 lookfor: 04 Item 10 - def 04
09 lookfor: 10 Item 8 - def 10
07 lookfor: 04 Item 2 - def 04
24 lookfor: 08 Item 14 - def 08
13 lookfor: 10 Item 13 - def 10
12 lookfor: 04 Item 10 - def 04
09 lookfor: 10 Item 8 - def 10
07 lookfor: 04 Item 2 - def 04
000 [def:00] [use:00] // LDB #42
001 [def:01] [use:00] // PC = 0xA000;
002 [def:04] [use:00] B = 42;
003 [def:00] [use:00] // INP A,B
004 [def:01] [use:00] // PC = 0xA002;
005 [def:02] [use:04] A = fninput(B);
006 [def:00] [use:00] // INC B
007 [def:01] [use:00] // PC = 0xA003;
008 [def:10] [use:04] tmp = B+1;
009 [def:08] [use:10] // CC = (tmp>>8)&1;
010 [def:04] [use:10] B = tmp&255;
011 [def:00] [use:00] // INC B
012 [def:01] [use:00] // PC = 0xA004;
013 [def:10] [use:04] tmp = B+1;
014 [def:08] [use:10] CC = (tmp>>8)&1;
015 [def:04] [use:10] B = tmp&255;
016 [def:00] [use:00] // ASR A,B
017 [def:01] [use:00] // PC = 0xA005;
018 [def:02] [use:04] A = B>>1;
019 [def:00] [use:00] // LDB #1
020 [def:01] [use:00] // PC = 0xA006;
021 [def:04] [use:00] B = 1;
022 [def:00] [use:00] // JMP 0xFC08 ; end of basic block ...
023 [def:01] [use:00] // PC = 0xA008;
024 [def:00] [use:00]
025 [def:0e] [use:00] goto *lab[PC = 0xFC08];
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#ifndef FALSE
#define FALSE (0!=0)
#define TRUE (0==0)
#endif
#define MAX 1024
int force[MAX];
long defs[MAX], uses[MAX];
char *codes[MAX];
int nextfree = 0;
#define conditionally 1
#define unconditionally 2
#define flush 3
#define PC 1
#define A 2
#define B 4
#define CC 8
#define TMP 16
/* See http://groups.yahoo.com/group/compilers101/ for better techniques
to eliminate redundant code. This cheap hack uses a bitmask and a
scanning loop. Similar code in a real compiler would not be as
restricted and would not require a BFI scanning loop. However for
use in static translators of simple micros with only a few registers,
this is an acceptable algorithm. */
void backchain(int idx, int regs)
{
int i;
long common_regs;
// idx is the index of this instruction.
// regs are the registers that need to be dumped.
for (i = idx; i >= 0; --i) {
if (regs == 0) break; /* small optimisation - early exit when done */
if (common_regs = (regs & defs[i])) {
/* if bits in 'defs' then this needs to be dumped. */
#ifdef DEBUG
fprintf(stdout,
"%02d lookfor: %02x Item %d - def %02x\n",
idx, regs, i, defs[i]);
#endif
regs &= ~common_regs; /* don't need to look for these ones any more */
// or regs -= common_regs would actually work here, if sloppily...
force[i] = TRUE;
backchain(i-1, uses[i]); /* tag the predecessors of this instr in turn */
}
}
}
void dump(int optional, long def, long use, char *code)
{
int i;
force[nextfree] = FALSE;
defs[nextfree] = def; uses[nextfree] = use; codes[nextfree] = code;
if (optional == unconditionally) {
force[nextfree] = TRUE;
backchain(nextfree-1, use);
}
if (optional != flush) nextfree += 1; else {
// this is the same as a call to "flush_basic_block()"
force[nextfree] = TRUE; /* last instruction non-optional */
backchain(nextfree-1, def); // 2nd param holds the registers we want dumped
for (i = 0; i <= nextfree; i++) {
#if DEBUG
fprintf(stdout, "%03d [def:%02x] [use:%02x] ", i, defs[i], uses[i]);
#endif
fprintf(stdout, "%s %s", (force[i] ? " " : "// "), codes[i]);
}
nextfree = 0;
}
}
int main(int argc, char **argv) {
int i;
/* for (i = 0; i < MAXBITS; i++) count[i] = 0; */
// Test program for a non-existent machine. Several details omitted.
//
// $ORG A000
// LDB #42
// INP A,B
// INC B
// INC B
// ASR A,B
// LDB #1
// JMP 0xFC08 ; end of basic block ...
dump(unconditionally, 0, 0, " // LDB #42\n"); /* No code, just
comments */
dump( conditionally, PC, 0, "PC = 0xA000;\n");
dump( conditionally, B, 0, "B = 42;\n");
dump(unconditionally, 0, 0, " // INP A,B\n"); /* No code, just
comments */
dump( conditionally, PC, 0, "PC = 0xA002;\n");
dump(unconditionally, A, B, "A = fninput(B);\n"); /* has side effects */
dump(unconditionally, 0, 0, " // INC B\n"); /* No code, just
comments */
dump( conditionally, PC, 0, "PC = 0xA003;\n");
dump( conditionally, TMP,B, "tmp = B+1;\n");
dump( conditionally, CC, TMP, "CC = (tmp>>8)&1;\n");
dump( conditionally, B, TMP, "B = tmp&255;\n");
dump(unconditionally, 0, 0, " // INC B\n"); /* No code, just
comments */
dump( conditionally, PC, 0, "PC = 0xA004;\n");
dump( conditionally, TMP,B, "tmp = B+1;\n");
dump( conditionally, CC, TMP, "CC = (tmp>>8)&1;\n");
dump( conditionally, B, TMP, "B = tmp&255;\n");
dump(unconditionally, 0, 0, " // ASR A,B\n"); /* No code, just
comments */
dump( conditionally, PC, 0, "PC = 0xA005;\n");
dump( conditionally, A, B, "A = B>>1;\n");
dump(unconditionally, 0, 0, " // LDB #1\n"); /* No code, just
comments */
dump( conditionally, PC, 0, "PC = 0xA006;\n");
dump( conditionally, B, 0, "B = 1;\n");
dump(unconditionally, 0, 0, " // JMP 0xFC08 ; end of basic
block ...\n"); /* No code, just comments */
dump( conditionally, PC, 0, "PC = 0xA008;\n");
dump(unconditionally, 0, 0, "\n"); /* No code, just comments */
dump( flush, A | B | CC, /* List of registers to force out */
/* Note we never want TMP */
/* and in this case we can ignore PC */
0, "goto *lab[PC = 0xFC08];\n");
exit(0);
return(0);
}
We've already looked at how we can discard assignments to flag registers
if they are not used before a subsequent assignment, and indeed we can handle
long chains of such assignments/discards before an actual use of the variable
forces us to generate code to write its value. However we also have to write
the value to memory whenever we hit the end of a basic block, at least in a
simple scheme without advanced dataflow analysis.
For instance with the sequence lda fred; adc jim; sta fred; we can ignore
the negative and overflow flags because we don't use them, and we only
calculate the carry flag when we actually need it. We assume the flags are
all up to date and consistent when we enter via 'start:'. When we get
to the "random:" label however, control could get there from anywhere, so
we have to make sure all our flags are saved so that the entry conditions
to the next block are correct. (We'd also have to do that on a "RTS" or
any goto such as "JMP [indirect]")
start:
/* lda fred */
RegA = fred;
/* adc jim */
{register int newcarry;
newcarry = adc_carry[RegA][jim][RegCarry];
RegA = adc[RegA][jim][RegCarry];
RegCarry = newcarry;
}
/* sta fred */
fred = RegA;
RegNegative = 0; RegOverflow = 0;
random:
So a label such as this pretty much forces us to dump those pending stores.
But what about where the end of the basic block is not caused by the program's
flow control but by something in the generated code? eg:
/* sex - sign extend B.A registers (aka D register on 6809) */
if (RegA & 0x80 != 0) RegB = 0x00; else RegB = 0xff;
This should not cause the same degree of panic to dump all flags back to
variables that something like a label would cause. We know this instruction
doesn't affect the carry flag, for example.
So we should be able to get away with a lightweight "context merging" here,
in compiler terms. If we're doing the redundant code elimination ourselves
at code-generation time rather than letting the compiler do it, it should
be easy; all we do is call our standard code-dumping procedure, and then
call the optimisation code and tell it that we have modified whichever
variables are written to inside the if/then/else sequences.
This suggests then that we need a code-dumping procedure something like this:
dump(used_regs, written_to_regs, statement, args)
which indeed is very like GCC's internal assembler's conventions for telling
the compiler what the consequences of a piece of inline assembler are.
So let's start with a simple instruction like ADC which uses the carry
flag, and sets carry, negative and overflow flags.
The generated code in the worst case would be something like this:
{
register int newcarry;
newcarry = adc_carry[RegA][operand][RegCarry];
RegOverflow = adc_overflow[RegA][operand][RegCarry];
RegNegative = adc_negative[RegA][operand][RegCarry];
RegA = adc[RegA][operand][RegCarry];
RegCarry = newcarry;
}
(the hack with the carry is because we don't want to change it before we
use it in the calculation to update RegA, but we need the old RegA in
order to update the carry. Call it a software 2-phase lock :-) )
When our static translator generates the code above, we don't write it
out immediately. We save it to a data structure which can later be used
by the optimiser, and code is only dumped when a basic block is ended.
By that time it can have decided to remove some of the statements depending
on the use/reference chains. (see our discussions on compilers101 for
details)
So, the translator source to generate the above would be something like this:
opd = decode_opd(&PC);
dump(Always, Always, " {\n");
dump(Carry, A | Carry, " register int newcarry;\n");
dump(Carry, A | Carry, " newcarry = adc_carry[RegA][%s][RegCarry];\n",
opd);
dump(Overflow, A | Carry, " RegOverflow =
adc_overflow[RegA][%s][RegCarry];\n", opd);
dump(Negative, A | Carry, " RegNegative =
adc_negative[RegA][%s][RegCarry];\n", opd);
dump(A, A | Carry, " RegA = adc[RegA][%s][RegCarry];\n", opd);
dump(Carry, Carry, " RegCarry = newcarry;\n");
dump(Always, Always, " }\n");
So... the first parameter is a set detailing which registers are written to.
The second parameter details which registers are read from.
The third paramater (and following varargs) are the code to be generated.
This info will be stored until a call to flush_basic_block() at which
point only the minimum necessary statements will be output.
With this framework then, generated code containing a conditional like this
is not a problem:
dump(B, A, " if (RegA & 0x80 != 0) RegB = 0x00; else RegB = 0xff;\n");
it simply says that it uses RegB and that it generates RegA. It can
be treated by the internal optimiser as if it were a simple unconditional
instruction, like the alternative coding:
dump(B, A, " RegB = sex[RegA];\n");
One reason I previously preferred the use of lookup tables like the
above was so that we had no conditionals, thus allowing the compiler to
optimise more aggressively. But if we handle that level of optimisation
within our generator rather than leaving it to the compiler or a
compiler-independant second pass optimiser, then the fact that one
implementation of the instruction uses an if/then/else is completely
irrelevant, and we can avoid these large lookup tables... You never
know, on a machine with a suitable instruction, the whole sequence may
end up getting compiled by the original SEX instruction or something
close to it! (The PDP15 had a single-opcode skip instruction which was
great for short conditionals like this; the ARM has something similar
which can work over multiple instructions as long as you don't change
the flags inside those instructions)
The following is only slightly more complex; or looking at it another
way, generates code only slightly less efficient, but is still better
than the alternative of ending a basic block prematurely:
// I can't think of a real example off the top of my head so I'll
// invent a hypothetical opcode "FOO arg" which if arg is even will
// add it to the A register otherwise will add it to the B register!
Generated code is:
/* FOO */
if (operand&1 == 0) A = adc[A][operand][RegCarry]; else B =
adc[A][operand][RegCarry];
The generator for this could therefore be:
dump(A | B, A | B | Carry, " if (%s&1 == 0) A = adc[A][%s][RegCarry]; else B =
adc[A][%s][RegCarry];\n", opd, opd, opd);
Now the drawback of this is that the optimiser will see that it needs both
A and B as inputs, and ensure that they are dumped before this statement
is generated, resulting in an extra store to one or other of the operands
no matter which branch is taken. Ideally the branch to calculate A should
cause B to be stored in that branch first, and vice-versa. This way both
A and B are updated after the statement but only once each.
But the fact that we have not been forced to flush the whole basic block
by virtue of the presence of an if/then/else is a big win, and worth doing.
The great thing about this scheme is that it only needs to be written once,
and it will be usable in any static translator. And it's not restricted
to one which generates C; the code in the dump() call could just as easily
be ARM or 68000 assembler...
I've started a new job this week and don't have as much time to hack code
as I usually do; and Bart is off on vacation for a couple of weeks, so don't
expect to see this code for a month or so - unless someone else feels like
walking up to the plate and giving it a go...
G
Although my personal concept of static recompilation is biased towards
translating specific programs for a purpose - which means that I don't
mind too much if I have to hand-tweak a couple of awkward spots after
the gross translation is done by a program - it does kind of rankle
that I can't see an easy way to handle awkward code like the example
below:
*BL _ Hit
{*** All errors come here - very unsafe ***}
*LDR _ R4, [Pc, #16] {pick up address of BUFFER} {001}
*LDR _ R3, [R4, #4] {error number} {001}
*TSTS _ R3, #16_80000000 {check for top bit} {001}
*BNE _ Err {fatal error} {001}
*LDR _ R14, [R4, #0] {address of inst. after SWI}{001}
*ORRS _ Pc, R14, #16_1000 0000 {return setting the V bit}
!?????????????????
*SWI _ 0 {DUMMY} {a hole for the address of BUFFER}
!?????????????????
Hit: *MOV _ R0, R14 {trap routine}
*LDR _ R1, A {address of 256-byte buffer?}
*STR _ R1, [Pc, #-20] {plug in dummy SWI above}
*MOV _ R2, #0
*MOV _ R3, #0
*SWI _ Brasil Control
%return
This is the source of the error trapping in the Acorn Pascal runtime
library. It does a BL to get its own PC in a register, and then
plugs it into the code above. (The code is written in Imp, with
embedded assembler. The Pascal compiler was written in this language
as well)
So much to my surprise, not only does some code produced by a compiler
have self-modifying code, *all* programs produced by this compiler
have self-modifying code! Since one of my projects at the moment
is to get the old Acorn Archimedes compilers working again in order
to compile for new ARM targets (such as the GBA and GP32) the problem
above is definitely of interest to me.
So I can see that through emulation it should be possible to locate
addresses in the code segment that are poked, but the simple scheme
I have for identifying memory usage won't tell me what code is poked
into that area or how its used. Although I could special-case this
particular example I'm more interested in solving the general case.
So far I've come up with something like this:
for (;;) {
switch (PC) {
...
case 0xe1005c: if (untouched) {
SWI(0);
} else {
interpret(PC);
break;
}
case 0xe10060: ... next instruction ...
...
}
}
So basically my solution is to pass control to an old-fashioned
emulator for any addresses which contain updated code. The emulator
will have to work out when it's safe to return.
I'm not sure this is a very good solution and I can see all sorts of
side issues. As far as I can see the only sensible way to implement
this involves having an array as large as the code with flags to
say if the word is the original (in which case use the static
translation) or has been poked (in which case emulate).
One problem is that it screws bigtime with your basic blocks. Any
optimisations that are done by optimising sequential instructions
will be broken if modified code jumps into the middle of a basic
block, so as well as interpreting the modified code we may also have
to interpret the non-modified code if we jump to an address that
we didn't previously realise was a jump target.
We either need a traditional emulator, *or* a second translation which
is done on an instruction by instruction basis with no basic-block
optimisations.
It's all very messy.
Back to the subject of translating a program from native opcodes to C:
I had mentioned that a significant source of overhead comes from things
like the C/V/Z/N flags which every opcode calculates, but which aren't
always (or indeed often) used, before they're re-calculated by a
subsequent opcode. I suggested that a good compiler would entirely
remove these redundant expressions, *but* that there weren't many
compilers that good... So Bart knocked up a proof-of-concept demo to
show that it's not too hard to do this in a compiler, and I'm giving
serious thought to hacking up a C-to-C source optimiser some day that
does it in a portable way. Meantime, this can be done by adding compiler-like
code to the program that generates the translation. (Not restricted to
C - translating to other architectures directly should be the same if you're
not able to map the flags in a 1:1 manner but have to duplicate them
explicitly)
However... short of doing global dataflow analysis, you'll always have
to write those flags back to variables (or at least a slaved register)
at the end of every basic block. I was trying to think of a way around
this, and have half of a solution. Imagine this code for the 6502 (from
Tom Walker's b-em):
/*ROR abs*/
addr=getword();
temp=readmem(addr);
tempi=c;
c=temp&1;
temp>>=1;
if (tempi) temp|=0x80;
p&=~(2 | 0x80);
p|=((temp==0)<<1) | (temp & 128);
writemem(addr,temp);
cycles-=6;
Assuming it's the last statement in a basic block there's nothing we can
do about the machine registers (in this case 'p') *but* there is something
we can do about 'temp' and 'tempi': we *know* these are only used by us,
so there's no need to remember them between instructions, so...
{register int temp, tempi;
... code from above ...
}
It's just a small tweak, but it lets the compiler know that temp and
tempi are not going to be used again after this sequence, so they don't
need to be remembered.
How much difference this would make over a program I don't know, but
it certainly shouldn't make it worse.
Here's some more code from that emulator:
INLINE unsigned char readmem(unsigned short address)
{
if ((address > 0xfc00) && (address < 0xff00))
{
if (address>0xFE0F&&address<0xFE20)
return readserial(address);
if ((address&~0x1F)==0xFEC0)
return readadc(address);
if ((address & ~0xf)==0xfe40 || (address & ~0xf)==0xfe50)
return SysVIARead(address & 0xf);
if (address==0xFE60)
return rand()&255;
if (address==0xFE70)
return rand()&255;
if ((address & ~0xf)==0xfe60 || (address & ~0xf)==0xfe70)
return UVIARead(address & 0xf);
if ((address & ~15)==0xfe00)
return CRTCRead(address & 0x1);
if ((address & ~0xf)==0xfe20)
return ULARead(address & 0xf);
if ((address&~0xF)==0xFE30)
return currom;
if ((address&~0x1F)==0xFE80)
return read8271(address);
return 0xfe;
}
return ram[address];
}
What we have here is an architecture where 99.99% of the accesses will be to
ram or rom, and only a few to special registers or peripherals. Inlining the
procedure is reasonable in a regular emulator where there's only one
instantiation of it, but in a static translation there are as many copies
as there are occurances of the opcode in the program. So it really would
be overkill to implement it this way. What's more, because of all the
branches and returns, inline code is going to have almost as much overhead
as a real procedure would minus just one stacking operation of the PC.
I would implement this in translation differently: first I'd expand the ram[]
array from 8 bits to 16 bits per byte (100% wasteful in space, but it's only
a 64K micro) and I'd use the top byte as a flag that it wasn't regular memory;
then I'd keep something like the same procedure as above, but make it a
real procedure, while moving the most common test out into in-line code
as follows:
/* 0 in top bit of 16-bit ram data means normal */
if ((temp = ram[addr]) < 0) temp = readmem(addr);
...
if (ram[addr] >= 0) ram[addr] = (unsigned char)temp;
else writemem(addr,temp);
the rest of the flag byte can also be usefully used to tag the memory
more precisely, making readmem much more efficient:
unsigned char readmem(unsigned short address)
switch ((unsigned short)ram[addr] >> (unsigned short)8) { /*Ansi silliness*/
/* consecutive values should be efficiently implemented by a jump table */
case 0x81:
return readserial(address);
case 0x82:
return readadc(address);
case 0x83:
return SysVIARead(address & 0xf);
case 0x84:
return rand()&255;
case 0x85:
return rand()&255;
case 0x86:
return UVIARead(address & 0xf);
case 0x87:
return CRTCRead(address & 0x1);
case 0x88:
return ULARead(address & 0xf);
case 0x89:
return currom;
case 0x8a:
return read8271(address);
default:
return 0xfe;
}
}
(similarly writemem, which needs to add "| 0x8?00" etc to each return.)
So although the method of creating a static translation which I outlined
earlier (starting with an emulator and outputting the emulator source
for each instruction) is guaranteed to work, it won't automatically
give you the best code, even if the compiler is above average; you
still have to think about ways of helping the compiler, even at the
expense of using more ram or lookup tables:
/* ASLA */
{
unsigned char olda=a;
a<<=1;
/* Two lines below expanded from:
setczn((olda & 0x80)>0, a==0, a & 128);
*/
p&=~(3 | 0x80);
p|=(((olda & 0x80)>0)!=0) | (((a==0)!=0)<<1) | (((a & 128)!=0)<<7);
Cycles+=2;
}
OK, the "a<<=1;" is probably not improved by "a = asl[a];" (although
more complex arithmetic ops are probably better served with a lookup
table than with inline code, eg a = adc[a][opd];) but what about the
flag calculation:
/* Two lines below expanded from:
setczn((olda & 0x80)>0, a==0, a & 128);
*/
p&=~(3 | 0x80);
p|=(((olda & 0x80)>0)!=0) | (((a==0)!=0)<<1) | (((a & 128)!=0)<<7);
never mind the bogosity of the macro expansion which will generate a lot
of unnecessary normalization (one reason why compilers need aggressive
optimisation even for constructs that a program would not ever write if
he realised he was doing it) but what are we calculating here? a carry
bit that is a function only of 'a'. (Apparently of 'olda', but olda is
itself a function of a.)
So why not do this as
p = (p & ~0x83) | aslflags[a];
Now if we put the two together, we get
p = (p & ~0x83) | aslflags[a]; a = asl[a];
Similar constructs would work for any opcode such as ADC, MUL etc.
Note that a good compiler will create an internal common sub-expression
temporary for the indexing operation so accessing two arrays with the
same subscript if often very cheap. probably cheaper than smooshing the
two arrays into a 16-bit array and shifting/masking to get the two results
back out of them.
But that's just ordinary emulator design decisions; what I was trying to show
was how to help the compiler. I said that explicitly scoping the temporaries
around each instruction was half of the solution; well, I think this is
the other half:
/* RRA addr */
oldval=val=fastread(addr);
fastwrite(addr,val);
val>>=1;
val&=127;
if (p & 1)
val|=128;
fastwrite(addr,val);
ADC(addr);
What's the problem here? Well, val is set from within a conditional
so unless your compiler is fairly smart (and we suspect that few of
them are) then it will end the basic block at that 'if' statement, with
the knock-on effects which we know that causes.
If we recast this as "val |= topbit[p]" and initialise that array
appropriately, there is no more conditional, and the optimiser can
go to work on the larger seaquence.
("val |= (p&1)<<7" would work too; I happen to lean on the side
of overkill for array lookups in emulator code! Very target-architecture
dependent)
The final part of the original example was: cycles += 2; - an optimiser
such as we've been discussing on compilers101 can save these up over
several instructions and conflate them to say cycles += 50. You can
still do timing-sensitive operations; any point that you would look
at the cycle count to make a decision, it would cause the compiler to
update the variable. I'm assuming that nothing is so timing-critical
that an emulator would check the cycle count on every instruction. If
you're just trying to slow down to real-time speed, or simulate CRT
flyback timing, once every hundred or so instructions should be more
than enough.
You may have noticed in the above that I conveniently avoided the
issue of procedure calls having side-effects, which also cause the
compiler to be forced to slave registers back to store. This can be
ameliorated by declaring all your virtual registers with the "register"
qualifier. Then you cannot take the address of them, and so the compiler
knows you cannot write to them via pointer indirections, and will not
slave them back to variables even if you call a procedure with side-effects.
However the catch is that you can't have global register ints, they can
only be declared within a procedure. So you're forced to use GCC's extensions
to allow nested procedures, if they're to be in a scope which can see the
register int declarations. This is messy. I don't yet have a good
solution. Mapping virtual registers to real registers is harder in C
than it is if translating from machine code to machine code (eg 6502 to ARM)
but as I found from CodeWarrior for the 68000, it's not impossible.
Note: typing a variable as 'register' does not force it to be put in a
register. The compiler can ignore the qualifier and do whatever it thinks is
the best thing for the code, leaving those variables on the stack, and
putting other non-register variables in registers for the duration of a
function or at least some basic blocks. What it does do consistently is
tell the compiler that you can't write to these variables any way other
than by name, which in the context above might just be all that's needed
to allow the compiler to more aggressively remove redundant code.
G
--- In staticrecompilers@yahoogroups.com, "Graham Toal <gtoal@g...>"
<gtoal@g...> wrote:
> --- In staticrecompilers@yahoogroups.com, "davesbit
<davesbit@y...>"
> <davesbit@y...> wrote:
> > GCC would spot even more stuff probably
>
> Ho ho! You've never done a dump of GCC's output, have you? It's
> not all they'd like you to think it is. For all the fancy
> techniques it tries to do, a lot of the compilers I used in the
70's
> produced better code that GCC does now.
True GCC has been known to be a bit naff - MVSC is pretty good though
from what I've seen, with x86 at least - has a few problems with ARM
in eVC though!
A passing thought... let's say we have a reasonably good compiler
which eliminates dead stores and redundant loads, but only within
basic blocks. In that case, when we're crafting the code which each
statement is to be translated to, it will make the basic blocks
larger if all the code is unconditional rather than containing
if/then/else sequences. For instance the 6809 SEX instruction which
Sign EXtends an 8-bit value to 16 bits *could* be written as:
if (register_B & 0x80 != 0) register_D = 0xff00 | register_B; else
register_D = register_B;
however that would screw with the basic block structure.
If instead we were to do:
register_D = sex_array[register_B];
and have a 256 element array containing -128:127 as 16-bit ints,
then all the optimisations we discussed earlier could still be
carried out.
Of course a better compiler would do global dataflow analysis and
context merging around if/then/else sequences, which would preserve
information about cached variables in registers, but to be able to
do that even for a simple 'if' like above requires a whole bunch of
added baggage, which you need for the general case, so my experience
has been that not too many compilers do it.
--- In staticrecompilers@yahoogroups.com, "davesbit <davesbit@y...>"
<davesbit@y...> wrote:
> GCC would spot even more stuff probably
Ho ho! You've never done a dump of GCC's output, have you? It's
not all they'd like you to think it is. For all the fancy
techniques it tries to do, a lot of the compilers I used in the 70's
produced better code that GCC does now.
Why don't you take that code sequence I gave as an example and
report back to us what various compilers do with it? It'll be
instructive...
Try it with and without the optimisation flags...
flag_C = acc_a0 = register_A;
cmp_old = register_B;
register_B = cmp_new = 0;
cmp_old = flag_C = acc_a0 = register_A;
register_A = cmp_new = 0x0f00;
cmp_old = register_A; cmp_new = 0x01;
register_A = flag_C = acc_a0 = 0x0f01;
G
At 01:52 AM 12/24/2002 -0000, you wrote:
>
>Thing is, this is all compiler theory - wouldn't GCC spot all of
>this, making any 6502->C (elimate dead registers) C->x86 method
>pointless?
>
>Thing is, even so, I think it's still worth it, because GCC would
>spot even more stuff probably, but it would be curious to check the -
>S output just to see how well it was doing!
I'd be interested in seeing the kind of code GCC generates without any
fancy optimizations done ahead of time and with the optimizations Graham
was talking about.
You can't put too much trust in a compiler. They're good sometimes, but
sometimes they just plain suck.
You could also generate machine code directly when statically translating
something, and in that case, you're on your own when it comes to
optimizations. A lot of these techniques are applicable to dynarecs, too.
----
Bart
Thing is, this is all compiler theory - wouldn't GCC spot all of
this, making any 6502->C (elimate dead registers) C->x86 method
pointless?
Thing is, even so, I think it's still worth it, because GCC would
spot even more stuff probably, but it would be curious to check the -
S output just to see how well it was doing!
--- In staticrecompilers@yahoogroups.com, Graham Toal <gtoal@g...>
wrote:
> > when rewritten with multiple assignments agglomerated =>
> >
> > register_B = 0
> > cmp_old = register_A = 0xf00
> > cmp_new = 1
> > register_A = flag_C = acc_a0 = 0xf01
> >
> > and frankly, it just doesn't get any better than this! That
would generate
>
> unfortunately it does, I was careless; the first assignment to
register_A
> was redundant...
>
> register_B = 0
> cmp_old = 0xf00
> cmp_new = 1
> register_A = flag_C = acc_a0 = 0xf01
> when rewritten with multiple assignments agglomerated =>
>
> register_B = 0
> cmp_old = register_A = 0xf00
> cmp_new = 1
> register_A = flag_C = acc_a0 = 0xf01
>
> and frankly, it just doesn't get any better than this! That would generate
unfortunately it does, I was careless; the first assignment to register_A
was redundant...
register_B = 0
cmp_old = 0xf00
cmp_new = 1
register_A = flag_C = acc_a0 = 0xf01
We're discussing the implementation of this over in 'compilers101' but
I wanted to show an example here so you understand why it's so relevant
to static binary translation (and also why it's easier in a static
translator than in a dynarec - the overhead is high, but if you only
do it once...)
> > >Unfortunately none of the compilers I've used have eliminated
> > >redundant loads and stores; however Bart has just finished writing
> > >a source to source optimiser which I hope to steal and adapt for
> > >this project so that I can eliminate those within basic blocks.
> >
> > But it doesn't really eliminate redundant loads/stores at all. At least not
> > the kind I'm thinking of. The code generator simply generates ugly
> > load/stores to temporary variables (which, granted, could easily be
> > optimized away by some post-analysis.)
> >
> > Let me know if it helps you in your project, I'd be interested to hear if
> > it works out.
Here is the sort of code that I want to simplify; it's generated by an
automatic translator, and has a few minor optimisations already performed,
for instance the original code
; 0000: 57 USB
; 0001: 00 CLR
; 0002: 0F LDA #$F00
; 0003: 20 01 ADD #$01
Does a load and an add in order to generate a literal that's outside it's
range. The ideal translation of this of course would be:
register_B = 0;
register_A = 0xf01;
but in a wider context, we need to keep track of all the flags that are
set by these operations. So the semi-naive translation that I've done
so far generates this:
L0000:
/* Invariants: register_P = 0x0 register_I = 0x00 register_A = 0x00 */;
flag_C = acc_a0 = register_A;
cmp_old = register_B; /* step back cmp flag */
register_B = cmp_new = 0;
cmp_old = flag_C = acc_a0 = register_A;
register_A = cmp_new = 0x0f00;
cmp_old = register_A; cmp_new = 0x01;
register_A = flag_C = acc_a0 = 0x0f01;
I'll translate it to one-letter names, run it through Bart's CSE, and
translate it back:
c = x = a;
p = b;
b = n = 0;
p = c = x = a;
a = n = 0xf00;
p = a; n = 1;
a = c = x = 0xf01;
=>
x = a
c = x
p = b
n = 0
b = n
x = a
c = x
p = c
n = 0xf00
a = n
p = a
n = 1
x = 0xf01
c = x
a = c
now, the set of distinct variables here is : a b c n p x, so if I manually
apply the rules from an earlier post about only generating reachable code,
we get this. Starting places are marked '*' and reachable items are marked '?'
x = a
c = x
p = b
n = 0 ?
b = n *
x = a
c = x
p = c
n = 0xf00 ?
a = n ?
p = a *
n = 1 *
x = 0xf01 *
c = x *
a = c *
which gives us:
n = 0 ?
b = n *
n = 0xf00 ?
a = n ?
p = a *
n = 1 *
x = 0xf01 *
c = x *
a = c *
which back in the original namespace is:
cmp_new = 0
register_B = cmp_new
cmp_new = 0xf00
register_A = cmp_new
cmp_old = register_A
cmp_new = 1
acc_a0 = 0xf01
flag_C = acc_a0
register_A = flag_C
That's not at all bad, but we can do better if we remember assignments, as
follows:
cmp_new = 0
register_B = 0
cmp_new = 0xf00
register_A = 0xf00
cmp_old = 0xf00
cmp_new = 1
acc_a0 = 0xf01
flag_C = 0xf01
register_A = flag_C
Having made that simple mod, we can get rid of *two* assignments to
cmp_new, so the final code is:
register_B = 0
register_A = 0xf00
cmp_old = 0xf00
cmp_new = 1
acc_a0 = 0xf01
flag_C = 0xf01
register_A = 0xf01
when rewritten with multiple assignments agglomerated =>
register_B = 0
cmp_old = register_A = 0xf00
cmp_new = 1
register_A = flag_C = acc_a0 = 0xf01
and frankly, it just doesn't get any better than this! That would generate
perfect code on any architecture. By stacking the assignments up like that
we leave the final code generator to decide whether to move a constant or save
a common register.
If you're interested, the best code generator that I've fed the sequence
through so far is CodeWarrior for the 68000. It managed to slave most of
the variables to registers and minimise saves to store, but even it didn't
generate code that could compare to the above:
register_B = 0;
00000064: 426E FDF6 clr.w -522(a6)
cmp_old = flag_C = acc_a0 = register_A;
00000068: 3D43 FDFC move.w d3,-516(a6)
0000006C: 3A03 move.w d3,d5
0000006E: 3C03 move.w d3,d6
register_A = cmp_new = 0x0f00;
00000070: 3E3C 0F00 move.w #3840,d7
00000074: 3607 move.w d7,d3
cmp_old = register_A;
00000076: 3C03 move.w d3,d6
cmp_new = 0x01;
00000078: 7E01 moveq #1,d7
register_A = flag_C = acc_a0 = 0x0f01;
0000007A: 3D7C 0F01 FDFC move.w #3841,-516(a6)
00000080: 3A3C 0F01 move.w #3841,d5
00000084: 3605 move.w d5,d3
If I had an optimiser like the above, plus a good compiler like CodeWarrior,
the translation of the original Cinematronics sequence to 68000 code would
be close to perfect.
The longer the basic block sequence that you can feed to the optimiser,
the fewer assignments to 'housekeeping' flag registers such as cmp_old
and cmp_new will need to be made. They will be saved only when used
(eg flag_C will be set immediately before a branch test), or at the
end of the basic block, in case they're needed in the next block.
Without doing global optimisations, that's as good as you'll get, and
frankly the 90/10 rule applies here - this has already given you 90% of
the benefit at only 10% of the effort that a global dataflow analysis
would take.
Anyway, this is why it's important in your translation to knock out as
many labels that aren't needed as possible, and not just label every
instruction. The fewer the labels, the larger the basic block, and the
more optimisation the compiler can do for you.
Do you reckon you could you do a switch like this?:
switch (pc&0xff0000)
{
case 0x000000:
switch (pc&0xffff00)
{
case 0x000200:
switch (pc)
{
case 0x000206: // blah blah
}
}
}
and then it would be ANSI C? (Because ANSI see allows 257 case
statements I think)
When you have a branch instruction, is the idea that you set
pc=target address, quit out of the switch and then go straight back
in?
Or is there a way to use a C goto statement?
--- In staticrecompilers@yahoogroups.com, Bart <trzynadl@u...> wrote:
> >Unfortunately none of the compilers I've used have eliminated
> >redundant loads and stores; however Bart has just finished writing
> >a source to source optimiser which I hope to steal and adapt for
> >this project so that I can eliminate those within basic blocks.
>
> But it doesn't really eliminate redundant loads/stores at all. At
least not
> the kind I'm thinking of. The code generator simply generates ugly
> load/stores to temporary variables (which, granted, could easily be
> optimized away by some post-analysis.)
>
> Let me know if it helps you in your project, I'd be interested to
hear if
> it works out.
>
> ----
> Bart
>Unfortunately none of the compilers I've used have eliminated
>redundant loads and stores; however Bart has just finished writing
>a source to source optimiser which I hope to steal and adapt for
>this project so that I can eliminate those within basic blocks.
But it doesn't really eliminate redundant loads/stores at all. At least not
the kind I'm thinking of. The code generator simply generates ugly
load/stores to temporary variables (which, granted, could easily be
optimized away by some post-analysis.)
Let me know if it helps you in your project, I'd be interested to hear if
it works out.
----
Bart
> I'm really curious as to how this static recompilation method works -
> is the entire program converted into one function/switch statement?
Yes and no!
Firstly, "yes":
> Is it something like this?
>
> // 206 move.b $a10001.l, D0 [10 39 00 a1 00 01 ]
> // 20c move.b D0, D7 [1e 00 ]
> // 20e andi.b #$f, D0 [02 00 00 0f ]
>
> goes to:
>
> switch (a)
> {
> case 0x206:
> // code to emulate move.b $a10001.l, D0
>
> case 0x20c:
> // code to emulate move.b D0, D7
>
> case 0x20e:
> // code to emulate andi.b #$f, D0
> }
it's like that, but initially with each instruction being emulated
and then returning, i.e:
> switch (a)
> {
> case 0x206:
> // code to emulate move.b $a10001.l, D0
> return;
>
> case 0x20c:
> // code to emulate move.b D0, D7
> return;
>
> case 0x20e:
> // code to emulate andi.b #$f, D0
> return;
> }
> Does the compiler even manage to compile that for a whole program
> address space? Will it accept a switch that big? Does GCC do a good
> job on it?
GCC does. I wasn't able to get CodeWarrior for the Palm to swallow
it however. I ended up with 8 equally sized switch statements and
one controlling loop rather than one big switch:
void cineExecuteFrame(void) {
bNewFrame = 0;
for (;;) {
switch ((register_PC >> 10)) {
case 0:
cineExecute0000(); break;
case 1:
cineExecute0400(); break;
case 2:
cineExecute0800(); break;
case 3:
cineExecute0c00(); break;
case 4:
cineExecute1000(); break;
case 5:
cineExecute1400(); break;
case 6:
cineExecute1800(); break;
case 7:
cineExecute1c00(); break;
}
if (bNewFrame == 1) return;
}
}
OK, now the "No" answer...
What you show above is exactly what I did while debugging the emulation,
so that I could run both cores in step, one instruction at a time.
However once it was debugged, I then removed the "return" between
each statement, *and* the "case" label, so that to get from one
instruction to the next, it just drops through. This then is the
entire point of the exercise: there is *no* flow control overhead
when the instructions drop through. Of course you have to do a lot
of work to know which labels are never used and can safely be
removed.
Also, most jumps are simple jumps and can jump directly to a static
label. (This is why the label version is better than the case version)
Only computed gotos will need to return and be redispatched by
to a case statement by code address.
Incidentally, since the translation is generated from a program, it's
very easy to have an option to either generate a large case statement,
*or* to use GCC's extension to allow arrays of labels.
> Does it even compile out redundant flag calculation in the
> original code??! I'm fascinated by this!
Unfortunately none of the compilers I've used have eliminated
redundant loads and stores; however Bart has just finished writing
a source to source optimiser which I hope to steal and adapt for
this project so that I can eliminate those within basic blocks.
(We're talking about that on the "compilers101" yahoo group)
> You could even do cycle accurate raster effects by putting the odd if
> (cycles<0) break; in! - god I'm so excited about all this!!
Yes, I did something like that to emulate the "WAI" (wait for vsync)
instruction in the Cinematronics. It exited the interpreter main loop,
waited for a real vsync, then went back to the dispatch loop.
Here are some of the different versions of code I've had at various
times; should make the structure clearer:
--------------------------------------------------------------------
** first, one instruction at a time emulation:
goto *lab[register_PC];
return;
L0000:
/* Invariants: register_P = 0x0 register_I = 0x00 register_A = 0x00 */;
/* opNOP_A_B (57) */
return;
L0001:
/* opLDAimm_B_AA (00) */
flag_C = acc_a0 = register_A;
cmp_old = register_B; /* step back cmp flag */
register_B = cmp_new = 0;
return;
L0002:
/* opLDAimm_A_AA (0f) */
cmp_old = flag_C = acc_a0 = register_A;
register_A = cmp_new = 0x0f00;
return;
L0003:
/* opADDimmX_A_AA (20) */
cmp_old = register_A; cmp_new = 0x01;
register_A = flag_C = acc_a0 = 0x0f01;
return;
L0005:
/* opLDJimm_A_A (48) */
register_J = 0x0008;
return;
--------------------------------------------------------------------
** next, same thing but redundant labels removed:
goto *lab[register_PC];
L0000:
/* Invariants: register_P = 0x0 register_I = 0x00 register_A = 0x00 */;
flag_C = acc_a0 = register_A;
cmp_old = register_B; /* step back cmp flag */
register_B = cmp_new = 0;
cmp_old = flag_C = acc_a0 = register_A;
register_A = cmp_new = 0x0f00;
cmp_old = register_A; cmp_new = 0x01;
register_A = flag_C = acc_a0 = 0x0f01;
L0008:
/* Invariants: register_P = 0x0 */;
ram[register_I = 0x00] = register_A; /* store acc to RAM */
register_I = ram[0x00]&0xff; /* set new register_I (8 bits) */
ram[register_I] = register_B; /* store acc */
register_A = (flag_C = ((cmp_old = acc_a0 = register_A) + (cmp_new =
0x1))) & 0xFFF; /* add values, save carry */
if (!(flag_C & CARRYBIT)) goto L0008;
--------------------------------------------------------------------
** and a version using case statements (and a few more comments);
note the jump code "{register_PC = 0x0904; break;};" where it jumps
to a new address, as opposed to merely "goto L0904;"...
switch (register_PC) {
case 0x0000:
/* Invariants: register_P = 0x0 register_I = 0x00 register_A = 0x00 */;
/* opNOP_A_B (57) */
/* opLDAimm_B_AA (00) */
flag_C = acc_a0 = register_A;
cmp_old = register_B; /* step back cmp flag */
register_B = cmp_new = 0;
/* opLDAimm_A_AA (0f) */
cmp_old = flag_C = acc_a0 = register_A;
register_A = cmp_new = 0x0f00;
/* opADDimmX_A_AA (20) */
cmp_old = register_A; cmp_new = 0x01;
register_A = flag_C = acc_a0 = 0x0f01;
/* opLDJimm_A_A (48) */
register_J = 0x0008;
/* opLDPimm_A_A (80) */
case 0x0008:
/* Invariants: register_P = 0x0 */;
/* opSTAdir_A_A (d0) */
ram[register_I = 0x00] = register_A; /* store acc to RAM */
/* opLDIdir_A_A (c0) */
register_I = ram[0x00]&0xff; /* set new register_I (8 bits) */
/* opNOP_A_B (57) */
/* opSTAirg_B_BB (e6) */
ram[register_I] = register_B; /* store acc */
/* opADDimm_A_AA (21) */
register_A = (flag_C = ((cmp_old = acc_a0 = register_A) + (cmp_new =
0x1))) & 0xFFF; /* add values, save carry */
/* opJNC_A_AA (5d) */
if (!(flag_C & CARRYBIT)) {register_PC = 0x0008; break;};
/* opSTAdir_A_A (d0) */
ram[register_I = 0x00] = register_A; /* store acc to RAM */
/* opLDAimm_A_AA (00) */
cmp_old = flag_C = acc_a0 = register_A;
register_A = cmp_new = 0;
/* opADDimm_A_AA (21) */
cmp_new = 0x1; cmp_old = acc_a0 = register_A; register_A = flag_C =
0x0001;
/* opOUTbi_A_A (95) */
/* opOUTsnd_A (95) */
reset_coin_counter(1);
/* opLDJimm_A_A (44) */
register_J = 0x0904;
/* opJMP_A_A (58) */
{register_PC = 0x0904; break;};
case 0x0026:
/* Invariants: register_P = 0x8 register_I = 0x82 register_A = 0x01 */;
/* opLDAimm_A_AA (09) */
cmp_old = flag_C = acc_a0 = register_A;
register_A = cmp_new = 0x0900;
/* opADDimmX_A_AA (20) */
cmp_old = register_A; cmp_new = 0x1c;
--------------------------------------------------------------------
> That's 90% of the work done now. You should have a rough
translation
> of the original program to C now. Now, replace one of the two
> emulator cores in your modified emulator with the new one which
> you've written: the only difference in structure between the two
> should be this: to execute a single instruction, the 'traditional'
> emulator fetches an opcode and dispatches to an appropriate
> procedure to emulate it; the 'translator' version however just
> dispatches on the basis of the address of the instruction, where it
> will find the translated code for that instruction, which it
executes
> and then returns.
>
> Finally it's all in place. Run the dual-cpu emulator, play your
> game, and any time it halts with a difference between the original
> emulator and your new one, examine the last instruction emulated
> and see why they differ. This is an *extremely* fast way to find
> the bugs in your translation. I took mine from first step to
> fully debugged in under a week. As long as you trust the original
> emulator, success is pretty much guaranteed.
>
> The very last step is to remove the original emulator code, and
> change your translation so that instead of returning after executing
> each opcode, it just drops through naturally to the next opcode;
> remove any switch labels that cannot be jump targets - and presto,
> your translation is complete. A good compiler will optimise the
> hell out of your basic blocks now.
I'm really curious as to how this static recompilation method works -
is the entire program converted into one function/switch statement?
Is it something like this?
// 206 move.b $a10001.l, D0 [10 39 00 a1 00
01 ]
// 20c move.b D0, D7 [1e 00 ]
// 20e andi.b #$f, D0 [02 00 00 0f ]
goes to:
switch (a)
{
case 0x206:
// code to emulate move.b $a10001.l, D0
case 0x20c:
// code to emulate move.b D0, D7
case 0x20e:
// code to emulate andi.b #$f, D0
}
Does the compiler even manage to compile that for a whole program
address space? Will it accept a switch that big? Does GCC do a good
job on it? Does it even compile out redundant flag calculation in the
original code??! I'm fascinated by this!
You could even do cycle accurate raster effects by putting the odd if
(cycles<0) break; in! - god I'm so excited about all this!!
>When you do find differing cpu contexts, what would the
>cause most likely be?
That's what should be revealed.
When I wrote my 68K interpreter, I tested it by having it check against
Starscream (which isn't by any means a sufficiently accurate 68K emulator
itself.) Whenever a difference was found, execution was halted and the
contexts for both CPUs were dumped (registers.)
I implemented this with a simple line oriented debugger, so all I had to do
was re-execute the instruction with different data in the registers (which
I could modify by hand) to deduce what was wrong in my emulator (or
Starscream) by checking against the 68K manuals.
Usually the errors were something related to flags or data being written
incorrectly. When a difference is found, checking the manual will help
identify what may be wrong in your (or the other) CPU core.
It really is a nifty way of debugging :)
----
Bart
--- In staticrecompilers@yahoogroups.com, "imikeyiz <staticbt@h...>"
<staticbt@h...> wrote:
> There are a couple of issues I think I miss, though.
>
> > That's 90% of the work done now. You should have a rough
> translation
> > of the original program to C now.
>
> When does the rough translation actually take place? (Sorry if I'm
> asking the obvious)
The translator is the code which was based on a disassembler.
Instead of printing out asm statements, it prints out the source
code stolen from the original emulator which would have been used to
interpret that opcode if it were to be run. for instance:
CINESTATE opLDAdir_A_AA (int opcode)
{
internal_test(opcode);
if (disp_opcodes) fprintf(codefile, "%s/* opLDAdir_A_AA (%02x) */
\n", cur_tabs, opcode);
fprintf(codefile, "%s%s", cur_tabs, "cmp_old = flag_C = acc_a0 =
register_A; /* store old acc */\n");
if (register_P != 0xDEADBEEFUL) {
fprintf(codefile, "%sregister_A = cmp_new = ram[register_I = 0x%
02x]; /* set I register */\n", cur_tabs, register_I = (register_P <<
4) + (opcode & 0x0f));
} else {
fprintf(codefile, "%sregister_A = cmp_new = ram[register_I =
(register_P << 4) + 0x%02x]; /* set I register */\n", cur_tabs,
(opcode & 0x0f));
register_I = 0xDEADBEEFUL;
}
register_A = 0xDEADBEEFUL;
fprintf(codefile, "%s%s", cur_tabs, "\n");
if (require_note_state(register_PC)) fprintf(codefile, "%s%s",
cur_tabs, "state = state_AA;\n");
return state = state_AA;
}
> >Now, replace one of the two
> > emulator cores in your modified emulator with the new one which
> > you've written: the only difference in structure between the two
> I have never heard of this idea and it definately sounds
interesting!
"independant reinvention". When I mentioned it on dynarec everyone
said "well, duh, of course we all do that already!" :-)
> myself. Ofcourse, hand-checking all the intricacies of the program
> must be time consuming, isn't it? Especially when dealing with
large
> programs? When you do find differing cpu contexts, what would the
> cause most likely be?
No, it's not time-consuming at all. You just look in your trace
buffer[*] to see which instruction was executed last and look at the
source for just that instruction. Takes you right to it and the
cause is usually obvious. You usually have the 'before and after'
registers to compare with, plus the correct answer from the other
emulator.
I found most often the cause came from me making changes to the
original emulator, trying to tighten up the code or make
optimisations; next most common cause was from differences in the
new dual emulator vs original emulator harness code.
* Trace buffer: you want to play-test your emulator in real time,
so if you print out a run-time trace to a file, the overhead of
one line of file I/O per instruction emulated may slug performance
way too much. Try creating a cyclic buffer in ram and sprintf your
trace info to the buffer. Install an atexit() handler which dumps
the buffer to a file on exit. Much much less runtime overhead.
G
Wow, thank you very much for that lengthy description. I'll be
printing that out and storing it in my filing cabinet :)
There are a couple of issues I think I miss, though.
> That's 90% of the work done now. You should have a rough
translation
> of the original program to C now.
When does the rough translation actually take place? (Sorry if I'm
asking the obvious)
>Now, replace one of the two
> emulator cores in your modified emulator with the new one which
> you've written: the only difference in structure between the two
> should be this: to execute a single instruction, the 'traditional'
> emulator fetches an opcode and dispatches to an appropriate
> procedure to emulate it; the 'translator' version however just
> dispatches on the basis of the address of the instruction, where it
> will find the translated code for that instruction, which it
executes
> and then returns.
>
> Finally it's all in place. Run the dual-cpu emulator, play your
> game, and any time it halts with a difference between the original
> emulator and your new one, examine the last instruction emulated
> and see why they differ. This is an *extremely* fast way to find
> the bugs in your translation. I took mine from first step to
> fully debugged in under a week. As long as you trust the original
> emulator, success is pretty much guaranteed.
I have never heard of this idea and it definately sounds interesting!
I can't wait to get some time to try to implement something like this
myself. Ofcourse, hand-checking all the intricacies of the program
must be time consuming, isn't it? Especially when dealing with large
programs? When you do find differing cpu contexts, what would the
cause most likely be?
>
> So you can see now why there shouldn't be any structural reason why
> things like difficult addressing modes would affect the translation;
> the details of each opcode just don't matter - you're stealing
> them verbatim from an existing emulator!
Thank you very much!
Michael
--- In staticrecompilers@yahoogroups.com, "imikeyiz <staticbt@h...>"
<staticbt@h...> wrote:
> Hi!
>
> I've been interested in static recompilation for a while. I first
got
> the thought of writing one when I wanted to play Super Metroid on
my
> Gamboy Advance, and so set about writing a stat rec to translate
the
> code. I was in _way_ over my head. Nevertheless I persevered. I
first
> tried porting to the native assembly language of the GBA (a RISC
cpu I
> might add) but then decided to convert to C code. Like you said,
this
> is for greater portability.
>
> Unfortunately, I never got through the machine code translation
stage.
> (Obviously a straight translation is not enough, as video accesses
etc
> have to be re-aligned etc.) Three reasons for this were my uni
work;
> my part-time work; and some fundamental concepts I did not
understand
> how to solve.
>
> For example, some types of indexed addressing meant that I couldnt
> translate bits of code because I wouldn't know what they would be
> until runtime.
Maybe we can help you with this. The fact that you would get stuck
on a detail such as addressing modes tells me you must have written
your translator in a different way from me, because the method I used
didn't care at all about the details of how any of the opcodes were
implemented, so give us a overview of how your translator worked
and maybe we can help.
For comparison, here roughly is how I wrote mine:
first, get a working emulator for the game you want to translate.
In the emulator, declare some arrays that are the exact same size
as the ROM. You're going to use these to remember info about each
byte in the ROM. This is a memory pig but who cares, machines are
big now and these arrays will be gone by the time you run the
final translation.
Add code to the emulator so that it when it executes a jump, it
tags the jump destination with a flag. You might use a different
flag for a computed goto as opposed to a direct jump. Depending
on your architecture that info may be useful later.
When you fetch an instruction, tag the address as being an
instruction; when you fetch an operand, tag it as an operand.
Save this data after each run, and reload it at the start of the
next, so that it is cumulative information over a large number
of runs. You're trying to exercise every path through the
program if you can (though if you miss a few a later technique
will catch those mistakes).
The reason for doing this is to help the disassembler that you're
going to write. A good disassembler will do a tree-walk of the
code, following jump destinations to identify code, but it will
often miss some code and treat it as data. Catching those missing
opcodes at runtime is almost essential. The alternative scheme
of just assuming that everything is code is not a good idea; unless
you're very lucky you'll tag some data as code by accident and
for this kind of translation that is a *much* worse mistake to make.
So, at the end of this, you'll know where the opcodes are, and
which of them are jump destinations. If you're VERY lucky your
code will have no computed gotos at all and you'll be able to
simplify the generated code immensely.
(If you're very unlucky, you'll find some bytes that are tagged
as both opcode or data, or opcodes bytes that are written to -
some games have self-modifying code. This can be a bummer.
There are ways around it but the easiest one is to have your
translated code retain a full emulator inside it for use only
in those areas of dynamically generated code)
Next you're going to steal some code from your original working
emulator and add it to the disassembler you've probably already
written by now. For any arbitrary opcode, take the code that
would be emulated for that opcode and enclose it in printf()'s,
and have the disassembler print that code out instead of printing
the assembly source.
Before continuing, you should now do a major overhaul of your
emulator code. Add a completely duplicated set of registers
and RAM, and a second emulator core (they should be identical, just
copy the code of the main dispatch routine), one of which accesses
the duplicate register set and the other accesses the original.
Change the main loop so that it executes each instruction in both
cores in lockstep. Write code to compare all the registers in both
copies and stop and print the registers if they differ. You can if
you like also compare both rams on each instruction, but that's
hellishly expensive and it turns out that unless you have a really
glaring bug in the simple load and store code, most emulator errors
turn up first as differences in registers. So you can skip that or
parameterise it so that you only invoke it if you have a subtle bug
that needs extra debugging help.
That's 90% of the work done now. You should have a rough translation
of the original program to C now. Now, replace one of the two
emulator cores in your modified emulator with the new one which
you've written: the only difference in structure between the two
should be this: to execute a single instruction, the 'traditional'
emulator fetches an opcode and dispatches to an appropriate
procedure to emulate it; the 'translator' version however just
dispatches on the basis of the address of the instruction, where it
will find the translated code for that instruction, which it executes
and then returns.
Finally it's all in place. Run the dual-cpu emulator, play your
game, and any time it halts with a difference between the original
emulator and your new one, examine the last instruction emulated
and see why they differ. This is an *extremely* fast way to find
the bugs in your translation. I took mine from first step to
fully debugged in under a week. As long as you trust the original
emulator, success is pretty much guaranteed.
The very last step is to remove the original emulator code, and
change your translation so that instead of returning after executing
each opcode, it just drops through naturally to the next opcode;
remove any switch labels that cannot be jump targets - and presto,
your translation is complete. A good compiler will optimise the
hell out of your basic blocks now.
So you can see now why there shouldn't be any structural reason why
things like difficult addressing modes would affect the translation;
the details of each opcode just don't matter - you're stealing
them verbatim from an existing emulator!
I'll talk about optimising the generated code some other time, but
chances are you won't need any optimisation. I've found that the
number of instructions executed for any opcode tend to be about the
same or fewer than the opcode-fetch loop overhead; the static
translation should get you at least a 100% increase in speed over
your regular emulator, and possibly many times that.
G
Hi!
I've been interested in static recompilation for a while. I first got
the thought of writing one when I wanted to play Super Metroid on my
Gamboy Advance, and so set about writing a stat rec to translate the
code. I was in _way_ over my head. Nevertheless I persevered. I first
tried porting to the native assembly language of the GBA (a RISC cpu I
might add) but then decided to convert to C code. Like you said, this
is for greater portability.
Unfortunately, I never got through the machine code translation stage.
(Obviously a straight translation is not enough, as video accesses etc
have to be re-aligned etc.) Three reasons for this were my uni work;
my part-time work; and some fundamental concepts I did not understand
how to solve.
For example, some types of indexed addressing meant that I couldnt
translate bits of code because I wouldn't know what they would be
until runtime. As you say:
> My favourite optimiation was to identify basic blocks, and note
> at run-time the values of all the registers on entry. Over many many
> runs of the code (ie playtesting) that info converged into the
> invariants that never changed, effectively converting variables
> int constants which I could then optimise away.
gives me a hint at some methods available, but proves that static
recompilation is more an art than a science.
Anyway, I'm keen to learn as much as I can on the subject, but I don't
think I'll be able to contribute too much at this stage :)
regards
Michael
--- In staticrecompilers@yahoogroups.com, Graham Toal <gtoal@g...> wrote:
> Well folks, we're at that period in the formation of a mailing list
> where we now have enough people to keep a conversation going, so I'll
> kick things off by describing my own rather simple effort at static
> translation.
>
> For many years I've really enjoyed the old vector game tailgunner, but
> I'd always wanted one of my own, and back in the seventies I couldn't
> afford a cabinet of my own, and in the 80's I didn't have the space for
> one, and by the 90's I couldn't find one! Then by the turn of the
> milennium Zonn Moore had written a great emulator for PCs and I
> could play my favourite game again, which I did, often, on PCs
>
> But soon I started getting itchy fingers and thought "wouldn't it
> be nice if I could play tailgunner on a handheld?" so I tried
> porting the emulator to my Palm Pilot. It ran at about 2 frames
> per second! It was clear that no amount of small tweaking would
> ever speed it up significantly, so I came up with the idea of
> removing the interpretation overhead by translating each instruction
> to the same instructions that would be executed by an emulator, but
> placing them inline without a main loop. Basically I'd reinvented
> static binary translation without actually knowing that it already
> existed :-) Though of course I soon discovered about what had been
> done before and started reading up on it. My translation was to C,
> and it's not as daft as it sounds; some folks objected that it wouldn't
> be as efficient as going to machine code, but in fact the 68000
> binary generated was excellent; the compiler forced every virtual
> register into a real register, and optimised away redundand stores
> of flags etc that are saved on every virtual instruction. But the
> big win was that I could retarget to other machines very quickly,
> which I did: I then tried a port to the GBA. Again, too slow, but
> getting there.
>
> Finally I treated myself to a superb Korean handheld called a GP32
> and finally I was able to run the translated tailgunner at full speed.
> What's more, the GP32 has a variable clock, and unlike some of the
> emulators on that system, my tailgunner ran at the slowest clock
> speed, which mean the longest battery life! (something like 8 hours
> of continuous play!) And the gameplay is superb, helped by the
> fact that the GP32 has an 8-way switched joystick which maps
> perfectly to the original tailgunner cabinet.
>
> So I feel pretty good about static translation, even using C as
> an intermediate code. If you have an interpreter to begin with,
> converting it to a static translator is relatively mechanical,
> and you can do the trick of executing both the real interpreted
> emulator plus your compiled binary code in parallel, comparing
> the register state after each instruction, and you can very
> rapidly iron out any bugs in your interpretation. In fact I
> added some rather hairy optimisations to my translation which
> I would never dare have done without the dual-cpu testbed.
>
> Of course this wasn't a general purpose translator; by the time
> it was done it was customised to handling only the tailgunner
> roms, but the tweaks could easily be removed to translate some
> other program for the same CPU, such as Rip Off.
>
> My favourite optimiation was to identify basic blocks, and note
> at run-time the values of all the registers on entry. Over many many
> runs of the code (ie playtesting) that info converged into the
> invariants that never changed, effectively converting variables
> int constants which I could then optimise away.
>
> This is from http://www.gtoal.com/athome/tailgunner/tailgunr-ops.c.html
>
> *lab[register_PC];
>
> L0000:
>
> /* Invariants: register_P = 0x0 register_I = 0x00 register_A =
0x00 */;
> flag_C = acc_a0 = register_A;
> cmp_old = register_B; /* step back cmp flag */
> register_B = cmp_new = 0;
> cmp_old = flag_C = acc_a0 = register_A;
> register_A = cmp_new = 0x0f00;
> cmp_old = register_A; cmp_new = 0x01;
> register_A = flag_C = acc_a0 = 0x0f01;
>
> L0008:
>
> /* Invariants: register_P = 0x0 */;
> ram[register_I = 0x00] = register_A; /* store acc to RAM */
> register_I = ram[0x00]&0xff; /* set new register_I (8 bits) */
> ram[register_I] = register_B; /* store acc */
> register_A = (flag_C = ((cmp_old = acc_a0 = register_A) +
(cmp_new = 0x1))) & 0xFFF; /* add values, save carry */
>
> if (!(flag_C & CARRYBIT)) goto L0008;
> ram[register_I = 0x00] = register_A; /* store acc to RAM */
> cmp_old = flag_C = acc_a0 = register_A;
> register_A = cmp_new = 0;
> cmp_new = 0x1; cmp_old = acc_a0 = register_A; register_A =
flag_C = 0x0001;
>
> reset_coin_counter(1);
> goto L0904;
>
> L0026:
>
> /* Invariants: register_P = 0x8 register_I = 0x82 register_A =
0x01 */;
> cmp_old = flag_C = acc_a0 = register_A;
> register_A = cmp_new = 0x0900;
> cmp_old = register_A; cmp_new = 0x1c;
> register_A = flag_C = acc_a0 = 0x091c;
> cmp_old = acc_a0 = register_A;
> register_A = (flag_C = (register_A + (cmp_new = ram[0x82]))) &
0xFFF;
> set_watchdog();
> cmp_old = register_A; register_A = cmp_new = rom[register_A];
/* new acc value */
> ram[register_I] = register_A; /* store acc */
>
> etc...
>
> Note that not every instruction has a label in front of it. This is
> because I used both a static source analysis and runtime tracing to
> identify jump targets, and eliminated any labels from addresses that
> were never jumped to. This gives the C compiler a great deal more
> information it can use to optimise the generated code - for instance,
> see how many times above cmp_old is set and never accessed before it
> is set again. All of those saves, and often the whole calculation of
> the expression on the RHS, can be thrown away completely. This is
> very had to do in a dynarec in comparison and impossible in a
> traditional emulator.
>
> G
Well folks, we're at that period in the formation of a mailing list
where we now have enough people to keep a conversation going, so I'll
kick things off by describing my own rather simple effort at static
translation.
For many years I've really enjoyed the old vector game tailgunner, but
I'd always wanted one of my own, and back in the seventies I couldn't
afford a cabinet of my own, and in the 80's I didn't have the space for
one, and by the 90's I couldn't find one! Then by the turn of the
milennium Zonn Moore had written a great emulator for PCs and I
could play my favourite game again, which I did, often, on PCs
But soon I started getting itchy fingers and thought "wouldn't it
be nice if I could play tailgunner on a handheld?" so I tried
porting the emulator to my Palm Pilot. It ran at about 2 frames
per second! It was clear that no amount of small tweaking would
ever speed it up significantly, so I came up with the idea of
removing the interpretation overhead by translating each instruction
to the same instructions that would be executed by an emulator, but
placing them inline without a main loop. Basically I'd reinvented
static binary translation without actually knowing that it already
existed :-) Though of course I soon discovered about what had been
done before and started reading up on it. My translation was to C,
and it's not as daft as it sounds; some folks objected that it wouldn't
be as efficient as going to machine code, but in fact the 68000
binary generated was excellent; the compiler forced every virtual
register into a real register, and optimised away redundand stores
of flags etc that are saved on every virtual instruction. But the
big win was that I could retarget to other machines very quickly,
which I did: I then tried a port to the GBA. Again, too slow, but
getting there.
Finally I treated myself to a superb Korean handheld called a GP32
and finally I was able to run the translated tailgunner at full speed.
What's more, the GP32 has a variable clock, and unlike some of the
emulators on that system, my tailgunner ran at the slowest clock
speed, which mean the longest battery life! (something like 8 hours
of continuous play!) And the gameplay is superb, helped by the
fact that the GP32 has an 8-way switched joystick which maps
perfectly to the original tailgunner cabinet.
So I feel pretty good about static translation, even using C as
an intermediate code. If you have an interpreter to begin with,
converting it to a static translator is relatively mechanical,
and you can do the trick of executing both the real interpreted
emulator plus your compiled binary code in parallel, comparing
the register state after each instruction, and you can very
rapidly iron out any bugs in your interpretation. In fact I
added some rather hairy optimisations to my translation which
I would never dare have done without the dual-cpu testbed.
Of course this wasn't a general purpose translator; by the time
it was done it was customised to handling only the tailgunner
roms, but the tweaks could easily be removed to translate some
other program for the same CPU, such as Rip Off.
My favourite optimiation was to identify basic blocks, and note
at run-time the values of all the registers on entry. Over many many
runs of the code (ie playtesting) that info converged into the
invariants that never changed, effectively converting variables
int constants which I could then optimise away.
This is from http://www.gtoal.com/athome/tailgunner/tailgunr-ops.c.html
*lab[register_PC];
L0000:
/* Invariants: register_P = 0x0 register_I = 0x00 register_A = 0x00 */;
flag_C = acc_a0 = register_A;
cmp_old = register_B; /* step back cmp flag */
register_B = cmp_new = 0;
cmp_old = flag_C = acc_a0 = register_A;
register_A = cmp_new = 0x0f00;
cmp_old = register_A; cmp_new = 0x01;
register_A = flag_C = acc_a0 = 0x0f01;
L0008:
/* Invariants: register_P = 0x0 */;
ram[register_I = 0x00] = register_A; /* store acc to RAM */
register_I = ram[0x00]&0xff; /* set new register_I (8 bits) */
ram[register_I] = register_B; /* store acc */
register_A = (flag_C = ((cmp_old = acc_a0 = register_A) + (cmp_new =
0x1))) & 0xFFF; /* add values, save carry */
if (!(flag_C & CARRYBIT)) goto L0008;
ram[register_I = 0x00] = register_A; /* store acc to RAM */
cmp_old = flag_C = acc_a0 = register_A;
register_A = cmp_new = 0;
cmp_new = 0x1; cmp_old = acc_a0 = register_A; register_A = flag_C =
0x0001;
reset_coin_counter(1);
goto L0904;
L0026:
/* Invariants: register_P = 0x8 register_I = 0x82 register_A = 0x01 */;
cmp_old = flag_C = acc_a0 = register_A;
register_A = cmp_new = 0x0900;
cmp_old = register_A; cmp_new = 0x1c;
register_A = flag_C = acc_a0 = 0x091c;
cmp_old = acc_a0 = register_A;
register_A = (flag_C = (register_A + (cmp_new = ram[0x82]))) & 0xFFF;
set_watchdog();
cmp_old = register_A; register_A = cmp_new = rom[register_A]; /* new acc
value */
ram[register_I] = register_A; /* store acc */
etc...
Note that not every instruction has a label in front of it. This is
because I used both a static source analysis and runtime tracing to
identify jump targets, and eliminated any labels from addresses that
were never jumped to. This gives the C compiler a great deal more
information it can use to optimise the generated code - for instance,
see how many times above cmp_old is set and never accessed before it
is set again. All of those saves, and often the whole calculation of
the expression on the RHS, can be thrown away completely. This is
very had to do in a dynarec in comparison and impossible in a
traditional emulator.
G