Search the web
Sign In
New User? Sign Up
staticrecompilers · Static Binary Translation
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 268 - 297 of 326   Newest  |  < Newer  |  Older >  |  Oldest
Messages: Show Message Summaries   (Group by Topic) Sort by Date v  
#297 From: "jankaspermartinsen" <jankaspermartinsen@...>
Date: Mon Nov 6, 2006 8:16 am
Subject: static compiled for j2me (java for mobile)
jankaspermar...
Offline Offline
Send Email Send Email
 
You might be interested, I made a static compiled version of the
phoenix arcade game (the only emulator I have ever written). It can be
run on j2me enabled devices, which is compatible with nokia s60. It
still slow (on old devices), but I belive this has something to do
with graphic routines on the phone(but im not sure). More details, see:

http://kaspermartinsen.googlepages.com/sbt

-jkm

#296 From: "Graham Toal" <gtoal@...>
Date: Thu Sep 15, 2005 3:29 pm
Subject: C to Java using MIPS as an intermediate code!
graham_toal
Offline Offline
Send Email Send Email
 
Found the weirdest static translator today, thought y'all might
be interested...

http://www.xwt.org/mips2java/
http://www.thisiscool.com/mips2java.htm

G

#295 From: Graham Toal <gtoal@...>
Date: Tue Mar 8, 2005 2:37 am
Subject: 1991 paper on SBT ...
graham_toal
Offline Offline
Send Email Send Email
 
Look what I found today...  written in 1991...  this outlines the same basic
method we've been following (though it doesn't have any of the advanced
tweaks we've been talking about that optimise beyond the first big win
you get from static translating at all).

   
http://www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/vol21/issue8/spe046c\
m.pdf

G

#294 From: Neil Bradley <nb@...>
Date: Tue Aug 24, 2004 6:02 pm
Subject: Re: More Orion progress
nb@...
Send Email Send Email
 
> There is very little left to do and not much that should
> have a significant effect on run-time speed.  The only things
> I see are a few trivial math ops (like a sequence of 4
> consecutive shifts which could be coalesced) plus the
> hack I described earlier about putting a {} block around
> the use of temp variables, to let the compiler know that
> they don't need to be remembered after use.

That's a good idea. Yes, I've noticed the:

	 add a, a
	 add a, a

Can easily be:

	 a <<= 2;

And this:
	 rlca
	 rlca
	 rlca
	 rlca

Can easily be:

	 a = (a >> 4) | (a << 4);

The above is compiled in 3 x86 instructions, whereas the rlca above is 16.

Plus, There are sequences like:

	 ld a, (addr)
	 and a, a
	 jp z, label

It generates quite a bit of code that could be represented as:

	 if (mspacman_memory[addr] == 0)
		 goto label;

Right now the sequence recompiles into something like 12 instructions.
This would drop it from those 12 to 3, and would also allow me to
eliminate a data store, too.

There are definitely more bones buried in this one. Now I need to just
figure out what would give me the best benefit and try it out.

I've also seen some boneheaded things that the code is currently doing,
like:

       sg_sZ80Context.z80pc = sg_u8mspacman_memory[sg_sZ80Context.z80SP++];
       u8Temp = sg_u8mspacman_memory[sg_sZ80Context.z80SP++];
       sg_sZ80Context.z80pc |= ((UINT32) u8Temp << 8);

Should be:

       sg_sZ80Context.z80pc = sg_u8mspacman_memory[sg_sZ80Context.z80SP++];
       sg_sZ80Context.z80pc |= ((UINT32)
sg_u8mspacman_memory[sg_sZ80Context.z80SP++]);

It reduces it a couple of instructions. Plus:

       sg_u8mspacman_memory[0x4c80] = sg_sZ80Context.z80L;
       sg_u8mspacman_memory[0x4c81] = sg_sZ80Context.z80H;

Can be changed to (provided it's a little endian machine):

       *((UINT16) &sg_u8mspacman_memory[0x4c80]) = sg_sZ80Context.z80HL;

> The only significant (possibly) win remaining that I can see
> is probably not relevant on a PC platform, which is forcing
> the variables into registers.  In order to enable this
> unfortunately you need to generate code to force H and L
> into HL and to break them apart again, for all pairs of
> registers.

Yeah, I intend on revisiting that issue at the tail end of the
optimization experiment. The idea is to get as much cruft out of the way
as possible so that the compiler will most likely generate better code in
the first place.

> I believe the way the z80 regs are used, it'll be one way for
> some (ag A and F) and the other way for others (eg HL).

I'm not so sure about that. In the case of HL, H and L are used separately
quite a lot, and HL is used together a lot, too.

> It might be possible to use the dead-code optimising structure
> to do this in a really easy way - just always plant the code
> to do the merging or splitting, and set a dependency such as
> "HL on output depends on H and L as input", then most of the
> time the code won't actually get as far as the output file.

The code already does a "dead store" analysis that are based on the
registers. For example, H and L are represented as individual registers in
the bitmask, but HL is an OR of H and L.

> For the X86 architecture I think it's as good as it can get -
> if you spend some months on small tweaks you might get
> another 5% at best.  Diminishing returns now.

I think I can do better. ;-)

> The flag data was interesting.  Such a huge win from the
> timing coalescing, and a relatively small (but still worthwhile)
> win from the flag removal.  It will be interesting to see
> how that compares when the 6502 is added.  I think the flag
> benefit will be more significant, if the opcode covers are
> breaking out the individual flag bits.

That's exactly how it's coded. Each opcode cover handles its own flag
handling.

> (Talking of which - Neil, now that you have good control over
> what flag bits are used, have you given any thought to whether
> the combined flag register is still the best architecture?  It
> probably is, but the underlying assumptions have changed so
> it is worth re-examining the conclusions just in case.

In short, yes, having a combined flag architecture is better in all
aspects than individual separated registers for the following reasons:

1) The effort/time to compute even a single flag bit is 3 instructions
tops, and in most cases where they're used (inc, or, and, etc...) it can
set *ALL* of them by a 3 instruction lookup. Not even a single flag
computation can be faster than a lookup. My lookup can beat up your flag
computation. ;-)

2) The extra stores required (even temporary) causes register allocation
shifts that of course are not good on few register architectures like the
x86

Then there are the pointless, negligible things like less memory used and
less cache thrash, however insignificant. ;-)

-->Neil

-------------------------------------------------------------------------------
Neil Bradley             "Your mistletoe is no match for my T.O.W. missile!"
Synthcom Systems, Inc.   - Santabot - Futurama
ICQ #29402898

#293 From: Graham Toal <gtoal@...>
Date: Tue Aug 24, 2004 2:26 pm
Subject: Re: More Orion progress
graham_toal
Offline Offline
Send Email Send Email
 
Great work, Neil - it's looking very nice.  As usual I've
unpacked it for folks who have problems with M$ files:

  http://www.gtoal.com/sbt/z80/nb/mspacman.c.optimized.html

There is very little left to do and not much that should
have a significant effect on run-time speed.  The only things
I see are a few trivial math ops (like a sequence of 4
consecutive shifts which could be coalesced) plus the
hack I described earlier about putting a {} block around
the use of temp variables, to let the compiler know that
they don't need to be remembered after use.

The only significant (possibly) win remaining that I can see
is probably not relevant on a PC platform, which is forcing
the variables into registers.  In order to enable this
unfortunately you need to generate code to force H and L
into HL and to break them apart again, for all pairs of
registers.

Also to do it well you have to profile the code and determine
if it is better to keep a pair as the 16bit variable by default
and split when necessary then recombine, or vice-versa.

I believe the way the z80 regs are used, it'll be one way for
some (ag A and F) and the other way for others (eg HL).

It might be possible to use the dead-code optimising structure
to do this in a really easy way - just always plant the code
to do the merging or splitting, and set a dependency such as
"HL on output depends on H and L as input", then most of the
time the code won't actually get as far as the output file.

But personally I wouldn't take it that far now unless I was
specifically targetting a slow ARM processor such as the GBA.

For the X86 architecture I think it's as good as it can get -
if you spend some months on small tweaks you might get
another 5% at best.  Diminishing returns now.


The flag data was interesting.  Such a huge win from the
timing coalescing, and a relatively small (but still worthwhile)
win from the flag removal.  It will be interesting to see
how that compares when the 6502 is added.  I think the flag
benefit will be more significant, if the opcode covers are
breaking out the individual flag bits.

(Talking of which - Neil, now that you have good control over
what flag bits are used, have you given any thought to whether
the combined flag register is still the best architecture?  It
probably is, but the underlying assumptions have changed so
it is worth re-examining the conclusions just in case.  The
main assumption being that since you need to calculate all
the flags, you might as well do them all in parallel.  Except
now you don't, you may only need one or two of the flag bits
from any given calculation...)

G

#292 From: Neil Bradley <nb@...>
Date: Tue Aug 24, 2004 9:14 am
Subject: More Orion progress
nb@...
Send Email Send Email
 
I finally got all of the flag elimination on a per instruction basis
finished up and debugged. Here's the results with full compiler
optimization turned on:

http://www.synthcom.com/~nb/mspacmandiffs.zip

Source code for all 3 versions are contained within the ZIP for your
inspection/perusal. Diff them to get the deltas. With no recompiler
optimizations it's a 1.9MB image, with timing coalescing (.noflags) it's
1.3MB, and with full optimization turned on it's 1.14MB.  All compiler
tests were done under MSVC 6.0 and are optimize for speed, Pentium Pro
output. Some tests with -nogfx:

Fully optimized version runs @ 24500FPS (compiles in ~2 minutes)
With no flag optimizations it runs @ 23000FPS (compiles in ~15 minutes)
With no coalescing or flag optimizations @ 19500FPS  (compiles in ~15 minutes)

"fully optimized" Means timing coalescing and unused flag calculation
removal optimizations enabled

"no flag optimizations" Means timing coalescing recompiler optimization
only

"no coalescing and no flag optimizations" Means all recompiler
optimizations disabled.

Interesting to note the drastic improvement in compiler speed with and
without the flag optimizations. Note the flag optimization routines are
completely core, so any other target that's written for Orion will inherit
the capability.

With full optimization turned on, the profiler is telling me that it's
spending 45% of the execution time (with -nogfx) in the memory region
handlers. That's really, really stinkin' tight code! Not much more blood
to be taken out of this turnip.

I think it's probably time to get a release out. I have the MSVC version
ready to go, but I don't have any SDL experience under UNIX. Anyone want
to take a stab at it?

-->Neil

-------------------------------------------------------------------------------
Neil Bradley             "Your mistletoe is no match for my T.O.W. missile!"
Synthcom Systems, Inc.   - Santabot - Futurama
ICQ #29402898

#291 From: Neil Bradley <nb@...>
Date: Sat Aug 7, 2004 12:05 am
Subject: Re: Orion v0.10 released
nb@...
Send Email Send Email
 
> Grrr.... I had 10 windows open in Gnu Screen and didn't spot that I
> hadn't hit send on this mail from last week!  I got back from my trip
> last night and didn't have time while I was gone to do anything with
> mspac.  Meanwhile, might as well send this now as delete it...

Hehehe... what speed of machine is this one?

> program into two parts - tcc barfs on source files that are too
> large; in fact even removing comments sometimes fixes this problem
> with tcc)

Whoops. ;-(

Have you managed to get any of the SDL porting done?

BTW, I'm leaving for Taipei tomorrow - won't be back until Wednesday at
the earliest...

-->Neil

-------------------------------------------------------------------------------
Neil Bradley             "Your mistletoe is no match for my T.O.W. missile!"
Synthcom Systems, Inc.   - Santabot - Futurama
ICQ #29402898

#290 From: Graham Toal <gtoal@...>
Date: Fri Aug 6, 2004 11:48 pm
Subject: Re: Orion v0.10 released
graham_toal
Offline Offline
Send Email Send Email
 
Grrr.... I had 10 windows open in Gnu Screen and didn't spot that I
hadn't hit send on this mail from last week!  I got back from my trip
last night and didn't have time while I was gone to do anything with
mspac.  Meanwhile, might as well send this now as delete it...

G

------
Just FYI, tcc compiles mspac in 1/3sec vs nearly 3 minutes for gcc - though
obviously gcc generates much faster code.  But definitely tempting for
rapid turnaround development.  (However... I did have to split the
program into two parts - tcc barfs on source files that are too
large; in fact even removing comments sometimes fixes this problem
with tcc)

I tell you what though, I *guarantee* that TCC is fast enough to use
as a dynarec with C as a pseudo-RTL representation.  Even if that
1/3rd sec all came at once, it would barely be noticable.  Can you
imagine how much overhead it would really contribute if spread over
the duration of a game (with most of it being done via tree-walk
exploration before the game starts)?


time tcc -o mspac mspac.c mspac2.c MSPac.c z80stub.c
0.330u 0.031s 0:00.35 102.8%    0+0k 0+0io 136pf+0w
pizzabox nb/recompV0.10> ./mspac
99 Frames
632 Frames
658 Frames
942 Frames
913 Frames
485 Frames
416 Frames
775 Frames

time gcc -O1 -o mspac mspac.c mspac2.c MSPac.c z80stub.c
134.700u 1.423s 2:40.32 84.9%   0+0k 0+0io 4757pf+0w
pizzabox nb/recompV0.10> ./mspac
2976 Frames
12133 Frames
12129 Frames
12132 Frames
12112 Frames
12137 Frames

#289 From: Neil Bradley <nb@...>
Date: Sat Jul 31, 2004 8:20 pm
Subject: Orion v0.10 released
nb@...
Send Email Send Email
 
It now runs Ms. Pacman! Graham, feel free to work your magic. ;-)

http://www.synthcom.com/~nb/orion/CodeDrops/recompV0.10.tar.zip
http://www.synthcom.com/~nb/orion/CodeDrops/recompV0.10.zip

Release notes:

V0.10:

* Moved sg_u8INTTiming[] to z80stub.c/.h to reduce code duplication
* Added timing check to Space Invaders code (to ensure the emulation and
recompilation are lock stepped)
* Removed label emission on entry points (not required)
* Added RST/DJNZ opcodes in the Z80 to prevent "unused label" warnings
* Added most of the remainder of Z80 opcodes - it now runs MS Pacman!
* Updates to the mspacman script to cover entry points that weren't
covered before

To make and run mspacman under UNIX:

make mspacman
./mspacman

I'm now off to work on a graphics engine so we can actually see what it's
doing! I do know that it's running the game, as I've trapped some of the
writes to the video section (as well as the sprite section) and it seems
to be working as expected.

-->Neil

-------------------------------------------------------------------------------
Neil Bradley             "Your mistletoe is no match for my T.O.W. missile!"
Synthcom Systems, Inc.   - Santabot - Futurama
ICQ #29402898

#288 From: "ldesnogu" <laurent.desnogues@...>
Date: Fri Jul 30, 2004 10:13 am
Subject: Re: MS Pacman recompiled!
ldesnogu
Offline Offline
Send Email Send Email
 
--- In staticrecompilers@yahoogroups.com, Graham Toal <gtoal@g...> wrote:
>
> One trivial tweak I'm not going to mention here ;-) is that when you use
> temp variables in an opcode cover (and I think in some of the cases you
> may not even need them at all), declare them as 'register' and in a
> local {} block, so that the compiler knows at the end of the block that
> the last value in it doesn't need to be written back, that'll allow the
> compiler to keep it in a register entirely or do away with it altogether
> if it can.

Well this is a very compiler specific tweak I guess.
Most compilers just ignore the register keyword when
high-level of optimizations are used.

I guess this needs testing...  Graham I guess you
already tested it, so contradict me :-)


Laurent

#287 From: Neil Bradley <nb@...>
Date: Fri Jul 30, 2004 8:50 am
Subject: Ms Pacman lives!
nb@...
Send Email Send Email
 
I finally got all of the niggling bugs out of the recompiled output (and
the emulator, too!):

http://www.synthcom.com/~nb/mspacman.zip

Performance is as follows on my 2.4Ghz 512K cache Pentium 4, 512MB of RAM,
under Windows XP:

C Emulation 	 4900FPS
x86 Emulation 	 15000FPS
Unoptimized recomp  18000FPS
Optimized recompilation  28000FPS

As an interesting aside, I've also compiled it under gcc on my 2.0Ghz 512K
cache Pentium 4 box running FreeBSD, and I'm getting 16000FPS on the
optimized (O3) recompilation.

I'm running it through the ARM compiler now (ARM SDT 2.51). We'll see how
well it optimizes. ;-)

Next up is the graphics engine. Then we'll be able to see the fruits of
the labor...

I'll do an Orion release in the next day or two, after I have a chance to
clean everything up.

-->Neil

-------------------------------------------------------------------------------
Neil Bradley             "Your mistletoe is no match for my T.O.W. missile!"
Synthcom Systems, Inc.   - Santabot - Futurama
ICQ #29402898

#286 From: Neil Bradley <nb@...>
Date: Wed Jul 28, 2004 6:04 pm
Subject: Re: Re: MS Pacman recompiled!
nb@...
Send Email Send Email
 
> > with the PC @ 0x125. Without a case 0x125, how will the xxxx_exec()
> > procedure be able to return to that instruction?
> Sure, I understand that - this is what you need *at the moment*
> where you are dumping timing tests on every opcode.  But when you
> conflate several cycle subtractions into one, you're implicitly
> acknowleging that you won't allow interrupts at those instructions
> which don't do the test.  You were saying that removing the unused
> case statements would somehow mess up your change to merge the
> cycle counts, and I was saying that I don't see why!

No, it was an incomplete statement. What I meant to say is that I can't
remove any of the case statements until I have the timing coalescing in
place.

> You haven't found tcc yet?  Boy are you in for a treat!
>   http://fabrice.bellard.free.fr/tcc/
> Try it out, and also note the dynamic code generation library
> available.  (Time to run up your unix system.  Maybe it works
> under cygwin but I've never tried it myself)

I'll check it out. I'm a FreeBSD guy, so I should be able to run it with
no problems.

> > Java? That's sick, man.
> Yeah, well barf on this :-) ...
>           http://web.utanet.at/nkehrer/tailgunner/Tailgunner.html

Already did. ;-) That's as cool as it is sick!

-->Neil

-------------------------------------------------------------------------------
Neil Bradley             "Your mistletoe is no match for my T.O.W. missile!"
Synthcom Systems, Inc.   - Santabot - Futurama
ICQ #29402898

#285 From: "Graham Toal" <gtoal@...>
Date: Wed Jul 28, 2004 3:14 pm
Subject: Re: MS Pacman recompiled!
graham_toal
Offline Offline
Send Email Send Email
 
--- In staticrecompilers@yahoogroups.com, Neil Bradley <nb@s...>
wrote:
> > I'm stumped!  I can't guess any reason why the timing calculation
> > coalescing would stop you removing unused case statement labels.
>
> Let's say there are 3 case statements that all contain sequential
> nonbranching instructions:
>
>  case 0x123:
> 	 ....
>  case 0x124:
> 	 ....
>  case 0x125:
> 	 ....
>
> Let's also say that on 0x124, my time slice runs out and returns
back to
> the caller. Now the caller reenters the execution (switch/case
statement)
> with the PC @ 0x125. Without a case 0x125, how will the xxxx_exec()
> procedure be able to return to that instruction?

Sure, I understand that - this is what you need *at the moment*
where you are dumping timing tests on every opcode.  But when you
conflate several cycle subtractions into one, you're implicitly
acknowleging that you won't allow interrupts at those instructions
which don't do the test.  You were saying that removing the unused
case statements would somehow mess up your change to merge the
cycle counts, and I was saying that I don't see why!


> tcc? You mean the ARM SDT 2.51's compiler?

You haven't found tcc yet?  Boy are you in for a treat!

   http://fabrice.bellard.free.fr/tcc/

Try it out, and also note the dynamic code generation library
available.  (Time to run up your unix system.  Maybe it works
under cygwin but I've never tried it myself)

> > (when it's not so late in the evening, remind me to talk to you
about
> > automatically splitting the mail procedure into byte-sized
chunks in
> > order to allow compilation either for the 68K or under Java...)
>
> Java? That's sick, man.

Yeah, well barf on this :-) ...

           http://web.utanet.at/nkehrer/tailgunner/Tailgunner.html

G

#284 From: Neil Bradley <nb@...>
Date: Wed Jul 28, 2004 7:09 am
Subject: Re: MS Pacman recompiled!
nb@...
Send Email Send Email
 
> > To be able to do this, I'll first have to have the timing optimization in
> > place, otherwise it won't be able to jump to
> Are you aware that in almost every post you make, there is at least
> one line that is truncated in mid-sentence?

Yup.

> I suspect editor/mailer
> problems.

Nope. Just very, very distracted by:

1. Wife who picks inopportune times to tell me about things
2. My 3 year old wanders in
3. My 3 year old hurts herself in the other room
4. My 5 month old starts crying and needs consolation
5. The phone rings (it's always for my wife but I always have to answer
it)
6. My cell phone (usually in the other room) rings
7. The cats get in to something they shouldn't (eating packing peanuts)
8. Digitally Imported streams a crappy trance track
9. My propensity to try to handle 5 things at once

So you see, when I stop mid sentence,

>  Usually I can intuit what you would have said but this time
> I'm stumped!  I can't guess any reason why the timing calculation
> coalescing would stop you removing unused case statement labels.

Let's say there are 3 case statements that all contain sequential
nonbranching instructions:

	 case 0x123:
		 ....
	 case 0x124:
		 ....
	 case 0x125:
		 ....

Let's also say that on 0x124, my time slice runs out and returns back to
the caller. Now the caller reenters the execution (switch/case statement)
with the PC @ 0x125. Without a case 0x125, how will the xxxx_exec()
procedure be able to return to that instruction?

> > Yeah. There are a lot of tweaks I'll be making. I'm just trying to get the
> > thing to work first... At least I have all the interrupt/halt handling
> > working properly.
> I'll look forward to checking that out :-)  The 'ei' is a bastard as
> it forces execution of the next instruction before it takes the
> pending interrupt.  So if you in-line it, you have to be very careful
> in the case that it is a jump/jsr/any sort of flow control op.  But
> you know that of course.

Yeah. I can't recall, though, if the Z80 actually latches interrupts. My C
Z80 emulator doesn't, but the x86 version does. ;-| Interestingly enough,
the x86 code won't run some of the MCR games as it sits, but the C code
runs everything!

> > I'm kinda ignoring it since you have it covered, but I do have the ARM SDT
> > 2.51 and MSVC compilers. Between us both, we should have it well covered.
> One of these days I'l install the Intel C compiler and see if it
> lives up to the hype.  Another one I would *love* to see stats on for
> this code (once we've done all the source optimising we can) is "tcc"
> which I still have in mind as the basis for a dynarec version of this
> stuff!

tcc? You mean the ARM SDT 2.51's compiler?

> > After I get Ms Pacman running in code, I'm going to focus some time on a
> > generic sprite/tiling graphics, controller, and sound library so we can
> > actually see the fruits of our labor. BSD License again, of course.
> Something portable please, like SDL or Allegro!

SDL, definitely. I wouldn't consider Allegro. There is a shim, however,
that needs to go between SDL and the emulation code to decode the graphics
and handle sprites and whatnot in an intelligent fashion.

> The harness really
> doesn't need to be part of the project.  In fact even the surrounding
> logic is overkill, the translator is probably better served just
> generating the main emulation function and nothing else, with the
> rest being handled as include files?

No need, really. In fact, it'd just get in the way when it's more than one
CPU with bizarre interaction (Galaga, for example, turns on and off
periodic NMIs to the slave CPUs). I'm planning on the "write your own
execution loop" approach. It's not like they're lengthy or difficult.

> (when it's not so late in the evening, remind me to talk to you about
> automatically splitting the mail procedure into byte-sized chunks in
> order to allow compilation either for the 68K or under Java...)

Java? That's sick, man.

-->Neil

-------------------------------------------------------------------------------
Neil Bradley             "Your mistletoe is no match for my T.O.W. missile!"
Synthcom Systems, Inc.   - Santabot - Futurama
ICQ #29402898

#283 From: Graham Toal <gtoal@...>
Date: Wed Jul 28, 2004 7:38 am
Subject: Re: MS Pacman recompiled!
graham_toal
Offline Offline
Send Email Send Email
 
> > And if not, you'll *definitely* get better register usage
> > due to larger basic blocks if you remove redundant case
> > labels.
>
> To be able to do this, I'll first have to have the timing optimization in
> place, otherwise it won't be able to jump to

Are you aware that in almost every post you make, there is at least
one line that is truncated in mid-sentence?  I suspect editor/mailer
problems.  Usually I can intuit what you would have said but this time
I'm stumped!  I can't guess any reason why the timing calculation
coalescing would stop you removing unused case statement labels.

> Yeah. There are a lot of tweaks I'll be making. I'm just trying to get the
> thing to work first... At least I have all the interrupt/halt handling
> working properly.

I'll look forward to checking that out :-)  The 'ei' is a bastard as
it forces execution of the next instruction before it takes the
pending interrupt.  So if you in-line it, you have to be very careful
in the case that it is a jump/jsr/any sort of flow control op.  But
you know that of course.

> I've got an ARM CPU based board (it's arcade related but I can't divulge
> information on its specifics) that I'll be porting some of the beefier
> games to. It'll be good to test. It's good that you're working with GCC.
> I'm kinda ignoring it since you have it covered, but I do have the ARM SDT
> 2.51 and MSVC compilers. Between us both, we should have it well covered.

One of these days I'l install the Intel C compiler and see if it
lives up to the hype.  Another one I would *love* to see stats on for
this code (once we've done all the source optimising we can) is "tcc"
which I still have in mind as the basis for a dynarec version of this
stuff!

> After I get Ms Pacman running in code, I'm going to focus some time on a
> generic sprite/tiling graphics, controller, and sound library so we can
> actually see the fruits of our labor. BSD License again, of course.

Something portable please, like SDL or Allegro!  The harness really
doesn't need to be part of the project.  In fact even the surrounding
logic is overkill, the translator is probably better served just
generating the main emulation function and nothing else, with the
rest being handled as include files?

(when it's not so late in the evening, remind me to talk to you about
automatically splitting the mail procedure into byte-sized chunks in
order to allow compilation either for the 68K or under Java...)

G

#282 From: Neil Bradley <nb@...>
Date: Wed Jul 28, 2004 4:33 am
Subject: Re: MS Pacman recompiled!
nb@...
Send Email Send Email
 
> You mentioned in an earlier post that it takes 45 minutes to
> compile.

With max optimization on (optimize for speed and Pentium Pro), yeah. It's
a bit over an hour now. ;-(

> Try removing the case labels and see how long it
> takes.  I have a suspicion it may be doing an unnecessary
> amount of hard work trying to determine flow control.  (Unnecessary
> in the sense that you've already determined the flow control
> in your translator and will have made all the significant
> optimisations already at the source code level (eventually))

Yeah, still lots of things left to try. Good point.

> And if not, you'll *definitely* get better register usage
> due to larger basic blocks if you remove redundant case
> labels.

To be able to do this, I'll first have to have the timing optimization in
place, otherwise it won't be able to jump to

> > Almost 2 megabytes of source. ;-( Lots of cleanup I need to do, too. I'm
> > currently hooking up a platform extension to debug it. Should be, uh,
> > fun... ;-(
> Neil is going to hate this I'm sure, but I zapped his source through
> a couple of quick edit macros to remove the case labels and the
> timing, and shorten a couple of the names just so I could read it
> more easily ;-)

Oh, no problem at all. I kinda dig the webized look. Though, for some
reason the script moved the disassembled parts to the wrong set of
instructions (the ones right before it).

> I'll mail you offline with any small tweaks I spot, no point in cluttering
> up the list with trivial details, especially since there's a good chance
> that you'll have already fixed them before I even mail you :)

Yeah. There are a lot of tweaks I'll be making. I'm just trying to get the
thing to work first... At least I have all the interrupt/halt handling
working properly.

> One trivial tweak I'm not going to mention here ;-) is that when you use
> temp variables in an opcode cover (and I think in some of the cases you
> may not even need them at all), declare them as 'register' and in a
> local {} block, so that the compiler knows at the end of the block that
> the last value in it doesn't need to be written back, that'll allow the
> compiler to keep it in a register entirely or do away with it altogether
> if it can.

Good advice! I'll add that to the ever growing todo list.

> My only question about the code is that you do make a major simplifcation
> by aliasing the pairs of one byte registers with their two-byte equivalents,
> but I'm worried that the aliasing will break the compiler's register
> optimisation strategy.

I set them up as #defines in the recompiler for this exact reason. I
observed that MSVC is doing a really good job figuring out 16 bit vs. 8
bit halfs during its optimizations.

> You certainly will *not* be able to force the
> variables into real registers throughout the entire procedure using this
> scheme, which doesn't hurt on the x86 but may be painful when you compile
> under 68000 or ARM or other machines with enough real registers to map
> all the target registers.

I've got an ARM CPU based board (it's arcade related but I can't divulge
information on its specifics) that I'll be porting some of the beefier
games to. It'll be good to test. It's good that you're working with GCC.
I'm kinda ignoring it since you have it covered, but I do have the ARM SDT
2.51 and MSVC compilers. Between us both, we should have it well covered.

After I get Ms Pacman running in code, I'm going to focus some time on a
generic sprite/tiling graphics, controller, and sound library so we can
actually see the fruits of our labor. BSD License again, of course.

-->Neil

-------------------------------------------------------------------------------
Neil Bradley             "Your mistletoe is no match for my T.O.W. missile!"
Synthcom Systems, Inc.   - Santabot - Futurama
ICQ #29402898

#281 From: Graham Toal <gtoal@...>
Date: Wed Jul 28, 2004 4:25 am
Subject: Re: MS Pacman recompiled!
graham_toal
Offline Offline
Send Email Send Email
 
You mentioned in an earlier post that it takes 45 minutes to
compile.  Try removing the case labels and see how long it
takes.  I have a suspicion it may be doing an unnecessary
amount of hard work trying to determine flow control.  (Unnecessary
in the sense that you've already determined the flow control
in your translator and will have made all the significant
optimisations already at the source code level (eventually))

And if not, you'll *definitely* get better register usage
due to larger basic blocks if you remove redundant case
labels.  If you give the compiler some help in this way,
you may be able to get just as good code from -O1 as
from -O3, in a fraction of the time (and with the added
benefit of being able to see the interspersed source/asm
listing, at least under GCC)

> recompile. The "final" version is here:
>
> http://www.synthcom.com/~nb/mspacman.zip
>
> Almost 2 megabytes of source. ;-( Lots of cleanup I need to do, too. I'm
> currently hooking up a platform extension to debug it. Should be, uh,
> fun... ;-(

Neil is going to hate this I'm sure, but I zapped his source through
a couple of quick edit macros to remove the case labels and the
timing, and shorten a couple of the names just so I could read it
more easily ;-)  I know this has broken it, but it makes it easier
to eyeball the code on a web page.  Here it is if anyone wants
to have a quick look at what he's done and doesn't want to pull
down the zip file first:

    http://www.gtoal.com/sbt/z80/nb/mspacman.c.html

(The unmangled version is here: http://www.gtoal.com/sbt/z80/nb/mspacman.c )

> Looks like the Z80 emitter could've been better written in a few places.
> After I get Ms Pacman going, I'll go back and clean the Z80 emitter up
> (and post another version of Orion).

I'll mail you offline with any small tweaks I spot, no point in cluttering
up the list with trivial details, especially since there's a good chance
that you'll have already fixed them before I even mail you :)

One trivial tweak I'm not going to mention here ;-) is that when you use
temp variables in an opcode cover (and I think in some of the cases you
may not even need them at all), declare them as 'register' and in a
local {} block, so that the compiler knows at the end of the block that
the last value in it doesn't need to be written back, that'll allow the
compiler to keep it in a register entirely or do away with it altogether
if it can.

eg change:
L0010:
			 u8Temp = sg_sZ80Context.z80L;
			 u32Temp = sg_sZ80Context.z80A + u8Temp;
			 sg_sZ80Context.z80F = g_u8SignZeroTable[(UINT8)
u32Temp] | ((u32Temp >> 8) & Z80_FLAG_C) | ((sg_sZ80Context.z80A ^ u32Temp ^
u8Temp) & Z80_FLAG_H) | (((u8Temp ^ sg_sZ80Context.z80A ^ 0x80) & (u8Temp ^
u32Temp) & 0x80) >> 5);
			 sg_sZ80Context.z80A = (UINT8) u32Temp;
			 if ((s32CyclesRemaining -= 4) < 0) { goto Exit0011;}

to

L0010:
                         {
			 register UINT8 u8Temp;
			 register UINT32 u32Temp;
			 u8Temp = sg_sZ80Context.z80L;
			 u32Temp = sg_sZ80Context.z80A + u8Temp;
			 sg_sZ80Context.z80F = g_u8SignZeroTable[(UINT8)
u32Temp] | ((u32Temp >> 8) & Z80_FLAG_C) | ((sg_sZ80Context.z80A ^ u32Temp ^
u8Temp) & Z80_FLAG_H) | (((u8Temp ^ sg_sZ80Context.z80A ^ 0x80) & (u8Temp ^
u32Temp) & 0x80) >> 5);
			 sg_sZ80Context.z80A = (UINT8) u32Temp;
			 }
			 if ((s32CyclesRemaining -= 4) < 0) { goto Exit0011;}

My only question about the code is that you do make a major simplifcation
by aliasing the pairs of one byte registers with their two-byte equivalents,
but I'm worried that the aliasing will break the compiler's register
optimisation strategy.  You certainly will *not* be able to force the
variables into real registers throughout the entire procedure using this
scheme, which doesn't hurt on the x86 but may be painful when you compile
under 68000 or ARM or other machines with enough real registers to map
all the target registers.

G

#280 From: Neil Bradley <nb@...>
Date: Wed Jul 28, 2004 12:34 am
Subject: MS Pacman recompiled!
nb@...
Send Email Send Email
 
Finally filled out the Z80 instructions enough so that Ms Pacman will
recompile. The "final" version is here:

http://www.synthcom.com/~nb/mspacman.zip

Almost 2 megabytes of source. ;-( Lots of cleanup I need to do, too. I'm
currently hooking up a platform extension to debug it. Should be, uh,
fun... ;-(

Looks like the Z80 emitter could've been better written in a few places.
After I get Ms Pacman going, I'll go back and clean the Z80 emitter up
(and post another version of Orion).

-->Neil

-------------------------------------------------------------------------------
Neil Bradley             "Your mistletoe is no match for my T.O.W. missile!"
Synthcom Systems, Inc.   - Santabot - Futurama
ICQ #29402898

#279 From: Neil Bradley <nb@...>
Date: Wed Jul 28, 2004 12:03 am
Subject: Re: Re: Space Invaders lives!
nb@...
Send Email Send Email
 
I'm detecting a sarcastic, perhaps condescending attitude coming from you.
Was that your intent?

> > > Dropping from 25 instructions to 3 instruction is, uh, a big win,
> > and all
> > > flags are correctly calculated.
> No they aren't.

Yes, the flags *ARE* natively calculated on the x86 ;-) Completely. The
point is the C code has no concept of flags (assembly does). You are right
that the overflow flag is in another flag, but even with the addition of
the few assembly instructions to move overflow into proper positions is
STILL a big win over a C equivalent.

But I do agree that flag calculation elimination will be far more of a win
than optimizing that particular instruction.

> In fact, I'd bet that the average add/adc doesn't
> require any flag calculation at all so the lack of
> lahf/sahf/pushf/popf would make it even smaller than the "prized"
> assembler code.

Gee, based on the fact you quoted "prized" assembly, I guess mentioning
the fact that good hand coded assembly can stop the tar out of C must've
hurt you somehow. Sorry 'bout that.

> the flags register manipulated by lahf and sahf.  If you are going to
> argue that z80 games never test overflow then I will also argue that
> the half/auxillary flag is also infrequently used.

They rarely use the overflow flag (they check carry most of the time).
Half/aux carry is used more often in practice.

> > I called "playing with the flag register") you may
> > very well experience stalls and/or kill the out of
> > order execution unit.

But not to the tune of 25 instructions worth... ;-) Obviously doing lookup
tables would be even better in the 8 bit realm, then no flags would have
to be calculated at all - just looked up (and that is the fastest approach
I've found).

-->Neil

-------------------------------------------------------------------------------
Neil Bradley             "Your mistletoe is no match for my T.O.W. missile!"
Synthcom Systems, Inc.   - Santabot - Futurama
ICQ #29402898

#278 From: mperry@...
Date: Tue Jul 27, 2004 11:54 pm
Subject: Re: Space Invaders lives!
riff6809
Offline Offline
Send Email Send Email
 
--- In staticrecompilers@yahoogroups.com, "ldesnogu"
<laurent.desnogues@w...> wrote:
> --- In staticrecompilers@yahoogroups.com, Neil Bradley <nb@s...>
wrote:
>
> > Another thing to note, the "add" instruction output in C is really
> pretty
> > bad. It's something like 25 instructions with max optimizations
> turned on.
> > But when I do it in assembly, it looks more like this:
> >
> >  sahf
> >  add al, 35h
> >  lahf
> >
> > Dropping from 25 instructions to 3 instruction is, uh, a big win,
> and all
> > flags are correctly calculated.


No they aren't.  The z80 certainly updates the overflow flag.  The
x86 overflow flag is unfortunately absent from the lower 16-bits of
the flags register manipulated by lahf and sahf.  If you are going to
argue that z80 games never test overflow then I will also argue that
the half/auxillary flag is also infrequently used.  Its not fair to
make the C code calculate both of those flags and then compare it to
an assembly routine that doesnt calculate them (in the sense that it
doesn't save them).  Only the overflow and half/auxillary flags are
prohibitively expensive to compute.  The negative and zero flags are
dirt cheap to compute, even if you insist on storing the flags in a
native packed format.  The carry flag isn't all that expensive either.
And as even you admitted, if you do dead-code elimination, most of
the flag calculations go away and the average length of the C code
emitted will surely be much smaller than the worst-case 25
instructions.  In fact, I'd bet that the average add/adc doesn't
require any flag calculation at all so the lack of
lahf/sahf/pushf/popf would make it even smaller than the "prized"
assembler code.

> No this is not necessarily a win because of pipeline
> hazards (reminder:  I don't know the x86 pipeline).
> If the flag register is explictly used (this is what
> I called "playing with the flag register") you may
> very well experience stalls and/or kill the out of
> order execution unit.

There is a stage or two dedicated to flags on the P4.  I'm not sure
what is performed during this stage, but I know that it is one of the
later stages in the way-too-deep pipeline.  Its quite possible that
lahf and sahf get processed in this stage as well as "standard"
instruction based flag updating.  This would allow the lahf/sahf
instructions to avoid stalling the pipeline until a subsequent
instruction requires the new flag values (or register into which they
were loaded).  Remember, the processor is already checking for data
dependencies.  The flags are dependencies like everything else.  Any
data dependency is going to force in-order execution.  Also be aware
of the fact that modern PCs translate x86 instructions into micro-ops
and that the P4 in particular uses a trace cache of micro-ops similar
to an optimized basic block.  I highly doubt that flags are coupled
with uops and I imagine that flags are treated as individual
registers with pushfd/popfd and the likes being microcoded or
requiring several uops.

#277 From: "ldesnogu" <laurent.desnogues@...>
Date: Tue Jul 27, 2004 10:39 am
Subject: Re: Space Invaders lives!
ldesnogu
Offline Offline
Send Email Send Email
 
--- In staticrecompilers@yahoogroups.com, Neil Bradley <nb@s...> wrote:

> Another thing to note, the "add" instruction output in C is really
pretty
> bad. It's something like 25 instructions with max optimizations
turned on.
> But when I do it in assembly, it looks more like this:
>
>  sahf
>  add al, 35h
>  lahf
>
> Dropping from 25 instructions to 3 instruction is, uh, a big win,
and all
> flags are correctly calculated.

No this is not necessarily a win because of pipeline
hazards (reminder:  I don't know the x86 pipeline).
If the flag register is explictly used (this is what
I called "playing with the flag register") you may
very well experience stalls and/or kill the out of
order execution unit.

Perhaps you could check with performance counters
if what I say is true on your P4...


Laurent

#276 From: Laurent Desnogues <laurent.desnogues@...>
Date: Tue Jul 27, 2004 8:37 am
Subject: Re: Orion and timing/optimization fun
ldesnogu
Offline Offline
Send Email Send Email
 
Neil Bradley wrote:

> Everything I've heard from *EVERYONE* talks about a "basic block", and
> that at the point of a branch, all optimization and cycle counting ends.

You are right about that but not totally:  all optimizing
compilers use BB even when doing global and interprocedural
optimizations!

I only see BB as a convenient data structure that can
prevent having to go through every instruction to know
where labels are put or where control flow changes occur.

>>http://citeseer.ist.psu.edu/cis?q=ssa&submit=Search+Documents&cs=1
>>If you follow that path an explicit IR may be needed.
>
> I'm not convinced that dynamic register allocation will actually buy
> speedups or improvements overall.

SSA is not related to dynamic register allocation.  It is
a kind of IR that helps making data flow analysis without
having to bother about control flow.

> Yeah. ;-) Without even trying, Orion is running Space Invaders @ the speed
> of a 3.4Ghz Z80. It's on a 2.4Ghz Pentium 4. ;-) Not bad! But we can do
> better....

Oh yes, I am sure that once you have implemented cycle
counting as you want to, the figure will be even more
impressive :-)


		 Laurent

#275 From: Neil Bradley <nb@...>
Date: Tue Jul 27, 2004 8:24 am
Subject: Re: Orion and timing/optimization fun
nb@...
Send Email Send Email
 
> > Actually you've not retired it, you've implemented it :-)  It's just a
> > disagreement over terminology, you're both talking about the same thing.
> I am glad someone shares my point of view ;-)

Well, I think my point should further be explained.

Everything I've heard from *EVERYONE* talks about a "basic block", and
that at the point of a branch, all optimization and cycle counting ends.
Since Orion transcends that, I of course shun away from the term to avoid
implying such things.

> http://citeseer.ist.psu.edu/cis?q=ssa&submit=Search+Documents&cs=1
> If you follow that path an explicit IR may be needed.

I'm not convinced that dynamic register allocation will actually buy
speedups or improvements overall.

> I don't think we should stop just because emulation is fast
> enough, just carefully decide what to implement first
> taking into account all we have in mind for later
> optimizations (doh this looks pretty obvious :-).

Yeah. ;-) Without even trying, Orion is running Space Invaders @ the speed
of a 3.4Ghz Z80. It's on a 2.4Ghz Pentium 4. ;-) Not bad! But we can do
better....

-->Neil

-------------------------------------------------------------------------------
Neil Bradley             "Your mistletoe is no match for my T.O.W. missile!"
Synthcom Systems, Inc.   - Santabot - Futurama
ICQ #29402898

#274 From: Laurent Desnogues <laurent.desnogues@...>
Date: Tue Jul 27, 2004 8:06 am
Subject: Re: Orion and timing/optimization fun
ldesnogu
Offline Offline
Send Email Send Email
 
Graham Toal wrote:
>>I've personally retired the concept of "Basic block" since Orion has code
>>flow analysis capabilities that obviate it. ;-)
>
>
> Actually you've not retired it, you've implemented it :-)  It's just a
> disagreement over terminology, you're both talking about the same thing.

I am glad someone shares my point of view ;-)

> The other BB thing you'll need (whatever you eventually call it ;-) ) is
> the concept of merging contexts, or as I think they started calling it more
> recently, "phi nodes".
>
> Actually it's not quite the same thing when you're generating C and have
> variables that map directly to each target register; but if you ever create
> an asm translation where you're mapping target registers to host registers
> only over a live range, you will need that concept.

You know you are describing SSA or static single assignment?
Look at the first article output by:

http://citeseer.ist.psu.edu/cis?q=ssa&submit=Search+Documents&cs=1

If you follow that path an explicit IR may be needed.

> Ah well, complexity for another day, and not much benefit from it.  Even the
> current unoptimised code generation is already good enough for the targets
> we're likely to use.

I don't think we should stop just because emulation is fast
enough, just carefully decide what to implement first
taking into account all we have in mind for later
optimizations (doh this looks pretty obvious :-).


		 Laurent

#273 From: Graham Toal <gtoal@...>
Date: Tue Jul 27, 2004 2:25 am
Subject: Re: Orion and timing/optimization fun
graham_toal
Offline Offline
Send Email Send Email
 
> > > * At any return or conditional return with more than one return point
> > Hum you are more or less defining a basic block
> > except that you optimize some cases.  This looks
> > promising :-)
>
> I've personally retired the concept of "Basic block" since Orion has code
> flow analysis capabilities that obviate it. ;-)

Actually you've not retired it, you've implemented it :-)  It's just a
disagreement over terminology, you're both talking about the same thing.

The other BB thing you'll need (whatever you eventually call it ;-) ) is
the concept of merging contexts, or as I think they started calling it more
recently, "phi nodes".

Actually it's not quite the same thing when you're generating C and have
variables that map directly to each target register; but if you ever create
an asm translation where you're mapping target registers to host registers
only over a live range, you will need that concept.

A slightly watered-down version of it will be useful in C; that is where
you note whether or not a flag is calculated, or a register has a known
value, then you have to merge the information from the multiple jump sources
whenever you have a jump target.  It gets tricky when you have recursive
procedures, and need to work out closures of the graph.

Ah well, complexity for another day, and not much benefit from it.  Even the
current unoptimised code generation is already good enough for the targets
we're likely to use.

Graham

#272 From: Graham Toal <gtoal@...>
Date: Tue Jul 27, 2004 2:14 am
Subject: Re: Orion and timing/optimization fun
graham_toal
Offline Offline
Send Email Send Email
 
> For example, if I have:
>
>  ...code...
>  ...code...
>  ...code...
>  jmp routine
>
>  .....
>
> routine:
>  ...code...
>  ...code...
>  ...code...
>
> And I know that "routine" only has one reference, I can coalesce the
> timing across the jumps. Same deal with the calls AND rets. So I don't
> plan on just dropping timing in at the point of a return or branch - I'd
> like to be a bit more sophisticated about it.

True enough, though my preference would be to exbed the code in which case
it becomes an inline sequence again.  Am I getting too fancy here? ;-)

[Same deal if you are expanding procedure/function calls inline.  And if you
do that, make sure you do so *before* constant operation folding/register
remembering
because you might be able to effectivly remove some function calls
altogether, if they had constant arguments!]

> > Secondly, you should force out a timing update when the accumulated sum
> > exceeds some threshhold.
>
> Unless the code is 1000 inline instructions with absolutely no branches at
> all, I don't think this could ever happen.

No, that's what I'm talking about, you *need* to force it out in inline code
because there won't have been any branches to do it for you.  Maybe 1000
instructions isn't very likely but what if your 'force' parameter is set much
lower, eg to 1 in order to force it on every instruction?  Better to do it
that way than have special-case code!

Actually I agree with your assessment of long straight-line sequences being
unlikely; that's why my version of this only dumped timing on *backwards*
branches, not on forward branches :-)

> I plan on adding in a "stricttiming" directive which will direct the
> recompiler to emit tighter timing granularity to help in the situation
> where interCPU communication (for example) is required/paramount.

That's the special-case code I was talking about :-) Just tweak the threshhold
value, and it all drops out in the wash.

G

#271 From: Neil Bradley <nb@...>
Date: Tue Jul 27, 2004 12:36 am
Subject: Yucky add/sub flag calc code
nb@...
Send Email Send Email
 
To underscore my point about the sucky flag calculation code, check this

lines out from si.c:

case 0x0030: // cp a, 99h
	 u8Temp = 0x99;
	 u32Temp = sg_sZ80Context.z80A - u8Temp;
	 sg_sZ80Context.z80F = g_u8SignZeroNegTable[(UINT8) u32Temp] |
((u32Temp >> 8) & Z80_FLAG_C) | ((u32Temp ^ sg_sZ80Context.z80A ^ u8Temp)
& Z80_FLAG_H) | (((u8Temp ^ sg_sZ80Context.z80A) & (sg_sZ80Context.z80A ^
u32Temp) & 0x80) >> 5);

Turns in to:

$L1757:
; Line 963
	 xor edx, edx
	 mov dl, al
; Line 964
	 mov bl, al
	 shr bl, 5
	 and bl, 4
	 sub edx, 153 		 ; 00000099H
	 mov ecx, edx
	 shr ecx, 5
	 and cl, 4
	 xor cl, bl
	 mov bl, al
	 not bl
	 shr bl, 5
	 and cl, bl
	 mov bl, al
	 not bl
	 xor bl, dl
	 and bl, 16 			 ; 00000010H
	 or cl, bl
	 mov ebx, edx
	 and ebx, 255 		 ; 000000ffH
	 or cl, BYTE PTR _g_u8SignZeroNegTable[ebx]
	 shr edx, 8
	 and dl, 1
	 or cl, dl
; Line 965
	 sub esi, 7
	 mov BYTE PTR _sg_sZ80Context+44, cl
	 js $Exit0032$1762

The compiler did do a good job for what it was given, but calculating
flags on a subtract is just horrible! That's why I think flag calculation
removal will have a much bigger impact (at least on the x86). I don't
think separately storing each flag out will do a better job than not
having the code to calculate flags in there in the first place (in other
words don't rely on the compiler to do the right thing). Especially on a
lower register count processor like the x86.

-->Neil

-------------------------------------------------------------------------------
Neil Bradley             "Your mistletoe is no match for my T.O.W. missile!"
Synthcom Systems, Inc.   - Santabot - Futurama
ICQ #29402898

#270 From: Neil Bradley <nb@...>
Date: Tue Jul 27, 2004 12:26 am
Subject: Re: Space Invaders lives!
nb@...
Send Email Send Email
 
> > C Z80 Emulation 	 7672 FPS
> > x86 Z80 Emulation 48035 FPS
> > Recompiled (debug mode) 32745 FPS
> > Recompiled  96497 FPS
> Hum couldn't this hint that using the x86 flags
> for recompilation is not needed at all?

How do you arrive at that conclusion? The x86 Z80 core is really at its
maximum, since every instruction must endure the 5 following opcodes at a
MINIMUM:

	 sub edi, timing
	 js exit
	 mov dl, [esi]
	 inc esi
	 jmp opcodeDispatchTable[edx*4]

The recompiled output only has the overhead of *TWO* of those instructions
on a per instruction basis when per instruction timing is used.

Another thing to note, the "add" instruction output in C is really pretty
bad. It's something like 25 instructions with max optimizations turned on.
But when I do it in assembly, it looks more like this:

	 sahf
	 add al, 35h
	 lahf

Dropping from 25 instructions to 3 instruction is, uh, a big win, and all
flags are correctly calculated.

> I don't
> know much about the P6 or later pipelines but I
> bet that playing with the flag register is not
> very good unless carefully scheduled which may
> be very difficult for a processor playing a lot
> with flags (68k comes to mind).

What do you mean by "playing with the flag register"?

-->Neil

-------------------------------------------------------------------------------
Neil Bradley             "Your mistletoe is no match for my T.O.W. missile!"
Synthcom Systems, Inc.   - Santabot - Futurama
ICQ #29402898

#269 From: Neil Bradley <nb@...>
Date: Tue Jul 27, 2004 12:20 am
Subject: Re: Orion and timing/optimization fun
nb@...
Send Email Send Email
 
> > * At a CPU "halt" instruction or equivalent
> > * At any "jump to self" instructions (yes, these exist!)
> > * At any return or conditional return with more than one return point
> Hum you are more or less defining a basic block
> except that you optimize some cases.  This looks
> promising :-)

I've personally retired the concept of "Basic block" since Orion has code
flow analysis capabilities that obviate it. ;-)

> Or perhaps your question was not in terms of "when
> should I implement this?" but rather "when should
> this optim be run?" ?

The latter. ;-)

-->Neil

-------------------------------------------------------------------------------
Neil Bradley             "Your mistletoe is no match for my T.O.W. missile!"
Synthcom Systems, Inc.   - Santabot - Futurama
ICQ #29402898

#268 From: Neil Bradley <nb@...>
Date: Tue Jul 27, 2004 12:19 am
Subject: Re: Orion and timing/optimization fun
nb@...
Send Email Send Email
 
> > * At a fixed jump point that has a target that is jumped to by multiple
> > places
> Assuming the above two are jump targets, then no - just at the jumps
> themselves. Proof by induction...

What if it jumps to an address that isn't referenced by anyone else? It
can continue the timing coalescing and not just stop at a jump.

> > * At any "jump to self" instructions (yes, these exist!)
> > * At any return or conditional return with more than one return point
> All the jumps are basically the same thing.  jmp/jsr/rts (+ points where a
> genuine external interrupt may be invoked? eg something from a real-time
> timer?)

They're quite different, actually, at least with Orion since the call tree
analysis is both forward and reverse referenced.

For example, if I have:

	 ...code...
	 ...code...
	 ...code...
	 jmp routine

	 .....

routine:
	 ...code...
	 ...code...
	 ...code...

And I know that "routine" only has one reference, I can coalesce the
timing across the jumps. Same deal with the calls AND rets. So I don't
plan on just dropping timing in at the point of a return or branch - I'd
like to be a bit more sophisticated about it.

> When you have a conditional jump, you should output the timing *so far*
> __inside__ the condition, while still remembering it for later.  That way
> it is only executed if the jump is taken.

Ah yes, that's a very, very good point! I hadn't thought of that - will be
sure to take it in to account.

> Secondly, you should force out a timing update when the accumulated sum
> exceeds some threshhold.

Unless the code is 1000 inline instructions with absolutely no branches at
all, I don't think this could ever happen.

> accuracy that you require for the particular game, eg if it is OK to allow
> the clock to wobble by 1% and there are a million clocks per second, then
> you would force it out whenever it exceeds 10,000.

I plan on adding in a "stricttiming" directive which will direct the
recompiler to emit tighter timing granularity to help in the situation
where interCPU communication (for example) is required/paramount.

-->Neil

-------------------------------------------------------------------------------
Neil Bradley             "Your mistletoe is no match for my T.O.W. missile!"
Synthcom Systems, Inc.   - Santabot - Futurama
ICQ #29402898

Messages 268 - 297 of 326   Newest  |  < Newer  |  Older >  |  Oldest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help