I'm always told how SSE2 routines always beat the FPU's hard coded routines,
however i've never actually seen this quantified. Yes, SIMD is good in theory,
but can it really beat scalar routines? It uses more memory (more instructions
streamed), and it's not hard coded. I've seen some C code trying to quantify
this, but naturally the speed gain presented has not only been inconsistent (I
assume the fault is in failing to consider the task scheduler), but the gain
seems less than 10% every time for the functions that i feel matter the most
(fsincos and atan). More so, I've found some code that is theoretically faster
than a nice Taylor, and seems pretty accurate when i typed in the function on my
TI-NSpire. I was wondering if anyone had any code laying around that they use to
benchmark procedures while taking the schedualer out of the equation. It might
also help the community in the event that they do not have a benchmarker handy
to try other things.