In two cases, my old code was using a temporary register but not saving
it properly; it basically worked by accident (an otherwise useless
FlushLock was causing CallerSavedRegistersInUse to think it was in use
by the GPR cache, even though it was actually a temporary).
I'm going to modify this in the next commit to use RDX, but I didn't
want to leave a broken revision in the middle.
The register is RBP, previously in the GPR allocation order. The next
commit will investigate whether there are too few GPRs (now or before),
but for now there is no replacement.
Previously, it was accessed RIP relatively; using RBP, anything in the
first 0x100 bytes of ppcState (including all the GPRs) can be accessed
with three fewer bytes. Code to access ppcState is generated constantly
(mostly by register save/load), so in principle, this should improve
instruction cache footprint significantly. It seems that this makes a
significant performance difference in practice.
The vast majority of this commit is mechanically replacing
M(&PowerPC::ppcState.x) with a new macro PPCSTATE(x).
Version 2: gets most of the cases which were using the register access
macros.
GetOpInfo was returning null pointers for invalid ops in subtables
instead of asserting an error. This was causing segfaults when the
jit tried to jit invalid code.
Prior to this change, it was possible to cause an infinite loop by making the string to be replaced and the replacing string the same thing.
e.g.
std::string some_str = "test";
ReplaceAll(some_str, "test", "test");
This also changes the replacing in a way that doesn't require starting from the beginning of the string on each replacement iteration.
For a long time, we've had ugly and inconsistent function names here as
helpers, names like "decodebytesRGB5A3rgba" which are absolutely
incomprehensible to understand. Fix this by introducing a new consistent
naming scheme, where the above function now becomes "DecodeBytes_RGB5A3".
Instead of having three separate functions and checking the tlutfmt in a
variety of places, just do it once in a helper method. This is already
for the slow path either in our Generic decoder or in our Software
renderer, so it doesn't matter that this is slower.
x64 will continue using the separate functions for speed.