I haven't observed this breaking any game, but it didn't match
the behavior of the interpreter as far as I could tell from
reading the code, in that denormals weren't being flushed.
If we can prove that FCVT will provide a correct conversion,
we can use FCVT. This makes the common case a bit faster
and the less likely cases (unfortunately including zero,
which FCVT actually can convert correctly) a bit slower.
Preparation for following commits.
This commit intentionally doesn't touch paired stores,
since paired stores are supposed to flush to zero.
(Consistent with Jit64.)
This simplifies some of the following commits. It does require
an extra register, but hey, we have 32 of them.
Something I think would be nice to add to the register cache
in the future is the ability to keep both the single and double
version of a guest register in two different host registers
when that is useful. That way, the extra register we write to
here can be read by a later instruction, saving us from
having to perform the same conversion again.
When the interpreter writes to a discarded register, its type
must be changed so that it is no longer considered discarded.
Fixes a 62ce1c7 regression.
Repeated erase() + iteration on a std::multimap is extremely slow.
Slow enough that it causes a 7 second long stutter during some
transitions in F-Zero X (a N64 VC game that triggers many, many icache
invalidations).
And slow enough that JitBaseBlockCache::DestroyBlock shows up on a
flame graph as taking >50% of total CPU time on the CPU-GPU thread:
https://i.imgur.com/vvqiFL6.png
This commit optimises those block link queries by replacing the
std::multimap (which is typically implemented with red-black trees)
with hash tables.
Master: https://i.imgur.com/vvqiFL6.png / 7s stutters
(starting from 5.0-2021 and with branch following disabled)
This commit: https://i.imgur.com/hAO74fy.png / ~0.7s stutters, which
is pretty close to 5.0 stable. (5.0-2021 introduced the performance
regression and it is especially noticeable when branch following
is disabled, which is the case for all N64 VC games since 5.0-8377.)
Specifically, 'Scooby-Doo! Mystery Mayhem', 'Scooby-Doo! Unmasked', 'Ed, Edd n Eddy: The Mis-Edventures', and the Wii version of 'Happy Feet'.
The JIT cache causes problems with emulated icache invalidation in these games, resulting in areas failing to load.
The swaps are confusing and don't accomplish much.
It was originally written like this:
u32 pte = bswap(*(u32*)&base_mem[pteg_addr]);
then bswap was changed to Common::swap32, and then the array access
was replaced with Memory::Read_U32, leading to the useless swaps.
While 6xx_pem.pdf §7.6.1.1 mentions that the number of trailing
zeros in HTABORG must be equal to the number of trailing ones
in the mask (i.e. HTABORG must be properly aligned), this is actually
not a hard requirement. Real hardware will just OR the base address
anyway. Ignoring SDR changes would lead to incorrect emulation.
Logging a warning instead of dropping the SDR update silently is a
saner behaviour.
JitArm64::DoJit contains a check where it prints a warning and tries
to pause emulation if instructed to compile code at address 0. I'm
assuming this was done in order to provide a nicer error behavior
in cases where PC was accidentally set to null. Unfortunately, it
has started causing us problems recently, as 688bd61 writes and runs
some code at address 0 to simulate the PPC being held in reset.
What makes this worse is that calling Core::SetState from the CPU
thread is actually not allowed and will cause a deadlock instead of
the intended behavior. I don't believe there is anything on a real
console that would stop you from executing code at address 0 (as
long as the MMU has been set up to allow it), and Jit64::DoJit
doesn't contain any check like this, so let's remove the check.
This commit adds a new "discarded" state for registers.
Discarding a register is like flushing it, but without
actually writing its value back to memory. We can discard
a register only when it is guaranteed that no instruction
will read from the register before it is next written to.
Discarding reduces the register pressure a little, and can
also let us skip a few flushes on interpreter fallbacks.
The output of instructions like fabsx and ps_sel is store-safe
if and only if the relevant inputs are. The old code was always
marking the output as store-safe if the output was a single,
and never otherwise.
Also, the old code was treating the output of psq_l/psq_lu as
store-safe, which seems incorrect (if dequantization is disabled).
If we know at compile time that the PPC carry flag definitely
has a certain value, we can bake that value into the emitted code
and skip having to read from PPCState.
At a first glance it may look like a part of the code I added to
srawx in efeda3b has a bug when a == s. The code actually happens
to work correctly, but in the interest of making the code easier
to reason about, I'd like to change the way it's implemented. This
change should improve the pipelining a little in the a == s case too.
...and let's optimize a divisor of 2 ever so slightly for good measure.
I wouldn't have bothered, but most GameCube games seem to hit this on
launch.
- Division by 2
Before:
41 BE 02 00 00 00 mov r14d,2
41 8B C2 mov eax,r10d
45 85 F6 test r14d,r14d
74 0D je overflow
3D 00 00 00 80 cmp eax,80000000h
75 0E jne normal_path
41 83 FE FF cmp r14d,0FFFFFFFFh
75 08 jne normal_path
overflow:
C1 F8 1F sar eax,1Fh
44 8B F0 mov r14d,eax
EB 07 jmp done
normal_path:
99 cdq
41 F7 FE idiv eax,r14d
44 8B F0 mov r14d,eax
done:
After:
45 8B F2 mov r14d,r10d
41 C1 EE 1F shr r14d,1Fh
45 03 F2 add r14d,r10d
41 D1 FE sar r14d,1
Add a function to calculate the magic constants required to optimize
signed 32-bit division.
Since this optimization is not exclusive to any particular architecture,
JitCommon seemed like a good place to put this.
Zero divided by any number is still zero. For whatever reason, this case
shows up frequently too.
Before:
B8 00 00 00 00 mov eax,0
85 F6 test esi,esi
74 0C je overflow
3D 00 00 00 80 cmp eax,80000000h
75 0C jne normal_path
83 FE FF cmp esi,0FFFFFFFFh
75 07 jne normal_path
overflow:
C1 F8 1F sar eax,1Fh
8B F8 mov edi,eax
EB 05 jmp done
normal_path:
99 cdq
F7 FE idiv eax,esi
8B F8 mov edi,eax
done:
After:
Nothing!
When the dividend is known at compile time, we can eliminate some of the
branching and precompute the result for the overflow case.
Before:
B8 54 D3 E6 02 mov eax,2E6D354h
85 FF test edi,edi
74 0C je overflow
3D 00 00 00 80 cmp eax,80000000h
75 0C jne normal_path
83 FF FF cmp edi,0FFFFFFFFh
75 07 jne normal_path
overflow:
C1 F8 1F sar eax,1Fh
8B F8 mov edi,eax
EB 05 jmp done
normal_path:
99 cdq
F7 FF idiv eax,edi
8B F8 mov edi,eax
done:
After:
85 FF test edi,edi
75 04 jne normal_path
33 FF xor edi,edi
EB 0A jmp done
normal_path:
B8 54 D3 E6 02 mov eax,2E6D354h
99 cdq
F7 FF idiv eax,edi
8B F8 mov edi,eax
done:
Fairly common with constant dividend of zero. Non-zero values occur
frequently in Ocarina of Time Master Quest.