Commit Graph

2414 Commits

Author SHA1 Message Date
51bf2dca21 Merge pull request #9675 from JosJuice/jit64-div-80000000
Jit64: Fix UB/infinite loop when compiling division by 0x80000000
2021-04-26 23:50:27 +02:00
7d4b87e7ae Jit64: Fix UB/infinite loop when compiling division by 0x80000000 2021-04-26 23:42:03 +02:00
ac679eb24d Merge pull request #9666 from leoetlino/jit-block-hashtable
Jit: Optimize block link queries by using hash tables
2021-04-25 18:45:41 +02:00
69c14d6ec3 JitArm64: Fix frspx with single precision source
I haven't observed this breaking any game, but it didn't match
the behavior of the interpreter as far as I could tell from
reading the code, in that denormals weren't being flushed.
2021-04-25 15:56:59 +02:00
54451ac731 JitArm64: Use ConvertSingleToDoubleLower in RW when faster 2021-04-25 15:56:59 +02:00
9d6263f306 JitArm64: Add unit tests for single/double conversion 2021-04-25 15:56:58 +02:00
2a9d88739c JitArm64: Skip accurate single/double conversion if store-safe 2021-04-25 15:56:58 +02:00
1d106ceaf5 JitArm64: Optimize ConvertSingleToDouble, part 2
If we can prove that FCVT will provide a correct conversion,
we can use FCVT. This makes the common case a bit faster
and the less likely cases (unfortunately including zero,
which FCVT actually can convert correctly) a bit slower.
2021-04-25 15:56:19 +02:00
018e247624 JitArm64: Optimize ConvertSingleToDouble, part 1 2021-04-25 15:56:19 +02:00
28e4869c43 JitArm64: Optimize ConvertDoubleToSingle 2021-04-25 15:56:19 +02:00
6e0a5876ef JitArm64: Use accurate single/double conversions
Our old conversion approach became a lot more inaccurate when
enabling flush-to-zero, to the point of obviously breaking games.
2021-04-25 15:56:19 +02:00
39eccf6603 JitArm64: Call RW before FCMPE in fselx
Needed because the next commit will make RW clobber flags.
2021-04-25 15:56:19 +02:00
949686bbe7 JitArm64: Factor out single/double conversion code to functions
Preparation for following commits.

This commit intentionally doesn't touch paired stores,
since paired stores are supposed to flush to zero.
(Consistent with Jit64.)
2021-04-25 15:56:19 +02:00
fdf7744a53 JitArm64: Move float conversion code out of EmitBackpatchRoutine
This simplifies some of the following commits. It does require
an extra register, but hey, we have 32 of them.

Something I think would be nice to add to the register cache
in the future is the ability to keep both the single and double
version of a guest register in two different host registers
when that is useful. That way, the extra register we write to
here can be read by a later instruction, saving us from
having to perform the same conversion again.
2021-04-25 15:56:19 +02:00
aa3a96f048 Merge pull request #9644 from JosJuice/jit-fallback-discard
Jits: Fix interpreter fallback handling of discarded registers
2021-04-25 13:20:41 +02:00
b3b5016f54 Jits: Fix interpreter fallback handling of discarded registers
When the interpreter writes to a discarded register, its type
must be changed so that it is no longer considered discarded.

Fixes a 62ce1c7 regression.
2021-04-25 13:01:40 +02:00
c812ab6a63 Jit: Optimize block link queries by using hash tables
Repeated erase() + iteration on a std::multimap is extremely slow.

Slow enough that it causes a 7 second long stutter during some
transitions in F-Zero X (a N64 VC game that triggers many, many icache
invalidations).

And slow enough that JitBaseBlockCache::DestroyBlock shows up on a
flame graph as taking >50% of total CPU time on the CPU-GPU thread:
https://i.imgur.com/vvqiFL6.png

This commit optimises those block link queries by replacing the
std::multimap (which is typically implemented with red-black trees)
with hash tables.

Master: https://i.imgur.com/vvqiFL6.png / 7s stutters
(starting from 5.0-2021 and with branch following disabled)

This commit: https://i.imgur.com/hAO74fy.png / ~0.7s stutters, which
is pretty close to 5.0 stable. (5.0-2021 introduced the performance
regression and it is especially noticeable when branch following
is disabled, which is the case for all N64 VC games since 5.0-8377.)
2021-04-24 17:20:59 +02:00
adebc499f9 Jit64: Indicate explicit [[fallthrough]] within load helper 2021-04-19 17:37:44 -04:00
5322256065 Merge pull request #9625 from leoetlino/mmu-sdr-update
MMU: Fix SDR updates being silently dropped in some cases
2021-04-06 20:23:13 -04:00
dad309d365 Disable ICache emulation for some games
Specifically, 'Scooby-Doo! Mystery Mayhem', 'Scooby-Doo! Unmasked', 'Ed, Edd n Eddy: The Mis-Edventures', and the Wii version of 'Happy Feet'.

The JIT cache causes problems with emulated icache invalidation in these games, resulting in areas failing to load.
2021-04-06 12:44:10 -07:00
49edd5f482 MMU: Remove a bunch of useless swaps
The swaps are confusing and don't accomplish much.

It was originally written like this:

u32 pte = bswap(*(u32*)&base_mem[pteg_addr]);

then bswap was changed to Common::swap32, and then the array access
was replaced with Memory::Read_U32, leading to the useless swaps.
2021-04-06 18:25:29 +02:00
960d957f4f MMU: Fix SDR updates being silently dropped in some cases
While 6xx_pem.pdf §7.6.1.1 mentions that the number of trailing
zeros in HTABORG must be equal to the number of trailing ones
in the mask (i.e. HTABORG must be properly aligned), this is actually
not a hard requirement. Real hardware will just OR the base address
anyway. Ignoring SDR changes would lead to incorrect emulation.

Logging a warning instead of dropping the SDR update silently is a
saner behaviour.
2021-04-06 18:25:09 +02:00
5222a4b7e5 Merge pull request #9585 from JosJuice/jitarm64-skip-carry
JitArm64: Skip calculating carry flag when not needed
2021-04-06 04:41:16 -04:00
99d43362e6 Merge pull request #9351 from JosJuice/discard-registers
Jits: Discard registers which we know will be overwritten
2021-04-06 04:40:26 -04:00
004dfd1586 Replace uses of cassert with Common/Assert.h 2021-04-02 10:18:18 -07:00
b3f71f7cdc JitArm64: Allow DoJit at address 0 (fix launching Wii titles)
JitArm64::DoJit contains a check where it prints a warning and tries
to pause emulation if instructed to compile code at address 0. I'm
assuming this was done in order to provide a nicer error behavior
in cases where PC was accidentally set to null. Unfortunately, it
has started causing us problems recently, as 688bd61 writes and runs
some code at address 0 to simulate the PPC being held in reset.
What makes this worse is that calling Core::SetState from the CPU
thread is actually not allowed and will cause a deadlock instead of
the intended behavior. I don't believe there is anything on a real
console that would stop you from executing code at address 0 (as
long as the MMU has been set up to allow it), and Jit64::DoJit
doesn't contain any check like this, so let's remove the check.
2021-04-01 11:28:53 +02:00
c915b780cf Merge pull request #9596 from Minty-Meeo/apply-moar-RunAsCPUThread
Apply More Core::RunAsCPUThread
2021-03-27 01:11:34 +01:00
62ce1c7653 Jits: Discard registers which we know will be overwritten
This commit adds a new "discarded" state for registers.
Discarding a register is like flushing it, but without
actually writing its value back to memory. We can discard
a register only when it is guaranteed that no instruction
will read from the register before it is next written to.

Discarding reduces the register pressure a little, and can
also let us skip a few flushes on interpreter fallbacks.
2021-03-24 20:48:44 +01:00
901170e299 PPCTables: Use u64 for instruction flags
We've run out of space :(
2021-03-24 20:48:36 +01:00
1845c5948d PPCAnalyst: Rework the store-safe logic
The output of instructions like fabsx and ps_sel is store-safe
if and only if the relevant inputs are. The old code was always
marking the output as store-safe if the output was a single,
and never otherwise.

Also, the old code was treating the output of psq_l/psq_lu as
store-safe, which seems incorrect (if dequantization is disabled).
2021-03-24 12:02:09 +01:00
3bd920638d JitArm64: Use STP for pc/npc, part 2
I missed one place in dd8e504.
2021-03-23 21:27:07 +01:00
LC
15ebb1d9e4 Merge pull request #9566 from Sintendo/jit64divwx
Jit64: Optimize divwx
2021-03-22 14:40:02 -04:00
baecddd262 JitArm64: Skip calculating carry flag when not needed 2021-03-19 23:02:24 +01:00
bcd572a820 Merge pull request #9593 from JosJuice/jitarm64-constant-carry
JitArm64: Constant carry flag optimizations
2021-03-19 22:58:17 +01:00
4c2cdb61df JitArm64: Constant carry flag optimizations
If we know at compile time that the PPC carry flag definitely
has a certain value, we can bake that value into the emitted code
and skip having to read from PPCState.
2021-03-19 22:40:19 +01:00
c5abcba77a JitArm64: Fix broken format strings in Arm64RegCache 2021-03-19 16:14:20 +01:00
db7f3f8f25 Apply More Core::RunAsCPUThread
In places where applicable, Core::RunAsCPUThread has replaced Core::SetState workarounds to pause and resume emulation for thread-sensitive operations.
 - void Core::SaveScreenShot()
 - void Core::SaveScreenShot(std::string_view name)
 - void JitInterface::GetProfileResults(Profiler::ProfileStats *prof_stats)
 - void MainWindow::OnExportRecording()
2021-03-18 22:31:28 -05:00
621b5b8e1a JitArm64: Optimize general case of srawx
Same approach as Jit64. A lot simpler, don't you think? :)
2021-03-17 00:15:23 +01:00
a45a0a2066 Merge pull request #9494 from Dentomologist/convert_arm64reg_to_enum_class
Arm64Gen: Convert ARM64Reg to enum class
2021-03-17 00:05:23 +01:00
c0f840525f JitArm64: Improve srawx special case carry calculation
At a first glance it may look like a part of the code I added to
srawx in efeda3b has a bug when a == s. The code actually happens
to work correctly, but in the interest of making the code easier
to reason about, I'd like to change the way it's implemented. This
change should improve the pipelining a little in the a == s case too.
2021-03-14 18:55:42 +01:00
f0f206714f Arm64Gen: Convert ARM64Reg to enum class
Most changes are just adding ARM64Reg:: in front of the constants.
2021-03-13 10:10:59 -08:00
defe7162f5 Jit64: divwx - Simplify divisor == -1 case
Suggested by @MerryMage. Thanks!

Co-authored-by: merry <MerryMage@users.noreply.github.com>
2021-03-07 18:29:12 +01:00
83f38388a1 Jit64: divwx - Micro-optimize default case
Both the normal path and the overflow path end with the same
instruction, so their tails can be merged.

Before:
41 8B C7             mov         eax,r15d
45 85 C0             test        r8d,r8d
74 0D                je          overflow
3D 00 00 00 80       cmp         eax,80000000h
75 0E                jne         normal_path
41 83 F8 FF          cmp         r8d,0FFFFFFFFh
75 08                jne         normal_path
overflow:
C1 F8 1F             sar         eax,1Fh
44 8B F0             mov         r14d,eax
EB 07                jmp         done
normal_path:
99                   cdq
41 F7 F8             idiv        eax,r8d
44 8B F0             mov         r14d,eax
done:

After:
41 8B C7             mov         eax,r15d
45 85 C0             test        r8d,r8d
74 0D                je          overflow
3D 00 00 00 80       cmp         eax,80000000h
75 0B                jne         normal_path
41 83 F8 FF          cmp         r8d,0FFFFFFFFh
75 05                jne         normal_path
overflow:
C1 F8 1F             sar         eax,1Fh
EB 04                jmp         done
normal_path:
99                   cdq
41 F7 F8             idiv        eax,r8d
done:
44 8B F0             mov         r14d,eax
2021-03-07 18:29:12 +01:00
1865035798 Jit64: divwx - Optimize division by 2
...and let's optimize a divisor of 2 ever so slightly for good measure.
I wouldn't have bothered, but most GameCube games seem to hit this on
launch.

- Division by 2
Before:
41 BE 02 00 00 00    mov         r14d,2
41 8B C2             mov         eax,r10d
45 85 F6             test        r14d,r14d
74 0D                je          overflow
3D 00 00 00 80       cmp         eax,80000000h
75 0E                jne         normal_path
41 83 FE FF          cmp         r14d,0FFFFFFFFh
75 08                jne         normal_path
overflow:
C1 F8 1F             sar         eax,1Fh
44 8B F0             mov         r14d,eax
EB 07                jmp         done
normal_path:
99                   cdq
41 F7 FE             idiv        eax,r14d
44 8B F0             mov         r14d,eax
done:

After:
45 8B F2             mov         r14d,r10d
41 C1 EE 1F          shr         r14d,1Fh
45 03 F2             add         r14d,r10d
41 D1 FE             sar         r14d,1
2021-03-07 18:29:12 +01:00
0637a7ec59 Jit64: divwx - Optimize power-of-two divisors
Power-of-two divisors can be done more elegantly, so handle them
separately.

- Division by 4
Before:
41 BD 04 00 00 00    mov         r13d,4
41 8B C0             mov         eax,r8d
45 85 ED             test        r13d,r13d
74 0D                je          overflow
3D 00 00 00 80       cmp         eax,80000000h
75 0E                jne         normal_path
41 83 FD FF          cmp         r13d,0FFFFFFFFh
75 08                jne         normal_path
overflow:
C1 F8 1F             sar         eax,1Fh
44 8B E8             mov         r13d,eax
EB 07                jmp         done
normal_path:
99                   cdq
41 F7 FD             idiv        eax,r13d
44 8B E8             mov         r13d,eax
done:

After:
45 85 C0             test        r8d,r8d
45 8D 68 03          lea         r13d,[r8+3]
45 0F 49 E8          cmovns      r13d,r8d
41 C1 FD 02          sar         r13d,2
2021-03-07 18:29:12 +01:00
530475dce8 Jit64: divwx - Micro-optimize certain divisors
When the multiplier is positive (which is the most common case), we can
generate slightly better code.

- Division by 30307
Before:
49 63 C5             movsxd      rax,r13d
48 69 C0 65 6B 32 45 imul        rax,rax,45326B65h
4C 8B C0             mov         r8,rax
48 C1 E8 3F          shr         rax,3Fh
49 C1 F8 2D          sar         r8,2Dh
44 03 C0             add         r8d,eax

After:
49 63 C5             movsxd      rax,r13d
4C 69 C0 65 6B 32 45 imul        r8,rax,45326B65h
C1 E8 1F             shr         eax,1Fh
49 C1 F8 2D          sar         r8,2Dh
44 03 C0             add         r8d,eax
2021-03-07 18:29:12 +01:00
95698c5ae1 Jit64: divwx - Optimize constant divisor
Optimize division by a constant into multiplication. This method is also
used by GCC and LLVM.

We also add optimized paths for divisors 0, 1, and -1, because they
don't work using this method. They don't occur very often, but are
necessary for correctness.

- Division by 1
Before:
41 BF 01 00 00 00    mov         r15d,1
41 8B C5             mov         eax,r13d
45 85 FF             test        r15d,r15d
74 0D                je          overflow
3D 00 00 00 80       cmp         eax,80000000h
75 0E                jne         normal_path
41 83 FF FF          cmp         r15d,0FFFFFFFFh
75 08                jne         normal_path
overflow:
C1 F8 1F             sar         eax,1Fh
44 8B F8             mov         r15d,eax
EB 07                jmp         done
normal_path:
99                   cdq
41 F7 FF             idiv        eax,r15d
44 8B F8             mov         r15d,eax
done:

After:
45 8B FD             mov         r15d,r13d

- Division by 30307
Before:
41 BA 63 76 00 00    mov         r10d,7663h
41 8B C5             mov         eax,r13d
45 85 D2             test        r10d,r10d
74 0D                je          overflow
3D 00 00 00 80       cmp         eax,80000000h
75 0E                jne         normal_path
41 83 FA FF          cmp         r10d,0FFFFFFFFh
75 08                jne         normal_path
overflow:
C1 F8 1F             sar         eax,1Fh
44 8B C0             mov         r8d,eax
EB 07                jmp         done
normal_path:
99                   cdq
41 F7 FA             idiv        eax,r10d
44 8B C0             mov         r8d,eax
done:

After:
49 63 C5             movsxd      rax,r13d
48 69 C0 65 6B 32 45 imul        rax,rax,45326B65h
4C 8B C0             mov         r8,rax
48 C1 E8 3F          shr         rax,3Fh
49 C1 F8 2D          sar         r8,2Dh
44 03 C0             add         r8d,eax

- Division by 30323
Before:
41 BA 73 76 00 00    mov         r10d,7673h
41 8B C5             mov         eax,r13d
45 85 D2             test        r10d,r10d
74 0D                je          overflow
3D 00 00 00 80       cmp         eax,80000000h
75 0E                jne         normal_path
41 83 FA FF          cmp         r10d,0FFFFFFFFh
75 08                jne         normal_path
overflow:
C1 F8 1F             sar         eax,1Fh
44 8B C0             mov         r8d,eax
EB 07                jmp         00000000161737E7
normal_path:
99                   cdq
41 F7 FA             idiv        eax,r10d
44 8B C0             mov         r8d,eax
done:

After:
49 63 C5             movsxd      rax,r13d
4C 69 C0 19 25 52 8A imul        r8,rax,0FFFFFFFF8A522519h
49 C1 E8 20          shr         r8,20h
44 03 C0             add         r8d,eax
C1 E8 1F             shr         eax,1Fh
41 C1 F8 0E          sar         r8d,0Eh
44 03 C0             add         r8d,eax
2021-03-07 18:29:01 +01:00
5bb8798df6 JitCommon: Signed 32-bit division magic constants
Add a function to calculate the magic constants required to optimize
signed 32-bit division.

Since this optimization is not exclusive to any particular architecture,
JitCommon seemed like a good place to put this.
2021-03-07 18:27:36 +01:00
c9adc60d73 Jit64: divwx - Special case dividend == 0
Zero divided by any number is still zero. For whatever reason, this case
shows up frequently too.

Before:
B8 00 00 00 00       mov         eax,0
85 F6                test        esi,esi
74 0C                je          overflow
3D 00 00 00 80       cmp         eax,80000000h
75 0C                jne         normal_path
83 FE FF             cmp         esi,0FFFFFFFFh
75 07                jne         normal_path
overflow:
C1 F8 1F             sar         eax,1Fh
8B F8                mov         edi,eax
EB 05                jmp         done
normal_path:
99                   cdq
F7 FE                idiv        eax,esi
8B F8                mov         edi,eax
done:

After:
Nothing!
2021-03-07 18:27:30 +01:00
c081e3f2b3 Jit64: divwx - Optimize constant dividend
When the dividend is known at compile time, we can eliminate some of the
branching and precompute the result for the overflow case.

Before:
B8 54 D3 E6 02       mov         eax,2E6D354h
85 FF                test        edi,edi
74 0C                je          overflow
3D 00 00 00 80       cmp         eax,80000000h
75 0C                jne         normal_path
83 FF FF             cmp         edi,0FFFFFFFFh
75 07                jne         normal_path
overflow:
C1 F8 1F             sar         eax,1Fh
8B F8                mov         edi,eax
EB 05                jmp         done
normal_path:
99                   cdq
F7 FF                idiv        eax,edi
8B F8                mov         edi,eax
done:

After:
85 FF                test        edi,edi
75 04                jne         normal_path
33 FF                xor         edi,edi
EB 0A                jmp         done
normal_path:
B8 54 D3 E6 02       mov         eax,2E6D354h
99                   cdq
F7 FF                idiv        eax,edi
8B F8                mov         edi,eax
done:

Fairly common with constant dividend of zero. Non-zero values occur
frequently in Ocarina of Time Master Quest.
2021-03-07 18:25:08 +01:00