dolphin/Source/Core
Ryan Houdek 8d61706440 [AArch64] Optimize lmw.
This instruction is fairly heavily used by Ikaruga to load a bunch of registers from the stack.
In particular at the start of the second stage is a block that takes up ~20% CPU time that includes a usage of lmw to load half of the guest
registers.

Basic thing optimized here is changing from a single 32bit LDR to potentially a single 128bit LDR.
a single 32bit LDR is fairly slow, so we can optimize a few ways.
If we have four or more registers to load, do a 64bit LDP in to two host registers, byteswap, and then move the high 32bits of the host registers in
to the correct mapped guest register locations.
If we have two registers to load then do a 32bit LDP which will load two guest registers in a single instruction.
and then if we have only one register left to load, load it as before.

This saves quite a bit of cycles since the Cortex-A57 and A72's LDR instruction takes a few cycles.

Each 32bit LDR takes 4 cycles latency, plus 1 cycle for post-index(which typically happens in parallel.
Both the 32bit and 64bit LDP take the same amount of latency.

So we are improving latencies and reducing code bloat here.
2015-08-28 14:40:30 -05:00
..
AudioCommon Merge pull request #2854 from Tilka/valgrind 2015-08-15 20:52:12 +02:00
Common Merge pull request #2914 from JosJuice/fix-volumedirectory 2015-08-26 22:12:23 +02:00
Core [AArch64] Optimize lmw. 2015-08-28 14:40:30 -05:00
DiscIO Implemented .elf and .dol support in gamelist 2015-08-28 11:10:03 -07:00
DolphinQt Fix DoFileSearch returning the passed-in directories themselves. 2015-06-25 15:17:52 +02:00
DolphinWX Implemented .elf and .dol support in gamelist 2015-08-28 11:10:03 -07:00
InputCommon evdev: don't pass null path to the kernel 2015-08-15 12:51:34 +02:00
UICommon Have the disassembler show the PC next to host instructions. 2015-08-07 02:43:54 -05:00
VideoBackends Vec3: Simplify operator== code 2015-08-28 14:46:40 -04:00
VideoCommon Merge pull request #2918 from lioncash/memcpy 2015-08-28 20:45:15 +02:00
CMakeLists.txt