We were generating a texture without ever setting the data to a known value.
This happened on the old code as well, just that PP shaders are receiving some love and people are using it and noticing some of its issues.
Doesn't support all the FPSCR flags, just the FPRF ones.
Add PPCAnalyzer support to remove unnecessary FPRF calculations.
POV-ray benchmark with enableFPRF forced on for an extreme comparison:
Before: 1500s
After, fmul/fmadd only: 728s
After, all float: 753s
In real games that use FPRF, like F-Zero GX, FPRF previously cost a few percent
of total runtime.
Since FPRF is so much faster now, if enableFPRF is set, just do it for every
float instruction, not just fmul/fmadd like before. I don't know if this will
fix any games, but there's little good reason not to.
The only possible functionality change is that s_efbAccessRequested and
s_swapRequested are no longer reset at init and shutdown of the OGL
backend (only; this is the only interaction any files other than
MainBase.cpp have with them). I am fairly certain this was entirely
vestigial.
Possible performance implications: efbAccessReady now uses an Event
rather than spinning, which might be slightly slower, but considering
the slow loop the flags are being checked in from the GPU thread, I
doubt it's noticeable.
Also, this uses sequentially consistent rather than release/acquire
memory order, which might be slightly slower, especially on ARM...
something to improve in Event/Flag, really.
This shouldn't affect functionality. I'm not sure if the breakpoint
distinction is actually necessary (my commit messages from the old
dc-netplay last year claim that breakpoints are broken anyway, but I
don't remember why), but I don't actually need to change this part of
the code (yet), so I'll stick with the trimmings change for now.
While we're at it, support a bunch of float load/store variants that weren't
implemented in the JIT. Might not have a big speed impact on typical games but
they're used at least a bit in povray and luabench.
694 -> 644 seconds on povray.
In particular, even in code that only runs on x86-64, you can't use
PRIx64 for size_t because, on OS X, one is unsigned long and the other
is unsigned long long and clang whines about the difference. I guess
you could make a size_t specifier macro, but those are horribly ugly, so
I just used casting.
Anyone want to make a nice (and slow) template-based printf?
Now without bare 'unsigned'.
Thanks to magumagu's softfp experiments, we know a lot more about the Wii's
strange floating point unit than we used to. In particular, when doing a
single-precision floating point multiply (fmulsx), it rounds the right hand
side's mantissa so as to lose the low 28 bits (of the 53-bit mantissa).
Emulating this behavior in Dolphin fixes a bunch of issues with games that
require extremely precise emulation of floating point hardware, especially
game replays. Fortunately, we can do this with rather little CPU cost; just ~5
extra instructions per multiply, instead of the vast load of a pure-software
float implementation.
This doesn't make floating-point behavior at all perfect. I still suspect
fmadd rounding might not be quite right, since the Wii uses fused instructions
and Dolphin doesn't, and NaN/infinity/exception handling is probably off in
various ways... but it's definitely way better than before.
This appears to fix replays in Mario Kart Wii, Mario Kart Double Dash, and
Super Smash Brothers Brawl. I wouldn't be surprised if it fixes a bunch of
other stuff too.
The changes to instructions other than fmulsx may not be strictly necessary,
but I included them for completeness, since it feels wrong to fix some
instructions but not others, since some games we didn't test might rely on
them.
(1) Rename ABI_ALL_CALLEE_SAVED to ABI_ALL_CALLER_SAVED, because that's
what it was actually defined as (and used as). Derp.
(2) RegistersInUse is always used for the purpose of saving registers
before calling a C++ function in the middle of a JIT block (without
flushing). There is no need to save callee-saved registers in this
case. Change the name to CallerSavedRegistersInUse and mask with
ABI_ALL_CALLER_SAVED.
Nothing obvious broke when starting up a Melee game. (I added a test
for anything actually being masked out; it happens, but in this
particular case seemed to occur at most a few dozen times per second, so
the actual performance benefit is probably negligible.)
The PPC_FP conversion code can be made a lot simpler with the observation
that the only values that need to be sent through the slow x87 path are
denormals.
A whole bunch faster: 708->678 seconds on POV-RAY.
strictStrings is not supported by debug libraries, and indeed breaks the build.
Drop wbemidl.h (incompatible with strictStrings) dependency by using SDL-style search for XInput GUIDs.
The concept of a "title bar" / "status bar" shouldn't be a core concept,
so remove the Host_UpdateStatusBar function, and move the code handles
whether to update the status bar or titlebar into DolphinWX.
This is effectively unused, as the window handles that we pass to the
GLInterface are window handles for the frame which isn't ever a real
toplevel window. Host_UpdateTitle is what actually sets the proper title
on the render window.
Now that MainNoGUI is properly architected and GLX doesn't need to
sometimes craft its own windows sometimes which we have to thread back
into MainNoGUI, we don't need to thread the window handle that GLX
creates at all.
This removes the reference to pass back here, and the g_pWindowHandle
always be the same as the window returned by Host_GetRenderHandle().
A future cleanup could remove g_pWindowHandle entirely.
We now have two cases: the GLX window is parented into a frame, or it's
parented into the MainNoGUI host. In both cases, the GLX window should
be locked to the size of the parent, so just sync it up based on that.