Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add N64 recompiler block hashes & inline 64-bit ops #1640

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ghost
Copy link

@ghost ghost commented Sep 8, 2024

Reworking recompiler's JIT block accesses to be more flexible so that we can bake more runtime info to them. This way runtime checks can be avoided in the generated code, making inlining the implementations more attractive since we don't need to generate calls to e.g. kernel mode checks for every 64-bit op. This is expected to increase recompiled code's performance eventually.

I expect these changes to have a small negative performance impact but haven't measured it.

Commit messages:

We'd like the recompiler to take the execution context such as kernel
mode into account when compiling blocks. That's why it's necessary to
identify blocks not just by address but all the information used at
compile time. This is done by computing a 32-bit key and using that as
a block's identifier instead of the last six physical address bits like
was done before.

Since we have now 32-bit instead of 6-bit keys, the block() function
hashes the keys before mapping them to one of the 64 pool rows. The hash
function was chosen arbitrarily to be better than a simple multiplicative
hash and is likely not the best choice for this exact task.

  • Pass JITContext down to leaf emit functions.
  • Emit inline implementations of basic 64-bit operations.
  • Use block compile-time information to elide kernel mode checks of
    the now inlined operations.

@ghost ghost marked this pull request as draft September 8, 2024 15:08
@LukeUsher LukeUsher requested review from invertego and rasky September 9, 2024 09:00
.cop1Enabled = scc.status.enable.coprocessor1 > 0,
.floatingPointMode = scc.status.floatingPointMode > 0,
.is64bit = context.bits == 64,
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned that we compose the context structure every block we run. I believe it might be better if we do this on changes instead. I understand it's harder to get right and can bring more bugs, but I believe it's going to be faster.

So let's say, have a function that recalculates the current context and its hash key and store them somewhere. At runtime, we just use the precalulcated context and the precalculated hash key.

Then, you need to call the function to recalculate the context in any codepath that can change one of the variables that affect it, os for instance mtc0 of the status register is a prime suspect, and there will be others of course, but maybe not dozen of them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the quick review. I moved JITContext inside the existing Context struct, now it's Context::JIT and its contents + bit vector representation are computed only when Context::setMode() is called. This is more often than strictly needed, for example exceptions change the mcc vector state and call setMode() but should be better than before.

I added separate update and toBits member functions to Context::JIT for later debugging; they can be used to check for staleness bugs in debug builds.

We'd like the recompiler to take the execution context such as kernel
mode into account when compiling blocks. That's why it's necessary to
identify blocks not just by address but all the information used at
compile time. This is done by computing a 32-bit key and using that as
a block's identifier instead of the last six physical address bits like
was done before.

The execution state and its representation as bit vector are recomputed
only when needed, in this case each time Context::setMode() is called,
which happens on powerup, in both MTC0 and MFC0 instructions, and on
exceptions.

Since we have now 32-bit instead of 6-bit keys, the block() function
hashes the keys before mapping them to one of the 64 pool rows. The hash
function was chosen arbitrarily to be better than a simple multiplicative
hash and is likely not the best choice for this exact task.
* Pass JITContext down to leaf emit functions.
* Emit inline implementations of basic 64-bit operations.
* Use block compile-time information to elide kernel mode checks of
the now inlined operations.
@ghost ghost force-pushed the add-block-hashes branch from 382c633 to ba4504a Compare September 10, 2024 19:02
@ghost
Copy link
Author

ghost commented Sep 10, 2024

The GDB::server.hasBreakpoints() check is a bit problematic because right now it gets updated only when Context::setMode is called even though breakpoints might be added or removed at other times as well. I suppose it could be polled for every block perhaps? I considered it part of the execution context but before it wasn't taken into account at all when looking up blocks. To me it seems like breakpoints only worked after all Pool instances where flushed.

@invertego
Copy link
Contributor

Sorry for the delay. I will try to review this in the next few days.

Copy link
Contributor

@invertego invertego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good aside from the issues mentioned. How much more are you planning to do with this before taking it out of draft status?

return (address >> 2 & 0x3f) | (jitBits & ~0x3f);
}

auto CPU::Recompiler::computePoolRow(u32 key) -> u32 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The n64 core already pulls in xxhash.h, so XXH32_avalanche might work here as well.

@@ -18,6 +18,9 @@ auto CPU::Context::setMode() -> void {
break;
}

jit.update(*this, self);
jitBits = jit.toBits();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should update() be responsible for recalculating jitBits?

@@ -18,6 +18,9 @@ auto CPU::Context::setMode() -> void {
break;
}

jit.update(*this, self);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to make sure this is updated after a save state load.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also when the breakpoint count changes.

@@ -106,6 +118,8 @@ struct CPU : Thread {
u32 mode;
u32 bits;
u32 segment[8]; //512_MiB chunks
u32 jitBits;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of the recompiler state is in the recompiler object. Is there a reason you want to keep this state separate?

call(&CPU::DSUBU);
emitZeroClear(Rdn);
if (!checkDualAllowed(ctx)) return 1;
sub64(reg(0), mem(Rs), mem(Rt), set_o);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_o not needed

Comment on lines +132 to +133
mov32_f(temp, flag_o);
auto didntOverflow = cmp32_jump(temp, imm(0), flag_eq);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mov32_f(temp, flag_o);
auto didntOverflow = cmp32_jump(temp, imm(0), flag_eq);
auto didntOverflow = jump(flag_no);

// If overflow flag set: throw an exception, skip the instruction via the 'end' label.
mov32_f(temp, flag_o);
auto didntOverflow = cmp32_jump(temp, imm(0), flag_eq);
call(&CPU::Exception::arithmeticOverflow, &cpu.exception);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emitting calls this way causes all the parameters to be emitted as immediates. In general it's cheaper (in terms of code footprint) to calculate addresses that are passed as arguments (see instances of lea in this file). Although, for cold code paths like this, it would be even better to call a trampoline in CPU that takes no arguments (aside from the implicit this pointer).

emitZeroClear(Rtn);
if (!checkDualAllowed(ctx)) return 1;
add64(reg(0), mem(Rs), imm(i16), set_o);
if(Rtn > 0) mov64(mem(Rt), reg(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The add64 can also be skipped if Rtn is zero.

emitZeroClear(Rdn);
if (!checkDualAllowed(ctx)) return 1;
sub64(reg(0), mem(Rs), mem(Rt), set_o);
if(Rdn > 0) mov64(mem(Rd), reg(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sub64 can be skipped in Rdn is zero.

Block* block;
u32 tag;
};
Row rows[1 << 6];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you arrive at 64 rows? Were other numbers tried?

@ghost
Copy link
Author

ghost commented Jan 19, 2025

Noting here that I've transferred to a new GitHub account: @pekkavaa
I'd like to continue ares JIT performance work at some point but maybe with a solid benchmarking setup first. Even small changes should get reliably measured.

Edit: And thanks for the review, invertego!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants