Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further hash function cleanup. #456

Merged
merged 6 commits into from
Jan 8, 2024

Conversation

silentbicycle
Copy link
Collaborator

@silentbicycle silentbicycle commented Jan 3, 2024

  • hash_id should take a uint64_t argument, rather than unsigned.

  • Instead of adding the IDs or hashing them separately and combining, pack both into the uint64_t argument for hash_id, since each is a 32-bit ID. Further experimentation supports that this has better collision behavior.

  • Add hash_ids to hash.h, so hashing a series of IDs is handled in one place. Chain the hashing, so difference from individual bits propagate bette

  • Remove fnv1a and FSM_PHI_*, now unused.

  • Add #defines for logging and asserting probe counts.

- hash_id should take a uint64_t argument, rather than unsigned.

- Instead of adding them or hashing them separately and combining,
  pack both into the uint64_t argument for hash_id, since each is
  a 32-bit ID. Further experimentation supports that this has
  better collision behavior.
- Add a function to hash a series of IDs, rather than doing it with
  a loop combining the intermediate hashes in inconsistent ways.

- Add a #define to log a probe count and a limit to assert, and
  rework the hashing code control flow for tracking/asserting on
  the probe count.

- Remove `hash_fnv1a_64` from a logging statement (normally not
  compiled), since that's the only remaining use remove the FNV1a
  functions from hash.h.

- Remove FSM_PHI_32 and FSM_PHI_64, now unused.

- parser.act still uses 32-bit FNV1a, but that's for character-by-
  character hashing, so that still makes sense, and it has its
  own copy of fnv1a.
@silentbicycle silentbicycle marked this pull request as ready for review January 4, 2024 20:09
@katef
Copy link
Owner

katef commented Jan 4, 2024

This gets us:

; ./words.sh 40000 100 | guff -b
    x: [0 - 19]    y: [0 - 4352]
⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠀⠀⠀⠀⠀⠂
⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠐⠀⠁⠀⠀
⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠠⠀⠀⠀⠀⠀⠀⠀⠀
⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠄⠀⠄⠀⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀
⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠐⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠠⠀⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡀⠀⠁⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⡇⠀⠀⠀⠀⠀⠀⠀⠀⠄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⡇⠀⠀⠀⠀⠐⠀⠐⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣇⣀⣄⣈⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀

* runs of collisions appearing in the tables. */
const uint64_t res = hash_id(a + b);
assert(a != b);
const uint64_t ab = ((uint64_t)a << 32) | (uint64_t)b;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a + b is commutative, this shifting isn't, I'm going to double-check that the ordering is consistent here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9cba560. Thanks @dgryski for catching this.

@silentbicycle
Copy link
Collaborator Author

I added an assertion for hash_ids's array of IDs being in ascending order. The actual assert shouldn't be expensive, but hash_ids is called very often in determinisation's inner loops. When I tested the assert made it nearly 10% slower overall in a determinisation-heavy workload, so I put #if EXPENSIVE_CHECKS around it. All tests passed locally with it enabled.

@katef
Copy link
Owner

katef commented Jan 8, 2024

examples/words still shows the same outliers. Ignoring those, the maximum on my desktop is ~3400ms

We talked through this a bit; these outliers are due to something during determinisation which isn't the fault of this hash function per se, just that it shows up here 'cos it gets called so many times. So I'm happy to go ahead and merge here, and I hope we can address that situation in the caller instead.

@katef katef merged commit 78081b8 into main Jan 8, 2024
322 checks passed
@katef katef deleted the sv/improve-hash-performance-in-determinisation--part-3 branch January 8, 2024 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants