-
-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Further hash function cleanup. #456
Further hash function cleanup. #456
Conversation
- hash_id should take a uint64_t argument, rather than unsigned. - Instead of adding them or hashing them separately and combining, pack both into the uint64_t argument for hash_id, since each is a 32-bit ID. Further experimentation supports that this has better collision behavior.
- Add a function to hash a series of IDs, rather than doing it with a loop combining the intermediate hashes in inconsistent ways. - Add a #define to log a probe count and a limit to assert, and rework the hashing code control flow for tracking/asserting on the probe count. - Remove `hash_fnv1a_64` from a logging statement (normally not compiled), since that's the only remaining use remove the FNV1a functions from hash.h. - Remove FSM_PHI_32 and FSM_PHI_64, now unused. - parser.act still uses 32-bit FNV1a, but that's for character-by- character hashing, so that still makes sense, and it has its own copy of fnv1a.
This gets us:
|
src/libfsm/determinise.c
Outdated
* runs of collisions appearing in the tables. */ | ||
const uint64_t res = hash_id(a + b); | ||
assert(a != b); | ||
const uint64_t ab = ((uint64_t)a << 32) | (uint64_t)b; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a + b
is commutative, this shifting isn't, I'm going to double-check that the ordering is consistent here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an assertion for |
examples/words still shows the same outliers. Ignoring those, the maximum on my desktop is ~3400ms We talked through this a bit; these outliers are due to something during determinisation which isn't the fault of this hash function per se, just that it shows up here 'cos it gets called so many times. So I'm happy to go ahead and merge here, and I hope we can address that situation in the caller instead. |
hash_id should take a
uint64_t
argument, rather thanunsigned
.Instead of adding the IDs or hashing them separately and combining, pack both into the uint64_t argument for hash_id, since each is a 32-bit ID. Further experimentation supports that this has better collision behavior.
Add
hash_ids
tohash.h
, so hashing a series of IDs is handled in one place. Chain the hashing, so difference from individual bits propagate betteRemove fnv1a and
FSM_PHI_*
, now unused.Add #defines for logging and asserting probe counts.