-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial implementation for Hybrid Hash Functions #91
base: main
Are you sure you want to change the base?
Conversation
Hi @oluiscabral, I'm really glad to see this PR!
In general, I've recently discovered the class of software we actually target with our framework: DAM, which could be used for categorizing various assets like photos, videos, 3D-models. So ideally would be great to cover all file sizes. That doesn't mean that the framework will be used only for DAM, but that's pretty good reference because it requires meticulous work with every single files and its metadata like tags, scores, attributes etc. |
} | ||
} | ||
|
||
const THRESHOLD: u64 = 1024 * 1024 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A wild idea, is it difficult to make this constant a type parameter? So we could instantiate same class using different thresholds? It would be really great to have benchmarks of optimized "skip-chunks" hash function for different sizes. The goal of such benchmarks is not only to see the speed improvement, but also to see collisions ratio.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nop, it is not difficult. I just haven't done it already, because I wanted to keep the implementation as similar as possible to the other implementations (Blake3 and CRC32) in this PoC
if size < THRESHOLD { | ||
// Use Blake3 for small files | ||
log::debug!("Computing BLAKE3 hash for bytes"); | ||
|
||
let mut hasher = Blake3Hasher::new(); | ||
hasher.update(bytes); | ||
let hash = hasher.finalize(); | ||
Ok(Hybrid(encode(hash.as_bytes()))) | ||
} else { | ||
// Use fnv hashing for large files | ||
log::debug!("Computing simple hash for bytes"); | ||
|
||
let hash = fnv_hash_bytes(bytes); | ||
Ok(Hybrid(format!("{}_{}", size, hash))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The original idea is the opposite: use Blake3 for small and medium files, and use faster function for large files where size of contents is large enough to make collision ratio low enough.
- FNV hashing can be added separately as dedicated hash function. Same as "skip-chunk" hash function.
- A wild idea: can we parameterize this hybrid hash function with other hash functions? So we could compose 2 "dedicated" hash functions into threshold-based hash function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yes, any file that has size below the
THRESHOLD
is being hashed by Blake3 already. - 100%
- Yes, totally. I'm not sure if there are higher priority things to do before it, but we could even create a fully parameterized implementation, that allows indefinite pairs composed of a hash function and its related threshold. I've done something similar to this in JavaScript once
Hello!
This pull request introduces a new implementation of the
ResourceId
trait using two different hash approaches:The
dev-hash/benches/hybrid.rs
file is a modification ofdev-hash/benches/blake3.rs
, with the newHybrid
struct being used instead of theBlake3
struct. This allows us to compare and analyze the performance differencesbetween the two approaches.
Note: I have not implemented tests for files larger than the threshold yet. This will be added in a future update. Please let me know if you have any suggestions or concerns regarding this approach.