Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add blob split and splice API #282
base: main
Are you sure you want to change the base?
Add blob split and splice API #282
Changes from all commits
ed727d8
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should design this
SpliceBlobRequest
as part ofBatchUpdateBlobsRequest
.The problem is that we are not pushing the chunk blobs in this request, only sending the CAS server metadata regarding combining some blobs into a larger blob. There could be a delay between the BatchUpdateBlobs RPC call and the SpliceBlob RPC call. That delay could be a few mili-seconds, or it could be weeks, or months after some of the uploaded chunks have expired from the CAS server. There is no transaction guarantee between the 2 RPCs.
So a way to have some form of transaction guarantee would be to send this as part of the same RPC that uploads all the blobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that is the overarching principle behind the whole protocol that we upload blobs and later refer to them by their digests. All that works, because the CAS promises to keep all blobs fresh (i.e., not forget about them for a reasonable amount of time) where its answer implies it knows about them (that could be the answer to a
FindMissingBlobs
request or a success statement to a blob upload request). The typical workflow for a client using this request would anyway be to split the blob locally, useFindMissingBlobs
to find out which blobs are not yet known to the CAS, then (batch) upload only the ones not yet known to the CAS (that's where the savings in traffic come from) and then request a splice of all of them. All this works becasue the promise of the CAS to keep referenced objects alive.To give a prominent example where the protocol is already relying on that guarantee to keep objects alive, consider the request to execute an action. That request does not upload any blobs, yet still expects them to be there because a recent interaction with the CAS showed they are in the CAS already. In a sense, that example also shows that blob splicing is nothing fundamentally new, but just an optimisation: the client could already now request an action calling
cat
be executed—in a request that is independent of the blob upload. However, making it explicitly an operation on the CAS gives a huge room for optimization: no need to spawn an action-execution environment, the CAS knows ahead of time hash and size of the blob that is to be stored as a result of that request, if it known the blob in question it does not even have to do anything (apart from keeping that blob fresh), etc.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This document is behind a paywall. Any chance we can link to a spec that is freely accessible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it this paper? http://cui.unige.ch/tcs/cours/algoweb/2002/articles/art_marculescu_andrei_1.pdf
Even though that paper provides a fairly thorough mathematical definition of how Rabin fingerprints work, it's not entirely obvious to me how it translates to an actual algorithm for us to use. Any chance we can include some pseudocode, or link to a publication that provides it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, that is the paper, I can update the link. I really spend some time trying to find a nice algorithmic description for the Rabin fingerprint method, but failed. Only I could find some real implementations on GitHub, but I assume this is not something, we would like to link here. There is also the original paper of the Rabin method http://www.xmailserver.org/rabin.pdf, but that one doesn't seem to help either. The paper above gives a reasonable introduction to how Rabin fingerprints works and even some thoughts about how to implement it, so I thought, that is the best source to link here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the paper about FastCDC https://ieeexplore.ieee.org/document/9055082 also contains an algorithmic description of the Rabin fingerprinting technique (Algorithm 3. RabinCDC8KB). The only thing that is missing here is a precise description of the precomputed U and T arrays. Even though, they provide links to other papers, where these arrays are supposedly defined, I could not find any definition in these papers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that there is a...
... probability that chunks actually contain a cutoff point that wasn't taken because it would violate the minimum chunk size? That probability sounds a lot higher than I'd imagine to be acceptable.
Consider the case where you inject some arbitrary data close to the beginning of a chunk that had a cutoff point within the first 128 KB that wasn't respected. If the injected data causes the cutoff point to be pushed above the 128 KB boundary, that would cause the cutoff point to be respected. This in turn could in its turn cause successive chunks to use different cutoff points as well.
Maybe it makes more sense to pick 16 KB or 32 KB here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that the idea is that you process input byte for byte, compute a fingerprint over a sliding window, and create a chunk if the last 19 bits of the hash are all zeroes or ones (which is it?), are you sure that this will give an average chunk size of 512 KB? That would only hold if there was no minimum size, right? So shouldn't the average chunk size be 128+512 = 640 KB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the 19 lowest bits of the fingerprint need to be zero for the expression (fp & mask) to become 0, in which case you found a chunk boundary. Regarding your question about the average chunk size, you are right, the actual average chunk size = expected chunk size (512 KB) + minimum chunk size (128 KB). I will state this more clearly in the comments, thanks for pointing this out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just did some measurements, and it looks like using
fp & mask == 0
is actually a pretty bad choice. The reason being that it's pretty easy to craft windows for whichfp == 0
, namely a sequence consisting exclusively of 0-bytes. This has also been observed here:https://ieeexplore.ieee.org/document/9006560
I noticed this especially when creating chunks of a Linux kernel source tarball (decompressed), for the same reason as stated in the article:
I didn't observe the 4-byte chunks, for the reason that I used a minimum size as documented.
Results look a lot better if I use
fp & mask == mask
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming my understanding of the algorithm is correct, doesn't that mean that there is a...
... probability that chunks end up reaching the maximum size? This sounds relatively high, but is arguably unavoidable.
A bit odd that these kinds of algorithms don't attempt to account for this, for example by repeatedly rerunning the algorithm with a smaller bit mask (18, 17, 16, [...] bits) until a match is found. That way you get a more even spread of data across such chunks, and need to upload fewer chunks in case data is injected into/removed from neighbouring chunks that both reach the 2 MB limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[ NOTE: I'm absolutely not an expert on content defined chunking algorithms! ]
Is a window size of 64 bits intentional? If I look at stuff like https://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf, it seems they are using 48 bytes, while in their case they are aiming for min=2KB, avg=8KB, max=64KB. Shouldn't this be scaled proportionally?
To me it's also not fully clear why an algorithm like this makes a distinction between a minimum chunk size and a window size. Why wouldn't one simply pick a 128 KB window and slide over that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, that was a mistake from myside, what was meant is 64 bytes window size. I took this value from an existing implementation. Thanks for this finding!
The window size is a general parameter of a rolling hash function and the hash value or also fingerprint for a specific byte position is calculated for that window of bytes. Then, you move forward by one byte and calculate the hash value for this new window of bytes again. Thanks to the rolling-hash property, this process can be done very efficiently. So, the window size influences the hash value for a specific byte position and thus, the locations of actual chunk boundaries. Theoretically, we could use a window size of the minimum chunk size of 128 KB, but it is not common to use such a large window size in the implementations of content-defined chunking algorithms I have seen so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at algorithm 2 in that paper, I see that I can indeed plug in these values to MinSize, MaxSize, and NormalSize. But what values should I use for MaskS and MaskL now? (There's also MaskA, but that seems unused by the stock algorithm.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, as far as I have understood the algorithm, it is the normalized chunking technique that they use, which allows to keep these mask values as they are, but change the min/max/average chunk sizes according to your needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at algorithm 2 in that paper that attempts to compute 8 KB chunks, I think the probability of a random chunk having size at most
x
should be as follows (link):Now if I change that graph to use the minimum/average/maximum chunk sizes that you propose while leaving the bitmasks unaltered, I see this (link):
If I change it so that
MaskS
has 21 bits andMaskL
has 17, then the graph starts to resemble the original one (link):So I do think the masks need to be adjusted as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Making the graphs above was meaningful. I think it gives good insight in why FastCDC was designed the way it is. They are essentially trying to mimic a normal distribution (link):
Using a literal normal distribution would not be desirable, because it means that the probability that a chunk is created at a given point not only depends on the data within the window, but also the size of the current chunk. And this is exactly what CDC tries to prevent. So that's why they emulate it using three partial functions.
Yeah, we should make sure to set
MaskS
andMaskL
accordingly.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, thanks @EdSchouten for this great analysis. I have to admit, I did not look into these mask values in that detail, because it also was not really explained in detail in the paper, but your concerns are absolutely right and we have to set the mask values accordingly, when the min/max/average chunk sizes are changed. I am just asking myself, since the paper authors mentioned they derived the mask values empirically, how should we adapt them?
@luluz66 , @sluongng since you mentioned you were working with the FastCDC algorithm with chunk sizes of 500 KB, I am wondering how you handled the adaption of the mask values or whether you did it at all. Can you share your experience here? Thank you so much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have our fork of CDC implementation here https://github.com/buildbuddy-io/fastcdc-go/blob/47805a2ecd550cb875f1b797a47a1a648a1feed1/fastcdc.go#L162-L172
This work is very much in progress and not final. We hope that the current API will come with configurable knobs so that downstream implementation could choose what's best for their use cases and data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the authors of the paper just used the following heuristics:
(1) sounds like a good idea, but (2) can likely be ignored. So just pick 21 out of the top 48 bits, and then another 17 bits that are a subset of the former.