-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
V3 idea: Drop support for offset reads & writes. Decompose large CAS objects into shallow Merkle trees #178
Comments
You can do that today in the server, as long as you can support read and write offsets. I'm not sure about exposing this complexity to the client - it's not required in most cases, but it makes every client significantly more complex. |
But there are cases in which it is desired, and stating that it isn't required in most cases is not helping. Any protocol out there contains features that aren't needed in most cases, yet they aren't removed because people depend on them. If we removed all features from HTTP that aren't needed in most cases, we'd have Gopher. I want to make use of a feature similar to Builds without the Bytes, except that I want a FUSE file system on the client side that allows me to still access any output files on demand. Some of these output files may be large. Right now reads against these files will hang, until they have been downloaded over the Bytestream protocol entirely, which makes ad hoc exploration hard. This is because there is no way to do partial file validation. |
I have collected some initial data to check whether deduplication caused by chunking would be helpful to reduce size of the data. Focused on data residing in cache, checking in-flight data maybe next step. Basic results:
Below stats show content of a bazel-remote cache CAS directory when entries are chunked, only unique chunks are preserved, and all chunks are compressed using zstd. Would be nice to get some additional results, those results may not be representative to others. The tool itself with more details is here: https://github.com/glukasiknuro/bazel-cache-chunking CC @mostynb
|
Nice! Thanks for doing these measurements. Important to mention that deduplication of identical chunks is not one of the main reasons I created this proposal. It is all about ensuring that large blobs simply don’t exist. Large blobs are annoying to work with, especially if you want partial access, resumption of transfers, compression, etc. while still having guaranteed data integrity. |
IIRC @Erikma and @johnterickson reported that transferring large files in parallel using chunks was significantly faster with azure blob storage than a single upload/download stream (and I imagine S3 and GCS would be similar). |
Also, from somewhat related item on bazel side: remote/performance: support rsync compression
The result above maybe before build without bytes was introduced though. |
For GCS we are using composite uploads and parallel sliced downloads which improve speed greatly for gigabyte (some article mentioning it in https://jbrojbrojbro.medium.com/slice-up-your-life-large-file-download-optimization-50ee623b708c) |
Regarding breaking down a large blob into multiple smaller chunks: here are some related arts which are being applied by Cloud Vendor to distribute container images:
Some quick summary:
Out of the 3 points above, I think [1] is proposed in this issue. [2][3] could be a client-server specific implementation, especially around FUSE client, but it would be nice to have P2P consideration for V3? |
Worth to note that the large blob performance issue is quite real: rules_docker used to have to disable remote-caching for a bunch of intermediary artifacts during a container image build bazelbuild/rules_docker#2043 If these data could be broken down to re-usable chunks/blocks, the cache hit should be much higher and thus eliminate the needs to disable remote cache for in-between artifacts. |
Returning to this from V3 doc: Another idea is to acknowledge the existence of bigger blobs and set a boundary to detect bigger blob for special treatment. (i.e. up to what size the server would accept a blob, all files bigger than that blob need to go through special treatment). Similar to how Yandex's Arc VCS does it (https://habr.com/ru/companies/yandex/articles/482926/), we could introduce a metadata type |
To add some thoughts to this discussion from the perspective of the justbuild development. We like the idea to avoid large blobs in the remote-execution CAS for several reasons as it has already been discussed. In bazelbuild/remote-apis#282, we proposed a content-defined blob-splitting scheme to save network traffic and in future also storage. Currently, we are happy with verifying blobs by reading through their entire data, which is why we focused on traffic and storage reduction and not optimized verification of splitted large blobs. If there exists a solution to model the Merkle tree not over fixed sizes but content defined, this would not only avoid large blobs but also save traffic and storage. This requires a fixed content-defined splitting algorithm with all randomness being known, e.g., in terms of seed and random number generator. We don't yet have a complete solution, but we just wanted to put this consideration into the discussion, not only for this issue, but also for the REv3 development since we are highly interested in having such network traffic and storage reduction concepts available in REv3. |
The size of objects stored in the CAS has a pretty large spread. We can have tiny Directory objects that are less than 100 bytes in size, but we can also have output files that are gigabytes in size. Dealing with these large files is annoying:
But if you take a look at the amount of data read in bytes, again 1h average, it's a lot less balanced. We sometimes see that the busiest shard receives 50% traffic more than the one that's least busy. And that's measured across a full hour.
My suggestion is that REv3 drops support for offset reads & writes entirely. Instead, we should model large files as shallow (1-level deep) Merkle trees of chunks with some fixed size. Instead of sending a single Bytestream request to download such a large file, a client must (or may?) first need to read a manifest object from the CAS containing a list of digests of chunks whose contents need to be concatenated. When uploading a file, FindMissingBlobs() should be called against the the digest of the manifest object and each of the digests of the chunks. This allows a client to skip uploading of parts that are already in the CAS. This both speeds up resumption of partially uploaded files, and adds (a rough version of) deduplication of identical regions across multiple files. Because large objects don't occur very frequently (read: almost all objects tend to be small), this indirection doesn't really impact performance for most workloads. Compression can be added by simply compressing individual chunks.
Earlier this year I experimented with this concept for Buildbarn (ADR#3), where the decomposition size can be any power of 2, 2 KiB or larger. It used a slightly simplified version of BLAKE3 to hash objects. I never fully completed it/merged it into master, unfortunately. It also didn't take the desire to do compression into account. I think that both decomposition and compression should be considered, and can likely not be discussed separately. My hope is that for REv3, we find the time to solve this properly.
The text was updated successfully, but these errors were encountered: