-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remote_exec.proto: support ZSTD with dictionary #276
base: main
Are you sure you want to change the base?
Conversation
0831f11
to
7924245
Compare
This PR is/will be a bit stale because I am a bit occupied with other tasks for the time being. I could revisit this if there are more demands for such capability, or if somebody else wants to pick up the flag, please feel free to. |
Since we might have some use case for this via bazelbuild/bazel#18997, I have pushed a new version of this PR according to review feedbacks. As for @EdSchouten suggestion about enhancing |
Why can't we just do it right as part of this PR? What's preventing us from doing that? If we don't add full negotiation to service ContentAddressableStorage {
rpc GetCompressionDictionaries(GetCompressionDictionariesRequest)
returns (GetCompressionDictionariesResponse);
}
message GetCompressionDictionariesRequest {
string instance_name = 1;
DigestFunction.Value digest_function = 2;
Compressor.Value compressor = 3;
}
message GetCompressionDictionariesResponse {
message Dictionary {
uint32 id = 1;
Digest digest = 2;
}
// The dictionary ID that clients SHOULD use when uploading
// objects into the Content Addressable Storage. An entry for
// this dictionary ID MUST be present in `dictionaries`.
uint32 default_dictionary_id = 1;
// Dictionaries that may be used by the server when returning
// objects stored in the Content Addressable Storage. The key
// corresponds to a dictionary ID, as described in RFC 8878,
// section 5, while the value refers to a dictionary
// as described in RFC 8878, chapter 6.
repeated Dictionary dictionaries = 2;
} |
Initially, I thought it could make it harder to get this merge. But I looked at it and found it to be less disruptive than I initially thought. So I pushed a commit that does just that, the language could use some help though 🙏 .
I think it would still require us to pass a minimum boolean value in So instead of sending the boolean value in ServerCapabilties, let's just send the digest and use the existing ByteStream.Read RPC instead. The caveat is ofc, the server would have to store that CAS value somewhere, which could potentially limit how fast the server could swap out the dictionaries. Realistically, I don't expect the dictionary list to change too frequently for most use cases as it will change the delta calculation, leading to more frequent cache invalidation. Lastly, I think we could reserve the dedicated RPC for when the dictionary negotiation gets more complicated. But for now, this would do. |
No, it doesn't. The presence of ZSTD_DICT can be used by the client to determine whether that additional RPC is supported. |
Discussed with @EdSchouten offline. As some server/client/proxy are caching the Sent a change to move the Dictionary fetching to a dedicated RPC in hope that this change has minimal impact on existing implementations. I still want to make this rpc ZSTD specific, but if there is a strong opinion on making it generic for future compressors, suggestions are welcome. |
Seems to me that compression dictionaries are simple enough where they wouldn't vary much across compressor types, so if it were compressor-specific then there would eventually wind up being a bunch of different RPCs with the same Request/Response signature? (I don't feel super strongly either way, just something to consider) |
Co-authored-by: Brandon Duffany <[email protected]>
I see your point. The difficult part, for me, is not being able to plan a head for what the other compressors' dictionary support is going to look like. Without investing too much time into Brotli compression, I could only make guesses. For example, I tried to look into Brotli shared dictionary and I don't think there is a dictionary ID attached. Quickly reading through Chrome Documentation and Cloudflare Blog, I think the dictionary is referred to by a combination of the media type that it's used for, and it's CAS SHA256 ID. If that's indeed the case, then I think we could make the RPC generic with something like @EdSchouten suggested, with slight modification.
That way |
A second related problem to this PR I have been thinking about is how to enable dictionary selection on client-initiated uploads. For example: let's say I want to do a Initially, I wanted to modify the dictionary message to include metadata like this message ZstdDictionary {
uint32 id = 1;
Digest digest = 2;
+ repeated string prefer_for = 3;
} where But now I am thinking that we should make a standalone map for this message GetCompressionDictionariesResponse {
message ZstdDictionary {
uint32 id = 1;
Digest digest = 2;
}
uint32 default_zstd_dictionary_id = 1;
repeated ZstdDictionary zstd_dictionaries = 2;
+ message DictionaryHint {
+ string hint = 1;
+ Digest dictionary_digest = 2;
+ }
+ repeated DictionaryHint dictionary_hints = 3;
} for faster lookup 🤔. |
|
||
message GetZstdDictionariesResponse { | ||
message Dictionary { | ||
uint32 id = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we could make dictionaries reasonably generic by replacing this uint32 with an Any. That wouldn't account for session-specific dictionaries, but it should allow for ~any static dictionary. The unpacking of the Any, and understanding what blobs it applies to, is very much compressor-specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is an "id" specific field, I would say we should use string
instead of Any
at the minimum. Any
delegates the responsibility down to implementations to validate the data format which could be a pain to work with.
If generic is a requirement here, I think something like this could be a good alternative
message GetDictionariesResponse {
message DefaultDictionaryId {
oneof id {
uint32 zstd_dictionary_id = 1;
}
}
// The dictionary ID that clients SHOULD use when uploading
// objects into the Content Addressable Storage. An entry for
// this dictionary ID MUST be present in `dictionaries`.
DefaultDictionaryId default_dictionary_id = 1;
message ZstdDictionary {
uint32 id = 1;
Digest digest = 2;
}
message ZstdDictionaries {
repeated ZstdDictionaries zstd_dictionaries = 1;
}
message Dictionaries {
oneof dictionaries {
ZstdDictionaries zstd_dictionaries = 1;
}
}
// Dictionaries that may be used by the server when returning
// objects stored in the Content Addressable Storage. The key
// corresponds to a dictionary ID, as described in RFC 8878,
// section 5, while the value refers to a dictionary
// as described in RFC 8878, chapter 6.
Dictionaries dictionaries = 2;
}
This would enable future Dictionary types (if there were any) could be added incrementally. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also be fine with a string, but given the conversation yesterday regarding people wanting more structured types to avoid the need to shove json into string fields, thought that Any might be appropriate. Using a string for an ID field would be more consistent with the rest of the API. In any case, I don't think that using string vs Any absolves the client from the need to validate the input. In this case of zstd, for example, the client will need to parse the string into an int ID to match with the ID encoded in the compressed blob.
At some level what I'm trying to avoid is needing to modify this API with each dictionary-compression function that we add, which means trying to remove anything zstd-specific. It's not realistic to design an API that will work in all cases without investing significantly in understanding, but there are only so many realistic ways to get dictionary-based compression to work, and I have to believe that "encode a dictionary ID in the compressed blob" is reasonably common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is an "id" specific field, I would say we should use string instead of Any at the minimum. Any delegates the responsibility down to implementations to validate the data format which could be a pain to work with.
But doesn't a string also require that an implementation validates the data format? There are also strings that are not valid integers.
Closes #272