Compression support for inlined data (v2-compatible) #201

mostynb · 2021-07-12T19:50:08Z

We added support for compressed blobs via the bytestream API a while back (#147), and this has been working well so far. Given that we don't seem to be gearing up for a non-backwards compatible REAPIv3 release, I think we should start investigating adding backwards-compatible support for compression of inlined data, aiming to minimize roundtrips while still using compressed data.

One idea that I have been thinking about is to create a new Blob message which has the same fields as Digest, but with two additional optional fields: bytes data to hold the inlined data and Compressor.Value compressor to describe the encoding. Then, we would replace each usage of Digest in a message that currently also has a separate field for inlined (uncompressed) data with a Blob field, and deprecate the previous data field.

For example:

message Blob {
  // The first two field types + numbers must match those in Digest
  string hash = 1;
  int64 size_bytes = 2; // This refers to the uncompressed size

  reserved 3, 4; // Leave some room in case we want to extend Digest later(?)

  bytes data = 5; // Possibly compressed data, if set
  Compressor.Value compressor = 6; // Encoding used, if the data field is set
}

For "request" messages that can be responded to with inlined data, we would add a repeated field that specifies the compression types that the client accepts. The server would pick one of those encodings (or identity/uncompressed) for each Blob that it chooses to inline data for. We would also add a new field in the Capabilities API for servers to advertise support for this.

I believe this would be backwards-compatible- old clients would not request compressed inlined data, and would receive Blobs that are decodable as Digests in response, and servers would have a clear signal for when this feature can be used in responses.

Is anyone else interested in this feature?

The text was updated successfully, but these errors were encountered:

EricBurnett · 2021-07-12T21:35:26Z

Heya Mostyn, We'd be interested in compressed inlined blobs too, yep! We have encountered some builds where it's relevant ourselves, with the batch APIs moving a non-trivial percent of the total bytes. (Builds particularly heavy in small files). For implementation details, what's the benefit of introducing a Blob message to wrap these fields vs just adding compressor where necessary? Specifically, I could see three main variants: 1. Just stick a `compressor` field beside `bytes` where relevant. (BatchUpdateBlobsRequest, BatchReadBlobsResponse, OutputFile, ActionResult x2 for std*_raw). 2. Use `Blob` to wrap `bytes` and `compressor`, but *not* the Digest fields. 3. Encapsulate Blob as a replacement for Digest, as you've described. Looking at the API as it currently stands, I think I'd be in favour 1 for V2 - that way we don't introduce any deprecated fields or branch points, and "clean" support simply looks like supporting the compressor value for each bytes field, even if the opposite party will never set it to anything non-passthrough. The use of Capabilities + request fields should I think make it easy to do backwards compatibility safely (so no `bytes` fields risk being misinterpreted). 2 or 3 seems nicest for V3, so the API can clearly express that anywhere inlined bytes are present, they might be compressed. I guess 3 has the benefit of expressing an "optionally inlined blob" in a single field, rather than needing 2 for Digest + optional Blob? I could get behind that, though I think then I'd want to see Digest nested *inside* Blob instead of `hash` + `size_bytes`, so there are no places in the API a raw `hash` field may show up except the Digest message itself.

…

On Mon, Jul 12, 2021 at 3:50 PM Mostyn Bramley-Moore < ***@***.***> wrote: We added support for compressed blobs via the bytestream API a while back, and this has been working well so far. Given that we don't seem to be gearing up for a non-backwards compatible REAPIv3 release, I think we should start investigating adding backwards-compatible support for compression of inlined data, aiming to minimize roundtrips while still using compressed data. One idea that I have been thinking about is to create a new Blob message which has the same fields as Digest, but with two additional optional fields: bytes data to hold the inlined data and Compressor.Value compressor to describe the encoding. Then, we would replace each usage of Digest in a message that currently also has a separate field for inlined (uncompressed) data with a Blob field, and deprecate the previous data field. For example: message Blob { // The first two field types + numbers must match those in Digest string hash = 1; int64 size_bytes = 2; // This refers to the uncompressed size reserved 3, 4; // Leave some room in case we want to extend Digest later(?) bytes data = 5; // Possibly compressed data, if set Compressor.Value compressor = 6; // Encoding used, if the data field is set } For "request" messages that can be responded to with inlined data, we would add a repeated field that specifies the compression types that the client accepts. The server would pick one of those encodings (or identity/uncompressed) for each Blob that it chooses to inline data for. We would also add a new field in the Capabilities API for servers to advertise support for this. I believe this would be backwards-compatible- old clients would not request compressed inlined data, and would receive Blobs that are decodable as Digests in response, and servers would have a clear signal for when this feature can be used in responses. Is anyone else interested in this feature? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#201>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABREW7CK3SSJRBE4H3OLIDTXNBPVANCNFSM5AHQBDYQ> .

mostynb · 2021-07-12T22:00:50Z

I have mostly been considering this from the "cleaner v3 design, retrofitted to v2" angle (3), rather than the "minimal required changes to v2" point of view (1). (2) doesn't sound like a great compromise to me, but I'd need to read up more on protobufv3 to understand the implications well.

There are benefits to each of (1) and (3) course, and I'd be happy to sketch out solutions to both, or either one if there's a clear preference in the group.

EricBurnett · 2021-07-12T22:37:30Z

For the path to adding it to V2, I think (1) stands to be cleaner and not lean on reinterpreting messages as other messages, so I'd be interested in getting your take on it. I think that can be independent of what exact pattern we'd pick for V3, so I'd suggest treating them as independent decisions. I don't feel terribly strongly about what pattern V3 *should *use, beyond being skeptical of inlining Digest fields into anything :). No need to evaluate (2) if nothing points to it being best for either. Thanks Mostyn!

…

On Mon, Jul 12, 2021 at 6:01 PM Mostyn Bramley-Moore < ***@***.***> wrote: I have mostly been considering this from the "cleaner v3 design, retrofitted to v2" angle (3), rather than the "minimal required changes to v2" point of view (1). (2) doesn't sound like a great compromise to me, but I'd need to read up more on protobufv3 to understand the implications well. There are benefits to each of (1) and (2) course, and I'd be happy to sketch out solutions to both, or either one if there's a clear preference in the group. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#201 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABREW6JI3B4PPCOYDBF2PTTXNQZZANCNFSM5AHQBDYQ> .

peterebden · 2021-07-13T09:02:06Z

We'd be interested in this too; we definitely observe a lot of builds with plenty of small but compressible artifacts (Go tends to be pretty heavy on this).

Agreed that both (1) and (3) have benefits; worth mentioning (2) for completeness but I agree that it's unlikely to be appealing as a solution. I think (1) is my favourite; (3) is a little weird from the perspective of the generated code given you have a thing that is like a Digest but is not actually one (so it'd be fine in languages like Python but not in say Go or Java).

I like the idea in general of encapsulating all blobs/digests with something like this Blob message as a V3 change; that seems more elegant than lots of individual sites re-inventing the same thing. But that doesn't seem practical for V2 and the embedded Request/Response messages seem like reasonable things to extend with the compressor field.

This is a small API change which allows for inlined data to be compressed. Refers to bazelbuild#201.

By defining a Blob message that overlaps with Digest, we can repurpose Digest fields to allow data inlining with optional compression in more places. The Digest message is then only used in cases where the receiver wants a reference rather than data. In this example I have changed most of the Digest fields to Blobs, which may lead to over-inlining of data. Each case needs to be considered separately before this change is ready for consideration. Refers to bazelbuild#201.

mostynb · 2021-07-13T12:04:08Z

Created a couple of draft PRs to gather opinions:

This is a small API change which allows for inlined data to be compressed. Refers to bazelbuild#201.

This is a small API change which allows for inlined data to be compressed form in BatchReadBlobs and BatchUpdateBlobs calls. Refers to bazelbuild#201.

* Add support for inlined compressed data in batch CAS operations This is a small API change which allows for inlined data to be compressed form in BatchReadBlobs and BatchUpdateBlobs calls. Refers to #201. * Remove some stray parentheses

mostynb added a commit to mostynb/remote-apis that referenced this issue Jul 13, 2021

Add support for inlined compressed data (minimal changes)

ab8fe65

This is a small API change which allows for inlined data to be compressed. Refers to bazelbuild#201.

This was referenced Jul 13, 2021

Add support for inlined compressed data (minimal changes) #202

Merged

Add support for inlined compressed data (larger refactoring) #203

Closed

mostynb added a commit to mostynb/remote-apis that referenced this issue Jul 19, 2021

Add support for inlined compressed data (minimal changes)

7df50ab

This is a small API change which allows for inlined data to be compressed. Refers to bazelbuild#201.

mostynb added a commit to mostynb/remote-apis that referenced this issue Jul 19, 2021

Add support for inlined compressed data (minimal changes)

097a5ea

This is a small API change which allows for inlined data to be compressed. Refers to bazelbuild#201.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compression support for inlined data (v2-compatible) #201

Compression support for inlined data (v2-compatible) #201

mostynb commented Jul 12, 2021 •

edited

Loading

EricBurnett commented Jul 12, 2021 via email

mostynb commented Jul 12, 2021 •

edited

Loading

EricBurnett commented Jul 12, 2021 via email

peterebden commented Jul 13, 2021

mostynb commented Jul 13, 2021

Compression support for inlined data (v2-compatible) #201

Compression support for inlined data (v2-compatible) #201

Comments

mostynb commented Jul 12, 2021 • edited Loading

EricBurnett commented Jul 12, 2021 via email

mostynb commented Jul 12, 2021 • edited Loading

EricBurnett commented Jul 12, 2021 via email

peterebden commented Jul 13, 2021

mostynb commented Jul 13, 2021

mostynb commented Jul 12, 2021 •

edited

Loading

mostynb commented Jul 12, 2021 •

edited

Loading