Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for inlined compressed data (larger refactoring) #203

Conversation

mostynb
Copy link
Contributor

@mostynb mostynb commented Jul 13, 2021

By defining a Blob message that overlaps with Digest, we can repurpose Digest fields to allow data inlining with optional compression in more places. The Digest message is then only used in cases where the receiver wants a reference rather than data.

In this example I have changed most of the Digest fields to Blobs, which may lead to over-inlining of data. Each case needs to be considered separately before this change is ready for consideration.

Refers to #201.

By defining a Blob message that overlaps with Digest, we can repurpose
Digest fields to allow data inlining with optional compression in more
places.  The Digest message is then only used in cases where the
receiver wants a reference rather than data.

In this example I have changed most of the Digest fields to Blobs,
which may lead to over-inlining of data.  Each case needs to be
considered separately before this change is ready for consideration.

Refers to bazelbuild#201.
@google-cla google-cla bot added the cla: yes Pull requests whose authors are covered by a CLA with Google. label Jul 13, 2021
@mostynb mostynb marked this pull request as draft July 13, 2021 12:03
// to run, which MUST be present in the
// [ContentAddressableStorage][build.bazel.remote.execution.v2.ContentAddressableStorage].
Digest command_digest = 1;
Blob command = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly object to this change.

With the platform properties being promoted from Command to Action, scheduler processes may nowadays be implemented in such a way that they don't need to load the Command message from the CAS. This effectively reverts that change.

The Command message can get big really quickly, especially in the case of projects where people use (strip_)include_prefix too much.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add to that: is it even likely that inlining Command into Action is advantageous? You only upload Action and Command into the CAS in case of remote execution; not remote caching. Most remote execution services hopefully have their workers located close to storage. Do we really care about reducing latency/roundtrips there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable- this is the "over-inlining" problem I mentioned in the description (this is only intended as an early draft for discussion).

// [Directory][build.bazel.remote.execution.v2.Directory] object
// represented. See [Digest][build.bazel.remote.execution.v2.Digest]
// for information about how to take the digest of a proto message.
Digest digest = 2;
Blob blob = 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it permitted to also use this construct inside of Tree objects? If not, should it be stated? If so, any guidelines on canonicalisation?

string hash = 1;
int64 size_bytes = 2; // This refers to the uncompressed size

reserved 3, 4; // Leave some room in case we want to extend Digest later(?)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add more fields to Digest, then clients/servers will already update to a newer version of the .proto. In that case they will already switch from Digest to Blob.

In other words, I don't see how reserving numbers here helps us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be right. If we update Digest later, we could reserve fields in Digest to "jump over" the extra fields in Blob.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we update Digest later, it's completely safe to use any numbers that overlap with Blob.

@mostynb
Copy link
Contributor Author

mostynb commented Jul 16, 2021

I'm going to close this draft PR since there is more interest in #202.

@mostynb mostynb closed this Jul 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Pull requests whose authors are covered by a CLA with Google.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants