forked from apache/iceberg-rust
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: add argument to ManifestWriter
to specify where data is written
#2
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a difference between this new
manifest_path
and what is contained within thelocation
for anOutputFile
already?In my mind the manifest path "metadata" and where the serialisation actually occurs (e.g. in S3) will always be the same. I.e. we write to
s3://metadata/.../v1.table-metadata.json
, which is both the place where it is serialised to a file and a metadata pointer of "where is this file"There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So - yes. As part of the spec that is a requirement; and
location
is both where the data gets serialized to and is included in the returnedManifestFile
.The problem here is that
iceberg-rust
enforces that by passing in theOutputFile
to a newManifestWriter
; which the writer then puts in theManifestFile
returned bywriter.write(manifest)
this is the coupling of serialization and building. Because of this, the user cannot build a metadata object (create the Rust type of that object) withlocation = /tmp/foo.avro
and serialize it (write that Rust type as bytes) to/tmp/bar.avro
. Theiceberg-rust
crate handles that for you. So if I pass in anOutputFile
withlocation
/tmp/foo.avro
to aManifestWriter
that's where theManifestFile
is being serialized to whether I like it or not.This is why we can't use
ObjectStore
at the moment. If we were to build/serialize the bytes on a nodeA
and then write those bytes to nodeB
, and query them from nodeB
; the queries will fail because the metadata that has been written to nodeB
will point to metadata on nodeA
. Because that's where the building/serialization took place. And those take place together because theiceberg-rust
crate has coupled them together.IMO, Building/serialization needs to be be made completely separate in the
iceberg-rust
crate. As an example, forManifestFile
you would ideally have some method, sayManifestFile::to_bytes(self) -> Bytes
that consumes theManifestFile
and serializes it. Where those bytes go is entirely up to the user, _but the user is required to uphold the invariant that where those bytes get written to match themanifest_path
for theManifestFile
that got serialized. However the maintainers seem keen on preserving that coupling; but want to change theFileIO
type to be a trait so that users can provide their own object_store/storage implementations.Part of me wonders if we should try to convince them to separate things out better; but I'm not sure how worthy that is given that a
FileIO
trait would allow us to implement that trait forArc<dyn ObjectStore>
, which should meet the design goal of usingObjectStore
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This PR removes that enforcement - hence why I think having a discussion about it first on this PR is important.