-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: add argument to ManifestWriter
to specify where data is written
#2
Conversation
This commit changes a couple function signatures that effectively allow the user to differentiate where metadata bytes get written to and what's contained in the metadata type. For more info on why see: <apache#778>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the unblock, one question
@@ -133,9 +133,9 @@ pub struct ManifestWriter { | |||
|
|||
impl ManifestWriter { | |||
/// Create a new manifest writer. | |||
pub fn new(output: OutputFile, snapshot_id: i64, key_metadata: Vec<u8>) -> Self { | |||
pub fn new(manifest_path: String, snapshot_id: i64, key_metadata: Vec<u8>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a difference between this new manifest_path
and what is contained within the location
for an OutputFile
already?
In my mind the manifest path "metadata" and where the serialisation actually occurs (e.g. in S3) will always be the same. I.e. we write to s3://metadata/.../v1.table-metadata.json
, which is both the place where it is serialised to a file and a metadata pointer of "where is this file"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my mind the manifest path "metadata" and where the serialisation actually occurs (e.g. in S3) will always be the same. I.e. we write to s3://metadata/.../v1.table-metadata.json, which is both the place where it is serialised to a file and a metadata pointer of "where is this file"
So - yes. As part of the spec that is a requirement; and location
is both where the data gets serialized to and is included in the returned ManifestFile
.
The problem here is that iceberg-rust
enforces that by passing in the OutputFile
to a new ManifestWriter
; which the writer then puts in the ManifestFile
returned by writer.write(manifest)
this is the coupling of serialization and building. Because of this, the user cannot build a metadata object (create the Rust type of that object) with location = /tmp/foo.avro
and serialize it (write that Rust type as bytes) to /tmp/bar.avro
. The iceberg-rust
crate handles that for you. So if I pass in an OutputFile
with location
/tmp/foo.avro
to a ManifestWriter
that's where the ManifestFile
is being serialized to whether I like it or not.
This is why we can't use ObjectStore
at the moment. If we were to build/serialize the bytes on a node A
and then write those bytes to node B
, and query them from node B
; the queries will fail because the metadata that has been written to node B
will point to metadata on node A
. Because that's where the building/serialization took place. And those take place together because the iceberg-rust
crate has coupled them together.
IMO, Building/serialization needs to be be made completely separate in the iceberg-rust
crate. As an example, for ManifestFile
you would ideally have some method, say ManifestFile::to_bytes(self) -> Bytes
that consumes the ManifestFile
and serializes it. Where those bytes go is entirely up to the user, _but the user is required to uphold the invariant that where those bytes get written to match the manifest_path
for the ManifestFile
that got serialized. However the maintainers seem keen on preserving that coupling; but want to change the FileIO
type to be a trait so that users can provide their own object_store/storage implementations.
Part of me wonders if we should try to convince them to separate things out better; but I'm not sure how worthy that is given that a FileIO
trait would allow us to implement that trait for Arc<dyn ObjectStore>
, which should meet the design goal of using ObjectStore
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is that iceberg-rust enforces that by passing in the OutputFile to a new ManifestWriter
Note: This PR removes that enforcement - hence why I think having a discussion about it first on this PR is important.
After a bit of conversing with @jdockerty - we've decided that this is not behavior we want to implement. The rationale is that:
|
Part of #13046
I kept the changes minimal because we would really like to minimize drift from the upstream iceberg crate. A ticket has been filed upstream so that hopefully the issue can be addressed properly. This is just a temporary hack unfortunately 😞