Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refine: refine interface of ManifestWriter #738

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ZENOTME
Copy link
Contributor

@ZENOTME ZENOTME commented Nov 28, 2024

This PR refine the write interface of ManifestWriter according to ManifestWriter from pyiceberg. It add 3 interface add, delete, existing which will rewrite some metadata of manifest entry, e.g. snapshot id, sequence number, file sequence number.

These refined interfaces are benefit for MergeAppend.( I' m working on it now.

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Nov 28, 2024

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ZENOTME for this pr!

/// Write a manifest.
pub async fn write(mut self, manifest: Manifest) -> Result<ManifestFile> {
/// Add a new manifest entry.
pub fn add(&mut self, mut entry: ManifestEntry) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of weird of manipulating arguments, how about make the arguments DataFile?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applies to other apis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason here use ManifesEntry is that in some case we will add entry from other Manifest. In this case, there are some info we need from original ManifsetEntry. E.g. when we add the delete manifest entry, we change the snapshot id and keep the original sequence number.

/// Add a delete manifest entry.
    pub fn delete(&mut self, mut entry: ManifestEntry) -> Result<()> {
        entry.status = ManifestStatus::Deleted;
        entry.snapshot_id = Some(self.snapshot_id);
        self.add_entry(entry)?;
        Ok(())
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced. If we ask user to provide ManifestEntry, it would be confusing to user which part will be used and which part not. I think the style in java would be more clear from a user's view. If we to use ManifestEntry approach, we must have clear documentation about the behavior of each part, e.g. which is ignored, which is reserved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we to use ManifestEntry approach, we must have clear documentation about the behavior of each part, e.g. which is ignored, which is reserved.

I agree with this. Then I think these functions can be pub(crate) to ensure public users will not use it. I think for now there is no demand that user need to use this API.🤔

}

/// Write manifest file and return it.
pub async fn to_manifest_file(mut self, metadata: ManifestMetadata) -> Result<ManifestFile> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have concerns with this api, since it's error prone. According to iceberg's spec, each manifest file should contains one type of data file: data or deletes. It's quite possible that the user pass different kinds entries in previouse method, then the metadata is different. My suggestion is to follow java/python's approach:

  1. A factory method like
pub fn new_v1_writer(...) {}
pub fn new_v2_writer(...) {}
pub fn new_v2_delete_writer(...) {}
  1. We could use things like trait or enum to abstract out common parts of different writers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use things like trait or enum to abstract out common parts of different writers.

Difference between v1, v2, delete is:

  1. the metadata of avro file
  2. avro schema
  3. content type
  4. check in add_entry to make sure entry.content_type == writer.content_type
  5. serialize the ManifestEntry

I think both differences except for serializing the ManifestEntry can be implemented by storing different data in the writer when we create the writer using the factory method. So do we really need to abstract out common parts of different writers now?🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine without trait/enum, the focus is factory methods to ensure api safety.

1. adopt factory method to build different type manifest writer
2. provide add, exist, delete method
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants