Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(puffin): Parse Puffin FileMetadata #765

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

fqaiser94
Copy link
Contributor

@fqaiser94 fqaiser94 commented Dec 7, 2024

Part of #744

Summary

Add support for parsing Puffin FileMetadata

Context

This is the third of a number of PRs to add support for Iceberg Puffin file format.
Feel free to ask clarifying questions if anything looks confusing/weird.
Also, it might be helpful to refer to the overarching PR from which these changes were split to understand better how these changes will fit in to the larger picture.

Useful links

Java Puffin binary files

Note the Java Puffin binary files are copied from https://github.com/apache/iceberg/tree/main/core/src/test/resources/org/apache/iceberg/puffin/v1

@@ -23,3 +23,8 @@

mod compression;
pub use compression::CompressionCodec;

mod metadata;
Copy link
Contributor Author

@fqaiser94 fqaiser94 Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I have deliberately chosen to not expose anything at the moment.
I want to create a separate, intentional PR to add pub modifiers after the reader and writer implementations have been added.
This way, we can develop inside this crate without having to worry about breaking changes until we're ready.
If it helps though, I intend to expose the following outside of the crate eventually:

pub use metadata::{BlobMetadata, FileMetadata, CREATED_BY_PROPERTY};

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great!

@fqaiser94 fqaiser94 marked this pull request as ready for review December 7, 2024 02:07
@fqaiser94 fqaiser94 force-pushed the puffin_metadata branch 4 times, most recently from 4d3dd9f to 011945e Compare December 13, 2024 12:22
@c-thiel
Copy link
Collaborator

c-thiel commented Dec 13, 2024

@fqaiser94, just added the higher level statistic files in #799 FYI. I would guess you would end up building those soon too.

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @fqaiser94 for working on this! Here are some minor suggestions.

crates/iceberg/src/puffin/compression.rs Outdated Show resolved Hide resolved
}

#[derive(Clone, PartialEq, Eq, Hash, Debug)]
pub(crate) enum Flag {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we have other flags?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the Puffin spec only supports one flag (as documented here) but allows for the possibility that there may be more flags in the future.

crates/iceberg/src/puffin/metadata.rs Show resolved Hide resolved
file_read: &dyn FileRead,
input_file_length: u64,
) -> Result<u32> {
let start = input_file_length - u64::from(FileMetadata::FOOTER_STRUCT_LENGTH);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use FileMetadata::FOOTER_STRUCT_LENGTH as u64 for simple

Copy link
Contributor Author

@fqaiser94 fqaiser94 Dec 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make this change if you insist but my understanding is that as can be dangerous and for that reason I would prefer to avoid. The majority of the from calls in here are safe/loss-less conversions but there is one that is not (where I use try_from) and I like being able to see that clearly in the code. Also, the from calls provide compile-time safety which is useful in case anyone decides to change the type of any of these constants in the future which could result in an unsafe/lossy cast if we use as. Agree that is not as nice to read but personally, I'll take compile-time safety every time 😄

crates/iceberg/src/puffin/metadata.rs Show resolved Hide resolved
let mut flags = HashSet::new();
for byte_number in 0..FileMetadata::FOOTER_STRUCT_FLAGS_LENGTH {
let byte_offset = footer_bytes.len()
- usize::from(FileMetadata::MAGIC_LENGTH)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same, we can use FileMetadata::MAGIC_LENGTH as usize

crates/iceberg/src/puffin/metadata.rs Outdated Show resolved Hide resolved
crates/iceberg/src/puffin/metadata.rs Outdated Show resolved Hide resolved
crates/iceberg/src/puffin/metadata.rs Outdated Show resolved Hide resolved
crates/iceberg/src/puffin/metadata.rs Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants