Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support metadata table "snapshots" #822

Merged
merged 8 commits into from
Dec 30, 2024

Conversation

xxchan
Copy link
Contributor

@xxchan xxchan commented Dec 18, 2024

#823

In Iceberg Spark/Flink there are metadata tables that provide information around the table: https://iceberg.apache.org/docs/latest/spark-queries/#inspecting-tables

Supporting this in iceberg-rust allows other engines (like RisingWave) to support these "standard" metadata tables.

In this PR, I added a trait for metadata tables and the first implementation for the snapshots metadata table. If it looks good, we can add more later.

reference implementation:
- https://py.iceberg.apache.org/api/#inspecting-tables
- https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/inspect.py#L58-L90

@xxchan xxchan force-pushed the xxchan/proper-gamefowl branch 2 times, most recently from 186f075 to e137fbb Compare December 18, 2024 13:16
@xxchan xxchan force-pushed the xxchan/proper-gamefowl branch from e137fbb to b3cf119 Compare December 18, 2024 13:20
Comment on lines 56 to 65
/// Returns the schema of the metadata table.
fn schema() -> Schema;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to make this separate and public, so engines can get the schema first without fetching the data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the metadata tables (e.g. entires) have schemas that depend on the table schema. E.g. readable_metrics is a struct with field per column. Might be worth having this take a scan as well or do so in another PR.

@xxchan
Copy link
Contributor Author

xxchan commented Dec 18, 2024

cc @liurenjie1024 @Xuanwo @Fokko @sdd

xxchan added a commit to xxchan/iceberg-rust that referenced this pull request Dec 20, 2024
@xxchan xxchan force-pushed the xxchan/proper-gamefowl branch from 8ca0ab6 to 973129e Compare December 25, 2024 14:04
Comment on lines 37 to 39
pub struct MetadataScan {
metadata_ref: TableMetadataRef,
}
Copy link
Contributor

@rshkv rshkv Dec 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering what you think about

Suggested change
pub struct MetadataScan {
metadata_ref: TableMetadataRef,
}
pub struct MetadataScan<'a> {
table: &'a Table,
}

so we can get at FileIO and other table properties that aren't in the metadata.

Signed-off-by: xxchan <[email protected]>
Signed-off-by: xxchan <[email protected]>
Signed-off-by: xxchan <[email protected]>
@xxchan xxchan force-pushed the xxchan/proper-gamefowl branch from 973129e to f40aa87 Compare December 28, 2024 04:06
fn schema() -> Schema;

/// Scans the metadata table.
fn scan(scan: &MetadataScan) -> Result<RecordBatch>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on making async because some of the metadata scans will do async reads of metadata files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for this, the manifests table requires an async call load_manifest_list() with a dependency on FileIO.

@xxchan
Copy link
Contributor Author

xxchan commented Dec 29, 2024

@rshkv Thanks for the review and the follow-up work. I just want to wait for the maintainers to make sure the framework looks good before applying any suggestions :p

BTW I think we can also change the details like signatures and fields on demand later.

@rshkv
Copy link
Contributor

rshkv commented Dec 29, 2024

Yeah, totally!

BTW I think we can also change the details like signatures and fields on demand later.

Yeah that's what I was worried about. Sg. 👍🏻

Signed-off-by: xxchan <[email protected]>
@xxchan xxchan requested a review from Xuanwo December 30, 2024 12:52
Signed-off-by: xxchan <[email protected]>
Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, it looks much better now. Only some minor questions.

crates/iceberg/src/metadata_scan.rs Outdated Show resolved Hide resolved
crates/iceberg/src/metadata_scan.rs Outdated Show resolved Hide resolved
Signed-off-by: xxchan <[email protected]>
@xxchan xxchan requested a review from Xuanwo December 30, 2024 14:42
@xxchan
Copy link
Contributor Author

xxchan commented Dec 30, 2024

Thanks for the review. Updated. Please take a another look.

Signed-off-by: xxchan <[email protected]>
Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is already in good shape. Let's merge it to unblock other pending PRs. Suggested changes can be made in a separate PR.

@@ -73,6 +73,7 @@ mod avro;
pub mod io;
pub mod spec;

pub mod metadata_scan;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be changed in to metadata_table.

@@ -200,6 +201,12 @@ impl Table {
TableScanBuilder::new(self)
}

/// Creates a metadata table which provides table-like APIs for inspecting metadata.
/// See [`MetadataTable`] for more details.
pub fn metadata_table(self) -> MetadataTable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use &self here to align with other APIs.

@Xuanwo Xuanwo merged commit 328e18e into apache:main Dec 30, 2024
17 checks passed
rshkv added a commit to rshkv/iceberg-rust that referenced this pull request Dec 30, 2024
rshkv added a commit to rshkv/iceberg-rust that referenced this pull request Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants