Support Snapshot Expiration Operation #516

sungwy · 2024-03-11T21:15:54Z

Feature Request / Improvement

Support Maintenance operations on PyIceberg: https://iceberg.apache.org/docs/1.4.0/maintenance

All operations except for data file compaction are metadata-only or file system operations, so supporting them on PyIceberg may be a small lift

Fokko · 2024-03-12T07:29:12Z

Metadata rewrites are already discussed in #270

I think snapshot expiration is one that we should also focus on. I think #511 is a prerequisite to ensure we have a structured way of listing all the metadata so we know which files we can remove. This would then just be operations on a PyArrow Table 🎉

sungwy · 2024-03-12T19:27:50Z

Noted. Adjusting the Issue title as suggested 👍

salexln · 2024-05-07T07:37:43Z

@Fokko
just to make sure, currently there is not support in pyiceberg for snapshot expiration?
If so, do you have any suggestions of how to remove old data (I'm writing data using pyiceberg to AWS Glue)

ndrluis · 2024-08-15T12:52:00Z

I started a discussion on the mailing list about the delete orphan files and meanwhile I'm studying the expire snapshots.

sungwy · 2024-08-15T21:18:26Z

Thanks @ndrluis - would you like me to assign this issue to you?

ndrluis · 2024-09-10T18:25:05Z

Hello everyone, I need some help with this.

During the implementation process, I noticed that we lack some features that exist in the Java implementation. The first one I want to discuss is the TableOperations class. In the RemoveSnapshots class, the io, current, refresh, and commit methods are used.

Currently, I can use the io from our Table implementation, and I can get the same behavior through Table to get and refresh the TableMetadata. However, I haven't found an implementation to commit TableMetadata.

The problem is that we require an implementation for each catalog. Currently, the Java implementation uses the TableOperations interface for the REST catalog, but the other catalogs (Snowflake / JDBC / Glue) use a different interface with the same methods.

My question is: how should we handle the commit process? I don't have a solid opinion on how we can solve this.

Just to clarify, this commit method is only for updating the table metadata.

cc/ @Fokko @HonahX @kevinjqliu @sungwy

ndrluis · 2024-09-10T19:59:04Z

After reading my question again, I'm not sure if I was clear enough. I want to discuss the design of the implementation in more detail.

I'm currently creating a class (ExpireSnapshot) to handle the expiration process (similar to RemoveSnapshots in Java) and also classes that will manage file deletion based on the updated metadata (ReachableFileCleanup and IncrementalFileCleanup). I believe this design makes sense, but my question is about where the TableMetadata commit should live. Does it make more sense to create a TableOperations class (with an implementation for each catalog), or should we add a method to each catalog that handles the commit process?

kevinjqliu · 2024-09-11T17:11:55Z

how should we handle the commit process?

The current way to commit involves the public commit_table method implemented for each catalog (ie SQL catalog). It takes in requirements and updates, similar to the RESTTableOperations::commit function.

I want to discuss the design of the implementation in more detail.

Do you mind starting a doc on what you've found so far? It would be helpful to figure out what is needed for snapshot expiration before diving into how to implement it

ndrluis · 2024-09-12T12:20:03Z

@kevinjqliu Thank you, I will start writing a document next week detailing what I found and explaining the differences in the commit operation and what we don't have in Python.

ndrluis · 2024-09-19T00:13:08Z

@kevinjqliu I believe that I now understand the differences in how we perform TableMetadata updates in Python versus how it's done in Java. I think that the set of classes describing the changes, along with the use of single dispatch, makes it a bit harder to understand compared to the Java builder strategy. I will continue working on the document to discuss other points.

sungwy changed the title ~~Support Maintenance Operations on PyIceberg~~ Support Snapshot Expiration Operation Mar 12, 2024

Gowthami03B mentioned this issue Mar 14, 2024

Add metadata tables #511

Closed

8 tasks

ndrluis mentioned this issue Aug 15, 2024

[feat] Table maintenance tasks #1065

Open

5 tasks

ndrluis self-assigned this Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Snapshot Expiration Operation #516

Support Snapshot Expiration Operation #516

sungwy commented Mar 11, 2024 •

edited

Loading

Fokko commented Mar 12, 2024

sungwy commented Mar 12, 2024

salexln commented May 7, 2024

ndrluis commented Aug 15, 2024

sungwy commented Aug 15, 2024

ndrluis commented Sep 10, 2024

ndrluis commented Sep 10, 2024

kevinjqliu commented Sep 11, 2024

ndrluis commented Sep 12, 2024

ndrluis commented Sep 19, 2024

Support Snapshot Expiration Operation #516

Support Snapshot Expiration Operation #516

Comments

sungwy commented Mar 11, 2024 • edited Loading

Feature Request / Improvement

Fokko commented Mar 12, 2024

sungwy commented Mar 12, 2024

salexln commented May 7, 2024

ndrluis commented Aug 15, 2024

sungwy commented Aug 15, 2024

ndrluis commented Sep 10, 2024

ndrluis commented Sep 10, 2024

kevinjqliu commented Sep 11, 2024

ndrluis commented Sep 12, 2024

ndrluis commented Sep 19, 2024

sungwy commented Mar 11, 2024 •

edited

Loading