Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support to append delete type data file #798

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ZENOTME
Copy link
Contributor

@ZENOTME ZENOTME commented Dec 13, 2024

This PR support to support to append delete type data file

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Dec 13, 2024

cc @liurenjie1024 @Xuanwo @Fokko @sdd

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Dec 14, 2024

Sorry, I think I have some misunderstanding here since the action which support to append data file and delete file(position delete file, equality delete file) is RowDelta.🤔 So I guess what we need is right:

  1. MergingSnapshotProducer: https://github.com/apache/iceberg/blob/540d6a6251e31b232fe6ed2413680621454d107a/core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java#L61
  2. RowDelta: https://github.com/apache/iceberg/blob/540d6a6251e31b232fe6ed2413680621454d107a/api/src/main/java/org/apache/iceberg/RowDelta.java#L32

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Dec 14, 2024

I noticed that MergingSnapshotProducer work like the MergeAppend, it will merge manifest when commit. Is that necessary for RowDelta?🤔 Can we implement RowDelta works like FastAppend but it can append data file and delete type file first here? I trying a demo using this way and it can query using spark. @Fokko

@liurenjie1024
Copy link
Contributor

Thanks @ZENOTME for this pr! Adding support for deletion is a complicated process, which involves conflict detection, commit retries. I would suggest to have sth like you mention built first like ManifestMergerManager, ManifestFilterManager, MergingSnapshot appender first.

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Dec 21, 2024

Thanks @ZENOTME for this pr! Adding support for deletion is a complicated process, which involves conflict detection, commit retries. I would suggest to have sth like you mention built first like ManifestMergerManager, ManifestFilterManager, MergingSnapshot appender first.

I want to make sure whether fast append, merge append, append data file, append delete file is orthogonal. For now, in iceberg-python and iceberg-java:

  • FastAppend action is for fast append and append data file.
  • MergeAppend action is for merge append and append data file.
  • RowDelta action is for merge append and append delete files and data files.

Is it possible to have action like combing fast append and append delete file and data file? In fact, I try this in practice and it can be read by spark. And for this action, I think we don't need the ManifestMergerManager.🤔

@liurenjie1024
Copy link
Contributor

Is it possible to have action like combing fast append and append delete file and data file? In fact, I try this in practice and it can be read by spark. And for this action, I think we don't need the ManifestMergerManager.

The reason we can have a fast append action in iceberg is that append only data file always have no conflict with other actions. Remember that iceberg supports concurrent write, so for update/deletion actions, we need to detect conflction. It works in simple case doesn't mean it's right.

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Dec 23, 2024

Is it possible to have action like combing fast append and append delete file and data file? In fact, I try this in practice and it can be read by spark. And for this action, I think we don't need the ManifestMergerManager.

The reason we can have a fast append action in iceberg is that append only data file always have no conflict with other actions. Remember that iceberg supports concurrent write, so for update/deletion actions, we need to detect conflction. It works in simple case doesn't mean it's right.

Do you mean that for appending only the data file, we can always retry by using the original data file, but for delete data file, the related data may be deleted already, so we should use something like ManifestMergerManager to merge and discard the useless delete file.

I think for now we have conflicting detect using uuid, but we don't have the retry.

self.tx.append_requirements(vec![
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants