Support merge manifests on writes (MergeAppend) #363

HonahX · 2024-02-04T08:53:06Z

Add MergeAppendFiles. This PR will enable the following configurations:

commit.manifest-merge.enabled: Controls whether to automatically merge manifests on writes.
commit.manifest.min-count-to-merge: Minimum number of manifests to accumulate before merging.
commit.manifest.target-size-bytes: Target size when merging manifest files.

Since commit.manifest-merge.enabled is default to True, we need to make MergeAppend as the default way to append data to align with the property definition and java implementation

Fokko

Great start @HonahX Maybe we want to see if there are any things we can split out, such as the rolling manifest writer.

pyiceberg/table/__init__.py

Fokko · 2024-02-05T19:58:43Z

pyiceberg/table/__init__.py

+        # TODO: need to re-consider the name here: manifest containing positional deletes and manifest containing deleted entries
+        unmerged_deletes_manifests = [manifest for manifest in existing_manifests if manifest.content == ManifestContent.DELETES]
+
+        data_manifest_merge_manager = ManifestMergeManager(


We're changing the append operation from a fast-append to a regular append when it hits a threshold. I would be more comfortable with keeping the compaction separate. This way we know that an append/overwrite is always fast and in constant time. For example, if you have a process that appends data, you know how fast it will run (actually it is a function of the number of manifests).

Thanks for the explanation! Totally agree! I was thinking it might be a good time to bring FastAppend and MergeAppend to pyiceberg, making them inherit from a _SnapshotProducer

Fokko · 2024-02-08T12:03:14Z

pyiceberg/table/__init__.py

@@ -944,7 +949,8 @@ def append(self, df: pa.Table) -> None:
        if len(self.spec().fields) > 0:
            raise ValueError("Cannot write to partitioned tables")

-        merge = _MergingSnapshotProducer(operation=Operation.APPEND, table=self)
+        # TODO: need to consider how to support both _MergeAppend and _FastAppend


Do we really want to support both? This part of the Java code has been a major source of (hard to debug) problems. Splitting out the commit and compaction path completely would simplify that quite a bit.

I think it is a good idea to have a separate API in UpdateSnapshot in #446 to compact manifests only. However, I believe retaining MergeAppend is also necessary due to the commit.manifest-merge.enabled setting. This setting, when enabled (which is the default), leads users to expect automatic merging of manifests when they append/overwrite data, rather than having to compact manifest by another API. What do you think?

tests/conftest.py

Fokko

Hey @HonahX thanks for working on this and sorry for the late reply. I wanted to take the time to test this properly.

It looks like either the snapshot inheritance is not working properly, or something is off with the writer. I converted the Avro manifest files to JSON using avro-tools, and noticed the following:

{
    "status": 1,
    "snapshot_id": {
        "long": 6972473597951752000
    },
    "data_sequence_number": {
        "long": -1
    },
    "file_sequence_number": {
        "long": -1
    },
...
}
{
    "status": 0,
    "snapshot_id": {
        "long": 3438738529910612500
    },
    "data_sequence_number": {
        "long": -1
    },
    "file_sequence_number": {
        "long": -1
    },
...
}
{
    "status": 0,
    "snapshot_id": {
        "long": 1638533332780464400
    },
    "data_sequence_number": {
        "long": 1
    },
    "file_sequence_number": {
        "long": 1
    },
....
}

Looks like either the snapshot inheritance is not working properly when rewriting the manifests.

Fokko · 2024-03-13T08:43:40Z

tests/integration/test_writes.py

@@ -355,6 +355,44 @@ def test_data_files(spark: SparkSession, session_catalog: Catalog, arrow_table_w
    assert [row.deleted_data_files_count for row in rows] == [0, 0, 1, 0, 0]


+@pytest.mark.integration


Can you parameterize the test for both V1 and V2 tables?

We want to assert the manifest-entries as well (only for the merge-appended one).

pyiceberg/table/__init__.py

sungwy

Thank you very much for adding this @HonahX . Just one small nit, and otherwise looks good to me!

sungwy · 2024-03-22T20:18:59Z

pyiceberg/table/__init__.py

@@ -1091,7 +1111,7 @@ def append(self, df: pa.Table) -> None:
        _check_schema(self.schema(), other_schema=df.schema)

        with self.transaction() as txn:
-            with txn.update_snapshot().fast_append() as update_snapshot:
+            with txn.update_snapshot().merge_append() as update_snapshot:


Could we update the new add_files method to also use merge_append?

That seems to be the default choice of snapshot producer in Java

@syun64 Could you elaborate on the motivation to pick merge-append over a fast-append? For Java, it is for historical reasons since the fast-append was added later. The fast-append creates more metadata but also has:

Takes less time to commit, since it doesn't rewrite any existing manifests. This reduces the chances of having a conflict.

The time it takes to commit is more predictable and fairly constant to the number of data files that are written.

When you static-overwrite partitions as you do in your typical ETL, it will speed up the deletes since it can just drop a whole manifest that the previous fast-append has produced.

The main downside is when you do full-table scans that you need to evaluate more metadata.

That's a good argument @Fokko . Especially in a world where we are potentially moving the work of doing table scans into the Rest Catalog, compacting manifests on write isn't important for this function that already looks to prioritize commit speed over anything else.

I think it makes sense to leave the function to use fast_append and let the users rely on other means of optimizing their table scans.

HonahX · 2024-06-03T07:37:19Z

Sorry for the long wait. I've fixed the sequence number inheritance issue. Previously some manifest entry incorrectly persist the -1 sequence number inherited from a newly constructed ManifestFile. I added a wrapper in ManifestWriter to ensure the sequence number None when unassigned.

I will add tests and update the doc soon

HonahX

Tests and doc are pushed! @Fokko @syun64 Could you please review this again when you have a chance?

pyiceberg/table/__init__.py

sungwy

Just a few nits, otherwise looks good @HonahX

pyiceberg/table/__init__.py

sungwy

This looks good to me @HonahX 👍

Fokko · 2024-06-30T19:46:26Z

I'm seeing some odd behavior:

from pyiceberg.catalog.sql import SqlCatalog
from datetime import datetime, timezone, date
import uuid
import pyarrow as pa

pa_schema = pa.schema([
    ("bool", pa.bool_()),
    ("string", pa.large_string()),
    ("string_long", pa.large_string()),
    ("int", pa.int32()),
    ("long", pa.int64()),
    ("float", pa.float32()),
    ("double", pa.float64()),
    # Not supported by Spark
    # ("time", pa.time64('us')),
    ("timestamp", pa.timestamp(unit="us")),
    ("timestamptz", pa.timestamp(unit="us", tz="UTC")),
    ("date", pa.date32()),
    # Not supported by Spark
    # ("time", pa.time64("us")),
    # Not natively supported by Arrow
    # ("uuid", pa.fixed(16)),
    ("binary", pa.large_binary()),
    ("fixed", pa.binary(16)),
])


TEST_DATA_WITH_NULL = {
    "bool": [False, None, True],
    "string": ["a", None, "z"],
    # Go over the 16 bytes to kick in truncation
    "string_long": ["a" * 22, None, "z" * 22],
    "int": [1, None, 9],
    "long": [1, None, 9],
    "float": [0.0, None, 0.9],
    "double": [0.0, None, 0.9],
    # 'time': [1_000_000, None, 3_000_000],  # Example times: 1s, none, and 3s past midnight #Spark does not support time fields
    "timestamp": [datetime(2023, 1, 1, 19, 25, 00), None, datetime(2023, 3, 1, 19, 25, 00)],
    "timestamptz": [
        datetime(2023, 1, 1, 19, 25, 00, tzinfo=timezone.utc),
        None,
        datetime(2023, 3, 1, 19, 25, 00, tzinfo=timezone.utc),
    ],
    "date": [date(2023, 1, 1), None, date(2023, 3, 1)],
    # Not supported by Spark
    # 'time': [time(1, 22, 0), None, time(19, 25, 0)],
    # Not natively supported by Arrow
    # 'uuid': [uuid.UUID('00000000-0000-0000-0000-000000000000').bytes, None, uuid.UUID('11111111-1111-1111-1111-111111111111').bytes],
    "binary": [b"\01", None, b"\22"],
    "fixed": [
        uuid.UUID("00000000-0000-0000-0000-000000000000").bytes,
        None,
        uuid.UUID("11111111-1111-1111-1111-111111111111").bytes,
    ],
}

catalog = SqlCatalog("test_sql_catalog", uri="sqlite:///:memory:", warehouse=f"/tmp/")

pa_table = pa.Table.from_pydict(TEST_DATA_WITH_NULL, schema=pa_schema)

catalog.create_namespace(('some',))

tbl = catalog.create_table(identifier="some.table", schema=pa_schema, properties={
    "commit.manifest.min-count-to-merge": "2"
})

for num in range(5):
    print(f"Appended: {num}")
    tbl.merge_append(pa_table)

It tries to read a corrupt file (or a bug in our reader):

---------------------------------------------------------------------------
EOFError                                  Traceback (most recent call last)
Cell In[2], line 71
     69 for num in range(5):
     70     print(f"Appended: {num}")
---> 71     tbl.merge_append(pa_table)

File ~/work/iceberg-python/pyiceberg/table/__init__.py:1424, in Table.merge_append(self, df, snapshot_properties)
   1411 """
   1412 Shorthand API for appending a PyArrow table to a table transaction and merging manifests on write.
   1413 
   (...)
   1421     snapshot_properties: Custom properties to be added to the snapshot summary
   1422 """
   1423 with self.transaction() as tx:
-> 1424     tx.merge_append(df=df, snapshot_properties=snapshot_properties)

File ~/work/iceberg-python/pyiceberg/table/__init__.py:472, in Transaction.merge_append(self, df, snapshot_properties)
    468 data_files = _dataframe_to_data_files(
    469     table_metadata=self._table.metadata, write_uuid=update_snapshot.commit_uuid, df=df, io=self._table.io
    470 )
    471 for data_file in data_files:
--> 472     update_snapshot.append_data_file(data_file)

File ~/work/iceberg-python/pyiceberg/table/__init__.py:1899, in UpdateTableMetadata.__exit__(self, _, value, traceback)
   1897 def __exit__(self, _: Any, value: Any, traceback: Any) -> None:
   1898     """Close and commit the change."""
-> 1899     self.commit()

File ~/work/iceberg-python/pyiceberg/table/__init__.py:1895, in UpdateTableMetadata.commit(self)
   1894 def commit(self) -> None:
-> 1895     self._transaction._apply(*self._commit())

File ~/work/iceberg-python/pyiceberg/table/__init__.py:2966, in _SnapshotProducer._commit(self)
   2965 def _commit(self) -> UpdatesAndRequirements:
-> 2966     new_manifests = self._manifests()
   2967     next_sequence_number = self._transaction.table_metadata.next_sequence_number()
   2969     summary = self._summary(self.snapshot_properties)

File ~/work/iceberg-python/pyiceberg/table/__init__.py:2935, in _SnapshotProducer._manifests(self)
   2932 delete_manifests = executor.submit(_write_delete_manifest)
   2933 existing_manifests = executor.submit(self._existing_manifests)
-> 2935 return self._process_manifests(added_manifests.result() + delete_manifests.result() + existing_manifests.result())

File ~/work/iceberg-python/pyiceberg/table/__init__.py:3111, in MergeAppendFiles._process_manifests(self, manifests)
   3102 unmerged_deletes_manifests = [manifest for manifest in manifests if manifest.content == ManifestContent.DELETES]
   3104 data_manifest_merge_manager = _ManifestMergeManager(
   3105     target_size_bytes=self._target_size_bytes,
   3106     min_count_to_merge=self._min_count_to_merge,
   3107     merge_enabled=self._merge_enabled,
   3108     snapshot_producer=self,
   3109 )
-> 3111 return data_manifest_merge_manager.merge_manifests(unmerged_data_manifests) + unmerged_deletes_manifests

File ~/work/iceberg-python/pyiceberg/table/__init__.py:3987, in _ManifestMergeManager.merge_manifests(self, manifests)
   3985 merged_manifests = []
   3986 for spec_id in reversed(groups.keys()):
-> 3987     merged_manifests.extend(self._merge_group(first_manifest, spec_id, groups[spec_id]))
   3989 return merged_manifests

File ~/work/iceberg-python/pyiceberg/table/__init__.py:3974, in _ManifestMergeManager._merge_group(self, first_manifest, spec_id, manifests)
   3963     return output_manifests
   3965 # executor = ExecutorFactory.get_or_create()
   3966 # futures = [executor.submit(merge_bin, b) for b in bins]
   3967 
   (...)
   3971 # for future in concurrent.futures.as_completed(futures):
   3972 #     completed_futures.add(future)
-> 3974 bin_results: List[List[ManifestFile]] = [merge_bin(b) for b in bins]
   3976 return [manifest for bin_result in bin_results for manifest in bin_result]

File ~/work/iceberg-python/pyiceberg/table/__init__.py:3974, in <listcomp>(.0)
   3963     return output_manifests
   3965 # executor = ExecutorFactory.get_or_create()
   3966 # futures = [executor.submit(merge_bin, b) for b in bins]
   3967 
   (...)
   3971 # for future in concurrent.futures.as_completed(futures):
   3972 #     completed_futures.add(future)
-> 3974 bin_results: List[List[ManifestFile]] = [merge_bin(b) for b in bins]
   3976 return [manifest for bin_result in bin_results for manifest in bin_result]

File ~/work/iceberg-python/pyiceberg/table/__init__.py:3961, in _ManifestMergeManager._merge_group.<locals>.merge_bin(manifest_bin)
   3959     output_manifests.extend(manifest_bin)
   3960 else:
-> 3961     output_manifests.append(self._create_manifest(spec_id, manifest_bin))
   3963 return output_manifests

File ~/work/iceberg-python/pyiceberg/table/__init__.py:3934, in _ManifestMergeManager._create_manifest(self, spec_id, manifest_bin)
   3932 with self._snapshot_producer.new_manifest_writer(spec=self._snapshot_producer.spec(spec_id)) as writer:
   3933     for manifest in manifest_bin:
-> 3934         for entry in self._snapshot_producer.fetch_manifest_entry(manifest=manifest, discard_deleted=False):
   3935             if entry.status == ManifestEntryStatus.DELETED and entry.snapshot_id == self._snapshot_producer.snapshot_id:
   3936                 #  only files deleted by this snapshot should be added to the new manifest
   3937                 writer.delete(entry)

File ~/work/iceberg-python/pyiceberg/table/__init__.py:3034, in _SnapshotProducer.fetch_manifest_entry(self, manifest, discard_deleted)
   3033 def fetch_manifest_entry(self, manifest: ManifestFile, discard_deleted: bool = True) -> List[ManifestEntry]:
-> 3034     return manifest.fetch_manifest_entry(io=self._io, discard_deleted=discard_deleted)

File ~/work/iceberg-python/pyiceberg/manifest.py:611, in ManifestFile.fetch_manifest_entry(self, io, discard_deleted)
    609 print(f"MANIFEST: {self.manifest_path}")
    610 input_file = io.new_input(self.manifest_path)
--> 611 with AvroFile[ManifestEntry](
    612     input_file,
    613     MANIFEST_ENTRY_SCHEMAS[DEFAULT_READ_VERSION],
    614     read_types={-1: ManifestEntry, 2: DataFile},
    615     read_enums={0: ManifestEntryStatus, 101: FileFormat, 134: DataFileContent},
    616 ) as reader:
    617     return [
    618         _inherit_from_manifest(entry, self)
    619         for entry in reader
    620         if not discard_deleted or entry.status != ManifestEntryStatus.DELETED
    621     ]

File ~/work/iceberg-python/pyiceberg/avro/file.py:172, in AvroFile.__enter__(self)
    170 with self.input_file.open() as f:
    171     self.decoder = new_decoder(f.read())
--> 172 self.header = self._read_header()
    173 self.schema = self.header.get_schema()
    174 if not self.read_schema:

File ~/work/iceberg-python/pyiceberg/avro/file.py:220, in AvroFile._read_header(self)
    219 def _read_header(self) -> AvroFileHeader:
--> 220     return construct_reader(META_SCHEMA, {-1: AvroFileHeader}).read(self.decoder)

File ~/work/iceberg-python/pyiceberg/avro/reader.py:333, in StructReader.read(self, decoder)
    331 for pos, field_reader in self._field_reader_functions:
    332     if pos is not None:
--> 333         struct[pos] = field_reader(decoder)  # later: pass reuse in here
    334     else:
    335         field_reader(decoder)

File ~/work/iceberg-python/pyiceberg/avro/reader.py:469, in MapReader.read(self, decoder)
    467         block_count = decoder.read_int()
    468 else:
--> 469     block_count = decoder.read_int()
    470     while block_count != 0:
    471         if block_count < 0:

File ~/work/iceberg-python/pyiceberg/avro/decoder_fast.pyx:85, in pyiceberg.avro.decoder_fast.CythonBinaryDecoder.read_int()

File ~/work/iceberg-python/pyiceberg/avro/decoder_fast.pyx:92, in pyiceberg.avro.decoder_fast.CythonBinaryDecoder.read_int()

EOFError: EOF: read 1 bytes

It tries to read this file, which turns out to be empty?

avro-tools tojson /tmp/some.db/table/metadata/94206240-2ae8-47e7-bffe-fd4a1b35d91d-m0.avro
24/06/30 21:44:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

avro-tools getmeta /tmp/some.db/table/metadata/94206240-2ae8-47e7-bffe-fd4a1b35d91d-m0.avro
24/06/30 21:45:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
schema	{"type":"struct","fields":[{"id":1,"name":"bool","type":"boolean","required":false},{"id":2,"name":"string","type":"string","required":false},{"id":3,"name":"string_long","type":"string","required":false},{"id":4,"name":"int","type":"int","required":false},{"id":5,"name":"long","type":"long","required":false},{"id":6,"name":"float","type":"float","required":false},{"id":7,"name":"double","type":"double","required":false},{"id":8,"name":"timestamp","type":"timestamp","required":false},{"id":9,"name":"timestamptz","type":"timestamptz","required":false},{"id":10,"name":"date","type":"date","required":false},{"id":11,"name":"binary","type":"binary","required":false},{"id":12,"name":"fixed","type":"fixed[16]","required":false}],"schema-id":0,"identifier-field-ids":[]}
partition-spec	{"spec-id":0,"fields":[]}
partition-spec-id	0
format-version	2
content	data
avro.schema	{"type": "record", "fields": [{"name": "status", "field-id": 0, "type": "int"}, {"name": "snapshot_id", "field-id": 1, "type": ["null", "long"], "default": null}, {"name": "data_sequence_number", "field-id": 3, "type": ["null", "long"], "default": null}, {"name": "file_sequence_number", "field-id": 4, "type": ["null", "long"], "default": null}, {"name": "data_file", "field-id": 2, "type": {"type": "record", "fields": [{"name": "content", "field-id": 134, "type": "int", "doc": "File format name: avro, orc, or parquet"}, {"name": "file_path", "field-id": 100, "type": "string", "doc": "Location URI with FS scheme"}, {"name": "file_format", "field-id": 101, "type": "string", "doc": "File format name: avro, orc, or parquet"}, {"name": "partition", "field-id": 102, "type": {"type": "record", "fields": [], "name": "r102"}, "doc": "Partition data tuple, schema based on the partition spec"}, {"name": "record_count", "field-id": 103, "type": "long", "doc": "Number of records in the file"}, {"name": "file_size_in_bytes", "field-id": 104, "type": "long", "doc": "Total file size in bytes"}, {"name": "column_sizes", "field-id": 108, "type": ["null", {"type": "array", "items": {"type": "record", "name": "k117_v118", "fields": [{"name": "key", "type": "int", "field-id": 117}, {"name": "value", "type": "long", "field-id": 118}]}, "logicalType": "map"}], "default": null, "doc": "Map of column id to total size on disk"}, {"name": "value_counts", "field-id": 109, "type": ["null", {"type": "array", "items": {"type": "record", "name": "k119_v120", "fields": [{"name": "key", "type": "int", "field-id": 119}, {"name": "value", "type": "long", "field-id": 120}]}, "logicalType": "map"}], "default": null, "doc": "Map of column id to total count, including null and NaN"}, {"name": "null_value_counts", "field-id": 110, "type": ["null", {"type": "array", "items": {"type": "record", "name": "k121_v122", "fields": [{"name": "key", "type": "int", "field-id": 121}, {"name": "value", "type": "long", "field-id": 122}]}, "logicalType": "map"}], "default": null, "doc": "Map of column id to null value count"}, {"name": "nan_value_counts", "field-id": 137, "type": ["null", {"type": "array", "items": {"type": "record", "name": "k138_v139", "fields": [{"name": "key", "type": "int", "field-id": 138}, {"name": "value", "type": "long", "field-id": 139}]}, "logicalType": "map"}], "default": null, "doc": "Map of column id to number of NaN values in the column"}, {"name": "lower_bounds", "field-id": 125, "type": ["null", {"type": "array", "items": {"type": "record", "name": "k126_v127", "fields": [{"name": "key", "type": "int", "field-id": 126}, {"name": "value", "type": "bytes", "field-id": 127}]}, "logicalType": "map"}], "default": null, "doc": "Map of column id to lower bound"}, {"name": "upper_bounds", "field-id": 128, "type": ["null", {"type": "array", "items": {"type": "record", "name": "k129_v130", "fields": [{"name": "key", "type": "int", "field-id": 129}, {"name": "value", "type": "bytes", "field-id": 130}]}, "logicalType": "map"}], "default": null, "doc": "Map of column id to upper bound"}, {"name": "key_metadata", "field-id": 131, "type": ["null", "bytes"], "default": null, "doc": "Encryption key metadata blob"}, {"name": "split_offsets", "field-id": 132, "type": ["null", {"type": "array", "element-id": 133, "items": "long"}], "default": null, "doc": "Splittable offsets"}, {"name": "equality_ids", "field-id": 135, "type": ["null", {"type": "array", "element-id": 136, "items": "long"}], "default": null, "doc": "Field ids used to determine row equality in equality delete files."}, {"name": "sort_order_id", "field-id": 140, "type": ["null", "int"], "default": null, "doc": "ID representing sort order for this file"}], "name": "r2"}}], "name": "manifest_entry"}
avro.codec	null

Looks like we're writing empty files: #876

Fokko

Looking good @HonahX ! 🙌

Fokko · 2024-06-30T17:50:19Z

mkdocs/docs/api.md

@@ -273,6 +273,10 @@ tbl.append(df)

 # or

+tbl.merge_append(df)


I'm reluctant to expose this to the public API for a couple of reasons:

Unsure if folks know what the impact is between choosing fast- or merge appends.

It might also be that we do appends as part of the operation (upserts as an obvious one).

Another method to the public API :)

How about having something similar as in Java, to control this using a table property: https://iceberg.apache.org/docs/1.5.2/configuration/#table-behavior-properties

Sounds great! I am also +1 on let it controlled by the config. I made merge_append a separate API to mirror the Java side implementation, which has newAppend and newFastAppend APIs. But it seems better to just make the commit.manifest-merge.enabled default to False on python side.

I will still keep FastAppend and MergeAppend as separate class, and keep merge_append in UpdateSnapshot class to ensure clarity, although the current MergeAppend is purely FastAppend + manifest merge.

Just curious, why not Java side newAppend return an FastAppend impl when commit.manifest-merge.enabled is False. Is it due to some backward compatibiilty issue?

Thanks! I think the use-case of the Java library is slightly different, since that's mostly used in query engines.

Is it due to some backward compatibiilty issue?

I think it is for historical reasons, since the fast-append was added later on :)

btw, I like how you split it out in classes, it is much cleaner now 👍

mkdocs/docs/configuration.md

HonahX · 2024-07-01T08:45:39Z

pyiceberg/table/__init__.py

-                output_file_location = _new_manifest_path(
-                    location=self._transaction.table_metadata.location, num=0, commit_uuid=self.commit_uuid
-                )
                with write_manifest(
                    format_version=self._transaction.table_metadata.format_version,
                    spec=self._transaction.table_metadata.spec(),
                    schema=self._transaction.table_metadata.schema(),
-                    output_file=self._io.new_output(output_file_location),
+                    output_file=self.new_manifest_output(),


@Fokko Thanks for the detailed code example and stacktrace! With the help of them and #876, I found the root cause of the bug: the collision of the names of manifest files within a commit. I've modified the code to avoid that.

It is hard to find because if the file is in the object storage, when FileIO opens a new OutputFile on the same location, the existing file is still readable until the OutputFile "commit". So for integration test that use minio, everything works fine. We won't find any issue until we rollback to some previous snapshot.

For the in-memory SqlCatalog test, since the file is in the local filesystem, the existing file become empty/corrupted immediately after we open a new OutputFile on the same location. This behavior causes the ManifestMergeManager write some empty file and the issue emerges.

I've included a temporary test in test_sql.py to ensure correctness of the current change. I will try to formalize that tommorrow

Thanks for digging into this and fixing it 🙌

Fokko · 2024-07-04T18:55:52Z

Doing some testing with avro-tools, asserting the state after 5 append operations with "commit.manifest.min-count-to-merge": "2"

V1 Table

Manifest-list

5th manifest-list

{
    "manifest_path": "/tmp/some.db/table/metadata/80ba9f84-99af-4af1-b8f5-4caa254645c2-m1.avro",
    "manifest_length": 6878,
    "partition_spec_id": 0,
    "content": 0,
    "sequence_number": 5,
    "min_sequence_number": 1,
    "added_snapshot_id": 6508090689697406000,
    "added_files_count": 1,
    "existing_files_count": 4,
    "deleted_files_count": 0,
    "added_rows_count": 3,
    "existing_rows_count": 12,
    "deleted_rows_count": 0,
    "partitions": {
        "array": []
    },
    "key_metadata": null
}

4th manifest-list

{
    "manifest_path": "/tmp/some.db/table/metadata/88807344-0e23-413c-827e-2a9ec63c6233-m1.avro",
    "manifest_length": 6436,
    "partition_spec_id": 0,
    "content": 0,
    "sequence_number": 4,
    "min_sequence_number": 1,
    "added_snapshot_id": 3455109142449701000,
    "added_files_count": 1,
    "existing_files_count": 3,
    "deleted_files_count": 0,
    "added_rows_count": 3,
    "existing_rows_count": 9,
    "deleted_rows_count": 0,
    "partitions": {
        "array": []
    },
    "key_metadata": null
}

Manifests

We have 5 manifests as expected:

avro-tools tojson /tmp/some.db/table/metadata/80ba9f84-99af-4af1-b8f5-4caa254645c2-m1.avro | wc -l 
       5

Last one:

{
    "status": 1,
    "snapshot_id": {
        "long": 6508090689697406000
    },
    "data_sequence_number": null,
    "file_sequence_number": null,
    "data_file": {
        "content": 0,
        "file_path": "/tmp/some.db/table/data/00000-0-80ba9f84-99af-4af1-b8f5-4caa254645c2.parquet",
        "file_format": "PARQUET",
        "partition": {},
        "record_count": 3,
        "file_size_in_bytes": 5459,
        "column_sizes": { ... },
        "value_counts": { ... },
        "null_value_counts": { ... },
        "nan_value_counts": { ... },
        "lower_bounds": { ... },
        "upper_bounds": { ... },
        "key_metadata": null,
        "split_offsets": {
            "array": [
                4
            ]
        },
        "equality_ids": null,
        "sort_order_id": null
    }
}

First one:

{
    "status": 0,
    "snapshot_id": {
        "long": 6508090689697406000
    },
    "data_sequence_number": {
        "long": 1
    },
    "file_sequence_number": {
        "long": 1
    },
    "data_file": {
        "content": 0,
        "file_path": "/tmp/some.db/table/data/00000-0-bbd4029c-510a-48e6-a905-ab5b69a832e8.parquet",
        "file_format": "PARQUET",
        "partition": {},
        "record_count": 3,
        "file_size_in_bytes": 5459,
        "column_sizes": { ... },
        "value_counts": { ... },
        "null_value_counts": { ... },
        "nan_value_counts": { ... },
        "lower_bounds": { ... },
        "upper_bounds": { ... },
        "key_metadata": null,
        "split_offsets": {
            "array": [
                4
            ]
        },
        "equality_ids": null,
        "sort_order_id": null
    }
}

This looks good, except for one thing: the snapshot_id is off, as from the spec:

Snapshot id where the file was added, or deleted if status is 2. Inherited when null.

This should be the ID of the first append operation.

V2 Table

Manifest list

5th manifest-list

{
    "manifest_path": "/tmp/some.db/tablev2/metadata/93717a88-1cea-4e3d-a69a-00ce3d087822-m1.avro",
    "manifest_length": 6883,
    "partition_spec_id": 0,
    "content": 0,
    "sequence_number": 5,
    "min_sequence_number": 1,
    "added_snapshot_id": 898025966831056900,
    "added_files_count": 1,
    "existing_files_count": 4,
    "deleted_files_count": 0,
    "added_rows_count": 3,
    "existing_rows_count": 12,
    "deleted_rows_count": 0,
    "partitions": {
        "array": []
    },
    "key_metadata": null
}

4th manifest-list

{
    "manifest_path": "/tmp/some.db/tablev2/metadata/5c64a07c-4b8a-4be1-a751-d4fd339560e2-m0.avro",
    "manifest_length": 5127,
    "partition_spec_id": 0,
    "content": 0,
    "sequence_number": 1,
    "min_sequence_number": 1,
    "added_snapshot_id": 1343032504684197000,
    "added_files_count": 1,
    "existing_files_count": 0,
    "deleted_files_count": 0,
    "added_rows_count": 3,
    "existing_rows_count": 0,
    "deleted_rows_count": 0,
    "partitions": {
        "array": []
    },
    "key_metadata": null
}

Manifests

last manifest file in manifest-list

{
    "status": 1,
    "snapshot_id": {
        "long": 898025966831056900
    },
    "data_sequence_number": null,
    "file_sequence_number": null,
    "data_file": {
        "content": 0,
        "file_path": "/tmp/some.db/tablev2/data/00000-0-93717a88-1cea-4e3d-a69a-00ce3d087822.parquet",
        "file_format": "PARQUET",
        "partition": {},
        "record_count": 3,
        "file_size_in_bytes": 5459,
        "column_sizes": { ... },
        "value_counts": { ... },
        "null_value_counts": { ... },
        "nan_value_counts": { ... },
        "lower_bounds": { ... },
        "upper_bounds": { ... },
        "key_metadata": null,
        "split_offsets": {
            "array": [
                4
            ]
        },
        "equality_ids": null,
        "sort_order_id": null
    }
}

First manifest in manifest-list

{
    "status": 0,
    "snapshot_id": {
        "long": 898025966831056900
    },
    "data_sequence_number": {
        "long": 1
    },
    "file_sequence_number": {
        "long": 1
    },
    "data_file": {
        "content": 0,
        "file_path": "/tmp/some.db/tablev2/data/00000-0-5c64a07c-4b8a-4be1-a751-d4fd339560e2.parquet",
        "file_format": "PARQUET",
        "partition": {},
        "record_count": 3,
        "file_size_in_bytes": 5459,
        "column_sizes": { ... },
        "value_counts": { ... },
        "null_value_counts": { ... },
        "nan_value_counts": { ... },
        "lower_bounds": { ... },
        "upper_bounds": { ... },
        "key_metadata": null,
        "split_offsets": {
            "array": [
                4
            ]
        },
        "equality_ids": null,
        "sort_order_id": null
    }
}

Except for the snapshot-id and #893 this looks great! 🥳

Fokko · 2024-07-04T19:05:29Z

Another test with commit.manifest.min-count-to-merge set to 100, and doing 500 append operations:

avro-tools tojson /tmp/some.db/woooo/metadata/snap-3952911087333379496-0-27b9a632-7ee0-4246-aaf2-fc6d8cb1dce5.avro        
{"manifest_path":"/tmp/some.db/woooo/metadata/27b9a632-7ee0-4246-aaf2-fc6d8cb1dce5-m0.avro","manifest_length":5125,"partition_spec_id":0,"content":0,"sequence_number":500,"min_sequence_number":500,"added_snapshot_id":3952911087333379496,"added_files_count":1,"existing_files_count":0,"deleted_files_count":0,"added_rows_count":3,"existing_rows_count":0,"deleted_rows_count":0,"partitions":{"array":[]},"key_metadata":null}
{"manifest_path":"/tmp/some.db/woooo/metadata/dac5af38-f01b-4a59-9e4c-b14a26706e75-m0.avro","manifest_length":5126,"partition_spec_id":0,"content":0,"sequence_number":499,"min_sequence_number":499,"added_snapshot_id":8943105647176444976,"added_files_count":1,"existing_files_count":0,"deleted_files_count":0,"added_rows_count":3,"existing_rows_count":0,"deleted_rows_count":0,"partitions":{"array":[]},"key_metadata":null}
{"manifest_path":"/tmp/some.db/woooo/metadata/ed164af5-dda7-4e3e-9b67-fcb2fd78771b-m0.avro","manifest_length":5125,"partition_spec_id":0,"content":0,"sequence_number":498,"min_sequence_number":498,"added_snapshot_id":723002263384967579,"added_files_count":1,"existing_files_count":0,"deleted_files_count":0,"added_rows_count":3,"existing_rows_count":0,"deleted_rows_count":0,"partitions":{"array":[]},"key_metadata":null}
{"manifest_path":"/tmp/some.db/woooo/metadata/e2d3e14e-8caf-4ca0-9515-c0a19c2a5658-m0.avro","manifest_length":5126,"partition_spec_id":0,"content":0,"sequence_number":497,"min_sequence_number":497,"added_snapshot_id":6977509396340474362,"added_files_count":1,"existing_files_count":0,"deleted_files_count":0,"added_rows_count":3,"existing_rows_count":0,"deleted_rows_count":0,"partitions":{"array":[]},"key_metadata":null}
{"manifest_path":"/tmp/some.db/woooo/metadata/3cc77cfe-b68c-4071-9f70-41cc3933f0af-m1.avro","manifest_length":222800,"partition_spec_id":0,"content":0,"sequence_number":496,"min_sequence_number":1,"added_snapshot_id":7132518699806947299,"added_files_count":1,"existing_files_count":495,"deleted_files_count":0,"added_rows_count":3,"existing_rows_count":1485,"deleted_rows_count":0,"partitions":{"array":[]},"key_metadata":null}

I don't think it merges the manifests as it should:

➜  iceberg-python git:(manifest_compaction) avro-tools tojson /tmp/some.db/woooo/metadata/3cc77cfe-b68c-4071-9f70-41cc3933f0af-m1.avro | wc -l                 
24/07/04 21:04:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
     496
➜  iceberg-python git:(manifest_compaction) avro-tools tojson /tmp/some.db/woooo/metadata/27b9a632-7ee0-4246-aaf2-fc6d8cb1dce5-m0.avro | wc -l
24/07/04 21:04:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
       1

I would expect the manifest-entries to be distributed more evenly over the manifests to ensure maximum parallelization.

# Conflicts: # pyiceberg/table/__init__.py # tests/integration/test_writes/test_writes.py

HonahX · 2024-07-10T07:59:12Z

Another test with commit.manifest.min-count-to-merge set to 100, and doing 500 append operations:

I think the observed behavior aligns with Java's merge_append. Each time we do one append, we add one manifest. At 100th append, when the number of manifest reach 100, the merge manager merge all of them to a new manifest file because they are all in the same "bin". This happens whenever the number of manifest reach 100, thus leaving us with a large manifest and 4 small ones.

I use spark to do the similar thing and get a similar result

@pytest.mark.integration
def test_spark_ref_behavior(spark: SparkSession, session_catalog: Catalog, arrow_table_with_null: pa.Table) -> None:
    identifier = "default.test_spark_ref_behavior"
    tbl = _create_table(session_catalog, identifier,
                        {"commit.manifest-merge.enabled": "true", "commit.manifest.min-count-to-merge": "10", "format-version": 2}, [])
    spark_df = spark.createDataFrame(arrow_table_with_null.to_pandas())

    for i in range(50):
        spark_df.writeTo(f"integration.{identifier}").append()
    tbl = session_catalog.load_table(identifier)
    tbl_a_manifests = tbl.current_snapshot().manifests(tbl.io)
    for manifest in tbl_a_manifests:
        print(
            f"Manifest: added: {manifest.added_files_count}, existing: {manifest.existing_files_count}, deleted: {manifest.deleted_files_count}")
=====
Manifest: added: 3, existing: 0, deleted: 0
Manifest: added: 3, existing: 0, deleted: 0
Manifest: added: 3, existing: 0, deleted: 0
Manifest: added: 3, existing: 0, deleted: 0
Manifest: added: 3, existing: 135, deleted: 0

To distribute manifest entries more evenly, I think we need to adjust the commit.manifest.target-size-bytes accordingly since this property controls the size of the bin.

I think this also reveal the value of the fast_append + compaction model, which make things more explicit

HonahX · 2024-07-10T08:02:16Z

tests/integration/test_writes/test_writes.py

+        assert tbl_a_data_file["file_path"].startswith("s3://warehouse/default/merge_manifest_a/data/")
+        if tbl_a_data_file["file_path"] == first_data_file_path:
+            # verify that the snapshot id recorded should be the one where the file was added
+            assert tbl_a_entries["snapshot_id"][i] == first_snapshot_id


added a test to verify the snapshot_id issue

Fokko · 2024-07-10T10:55:37Z

To distribute manifest entries more evenly, I think we need to adjust the commit.manifest.target-size-bytes accordingly since this property controls the size of the bin.

Thanks, that makes actually a lot of sense 👍

Fokko · 2024-07-10T10:56:31Z

Whoo 🥳 Thanks @HonahX for working on this, and thanks @syun64 for the review 🙌

@HonahX

commit 1ed3abd Author: Sung Yun <[email protected]> Date: Wed Jul 17 02:04:52 2024 -0400 Allow writing `pa.Table` that are either a subset of table schema or in arbitrary order, and support type promotion on write (apache#921) * merge * thanks @HonahX :) Co-authored-by: Honah J. <[email protected]> * support promote * revert promote * use a visitor * support promotion on write * fix * Thank you @Fokko ! Co-authored-by: Fokko Driesprong <[email protected]> * revert * add-files promotiontest * support promote for add_files * add tests for uuid * add_files subset schema test --------- Co-authored-by: Honah J. <[email protected]> Co-authored-by: Fokko Driesprong <[email protected]> commit 0f2e19e Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Jul 15 23:25:08 2024 -0700 Bump zstandard from 0.22.0 to 0.23.0 (apache#934) Bumps [zstandard](https://github.com/indygreg/python-zstandard) from 0.22.0 to 0.23.0. - [Release notes](https://github.com/indygreg/python-zstandard/releases) - [Changelog](https://github.com/indygreg/python-zstandard/blob/main/docs/news.rst) - [Commits](indygreg/python-zstandard@0.22.0...0.23.0) --- updated-dependencies: - dependency-name: zstandard dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit ec73d97 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Jul 15 23:24:47 2024 -0700 Bump griffe from 0.47.0 to 0.48.0 (apache#933) Bumps [griffe](https://github.com/mkdocstrings/griffe) from 0.47.0 to 0.48.0. - [Release notes](https://github.com/mkdocstrings/griffe/releases) - [Changelog](https://github.com/mkdocstrings/griffe/blob/main/CHANGELOG.md) - [Commits](mkdocstrings/griffe@0.47.0...0.48.0) --- updated-dependencies: - dependency-name: griffe dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit d05a423 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Jul 15 23:24:16 2024 -0700 Bump mkdocs-material from 9.5.28 to 9.5.29 (apache#932) Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.5.28 to 9.5.29. - [Release notes](https://github.com/squidfunk/mkdocs-material/releases) - [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG) - [Commits](squidfunk/mkdocs-material@9.5.28...9.5.29) --- updated-dependencies: - dependency-name: mkdocs-material dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit e27cd90 Author: Yair Halevi (Spock) <[email protected]> Date: Sun Jul 14 22:11:04 2024 +0300 Allow empty `names` in mapped field of Name Mapping (apache#927) * Remove check_at_least_one field validator Iceberg spec permits an emtpy list of names in the default name mapping. check_at_least_one is therefore unnecessary. * Remove irrelevant test case * Fixing pydantic model No longer requiring minimum length of names list to be 1. * Added test case for empty names in name mapping * Fixed formatting error commit 3f44dfe Author: Soumya Ghosh <[email protected]> Date: Sun Jul 14 00:35:38 2024 +0530 Lowercase bool values in table properties (apache#924) commit b11cdb5 Author: Sung Yun <[email protected]> Date: Fri Jul 12 16:45:04 2024 -0400 Deprecate to_requested_schema (apache#918) * deprecate to_requested_schema * prep for release commit a3dd531 Author: Honah J <[email protected]> Date: Fri Jul 12 13:14:40 2024 -0700 Glue endpoint config variable, continue apache#530 (apache#920) Co-authored-by: Seb Pretzer <[email protected]> commit 32e8f88 Author: Sung Yun <[email protected]> Date: Fri Jul 12 15:26:00 2024 -0400 support PyArrow timestamptz with Etc/UTC (apache#910) Co-authored-by: Fokko Driesprong <[email protected]> commit f6d56e9 Author: Sung Yun <[email protected]> Date: Fri Jul 12 05:31:06 2024 -0400 fix invalidation logic (apache#911) commit 6488ad8 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Thu Jul 11 22:56:48 2024 -0700 Bump coverage from 7.5.4 to 7.6.0 (apache#917) Bumps [coverage](https://github.com/nedbat/coveragepy) from 7.5.4 to 7.6.0. - [Release notes](https://github.com/nedbat/coveragepy/releases) - [Changelog](https://github.com/nedbat/coveragepy/blob/master/CHANGES.rst) - [Commits](nedbat/coveragepy@7.5.4...7.6.0) --- updated-dependencies: - dependency-name: coverage dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit dceedfa Author: Sung Yun <[email protected]> Date: Thu Jul 11 20:32:14 2024 -0400 Check if schema is compatible in `add_files` API (apache#907) Co-authored-by: Fokko Driesprong <[email protected]> commit aceed2a Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Thu Jul 11 15:52:06 2024 +0200 Bump mypy-boto3-glue from 1.34.136 to 1.34.143 (apache#912) Bumps [mypy-boto3-glue](https://github.com/youtype/mypy_boto3_builder) from 1.34.136 to 1.34.143. - [Release notes](https://github.com/youtype/mypy_boto3_builder/releases) - [Commits](https://github.com/youtype/mypy_boto3_builder/commits) --- updated-dependencies: - dependency-name: mypy-boto3-glue dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit 1b9b884 Author: Fokko Driesprong <[email protected]> Date: Thu Jul 11 12:45:20 2024 +0200 PyArrow: Don't enforce the schema when reading/writing (apache#902) * PyArrow: Don't enforce the schema PyIceberg struggled with the different type of arrow, such as the `string` and `large_string`. They represent the same, but are different under the hood. My take is that we should hide these kind of details from the user as much as possible. Now we went down the road of passing in the Iceberg schema into Arrow, but when doing this, Iceberg has to decide if it is a large or non-large type. This PR removes passing down the schema in order to let Arrow decide unless: - The type should be evolved - In case of re-ordering, we reorder the original types * WIP * Reuse Table schema * Make linter happy * Squash some bugs * Thanks Sung! Co-authored-by: Sung Yun <[email protected]> * Moar code moar bugs * Remove the variables wrt file sizes * Linting * Go with large ones for now * Missed one there! --------- Co-authored-by: Sung Yun <[email protected]> commit 8f47dfd Author: Soumya Ghosh <[email protected]> Date: Thu Jul 11 11:52:55 2024 +0530 Move determine_partitions and helper methods to io.pyarrow (apache#906) commit 5aa451d Author: Soumya Ghosh <[email protected]> Date: Thu Jul 11 07:57:05 2024 +0530 Rename data_sequence_number to sequence_number in ManifestEntry (apache#900) commit 77a07c9 Author: Honah J <[email protected]> Date: Wed Jul 10 03:56:13 2024 -0700 Support MergeAppend operations (apache#363) * add ListPacker + tests * add merge append * add merge_append * fix snapshot inheritance * test manifest file and entries * add doc * fix lint * change test name * address review comments * rename _MergingSnapshotProducer to _SnapshotProducer * fix a serious bug * update the doc * remove merge_append as public API * make default to false * add test description * fix merge conflict * fix snapshot_id issue commit 66b92ff Author: Fokko Driesprong <[email protected]> Date: Wed Jul 10 10:09:20 2024 +0200 GCS: Fix incorrect token description (apache#909) commit c25e080 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue Jul 9 20:50:29 2024 -0700 Bump zipp from 3.17.0 to 3.19.1 (apache#905) Bumps [zipp](https://github.com/jaraco/zipp) from 3.17.0 to 3.19.1. - [Release notes](https://github.com/jaraco/zipp/releases) - [Changelog](https://github.com/jaraco/zipp/blob/main/NEWS.rst) - [Commits](jaraco/zipp@v3.17.0...v3.19.1) --- updated-dependencies: - dependency-name: zipp dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit 301e336 Author: Sung Yun <[email protected]> Date: Tue Jul 9 23:35:11 2024 -0400 Cast 's', 'ms' and 'ns' PyArrow timestamp to 'us' precision on write (apache#848) commit 3f574d3 Author: Fokko Driesprong <[email protected]> Date: Tue Jul 9 11:36:43 2024 +0200 Support partial deletes (apache#569) * Add option to delete datafiles This is done through the Iceberg metadata, resulting in efficient deletes if the data is partitioned correctly * Pull in main * WIP * Change DataScan to accept Metadata and io For the partial deletes I want to do a scan on in memory metadata. Changing this API allows this. * fix name-mapping issue * WIP * WIP * Moar tests * Oops * Cleanup * WIP * WIP * Fix summary generation * Last few bits * Fix the requirement * Make ruff happy * Comments, thanks Kevin! * Comments * Append rather than truncate * Fix merge conflicts * Make the tests pass * Add another test * Conflicts * Add docs (apache#33) * docs * docs * Add a partitioned overwrite test * Fix comment * Skip empty manifests --------- Co-authored-by: HonahX <[email protected]> Co-authored-by: Sung Yun <[email protected]> commit cdc3e54 Author: Fokko Driesprong <[email protected]> Date: Tue Jul 9 08:28:27 2024 +0200 Disallow writing empty Manifest files (apache#876) * Disallow writing empty Avro files/blocks Raising an exception when doing this might look extreme, but there is no real good reason to allow this. * Relax the constaints a bit commit b68e109 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Jul 8 22:16:23 2024 -0700 Bump fastavro from 1.9.4 to 1.9.5 (apache#904) Bumps [fastavro](https://github.com/fastavro/fastavro) from 1.9.4 to 1.9.5. - [Release notes](https://github.com/fastavro/fastavro/releases) - [Changelog](https://github.com/fastavro/fastavro/blob/master/ChangeLog) - [Commits](fastavro/fastavro@1.9.4...1.9.5) --- updated-dependencies: - dependency-name: fastavro dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit 90547bb Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Jul 8 22:15:39 2024 -0700 Bump moto from 5.0.10 to 5.0.11 (apache#903) Bumps [moto](https://github.com/getmoto/moto) from 5.0.10 to 5.0.11. - [Release notes](https://github.com/getmoto/moto/releases) - [Changelog](https://github.com/getmoto/moto/blob/master/CHANGELOG.md) - [Commits](getmoto/moto@5.0.10...5.0.11) --- updated-dependencies: - dependency-name: moto dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit 7dff359 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Sun Jul 7 07:50:19 2024 +0200 Bump tenacity from 8.4.2 to 8.5.0 (apache#898) commit 4aa469e Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Sat Jul 6 22:30:59 2024 +0200 Bump certifi from 2024.2.2 to 2024.7.4 (apache#899) Bumps [certifi](https://github.com/certifi/python-certifi) from 2024.2.2 to 2024.7.4. - [Commits](certifi/python-certifi@2024.02.02...2024.07.04) --- updated-dependencies: - dependency-name: certifi dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit aa7ad78 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Sat Jul 6 20:37:51 2024 +0200 Bump deptry from 0.16.1 to 0.16.2 (apache#897) Bumps [deptry](https://github.com/fpgmaas/deptry) from 0.16.1 to 0.16.2. - [Release notes](https://github.com/fpgmaas/deptry/releases) - [Changelog](https://github.com/fpgmaas/deptry/blob/main/CHANGELOG.md) - [Commits](fpgmaas/deptry@0.16.1...0.16.2) --- updated-dependencies: - dependency-name: deptry dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

HonahX mentioned this pull request Feb 4, 2024

Implement Centralized Management of Table Properties #365

Closed

Fokko reviewed Feb 5, 2024

View reviewed changes

Fokko added this to the PyIceberg 0.7.0 release milestone Feb 7, 2024

Fokko reviewed Feb 8, 2024

View reviewed changes

HonahX changed the title ~~Support merge manifests on writes~~ Support merge manifests on writes (MergeAppend) Feb 23, 2024

HonahX marked this pull request as ready for review February 26, 2024 10:51

HonahX mentioned this pull request Feb 27, 2024

Support metadata compaction #270

Open

HonahX commented Mar 3, 2024

View reviewed changes

tests/conftest.py Outdated Show resolved Hide resolved

Fokko requested changes Mar 13, 2024

View reviewed changes

sungwy reviewed Mar 22, 2024

View reviewed changes

Fokko mentioned this pull request Mar 27, 2024

Add entries metadata table #551

Merged

jqin61 mentioned this pull request Apr 3, 2024

Support partial deletes #569

Merged

Fokko mentioned this pull request Apr 10, 2024

Implement rolling manifest-writers #596

Closed

HonahX added 4 commits June 2, 2024 18:39

add ListPacker + tests

6803eba

add merge append

f0fc260

add merge_append

cbb8cec

fix snapshot inheritance

bf63c03

HonahX force-pushed the manifest_compaction branch from 57eba6a to bf63c03 Compare June 3, 2024 07:29

HonahX added 4 commits June 3, 2024 22:16

test manifest file and entries

9dd69af

add doc

4921a7f

fix lint

984ca41

change test name

8510f71

HonahX requested review from Fokko and sungwy June 4, 2024 06:53

HonahX commented Jun 4, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

sungwy reviewed Jun 4, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

HonahX added 3 commits June 9, 2024 22:49

Merge branch 'main' into manifest_compaction

1ec5edd

address review comments

a7da318

rename _MergingSnapshotProducer to _SnapshotProducer

c4feda5

sungwy approved these changes Jun 15, 2024

View reviewed changes

Merge branch 'main' into manifest_compaction

7e6e1d4

Fokko reviewed Jun 30, 2024

View reviewed changes

Fokko mentioned this pull request Jun 30, 2024

Disallow writing empty Manifest files #876

Merged

HonahX added 2 commits July 1, 2024 01:14

Merge branch 'main' into manifest_compaction

123f5d3

fix a serious bug

9777e9b

HonahX commented Jul 1, 2024

View reviewed changes

HonahX added 5 commits July 1, 2024 21:37

Merge branch 'main' into manifest_compaction

3393757

update the doc

66dddbe

remove merge_append as public API

aff1bea

make default to false

7625857

add test description

3e3a1b4

HonahX added 3 commits July 9, 2024 22:39

Merge branch 'main' into manifest_compaction

914d6ef

# Conflicts: # pyiceberg/table/__init__.py # tests/integration/test_writes/test_writes.py

fix merge conflict

71a5fe0

fix snapshot_id issue

c7e4095

HonahX commented Jul 10, 2024

View reviewed changes

Fokko approved these changes Jul 10, 2024

View reviewed changes

Fokko merged commit 77a07c9 into apache:main Jul 10, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support merge manifests on writes (MergeAppend) #363

Support merge manifests on writes (MergeAppend) #363

HonahX commented Feb 4, 2024 •

edited

Loading

Fokko left a comment

Fokko Feb 5, 2024

HonahX Feb 6, 2024

Fokko Feb 8, 2024

HonahX Feb 20, 2024

Fokko left a comment •

edited

Loading

Fokko Mar 13, 2024

Fokko Mar 13, 2024

sungwy left a comment

sungwy Mar 22, 2024

Fokko Mar 26, 2024

sungwy Mar 26, 2024

HonahX commented Jun 3, 2024

HonahX left a comment

sungwy left a comment

sungwy left a comment

Fokko commented Jun 30, 2024 •

edited

Loading

Fokko left a comment

Fokko Jun 30, 2024

HonahX Jul 2, 2024

Fokko Jul 4, 2024

Fokko Jul 4, 2024

HonahX Jul 1, 2024 •

edited

Loading

Fokko Jul 4, 2024

Fokko commented Jul 4, 2024

Fokko commented Jul 4, 2024

HonahX commented Jul 10, 2024 •

edited

Loading

HonahX Jul 10, 2024 •

edited

Loading

Fokko commented Jul 10, 2024

Fokko commented Jul 10, 2024

		@@ -355,6 +355,44 @@ def test_data_files(spark: SparkSession, session_catalog: Catalog, arrow_table_w
		assert [row.deleted_data_files_count for row in rows] == [0, 0, 1, 0, 0]


		@pytest.mark.integration

Support merge manifests on writes (MergeAppend) #363

Support merge manifests on writes (MergeAppend) #363

Conversation

HonahX commented Feb 4, 2024 • edited Loading

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fokko left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sungwy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HonahX commented Jun 3, 2024

HonahX left a comment

Choose a reason for hiding this comment

sungwy left a comment

Choose a reason for hiding this comment

sungwy left a comment

Choose a reason for hiding this comment

Fokko commented Jun 30, 2024 • edited Loading

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HonahX Jul 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fokko commented Jul 4, 2024

V1 Table

Manifest-list

5th manifest-list

4th manifest-list

Manifests

Last one:

First one:

V2 Table

Manifest list

5th manifest-list

4th manifest-list

Manifests

last manifest file in manifest-list

First manifest in manifest-list

Fokko commented Jul 4, 2024

HonahX commented Jul 10, 2024 • edited Loading

HonahX Jul 10, 2024 • edited Loading

Choose a reason for hiding this comment

Fokko commented Jul 10, 2024

Fokko commented Jul 10, 2024

HonahX commented Feb 4, 2024 •

edited

Loading

Fokko left a comment •

edited

Loading

Fokko commented Jun 30, 2024 •

edited

Loading

HonahX Jul 1, 2024 •

edited

Loading

HonahX commented Jul 10, 2024 •

edited

Loading

HonahX Jul 10, 2024 •

edited

Loading