Add an e2e test for writing an Iceberg table with pyiceberg and reading it with DataFusion #825

gruuya · 2024-12-18T14:51:39Z

Besides serving as an e2e test, this also excessive quite a bit of type conversion logic.

In addition it demonstrates the issue described in #813.

…g it with DataFusion

Fokko · 2024-12-18T18:37:05Z

crates/integration_tests/testdata/pyiceberg/load_types_table.py

+import pyarrow.parquet as pq
+
+# Generate a table with various types in memory and dump to a Parquet file
+ctx = SessionContext()


I'm reluctant to let Datafusion generate these files, for two reasons:

These files are imported into an Iceberg table and are missing things like Field-IDs etc.

I'd rather depend on purely Spark which uses the Java SDK underneath, which is more complete than PyIceberg. The Java impl is considered the reference implementation.

missing things like Field-IDs etc.

If by that you mean PARQUET:field_id note that pyiceberg actualy decorates the fields during table creation with those, I've just pushed an addendum to the tests that demonstrate this.

Granted I'm not sure about the state of pyiceberg compliance with the protocol.

Would you be more open to accepting this test if I switched to Java SDK for generating the Parquet files and creating the table?

I also see value in this, let's hear what others think!

Generally i like the idea of using pyiceberg for testing, maybe also alongside spark. This can help facilitate cross platform interoperability,
but im not sure when we would use one versus the other.

Alright just an update—I've tried doing the same with pyspark but it seems it is being very conservative and either rejecting some types outright (e.g. Illegal Parquet type: INT64 (TIMESTAMP(NANOS,false))), or coercing them on its own.

For instance if I just supply it with a Parquet file like type_test.parquet (output of load_types_table.py) with type hints

message arrow_schema { optional boolean cboolean; optional int32 cint8 (INTEGER(8,true)); optional int32 cint16 (INTEGER(16,true)); optional int32 cint32; optional int64 cint64; ...

and do

parquet_df = spark.read.parquet("types_test.parquet") parquet_df.writeTo("rest.default.types_test").using("iceberg").create()

the resulting data parquet files have incompatible type hints removed

message table { optional boolean cboolean = 1; optional int32 cint8 = 2; optional int32 cint16 = 3; optional int32 cint32 = 4; optional int64 cint64 = 5; ...

TLDR: I can't seem to replicate the issue in #814 with pyspark, though it could still be used to test out the non-corner cases.

but im not sure when we would use one versus the other.

I think there's definitely merit in having an integration types test like the one proposed here. The basic one can be done with pyspark, though it might be a bonus (or not) that apparently pyiceberg can hit some corner cases which pyspark can't.

(fwiw, this is something that we ran into in our dev work, so it's not a end-user-reported issue.)

apparently pyiceberg can hit some corner cases which pyspark can't.

Interesting! curious to hear about what you guys ran into.

I think Pyspark is good for integration tests through the spark API. PyIceberg allows more granular test cases since you can work with iceberg constructs

curious to hear about what you guys ran into.

Actually the only one so far seems to be this Iceberg-from-parquet creation discussed above—if the Parquet file has e.g. Int32 field with Int16 type hints spark will automatically coerce those in the resulting data file into Int32, whereas pyiceberg will leave it as is (and so during the scan in the actual record batches we get Int16, whilst the schema says it's Int32).

crates/integration_tests/testdata/pyiceberg/load_types_table.py

kevinjqliu · 2024-12-19T20:24:20Z

crates/integration_tests/testdata/pyiceberg/load_types_table.py

+import pyarrow.parquet as pq
+
+# Generate a table with various types in memory and dump to a Parquet file
+ctx = SessionContext()


Generally i like the idea of using pyiceberg for testing, maybe also alongside spark. This can help facilitate cross platform interoperability,
but im not sure when we would use one versus the other.

crates/integration_tests/testdata/pyiceberg/Dockerfile

…n test fixture

gruuya force-pushed the datafusion-integration-test branch from 64c1473 to 3a328ed Compare December 18, 2024 14:51

gruuya mentioned this pull request Dec 18, 2024

Reported and actual arrow schema of the table can be different #813

Open

gruuya force-pushed the datafusion-integration-test branch from 3a328ed to 1b03c42 Compare December 18, 2024 14:54

Add a e2e test for writing an Iceberg table with pyiceberg and readin…

c261ddf

…g it with DataFusion

gruuya force-pushed the datafusion-integration-test branch from 1b03c42 to c261ddf Compare December 18, 2024 14:56

Fokko reviewed Dec 18, 2024

View reviewed changes

Add asserts for the pyiceberg resulting arrow schema

b5d23ff

kevinjqliu reviewed Dec 19, 2024

View reviewed changes

Use pyarrow as req of pyiceberg when installing it for the integratio…

64990ba

…n test fixture

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an e2e test for writing an Iceberg table with pyiceberg and reading it with DataFusion #825

Add an e2e test for writing an Iceberg table with pyiceberg and reading it with DataFusion #825

gruuya commented Dec 18, 2024

Fokko Dec 18, 2024

gruuya Dec 18, 2024

Fokko Dec 18, 2024

kevinjqliu Dec 19, 2024

gruuya Dec 19, 2024

gruuya Dec 19, 2024

kevinjqliu Dec 20, 2024

gruuya Dec 21, 2024

kevinjqliu Dec 19, 2024

Add an e2e test for writing an Iceberg table with pyiceberg and reading it with DataFusion #825

Are you sure you want to change the base?

Add an e2e test for writing an Iceberg table with pyiceberg and reading it with DataFusion #825

Conversation

gruuya commented Dec 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment