fix: gurantee the deserialize order of struct is same as the struct type #795

ZENOTME · 2024-12-13T16:15:48Z

We should deserialize according to order of struct type rather than the deserialize value.

ZENOTME · 2024-12-13T16:16:12Z

Xuanwo · 2024-12-14T06:23:02Z

crates/iceberg/src/spec/values.rs

            ])),
            &Type::Struct(StructType::new(vec![
                NestedField::required(2, "id", Type::Primitive(PrimitiveType::Int)).into(),
                NestedField::optional(3, "name", Type::Primitive(PrimitiveType::String)).into(),
                NestedField::optional(4, "address", Type::Primitive(PrimitiveType::String)).into(),
+                NestedField::required(5, "extra", Type::Primitive(PrimitiveType::Int)).into(),


Hi, would you like to add a new test that cover the mis-order cases?

Yes, but I find that it can also pass originally. I'm trying to find the test case that can't pass originally.

Yes, but I find that it can also pass originally. I'm trying to find the test case that can't pass originally.

Thank you, that will be really meaningful.

I have found the reason why this can pass originally: the avro writer will ensure the record order according to the schema: https://github.com/apache/avro-rs/blob/390a150bfc5999eb852c9c0ef40335612f1407b5/avro/src/encode.rs#L247.

However, if we serialize into other formats, e.g. json, the order can't be guaranteed.

Fokko · 2024-12-16T14:30:16Z

crates/iceberg/src/spec/values.rs

+        let deserialized: RawLiteral = serde_json::from_str(&serialized).unwrap();
+        let deserialized_literal = deserialized.try_into(&fields).unwrap().unwrap();
+
+        assert_eq!(expected_literal, deserialized_literal);


It looks like the whole serialization is off, this should be done by ID instead of name:

I checked out the branch, and it is currently by name:

{ "id": 1, "extra": 1000, "name": "bar", "address": null }

While it should be:

{ "2": 1, "3": "bar", "4": null, "5": 1000 }

This is not same as JSON single-value serialization, th JSON single-value serialization has the specific implementation

iceberg-rust/crates/iceberg/src/spec/values.rs

Line 1993 in 813c2b5

(Literal::Struct(s), Type::Struct(struct_type)) => {

.
This test case is just to test the normal Serialize implementation, internally it mainly used in avro format. See https://docs.rs/avro-rs/latest/avro_rs/types/struct.Record.html, that's why here record store name and value.

Here I serialize it into json type is to test the reorder case.

Hi, I'm a bit confused about why we need to care about this. [De]serialization is a very format-specific task, and it's really challenging to ensure our implementations meet all format requirements. I'm a bit concerned about the additional cost we incur to achieve this. Doesn't it seem fine as long as it works well with Avro?

Ah, missed that. Let me unblock this for now 👍

Mark unresolve for my newly added comment: #795 (comment)

Sorry @Fokko 😄

Ah, around the same time! I share your concern, and I would like to check later on if we do the field-ID projection properly, but I didn't want to block the release 👍

For Avro it is very simple, it will always be decoded in the same order as the schema (otherwise it will just break). That said, we can rely on the order for V1, but use field-ID-based projection for V2 tables.

Xuanwo reviewed Dec 14, 2024

View reviewed changes

gurantee the deserialize order of struct is same as the struct type

813c2b5

ZENOTME force-pushed the fix_raw_record branch from 744024d to 813c2b5 Compare December 14, 2024 07:15

ZENOTME mentioned this pull request Dec 16, 2024

Tracking issues of iceberg rust v0.4.0 Release #739

Closed

15 tasks

Fokko requested changes Dec 16, 2024

View reviewed changes

sungwy added this to the 0.4.0 Release milestone Dec 16, 2024

Fokko approved these changes Dec 16, 2024

View reviewed changes

sungwy modified the milestones: 0.4.0 Release , 0.5.0 Release Dec 16, 2024

Fokko self-requested a review December 16, 2024 20:49

Xuanwo removed this from the 0.5.0 Release milestone Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: gurantee the deserialize order of struct is same as the struct type #795

fix: gurantee the deserialize order of struct is same as the struct type #795

ZENOTME commented Dec 13, 2024

ZENOTME commented Dec 13, 2024

Xuanwo Dec 14, 2024

ZENOTME Dec 14, 2024

Xuanwo Dec 14, 2024

ZENOTME Dec 14, 2024

Fokko Dec 16, 2024

ZENOTME Dec 16, 2024

Xuanwo Dec 16, 2024

Fokko Dec 16, 2024

Xuanwo Dec 16, 2024

Fokko Dec 16, 2024

fix: gurantee the deserialize order of struct is same as the struct type #795

Are you sure you want to change the base?

fix: gurantee the deserialize order of struct is same as the struct type #795

Conversation

ZENOTME commented Dec 13, 2024

ZENOTME commented Dec 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment