Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: gurantee the deserialize order of struct is same as the struct type #795

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ZENOTME
Copy link
Contributor

@ZENOTME ZENOTME commented Dec 13, 2024

We should deserialize according to order of struct type rather than the deserialize value.

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Dec 13, 2024

cc @liurenjie1024 @Fokko @Xuanwo @sdd

])),
&Type::Struct(StructType::new(vec![
NestedField::required(2, "id", Type::Primitive(PrimitiveType::Int)).into(),
NestedField::optional(3, "name", Type::Primitive(PrimitiveType::String)).into(),
NestedField::optional(4, "address", Type::Primitive(PrimitiveType::String)).into(),
NestedField::required(5, "extra", Type::Primitive(PrimitiveType::Int)).into(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, would you like to add a new test that cover the mis-order cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I find that it can also pass originally. I'm trying to find the test case that can't pass originally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I find that it can also pass originally. I'm trying to find the test case that can't pass originally.

Thank you, that will be really meaningful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found the reason why this can pass originally: the avro writer will ensure the record order according to the schema: https://github.com/apache/avro-rs/blob/390a150bfc5999eb852c9c0ef40335612f1407b5/avro/src/encode.rs#L247.

However, if we serialize into other formats, e.g. json, the order can't be guaranteed.

let deserialized: RawLiteral = serde_json::from_str(&serialized).unwrap();
let deserialized_literal = deserialized.try_into(&fields).unwrap().unwrap();

assert_eq!(expected_literal, deserialized_literal);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the whole serialization is off, this should be done by ID instead of name:

I checked out the branch, and it is currently by name:

{
    "id": 1,
    "extra": 1000,
    "name": "bar",
    "address": null
}

While it should be:

{
    "2": 1,
    "3": "bar",
    "4": null,
    "5": 1000
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not same as JSON single-value serialization, th JSON single-value serialization has the specific implementation

(Literal::Struct(s), Type::Struct(struct_type)) => {
.
This test case is just to test the normal Serialize implementation, internally it mainly used in avro format. See https://docs.rs/avro-rs/latest/avro_rs/types/struct.Record.html, that's why here record store name and value.

Here I serialize it into json type is to test the reorder case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I'm a bit confused about why we need to care about this. [De]serialization is a very format-specific task, and it's really challenging to ensure our implementations meet all format requirements. I'm a bit concerned about the additional cost we incur to achieve this. Doesn't it seem fine as long as it works well with Avro?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, missed that. Let me unblock this for now 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark unresolve for my newly added comment: #795 (comment)

Sorry @Fokko 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, around the same time! I share your concern, and I would like to check later on if we do the field-ID projection properly, but I didn't want to block the release 👍

For Avro it is very simple, it will always be decoded in the same order as the schema (otherwise it will just break). That said, we can rely on the order for V1, but use field-ID-based projection for V2 tables.

@sungwy sungwy added this to the 0.4.0 Release milestone Dec 16, 2024
@sungwy sungwy modified the milestones: 0.4.0 Release , 0.5.0 Release Dec 16, 2024
@Fokko Fokko self-requested a review December 16, 2024 20:49
@Xuanwo Xuanwo removed this from the 0.5.0 Release milestone Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants