Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bincode::serialize calls serialized_size, doubling runtime for nontrivial Serialize impls #401

Open
saethlin opened this issue Jul 31, 2021 · 5 comments

Comments

@saethlin
Copy link
Contributor

I'm reporting this because @gbaranski noticed that JSON was faster than bincode for serializing a structure. I don't think bincode::serialize needs to be particularly fast, but this is quite silly. The root cause of this is bincode::serialized_size, which (as best I can tell) fully serializes any type that isn't very simple.

I'm not certain if this is the intended behavior of SizeChecker; on a type like chrono::DateTime caliling bincode::serialize will serialize the object twice. Is this itself a bug?

Whether or not the speed of serialized_size can/should be improved, does bincode::serialize even need to call it? It would be nice to know the size we need ahead of time, but I really wonder if it's always faster to just call bincode::serialize_into and maybe call Vec::shrink_to_fit after.

Here's a program that demonstrates the situation:

#!/usr/bin/env rust-script
//! ```cargo
//! [dependencies]
//! bincode = "=1.3.3"
//! serde = { version = "=1.0.126", features = ["derive"] }
//! chrono = { version = "=0.4.19", features = ["serde"] }
//! ```

fn main() {
    let mut data = Vec::new();
    for _ in 0..100 {
        data.push(chrono::Utc::now());
    }
    let start = std::time::Instant::now();
    for _ in 0..10_000 {
        ser(&data);
    }
    println!("bincode::serialize {:?}", start.elapsed());

    let start = std::time::Instant::now();
    for _ in 0..10_000 {
        ser_into(&data);
    }
    println!("bincode::serialize_into {:?}", start.elapsed());
}

#[inline(never)]
fn ser(data: &[chrono::DateTime<chrono::Utc>]) -> Vec<u8> {
    bincode::serialize(data).unwrap()
}

#[inline(never)]
fn ser_into(data: &[chrono::DateTime<chrono::Utc>]) -> Vec<u8> {
    let mut out = Vec::new();
    bincode::serialize_into(&mut out, data).unwrap();
    out
}
@ZoeyR
Copy link
Collaborator

ZoeyR commented Aug 12, 2021

Interesting that the serialize_size isn't optimized away when there is no size limit defined. I'll look into that more. The reason we have to call serialize_size is to make sure we don't accidentally cause an OOM from maliciously structured data. It might be more performant if we calculate the size as the serialization is happening. I'll try to look into it.

@ZoeyR
Copy link
Collaborator

ZoeyR commented Aug 12, 2021

Ah! I recall now, the reason that serialized_size was called is that in structs that are simpler to serialize, there was a noticeable performance boost from pre-allocating the vector to the appropriate size. If might make sense to stop doing that if it turns out we were wrong about which case was more common, simple serilization or complex. And for people that need the performance boost they can just call serialized_size and do the pre-allocation themselves. @saethlin do you have any data on which case would be more common?

@saethlin
Copy link
Contributor Author

Unfortunately I have no idea, because I'm not a user of bincode::serialize; I've always used this crate with a Write impl.

My only prior experience with serialized_size was when I was implementing a system that sent Vecs of objects serialized with bincode over UDP. I wanted to keep my UDP frames below the MTU, so I was manually calling serialized_size after every push. When I eventually looked at a profile the program spent all of its cycles in serialized_size.

I don't know what exactly causes serialized_size to collapse from the efficient SizeChecker methods that just increment an integer, to actually serializing the object. I think the question is how common that code path is.

@stale stale bot added the stale label Oct 12, 2021
@bincode-org bincode-org deleted a comment from stale bot Oct 12, 2021
@ZoeyR ZoeyR added not-stale and removed stale labels Oct 12, 2021
@gregates
Copy link

gregates commented Aug 27, 2023

I ran into this issue as well attempting to implement a custom serialize function for a type holding potentially gigabytes of data. We want to compress prior to serializing, and possibly chunk the data up, hence the custom serialize impl. But that's an expensive operation, so I can't afford to call serialize twice.

FWIW before I found this issue, I created a minimal repro to demonstrate that serialize is called twice by bincode 1.3.3, you can find it here: https://github.com/gregates/bincode_double_ser_repro/tree/mainline. Although it sounds like the reason for this behavior is already known.

In my case, I can just switch to serialize_into instead of serialize. But I was caught by surprise by this behavior and it took me a while to understand how to work around it.

@lrobidou
Copy link

lrobidou commented Jul 16, 2024

Hello,
I encountered the same problem. I was trying to dump a non-trivial structure (this structure can hold some GBs of data). I was surprised when I noticed the double call, I thought I messed up somewhere. Using serialize_into solved my issue, thank you @gregates !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants