Fix WARC headers parsing when record has `Content-Length: 0` and record after it. #42

bocharov · 2024-11-07T19:39:32Z

Validated against Python implementation: https://github.com/webrecorder/warcio

Also fixed flaky test record::verify_display by sorting header names.

bocharov · 2024-11-11T20:42:07Z

Hi @jedireza! I appreciate your review on this PR, thanks!

jedireza

Thanks for creating a PR.

Fix WARC headers parsing when record has Content-Length: 0 and record after it.

So is there a bug you're trying to fix? If so, could we first add a failing test case?

Also fixed flaky test record::verify_display by sorting header names.

I didn't know this was a flaky test, but I was able to confirm. I'm not sure we should incur the cost of sort during every display.

Here is another way to make verify_display stable:

  fn verify_display() {
        let header_entries = vec![
            (WarcHeader::WarcType, b"dunno".to_vec()),
            (WarcHeader::Date, b"2024-01-01T00:00:00Z".to_vec()),
        ];

        let headers = RawRecordHeader {
            version: "1.0".to_owned(),
            headers: header_entries.into_iter().collect(),
        };

        let output = headers.to_string();

        let expected_lines = vec![
            "WARC/1.0",
            "warc-type: dunno",
            "warc-date: 2024-01-01T00:00:00Z",
            "",
        ];
        let actual_lines: Vec<_> = output.lines().collect();

        let mut expected_headers: Vec<_> = expected_lines[1..expected_lines.len() - 1].to_vec();
        expected_headers.sort();

        let mut actual_headers: Vec<_> = actual_lines[1..actual_lines.len() - 1].to_vec();
        actual_headers.sort();

        // verify parts
        assert_eq!(actual_lines[0], expected_lines[0]); // WARC version
        assert_eq!(actual_headers, expected_headers); // headers (sorted)
        assert_eq!(actual_lines.last(), expected_lines.last()); // empty line
    }

bocharov · 2024-11-12T21:46:50Z

@jedireza Thanks for coming back on this!

The test case zero_and_nonzero_content_length demonstrates the problem and if you rollback next method original logic let mut found_body = expected_body_len == 0; then it fails.

Regarding verify_display I don't feel strongly about specific way to fix flaky test, so I went with your suggestion sorting headers in the test.

…rd after it. Validated against Python implementation: https://github.com/webrecorder/warcio Also fixed flaky test `record::verify_display` by sorting header names in the test.

jedireza

Thanks for the changes! This makes sense to me now. Had to reacquaint myself with this code and warc a bit.

bocharov · 2024-11-13T18:12:32Z

Thanks for quick response and PR approval! I'd appreciate if you could merge it and release new version of the crate, so that I can update dependencies in my projects instead of using a fork.

jedireza · 2024-11-14T04:34:29Z

Going to merge this and make it a patch release.

jedireza · 2024-11-14T04:41:40Z

Published as 0.3.3

https://crates.io/crates/warc

jedireza reviewed Nov 12, 2024

View reviewed changes

jedireza mentioned this pull request Nov 12, 2024

Upgrade nom dependency to 7.1.3 #41

Open

bocharov force-pushed the alex/warc-parser-fixes branch from 3e4bbd1 to 7f662f8 Compare November 12, 2024 21:45

Fix WARC headers parsing when record has Content-Length: 0 and reco…

7d7a2d4

…rd after it. Validated against Python implementation: https://github.com/webrecorder/warcio Also fixed flaky test `record::verify_display` by sorting header names in the test.

bocharov force-pushed the alex/warc-parser-fixes branch from 7f662f8 to 7d7a2d4 Compare November 12, 2024 21:51

jedireza approved these changes Nov 13, 2024

View reviewed changes

jhwgh1968 approved these changes Nov 14, 2024

View reviewed changes

jedireza merged commit 0aac9af into jedireza:master Nov 14, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix WARC headers parsing when record has `Content-Length: 0` and record after it. #42

Fix WARC headers parsing when record has `Content-Length: 0` and record after it. #42

bocharov commented Nov 7, 2024

bocharov commented Nov 11, 2024

jedireza left a comment

bocharov commented Nov 12, 2024

jedireza left a comment

bocharov commented Nov 13, 2024

jedireza commented Nov 14, 2024 •

edited

Loading

jedireza commented Nov 14, 2024

Fix WARC headers parsing when record has Content-Length: 0 and record after it. #42

Fix WARC headers parsing when record has Content-Length: 0 and record after it. #42

Conversation

bocharov commented Nov 7, 2024

bocharov commented Nov 11, 2024

jedireza left a comment

Choose a reason for hiding this comment

bocharov commented Nov 12, 2024

jedireza left a comment

Choose a reason for hiding this comment

bocharov commented Nov 13, 2024

jedireza commented Nov 14, 2024 • edited Loading

jedireza commented Nov 14, 2024

Fix WARC headers parsing when record has `Content-Length: 0` and record after it. #42

Fix WARC headers parsing when record has `Content-Length: 0` and record after it. #42

jedireza commented Nov 14, 2024 •

edited

Loading