-
Notifications
You must be signed in to change notification settings - Fork 981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading deflate file throw "Unexpected EOF" #837
Comments
A zip file is not simply a deflate-compressed file, but an archiving format with file tables etc. |
the code should not interpretate as "Load a zip archive". The example is simplified. The |
Okay, I see. How was the file created? If there are no headers or meta data about the deflate stream, it can be hard to debug why the file cannot be read, and it might be related to some unsupported feature in our deflate implementation. |
I tried reading from your data file, a single byte per read, and it seems like the deflate stream just ends after reading 208671 byte(s):
I also tried running it through zlibs example program
|
Thank you for testing. The exact same exeption is throwing here. I dont know what byte is here the problem. The file is the deflate data of an pdf page content stream. If i use the uncompromise data and convert the byte[] to and utf 8 string the correct data is combing back. So it seems to be that the single byte which occurs the error is the problem. |
Is there a way to fix it here? |
If zlib gives the exact same result, it's the data (input file) that is the problem. |
It seems to be an performance overkill to remove bytewise. Is it possible to get a more concrete exception on which position the problem occur? A naive way int length = 4096;
using (var input = new FileStream(@"data.bin", FileMode.Open))
{
using (var output = new MemoryStream(65536))
{
using (var _inflater = new InflaterInputStream(input))
{
byte[] data;
while (true)
{
data = new byte[length];
try
{
var size = _inflater.Read(data, 0, length);
if (size > 0)
{
output.Write(data, 0, size);
}
else
{
break;
}
}
catch (ICSharpCode.SharpZipLib.SharpZipBaseException e) when (e.Message.Equals("Unexpected EOF", StringComparison.OrdinalIgnoreCase))
{
length -= 1;
}
catch (Exception)
{
throw;
}
}
}
var strg = System.Text.Encoding.UTF8.GetString(output.ToArray());
}
}
} |
Yes, something like that is what I meant, but not for the final solution, just to find out what parts of the file shouldn't be passed to INFLATE. I assume it would be the same for all files in this format. Perhaps there is an additional CRC or something appended to the end? Or perhaps multiple streams are appended together in the original file and so the last deflate-record has it's "isLastRecord" bit set to false? |
Thd pdf specification allows that the stream can be a single deflated stream or an array of streams. But on my understanding the concatenation to one single content file happen after deflating. So in this case the data is produced as closed container which is deflated. WHat is a CRC? |
is there any idea other libraries in the pdf world with own implementations of deflate can work with the data. I don't why it ends on this point because there is more data behind this point. |
This project is focused on zip and tar.gz/bz2, so I have no insight into PDF, sorry. Plain DEFLATE is not that common in files, I would probably take a look at the producer of those files to see if it either includes too much or too little data. You could also try debugging your program and stepping back in the stack trace to see why more data is required (you would need to have a basic understanding of how the DEFLATE format works though). |
I am getting the same error trying to deflate the data contained in this file |
It looks like this is actually a bug in Adobe's PDF generation engine. It is leaving off the last byte of the Adler-32 checksum if the last byte is 0x00. In the case of the file I provided, the computed checksum is 0x60F7D300, but the last 4 bytes of the data in the encoded stream are 0x00, 0x60, 0xF7, and 0xD3. In the case of the file @lutz provided, the computed checksum is 0x79DFAE00, but the last 4 bytes of data in the encoded stream are 0x00, 0x79, 0xDF, and 0xAE. I have confirmed that adding a byte with value 0x00 to the end of each these files causes them to process correctly. It would seem that Acrobat Reader must be ignoring the header and checksum fields and is just processing the raw DEFLATE data. |
@asyncritus great detective work! It could also be the case that the way they are reading/writing the checksum allows for truncating trailing null bytes. In the case of SharpZipLib it should be fairly easy to try to fill any missing bytes in the CRC with 0 bytes if it reaches EOF... |
...or perhaps it's the tool that extracts out the PDF streams that strips the trailing null bytes? How did you produce the file? |
@asyncritus Great work. And your result is that what i thought about the adobe pdf engine. |
@piksel We don`t trail these information when we read. t seems to be that the adobe pdf engine do that with a specific update. We could identfify that the behaviour is changed with adobes indesign 18.5 (windows and mac) update. Before it works and after not. |
After some further investigation with more examples, I've found that it is not just leaving off trailing 0x00 bytes, but as soon as it encounters a 0x00 byte in the checksum, it stops writing data. For example, in one situation the checksum is 0x001E9C82, and none of those bytes are present. In another case, the checksum is 0x6C00878A, and only 0x6C was present. Our customer that is having these issues is using InDesign 19.0. We are trying to obtain the original InDesign documents so that we can test with an earlier version. @lutz Have you contacted Adobe about this issue? |
We could reproduce the behaviour down to version 18.5. One of our customer could check multiple indesign version and the v18.5 seems to be the first. The v17 should be definitiv works. We have no contact with Adobe. The problem is that most PDF viewers we check works with the files (Adobe Acrobat/Reader , PDF X Change, Summatra, Browser and so on) It could be that most of theme have the identical behavior of ignoring checksum and interprete the raw data. So we have not enough argument. The PDF specification is clear enough to say that deflate should be use and deflate spec is strict in his format (checksum anf so on) It is not the first time that Adobe as inventer of the PDF format is interprete pdf files more in a free way instead of a strict way |
It seems like the only thing we can do is to add a way to ignore the CRC (in the library, that is). It should be a useful option to have in any case... |
[SharpZipLib] - Fixes: icsharpcode/SharpZipLib#837
@piksel I come late, but with a fix. GitHubProUser67/MultiServer3@90354fc |
这是来自QQ邮箱的假期自动回复邮件。
您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。
|
[SharpZipLib] - Fixes: icsharpcode/SharpZipLib#837
Describe the bug
Hello community,
i am not really firm with deflate compremissed files but with the attached file (data.zip) i get a unexpeced EOF exception from the
InflaterInputStream
class. It is reproducible with the following code at linevar size = _inflater.Read(data, 0, data.Length);
Best regards
Daniel
Reproduction Code
No response
Steps to reproduce
Create a console app, add the latest release as nuget package of SharpZipLib and run the above code.
Expected behavior
It should not throw a exception
Operating System
Windows
Framework Version
No response
Tags
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: