You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If a parsed token in a PSParser subclass is split across the boundary between buffers, a keyword token will be incorrect split into two separate tokens, causing the wrong keyword to be produced and destroying all subsequent parsing. The BUFSIZ is just 4096, so any data stream longer than 4096 will potentially suffer from this issue.
A simple solution is to increase the buffer size to a much larger value (gigabytes) - in practice the impact on performance will be negligible since most PDFs fit within available RAM anyway. Alternate, (it seems to me) all _parse* functions would need to be adjusted to handle the case where we hit the end of buffer and the subsequent bytes in the buffer might contain the rest of a toekn.
The attached PDF demonstrates the issue when trying to parse its cmaps, some of which are longer than 4096 bytes.
Here is log output from parsing the attached PDF's fonts -- note that the token beginbfchar is incorrectly split into two tokens, beg and inbfchar since it the token happens to be split by the end of the buffer. This cause the incorrect interpretation of all subsequent tokens. Increasing BUFSIZ mitigates the issue.
This change is the culprit: #885 as it doesn't distinguish between the end of the stream and the end of the buffer.
Ideally the PSParser code should be replaced with a lexer based on a more robust and well-tested codebase, but for the moment we can simply fix that fix, which I'll do in a second.
dhdaines
added a commit
to dhdaines/pdfminer.six
that referenced
this issue
Aug 1, 2024
If a parsed token in a PSParser subclass is split across the boundary between buffers, a keyword token will be incorrect split into two separate tokens, causing the wrong keyword to be produced and destroying all subsequent parsing. The BUFSIZ is just 4096, so any data stream longer than 4096 will potentially suffer from this issue.
A simple solution is to increase the buffer size to a much larger value (gigabytes) - in practice the impact on performance will be negligible since most PDFs fit within available RAM anyway. Alternate, (it seems to me) all
_parse*
functions would need to be adjusted to handle the case where we hit the end of buffer and the subsequent bytes in the buffer might contain the rest of a toekn.The attached PDF demonstrates the issue when trying to parse its cmaps, some of which are longer than 4096 bytes.
Here is log output from parsing the attached PDF's fonts -- note that the token
beginbfchar
is incorrectly split into two tokens,beg
andinbfchar
since it the token happens to be split by the end of the buffer. This cause the incorrect interpretation of all subsequent tokens. Increasing BUFSIZ mitigates the issue.1361.pdf
Originating issue:
ocrmypdf/OCRmyPDF#1361
The text was updated successfully, but these errors were encountered: