-
Notifications
You must be signed in to change notification settings - Fork 692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extract_text misses spaces between words #606
Comments
Did you try to modify the x_tolerance parameter like this? text = page.extract_text(x_tolerance=1) |
@xelaos Thanks, that did work for me (at least on this example). |
@jtjohnston Typically, you'll need to specify/adjust PDFs don't themselves have a concept of "words" and many PDFs don't include whitespace characters explicitly but rather depend on letter-spacing to visually represent that whitespace. So this library provides the This library has generally shied away from "magic" — i.e., auto-tuning parameters. But there are likely some heuristics you could use to auto-guess the appropriate Closing this issue for now, but feel free to continue the discussion. |
@jsvine I have the same problem. Let me know if I should start a new issue for this, but your above reply is very relevant. My initial though is that ideally we would be able to set the tolerance as a fraction of the font-size, since both the words spacing and the line spacing usually change proportionally with the change in font-size. |
hey -- im working on this, do you have a pdf with crammed letters I can use for testing? |
Describe the bug
Extracting text frequently misses spaces between words resulting in many words being concatenated.
This particular example is a pdf that was generated (likely) using a LaTeX compiler (e.g. pdflatex).
Code to reproduce the problem
PDF file
An example pdf can be downloaded (as of 2/16/2022) with:
wget https://proceedings.neurips.cc/paper/2021/file/000c076c390a4c357313fca29e390ece-Paper.pdf
.Expected behavior
The first line of the actual behavior (below) should be:
Actual behavior
(sub-sampled output):
Environment
The text was updated successfully, but these errors were encountered: