Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract_text misses spaces between words #606

Closed
jtjohnston opened this issue Feb 16, 2022 · 6 comments
Closed

extract_text misses spaces between words #606

jtjohnston opened this issue Feb 16, 2022 · 6 comments
Labels

Comments

@jtjohnston
Copy link

Describe the bug

Extracting text frequently misses spaces between words resulting in many words being concatenated.
This particular example is a pdf that was generated (likely) using a LaTeX compiler (e.g. pdflatex).

Code to reproduce the problem

import pdfplumber

with pdfplumber.open( "/path/to/file.pdf") as pdf:
    first_page = pdf.pages[0]
    print( first_page.extract_text() )

PDF file

An example pdf can be downloaded (as of 2/16/2022) with: wget https://proceedings.neurips.cc/paper/2021/file/000c076c390a4c357313fca29e390ece-Paper.pdf.

Expected behavior

The first line of the actual behavior (below) should be:

We provide improved gap-dependent regret bounds for reinforcement learning in

Actual behavior

(sub-sampled output):

Weprovideimprovedgap-dependentregretboundsforreinforcementlearningin
finiteepisodicMarkovdecisionprocesses. Comparedtopriorwork,ourbounds
dependonalternativedefinitionsofgaps. Thesedefinitionsarebasedontheinsight
that,inordertoachieveafavorableregret,analgorithmdoesnotneedtolearnhow
tobehaveoptimallyinstatesthatarenotreachedbyanoptimalpolicy. Weprove
tighterupperregretboundsforoptimisticalgorithmsandaccompanythemwith
newinformation-theoreticlowerboundsforalargeclassofMDPs. Ourresults
showthatoptimisticalgorithmscannotachievetheinformation-theoreticlower
boundsevenindeterministicMDPsunlessthereisauniqueoptimalpolicy.

Environment

  • pdfplumber version: 0.6.0
  • Python version: 3.9.7
  • OS: Linux (Ubuntu 18, running in a WSL2 shell)
@jtjohnston jtjohnston added the bug label Feb 16, 2022
@xelaos
Copy link

xelaos commented Feb 21, 2022

Did you try to modify the x_tolerance parameter like this?

text = page.extract_text(x_tolerance=1)

@jtjohnston
Copy link
Author

@xelaos Thanks, that did work for me (at least on this example).
I guess the question now is: how do I know when/if I have to use that (e.g. if I'm extracting text automatically from lots of pdfs)? or how often is this needed? Why sometimes and not others? Etc.

@jsvine
Copy link
Owner

jsvine commented Mar 3, 2022

@jtjohnston Typically, you'll need to specify/adjust x_tolerance whenever you have typography that crams letters together very closely or spaces them apart very widely.

PDFs don't themselves have a concept of "words" and many PDFs don't include whitespace characters explicitly but rather depend on letter-spacing to visually represent that whitespace. So this library provides the x_tolerance parameter to let the user specify the minimum distance between letters that should be considered a word separator.

This library has generally shied away from "magic" — i.e., auto-tuning parameters. But there are likely some heuristics you could use to auto-guess the appropriate x_tolerance, especially if you have some general expectations about the types of PDFs you'll be processing (i.e., they'll all have a big chunk of text on the first page, et cetera).

Closing this issue for now, but feel free to continue the discussion.

@jsvine jsvine closed this as completed Mar 3, 2022
@Sarke
Copy link

Sarke commented Sep 1, 2023

@jsvine I have the same problem. Let me know if I should start a new issue for this, but your above reply is very relevant.

My initial though is that ideally we would be able to set the tolerance as a fraction of the font-size, since both the words spacing and the line spacing usually change proportionally with the change in font-size.

@jsvine
Copy link
Owner

jsvine commented Sep 11, 2023

Thanks @Sarke, I think that's a nice idea and have opened a feature request issue here: #987

@afriedman412
Copy link
Contributor

@jsvine I have the same problem. Let me know if I should start a new issue for this, but your above reply is very relevant.

My initial though is that ideally we would be able to set the tolerance as a fraction of the font-size, since both the words spacing and the line spacing usually change proportionally with the change in font-size.

hey -- im working on this, do you have a pdf with crammed letters I can use for testing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants