extract_text misses spaces between words #606

jtjohnston · 2022-02-16T17:55:41Z

Describe the bug

Extracting text frequently misses spaces between words resulting in many words being concatenated.
This particular example is a pdf that was generated (likely) using a LaTeX compiler (e.g. pdflatex).

Code to reproduce the problem

import pdfplumber

with pdfplumber.open( "/path/to/file.pdf") as pdf:
    first_page = pdf.pages[0]
    print( first_page.extract_text() )

PDF file

An example pdf can be downloaded (as of 2/16/2022) with: wget https://proceedings.neurips.cc/paper/2021/file/000c076c390a4c357313fca29e390ece-Paper.pdf.

Expected behavior

The first line of the actual behavior (below) should be:

We provide improved gap-dependent regret bounds for reinforcement learning in

Actual behavior

(sub-sampled output):

Weprovideimprovedgap-dependentregretboundsforreinforcementlearningin
ﬁniteepisodicMarkovdecisionprocesses. Comparedtopriorwork,ourbounds
dependonalternativedeﬁnitionsofgaps. Thesedeﬁnitionsarebasedontheinsight
that,inordertoachieveafavorableregret,analgorithmdoesnotneedtolearnhow
tobehaveoptimallyinstatesthatarenotreachedbyanoptimalpolicy. Weprove
tighterupperregretboundsforoptimisticalgorithmsandaccompanythemwith
newinformation-theoreticlowerboundsforalargeclassofMDPs. Ourresults
showthatoptimisticalgorithmscannotachievetheinformation-theoreticlower
boundsevenindeterministicMDPsunlessthereisauniqueoptimalpolicy.

Environment

pdfplumber version: 0.6.0
Python version: 3.9.7
OS: Linux (Ubuntu 18, running in a WSL2 shell)

The text was updated successfully, but these errors were encountered:

xelaos · 2022-02-21T14:15:49Z

Did you try to modify the x_tolerance parameter like this?

text = page.extract_text(x_tolerance=1)

jtjohnston · 2022-02-25T15:50:05Z

@xelaos Thanks, that did work for me (at least on this example).
I guess the question now is: how do I know when/if I have to use that (e.g. if I'm extracting text automatically from lots of pdfs)? or how often is this needed? Why sometimes and not others? Etc.

jsvine · 2022-03-03T02:22:03Z

@jtjohnston Typically, you'll need to specify/adjust x_tolerance whenever you have typography that crams letters together very closely or spaces them apart very widely.

PDFs don't themselves have a concept of "words" and many PDFs don't include whitespace characters explicitly but rather depend on letter-spacing to visually represent that whitespace. So this library provides the x_tolerance parameter to let the user specify the minimum distance between letters that should be considered a word separator.

This library has generally shied away from "magic" — i.e., auto-tuning parameters. But there are likely some heuristics you could use to auto-guess the appropriate x_tolerance, especially if you have some general expectations about the types of PDFs you'll be processing (i.e., they'll all have a big chunk of text on the first page, et cetera).

Closing this issue for now, but feel free to continue the discussion.

Sarke · 2023-09-01T23:13:24Z

@jsvine I have the same problem. Let me know if I should start a new issue for this, but your above reply is very relevant.

My initial though is that ideally we would be able to set the tolerance as a fraction of the font-size, since both the words spacing and the line spacing usually change proportionally with the change in font-size.

jsvine · 2023-09-11T16:39:33Z

Thanks @Sarke, I think that's a nice idea and have opened a feature request issue here: #987

afriedman412 · 2023-10-19T22:05:57Z

@jsvine I have the same problem. Let me know if I should start a new issue for this, but your above reply is very relevant.

My initial though is that ideally we would be able to set the tolerance as a fraction of the font-size, since both the words spacing and the line spacing usually change proportionally with the change in font-size.

hey -- im working on this, do you have a pdf with crammed letters I can use for testing?

jtjohnston added the bug label Feb 16, 2022

jsvine closed this as completed Mar 3, 2022

jsvine mentioned this issue Sep 11, 2023

For text extraction, add fractional versions of x/y_tolerance arguments #987

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract_text misses spaces between words #606

extract_text misses spaces between words #606

jtjohnston commented Feb 16, 2022

xelaos commented Feb 21, 2022

jtjohnston commented Feb 25, 2022

jsvine commented Mar 3, 2022

Sarke commented Sep 1, 2023

jsvine commented Sep 11, 2023

afriedman412 commented Oct 19, 2023

extract_text misses spaces between words #606

extract_text misses spaces between words #606

Comments

jtjohnston commented Feb 16, 2022

Describe the bug

Code to reproduce the problem

PDF file

Expected behavior

Actual behavior

Environment

xelaos commented Feb 21, 2022

jtjohnston commented Feb 25, 2022

jsvine commented Mar 3, 2022

Sarke commented Sep 1, 2023

jsvine commented Sep 11, 2023

afriedman412 commented Oct 19, 2023