For text extraction, add fractional versions of `x/y_tolerance` arguments #987

jsvine · 2023-09-11T16:38:35Z

Currently, x_tolerance and y_tolerance are treated as numeric constants. But, as @Sarke points on in #606 (comment), it could be useful to provide a "fractional" version of these arguments:

[...] ideally we would be able to set the tolerance as a fraction of the font-size, since both the words spacing and the line spacing usually change proportionally with the change in font-size.

Implementing this correctly might be tricky, as x/y_tolerance are passed across a few methods, but it should be doable. Some other things to sort out:

Let's say the new x-related parameter is x_tolerance_fraction. Should the value be a number (indicating desired fractional threshold) or a boolean (indicating that x_tolerance should be interpreted as a fraction)?
What should the precise calculation be? Something like (in pseudocode) x_tolerance = current_character["size"] * x_tolerance_fraction?

Any other questions or complications I may be overlooking?

The text was updated successfully, but these errors were encountered:

afriedman412 · 2023-10-17T23:43:02Z

is anyone working on this? I'd like to take a crack at it if not!

jsvine · 2023-10-18T18:12:40Z

@afriedman412 I'm not aware of anyone actively working on this, thanks for checking — and thanks for offering! Would be wonderful if you took a crack at it.

afriedman412 · 2023-10-19T22:06:57Z

great -- do you have a pdf with extra-tight letters I can use for testing?

(the pdf in the original issue worked fine with x_toleraance=1 sooo....)

jsvine · 2023-10-25T21:56:17Z

@afriedman412 How about something like this?: issue-987-test.pdf

import pdfplumber

pdf = pdfplumber.open("issue-987-test.pdf")
page = pdf.pages[0]

for x in [ 0, 3, 10 ]:
    print(f"--- x_tolerance = {x} ---")
    print(page.extract_text(x_tolerance=x))
    print("")

... outputs this:

--- x_tolerance = 0 ---
Big Te xt
Small Te xt

--- x_tolerance = 3 ---
Big Te xt
Small Text

--- x_tolerance = 10 ---
Big Text
SmallText

afriedman412 · 2023-10-26T17:08:31Z

sorry im confused -- what do we want it to output?

jsvine · 2023-10-26T17:31:49Z

Ah, my apologies for not being more explicit. Ideally, the proportional tolerance feature would make it possible to get this back:

Big Text
Small Text

The examples above show (or try to show) that non-proportional tolerances either under-condense the big text or over-condense the small text.

afriedman412 · 2023-10-26T18:00:20Z

is there a less dumb way to get text size than int(text['bottom'] - text['top'])?

jsvine · 2023-10-26T19:18:42Z

Are you looking at char objects, or something else (which I might infer from the variable being named text)? If char objects:

char["height"] already gets you that calculation (without the floor-ing that int(...) introduces)
... but it might be more idiomatic/expected to use char["size"] as the base

I'd have to check more carefully, but I believe those two values should typically be the same.

afriedman412 · 2023-10-27T01:00:37Z

Honestly I'm lazy and couldn't find an easy way to extract char objects from text, so now it works with anything with a top and bottom param. The int isn't necessary -- I guess I assumed font sizes are always whole numbers?

def get_char_tolerance(t, x_tolerance):
    """Scales x_tolerance to font size (height of text)"""
    if "bottom" in t.keys() and "top" in t.keys():
        return int(t['bottom'] - t['top'])*x_tolerance
    else:
        raise KeyError("Couldn't get height of text for x_tolerance scaling.")

Anyways, fractional x_tolerance mostly works now! The default value is 0.15, because "issue-987-test.pdf" rendered both text sizes correctly between 0.06 and 0.2, and it passed the unit test, and I like round numbers.

Some questions:

Where are the table-extraction values set? Would it make sense to define all the text defaults in one place? (DEFAULT_X_TOLERANCE is currently just at the top of text.py.)
How flexible does this param need to be? I was thinking about scaling it so the default value is 1, but I'm not sure what the range would be
I also kind of feel like making the number a decimal implies scaling, whereas a whole number implies total units
That said, do we want to leave the option in for explicit tolerance?

jsvine · 2023-10-27T20:31:23Z

Thanks, @afriedman412! I'll address your specific questions below, but first this seems like a good opportunity for me to sketch out a bit more about how I see this working:

The current default settings will continue to apply — i.e., by default, .extract_text(...) and related methods will continue to use explicit (rather than proportional) tolerances.
Those methods will gain two new parameters: x_tolerance_ratio and y_tolerance_ratio, which will both default to None. ("_ratio" seems a bit better name than my original proposal, "_fraction".)
If x_tolerance_ratio is set to a number (integer or float), it will override any explicit x_tolerance (and likewise y_tolerance if y_tolerance_ratio is set).
When determining whether the horizontal distance between Character A and Character B exceeds the x_tolerance_ratio, we calculate x_tolerance to be A["size"] * x_tolerance_ratio (and something analogous for vertical distances).

As a matter of actual implementation, things get tricky, as these tolerances are used in several parts of pdfplumber:

Separating lines from one another, via cluster_objects:

For layout=True:

pdfplumber/pdfplumber/utils/text.py

Lines 466 to 468 in d9561d1

    
           subclusters = cluster_objects( 
        
               upright_cluster, itemgetter(cluster_key), self.y_tolerance 
        
           )

... and:

pdfplumber/pdfplumber/utils/text.py

Lines 227 to 232 in d9561d1

    
           cluster_objects( 
        
               words_sorted_doctop, 
        
               lambda x: float(x[0]["doctop"]), 
        
               y_tolerance, 
        
               preserve_order=presorted or use_text_flow, 
        
           )

For layout=False:

pdfplumber/pdfplumber/utils/text.py

Line 538 in d9561d1

lines = cluster_objects(words, itemgetter("doctop"), y_tolerance)

Separating words from one another in the same line:

pdfplumber/pdfplumber/utils/text.py

Lines 347 to 351 in d9561d1

    
           def char_begins_new_word( 
        
               self, 
        
               prev_char: T_obj, 
        
               curr_char: T_obj, 
        
           ) -> bool:

... and possibly a few other places, not to mention where these utility functions are integrated into the Page.extract_... methods. This is not to discourage you from implementing this, but rather just a note that there are some tricky bits.

Now on to the specific questions:

Where are the table-extraction values set?

I think here you're asking about the default parameter values for the table-extraction methods? If so, you can find the table-specific ones at the top of this file: https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/table.py

... while the general text extraction defaults are set at the top of this file: https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/utils/text.py

Would it make sense to define all the text defaults in one place? (DEFAULT_X_TOLERANCE is currently just at the top of text.py.)

I'd prefer to keep as-is.

How flexible does this param need to be? I was thinking about scaling it so the default value is 1, but I'm not sure what the range would be

See above, which I believe should answer that question, but let me know if not.

I also kind of feel like making the number a decimal implies scaling, whereas a whole number implies total units

I think we should give users the option to specify these ratios as any number, to give them as much precision as they'd like.

That said, do we want to leave the option in for explicit tolerance?

See above; I think the explicit tolerance should remain the default, both for backward-compatibility's sake and for predictability's sake (the explicit value does not depend on character order, which I believe the ratio version will).

Happy to clarify any of the above and to answer any follow-up Qs! And thanks again!

afriedman412 · 2023-10-30T23:19:05Z

thanks for all this

my big question is do we want to make calculating tolerances dynamic?

like right now my approach is basically just using the size of first character available to calculate the tolerance. if we are iterating through the lines on a page with cluster_objects we could theoretically adjust ratios on the fly...

jsvine · 2023-10-31T14:25:29Z

my big question is do we want to make calculating tolerances dynamic?
like right now my approach is basically just using the size of first character available to calculate the tolerance.

Good question. At the very least, we want the tolerances to be dynamic between lines.

I think the answer is less clear within lines. The argument in favor of calculating tolerances on a per-character (i.e., within line) basis would be more flexibility for lines containing text of different sizes. Not necessarily the most common occurrence, but a possibility. The argument against would be greater complexity and a (small, probably negligible) performance hit. I'd say let's start experimenting / prototyping without that, and then see how much of a hassle it'd be to add it.

afriedman412 · 2023-11-01T17:06:59Z

are we sure we need y_tolerance_ratio? off top it feels like line spacing is much less dependent on font size...

I'm going to implement x_tolerance first and we can go from there

jsvine · 2023-11-02T15:16:42Z

I think that's a reasonable (and smartly constrained) place to start. I think y_tolerance_ratio would still be helpful, but x_tolerance_ratio would certainly still be useful on its own.

jsvine added the feature-request All feature requests receive this label initially, can be upgraded to "enhancement" label Sep 11, 2023

jsvine self-assigned this Sep 11, 2023

jsvine mentioned this issue Sep 11, 2023

extract_text misses spaces between words #606

Closed

afriedman412 mentioned this issue Nov 1, 2023

Add x_tolerance_ratio param to extract_text and similar functions #1037

Closed

afriedman412 mentioned this issue Nov 4, 2023

Add x_tolerance_ratio param to extract_text and similar functions (now properly linted!) #1041

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For text extraction, add fractional versions of `x/y_tolerance` arguments #987

For text extraction, add fractional versions of `x/y_tolerance` arguments #987

jsvine commented Sep 11, 2023

afriedman412 commented Oct 17, 2023

jsvine commented Oct 18, 2023

afriedman412 commented Oct 19, 2023

jsvine commented Oct 25, 2023

afriedman412 commented Oct 26, 2023

jsvine commented Oct 26, 2023

afriedman412 commented Oct 26, 2023

jsvine commented Oct 26, 2023

afriedman412 commented Oct 27, 2023

jsvine commented Oct 27, 2023 •

edited

Loading

afriedman412 commented Oct 30, 2023

jsvine commented Oct 31, 2023

afriedman412 commented Nov 1, 2023

jsvine commented Nov 2, 2023

For text extraction, add fractional versions of x/y_tolerance arguments #987

For text extraction, add fractional versions of x/y_tolerance arguments #987

Comments

jsvine commented Sep 11, 2023

afriedman412 commented Oct 17, 2023

jsvine commented Oct 18, 2023

afriedman412 commented Oct 19, 2023

jsvine commented Oct 25, 2023

afriedman412 commented Oct 26, 2023

jsvine commented Oct 26, 2023

afriedman412 commented Oct 26, 2023

jsvine commented Oct 26, 2023

afriedman412 commented Oct 27, 2023

jsvine commented Oct 27, 2023 • edited Loading

afriedman412 commented Oct 30, 2023

jsvine commented Oct 31, 2023

afriedman412 commented Nov 1, 2023

jsvine commented Nov 2, 2023

For text extraction, add fractional versions of `x/y_tolerance` arguments #987

For text extraction, add fractional versions of `x/y_tolerance` arguments #987

jsvine commented Oct 27, 2023 •

edited

Loading