-
Notifications
You must be signed in to change notification settings - Fork 692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For text extraction, add fractional versions of x/y_tolerance
arguments
#987
Comments
is anyone working on this? I'd like to take a crack at it if not! |
@afriedman412 I'm not aware of anyone actively working on this, thanks for checking — and thanks for offering! Would be wonderful if you took a crack at it. |
great -- do you have a pdf with extra-tight letters I can use for testing? (the pdf in the original issue worked fine with |
@afriedman412 How about something like this?: issue-987-test.pdf import pdfplumber
pdf = pdfplumber.open("issue-987-test.pdf")
page = pdf.pages[0]
for x in [ 0, 3, 10 ]:
print(f"--- x_tolerance = {x} ---")
print(page.extract_text(x_tolerance=x))
print("") ... outputs this:
|
sorry im confused -- what do we want it to output? |
Ah, my apologies for not being more explicit. Ideally, the proportional tolerance feature would make it possible to get this back:
The examples above show (or try to show) that non-proportional tolerances either under-condense the big text or over-condense the small text. |
is there a less dumb way to get text size than |
Are you looking at
I'd have to check more carefully, but I believe those two values should typically be the same. |
Honestly I'm lazy and couldn't find an easy way to extract
Anyways, fractional Some questions:
|
Thanks, @afriedman412! I'll address your specific questions below, but first this seems like a good opportunity for me to sketch out a bit more about how I see this working:
As a matter of actual implementation, things get tricky, as these tolerances are used in several parts of
... and possibly a few other places, not to mention where these utility functions are integrated into the Now on to the specific questions:
I think here you're asking about the default parameter values for the table-extraction methods? If so, you can find the table-specific ones at the top of this file: https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/table.py ... while the general text extraction defaults are set at the top of this file: https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/utils/text.py
I'd prefer to keep as-is.
See above, which I believe should answer that question, but let me know if not.
I think we should give users the option to specify these ratios as any number, to give them as much precision as they'd like.
See above; I think the explicit tolerance should remain the default, both for backward-compatibility's sake and for predictability's sake (the explicit value does not depend on character order, which I believe the ratio version will). Happy to clarify any of the above and to answer any follow-up Qs! And thanks again! |
thanks for all this my big question is do we want to make calculating tolerances dynamic? like right now my approach is basically just using the size of first character available to calculate the tolerance. if we are iterating through the lines on a page with |
Good question. At the very least, we want the tolerances to be dynamic between lines. I think the answer is less clear within lines. The argument in favor of calculating tolerances on a per-character (i.e., within line) basis would be more flexibility for lines containing text of different sizes. Not necessarily the most common occurrence, but a possibility. The argument against would be greater complexity and a (small, probably negligible) performance hit. I'd say let's start experimenting / prototyping without that, and then see how much of a hassle it'd be to add it. |
are we sure we need I'm going to implement x_tolerance first and we can go from there |
I think that's a reasonable (and smartly constrained) place to start. I think |
Currently,
x_tolerance
andy_tolerance
are treated as numeric constants. But, as @Sarke points on in #606 (comment), it could be useful to provide a "fractional" version of these arguments:Implementing this correctly might be tricky, as
x/y_tolerance
are passed across a few methods, but it should be doable. Some other things to sort out:x_tolerance_fraction
. Should the value be a number (indicating desired fractional threshold) or a boolean (indicating thatx_tolerance
should be interpreted as a fraction)?x_tolerance = current_character["size"] * x_tolerance_fraction
?Any other questions or complications I may be overlooking?
The text was updated successfully, but these errors were encountered: