How to crop each character for each line? #2701

nissansz · 2023-09-29T23:33:48Z

nissansz
Sep 29, 2023

For first line, I put a line of table or textbox below image,
but when I use below code to extract characters, can only get image for whole line, how to split cells?

'''
import fitz

from PIL import Image

import cv2
import numpy

path = r'L:/jp5/Presentation1.pdf'
path = r'L:/jp5/textbox.pdf'

doc = fitz.open(path)

j = 0
for i in range(0, doc.page_count):
page = doc[i]
for block in page.get_text("dict", flags=fitz.TEXTFLAGS_TEXT)["blocks"]:
for line in block["lines"]:
j = j + 1
bbox = line["bbox"] # the line bbox
text = " ".join([span["text"] for span in line["spans"]]) # text in line
if len(text.strip()) > 2:
print(bbox)
print(text)
pix = page.get_pixmap(clip=bbox, dpi=300) # pixmap of line bbox
# pix.save(...)"page%s-%s.png" % (page, xref)
# pix.save(r'F:\0mupdflines\' + "page-%i_%i.png" % (page.number, j)) # store image as a PNG

doc.close()

'''

JorjMcKie · 2023-09-30T05:26:41Z

JorjMcKie
Sep 30, 2023
Maintainer

You may want to use the table finder:

In [1]: import fitz
In [2]: doc=fitz.open("Presentation1.pdf")
In [3]: page = doc[0]
In [4]: tabs = page.find_tables()  # detect the tables
In [5]: tab = tabs[0]  # get first table on page
In [6]: tab.extract()  # read all cell content
Out[6]:
[['丕', '石', '右', '布', '夯', '龙', '戊'],
 ['丕', '石', '右', '布', '夯', '龙', '戊'],
 ['', None, None, None, None, None, None]]
In [7]: # also works as a pandas DataFrame:
In [8]: df = tab.to_pandas()
In [9]: df
Out[9]:
   丕     石     右     布     夯     龙     戊
0  丕     石     右     布     夯     龙     戊
1     None  None  None  None  None  None
In [10]: # you can also extract row and cell coordinates
In [11]: row = tab.rows[0]
In [12]: row.bbox
Out[12]: (0.0, 0.0, 369.3599853515625, 50.4000244140625)

In [14]: row.cells[0]
Out[14]: (0.0, 0.0, 52.768001556396484, 50.4000244140625)

For background on using PyMuPDF's table finder look at blog1 and blog2.

Hint: to make your code snippets appear as nice as mine, consider using 3 backtics (`) or use this to invoke a guide:

0 replies

nissansz · 2023-09-30T05:38:01Z

nissansz
Sep 30, 2023
Author

Thank you.
How about parsing single textbox?

1 reply

JorjMcKie Sep 30, 2023
Maintainer

You can take each table cell and extract its content by e.g. get_textbox(cell).

nissansz · 2023-09-30T05:41:21Z

nissansz
Sep 30, 2023
Author

There is error. Also how to crop the image of the cell?

File "F:/pycharm2020.2/RapidStructure-0.0.0/exportLine_txt_png_tables.py", line 20, in
for tab in page.find_tables() :
AttributeError: 'Page' object has no attribute 'find_tables'
/
for i in range(0,doc.page_count):
page = doc[i]
# for block in page.get_text("dict", flags=fitz.TEXTFLAGS_TEXT)["blocks"]:
for tab in page.find_tables() :
a = tab.extract() # read all cell content
print(a)
/

2 replies

JorjMcKie Sep 30, 2023
Maintainer

install a later version

JorjMcKie Sep 30, 2023
Maintainer

use page.get_pixmap(clip=cell).

nissansz · 2023-09-30T06:17:09Z

nissansz
Sep 30, 2023
Author

Cannot see cell bbox property

1 reply

JorjMcKie Sep 30, 2023
Maintainer

Cannot see cell bbox property

what does that mean

nissansz · 2023-09-30T06:20:03Z

nissansz
Sep 30, 2023
Author

I want to crop bbox for each character

1 reply

JorjMcKie Sep 30, 2023
Maintainer

sure, what's wrong with the cell bboxes?

nissansz · 2023-09-30T06:22:21Z

nissansz
Sep 30, 2023
Author

I cannot get bboxes for the cells

1 reply

JorjMcKie Sep 30, 2023
Maintainer

This

for cell in tab.cells:
    page.draw_rect(cell, color=(1,0,0))

gives you this:

What is missing?

nissansz · 2023-09-30T06:35:10Z

nissansz
Sep 30, 2023
Author

Thank you.
I found a problem, there should be 14 cells, but don't why the whole image is treated as an cell too.

1 reply

JorjMcKie Sep 30, 2023
Maintainer

The table finder tries to make sense of the vector graphics with its standard detection strategy.
Apaprt from single lines wrapping the cells, there also is large white rectangle covering the full page.
This is responsible for the "problem" taking the rest of the page (under the first 2 rows) as an additional row with only one cell.
So you have to use own code to cope with this currently. In a future ersion, we may add logic that prevents this sort of thing.

nissansz · 2023-09-30T07:00:29Z

nissansz
Sep 30, 2023
Author

Ok. Thank you.

0 replies

nissansz · 2023-10-10T00:22:36Z

nissansz
Oct 10, 2023
Author

How to read a table cells by rows and columns?
There is error: cell = table.get_table_cell(row, col)
AttributeError: 'Table' object has no attribute 'get_table_cell'

'''
html = ""
cells = []
# col_widths = table.get_table_col_widths()
col_widths = table.row_count

for row in range(table.row_count):
    html += "<tr>"
    # for col in range(len(col_widths)):
    for col in range (0,col_widths):
        cell = table.get_table_cell(row, col)
        tokens = cell.get_text().split()
        bbox = cell.rect
        cells.append({
            'tokens': tokens,
            'bbox': [bbox.x0, bbox.y0, bbox.x1, bbox.y1]
        })
        html += f"<td>{' '.join(tokens)}</td>"
    html += "</tr>"

'''

8 replies

nissansz Oct 10, 2023
Author

Thank you. But above method can only get text, not cell coordinates, any method to get coordinates by rows and columns?

JorjMcKie Oct 10, 2023
Maintainer

You also have tab.cells and tab.rows.cells. These are lists of rectangles.

nissansz Oct 10, 2023
Author

border4.pdf

For example, abovepdf, I want to get result like below.

。作者			.....

nissansz Oct 10, 2023
Author

'''

。作者			.....

'''

nissansz Oct 10, 2023
Author

JorjMcKie · 2023-10-10T06:25:14Z

JorjMcKie
Oct 10, 2023
Maintainer

Try

tabs=page.find_tables()
tab=tabs[0]
for e in tab.extract():
    print(e)

    
['。作者', None, None, '件事”。“只要', '看20多份报']
['pulseawry', '认为，', '。冰壶运动需要', None, None]
['，还有师父咸鱼', '564.77', '8506.00', '', '32.20']
['Ssubs', '', '', '', '4862.66']
[None, '7', '119.57', '满了湿润和温', '68']
tabs=page.find_tables()

tab.row_count, tab.col_count
(5, 5)

For each of these cell content items you will find the corresponding cell coordinates like this

for row in tab.rows:
    print(row.cells)  # a list of rectangle tuples

When encountering joined cells, the table finder will dissolve them and alsways return the maximum values for rows / columns. It will use None cells in all the necessary places when doing this both, in terms of cell text content and cell coordinates.

for row in tab.rows:
    print([tuple(fitz.IRect(cell)) if cell!= None else None for cell in row.cells])

    
[(66, 243, 562, 271), None, None, (562, 243, 728, 300), (728, 243, 894, 300)]
[(66, 271, 231, 300), (231, 271, 397, 300), (397, 271, 562, 300), None, None]
[(66, 300, 231, 329), (231, 300, 397, 329), (397, 300, 562, 329), (562, 300, 728, 329), (728, 300, 894, 329)]
[(66, 329, 231, 387), (231, 329, 397, 358), (397, 329, 562, 358), (562, 329, 728, 358), (728, 329, 894, 358)]
[None, (231, 358, 397, 387), (397, 358, 562, 387), (562, 358, 728, 387), (728, 358, 894, 387)]

If you compare the text and the cell output you will see the logic used to dissolve the joined cells.

4 replies

nissansz Oct 10, 2023
Author

Thank you.
It seems that None tag is difficult to identify which direction to merge, left or upper? Sometimes, it may be not clear.

JorjMcKie Oct 10, 2023
Maintainer

Sure. But that's all you can have.
In your example however, it is clear.

nissansz Oct 10, 2023
Author

Do you know any method to judge count of merged rows/columns in pywin32 for ppt?

JorjMcKie Oct 10, 2023
Maintainer

No, sorry.
Maybe the table finder can give some help though:
row_count * col_count is the count of all cells, including the None ones. That number also comes out when looking at the cells of each row.
But table.cells does not contain the None cells:

In [4]: page=doc[0]
In [5]: tab=page.find_tables()[0]
In [6]: tab.row_count * tab.col_count
Out[6]: 25
In [7]: len(tab.cells)
Out[7]: 20
In [8]:

tab.cells are exactly the visible cells. So comparing the two lists should at least be some help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to crop each character for each line? #2701

{{title}}

Replies: 10 comments 19 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to crop each character for each line? #2701

nissansz Sep 29, 2023

Replies: 10 comments · 19 replies

JorjMcKie Sep 30, 2023 Maintainer

nissansz Sep 30, 2023 Author

JorjMcKie Sep 30, 2023 Maintainer

nissansz Sep 30, 2023 Author

JorjMcKie Sep 30, 2023 Maintainer

JorjMcKie Sep 30, 2023 Maintainer

nissansz Sep 30, 2023 Author

JorjMcKie Sep 30, 2023 Maintainer

nissansz Sep 30, 2023 Author

JorjMcKie Sep 30, 2023 Maintainer

nissansz Sep 30, 2023 Author

JorjMcKie Sep 30, 2023 Maintainer

nissansz Sep 30, 2023 Author

JorjMcKie Sep 30, 2023 Maintainer

nissansz Sep 30, 2023 Author

nissansz Oct 10, 2023 Author

nissansz Oct 10, 2023 Author

JorjMcKie Oct 10, 2023 Maintainer

nissansz Oct 10, 2023 Author

nissansz Oct 10, 2023 Author

nissansz Oct 10, 2023 Author

JorjMcKie Oct 10, 2023 Maintainer

nissansz Oct 10, 2023 Author

JorjMcKie Oct 10, 2023 Maintainer

nissansz Oct 10, 2023 Author

JorjMcKie Oct 10, 2023 Maintainer

nissansz
Sep 29, 2023

Replies: 10 comments 19 replies

JorjMcKie
Sep 30, 2023
Maintainer

nissansz
Sep 30, 2023
Author

JorjMcKie Sep 30, 2023
Maintainer

nissansz
Sep 30, 2023
Author

JorjMcKie Sep 30, 2023
Maintainer

JorjMcKie Sep 30, 2023
Maintainer

nissansz
Sep 30, 2023
Author

JorjMcKie Sep 30, 2023
Maintainer

nissansz
Sep 30, 2023
Author

JorjMcKie Sep 30, 2023
Maintainer

nissansz
Sep 30, 2023
Author

JorjMcKie Sep 30, 2023
Maintainer

nissansz
Sep 30, 2023
Author

JorjMcKie Sep 30, 2023
Maintainer

nissansz
Sep 30, 2023
Author

nissansz
Oct 10, 2023
Author

nissansz Oct 10, 2023
Author

JorjMcKie Oct 10, 2023
Maintainer

nissansz Oct 10, 2023
Author

nissansz Oct 10, 2023
Author

nissansz Oct 10, 2023
Author

JorjMcKie
Oct 10, 2023
Maintainer

nissansz Oct 10, 2023
Author

JorjMcKie Oct 10, 2023
Maintainer

nissansz Oct 10, 2023
Author

JorjMcKie Oct 10, 2023
Maintainer