Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Title text partially missing issue in recovery_to_markdown.py #14216

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Coobiw
Copy link

@Coobiw Coobiw commented Nov 13, 2024

when I run the quicktour code as following:

import os
import cv2
from PIL import Image
from pathlib import Path
from paddleocr import PPStructure,save_structure_res, draw_structure_result
from paddleocr.ppstructure.recovery.recovery_to_doc import sorted_layout_boxes
from paddleocr.ppstructure.recovery.recovery_to_markdown import convert_info_markdown

# 中文测试图
# table_engine = PPStructure(recovery=True)
# 英文测试图
table_engine = PPStructure(recovery=True, lang='en')

save_folder = './paddleocr_markdown_restore_new'
img_path = './pics/20241113-091849.jpeg'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])

for line in result:
    # line.pop('img')
    print(line)

im_show = draw_structure_result(img, result, font_path='/cpfs/data/user/zhiqi/mm_ocr/got/fonts/simfang.ttf')
im_show = Image.fromarray(im_show)
im_show.save(f'{save_folder}/{Path(img_path).stem}_vis_paddle.jpg')
h, w, _ = img.shape
res = sorted_layout_boxes(result, w)
convert_info_markdown(res, save_folder, os.path.basename(img_path).split('.')[0])

I find that the output markdown file has some minor mistakes on the title texts. I will show the result.

test pdf screenshot:
img_v3_02gj_5e6fbf19-198e-47e5-a13f-2c455dd5901g

part of original recovered markdown:

# 3

# 3.1
...

# 3.1.1
...

The title texts are incomplete. I find that the source code only append the first part of detected title text:

elif region["type"].lower() == "title":
            markdown_string.append(f"""# {region["res"][0]["text"]}""")

So I modify recovery_to_markdown.py. After that, the title text is complete, as following:

# 3 Contribucion de la JERS al marco de politicas

# 3.1 Sector bancario
...

# 3.1.1 Dictamenes relativos al articulo 458 del R eglamento sobre Requisitos de Capital
...

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@GreatV
Copy link
Collaborator

GreatV commented Nov 15, 2024

please fix codestyle and sign the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants