-
Notifications
You must be signed in to change notification settings - Fork 582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing Content in PDF to Markdown Conversion #50
Comments
@shawn8888, try the prompt mentioned in #48 and see if that improves. |
Sorry, I am not familiar with github. How to use this updated prompt? |
@shawn8888 pass it like this: custom_system_prompt = """
Convert the following PDF page to markdown.
Return only the markdown with no explanation text. Do not include deliminators like ```markdown.
RULES:
- You must include all information on the page. Do not exclude headers, footers, or subtext.
- Charts & infographics must be interpreted to a markdown format
- Non text based images must be replaced with [Description of image](image.png)
"""
zerox(file_path, output_file_path,
custom_system_prompt = custom_system_prompt,) |
The result .md file is the same. Am I doing it wrong? The .py file is attached.
|
@shawn8888, looks like its the vision model (gpt 4o mini), and not zerox. Try with gpt 4o and see if it does the same, also may be check with your document image manually with gpt to see if that changes anything. |
|
@shawn8888, |
@pradhyumna85 4o returns much more OCR content than 4o-mini, but also much more expensive. I would use 4o when 4o-mini's result is not satisfied. Thanks a lot! |
I'm using the Zerox project to convert PDFs to Markdown using OpenAI API 4o-mini, but I've noticed that not all content is being converted, and some useful information is missing. This issue persists even with documents that shouldn't be difficult to OCR.
Could you provide insights into what might be causing this? Understanding the limitations or reasons behind the incomplete conversions would be helpful for users trying to get the most out of the tool.
Thank you.
The text was updated successfully, but these errors were encountered: