Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Content in PDF to Markdown Conversion #50

Closed
shawn8888 opened this issue Oct 8, 2024 · 8 comments
Closed

Missing Content in PDF to Markdown Conversion #50

shawn8888 opened this issue Oct 8, 2024 · 8 comments

Comments

@shawn8888
Copy link

I'm using the Zerox project to convert PDFs to Markdown using OpenAI API 4o-mini, but I've noticed that not all content is being converted, and some useful information is missing. This issue persists even with documents that shouldn't be difficult to OCR.

Could you provide insights into what might be causing this? Understanding the limitations or reasons behind the incomplete conversions would be helpful for users trying to get the most out of the tool.

Thank you.

@pradhyumna85
Copy link
Contributor

@shawn8888, try the prompt mentioned in #48 and see if that improves.

@shawn8888
Copy link
Author

@shawn8888, try the prompt mentioned in #48 and see if that improves.

Sorry, I am not familiar with github. How to use this updated prompt?

@pradhyumna85
Copy link
Contributor

@shawn8888 pass it like this:

custom_system_prompt = """
    Convert the following PDF page to markdown.
    Return only the markdown with no explanation text. Do not include deliminators like ```markdown.

    RULES:
    - You must include all information on the page. Do not exclude headers, footers, or subtext.
    - Charts & infographics must be interpreted to a markdown format
    - Non text based images must be replaced with [Description of image](image.png)
"""
zerox(file_path, output_file_path,
                      custom_system_prompt = custom_system_prompt,)

@shawn8888
Copy link
Author

The result .md file is the same. Am I doing it wrong? The .py file is attached.

C:\Backup\Projects\python>python hello_zerox2.py
C:\Python312\Lib\site-packages\pyzerox\models\modellitellm.py:52: UserWarning:
    Custom system prompt was provided which overrides the default system prompt. We assume that you know what you are doing.
    . Default prompt for zerox is:

    Convert the following PDF page to markdown.
    Return only the markdown with no explanation text.
    Do not exclude any content from the page.

  warnings.warn(f"{Messages.CUSTOM_SYSTEM_PROMPT_WARNING}. Default prompt for zerox is:\n {DEFAULT_SYSTEM_PROMPT}")
ZeroxOutput(completion_time

hello_zerox3.zip

@pradhyumna85
Copy link
Contributor

@shawn8888, looks like its the vision model (gpt 4o mini), and not zerox. Try with gpt 4o and see if it does the same, also may be check with your document image manually with gpt to see if that changes anything.

@shawn8888
Copy link
Author

@pradhyumna85

  1. The result on the console only shows the "default system prompt". How do I know the custom system prompt is working?
  2. 4o model does return more text than 4o-mini, which surprised me. I also found an obvious mistake in 4o but not in 4o-mini. AI illusion I guess?

@pradhyumna85
Copy link
Contributor

pradhyumna85 commented Oct 12, 2024

@pradhyumna85

  1. The result on the console only shows the "default system prompt". How do I know the custom system prompt is working?
  2. 4o model does return more text than 4o-mini, which surprised me. I also found an obvious mistake in 4o but not in 4o-mini. AI illusion I guess?

@shawn8888,
For 2. I think gpt 4o is technically a bigger model than the mini so that could be a reason.
For 1. In the custom prompt try to instruct the model to return a JSON instead of markdown and see if that is changing the output.

@shawn8888
Copy link
Author

shawn8888 commented Oct 12, 2024

@pradhyumna85
Good idea! I modified the custom_system_prompt to return JSON, and it successfully did so, confirming its functionality. However, I would still appreciate it if Zerox could return both the default and custom prompts.

4o returns much more OCR content than 4o-mini, but also much more expensive. I would use 4o when 4o-mini's result is not satisfied.

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants