Missing Content in PDF to Markdown Conversion #50

shawn8888 · 2024-10-08T16:12:07Z

I'm using the Zerox project to convert PDFs to Markdown using OpenAI API 4o-mini, but I've noticed that not all content is being converted, and some useful information is missing. This issue persists even with documents that shouldn't be difficult to OCR.

Could you provide insights into what might be causing this? Understanding the limitations or reasons behind the incomplete conversions would be helpful for users trying to get the most out of the tool.

Thank you.

pradhyumna85 · 2024-10-11T13:08:17Z

@shawn8888, try the prompt mentioned in #48 and see if that improves.

shawn8888 · 2024-10-12T04:01:55Z

@shawn8888, try the prompt mentioned in #48 and see if that improves.

Sorry, I am not familiar with github. How to use this updated prompt?

pradhyumna85 · 2024-10-12T12:16:21Z

@shawn8888 pass it like this:

custom_system_prompt = """
    Convert the following PDF page to markdown.
    Return only the markdown with no explanation text. Do not include deliminators like ```markdown.

    RULES:
    - You must include all information on the page. Do not exclude headers, footers, or subtext.
    - Charts & infographics must be interpreted to a markdown format
    - Non text based images must be replaced with [Description of image](image.png)
"""
zerox(file_path, output_file_path,
                      custom_system_prompt = custom_system_prompt,)

shawn8888 · 2024-10-12T12:42:48Z

The result .md file is the same. Am I doing it wrong? The .py file is attached.

C:\Backup\Projects\python>python hello_zerox2.py
C:\Python312\Lib\site-packages\pyzerox\models\modellitellm.py:52: UserWarning:
    Custom system prompt was provided which overrides the default system prompt. We assume that you know what you are doing.
    . Default prompt for zerox is:

    Convert the following PDF page to markdown.
    Return only the markdown with no explanation text.
    Do not exclude any content from the page.

  warnings.warn(f"{Messages.CUSTOM_SYSTEM_PROMPT_WARNING}. Default prompt for zerox is:\n {DEFAULT_SYSTEM_PROMPT}")
ZeroxOutput(completion_time

hello_zerox3.zip

pradhyumna85 · 2024-10-12T13:59:09Z

@shawn8888, looks like its the vision model (gpt 4o mini), and not zerox. Try with gpt 4o and see if it does the same, also may be check with your document image manually with gpt to see if that changes anything.

shawn8888 · 2024-10-12T14:59:30Z

@pradhyumna85

The result on the console only shows the "default system prompt". How do I know the custom system prompt is working?
4o model does return more text than 4o-mini, which surprised me. I also found an obvious mistake in 4o but not in 4o-mini. AI illusion I guess?

pradhyumna85 · 2024-10-12T17:49:42Z

@pradhyumna85

The result on the console only shows the "default system prompt". How do I know the custom system prompt is working?

4o model does return more text than 4o-mini, which surprised me. I also found an obvious mistake in 4o but not in 4o-mini. AI illusion I guess?

@shawn8888,
For 2. I think gpt 4o is technically a bigger model than the mini so that could be a reason.
For 1. In the custom prompt try to instruct the model to return a JSON instead of markdown and see if that is changing the output.

shawn8888 · 2024-10-12T18:01:33Z

@pradhyumna85
Good idea! I modified the custom_system_prompt to return JSON, and it successfully did so, confirming its functionality. However, I would still appreciate it if Zerox could return both the default and custom prompts.

4o returns much more OCR content than 4o-mini, but also much more expensive. I would use 4o when 4o-mini's result is not satisfied.

Thanks a lot!

shawn8888 closed this as completed Oct 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing Content in PDF to Markdown Conversion #50

Missing Content in PDF to Markdown Conversion #50

shawn8888 commented Oct 8, 2024

pradhyumna85 commented Oct 11, 2024

shawn8888 commented Oct 12, 2024

pradhyumna85 commented Oct 12, 2024

shawn8888 commented Oct 12, 2024

pradhyumna85 commented Oct 12, 2024

shawn8888 commented Oct 12, 2024

pradhyumna85 commented Oct 12, 2024 •

edited

Loading

shawn8888 commented Oct 12, 2024 •

edited

Loading

Missing Content in PDF to Markdown Conversion #50

Missing Content in PDF to Markdown Conversion #50

Comments

shawn8888 commented Oct 8, 2024

pradhyumna85 commented Oct 11, 2024

shawn8888 commented Oct 12, 2024

pradhyumna85 commented Oct 12, 2024

shawn8888 commented Oct 12, 2024

pradhyumna85 commented Oct 12, 2024

shawn8888 commented Oct 12, 2024

pradhyumna85 commented Oct 12, 2024 • edited Loading

shawn8888 commented Oct 12, 2024 • edited Loading

pradhyumna85 commented Oct 12, 2024 •

edited

Loading

shawn8888 commented Oct 12, 2024 •

edited

Loading