Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llamaparse error when parsing docx file #1363

Open
jsmusgrave opened this issue Oct 22, 2024 · 12 comments · Fixed by #1364
Open

Llamaparse error when parsing docx file #1363

jsmusgrave opened this issue Oct 22, 2024 · 12 comments · Fixed by #1364
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@jsmusgrave
Copy link
Contributor

Llamaparse parsing for docx doesn't work in 0.7.3. This works via the web UI which appears to use the public API.
I had hoped 1340 would address this but it has not.

Demonstration code. (Change the file name and the api key env.)

import { LlamaParseReader } from "llamaindex";
import { ParserLanguages } from "@llamaindex/cloud/api/dist";
import fs from "fs";

type LlamaParseReaderParams = Partial<Omit<LlamaParseReader, "language" | "apiKey">>  & {
    language?: ParserLanguages | ParserLanguages[] | undefined;
    apiKey?: string | undefined;
}

async function main() {
    const path = "/tmp/somedoc.docx"
    if (!fs.existsSync(path)) {
        console.error(`File ${path} does not exist`);
        process.exit(1);
    } else {
        console.log(`File ${path} exists`);
    }

    const apiKey = process.env.LLAMAINDEX_KEY;
    const params : LlamaParseReaderParams = { 
        verbose: true,
        parsingInstruction: "Extract the text from the document a long with any images and tables.  This is a document for a course and the contents of the images are important.",
        fastMode: false,
        gpt4oMode: true,
        useVendorMultimodalModel: true,
        vendorMultimodalModelName: "anthropic-sonnet-3.5",
        // vendorMultimodalApiKey?: string | undefined;
        premiumMode: true,
        resultType: "markdown", 
        apiKey: apiKey,
        doNotCache: true,
    };

    // set up the llamaparse reader
    const reader = new LlamaParseReader(params);

    const buffer = fs.readFileSync(path);
    const documents = await reader.loadDataAsContent(new Uint8Array(buffer));

    let allText = "";
    documents.forEach(doc => {
        allText += doc.text;
    });

    console.log(allText);
  } 
  main().catch(console.error);
@himself65 himself65 added the bug Something isn't working label Oct 22, 2024
@himself65 himself65 self-assigned this Oct 22, 2024
@himself65
Copy link
Member

Checking this

@himself65
Copy link
Member

Fixed in #1364, for the workaround you have to pass filename as second parameter to make sure llamaparse know the correct file type.
For now, I don't wanna detect the file type in llamaindex-ts side for performance/bundler considerration, you can use loadData(filePath) for the best compatibility

@jsmusgrave
Copy link
Contributor Author

My usecase is loading data from a bucket, so I have a buffer. (Unlike my simplified example above). So I'm using loadDataAsContent.

Ex:

const documents = await reader.loadDataAsContent(new Uint8Array(buffer));

I don't believe there's a way to pass the filename to this call or to the LlamaParseReader params.

@himself65 himself65 reopened this Oct 22, 2024
@himself65
Copy link
Member

I have to leave this ticket to llama parse side? I cannot do things more here

@himself65
Copy link
Member

himself65 commented Oct 22, 2024

/cc @hexapode

@himself65
Copy link
Member

I think I should close this, I double tested on stackbliz that now it's should working.

LlamaParse has some internal upgrade to fix this

@himself65
Copy link
Member

himself65 commented Oct 22, 2024

Please try this. If you have any more issue, please let me know

https://stackblitz.com/edit/stackblitz-starters-k137wi?file=index.js

@jsmusgrave
Copy link
Contributor Author

Awesome. Thank you! With default config it works.

The Multi-modal version fails:

Got Error Code: ERROR_DURING_PROCESSING and Error Message: An unknown error occurred during processing. Job id: fee42930-5832-45d1-a9a5-4e0ba126cf9d

@himself65
Copy link
Member

could give me the parameter and maybe sample data?

@jsmusgrave
Copy link
Contributor Author

import { LlamaParseReader } from "llamaindex";
import fs from "fs";
import { ParserLanguages } from "@llamaindex/cloud/api/dist";

export type LlamaParseReaderParams = Partial<Omit<LlamaParseReader, "language" | "apiKey">>  & {
    language?: ParserLanguages | ParserLanguages[] | undefined;
    apiKey?: string | undefined;
}

async function main() {
    const path = "/tmp/sample.docx";

    if (!fs.existsSync(path)) {
        console.error(`File ${path} does not exist`);
        process.exit(1);
    } else {
        console.log(`File ${path} exists`);
    }

    const apiKey = process.env.LLAMAINDEX_KEY;
    const vendorMultimodalApiKey = process.env.LI_ANTHROPIC_KEY;
    const params : LlamaParseReaderParams = { 
        verbose: true,
        parsingInstruction: "Extract the text from the document along with any details of images and tables.  This is a document for a course and a very detailed description of the contents of the images is important.",
        fastMode: false,
        gpt4oMode: false,
        useVendorMultimodalModel: true,
        vendorMultimodalModelName: "anthropic-sonnet-3.5",
        vendorMultimodalApiKey: vendorMultimodalApiKey,
        premiumMode: true,
        resultType: "markdown", 
        apiKey: apiKey,
        doNotCache: true,
    };

    // set up the llamaparse reader
    const reader = new LlamaParseReader(params);

    const buffer = fs.readFileSync(path);
    const documents = await reader.loadDataAsContent(new Uint8Array(buffer));

    let allText = "";
    documents.forEach(doc => {
        allText += doc.text;
    });

    console.log(allText);
  }
  
  main().catch(console.error).then((e) => {
    console.error("error", e);
  });

Using this file:
https://ieeeaccess.ieee.org/wp-content/uploads/2022/01/Access-Template.docx

@himself65 himself65 reopened this Oct 24, 2024
@himself65
Copy link
Member

i think this is same issue that docx parsed as pdf

@himself65 himself65 changed the title Llamaparse parsing for docx and odt doesn't work in 0.7.3 Llamaparse error when parsing docx file Oct 24, 2024
@himself65
Copy link
Member

for now there's a workaround

const magic = [80, 75, 3, 4];
let documents
if (buffer[0] === magic[0] && buffer[1] === magic[1] && buffer[2] === magic[2] && buffer[3] === magic[3]) {
  documents = await reader.loadDataAsContent(new Uint8Array(buffer), 'filename.docx');
}

@himself65 himself65 added the help wanted Extra attention is needed label Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants