Llamaparse error when parsing docx file #1363

jsmusgrave · 2024-10-22T03:02:48Z

Llamaparse parsing for docx doesn't work in 0.7.3. This works via the web UI which appears to use the public API.
I had hoped 1340 would address this but it has not.

Demonstration code. (Change the file name and the api key env.)

import { LlamaParseReader } from "llamaindex";
import { ParserLanguages } from "@llamaindex/cloud/api/dist";
import fs from "fs";

type LlamaParseReaderParams = Partial<Omit<LlamaParseReader, "language" | "apiKey">>  & {
    language?: ParserLanguages | ParserLanguages[] | undefined;
    apiKey?: string | undefined;
}

async function main() {
    const path = "/tmp/somedoc.docx"
    if (!fs.existsSync(path)) {
        console.error(`File ${path} does not exist`);
        process.exit(1);
    } else {
        console.log(`File ${path} exists`);
    }

    const apiKey = process.env.LLAMAINDEX_KEY;
    const params : LlamaParseReaderParams = { 
        verbose: true,
        parsingInstruction: "Extract the text from the document a long with any images and tables.  This is a document for a course and the contents of the images are important.",
        fastMode: false,
        gpt4oMode: true,
        useVendorMultimodalModel: true,
        vendorMultimodalModelName: "anthropic-sonnet-3.5",
        // vendorMultimodalApiKey?: string | undefined;
        premiumMode: true,
        resultType: "markdown", 
        apiKey: apiKey,
        doNotCache: true,
    };

    // set up the llamaparse reader
    const reader = new LlamaParseReader(params);

    const buffer = fs.readFileSync(path);
    const documents = await reader.loadDataAsContent(new Uint8Array(buffer));

    let allText = "";
    documents.forEach(doc => {
        allText += doc.text;
    });

    console.log(allText);
  } 
  main().catch(console.error);

himself65 · 2024-10-22T04:43:47Z

Checking this

himself65 · 2024-10-22T07:39:54Z

Fixed in #1364, for the workaround you have to pass filename as second parameter to make sure llamaparse know the correct file type.
For now, I don't wanna detect the file type in llamaindex-ts side for performance/bundler considerration, you can use loadData(filePath) for the best compatibility

jsmusgrave · 2024-10-22T23:24:34Z

My usecase is loading data from a bucket, so I have a buffer. (Unlike my simplified example above). So I'm using loadDataAsContent.

Ex:

const documents = await reader.loadDataAsContent(new Uint8Array(buffer));

I don't believe there's a way to pass the filename to this call or to the LlamaParseReader params.

himself65 · 2024-10-22T23:33:03Z

I have to leave this ticket to llama parse side? I cannot do things more here

himself65 · 2024-10-22T23:33:30Z

/cc @hexapode

himself65 · 2024-10-22T23:49:42Z

I think I should close this, I double tested on stackbliz that now it's should working.

LlamaParse has some internal upgrade to fix this

himself65 · 2024-10-22T23:50:13Z

Please try this. If you have any more issue, please let me know

https://stackblitz.com/edit/stackblitz-starters-k137wi?file=index.js

jsmusgrave · 2024-10-23T00:26:27Z

Awesome. Thank you! With default config it works.

The Multi-modal version fails:

Got Error Code: ERROR_DURING_PROCESSING and Error Message: An unknown error occurred during processing. Job id: fee42930-5832-45d1-a9a5-4e0ba126cf9d

himself65 · 2024-10-23T00:32:04Z

could give me the parameter and maybe sample data?

jsmusgrave · 2024-10-23T14:06:14Z

import { LlamaParseReader } from "llamaindex";
import fs from "fs";
import { ParserLanguages } from "@llamaindex/cloud/api/dist";

export type LlamaParseReaderParams = Partial<Omit<LlamaParseReader, "language" | "apiKey">>  & {
    language?: ParserLanguages | ParserLanguages[] | undefined;
    apiKey?: string | undefined;
}

async function main() {
    const path = "/tmp/sample.docx";

    if (!fs.existsSync(path)) {
        console.error(`File ${path} does not exist`);
        process.exit(1);
    } else {
        console.log(`File ${path} exists`);
    }

    const apiKey = process.env.LLAMAINDEX_KEY;
    const vendorMultimodalApiKey = process.env.LI_ANTHROPIC_KEY;
    const params : LlamaParseReaderParams = { 
        verbose: true,
        parsingInstruction: "Extract the text from the document along with any details of images and tables.  This is a document for a course and a very detailed description of the contents of the images is important.",
        fastMode: false,
        gpt4oMode: false,
        useVendorMultimodalModel: true,
        vendorMultimodalModelName: "anthropic-sonnet-3.5",
        vendorMultimodalApiKey: vendorMultimodalApiKey,
        premiumMode: true,
        resultType: "markdown", 
        apiKey: apiKey,
        doNotCache: true,
    };

    // set up the llamaparse reader
    const reader = new LlamaParseReader(params);

    const buffer = fs.readFileSync(path);
    const documents = await reader.loadDataAsContent(new Uint8Array(buffer));

    let allText = "";
    documents.forEach(doc => {
        allText += doc.text;
    });

    console.log(allText);
  }
  
  main().catch(console.error).then((e) => {
    console.error("error", e);
  });

Using this file:
https://ieeeaccess.ieee.org/wp-content/uploads/2022/01/Access-Template.docx

himself65 · 2024-10-24T17:31:22Z

i think this is same issue that docx parsed as pdf

himself65 · 2024-10-24T17:37:47Z

for now there's a workaround

const magic = [80, 75, 3, 4];
let documents
if (buffer[0] === magic[0] && buffer[1] === magic[1] && buffer[2] === magic[2] && buffer[3] === magic[3]) {
  documents = await reader.loadDataAsContent(new Uint8Array(buffer), 'filename.docx');
}

himself65 added the bug Something isn't working label Oct 22, 2024

himself65 self-assigned this Oct 22, 2024

himself65 mentioned this issue Oct 22, 2024

fix(cloud): allow filename in llama parse #1364

Merged

himself65 closed this as completed in #1364 Oct 22, 2024

himself65 reopened this Oct 22, 2024

himself65 closed this as completed Oct 22, 2024

himself65 reopened this Oct 24, 2024

himself65 changed the title ~~Llamaparse parsing for docx and odt doesn't work in 0.7.3~~ Llamaparse error when parsing docx file Oct 24, 2024

himself65 added the help wanted Extra attention is needed label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llamaparse error when parsing docx file #1363

Llamaparse error when parsing docx file #1363

jsmusgrave commented Oct 22, 2024

himself65 commented Oct 22, 2024

himself65 commented Oct 22, 2024

jsmusgrave commented Oct 22, 2024

himself65 commented Oct 22, 2024

himself65 commented Oct 22, 2024 •

edited

Loading

himself65 commented Oct 22, 2024

himself65 commented Oct 22, 2024 •

edited

Loading

jsmusgrave commented Oct 23, 2024

himself65 commented Oct 23, 2024

jsmusgrave commented Oct 23, 2024

himself65 commented Oct 24, 2024

himself65 commented Oct 24, 2024

Llamaparse error when parsing docx file #1363

Llamaparse error when parsing docx file #1363

Comments

jsmusgrave commented Oct 22, 2024

himself65 commented Oct 22, 2024

himself65 commented Oct 22, 2024

jsmusgrave commented Oct 22, 2024

himself65 commented Oct 22, 2024

himself65 commented Oct 22, 2024 • edited Loading

himself65 commented Oct 22, 2024

himself65 commented Oct 22, 2024 • edited Loading

jsmusgrave commented Oct 23, 2024

himself65 commented Oct 23, 2024

jsmusgrave commented Oct 23, 2024

himself65 commented Oct 24, 2024

himself65 commented Oct 24, 2024

himself65 commented Oct 22, 2024 •

edited

Loading

himself65 commented Oct 22, 2024 •

edited

Loading