Breaking down large documents into sub-documents to prevent truncation during embedding generation #4374

mithunb · 2023-03-10T06:18:11Z

mithunb
Mar 10, 2023

I am using an EmbeddingRetreiver (ER) and storing embeddings in a FAISS store. ER is associated with a sentence transformer model with a limit on input token length. Given this set-up, what happens when large documents are encountered? Ideally a large document should be broken down into sub documents, embedding for each sub-document stored and while result retrieval, they should aware of which sub-document belongs to which document. Comments anyone?

recrudesce · 2023-03-11T19:05:23Z

recrudesce
Mar 11, 2023

Here's how I did it:

Use the pre-processor to split the input document (document here refers to the text file), which then splits it into multiple documents (as in documents to store). I store the actual name of the input document in the meta data against each of the splitted documents, which is then returned in the output of the pipeline. You can then pull that value out of the JSON and display it to the user.

The pipeline will truncate the input content if it experiences a document that is longer than the token length of the model.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Breaking down large documents into sub-documents to prevent truncation during embedding generation #4374

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Breaking down large documents into sub-documents to prevent truncation during embedding generation #4374

mithunb Mar 10, 2023

Replies: 1 comment

recrudesce Mar 11, 2023

mithunb
Mar 10, 2023

recrudesce
Mar 11, 2023