Replies: 1 comment
-
Here's how I did it: Use the pre-processor to split the input document (document here refers to the text file), which then splits it into multiple documents (as in documents to store). I store the actual name of the input document in the meta data against each of the splitted documents, which is then returned in the output of the pipeline. You can then pull that value out of the JSON and display it to the user. The pipeline will truncate the input content if it experiences a document that is longer than the token length of the model. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am using an EmbeddingRetreiver (ER) and storing embeddings in a FAISS store. ER is associated with a sentence transformer model with a limit on input token length. Given this set-up, what happens when large documents are encountered? Ideally a large document should be broken down into sub documents, embedding for each sub-document stored and while result retrieval, they should aware of which sub-document belongs to which document. Comments anyone?
Beta Was this translation helpful? Give feedback.
All reactions