Best practices for limiting responses to a specific source document #95

nramirez · 2023-05-18T17:37:17Z

Hi, Thanks for the contribution.

I have been using your repository to train a model on a collection of books. My goal is to generate answers that are specific to a single source document, essentially using the model as an assistant that draws information from one selected book at a time (such as "cats.pdf").

Initially, I attempted to implement this by modifying the prompts, but the results were inconsistent, and the model sometimes used information from other sources. Here's an example of how I structured the prompts:

You are a helpful assistant trained to answer questions solely based on the content of book_name.pdf. Given the text in the book and a question, generate an appropriate answer. If the answer is not contained within the book, simply say that you don't know, rather than inventing an answer. The question is: What is the distance from the moon to the earth?

Seems like ingest.py adds the source path to the doc metadata. However, when a question is asked, the model retrieves the most relevant documents based on the semantic similarity between the query's embedding and the documents' embeddings, not a specific document identifier. The model does not consider the document's metadata (like its source path) during retrieval, which means it can't be instructed to refer to a specific document just by mentioning the document's name or identifier in the prompt (?).

Considering this, I'm evaluating the option of creating a dropdown menu that lists all the books I've trained the model on. When a book is selected from this menu, I would swap the databases to only include documents from the selected book when a query is made.

With that context, I have a few questions:

Is there a more efficient way to constrain the model's responses to a specific source document than by manipulating the prompt, using metadata, or swapping databases?
If I proceed with the dropdown menu and database swapping approach, are there any potential drawbacks or issues I should be aware of?
Given the potential usefulness of this feature to other users, would it be worth considering the addition of an option to limit responses to a specific source in the main repository?

Thanks for your time & I'd appreciate your insights.

PS: Adding this under docs, because it might be a result of my lack of understanding of how everything works together.

neeewwww · 2023-05-18T17:52:06Z

In order to enrich this discussion, it would be beneficial to be able include keywords that can aid in "directing" the response.

Keywords: Cat, Cat Book, Meals.
Please proceed with your question.

hippalectryon-0 · 2023-05-18T18:16:24Z

Is there a more efficient way to constrain the model's responses to a specific source document than by manipulating the prompt, using metadata, or swapping databases?

Yes. Several options:

add source metadata in prompt. Pro: very easy to implement, Con: takes tokens for each source, and relies on the right source being picked as a "relevant" source
since retrieving items from the db is very fast compared to generating the answer, we can fetch many items, and filter by their metadata before forwarding them to the model. Pro: easy to do, con: relies on the database fetcher to be good enough to retrieve the documents we want among all the documents requested

If I proceed with the dropdown menu and database swapping approach, are there any potential drawbacks or issues I should be aware of?

I'm unsure how the db is stored, but I don't think it's ordered by metadata, so if your db is big it will take a long time to filter by your document

Given the potential usefulness of this feature to other users, would it be worth considering the addition of an option to limit responses to a specific source in the main repository?

Yes that sounds like a good idea

su77ungr · 2023-05-19T00:09:40Z

Some delightful ideas I would have to think about, thanks.

For starters I see no reason why we could not just pipe in

if selected_document is not None:
    metadata_filter = {"key": "source_document", "value": selected_document}
    retriever = self.qdrant_langchain.as_retriever(
        search_type="mmr", metadata_filter=metadata_filter, k=n_forward_documents, fetch_k=n_retrieve_documents
    )
else:
    retriever = self.qdrant_langchain.as_retriever(search_type="mmr", k=n_forward_documents, fetch_k=n_retrieve_documents)

Where this would not perform worse than any other approach (have to check mmr compatibility tho). I think it should be possible to fetch a list of all available documents and set the filter on selection.

nramirez · 2023-05-19T03:25:15Z

Nice thanks for the quick suggestions. I'd probably go with the easiest for now. #95 (comment)

Another idea is to modify the ingest.py. Currently, it creates a single collection for all books, but perhaps we could adapt it so that a new collection is created for each source_document. startLLM.py will have to be updated to retrieve only the selected collection. However, I'm not sure about the potential performance implications of this approach.

The challenge here is that many users may prefer to search across all their documents, which is not my specific use case.

Regardless of the approach, will likely need a mechanism to keep track of which names have been ingested. This could potentially be achieved by creating another table in qdrant for storing global metadata.

I'd love to collaborate, but my plate is pretty full these days. I usually sneak in some time for AI projects at night. Really appreciate what you're doing here!

su77ungr added enhancement New feature or request 🚀🚀🚀 Feature labels May 19, 2023

nramirez changed the title ~~DOC: Best practices for limiting responses to a specific source document~~ Best practices for limiting responses to a specific source document May 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices for limiting responses to a specific source document #95

Best practices for limiting responses to a specific source document #95

nramirez commented May 18, 2023 •

edited

Loading

neeewwww commented May 18, 2023

hippalectryon-0 commented May 18, 2023

su77ungr commented May 19, 2023

nramirez commented May 19, 2023

Best practices for limiting responses to a specific source document #95

Best practices for limiting responses to a specific source document #95

Comments

nramirez commented May 18, 2023 • edited Loading

neeewwww commented May 18, 2023

hippalectryon-0 commented May 18, 2023

su77ungr commented May 19, 2023

nramirez commented May 19, 2023

nramirez commented May 18, 2023 •

edited

Loading