Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contextual Retrieval #17367

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

cklapperich
Copy link

@cklapperich cklapperich commented Dec 25, 2024

Description

Document Context Retrieval

This feature adds a llama_index implementation of "contextual retrieval"

It adds a new Extractor class, DocumentContextExtractor, which can be used in a pipeline. It requires a Document Store and an LLM to provide the context. It also requires you keep the documentstore up to date.

motivation

Contextual retrieval is a cool technique to boost RAG accuracy, but no production-ready code exists that handles all the gnarly nasty edge cases.

Anthropic made a 'cookbook' notebook. llama_index also made a demo of it here

Both are cool, neither scale to 100s of documents because:

  • rate limits

  • cost

  • cant use in pipeline

  • documents too large for context window

  • prompt caching doesn't work via llama_index interface (this may have been very recently fixed)

  • error handling and warnings

  • chunk + context can be too big for the embedding model so you need some control over that

  • and more!

Usage

docstore = SimpleDocumentStore()

llm = OpenRouter(model="openai/gpt-4o-mini")

# initialize the extractor
extractor = DocumentContextExtractor(document_store, llm)

storage_context = StorageContext.from_defaults(vector_store=vector_store,
                                                            docstore=docstore,
                                                            index_store=index_store)
index = VectorStoreIndex.from_vector_store(
            vector_store=vector_store,
            embed_model=embed_model,
            storage_context=storage_context,
            transformations=[text_splitter, document_context_extractor]
        )

reader = SimpleDirectoryReader(directory)
documents = reader.load_data()

# have to keep this updated for the DocumentContextExtractor to function.
storagecontext.docstore.add_documents(documents)
for doc in documents:
    index.insert(doc)

custom keys and prompts

by default, the extractor adds a key called "context" to each node, using a reasonable default prompt taken from the blog post cookbook, but you can pass in a list of keys and prompts like so:

extractor = DocumentContextExtractor(document_store, llm, key="context", prompt="Give the document context"])

Type of Change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

core/tests/extractors/test_document_context.py

  • I added new unit tests to cover this change

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran make format; make lint to appease the lint gods

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Dec 25, 2024
@cklapperich cklapperich marked this pull request as draft December 25, 2024 05:10
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@cklapperich cklapperich marked this pull request as ready for review January 1, 2025 05:59
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jan 1, 2025
@logan-markewich logan-markewich self-assigned this Jan 2, 2025
continue

# Count as failure for all other exceptions
self.failed_nodes.add(node.node_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By storing this on self, we make the extractor stateful -- It might be better to avoid this? Or find a way to make multiple runs on the same extractor object possible?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah true. im not sure how else to do this though. ideas?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we pass some list between functions instead of having it on self ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I thought about this. The failed nodes just seemed like a useful convenience. The function aextract truly is stateless: it doesn't use the failed_nodes list in any way. its sort of just a logging thing I guess?
I could remove it.

Copy link
Author

@cklapperich cklapperich Jan 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we pass some list between functions instead of having it on self ?

yeah we could but then how do we give it back to the user? Although im also not sure how useful this even is as a feature. I might just kill it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, but I do use it to check if it successfully generated context for all nodes. but there's probably other ways to do that too. but I'd still need to store state! I guess the user could also just write a function to iterate over all their nodes and check if they all have a 'context' metadata key... hmm.
My notebook does use the extractor.is_job_complete() function which depends on self._failed_nodes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts on this one @logan-markewich ?

if self.key in node.metadata:
continue

doc: Optional[Union[Node, TextNode]] = await self._get_document(
Copy link
Collaborator

@logan-markewich logan-markewich Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is potentially many nodes for one document. Doesn't this for-loop mean that we are potentially getting the same document many times? (I might be wrong!)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but there's an lru cache. but then I changed it so the list is always sorted after running sorting benchmarks. so this could probably be better.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so the self._get_document() has an lru cache of size 10. which might be too big. there are some very rare edge cases where 2 different documents can be interleaved,
remember this is fully async!! every node is its own task. BUT nodes are grouped by document, and asyncio will MOSTLY run things in the order the jobs were submitted.

You can theoretically, sometimes, get Doc A -> Doc B -> Doc A
we could probably set the cache size to 1 or 2 and be fine. but I think this code is otherwise good how it is!

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Jan 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants