Contextual Retrieval #17367

cklapperich · 2024-12-25T05:04:32Z

Description

Document Context Retrieval

This feature adds a llama_index implementation of "contextual retrieval"

It adds a new Extractor class, DocumentContextExtractor, which can be used in a pipeline. It requires a Document Store and an LLM to provide the context. It also requires you keep the documentstore up to date.

motivation

Contextual retrieval is a cool technique to boost RAG accuracy, but no production-ready code exists that handles all the gnarly nasty edge cases.

Anthropic made a 'cookbook' notebook. llama_index also made a demo of it here

Both are cool, neither scale to 100s of documents because:

rate limits
cost
cant use in pipeline
documents too large for context window
prompt caching doesn't work via llama_index interface (this may have been very recently fixed)
error handling and warnings
chunk + context can be too big for the embedding model so you need some control over that
and more!

Usage

docstore = SimpleDocumentStore()

llm = OpenRouter(model="openai/gpt-4o-mini")

# initialize the extractor
extractor = DocumentContextExtractor(document_store, llm)

storage_context = StorageContext.from_defaults(vector_store=vector_store,
                                                            docstore=docstore,
                                                            index_store=index_store)
index = VectorStoreIndex.from_vector_store(
            vector_store=vector_store,
            embed_model=embed_model,
            storage_context=storage_context,
            transformations=[text_splitter, document_context_extractor]
        )

reader = SimpleDirectoryReader(directory)
documents = reader.load_data()

# have to keep this updated for the DocumentContextExtractor to function.
storagecontext.docstore.add_documents(documents)
for doc in documents:
    index.insert(doc)

custom keys and prompts

by default, the extractor adds a key called "context" to each node, using a reasonable default prompt taken from the blog post cookbook, but you can pass in a list of keys and prompts like so:

extractor = DocumentContextExtractor(document_store, llm, key="context", prompt="Give the document context"])

Type of Change

Please delete options that are not relevant.

New feature (non-breaking change which adds functionality)
This change requires a documentation update

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

core/tests/extractors/test_document_context.py

I added new unit tests to cover this change

Suggested Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added Google Colab support for the newly added notebooks.
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I ran make format; make lint to appease the lint gods

review-notebook-app · 2024-12-30T05:34:24Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

docs/docs/examples/metadata_extraction/DocumentContextExtractor.ipynb

llama-index-core/llama_index/core/extractors/document_context.py

logan-markewich · 2025-01-02T22:50:15Z

llama-index-core/llama_index/core/extractors/document_context.py

+                    continue
+
+                # Count as failure for all other exceptions
+                self.failed_nodes.add(node.node_id)


By storing this on self, we make the extractor stateful -- It might be better to avoid this? Or find a way to make multiple runs on the same extractor object possible?

yeah true. im not sure how else to do this though. ideas?

Can't we pass some list between functions instead of having it on self ?

so I thought about this. The failed nodes just seemed like a useful convenience. The function aextract truly is stateless: it doesn't use the failed_nodes list in any way. its sort of just a logging thing I guess?
I could remove it.

Can't we pass some list between functions instead of having it on self ?

yeah we could but then how do we give it back to the user? Although im also not sure how useful this even is as a feature. I might just kill it?

no, but I do use it to check if it successfully generated context for all nodes. but there's probably other ways to do that too. but I'd still need to store state! I guess the user could also just write a function to iterate over all their nodes and check if they all have a 'context' metadata key... hmm.
My notebook does use the extractor.is_job_complete() function which depends on self._failed_nodes

thoughts on this one @logan-markewich ?

llama-index-core/llama_index/core/extractors/document_context.py

logan-markewich · 2025-01-02T22:53:53Z

llama-index-core/llama_index/core/extractors/document_context.py

+            if self.key in node.metadata:
+                continue
+
+            doc: Optional[Union[Node, TextNode]] = await self._get_document(


there is potentially many nodes for one document. Doesn't this for-loop mean that we are potentially getting the same document many times? (I might be wrong!)

yes but there's an lru cache. but then I changed it so the list is always sorted after running sorting benchmarks. so this could probably be better.

Okay so the self._get_document() has an lru cache of size 10. which might be too big. there are some very rare edge cases where 2 different documents can be interleaved,
remember this is fully async!! every node is its own task. BUT nodes are grouped by document, and asyncio will MOSTLY run things in the order the jobs were submitted.

You can theoretically, sometimes, get Doc A -> Doc B -> Doc A
we could probably set the cache size to 1 or 2 and be fine. but I think this code is otherwise good how it is!

llama-index-core/llama_index/core/extractors/document_context.py

…/llama_index into contextual_retrieval

cklapperich3 added 15 commits December 21, 2024 12:03

1st commit of DocumentContextExtractor

3c9f7d9

integrated upates to documentcontextextractor.

e5796e5

removed unused import

815e2f8

fixed test code

7f04900

mypy compatibility

316e72a

more typing shenanigans

3628ffe

typypesafety

839f263

unused imports and an #ignore removed

fd5c8ba

removed dumb comments

dfcaa49

node sorting.

a690b28

moved into core.

5b121e5

final commit

6d51d86

all tests pass.

b7eb5e5

linting

8e88e2e

cleanup old stuff

6776695

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Dec 25, 2024

cklapperich marked this pull request as draft December 25, 2024 05:10

added a notebook

08b9fe0

cklapperich3 added 2 commits December 29, 2024 23:55

added documentation

995acb6

notebook done

5c4b8ba

cklapperich marked this pull request as ready for review January 1, 2025 05:59

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jan 1, 2025

cklapperich3 and others added 2 commits January 1, 2025 00:10

fixed commentt

3763427

Merge branch 'main' into contextual_retrieval

75a23d1

logan-markewich self-assigned this Jan 2, 2025