-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Contextual Retrieval #17367
base: main
Are you sure you want to change the base?
Contextual Retrieval #17367
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
docs/docs/examples/metadata_extraction/DocumentContextExtractor.ipynb
Outdated
Show resolved
Hide resolved
llama-index-core/llama_index/core/extractors/document_context.py
Outdated
Show resolved
Hide resolved
llama-index-core/llama_index/core/extractors/document_context.py
Outdated
Show resolved
Hide resolved
continue | ||
|
||
# Count as failure for all other exceptions | ||
self.failed_nodes.add(node.node_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By storing this on self
, we make the extractor stateful -- It might be better to avoid this? Or find a way to make multiple runs on the same extractor object possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah true. im not sure how else to do this though. ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we pass some list between functions instead of having it on self
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I thought about this. The failed nodes just seemed like a useful convenience. The function aextract truly is stateless: it doesn't use the failed_nodes list in any way. its sort of just a logging thing I guess?
I could remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we pass some list between functions instead of having it on self ?
yeah we could but then how do we give it back to the user? Although im also not sure how useful this even is as a feature. I might just kill it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, but I do use it to check if it successfully generated context for all nodes. but there's probably other ways to do that too. but I'd still need to store state! I guess the user could also just write a function to iterate over all their nodes and check if they all have a 'context' metadata key... hmm.
My notebook does use the extractor.is_job_complete() function which depends on self._failed_nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thoughts on this one @logan-markewich ?
if self.key in node.metadata: | ||
continue | ||
|
||
doc: Optional[Union[Node, TextNode]] = await self._get_document( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is potentially many nodes for one document. Doesn't this for-loop mean that we are potentially getting the same document many times? (I might be wrong!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but there's an lru cache. but then I changed it so the list is always sorted after running sorting benchmarks. so this could probably be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so the self._get_document() has an lru cache of size 10. which might be too big. there are some very rare edge cases where 2 different documents can be interleaved,
remember this is fully async!! every node is its own task. BUT nodes are grouped by document, and asyncio will MOSTLY run things in the order the jobs were submitted.
You can theoretically, sometimes, get Doc A -> Doc B -> Doc A
we could probably set the cache size to 1 or 2 and be fine. but I think this code is otherwise good how it is!
…/llama_index into contextual_retrieval
Description
Document Context Retrieval
This feature adds a llama_index implementation of "contextual retrieval"
It adds a new Extractor class, DocumentContextExtractor, which can be used in a pipeline. It requires a Document Store and an LLM to provide the context. It also requires you keep the documentstore up to date.
motivation
Contextual retrieval is a cool technique to boost RAG accuracy, but no production-ready code exists that handles all the gnarly nasty edge cases.
Anthropic made a 'cookbook' notebook. llama_index also made a demo of it here
Both are cool, neither scale to 100s of documents because:
rate limits
cost
cant use in pipeline
documents too large for context window
prompt caching doesn't work via llama_index interface (this may have been very recently fixed)
error handling and warnings
chunk + context can be too big for the embedding model so you need some control over that
and more!
Usage
custom keys and prompts
by default, the extractor adds a key called "context" to each node, using a reasonable default prompt taken from the blog post cookbook, but you can pass in a list of keys and prompts like so:
Type of Change
Please delete options that are not relevant.
How Has This Been Tested?
Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.
core/tests/extractors/test_document_context.py
Suggested Checklist:
make format; make lint
to appease the lint gods