Skip to content

Commit

Permalink
Add dense proposals (#14719)
Browse files Browse the repository at this point in the history
Indexing strategy based on decomposing candidate propositions while
indexing.
  • Loading branch information
hinthornw authored Dec 14, 2023
1 parent bc3ec78 commit 79ae6c2
Show file tree
Hide file tree
Showing 14 changed files with 3,371 additions and 0 deletions.
3 changes: 3 additions & 0 deletions templates/rag-chroma-dense-retrieval/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
docs/img_*.jpg
chroma_db_proposals
multi_vector_retriever_metadata
21 changes: 21 additions & 0 deletions templates/rag-chroma-dense-retrieval/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2023 LangChain, Inc.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
81 changes: 81 additions & 0 deletions templates/rag-chroma-dense-retrieval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# rag-chroma-dense-retrieval

This template demonstrates the multi-vector indexing strategy proposed by Chen, et. al.'s [Dense X Retrieval: What Retrieval Granularity Should We Use?](https://arxiv.org/abs/2312.06648). The prompt, which you can [try out on the hub](https://smith.langchain.com/hub/wfh/proposal-indexing), directs an LLM to generate de-contextualized "propositions" which can be vectorized to increase the retrieval accuracy. You can see the full definition in `proposal_chain.py`.

![Retriever Diagram](./_images/retriever_diagram.png)

## Storage

For this demo, we index a simple academic paper using the RecursiveUrlLoader, and store all retriever information locally (using chroma and a bytestore stored on the local filesystem). You can modify the storage layer in `storage.py`.

## Environment Setup

Set the `OPENAI_API_KEY` environment variable to access `gpt-3.5` and the OpenAI Embeddings classes.

## Indexing

Create the index by running the following:

```python
poetry install
poetry run python rag_chroma_dense_retrieval/ingest.py
```

## Usage

To use this package, you should first have the LangChain CLI installed:

```shell
pip install -U langchain-cli
```

To create a new LangChain project and install this as the only package, you can do:

```shell
langchain app new my-app --package rag-chroma-dense-retrieval
```

If you want to add this to an existing project, you can just run:

```shell
langchain app add rag-chroma-dense-retrieval
```

And add the following code to your `server.py` file:

```python
from rag_chroma_dense_retrieval import chain

add_routes(app, chain, path="/rag-chroma-dense-retrieval")
```

(Optional) Let's now configure LangSmith.
LangSmith will help us trace, monitor and debug LangChain applications.
LangSmith is currently in private beta, you can sign up [here](https://smith.langchain.com/).
If you don't have access, you can skip this section

```shell
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=<your-api-key>
export LANGCHAIN_PROJECT=<your-project> # if not specified, defaults to "default"
```

If you are inside this directory, then you can spin up a LangServe instance directly by:

```shell
langchain serve
```

This will start the FastAPI app with a server is running locally at
[http://localhost:8000](http://localhost:8000)

We can see all templates at [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
We can access the playground at [http://127.0.0.1:8000/rag-chroma-dense-retrieval/playground](http://127.0.0.1:8000/rag-chroma-dense-retrieval/playground)

We can access the template from code with:

```python
from langserve.client import RemoteRunnable

runnable = RemoteRunnable("http://localhost:8000/rag-chroma-dense-retrieval")
```
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2,859 changes: 2,859 additions & 0 deletions templates/rag-chroma-dense-retrieval/poetry.lock

Large diffs are not rendered by default.

35 changes: 35 additions & 0 deletions templates/rag-chroma-dense-retrieval/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
[tool.poetry]
name = "rag-chroma-dense-retrieval"
version = "0.1.0"
description = "Dense retrieval using vectorized propositions.s"
authors = [
"William Fu-Hinthorn <[email protected]>",
]
readme = "README.md"

[tool.poetry.dependencies]
python = ">=3.8.1,<4.0"
langchain = ">=0.0.350"
openai = "<2"
tiktoken = ">=0.5.1"
chromadb = ">=0.4.14"
bs4 = "^0.0.1"

[tool.poetry.group.dev.dependencies]
langchain-cli = ">=0.0.15"

[tool.langserve]
export_module = "rag_chroma_multi_modal_multi_vector"
export_attr = "chain"

[tool.templates-hub]
use-case = "rag"
author = "LangChain"
integrations = ["OpenAI", "Chroma"]
tags = ["vectordbs"]

[build-system]
requires = [
"poetry-core",
]
build-backend = "poetry.core.masonry.api"
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "681a5d1e",
"metadata": {},
"source": [
"## Run Template\n",
"\n",
"In `server.py`, set -\n",
"```\n",
"from fastapi import FastAPI\n",
"from langserve import add_routes\n",
"from rag_chroma_dense_retrieval import chain\n",
"\n",
"app = FastAPI(\n",
" title=\"LangChain Server\",\n",
" version=\"1.0\",\n",
" description=\"Retriever and Generator for RAG Chroma Dense Retrieval\",\n",
")\n",
"\n",
"add_routes(app, chain, path=\"/rag-chroma-dense-retrieval\")\n",
"\n",
"if __name__ == \"__main__\":\n",
" import uvicorn\n",
"\n",
" uvicorn.run(app, host=\"localhost\", port=8000)\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d774be2a",
"metadata": {},
"outputs": [],
"source": [
"from langserve.client import RemoteRunnable\n",
"\n",
"rag_app = RemoteRunnable(\"http://localhost:8001/rag-chroma-dense-retrieval\")\n",
"rag_app.invoke(\"How are transformers related to convolutional neural networks?\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from rag_chroma_dense_retrieval.chain import chain
from rag_chroma_dense_retrieval.proposal_chain import proposition_chain

__all__ = ["chain", "proposition_chain"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
from langchain_community.chat_models import ChatOpenAI
from langchain_core.load import load
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel
from langchain_core.runnables import RunnablePassthrough

from rag_chroma_dense_retrieval.constants import DOCSTORE_ID_KEY
from rag_chroma_dense_retrieval.storage import get_multi_vector_retriever


def format_docs(docs: list) -> str:
loaded_docs = [load(doc) for doc in docs]
return "\n".join(
[
f"<Document id={i}>\n{doc.page_content}\n</Document>"
for i, doc in enumerate(loaded_docs)
]
)


def rag_chain(retriever):
"""
The RAG chain
:param retriever: A function that retrieves the necessary context for the model.
:return: A chain of functions representing the multi-modal RAG process.
"""
model = ChatOpenAI(temperature=0, model="gpt-4-1106-preview", max_tokens=1024)
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"You are an AI assistant. Answer based on the retrieved documents:"
"\n<Documents>\n{context}\n</Documents>",
),
("user", "{question}?"),
]
)

# Define the RAG pipeline
chain = (
{
"context": retriever | format_docs,
"question": RunnablePassthrough(),
}
| prompt
| model
| StrOutputParser()
)

return chain


# Create the multi-vector retriever
retriever = get_multi_vector_retriever(DOCSTORE_ID_KEY)

# Create RAG chain
chain = rag_chain(retriever)


# Add typing for input
class Question(BaseModel):
__root__: str


chain = chain.with_types(input_type=Question)
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
DOCSTORE_ID_KEY = "doc_id"
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
import logging
import uuid
from typing import Sequence

from langchain_core.documents import Document
from langchain_core.runnables import Runnable

from rag_chroma_dense_retrieval.constants import DOCSTORE_ID_KEY
from rag_chroma_dense_retrieval.proposal_chain import proposition_chain
from rag_chroma_dense_retrieval.storage import get_multi_vector_retriever

logging.basicConfig(level=logging.INFO)

logger = logging.getLogger(__name__)


def add_documents(
retriever,
propositions: Sequence[Sequence[str]],
docs: Sequence[Document],
id_key: str = DOCSTORE_ID_KEY,
):
doc_ids = [
str(uuid.uuid5(uuid.NAMESPACE_DNS, doc.metadata["source"])) for doc in docs
]
prop_docs = [
Document(page_content=prop, metadata={id_key: doc_ids[i]})
for i, props in enumerate(propositions)
for prop in props
if prop
]
retriever.vectorstore.add_documents(prop_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))


def create_index(
docs: Sequence[Document],
indexer: Runnable,
docstore_id_key: str = DOCSTORE_ID_KEY,
):
"""
Create retriever that indexes docs and their propositions
:param docs: Documents to index
:param indexer: Runnable creates additional propositions per doc
:param docstore_id_key: Key to use to store the docstore id
:return: Retriever
"""
logger.info("Creating multi-vector retriever")
retriever = get_multi_vector_retriever(docstore_id_key)
propositions = indexer.batch([{"input": doc.page_content} for doc in docs])

add_documents(
retriever,
propositions,
docs,
id_key=docstore_id_key,
)

return retriever


if __name__ == "__main__":
# For our example, we'll load docs from the web
from langchain.text_splitter import RecursiveCharacterTextSplitter # noqa
from langchain_community.document_loaders.recursive_url_loader import (
RecursiveUrlLoader,
) # noqa

# The attention is all you need paper
# Could add more parsing here, as it's very raw.
loader = RecursiveUrlLoader("https://ar5iv.labs.arxiv.org/html/1706.03762")
data = loader.load()
logger.info(f"Loaded {len(data)} documents")

# Split

text_splitter = RecursiveCharacterTextSplitter(chunk_size=8000, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)
logger.info(f"Split into {len(all_splits)} documents")

# Create retriever
retriever_multi_vector_img = create_index(
all_splits,
proposition_chain,
DOCSTORE_ID_KEY,
)
Loading

0 comments on commit 79ae6c2

Please sign in to comment.