-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question]: PrivateGPT only uses cosine similarity with PgVector even with HNSW index added #17311
Comments
Hey @muzammilvd! 👋 I'm here to assist you with bug fixes, answer questions, and even guide you on how to become a contributor. Let's tackle this together! To switch the default operator to Euclidean distance and ensure HNSW indexing is used for similarity searches using PgVector, you need to modify the hnsw_dist_method = self.hnsw_kwargs.pop("hnsw_dist_method", "vector_l2_ops")
index_name = f"{self._table_class.__tablename__}_embedding_idx"
with self._session() as session, session.begin():
statement = sqlalchemy.text(
f"CREATE INDEX IF NOT EXISTS {index_name} ON {self.schema_name}.{self._table_class.__tablename__} USING hnsw (embedding {hnsw_dist_method}) WITH (m = {hnsw_m}, ef_construction = {hnsw_ef_construction})"
)
session.execute(statement)
session.commit() Make sure your To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
Hi, thanks for your response and I have a question that as I am using VectorIndexRetriever() to create my retriever so is there any option in VectorIndexRetriever() to use HNSW as by default it is using cosine. here is the retriever def get_retriever(
self,
index: VectorStoreIndex,
context_filter: ContextFilter | None = None,
similarity_top_k: int = 2,
) -> VectorIndexRetriever:
# This way we support qdrant (using doc_ids) and the rest (using filters)
return VectorIndexRetriever(
index=index,
similarity_top_k=similarity_top_k,
doc_ids=context_filter.docs_ids if context_filter else None,
filters=(
_doc_id_metadata_filter(context_filter)
if self.settings.vectorstore.database != "qdrant"
else None
),
) |
The Here's a general approach to using HNSW with a
Here's a simplified example: from llama_index.vector_stores.hnswlib import HnswlibVectorStore
from llama_index.core import VectorStoreIndex
# Configure HNSW vector store
hnsw_vector_store = HnswlibVectorStore.from_params(
space="ip", # or "l2" for Euclidean distance
dimension=1536,
max_elements=1000,
)
# Create the index
index = VectorStoreIndex(vector_store=hnsw_vector_store)
# Initialize the retriever
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=2,
# other parameters as needed
) Ensure that your vector store is set up to use the desired distance metric (e.g., Euclidean distance) by configuring it appropriately when you create the vector store instance. To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
So you suggest that I need to change the vectore_store instance and use HnswlibVectorStore but I am using Pgvector database where i have data with states that I mentioned in my question. Here is the current vector_store that using PGVectorStore self.vector_store = typing.cast(
BasePydanticVectorStore,
PGVectorStore.from_params(
**settings.postgres.model_dump(exclude_none=True),
table_name="embeddings",
embed_dim=settings.embedding.embed_dim,
),
) So what should I do and as my I understanding you are suggesting to change the db vector_store method so how it will use the pgvector db and its data? |
To use HNSW indexing with Euclidean distance in your from sqlalchemy import make_url
# Assuming you have a connection string and settings
url = make_url(settings.postgres.connection_string)
self.vector_store = PGVectorStore.from_params(
database=settings.postgres.db_name,
host=url.host,
password=url.password,
port=url.port,
user=url.username,
table_name="embeddings",
embed_dim=settings.embedding.embed_dim,
hnsw_kwargs={
"hnsw_m": 16,
"hnsw_ef_construction": 64,
"hnsw_ef_search": 40,
"hnsw_dist_method": "vector_l2_ops", # Use Euclidean distance
},
) This configuration sets up the To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
I implemented above code that you shared and I testing but it still using cosine as I examined the DB logs where I found a query that is like SELECT private_gpt.data_embeddings.id, private_gpt.data_embeddings.node_id, private_gpt.data_embeddings.text, private_gpt.data_embeddings.metadata_, private_gpt.data_embeddings.embedding <=> '[-0.46123987436294556,-0.6934341192245483,-3.45989727973938,-0.1854298710823059,0.6372467279434204,0.9199901819229126,0.4315108358860016,0.08280982077121735,0.428688645362854,...........] AS distance
FROM private_gpt.data_embeddings ORDER BY distance asc
LIMIT 2 After analysing DB logs and I shared above some you can see |
To ensure that the Here are a few steps to troubleshoot and ensure the correct setup:
If these steps do not resolve the issue, it might be necessary to review the implementation details of the To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
Here the updated self.vector_store = typing.cast(
BasePydanticVectorStore,
PGVectorStore.from_params(
**settings.postgres.model_dump(exclude_none=True),
table_name="embeddings",
embed_dim=settings.embedding.embed_dim,
hnsw_kwargs={
"hnsw_m": 16,
"hnsw_ef_construction": 64,
"hnsw_ef_search": 40,
"hnsw_dist_method": "vector_l2_ops", # Use Euclidean distance
},
),
) As I provided you my retriever and it responsible to run sql query to do similarity search in vector so my question is that can we run sql query manually by code like I want to run raw sql query where I will do query with |
Yes, you can manually run a raw SQL query using the SELECT * FROM items ORDER BY embedding <-> '[query_vector]' LIMIT 5; This query orders the results by their similarity to a given vector, represented by To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
Yes but my question was that can i run raw query by code and get expected object that PGVectorStore or VectorStoreIndex or my retriever. I want like all my process remain same just want to execute sql query by code for Euclidean and rest of the process should remain same and I provide you all code as well. Please provide me step by step guide and code for it. |
To execute a raw SQL query using Euclidean distance and integrate the results back into your existing process with
Here's a step-by-step guide with code: Step 1: Establish a Database Connectionimport psycopg2
# Connect to your PostgreSQL database
connection = psycopg2.connect(
host="your_host",
database="your_database",
user="your_user",
password="your_password"
) Step 2: Execute the Raw SQL Querydef execute_euclidean_query(connection, query_vector):
with connection.cursor() as cursor:
# Prepare the SQL query using the <-> operator for Euclidean distance
sql_query = """
SELECT id, node_id, text, metadata_, embedding
FROM private_gpt.data_embeddings
ORDER BY embedding <-> %s
LIMIT 2;
"""
cursor.execute(sql_query, (query_vector,))
results = cursor.fetchall()
return results Step 3: Process the Query ResultsAssuming your retriever or index expects a specific format, you need to convert the raw results into that format. Here's a generic example: def process_results(raw_results):
# Convert raw SQL results into the expected format
processed_results = []
for row in raw_results:
result = {
"id": row[0],
"node_id": row[1],
"text": row[2],
"metadata": row[3],
"embedding": row[4]
}
processed_results.append(result)
return processed_results Integration with Your Existing ProcessYou can now integrate these steps into your existing process. For example, you can replace the part of your code that retrieves results with the query_vector = [-0.46123987436294556, -0.6934341192245483, ...] # Your query vector
raw_results = execute_euclidean_query(connection, query_vector)
processed_results = process_results(raw_results)
# Use processed_results in your retriever or index as needed This approach allows you to manually execute a raw SQL query using Euclidean distance and integrate the results back into your existing workflow. Adjust the code as necessary to fit the specific requirements of your retriever or index. To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
This is below my retriever function def get_retriever(
self,
index: VectorStoreIndex,
context_filter: ContextFilter | None = None,
similarity_top_k: int = 2,
) -> VectorIndexRetriever:
# This way we support qdrant (using doc_ids) and the rest (using filters)
return VectorIndexRetriever(
index=index,
similarity_top_k=similarity_top_k,
doc_ids=context_filter.docs_ids if context_filter else None,
filters=(
_doc_id_metadata_filter(context_filter)
if self.settings.vectorstore.database != "qdrant"
else None
),
) I want to edit the sql query that is generate to do similarity search in my pgvector db. How can I do it? |
To edit the SQL query generated for similarity search in your pgvector database within the class PGVectorSQLParser(BaseSQLParser):
"""PGVector SQL Parser."""
def __init__(
self,
embed_model: BaseEmbedding,
) -> None:
"""Initialize params."""
self._embed_model = embed_model
def parse_response_to_sql(self, response: str, query_bundle: QueryBundle) -> str:
"""Parse response to SQL."""
sql_query_start = response.find("SQLQuery:")
if sql_query_start != -1:
response = response[sql_query_start:]
if response.startswith("SQLQuery:"):
response = response[len("SQLQuery:") :]
sql_result_start = response.find("SQLResult:")
if sql_result_start != -1:
response = response[:sql_result_start]
# this gets you the sql string with [query_vector] placeholders
raw_sql_str = response.strip().strip("```sql").strip("```").strip()
query_embedding = self._embed_model.get_query_embedding(query_bundle.query_str)
query_embedding_str = str(query_embedding)
return raw_sql_str.replace("[query_vector]", query_embedding_str) To customize the SQL query, you can modify the logic within this method to adjust how the SQL string is constructed or how the To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
I dont have any PGVectorSQLParser class or method. I share the retriever function who is responsible to do query in db and i would to edit the sql query that do similarity search in db so can you make change in my retriever function. here is my retriever below def get_retriever(
self,
index: VectorStoreIndex,
context_filter: ContextFilter | None = None,
similarity_top_k: int = 2,
) -> VectorIndexRetriever:
# This way we support qdrant (using doc_ids) and the rest (using filters)
return VectorIndexRetriever(
index=index,
similarity_top_k=similarity_top_k,
doc_ids=context_filter.docs_ids if context_filter else None,
filters=(
_doc_id_metadata_filter(context_filter)
if self.settings.vectorstore.database != "qdrant"
else None
),
) |
To modify the SQL query used for similarity searches in your Here's a general approach to achieve this:
Here's a simplified example of how you might implement this: from llama_index.core.indices.vector_store.retrievers.retriever import VectorIndexRetriever
class CustomVectorIndexRetriever(VectorIndexRetriever):
def _execute_query(self, query_vector):
# Custom SQL query using the <-> operator for Euclidean distance
sql_query = f"""
SELECT id, node_id, text, metadata_, embedding
FROM your_table_name
ORDER BY embedding <-> '{query_vector}'
LIMIT {self.similarity_top_k};
"""
# Execute the SQL query using your database connection
# This is a placeholder for actual database interaction code
results = self._database_connection.execute(sql_query)
return results
# Use the custom retriever in your get_retriever function
def get_retriever(
self,
index: VectorStoreIndex,
context_filter: ContextFilter | None = None,
similarity_top_k: int = 2,
) -> CustomVectorIndexRetriever:
return CustomVectorIndexRetriever(
index=index,
similarity_top_k=similarity_top_k,
doc_ids=context_filter.docs_ids if context_filter else None,
filters=(
_doc_id_metadata_filter(context_filter)
if self.settings.vectorstore.database != "qdrant"
else None
),
) This example assumes you have a method to execute raw SQL queries against your database. You'll need to replace the placeholder code with actual database interaction logic, such as using a library like To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
Question Validation
Question
Description:
I am using HNSW vector indexing with pgvector in my database. Upon inspecting the SQL query logs, I noticed that by default, the
<=>
operator (cosine similarity) is being used for similarity searches. However, I would prefer to use the<->
operator (Euclidean distance) for similarity searching, as it reduces the query time significantly in our case where we haveTotal records: 20,787,206 (private_gpt.data_embeddings)
Index size: > 69 GB
Table size: > 500GB
Ingested files: >108,000
Despite this, I have been unable to find any resources in the documentation regarding how to utilize HNSW indexing, as it doesn't seem to be used by default in my case.
Could you please advise on how to switch the default operator to Euclidean distance and ensure HNSW indexing is used for similarity searches using PgVector?
The text was updated successfully, but these errors were encountered: