Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: PrivateGPT only uses cosine similarity with PgVector even with HNSW index added #17311

Open
1 task done
muzammilvd opened this issue Dec 18, 2024 · 15 comments
Open
1 task done
Labels
question Further information is requested

Comments

@muzammilvd
Copy link

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

Description:

I am using HNSW vector indexing with pgvector in my database. Upon inspecting the SQL query logs, I noticed that by default, the <=> operator (cosine similarity) is being used for similarity searches. However, I would prefer to use the <-> operator (Euclidean distance) for similarity searching, as it reduces the query time significantly in our case where we have

Total records: 20,787,206 (private_gpt.data_embeddings)
Index size: > 69 GB
Table size: > 500GB
Ingested files: >108,000

Despite this, I have been unable to find any resources in the documentation regarding how to utilize HNSW indexing, as it doesn't seem to be used by default in my case.

Could you please advise on how to switch the default operator to Euclidean distance and ensure HNSW indexing is used for similarity searches using PgVector?

@muzammilvd muzammilvd added the question Further information is requested label Dec 18, 2024
Copy link

dosubot bot commented Dec 18, 2024

Hey @muzammilvd! 👋 I'm here to assist you with bug fixes, answer questions, and even guide you on how to become a contributor. Let's tackle this together!

To switch the default operator to Euclidean distance and ensure HNSW indexing is used for similarity searches using PgVector, you need to modify the hnsw_dist_method parameter when creating the HNSW index. Specifically, you should set hnsw_dist_method to "vector_l2_ops" instead of the default "vector_cosine_ops". Here's a code snippet that demonstrates how to create the index with the desired settings:

hnsw_dist_method = self.hnsw_kwargs.pop("hnsw_dist_method", "vector_l2_ops")

index_name = f"{self._table_class.__tablename__}_embedding_idx"

with self._session() as session, session.begin():
    statement = sqlalchemy.text(
        f"CREATE INDEX IF NOT EXISTS {index_name} ON {self.schema_name}.{self._table_class.__tablename__} USING hnsw (embedding {hnsw_dist_method}) WITH (m = {hnsw_m}, ef_construction = {hnsw_ef_construction})"
    )
    session.execute(statement)
    session.commit()

Make sure your hnsw_kwargs dictionary includes this setting when you create the index [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@muzammilvd
Copy link
Author

muzammilvd commented Dec 19, 2024

Hey @muzammilvd! 👋 I'm here to assist you with bug fixes, answer questions, and even guide you on how to become a contributor. Let's tackle this together!

To switch the default operator to Euclidean distance and ensure HNSW indexing is used for similarity searches using PgVector, you need to modify the hnsw_dist_method parameter when creating the HNSW index. Specifically, you should set hnsw_dist_method to "vector_l2_ops" instead of the default "vector_cosine_ops". Here's a code snippet that demonstrates how to create the index with the desired settings:

hnsw_dist_method = self.hnsw_kwargs.pop("hnsw_dist_method", "vector_l2_ops")

index_name = f"{self._table_class.__tablename__}_embedding_idx"

with self._session() as session, session.begin():
    statement = sqlalchemy.text(
        f"CREATE INDEX IF NOT EXISTS {index_name} ON {self.schema_name}.{self._table_class.__tablename__} USING hnsw (embedding {hnsw_dist_method}) WITH (m = {hnsw_m}, ef_construction = {hnsw_ef_construction})"
    )
    session.execute(statement)
    session.commit()

Make sure your hnsw_kwargs dictionary includes this setting when you create the index [1].

To continue talking to Dosu, mention @dosu.

Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

Hi, thanks for your response and I have a question that as I am using VectorIndexRetriever() to create my retriever so is there any option in VectorIndexRetriever() to use HNSW as by default it is using cosine.

here is the retriever

def get_retriever(
        self,
        index: VectorStoreIndex,
        context_filter: ContextFilter | None = None,
        similarity_top_k: int = 2,
    ) -> VectorIndexRetriever:
        # This way we support qdrant (using doc_ids) and the rest (using filters)
        return VectorIndexRetriever(
            index=index,
            similarity_top_k=similarity_top_k,
            doc_ids=context_filter.docs_ids if context_filter else None,
            filters=(
                _doc_id_metadata_filter(context_filter)
                if self.settings.vectorstore.database != "qdrant"
                else None
            ),
        )

Copy link

dosubot bot commented Dec 19, 2024

The VectorIndexRetriever class itself does not directly specify the use of HNSW indexing. Instead, HNSW indexing is typically configured at the level of the vector store you are using. For example, if you are using a vector store that supports HNSW, such as HnswlibVectorStore, you would configure the HNSW settings when you create the vector store instance.

Here's a general approach to using HNSW with a VectorIndexRetriever:

  1. Create an HNSW Vector Store: Use a vector store that supports HNSW indexing, like HnswlibVectorStore or DocArrayHnswVectorStore. Configure it with the desired parameters for HNSW.

  2. Initialize the VectorIndexRetriever: Pass the configured vector store to the VectorIndexRetriever.

Here's a simplified example:

from llama_index.vector_stores.hnswlib import HnswlibVectorStore
from llama_index.core import VectorStoreIndex

# Configure HNSW vector store
hnsw_vector_store = HnswlibVectorStore.from_params(
    space="ip",  # or "l2" for Euclidean distance
    dimension=1536,
    max_elements=1000,
)

# Create the index
index = VectorStoreIndex(vector_store=hnsw_vector_store)

# Initialize the retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=2,
    # other parameters as needed
)

Ensure that your vector store is set up to use the desired distance metric (e.g., Euclidean distance) by configuring it appropriately when you create the vector store instance.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@muzammilvd
Copy link
Author

hnsw_vector_store

The VectorIndexRetriever class itself does not directly specify the use of HNSW indexing. Instead, HNSW indexing is typically configured at the level of the vector store you are using. For example, if you are using a vector store that supports HNSW, such as HnswlibVectorStore, you would configure the HNSW settings when you create the vector store instance.

Here's a general approach to using HNSW with a VectorIndexRetriever:

  1. Create an HNSW Vector Store: Use a vector store that supports HNSW indexing, like HnswlibVectorStore or DocArrayHnswVectorStore. Configure it with the desired parameters for HNSW.
  2. Initialize the VectorIndexRetriever: Pass the configured vector store to the VectorIndexRetriever.

Here's a simplified example:

from llama_index.vector_stores.hnswlib import HnswlibVectorStore
from llama_index.core import VectorStoreIndex

# Configure HNSW vector store
hnsw_vector_store = HnswlibVectorStore.from_params(
    space="ip",  # or "l2" for Euclidean distance
    dimension=1536,
    max_elements=1000,
)

# Create the index
index = VectorStoreIndex(vector_store=hnsw_vector_store)

# Initialize the retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=2,
    # other parameters as needed
)

Ensure that your vector store is set up to use the desired distance metric (e.g., Euclidean distance) by configuring it appropriately when you create the vector store instance.

To continue talking to Dosu, mention @dosu.

Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

So you suggest that I need to change the vectore_store instance and use HnswlibVectorStore but I am using Pgvector database where i have data with states that I mentioned in my question.

Here is the current vector_store that using PGVectorStore

self.vector_store = typing.cast(
                    BasePydanticVectorStore,
                    PGVectorStore.from_params(
                        **settings.postgres.model_dump(exclude_none=True),
                        table_name="embeddings",
                        embed_dim=settings.embedding.embed_dim,
                    ),
                )

So what should I do and as my I understanding you are suggesting to change the db vector_store method so how it will use the pgvector db and its data?

Copy link

dosubot bot commented Dec 19, 2024

To use HNSW indexing with Euclidean distance in your PGVectorStore setup, you can configure the hnsw_kwargs parameter when creating the PGVectorStore instance. Here's how you can adjust your current setup to use HNSW indexing with Euclidean distance:

from sqlalchemy import make_url

# Assuming you have a connection string and settings
url = make_url(settings.postgres.connection_string)

self.vector_store = PGVectorStore.from_params(
    database=settings.postgres.db_name,
    host=url.host,
    password=url.password,
    port=url.port,
    user=url.username,
    table_name="embeddings",
    embed_dim=settings.embedding.embed_dim,
    hnsw_kwargs={
        "hnsw_m": 16,
        "hnsw_ef_construction": 64,
        "hnsw_ef_search": 40,
        "hnsw_dist_method": "vector_l2_ops",  # Use Euclidean distance
    },
)

This configuration sets up the PGVectorStore to use HNSW indexing with Euclidean distance for similarity searches. Make sure that your settings object contains the necessary database connection details and embedding dimensions [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@muzammilvd
Copy link
Author

To use HNSW indexing with Euclidean distance in your PGVectorStore setup, you can configure the hnsw_kwargs parameter when creating the PGVectorStore instance. Here's how you can adjust your current setup to use HNSW indexing with Euclidean distance:

from sqlalchemy import make_url

# Assuming you have a connection string and settings
url = make_url(settings.postgres.connection_string)

self.vector_store = PGVectorStore.from_params(
    database=settings.postgres.db_name,
    host=url.host,
    password=url.password,
    port=url.port,
    user=url.username,
    table_name="embeddings",
    embed_dim=settings.embedding.embed_dim,
    hnsw_kwargs={
        "hnsw_m": 16,
        "hnsw_ef_construction": 64,
        "hnsw_ef_search": 40,
        "hnsw_dist_method": "vector_l2_ops",  # Use Euclidean distance
    },
)

This configuration sets up the PGVectorStore to use HNSW indexing with Euclidean distance for similarity searches. Make sure that your settings object contains the necessary database connection details and embedding dimensions [1].

To continue talking to Dosu, mention @dosu.

Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

I implemented above code that you shared and I testing but it still using cosine as I examined the DB logs where I found a query that is like

SELECT private_gpt.data_embeddings.id, private_gpt.data_embeddings.node_id, private_gpt.data_embeddings.text, private_gpt.data_embeddings.metadata_, private_gpt.data_embeddings.embedding <=> '[-0.46123987436294556,-0.6934341192245483,-3.45989727973938,-0.1854298710823059,0.6372467279434204,0.9199901819229126,0.4315108358860016,0.08280982077121735,0.428688645362854,...........] AS distance 
	FROM private_gpt.data_embeddings ORDER BY distance asc 
	 LIMIT 2

After analysing DB logs and I shared above some you can see <=> operating which is use by cosine and <-> use by Euclidean. So the issue is till there.

Copy link

dosubot bot commented Dec 19, 2024

To ensure that the PGVectorStore uses Euclidean distance instead of cosine similarity, you need to verify that the hnsw_dist_method is correctly set to "vector_l2_ops" in the hnsw_kwargs parameter. If the configuration is correct but the SQL logs still show the <=> operator (cosine similarity), it might indicate that the configuration is not being applied as expected.

Here are a few steps to troubleshoot and ensure the correct setup:

  1. Double-check the Configuration: Ensure that the hnsw_kwargs dictionary is correctly passed to the PGVectorStore.from_params method with "hnsw_dist_method": "vector_l2_ops".

  2. Inspect the Code: Verify that the _create_hnsw_index method in the PGVectorStore class is correctly using the hnsw_dist_method from the hnsw_kwargs. The method should apply this setting when creating the index.

  3. Recreate the Index: If the index was created before the configuration change, you might need to drop and recreate the index to apply the new settings.

  4. Check for Overrides: Ensure there are no other parts of the code that might override the hnsw_dist_method setting after the PGVectorStore is initialized.

If these steps do not resolve the issue, it might be necessary to review the implementation details of the PGVectorStore to ensure that the distance method is being correctly applied in the SQL query construction.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@muzammilvd
Copy link
Author

To ensure that the PGVectorStore uses Euclidean distance instead of cosine similarity, you need to verify that the hnsw_dist_method is correctly set to "vector_l2_ops" in the hnsw_kwargs parameter. If the configuration is correct but the SQL logs still show the <=> operator (cosine similarity), it might indicate that the configuration is not being applied as expected.

Here are a few steps to troubleshoot and ensure the correct setup:

  1. Double-check the Configuration: Ensure that the hnsw_kwargs dictionary is correctly passed to the PGVectorStore.from_params method with "hnsw_dist_method": "vector_l2_ops".
  2. Inspect the Code: Verify that the _create_hnsw_index method in the PGVectorStore class is correctly using the hnsw_dist_method from the hnsw_kwargs. The method should apply this setting when creating the index.
  3. Recreate the Index: If the index was created before the configuration change, you might need to drop and recreate the index to apply the new settings.
  4. Check for Overrides: Ensure there are no other parts of the code that might override the hnsw_dist_method setting after the PGVectorStore is initialized.

If these steps do not resolve the issue, it might be necessary to review the implementation details of the PGVectorStore to ensure that the distance method is being correctly applied in the SQL query construction.

To continue talking to Dosu, mention @dosu.

Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

Here the updated PGVectorStore class

self.vector_store = typing.cast(
                    BasePydanticVectorStore,
                    PGVectorStore.from_params(
                        **settings.postgres.model_dump(exclude_none=True),
                        table_name="embeddings",
                        embed_dim=settings.embedding.embed_dim,
                        hnsw_kwargs={
                            "hnsw_m": 16,
                            "hnsw_ef_construction": 64,
                            "hnsw_ef_search": 40,
                            "hnsw_dist_method": "vector_l2_ops",  # Use Euclidean distance
                        },
                    ),
                )

As I provided you my retriever and it responsible to run sql query to do similarity search in vector so my question is that can we run sql query manually by code like I want to run raw sql query where I will do query with <-> to use Euclidean. So is there any way to do it

Copy link

dosubot bot commented Dec 19, 2024

Yes, you can manually run a raw SQL query using the <-> operator to perform a similarity search with Euclidean distance in the PGVectorStore setup. Here's an example of how you can execute such a query:

SELECT * FROM items ORDER BY embedding <-> '[query_vector]' LIMIT 5;

This query orders the results by their similarity to a given vector, represented by [query_vector], using the <-> operator on the embedding column [1]. You can execute this query using your database connection in Python, typically with a library like SQLAlchemy or psycopg2, depending on your setup.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@muzammilvd
Copy link
Author

Yes, you can manually run a raw SQL query using the <-> operator to perform a similarity search with Euclidean distance in the PGVectorStore setup. Here's an example of how you can execute such a query:

SELECT * FROM items ORDER BY embedding <-> '[query_vector]' LIMIT 5;

This query orders the results by their similarity to a given vector, represented by [query_vector], using the <-> operator on the embedding column [1]. You can execute this query using your database connection in Python, typically with a library like SQLAlchemy or psycopg2, depending on your setup.

To continue talking to Dosu, mention @dosu.

Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

Yes but my question was that can i run raw query by code and get expected object that PGVectorStore or VectorStoreIndex or my retriever. I want like all my process remain same just want to execute sql query by code for Euclidean and rest of the process should remain same and I provide you all code as well. Please provide me step by step guide and code for it.

Copy link

dosubot bot commented Dec 19, 2024

To execute a raw SQL query using Euclidean distance and integrate the results back into your existing process with PGVectorStore or VectorStoreIndex, you can follow these steps:

  1. Establish a Database Connection: Use a library like psycopg2 or SQLAlchemy to connect to your PostgreSQL database.

  2. Execute the Raw SQL Query: Use the <-> operator in your SQL query to perform the similarity search with Euclidean distance.

  3. Process the Query Results: Convert the raw query results into the format expected by your retriever or index.

Here's a step-by-step guide with code:

Step 1: Establish a Database Connection

import psycopg2

# Connect to your PostgreSQL database
connection = psycopg2.connect(
    host="your_host",
    database="your_database",
    user="your_user",
    password="your_password"
)

Step 2: Execute the Raw SQL Query

def execute_euclidean_query(connection, query_vector):
    with connection.cursor() as cursor:
        # Prepare the SQL query using the <-> operator for Euclidean distance
        sql_query = """
        SELECT id, node_id, text, metadata_, embedding
        FROM private_gpt.data_embeddings
        ORDER BY embedding <-> %s
        LIMIT 2;
        """
        cursor.execute(sql_query, (query_vector,))
        results = cursor.fetchall()
    return results

Step 3: Process the Query Results

Assuming your retriever or index expects a specific format, you need to convert the raw results into that format. Here's a generic example:

def process_results(raw_results):
    # Convert raw SQL results into the expected format
    processed_results = []
    for row in raw_results:
        result = {
            "id": row[0],
            "node_id": row[1],
            "text": row[2],
            "metadata": row[3],
            "embedding": row[4]
        }
        processed_results.append(result)
    return processed_results

Integration with Your Existing Process

You can now integrate these steps into your existing process. For example, you can replace the part of your code that retrieves results with the execute_euclidean_query function and then process the results with process_results.

query_vector = [-0.46123987436294556, -0.6934341192245483, ...]  # Your query vector
raw_results = execute_euclidean_query(connection, query_vector)
processed_results = process_results(raw_results)

# Use processed_results in your retriever or index as needed

This approach allows you to manually execute a raw SQL query using Euclidean distance and integrate the results back into your existing workflow. Adjust the code as necessary to fit the specific requirements of your retriever or index.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@muzammilvd
Copy link
Author

This is below my retriever function

def get_retriever(
        self,
        index: VectorStoreIndex,
        context_filter: ContextFilter | None = None,
        similarity_top_k: int = 2,
    ) -> VectorIndexRetriever:
        # This way we support qdrant (using doc_ids) and the rest (using filters)
        return VectorIndexRetriever(
            index=index,
            similarity_top_k=similarity_top_k,
            doc_ids=context_filter.docs_ids if context_filter else None,
            filters=(
                _doc_id_metadata_filter(context_filter)
                if self.settings.vectorstore.database != "qdrant"
                else None
            ),
        )

I want to edit the sql query that is generate to do similarity search in my pgvector db. How can I do it?

Copy link

dosubot bot commented Dec 20, 2024

To edit the SQL query generated for similarity search in your pgvector database within the get_retriever function, you need to modify the PGVectorSQLParser class, specifically the parse_response_to_sql method. This method is responsible for parsing the response and generating the SQL query string. Here's the relevant part of the code:

class PGVectorSQLParser(BaseSQLParser):
    """PGVector SQL Parser."""

    def __init__(
        self,
        embed_model: BaseEmbedding,
    ) -> None:
        """Initialize params."""
        self._embed_model = embed_model

    def parse_response_to_sql(self, response: str, query_bundle: QueryBundle) -> str:
        """Parse response to SQL."""
        sql_query_start = response.find("SQLQuery:")
        if sql_query_start != -1:
            response = response[sql_query_start:]
            if response.startswith("SQLQuery:"):
                response = response[len("SQLQuery:") :]
        sql_result_start = response.find("SQLResult:")
        if sql_result_start != -1:
            response = response[:sql_result_start]

        # this gets you the sql string with [query_vector] placeholders
        raw_sql_str = response.strip().strip("```sql").strip("```").strip()
        query_embedding = self._embed_model.get_query_embedding(query_bundle.query_str)
        query_embedding_str = str(query_embedding)
        return raw_sql_str.replace("[query_vector]", query_embedding_str)

To customize the SQL query, you can modify the logic within this method to adjust how the SQL string is constructed or how the [query_vector] placeholder is replaced with the actual query embedding [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@muzammilvd
Copy link
Author

PGVectorSQLParser

To edit the SQL query generated for similarity search in your pgvector database within the get_retriever function, you need to modify the PGVectorSQLParser class, specifically the parse_response_to_sql method. This method is responsible for parsing the response and generating the SQL query string. Here's the relevant part of the code:

class PGVectorSQLParser(BaseSQLParser):
    """PGVector SQL Parser."""

    def __init__(
        self,
        embed_model: BaseEmbedding,
    ) -> None:
        """Initialize params."""
        self._embed_model = embed_model

    def parse_response_to_sql(self, response: str, query_bundle: QueryBundle) -> str:
        """Parse response to SQL."""
        sql_query_start = response.find("SQLQuery:")
        if sql_query_start != -1:
            response = response[sql_query_start:]
            if response.startswith("SQLQuery:"):
                response = response[len("SQLQuery:") :]
        sql_result_start = response.find("SQLResult:")
        if sql_result_start != -1:
            response = response[:sql_result_start]

        # this gets you the sql string with [query_vector] placeholders
        raw_sql_str = response.strip().strip("```sql").strip("```").strip()
        query_embedding = self._embed_model.get_query_embedding(query_bundle.query_str)
        query_embedding_str = str(query_embedding)
        return raw_sql_str.replace("[query_vector]", query_embedding_str)

To customize the SQL query, you can modify the logic within this method to adjust how the SQL string is constructed or how the [query_vector] placeholder is replaced with the actual query embedding [1].

To continue talking to Dosu, mention @dosu.

Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

I dont have any PGVectorSQLParser class or method. I share the retriever function who is responsible to do query in db and i would to edit the sql query that do similarity search in db so can you make change in my retriever function.

here is my retriever below

def get_retriever(
        self,
        index: VectorStoreIndex,
        context_filter: ContextFilter | None = None,
        similarity_top_k: int = 2,
    ) -> VectorIndexRetriever:
        # This way we support qdrant (using doc_ids) and the rest (using filters)
        return VectorIndexRetriever(
            index=index,
            similarity_top_k=similarity_top_k,
            doc_ids=context_filter.docs_ids if context_filter else None,
            filters=(
                _doc_id_metadata_filter(context_filter)
                if self.settings.vectorstore.database != "qdrant"
                else None
            ),
        )

Copy link

dosubot bot commented Dec 20, 2024

To modify the SQL query used for similarity searches in your get_retriever function, you need to customize the VectorIndexRetriever class or the underlying vector store to use the desired SQL syntax. Since the VectorIndexRetriever itself doesn't directly handle SQL query construction, you'll need to adjust the vector store or the query execution logic it relies on.

Here's a general approach to achieve this:

  1. Subclass the VectorIndexRetriever: Create a subclass of VectorIndexRetriever that overrides the method responsible for executing the query. This method should construct the SQL query using the <-> operator for Euclidean distance.

  2. Override the Query Execution: In your subclass, override the method that interacts with the database to ensure it uses the correct SQL syntax.

Here's a simplified example of how you might implement this:

from llama_index.core.indices.vector_store.retrievers.retriever import VectorIndexRetriever

class CustomVectorIndexRetriever(VectorIndexRetriever):
    def _execute_query(self, query_vector):
        # Custom SQL query using the <-> operator for Euclidean distance
        sql_query = f"""
        SELECT id, node_id, text, metadata_, embedding
        FROM your_table_name
        ORDER BY embedding <-> '{query_vector}'
        LIMIT {self.similarity_top_k};
        """
        # Execute the SQL query using your database connection
        # This is a placeholder for actual database interaction code
        results = self._database_connection.execute(sql_query)
        return results

# Use the custom retriever in your get_retriever function
def get_retriever(
        self,
        index: VectorStoreIndex,
        context_filter: ContextFilter | None = None,
        similarity_top_k: int = 2,
    ) -> CustomVectorIndexRetriever:
        return CustomVectorIndexRetriever(
            index=index,
            similarity_top_k=similarity_top_k,
            doc_ids=context_filter.docs_ids if context_filter else None,
            filters=(
                _doc_id_metadata_filter(context_filter)
                if self.settings.vectorstore.database != "qdrant"
                else None
            ),
        )

This example assumes you have a method to execute raw SQL queries against your database. You'll need to replace the placeholder code with actual database interaction logic, such as using a library like psycopg2 or SQLAlchemy to execute the query and fetch results. Adjust the table name and other details to fit your specific setup.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant