Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extension does not seem to do hybrid queries between HNSW index and typical duckdb scalar where clauses. #35

Open
malcolm-smith-mck opened this issue Oct 28, 2024 · 0 comments

Comments

@malcolm-smith-mck
Copy link

For this example (like the basic readme example but I added a varchar column so that I could test if the extension could filter by a scalar column and also process results with array distance and/or using HNSW indexes. I noted that the usearch library has support for push down filters but the vss extensions does not seem to support this operation.

These tests failed, the extension does not seem to support hybrid scalar and vector filters. This feature is very important to my key use cases which require role based security access to GenAI RAG vector stores - e.g. As a user in business unit 1 I can see only RAG chunks that are part of business unit 1 processed documents.

`DROP TABLE IF EXISTS my_vector_table;
CREATE TABLE my_vector_table (vec FLOAT[3],cluster_name varchar);

INSERT INTO my_vector_table (vec) SELECT array_value(a,b,c) FROM range(0,100) ra(a), range(0,100) rb(b), range(0,100) rc(c); -- 1M records
update my_vector_table set cluster_name = (random()*10+100)::int; -- assigns fake cluster names 101, 102,103 with ~100K record each

CREATE INDEX my_hnsw_index ON my_vector_table USING HNSW (vec);
CREATE INDEX my_cluster_name_idx on my_vector_table(cluster_name);

SELECT * FROM my_vector_table where cluster_name = '101' ORDER BY array_distance(vec, [1,2,3]::FLOAT[3]) limit 3; -- does not use HNSW index only uses scalar index -- PROBLEM 1

select * from (SELECT * FROM my_vector_table where cluster_name = '101' ORDER BY array_distance(vec, [1,2,3]::FLOAT[3]) limit 1000) as tt limit 3; -- this uses the HNSW index but requires the inner limit to be set high enough to happen to include some rows with cluster_name = '101' -- PROBLEM 2
`

PROBLEM 1 - the query fails to any results, does use the HNSW index but seems to apply the plan incorrectly
┌─────────────────────────────┐ │┌───────────────────────────┐│ ││ Physical Plan ││ │└───────────────────────────┘│ └─────────────────────────────┘ ┌───────────────────────────┐ │ PROJECTION │ │ ──────────────────── │ │ #0 │ │ #1 │ │ │ │ ~3 Rows │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ PROJECTION │ │ ──────────────────── │ │ vec │ │ cluster_name │ │ array_distance(vec, [1.0, │ │ 2.0, 3.0]) │ │ │ │ ~200000 Rows │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ FILTER │ │ ──────────────────── │ │ ((#0 = '101') AND (#0 IS │ │ NOT NULL)) │ │ │ │ ~1000000 Rows │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ HNSW_INDEX_SCAN │ │ ──────────────────── │ │ my_vector_table (HNSW │ │ INDEX SCAN : │ │ my_hnsw_index) │ │ │ │ Projections: │ │ cluster_name │ │ vec │ │ │ │ ~1000000 Rows │ └───────────────────────────┘

PROBLEM 2 - no combination of temporary tables, CTE's or other SQL tactics will cause the engine to first create a list of cluster_name = '101' outputs which are then provided to the array_distance limit process - this sql is a hack to try to get some number of results from the HNSW index and then filtering them back to a smaller matching list. This does return a result but is not reliable because results are not really filtering on both restrictions. Explain plan also does not look quite right to me, in terms of sequence of operations.

`
explain select * from (SELECT * FROM my_vector_table where cluster_name = '101' ORDER BY array_distance(vec, [1,2,3]::FLOAT[3]) limit 1000) as tt limit 3;

┌─────────────────────────────┐
│┌───────────────────────────┐│
││ Physical Plan ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│ PROJECTION │
│ ──────────────────── │
│ #0 │
#1
│ │
│ ~1000 Rows │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ STREAMING_LIMIT │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ──────────────────── │
│ vec │
│ cluster_name │
│ array_distance(vec, [1.0, │
│ 2.0, 3.0]) │
│ │
│ ~200000 Rows │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ FILTER │
│ ──────────────────── │
│ ((#0 = '101') AND (#0 IS │
│ NOT NULL)) │
│ │
│ ~1000000 Rows │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HNSW_INDEX_SCAN │
│ ──────────────────── │
│ my_vector_table (HNSW │
│ INDEX SCAN : │
│ my_hnsw_index) │
│ │
│ Projections: │
│ cluster_name │
│ vec │
│ │
│ ~1000000 Rows │
└───────────────────────────┘
`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant