Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue: metadata filtering #26

Open
asg017 opened this issue Jun 21, 2024 · 3 comments
Open

Tracking issue: metadata filtering #26

asg017 opened this issue Jun 21, 2024 · 3 comments

Comments

@asg017
Copy link
Owner

asg017 commented Jun 21, 2024

sqlite-vec doesn't have good metadata filtering as of v0.1.0. Only vector columns can be declared in the vec0 constructor. You can do pre-filtering with vec_column IN (...) queries, but that's slow and inconvenient.

I'm thinking:

create virtual table vec_movies(
  movie_id text primary key,
  genre text,
  release_date date,
  rating text,
  is_3d boolean,
  synopsis_embedding float[768]
);

genre, release_date, rating, and is_3d would all be "metadata" columns. You could do queries like:

select
  rowid,
  distance
from vec_movies
where synopsis_embedding match embed('comedic american summer camp')
  and k = 20
  and is_3d
  and release_date between '2010-01-01' and '2015-12-31'
  and rating = 'PG';

We could capture all the WHERE clauses to ensure that the top 20 returned vectors match that criteria.

A few open questions:

How do we store metadata values?

We could store in OLTP-fashion with the _rowids shadow tables, but that may be slow. We could store in column-oriented fashion to match the vector column formats, but unsure how much faster that would be.

How would this work with ANN indexes?

🤷

What datatypes to support?

Ideally everything, ideally STRICT. But if we do column-oriented we'd need a strict subset. like:

  • TEXT
  • INT
  • DOUBLE
  • BLOB
  • BOOLEAN
  • DATE/DATETIME

Maybe we could do dictionary encoding for text values? maybe that's a column option, like genre text encoding=dictionary or something. Maybe ENUMs? NULL/NOT NULL?

@asg017 asg017 pinned this issue Jun 23, 2024
@asg017 asg017 mentioned this issue Jul 24, 2024
@forrestbao
Copy link

I really wish this feature can be available soon.

@ajram23
Copy link

ajram23 commented Sep 24, 2024

@asg017 +1. Looks like langchain expects the metadata to be available as a dictionary. I have tried the integration and this is the last remaining piece to migrate fully out of ChromaDB.

@lojik-ng
Copy link

lojik-ng commented Oct 1, 2024

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants