Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark against FAISS & nmslib? #4

Open
jaanli opened this issue Apr 14, 2024 · 2 comments
Open

Benchmark against FAISS & nmslib? #4

jaanli opened this issue Apr 14, 2024 · 2 comments

Comments

@jaanli
Copy link

jaanli commented Apr 14, 2024

Such a benchmark would be super helpful to decide which in-browser use cases are flexible enough :)

https://github.com/nmslib/hnswlib

For example, I have a few databases ready to go:

20 years of census data - https://jaanli.github.io/american-community-survey/new-york-area/income-by-race
15 million hospital claims - https://onefact.github.io/synthetic-healthcare-data/
All of NYC real estate - https://jaanli.github.io/new-york-real-estate/

And I really want to visualize the 30,000+ Mandarin characters by their phono-semantic specificity/etymological origins on a map.

All of these require high-dimensional similarity search, but are of very different scale. So the UI/UX interactions (e.g. very early ones from 2017 here: https://jaan.io/food2vec-augmented-cooking-machine-intelligence/) will be constrained by the queries per second supported in this duckdb extension.

Hope that makes sense, and happy to help! 🙏 super exciting that this is now feasible!!

@jaanli
Copy link
Author

jaanli commented Apr 19, 2024

In case further motivation is needed, here are the types of algorithms I need to benchmark: https://github.com/google-deepmind/xtr - the FAISS parts are here: https://github.com/google-deepmind/xtr/blob/main/xtr_evaluation_on_beir_miracl.ipynb

            ds = 128
            num_clusters = 50
            code_size = 64
            quantizer = faiss.IndexFlatIP(ds)
            opq_matrix = faiss.OPQMatrix(ds, code_size)
            opq_matrix.niter = 10
            sub_index = faiss.IndexIVFPQ(quantizer, ds, num_clusters, code_size, 4, faiss.METRIC_INNER_PRODUCT)
            index = faiss.IndexPreTransform(opq_matrix, sub_index)
            index.train(all_token_embeds[:num_tokens])
            index.add(all_token_embeds[:num_tokens])
            class FaissSearcher(object):
                def __init__(self, index):
                    self.index = index
                def search_batched(self, query_embeds, final_num_neighbors, **kwargs):
                    scores, top_ids = self.index.search(query_embeds, final_num_neighbors)
                    return top_ids, scores
            self.searcher = FaissSearcher(index)

@JAicewizard
Copy link

JAicewizard commented Sep 9, 2024

Hello, I ran some benchmarks comparing VSS against the FAISS extension and posted them here: https://github.com/arjenpdevries/faiss/blob/main/README.md
This URL will die soon, but once it is merged it will be in the general REDME. TLDR: VSS is about 2-3 times slower compared to FAISS when using a single query on 8.8M datapoints with dimension 1536.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants