Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use more modern vectors in nightly bench #237

Open
msokolov opened this issue Oct 27, 2023 · 1 comment
Open

Use more modern vectors in nightly bench #237

msokolov opened this issue Oct 27, 2023 · 1 comment

Comments

@msokolov
Copy link
Collaborator

We have support for testing with 384-dim MINIMLM and 768-dim MPNET vectors baked into some of the python utilities in luceneutil. We also have tools for generating 8-bit versions of these. Let's produce some datasets for wikipedia and use them in place of the GloVe-100d vectors we use today. That GloVe model is pretty old and exhibits some unfortunate numerical instability; I think it's not really SOTA any more. Also if we use higher-dim vectors this should help showcase some of the vectorization work we've seen, which I think is more effective at higher dimensions - also 100d is a particularly bad case for vectorization because it is not a multiple of a power of two.

We can use src/python/infer_token_vectors.py to generate token-vectors from the Wikipedia corpus; there's an example command line in its comments for the 384-dim MiniLM model:

python src/python/infer_token_vectors.py ../data/enwiki-20120502-lines-1k-fixed-utf8-with-random-label.txt \
    ../data/enwiki-20120502.all-MiniLM-L6-v2.tok \
    ../data/enwiki-20120502.all-MiniLM-L6-v2.vec

This generates a dictionary of token->vector for all the tokens in our wikipedia data.

Then we can generate vectors for the wikipedia line documents and for the vector search tasks using build targets in build.xml: vectors-minilm-tasks and vectors-minilm-docs. The way we do thjis is kind of cheesy; we just sum up the token vectors for the tokens in each document. This isn't really SOTA but probably fine for performance testing.

For 8 bit vectors we need to choose a scale factor and run another job (there's an example for GloVe in build.xml). OTOH rather than precomputing these, perhaps we can wait until Lucene quantization lands and use that? I'll open a separate issue for this one

moving from 100 to 384-d vectors is going to slow things down a bit. We might also want to coordinate this with adopting concurrent vector merges apache/lucene#12660 - I'll open a separate issue for that

@msokolov
Copy link
Collaborator Author

msokolov commented Oct 27, 2023

I uploaded enwiki-20120502-lines-1k-mpnet.vec to home.apache.org:~sokolov which has vectors for 33M docs. Also uploaded the token dictionary: enwiki-20120502-minilm.tok/enwiki-20120502-minilm.vec and an 8-bit vectors file: enwiki-20120502-lines-1k-minilm-8bit.vec, and I pushed task files to main. So I think we should have all the data we need to switch nightly benchmarks over. But @mikemccand you must do the final step since you will need to download the data to the beast before we can cut over

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant