Use more modern vectors in nightly bench #237

msokolov · 2023-10-27T17:13:57Z

We have support for testing with 384-dim MINIMLM and 768-dim MPNET vectors baked into some of the python utilities in luceneutil. We also have tools for generating 8-bit versions of these. Let's produce some datasets for wikipedia and use them in place of the GloVe-100d vectors we use today. That GloVe model is pretty old and exhibits some unfortunate numerical instability; I think it's not really SOTA any more. Also if we use higher-dim vectors this should help showcase some of the vectorization work we've seen, which I think is more effective at higher dimensions - also 100d is a particularly bad case for vectorization because it is not a multiple of a power of two.

We can use src/python/infer_token_vectors.py to generate token-vectors from the Wikipedia corpus; there's an example command line in its comments for the 384-dim MiniLM model:

python src/python/infer_token_vectors.py ../data/enwiki-20120502-lines-1k-fixed-utf8-with-random-label.txt \
    ../data/enwiki-20120502.all-MiniLM-L6-v2.tok \
    ../data/enwiki-20120502.all-MiniLM-L6-v2.vec

This generates a dictionary of token->vector for all the tokens in our wikipedia data.

Then we can generate vectors for the wikipedia line documents and for the vector search tasks using build targets in build.xml: vectors-minilm-tasks and vectors-minilm-docs. The way we do thjis is kind of cheesy; we just sum up the token vectors for the tokens in each document. This isn't really SOTA but probably fine for performance testing.

For 8 bit vectors we need to choose a scale factor and run another job (there's an example for GloVe in build.xml). OTOH rather than precomputing these, perhaps we can wait until Lucene quantization lands and use that? I'll open a separate issue for this one

moving from 100 to 384-d vectors is going to slow things down a bit. We might also want to coordinate this with adopting concurrent vector merges apache/lucene#12660 - I'll open a separate issue for that

The text was updated successfully, but these errors were encountered:

msokolov · 2023-10-27T17:37:27Z

I uploaded enwiki-20120502-lines-1k-mpnet.vec to home.apache.org:~sokolov which has vectors for 33M docs. Also uploaded the token dictionary: enwiki-20120502-minilm.tok/enwiki-20120502-minilm.vec and an 8-bit vectors file: enwiki-20120502-lines-1k-minilm-8bit.vec, and I pushed task files to main. So I think we should have all the data we need to switch nightly benchmarks over. But @mikemccand you must do the final step since you will need to download the data to the beast before we can cut over

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use more modern vectors in nightly bench #237

Use more modern vectors in nightly bench #237

msokolov commented Oct 27, 2023

msokolov commented Oct 27, 2023 •

edited

Loading

Use more modern vectors in nightly bench #237

Use more modern vectors in nightly bench #237

Comments

msokolov commented Oct 27, 2023

msokolov commented Oct 27, 2023 • edited Loading

msokolov commented Oct 27, 2023 •

edited

Loading