Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch nightly benchy to more realistic Cohere/wikipedia-22-12-en-embeddings vectors #256

Open
mikemccand opened this issue Mar 5, 2024 · 22 comments

Comments

@mikemccand
Copy link
Owner

#255 added realistic Cohere/wikipedia-22-12-en-embeddings 768 dim vectors to luceneutil -- let's switch over nightlies to use these vectors instead.

@mikemccand
Copy link
Owner Author

I attempted to follow the README instructions to generate nightly benchy vectors, using this command:

python3 -u src/python/infer_token_vectors_cohere.py ../data/cohere-wikipedia-768.vec 27625038 ../data/cohere-wikipedia-queries-768.vec 10000

(Note that the nightly benchy only does indexing, so I really only need the first file)

But this consumes gobbs of RAM apparently and the Linux OOME killer killed it!

Is this expected? I can run this on a beefier machine if need be (current machine has "only" 256 GB and no swap) for this one-time generation of vectors ...

Maybe datasets.load_dataset can load just the N vectors I need, not everything in the train split?

@mikemccand
Copy link
Owner Author

Oooh this load_datasets method takes a parameter keep_in_memory! I'll poke around.

@mikemccand
Copy link
Owner Author

OK well that keep_in_memory=False parameter seemed to do nothing -- still OOME killer at 256 GB RAM.

With this change to do chunking into 1M blocks of vectors when writing the index-time inferred vectors I was able to run the tool!

Full output below:

beast3:util.nightly[master]$ python3 -u src/python/infer_token_vectors_cohere.py ../data/cohere-wikipedia-768.vec 27625038 ../data/cohere-wikipedia-queries-768.vec 10000
Resolving data files: 100%|████████████████████████████████████████████| 253/253 [00:01<00:00, 250.70it/s]
Loading dataset shards: 100%|█████████████████████████████████████████| 252/252 [00:00<00:00, 1121.66it/s]
total number of rows: 35167920
embeddings dims: 768
saving docs[0:1000000 of shape: (1000000, 768) to file
saving docs[1000000:2000000 of shape: (1000000, 768) to file
saving docs[2000000:3000000 of shape: (1000000, 768) to file
saving docs[3000000:4000000 of shape: (1000000, 768) to file
saving docs[4000000:5000000 of shape: (1000000, 768) to file
saving docs[5000000:6000000 of shape: (1000000, 768) to file
saving docs[6000000:7000000 of shape: (1000000, 768) to file
saving docs[7000000:8000000 of shape: (1000000, 768) to file
saving docs[8000000:9000000 of shape: (1000000, 768) to file
saving docs[9000000:10000000 of shape: (1000000, 768) to file
saving docs[10000000:11000000 of shape: (1000000, 768) to file
saving docs[11000000:12000000 of shape: (1000000, 768) to file
saving docs[12000000:13000000 of shape: (1000000, 768) to file
saving docs[13000000:14000000 of shape: (1000000, 768) to file
saving docs[14000000:15000000 of shape: (1000000, 768) to file
saving docs[15000000:16000000 of shape: (1000000, 768) to file
saving docs[16000000:17000000 of shape: (1000000, 768) to file
saving docs[17000000:18000000 of shape: (1000000, 768) to file
saving docs[18000000:19000000 of shape: (1000000, 768) to file
saving docs[19000000:20000000 of shape: (1000000, 768) to file
saving docs[20000000:21000000 of shape: (1000000, 768) to file
saving docs[21000000:22000000 of shape: (1000000, 768) to file
saving docs[22000000:23000000 of shape: (1000000, 768) to file
saving docs[23000000:24000000 of shape: (1000000, 768) to file
saving docs[24000000:25000000 of shape: (1000000, 768) to file
saving docs[25000000:26000000 of shape: (1000000, 768) to file
saving docs[26000000:27000000 of shape: (1000000, 768) to file
saving docs[27000000:27625038 of shape: (625038, 768) to file
saving queries of shape: (10000, 768) to file
reading docs of shape: (27625038, 768)
reading queries shape: (10000, 768)

It produced a large .vec file:

beast3:util.nightly[master]$ ls -lh ../data/cohere-wikipedia-768.vec
-rw-r--r-- 1 mike mike 159G Mar 10 22:13 ../data/cohere-wikipedia-768.vec

Next I'll try switching to this source for nightly benchy. I'll also publish this on home.apache.org.

@mikemccand
Copy link
Owner Author

Hmm, except, that file is too large?

beast3:util.nightly[master]$ python3
Python 3.11.7 (main, Jan 29 2024, 16:03:57) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 27000000 * 768 * 4 / 1024 / 1024 / 1024
77.24761962890625

It's 159 GB but should be ~77 GB?

Maybe my "chunking" is buggy :)

@mikemccand
Copy link
Owner Author

OK I think these are float64 typed vectors, in which case the file size makes sense. But I think nightly benchy wants float32?

@mikemccand
Copy link
Owner Author

mikemccand commented Mar 11, 2024

And I think knnPerfTest.py/KnnGraphTester.java also wants float32? I'm confused how they are working now on the generated file ...

@mikemccand
Copy link
Owner Author

Oooh this Dataset.cast method looks promising! I'll explore...

@mikemccand
Copy link
Owner Author

OK I made this change and kicked off infer_token_vectors_cohere.py again and it looks to at least be running...:

diff --git a/src/python/infer_token_vectors_cohere.py b/src/python/infer_token_vectors_cohere.py
index 5c350df..5027eb2 100644
--- a/src/python/infer_token_vectors_cohere.py
+++ b/src/python/infer_token_vectors_cohere.py
@@ -28,11 +28,19 @@ for name in (filename, filename_queries):

 ds = datasets.load_dataset("Cohere/wikipedia-22-12-en-embeddings",
                            split="train")
+print(f'features: {ds.features}')
 print(f"total number of rows: {len(ds)}")
 print(f"embeddings dims: {len(ds[0]['emb'])}")

 # ds = ds[:num_docs]

+# we just want the vector embeddings:
+for feature_name in ds.features.keys():
+  if feature_name != 'emb':
+    ds = ds.remove_columns(feature_name)
+
+ds = ds.cast(datasets.Features({'emb': datasets.Sequence(feature=datasets.Value("float32"))}))
+
 # do this in windows, else the RAM usage is crazy (OOME even with 256
 # GB RAM since I think this step makes 2X copy of the dataset?)
 doc_upto = 0

@mikemccand
Copy link
Owner Author

OK hmm scratch that, I see from the already loaded features that Dataset thinks these emb vectors are already float32:

features: {'id': Value(dtype='int32', id=None), 'title': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'url': Value(dtype='string', id=None), 'wiki_id': Value(dtype='int32', id=None), 'views': Value(dtype='float32', id=None), 'paragraph_id': Value(dtype='int32', id=None), 'langs': Value(dtype='int32', id=None), 'emb': Sequence(feature=Value(dtype='float32', id=None), length=-1, id=None)}

@mikemccand
Copy link
Owner Author

OK! Now I think the issue is in np.array -- I think we have to give it a preferred data type, else, it seems to be casting the Dataset's float32 up to float64, maybe.

So, now I'm testing this:

diff --git a/src/python/infer_token_vectors_cohere.py b/src/python/infer_token_vectors_cohere.py
index 5c350df..4cc305e 100644
--- a/src/python/infer_token_vectors_cohere.py
+++ b/src/python/infer_token_vectors_cohere.py
@@ -28,11 +28,20 @@ for name in (filename, filename_queries):

 ds = datasets.load_dataset("Cohere/wikipedia-22-12-en-embeddings",
                            split="train")
+print(f'features: {ds.features}')
 print(f"total number of rows: {len(ds)}")
 print(f"embeddings dims: {len(ds[0]['emb'])}")

 # ds = ds[:num_docs]

+if False:
+  # we just want the vector embeddings:
+  for feature_name in ds.features.keys():
+    if feature_name != 'emb':
+      ds = ds.remove_columns(feature_name)
+
+  ds = ds.cast(datasets.Features({'emb': datasets.Sequence(feature=datasets.Value("float32"))}))
+
 # do this in windows, else the RAM usage is crazy (OOME even with 256
 # GB RAM since I think this step makes 2X copy of the dataset?)
 doc_upto = 0
@@ -40,7 +49,7 @@ window_num_docs = 1000000
 while doc_upto < num_docs:
   next_doc_upto = min(doc_upto + window_num_docs, num_docs)
   ds_embs = ds[doc_upto:next_doc_upto]['emb']
-  embs = np.array(ds_embs)
+  embs = np.array(ds_embs, dtype=np.single)
   print(f"saving docs[{doc_upto}:{next_doc_upto} of shape: {embs.shape} to file")
   with open(filename, "ab") as out_f:
       embs.tofile(out_f)

mikemccand added a commit that referenced this issue Apr 29, 2024
@mikemccand
Copy link
Owner Author

OK the above change seemed to have worked (I just pushed it)! I now see these vector files:

-rw-r--r-- 1 mike mike  80G Mar 28 12:57 cohere-wikipedia-768.vec
-rw-r--r-- 1 mike mike 586M Mar 28 12:57 cohere-wikipedia-queries-768.vec

Now I will try to confirm their recall seems sane, and then switch nightly to them.

@mikemccand
Copy link
Owner Author

OK I think the next wrinkle here is ... to fix SearchPerfTest to use the pre-computed Cohere query vectors from cohere-wikipedia-queries-768.vec, instead of attempting to do inference based on the lexical tokens of each incoming query. I guess we could just incrementally pull the vectors from the query vectors file and assign them sequentially to each vector query we see? @msokolov does that sound reasonable?

@msokolov
Copy link
Collaborator

I think we can modify VectorDictionary so accept a --no-tokenize option and then lookup the vector using the full query text? We would need to generate a text file with the queries, one per line, to correspond with the binary vector file.

@msokolov
Copy link
Collaborator

otherwise you could simply select some random vector every time you see a vector-type query task?? But I would expect some vectors behave differently from others? Not sure

@mikemccand
Copy link
Owner Author

I was finally able to index/search using these Cohere vectors, and the profiler output is sort of strange:

This is CPU:

PROFILE SUMMARY from 44698 events (total: 44698)
  tests.profile.mode=cpu
  tests.profile.count=30
  tests.profile.stacksize=1
  tests.profile.linenumbers=false
PERCENT       CPU SAMPLES   STACK
10.98%        4907          jdk.internal.misc.ScopedMemoryAccess#getByteInternal()
6.05%         2702          jdk.incubator.vector.FloatVector#reduceLanesTemplate()
4.06%         1813          org.apache.lucene.store.MemorySegmentIndexInput#readByte()
3.63%         1622          perf.PKLookupTask#go()
2.89%         1292          org.apache.lucene.store.DataInput#readVInt()
2.84%         1269          org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
2.58%         1153          org.apache.lucene.util.fst.FST#findTargetArc()
2.45%         1093          jdk.incubator.vector.FloatVector#fromArray0Template()
2.28%         1018          org.apache.lucene.util.LongHeap#downHeap()
2.25%         1007          org.apache.lucene.util.SparseFixedBitSet#insertLong()
2.06%         921           org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
2.05%         916           jdk.internal.foreign.AbstractMemorySegmentImpl#checkBounds()
1.96%         875           jdk.incubator.vector.FloatVector#lanewiseTemplate()
1.91%         852           jdk.internal.util.ArraysSupport#mismatch()
1.40%         627           org.apache.lucene.util.compress.LZ4#decompress()
1.31%         586           jdk.internal.foreign.MemorySessionImpl#checkValidStateRaw()
1.18%         526           org.apache.lucene.util.SparseFixedBitSet#getAndSet()
1.15%         516           org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader$OffHeapHnswGraph#seek()
0.99%         444           org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#doReset()
0.99%         441           org.apache.lucene.util.BytesRef#compareTo()
0.94%         418           org.apache.lucene.util.fst.FST#readArcByDirectAddressing()
0.92%         413           org.apache.lucene.search.TopKnnCollector#topDocs()
0.91%         408           org.apache.lucene.index.VectorSimilarityFunction$2#compare()
0.90%         401           org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatVectorScorer#score()
0.80%         359           org.apache.lucene.index.SegmentInfo#maxDoc()
0.79%         353           java.util.Arrays#fill()
0.75%         334           org.apache.lucene.codecs.lucene99.Lucene99PostingsReader#decodeTerm()
0.73%         326           java.util.Arrays#compareUnsigned()
0.72%         324           org.apache.lucene.search.ReferenceManager#acquire()
0.71%         317           org.apache.lucene.store.DataInput#readVLong()

and this is HEAP:

PROFILE SUMMARY from 748 events (total: 38182M)
  tests.profile.mode=heap
  tests.profile.count=30
  tests.profile.stacksize=1
  tests.profile.linenumbers=false
PERCENT       HEAP SAMPLES  STACK
21.88%        8355M         java.util.concurrent.locks.AbstractQueuedSynchronizer#acquire()
13.50%        5154M         org.apache.lucene.util.ArrayUtil#growNoCopy()
9.31%         3556M         org.apache.lucene.util.SparseFixedBitSet#insertLong()
9.00%         3436M         perf.StatisticsHelper#startStatistics()
9.00%         3436M         java.util.ArrayList#iterator()
5.76%         2199M         org.apache.lucene.util.fst.ByteSequenceOutputs#read()
3.60%         1374M         org.apache.lucene.util.BytesRef#<init>()
3.56%         1357M         org.apache.lucene.codecs.lucene95.OffHeapFloatVectorValues#<init>()
3.52%         1345M         org.apache.lucene.util.ArrayUtil#growExact()
2.65%         1013M         org.apache.lucene.search.TopKnnCollector#topDocs()
2.50%         956M          java.util.concurrent.locks.AbstractQueuedSynchronizer#tryInitializeHead()
2.34%         893M          org.apache.lucene.util.SparseFixedBitSet#insertBlock()
1.98%         755M          org.apache.lucene.util.LongHeap#<init>()
1.51%         578M          java.util.logging.LogManager#reset()
1.51%         578M          java.util.concurrent.FutureTask#runAndReset()
1.51%         578M          jdk.jfr.internal.ShutdownHook#run()
1.21%         463M          jdk.internal.foreign.MappedMemorySegmentImpl#dup()
0.90%         343M          java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject#newConditionNode()
0.83%         315M          org.apache.lucene.util.SparseFixedBitSet#<init>()
0.60%         229M          org.apache.lucene.util.hnsw.FloatHeap#<init>()
0.52%         200M          org.apache.lucene.util.hnsw.FloatHeap#getHeap()
0.45%         171M          jdk.internal.misc.Unsafe#allocateUninitializedArray()
0.45%         171M          org.apache.lucene.util.packed.DirectMonotonicReader#getInstance()
0.45%         171M          org.apache.lucene.store.DataInput#readString()
0.22%         85M           org.apache.lucene.search.knn.TopKnnCollectorManager#newCollector()
0.22%         85M           org.apache.lucene.search.knn.MultiLeafKnnCollector#<init>()
0.15%         57M           org.apache.lucene.store.MemorySegmentIndexInput#buildSlice()
0.08%         28M           perf.TaskParser$TaskBuilder#parseVectorQuery()
0.07%         28M           java.util.regex.Pattern#matcher()
0.07%         28M           org.apache.lucene.search.TaskExecutor$TaskGroup#createTask()

Why are we reading individual bytes so intensively? And why is lock acquisition the top HEAP object creator!?

@mikemccand
Copy link
Owner Author

Here's the perf.py I ran (just A/A):

import sys
sys.path.insert(0, '/l/util/src/python')

import competition

if __name__ == '__main__':
  sourceData = competition.sourceData('wikimediumall')

  sourceData.tasksFile = '/l/util/just-vector-search.tasks'
  comp = competition.Competition(taskRepeatCount=200)
  #comp.addTaskPattern('HighTerm$')                                                                                                                                                                    

  checkout = 'trunk'

  index = comp.newIndex(checkout, sourceData, numThreads=36, addDVFields=True,
                        grouping=False, useCMS=True,
                        #javaCommand='/opt/jdk-18-ea-28/bin/java --add-modules jdk.incubator.foreign -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp',                              
                        ramBufferMB=256,
                        analyzer = 'StandardAnalyzerNoStopWords',
                        vectorFile = '/lucenedata/enwiki/cohere-wikipedia-768.vec',
                        vectorDimension = 768,
                        hnswThreadsPerMerge = 4,
                        hnswThreadPoolCount = 16,
                        vectorEncoding = 'FLOAT32',
                        verbose = True,
                        name = 'mikes-vector-test',
                        facets = (('taxonomy:Date', 'Date'),
                                  ('taxonomy:Month', 'Month'),
                                  ('taxonomy:DayOfYear', 'DayOfYear'),
                                  ('taxonomy:RandomLabel.taxonomy', 'RandomLabel'),
                                  ('sortedset:Date', 'Date'),
                                  ('sortedset:Month', 'Month'),
                                  ('sortedset:DayOfYear', 'DayOfYear'),
                                  ('sortedset:RandomLabel.sortedset', 'RandomLabel')))

  comp.competitor('base', checkout, index=index, vectorFileName='/lucenedata/enwiki/cohere-wikipedia-queries-768.vec', vectorDimension=768,
                  #javacCommand='/opt/jdk-18-ea-28/bin/javac',                                                                                                                                         
                  #javaCommand='/opt/jdk-18-ea-28/bin/java --add-modules jdk.incubator.foreign -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp')                                    
                  )
  comp.competitor('comp', checkout, index=index, vectorFileName='/lucenedata/enwiki/cohere-wikipedia-queries-768.vec', vectorDimension=768,
                  #javacCommand='/opt/jdk-18-ea-28/bin/javac',                                                                                                                                         
                  #javaCommand='/opt/jdk-18-ea-28/bin/java --add-modules jdk.incubator.foreign -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp')                                    
                  )
  comp.benchmark('atoa')

@mikemccand
Copy link
Owner Author

More thread context for the CPU profiling:

PROFILE SUMMARY from 10264 events (total: 10264)
  tests.profile.mode=cpu
  tests.profile.count=50
  tests.profile.stacksize=8
  tests.profile.linenumbers=false
PERCENT       CPU SAMPLES   STACK
12.59%        1292          jdk.incubator.vector.FloatVector#reduceLanesTemplate()
                              at jdk.incubator.vector.Float256Vector#reduceLanes()
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody()
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProduct()
                              at org.apache.lucene.util.VectorUtil#dotProduct()
                              at org.apache.lucene.index.VectorSimilarityFunction$2#compare()
                              at org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatVectorScorer#score()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
7.16%         735           org.apache.lucene.store.DataInput#readVInt()
                              at org.apache.lucene.store.MemorySegmentIndexInput#readVInt()
                              at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader$OffHeapHnswGraph#seek()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#graphSeek()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
                              at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search()
4.25%         436           jdk.incubator.vector.FloatVector#lanewiseTemplate()
                              at jdk.incubator.vector.Float256Vector#lanewise()
                              at jdk.incubator.vector.Float256Vector#lanewise()
                              at jdk.incubator.vector.FloatVector#fma()
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#fma()
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody()
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProduct()
                              at org.apache.lucene.util.VectorUtil#dotProduct()
4.21%         432           jdk.internal.misc.ScopedMemoryAccess#getByteInternal()
                              at jdk.internal.misc.ScopedMemoryAccess#getByte()
                              at java.lang.invoke.VarHandleSegmentAsBytes#get()
                              at java.lang.invoke.VarHandleGuards#guard_LJ_I()
                              at java.lang.foreign.MemorySegment#get()
                              at org.apache.lucene.store.MemorySegmentIndexInput#readByte()
                              at org.apache.lucene.store.DataInput#readVInt()
                              at org.apache.lucene.store.MemorySegmentIndexInput#readVInt()
4.08%         419           org.apache.lucene.util.SparseFixedBitSet#insertLong()
                              at org.apache.lucene.util.SparseFixedBitSet#getAndSet()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
                              at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search()
                              at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader#search()
                              at org.apache.lucene.index.CodecReader#searchNearestVectors()
2.84%         292           org.apache.lucene.util.LongHeap#downHeap()
                              at org.apache.lucene.util.LongHeap#pop()
                              at org.apache.lucene.util.hnsw.NeighborQueue#pop()
                              at org.apache.lucene.search.TopKnnCollector#topDocs()
                              at org.apache.lucene.search.knn.MultiLeafKnnCollector#topDocs()
                              at org.apache.lucene.search.KnnFloatVectorQuery#approximateSearch()
                              at org.apache.lucene.search.AbstractKnnVectorQuery#getLeafResults()
                              at org.apache.lucene.search.AbstractKnnVectorQuery#searchLeaf()
2.74%         281           org.apache.lucene.index.VectorSimilarityFunction$2#compare()
                              at org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatVectorScorer#score()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
                              at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search()
                              at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader#search()
                              at org.apache.lucene.index.CodecReader#searchNearestVectors()
2.58%         265           org.apache.lucene.util.compress.LZ4#decompress()
                              at org.apache.lucene.codecs.lucene90.LZ4WithPresetDictCompressionMode$LZ4WithPresetDictDecompressor#decompress()
                              at org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#document()
                              at org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader#serializedDocument()
                              at org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader#document()
                              at org.apache.lucene.index.CodecReader$1#document()
                              at org.apache.lucene.index.BaseCompositeReader$2#document()
                              at org.apache.lucene.index.StoredFields#document()
2.48%         255           org.apache.lucene.util.SparseFixedBitSet#getAndSet()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
                              at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search()
                              at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader#search()
                              at org.apache.lucene.index.CodecReader#searchNearestVectors()
                              at org.apache.lucene.search.KnnFloatVectorQuery#approximateSearch()

Curious that readVInt, when seeking to load a vector (?) is 2nd hotspot?

@msokolov
Copy link
Collaborator

msokolov commented Jun 10, 2024 via email

@benwtrent
Copy link
Collaborator

I might be missing it, but where is the similarity defined for using the cohere vectors? They are designed for max inner product, if we use euclidean, I would expect graph building and indexing to be poor as we might get stuck in local minima.

@msokolov
Copy link
Collaborator

The benchmark tools are hard-coded to use DOT_PRODUCT; see https://github.com/mikemccand/luceneutil/blob/main/src/main/perf/LineFileDocs.java#L454

Maybe this is why we get such poor results w/Cohere?

@benwtrent
Copy link
Collaborator

@msokolov using dot_product likely doesn't work with 768 cohere unless they are manually normalized. If these things aren't normalized, we will be getting some whacky scores and we likely lose a bunch of information by snapping to be greater than 0.

I could maybe see cosine working.

But I would suggest we switch to max-inner-product for Cohere 768 for a true test with those vectors as they were designed to be used.

@msokolov
Copy link
Collaborator

I ran a test comparing mip and angular over Cohere Wikipedia vectors (what KnnGraphTester calls MAXIMUM_INNER_PRODUCT and DOT_PRODUCT) and the results were surprising:

mainline, Cohere, angular

recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.631         0.496  1500000    10       6       32         50         no   330.94         213.55             1          4436.82
 0.617         0.439  1500000    10       6       32         50     7 bits   352.52         217.64             1          5543.35
 0.408         0.422  1500000    10       6       32         50     4 bits   340.32         151.22             1          5544.56

mainline, Cohere, mip

recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.593         0.475  1500000    10       6       32         50         no   325.19         210.78             1          4436.81
 0.601         0.454  1500000    10       6       32         50     7 bits   346.48         218.88             1          5543.35
 0.405         0.307  1500000    10       6       32         50     4 bits   345.31         144.83             1          5544.56

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants