-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can we store only quantized vectors to reduce disk footprint? #14007
Comments
Very feasible. However, it gets tricky on accuracy and the quantization methods will require updating to handle this. But all the current (and hopefully future) quantization methodologies can easily "re-hydrate" vectors so that iterating floats is still possible (folks give floats, we should be able to return floats if they want them).
For higher fidelity quantization methods, I expect accuracy to be OK. I am not 100% sure with how things are now if we would ever want to quantize below int7 for storage. Int4 as it stands (when using the dynamic confidence interval), would suffer greatly without some more significant improvement to the algorithm. |
Excuse my ignorance, but I was wondering...
Could you elaborate on the computational costs associated with this? If the need to retrieve floats from users is not present, is it feasible to skip this rehydration step and directly use the quantized vectors for distance calculations?
So, would int7 be considered a higher fidelity quantization method? Based on your experience and insights, how would you rate the fidelity of int7, int4, and binary quantization methods? Where do they stand in terms of maintaining accuracy while optimizing storage efficiency? Has the Lucene community already planned or discussed the implementation of a dedicated KnnVectorsFormat for handling only quantized vectors? Are there quick support mechanisms for users who are willing to compromise on accuracy for significant savings in disk space and do not require the original vectors? |
Quantized vectors are used directly for distance calculations today. We keep the full precision values around for recalculating quantiles during merging. It could be useful to enable saving disk space at the cost of slower retrieval of float full precision vectors. Some use-cases might have uniform data distributions anyway. |
Note that we would also need the original float32 vectors for re-ranking in case it's needed (e.g maybe int4) (see #13564). But if someone prefers less storage and is okay with low recall, then dropping the original vectors at searching side (if the indexing and searching are separated) could be fine. I think we still need for indexing and merging as vigyasharma@ comment. There are also several related ideas: |
I don't know if its strictly necessary to keep the raw vectors for merging. Once a certain limit is reached, especially for vectors that play well with quantization, you can likely throw the vectors away even on the indexing side. It would just require work, and I wouldn't want it to be the default behavior. I will answer the wall of questions from @Rassyan. But, most of these would be discoverable through exploring the quantization methodology for yourself.
Its just the inverse of the quantization method. For every vector, you iterate its components inverting the quantization step.
I would think so, unless you had a different one in mind?
For pure score correlations, what we have seen:
There are vectors that are aggressively AGAINST quantization (e.g. GIST & Glove) and perform way worse than all other vectors. But modern transformer architectures perform way better.
Just do the math ;)
There has been discussions around having the floating point removed for "search segments" but keeping them for "index segments"
I don't even know what this means. |
In my opinion, we should not have lossy codecs. This creates weird situations where the errors could compound in weird ways over time, e.g. when you switch file formats. I'd rather like it to be done on top of the codec API. E.g. computing a good scalar quantization for a given model offline, and then using it in the way in to index vectors directly as byte[]. |
Lucene's HNSW KNN already supports this -- users can provide either So a user could pre-quantize their |
Also, note that, at least for the current scalar quantization ( For apps using near-real-time segment replication (the best/preferred replication approach IMO), it may be possible eventually to strip out the |
I do think we should consider adding support for half-floats. Regardless of them being able to utilize hard-ware accelerated comparisons or not, it would greatly reduce the disk footprint. The contract with the user for half-floats would be pretty straight forward as they specify that they want to store them as such. The segment based replication solution is a very interesting one. Though for higher compression ratios (e.g. 1 or 2 bits), you generally need the rescoring. Though, conceivably, the codec could quantize the vectors twice (int7, then to bit) and allows you to rerank bit quantized values with int7 quantized values...quantization all the way down |
I like this idea, it allows more granular trade-off between memory usage and recall. Although if it means we need to store all 3 levels (float32/int7/bit) in disk, or we need to sacrifice accuracy during merging by just relying on the intermediate int7. |
+1 |
Do we have a separate issue open already for half-floats? It seems like it deserves its own spinoff issue ... |
Description
In light of optimizing disk usage for KNN vector searches in Lucene, I propose considering a new KnnVectorsFormat class in Lucene that handles only quantized vectors, eliminating the need to store original float32 vectors. This approach could significantly reduce disk usage, with potential reductions similar to the memory efficiency seen in int8 quantization scenarios, where usage can drop to about 25%. This figure is illustrative, emphasizing that actual savings could vary with different quantization methods and storage configurations.
I seek community feedback on:
Your insights will help determine the viability of this approach for enhancing Lucene's vector search capabilities.
The text was updated successfully, but these errors were encountered: