-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider contributing to the search benchmark game #150
Comments
Tantivy actually looks pretty good. Rust allows it to be closer to hardware than Lucene and Tantivy seems to have some specialization for certain common cases like pure disjunctions of term queries vs. disjunctions of arbitrary queries. It used to lack significant features but it's catching up impressively quickly, e.g. it supports doc values and block-max WAND nowadays. We're very much looking forward to Panama and the vector API, hopefully this will allow us to reduce the difference between Tantivy and Lucene. However I don't think that io_uring is something that we can leverage, I fear it would make our (already complicated) search abstractions even more complicated (except maybe for our simpler and I/O heavy APIs like stored fields?). |
Thank you for all the enthusiasm @LifeIsStrange! I could not agree more that making benchmarking easier, more standardized, etc. is an excellent goal to allow true comparisons of the world's search engines. Tantivy (and Rust) look awesome! I love that Rust allows direct access to vectorized SIMD instructions, whereas in javaland we have to play silly games by carefully writing our |
Indeed Tantivy mention using:
See a blog post about the tantivity founder talking about this optimization
But in fact, Java supports a clean, explicit vector API since OpenJDK 16! I have played with it a bit and the API looks 1) much simpler than classical low level SIMD and 2) is superior in that it allow cross platform SIMD instructions (works on ARM out of the box) and 3) since the JVM has a runtime, it allow to select modern instructions (AVX vs SSE, SVM vs neon) and higher SIMD lane length (I'm not sure about the last one but maybe they can downscale AVX-512 to AVX-256 automatically?) |
Besides SIMD integer compression in tantivy, another interesting software optimization is his take on levenshtein automaton. Anyway in 2015, someone on Hackernews wrote a blog post implementing a much simpler implementation of the automaton although it might be less performant.. Then in 2019, the Tantivy founder wrote a blog post about levhenstein automaton
So maybe that lucene could prune the states and try to be smarter about it. Independently of that, this repository mention a significant performance advantage vs Lucene DFA
Finally, this recent paper might be interesting So to conclude, it seems there are interesting possible optimizations to explore regarding Lucene fuzzy search performance :) digression: you might find this blog about finite state transducers and their various use cases interesting although, as always you were the first to introduce those notions ;) @fulmicoton friendly tag since this comment is about some of your contributions |
Hi @mikemccand !
I recently discovered your blog, and I have to say I am a huge fan.
I do not (yet) have decent knowledge in Lucene (in fact I have never used it) but it's a technology I use indirectly (through ELK/opensearch) and like you I have great interest in the idea of progress, and what better quantify progress than the idea of performance optimizations/benchmarking?
The (probably) most famous microbenchmark for comparing programming languages is the benchmark game
In homage to its name, people have developped a Search benchmark game
I recently stumbled upon luceneutil and it seems to be a great set of benchmarks, useful for catching lucene regressions and improvements.
However, the Search benchmark game has a complementary value despite being (currently) simpler in what kind of queries/dataset it test. As it allow to compare lucene performance with competitors. Among those, the rust library Tantivy stands out as being on average 2 times faster than lucene!
I'm sure that lucene has some specific, advanced optimizations that would make it faster for some specific kinds of queries
and of course performance is only one criterion (among user friendliness, correctness and feature completeness) for the choice of a search library.
But Tantivy seems to be significantly and consistently faster for generic queries (at least on this dataset), and of course being rust based is an advantage over Java but it's possible that their advantage also come from software/algorithmic optimizations that lucene could take inspiration from (and conversely!)
I do not have the qualification nor the time but maybe that you could try to contribute to the benchmark in two ways:
Those are just suggestions from an enthusiast, feel free to ignore this issue! :)
The text was updated successfully, but these errors were encountered: