-
-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cli (tjs idea) #32
base: main
Are you sure you want to change the base?
cli (tjs idea) #32
Conversation
It turns out that c is pretty fast (thats singlethreaded)
|
yeah 65% faster if i read the output correctly 🤣
|
|
Even if it's 65% faster, it might still be worth trying to improve. One fairly simple thing could be to change the scoring calculation from int16_t to float. Floats are almost the same speed as int16_t for single operations, sometimes a bit faster, sometimes a bit slower depending on the operation and processor, but very comparable in speed. But the difference is that the the compiler can do much more optimization for floats using SIMD extensions like SSE and AVX with the right compiler options, although there are some integer support in modern SIMD instruction sets. Additionally when using strict aliasing, the floating point type is different from the char and int types for the other arrays, so it doesn't always need to assume that there are aliasing going on, which forces things to be read from memory instead of using cached results in registers. But of course there's still a lot of potential aliasing, and I'm not sure how well the compiler is able to analyse what's going on, so some manual annotation might improve things further. Using floats will of course cause a bit more memory pressure and therefore potentially more cache misses, which are slow. But the slab is re-used, and unless you match really long strings, everything fits easily into the cache, so that should not be an issue. But of course there could be other things going on that would make it slower with floats instead of faster, so benchmarking is needed. For example it's possible that the integer versions already uses the simd extensions that are possible, or that the opportunity for optimization does not exits the way the code currently is written. But since the algorithm is essentially doing matrix calculations, it might still be possible to rewrite it to take proper advantage of SIMD. |
All of this sounds good. I think you are way more qualified doing this than i am. 😆 But we are not stopping making telescope faster. The idea here was having a pipe that allows us to move the score calculation away from the neovim thread. Other idea we have, i talked about this yesterday with tj, was if we bundle fzf-native in telescope core and require people to do What i was thinking is having the store actually in c as a score, index tuple (we cant store the actual table reference in c afaik but we can store the index to the table ref that lies at i in the results table i think). We could make a heap in c then and maybe sort the first 500~1000 correctly without having a performance penalty because we have seen a factor 10 performance improvement between the c part and lua (fzy-native vs fzy lua version). I just need to mention it. I am not sure that a heap is optimal here, i just know that it ended up being faster than the linked list approach i tried first and i had build a heap before so i tried that. And it ended up bringing the time down from |
"I just need to mention it. I am not sure that a heap is optimal here" Optimal is a preallocated heap with offsets and length to objects xor padded objects. The latter looks slower though, since line length can differ alot. The first could be tried, but would be wasteful. We dont know the count of hits, so any estimation might be wrong. Second-optimal are arena/region-allocators, where you just bump the capacity (pointer) to add more stuff inside instead having the malloc overhead on every call. If the memory page is full, the arena allocator takes the next one. The index stores the offset to text chunks. The control structure stores pointer to the indexes. So something like control_block
| |
| |->index0: | offset0 | offset 1 | ... |
|------>index1: | offset0 | offset 1 | ... |
and offset0 -------------> text chunk0 (assume this is a continues memory chunk from region allocator)
offset1 -------------> text chunk1
... .... Then the problem of cache locality boils down to having the lookup the text defined by offsets. I am not familiar with what is stored and how sorting should work, so I cant tell how the control_block would look exactly. Note, that looks like a fundamental redesign. So it should be done later and not in this PR as to limit the scope. Not sure, if adding a arena/region allocator is worth it though. https://github.com/cgaebel/arena_alloc looks good enough for that. |
I was talking about the data structure heap, like i implemented here in this PR. max-heap( to be more specific). I wasnt talking about allocation. fzf-native actually only calculates the score. Telescope sorts it in a datastructure (currently linked list and we only sort the first n elements, the displayed onces). I was thinking maybe we could improve that and other core elements in telescope. Thats all. But that doesnt affect either this repository. This PR is just me playing around with a idea that tj mentioned. I dont try to get this merged anytime soon (i could just merge it) but i havent figured what i wanna do with it. Still thanks for your comments :) |
assuming you have a file called
files
doingfd --hidden -I > files
for example.fzf.h
is the "prompt term" heresingle threaded
simple multi threaded attempt (add to list/sorting still missing)