Better Heuristics for Substring Search #72

JobLeonard · 2024-01-24T15:49:17Z

So I recently had to implement a fast text filter for work (in browser JavaScript, so no use for your library (yet) I'm afraid), and used the same-ish kind of insight but inverted: given that nearby characters are strongly correlated, then the first and last characters in a substring should be the least correlated. Or put in other words: suppose two (eventually) different words match their first n characters so far. Of all remaining characters, character n+1 is most likely to be the same, so actually the worst character to test next.

Statistically speaking, comparing the first and last characters has a higher chance to filter out mismatching strings early, avoiding false positives (of course, I'm well aware that linear memory access patterns might have a much bigger impact than this statistic).

This was not my original insight - Wojciech Muła also does this in one of his SIMD string search algorithms: http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd (without justification though, he left that as an exercise for the reader)

Looking through the code-base, I see that you first match on a four-character needle, then finish comparing on the suffixes with this sz_equal function:

StringZilla/stringzilla/stringzilla.h

Lines 94 to 98 in 14e7a78

    
           inline static sz_bool_t sz_equal(sz_string_start_t a, sz_string_start_t b, sz_size_t length) { 
        
               sz_string_start_t const a_end = a + length; 
        
               while (a != a_end && *a == *b) a++, b++; 
        
               return a_end == a; 
        
           }

My idea could be (relatively) quickly benchmarked by modifying this function. I haven't written C++ in a while, but I think this would be a correct "first test the last character, then test the remaining string characters" variation:

sz_bool_t sz_equal(sz_string_start_t a, sz_string_start_t b, sz_size_t length) {
    sz_string_start_t const a_end = a + length - 1;
    if (a <= a_end && *a_end != b[length - 1]) return false;
    while (a != a_end && *a == *b) a++, b++;
    return a_end == a;
}

Or a "compare back-to-front" variation:

sz_bool_t sz_equal(sz_string_start_t a, sz_string_start_t b, sz_size_t length) {
    sz_string_start_t a_end = a + length - 1;
    sz_string_start_t b_end = b + length - 1;
    if (a <= a_end && *a_end != b[length - 1]) return false;
    while (a != a_end && *a == *b) a++, b++;
    return a_end == a;
}

I am aware that simple forward linear memory access is faster than random or backward memory access, so I would not be surprised if both of these functions are slower in practice, regardless of what statistics says. But we won't know until we try, right?

For completeness sake here is my JS data structure, although I don't think it is particularly relevant as it solves a slightly different (but adjacent) problem than StringZilla

https://observablehq.com/@jobleonard/a-data-structure-for-table-row-filtering

The use-case is real-time filtering a data table based on user-input, leaving only rows that contain a substring match in any data cell with the input string. So it's more optimized for "latency". It pre-processes the indices of all trigrams of the strings in the table, and uses these as needles. For an input string it filters on the first trigram, then filters backwards from the last trigram in the input substring. Because trigrams are looked up from a hasmap the forward or backward access pattern concern does not apply here. It also uses previous results to narrow down the results faster (e.g. if a user typed the, then types a y to produce they, the filter only searches the existing results of the instead of starting over)

The text was updated successfully, but these errors were encountered:

ashvardanian · 2024-01-29T07:09:44Z

Hi @JobLeonard! Sorry for a late response, didn’t see the issue.

That’s a good suggestion for the serial version! I haven’t spent much time optimizing it.

On most modern CPUs the forward and backward passes over memory are equally fast, AFAIK. It might be a good idea to also add a version that uses 64 bit integers, if misaligned reads are allowed. Would you be interested in trying a couple of such approaches and submitting a PR?

In case you do, the CONTRIBUTING.md file references several datasets for benchmarks 🤗

JobLeonard · 2024-01-29T13:54:11Z

I'll take a shot at it, with the following caveats:

I only have one laptop to benchmark it with (a six years old Lenovo P51 with an Intel® Core™ i7-7820HQ Processor running KDE Neon (Ubuntu LTS 22.04 based))
don't have too much spare time so the GCC 12 I already have installed will have to do.
"read string as 64 bit integers" sounds like it's great when everything aligns neatly, but wouldn't that require a lot of special casing? Checks for whether sz_size_t length is less than eight characters, and for whether sz_string_start_t a and/or sz_string_start_t b start misaligned, or end misaligned. Start or end within the same 8-byte word or not. That's a lot of variations to consider (unless there's an obvious bitmasking + aligned reads trick that I'm too tired to work out in my head right now). So I think I'll skip that for now.

So my benchmark nrs will be limited to a few simple variations of sz_equal on x86-64, with AVX2 thrown in as a point of comparison too, and therefore only useful as a first sanity check for whether this idea is worth investigating further. Is that ok?

ashvardanian · 2024-01-29T15:55:06Z

Sure @JobLeonard, you can start and I can join a bit later.

For benchmarking I often take cloud instances (c7g and r7iz) to gain access to more hardware. Might be useful to you too 🤗

One thing to watch out for - on very short strings (well under 64 bytes) we are optimizing the for-loop. On longer strings, if we take the first and the last character - we end up fetching 2 cache lines from each string, instead of just one.

ashvardanian · 2024-02-27T22:32:11Z

Hi @JobLeonard,

I have a big update! I've generalized the substring search methods to be able to match different characters within the needle. The method that infers the best targets is called _sz_locate_needle_anomalies. For long needles and small alphabets, updating it may have a noticeable impact. Here is the code:

StringZilla/include/stringzilla/stringzilla.h

Lines 1586 to 1657 in cbdae26

    
           /** 
        
            *  @brief  Chooses the offsets of the most interesting characters in a search needle. 
        
            * 
        
            *  Search throughput can significantly deteriorate if we are matching the wrong characters. 
        
            *  Say the needle is "aXaYa", and we are comparing the first, second, and last character. 
        
            *  If we use SIMD and compare many offsets at a time, comparing against "a" in every register is a waste. 
        
            * 
        
            *  Similarly, dealing with UTF8 inputs, we know that the lower bits of each character code carry more information. 
        
            *  Cyrillic alphabet, for example, falls into [0x0410, 0x042F] code range for uppercase [А, Я], and 
        
            *  into [0x0430, 0x044F] for lowercase [а, я]. Scanning through a text written in Russian, half of the 
        
            *  bytes will carry absolutely no value and will be equal to 0x04. 
        
            */ 
        
           SZ_INTERNAL void _sz_locate_needle_anomalies(sz_cptr_t start, sz_size_t length, // 
        
                                                        sz_size_t *first, sz_size_t *second, sz_size_t *third) { 
        
               *first = 0; 
        
               *second = length / 2; 
        
               *third = length - 1; 
        
               // 
        
               int has_duplicates =                   // 
        
                   start[*first] == start[*second] || // 
        
                   start[*first] == start[*third] ||  // 
        
                   start[*second] == start[*third]; 
        
               // Loop through letters to find non-colliding variants. 
        
               if (length > 3 && has_duplicates) { 
        
                   // Pivot the middle point right, until we find a character different from the first one. 
        
                   for (; start[*second] == start[*first] && *second + 1 < *third; ++(*second)) {} 
        
                   // Pivot the third (last) point left, until we find a different character. 
        
                   for (; (start[*third] == start[*second] || start[*third] == start[*first]) && *third > (*second + 1); 
        
                        --(*third)) {} 
        
               } 
        
               // TODO: Investigate alternative strategies for long needles. 
        
               // On very long needles we have the luxury to choose! 
        
               // Often dealing with UTF8, we will likely benfit from shifting the first and second characters 
        
               // further to the right, to achieve not only uniqness within the needle, but also avoid common 
        
               // rune prefixes of 2-, 3-, and 4-byte codes. 
        
               if (length > 8) { 
        
                   // Pivot the first and second points right, until we find a character, that: 
        
                   // > is different from others. 
        
                   // > doesn't start with 0b'110x'xxxx - only 5 bits of relevant info. 
        
                   // > doesn't start with 0b'1110'xxxx - only 4 bits of relevant info. 
        
                   // > doesn't start with 0b'1111'0xxx - only 3 bits of relevant info. 
        
                   // 
        
                   // So we are practically searching for byte values that start with 0b0xxx'xxxx or 0b'10xx'xxxx. 
        
                   // Meaning they fall in the range [0, 127] and [128, 191], in other words any unsigned int up to 191. 
        
                   sz_u8_t const *start_u8 = (sz_u8_t const *)start; 
        
                   sz_size_t vibrant_first = *first, vibrant_second = *second, vibrant_third = *third; 
        
                   // Let's begin with the seccond character, as the termination criterea there is more obvious 
        
                   // and we may end up with more variants to check for the first candidate. 
        
                   for (; (start_u8[vibrant_second] > 191 || start_u8[vibrant_second] == start_u8[vibrant_third]) && 
        
                          (vibrant_second + 1 < vibrant_third); 
        
                        ++vibrant_second) {} 
        
                   // Now check if we've indeed found a good candidate or should revert the `vibrant_second` to `second`. 
        
                   if (start_u8[vibrant_second] < 191) { *second = vibrant_second; } 
        
                   else { vibrant_second = *second; } 
        
                   // Now check the first character. 
        
                   for (; (start_u8[vibrant_first] > 191 || start_u8[vibrant_first] == start_u8[vibrant_second] || 
        
                           start_u8[vibrant_first] == start_u8[vibrant_third]) && 
        
                          (vibrant_first + 1 < vibrant_second); 
        
                        ++vibrant_first) {} 
        
                   // Now check if we've indeed found a good candidate or should revert the `vibrant_first` to `first`. 
        
                   // We don't need to shift the third one when dealing with texts as the last byte of the text is 
        
                   // also the last byte of a rune and contains the most information. 
        
                   if (start_u8[vibrant_first] < 191) { *first = vibrant_first; } 
        
               } 
        
           }

Not sure about what would be the best dataset for such benchmarks, seems like this is related to #91.

JobLeonard · 2024-02-28T15:21:33Z

Thanks for the heads-up, looks like a worthwhile change for the examples given in the doc comment.

I was about to write that I'm still interested in giving this a go but have been very busy at work in the last month. Not entirely sure when I manage to free up some time again but just wanted to re-assure you that I haven't forgotten about it!

ashvardanian self-assigned this Jan 29, 2024

ashvardanian changed the title ~~Have you tried the "inverse" heuristic? ("the last and first byte of a substring are the least correlated")~~ Better Heuristics for Substring Search Feb 27, 2024

ashvardanian added performance Performance related discussion or suggestion core Work on the algorithm design and implementation labels Feb 27, 2024

vasily-v-ryabov mentioned this issue May 30, 2024

Bug: sz_find incorrectly finds the substring with length=5 #154

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better Heuristics for Substring Search #72

Better Heuristics for Substring Search #72

JobLeonard commented Jan 24, 2024 •

edited

Loading

ashvardanian commented Jan 29, 2024 •

edited

Loading

JobLeonard commented Jan 29, 2024

ashvardanian commented Jan 29, 2024

ashvardanian commented Feb 27, 2024

JobLeonard commented Feb 28, 2024

Better Heuristics for Substring Search #72

Better Heuristics for Substring Search #72

Comments

JobLeonard commented Jan 24, 2024 • edited Loading

ashvardanian commented Jan 29, 2024 • edited Loading

JobLeonard commented Jan 29, 2024

ashvardanian commented Jan 29, 2024

ashvardanian commented Feb 27, 2024

JobLeonard commented Feb 28, 2024

JobLeonard commented Jan 24, 2024 •

edited

Loading

ashvardanian commented Jan 29, 2024 •

edited

Loading