Trigram assisted wildcard query and general index efficiency #1794

CMajeri · 2023-02-25T02:33:46Z

Hello,
I'm new to bleve and trying to use it to perform simple substring search, i.e. the equivalent of the sql query LIKE '%<some_word>%'.
A typical way to achieve this is through the use of trigrams, where we match all entries that contain all trigrams, and then follow that up with a second filter operation.
I tried to replicate this in the beer-search context, and came up with this:

	q := query.NewMatchQuery("pale")
	q.SetField("name_tri")
	q.SetOperator(query.MatchQueryOperatorAnd)
	req := bleve.NewSearchRequest(query.NewConjunctionQuery([]query.Query{
		q,
		query.NewWildcardQuery("*pale*"),
	}))
	req.Fields = []string{"name"}
	req.SortBy([]string{"name"})
	req.Size = 1000
	res, err := beerIndex.Search(req)
	if err != nil {
		panic(err)
	}
	for _, r := range res.Hits {
		if !strings.Contains(strings.ToLower(r.Fields["name"].(string)), "pale") {
			fmt.Println(r.Fields["name"])
		}
	}

where "name_tri" is a text mapping, using a trigram as a token filter.
This works perfectly, and doesn't print any rows.
For comparison, without the wildcard query, this prints:

Cow Palace Scotch Ale
Cow Palace Scotch Ale 1998
Cow Palace Scotch Ale 2000
Cow Palace Scotch Ale 2001
Lone Palm Ale
Palm Speciale

which is expected.

However, I'm unfamiliar with how bleve performs its indexing, and was wondering how well it combines those filters.
Will the presence of the wildcard completely negate the benefits of filtering through trigrams?
In general, how does bleve combine filters, is there any documentation on the subject besides the code?
I'd also be particularly interested in knowing how sorting works, espectially in the context of paginating results (i.e. running the same query many times with different limits and different offsets).

Thanks.

The text was updated successfully, but these errors were encountered:

abhinavdangeti · 2023-03-07T16:29:24Z

@CMajeri here're a few details that should help answering your questions ..

Like with any other information retrieval library, you'd benefit by moving the bulk of work to index time over query time. This'd obviously come with the trade-off that your indexing time will likely increase and possibly it's footprint too.
A wildcard query works in 2 steps ..
1. Determine the candidate terms - by running a regexp search against what is indexed against the field.
2. A disjunction search will then follow for all the established candidate terms.
So a wildcard query could potentially run sub par owing to the number of candidates that become eligible, and the disjunction query itself.
Here's a recommendation for your situation - configure a "custom" analyzer where you ..
- tokenize your english text say on unicode
- apply the to_lower token filter and also ngram token filter with min max equal to say 3
- then you'd be indexing the following tokens for your text ..

"Cow Palace Scotch Ale" => {"cow", "pal", "ala", "lac", "ace", "sco", "cot", "otc", "tch", "ale"}

Now, your search request would just be a term query, which would perform a lot faster than the wildcard query.

{"query": {"field": "name_tri", "term": "ale"}}

Here's code that'll build this analyzer for you ..

func buildIndexMapping() (*bleve.IndexMapping, error) {
	indexMapping := bleve.NewIndexMapping()

	var err error
	err = indexMapping.AddCustomTokenFilter("ngram_min_3_max_3",
		map[string]interface{}{
			"min": 3,
			"max": 3,
			"type": `ngram`,
		})
	if err != nil {
		return nil, err
	}

	err = indexMapping.AddCustomAnalyzer("custom_ngram",
		map[string]interface{}{
			"type": `custom`,
			"char_filters": []interface{}{},
			"tokenizer": `unicode`,
			"token_filters": []interface{}{
				`to_lower`,
				`ngram_min_3_max_3`,
			},
		})
	if err != nil {
		return nil, err
	}

	return indexMapping, nil
}

Here's a blog highlighting how text analysis works within bleve ..
https://www.couchbase.com/blog/full-text_search_text_analysis/
Sorting and pagination is applied once all results are collected, over the heap.
- You can sort over multiple keys, an example in the godocs here.
- You can paginate using the size (limit) and from (offset) attributes in your search request. This will return predictive results only if the results are sorted.
- The above settings may perform sub par with deeper pagination for which I'd recommend you look into the search_after and search_before capability we support.
Although bleve indexes are disk-bound, fetched indexed content is mmap-ed, which'll support faster access on subsequent queries to the same content.

Akilan1999 mentioned this issue May 23, 2023

Improved search uadmin uadmin/uadmin#139

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trigram assisted wildcard query and general index efficiency #1794

Trigram assisted wildcard query and general index efficiency #1794

CMajeri commented Feb 25, 2023

abhinavdangeti commented Mar 7, 2023

Trigram assisted wildcard query and general index efficiency #1794

Trigram assisted wildcard query and general index efficiency #1794

Comments

CMajeri commented Feb 25, 2023

abhinavdangeti commented Mar 7, 2023