Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trigram assisted wildcard query and general index efficiency #1794

Open
CMajeri opened this issue Feb 25, 2023 · 1 comment
Open

Trigram assisted wildcard query and general index efficiency #1794

CMajeri opened this issue Feb 25, 2023 · 1 comment

Comments

@CMajeri
Copy link

CMajeri commented Feb 25, 2023

Hello,
I'm new to bleve and trying to use it to perform simple substring search, i.e. the equivalent of the sql query LIKE '%<some_word>%'.
A typical way to achieve this is through the use of trigrams, where we match all entries that contain all trigrams, and then follow that up with a second filter operation.
I tried to replicate this in the beer-search context, and came up with this:

	q := query.NewMatchQuery("pale")
	q.SetField("name_tri")
	q.SetOperator(query.MatchQueryOperatorAnd)
	req := bleve.NewSearchRequest(query.NewConjunctionQuery([]query.Query{
		q,
		query.NewWildcardQuery("*pale*"),
	}))
	req.Fields = []string{"name"}
	req.SortBy([]string{"name"})
	req.Size = 1000
	res, err := beerIndex.Search(req)
	if err != nil {
		panic(err)
	}
	for _, r := range res.Hits {
		if !strings.Contains(strings.ToLower(r.Fields["name"].(string)), "pale") {
			fmt.Println(r.Fields["name"])
		}
	}

where "name_tri" is a text mapping, using a trigram as a token filter.
This works perfectly, and doesn't print any rows.
For comparison, without the wildcard query, this prints:

Cow Palace Scotch Ale
Cow Palace Scotch Ale 1998
Cow Palace Scotch Ale 2000
Cow Palace Scotch Ale 2001
Lone Palm Ale
Palm Speciale

which is expected.

However, I'm unfamiliar with how bleve performs its indexing, and was wondering how well it combines those filters.
Will the presence of the wildcard completely negate the benefits of filtering through trigrams?
In general, how does bleve combine filters, is there any documentation on the subject besides the code?
I'd also be particularly interested in knowing how sorting works, espectially in the context of paginating results (i.e. running the same query many times with different limits and different offsets).

Thanks.

@abhinavdangeti
Copy link
Member

@CMajeri here're a few details that should help answering your questions ..

  • Like with any other information retrieval library, you'd benefit by moving the bulk of work to index time over query time. This'd obviously come with the trade-off that your indexing time will likely increase and possibly it's footprint too.
  • A wildcard query works in 2 steps ..
    1. Determine the candidate terms - by running a regexp search against what is indexed against the field.
    2. A disjunction search will then follow for all the established candidate terms.
  • So a wildcard query could potentially run sub par owing to the number of candidates that become eligible, and the disjunction query itself.
  • Here's a recommendation for your situation - configure a "custom" analyzer where you ..
    • tokenize your english text say on unicode
    • apply the to_lower token filter and also ngram token filter with min max equal to say 3
    • then you'd be indexing the following tokens for your text ..
"Cow Palace Scotch Ale" => {"cow", "pal", "ala", "lac", "ace", "sco", "cot", "otc", "tch", "ale"}
  • Now, your search request would just be a term query, which would perform a lot faster than the wildcard query.
{"query": {"field": "name_tri", "term": "ale"}}
  • Here's code that'll build this analyzer for you ..
func buildIndexMapping() (*bleve.IndexMapping, error) {
	indexMapping := bleve.NewIndexMapping()

	var err error
	err = indexMapping.AddCustomTokenFilter("ngram_min_3_max_3",
		map[string]interface{}{
			"min": 3,
			"max": 3,
			"type": `ngram`,
		})
	if err != nil {
		return nil, err
	}

	err = indexMapping.AddCustomAnalyzer("custom_ngram",
		map[string]interface{}{
			"type": `custom`,
			"char_filters": []interface{}{},
			"tokenizer": `unicode`,
			"token_filters": []interface{}{
				`to_lower`,
				`ngram_min_3_max_3`,
			},
		})
	if err != nil {
		return nil, err
	}

	return indexMapping, nil
}
  • Here's a blog highlighting how text analysis works within bleve ..
    https://www.couchbase.com/blog/full-text_search_text_analysis/

  • Sorting and pagination is applied once all results are collected, over the heap.

    • You can sort over multiple keys, an example in the godocs here.
    • You can paginate using the size (limit) and from (offset) attributes in your search request. This will return predictive results only if the results are sorted.
    • The above settings may perform sub par with deeper pagination for which I'd recommend you look into the search_after and search_before capability we support.
  • Although bleve indexes are disk-bound, fetched indexed content is mmap-ed, which'll support faster access on subsequent queries to the same content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants