Skip to content

A N-Gram tokenizer generating non-overlapping positioned token chunks for partial token search

Notifications You must be signed in to change notification settings

yakaz/elasticsearch-analysis-hashsplitter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 

Repository files navigation

HashSplitter analysis plugin for ElasticSearch

HashSplitter plugin is a N-Gram tokenizer generating tokens that are not overlapping and are prefixed.

In order to install the plugin, simply run: bin/plugin -install yakaz/elasticsearch-analysis-hashsplitter/0.2.0.

-------------------------------------------------
| HashSplitter Analysis Plugin | ElasticSearch  |
-------------------------------------------------
| master                       | 0.19 -> master |
-------------------------------------------------
| 0.2.0                        | 0.19 -> master |
-------------------------------------------------
| 0.1.0                        | 0.19 -> master |
-------------------------------------------------

It supports a wide variety of requests such as:

  • exact match
  • query by analyzed (prefixed) terms
  • wildcard query
  • range query
  • prefix query

Here's a concrete example of the analysis performed:

chunk_length: 4
prefixes: ABCDEFGH
input: d41d8cd98f00b204e9800998ecf8427e
output:
 - Ad41d
 - B8cd9
 - C8f00
 - Db204
 - Ee980
 - F0998
 - Gecf8
 - H427e

It is aimed at making hashs (or any fixed length value splittable in equally sized chunks) partially searchables efficiently, without having a plain wildcard query enumerate tons of terms. It can also help reduce the index size.

However, depending on your configuration, if you do not wish to search for wildcard queries, you may experience slightly decreased performance. See http://elasticsearch-users.115913.n3.nabble.com/Advices-indexing-MD5-or-same-kind-of-data-td2867646.html for more information.

Features

The plugin provides:

  • hashsplitter field type
  • hashsplitter analyzer
  • hashsplitter tokenizer
  • hashsplitter token filter
  • hashsplitter_term query/filter (same syntax as the regular term query/filter)
  • hashsplitter_wildcard query/filter (same syntax as the regular wildcard query/filter)

The plugin also provides correct support of the hashsplitter field type for the standard:

  • field query/filter (used by the term query/filter)
  • prefix query/filter
  • range query/filter

The plugin does not support:

  • fuzzy query/filter

The plugin cannot currently support (as of ElasticSearch 0.19.0):

  • term query/filter: This gets mapped to a field query by ElasticSearch. Use the hashsplitter_term query instead.

Note that a query_string query calls the field, prefix, range and fuzzy capability of the hashsplitter field automatically. But make sure you actually use the hashsplitter field type and direct the query to that field (and not the _all field for eg.).

Configuration

It is recommended that you use the hashsplitter field type, this will enable custom querying easily. It is also the only way of using the field, prefix and range queries/filters. The alternative would be to use the hashsplitter analysis on the field, and be spent extra care to the way you query the field.

The hashsplitter field type

Here is a sample mapping (in config/mapping/your_index/your_mapping_type.json):

{
  "your_mapping_type" : {
    properties : {
      [...]
      "your_hash_field" : {
        type: "hashsplitter",
        settings : {
          chunk_length: 4,
          prefixes: "abcd",
          size: 16,
          wildcard_one: "?",
          wildcard_any: "*"
        }
      },
      [...]
    }
  }
}

This will define the your_hash_field field within the your_mapping_type as having the hashsplitter type. Notice the unusual settings section. It will be parsed by the plugin in order to configure the tokenization according to your needs.

Parameters:

  • chunk_length: The length of the chunks generated by the analysis. The input "0123456789" with a chunk_length of 2 will be split into [01, 23, 45, 67, 89], with a chunk_length of 3 it will be split into [012, 345, 678, 9]. Note that the last chunk can be shorter than chunk_length characters.
  • prefixes: The positional prefixes to prepend to each chunk. Each individual character in the given string will be used, in turn. The chunks [000, 111, 222, 333] with a "abc" prefixes will generate the following terms: [a000, b111, c222, a333]. Note how it wraps if there are not enough prefix character available. You want to avoid this as is will make a000 and a333 indistinguishable.
  • size: How long are the input hash supposed to be, as an integer, or "variable". This won't prevent bad values from being analyzed at all. This information is solely used by the wildcard query/filter in order to expand *s properly.
  • wildcard_one: Which character to be used as a single character wildcard. A single character string. This may help you if the default ? is a genuine input character. It is solely used in the wildcard query/filter.
  • wildcard_any: Which character to be used a a any string wildcard. A single character string. This may help you if the default * is a genuine input character. It is solely used in the wildcard query/filter.
Default values:

All parameters are optional, so is the settings section.

  • chunk_length: 1
  • prefixes: "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789,."
  • size: "variable"
  • wildcard_one: "?"
  • wildcard_any: "*"

The hashsplitter analyzer, tokenizer and token filter

Those analysis components will merely split the input into chunks of fixed size and prefix them. Each of these have 2 parameters that you want to define in the configuration.

Here is a sample configuration (in config/elasticsearch.yml):

index.analysis:
  analysis:
    analyzer:
      your_hash_analyzer:
        type: hashsplitter
        chunk_length: 4
        prefixes: ABCDEFGH
    tokenizer:
      your_hash_tokenizer:
        type: hashsplitter
        chunk_length: 4
        prefixes: ABCDEFGH
    filter:
      your_hash_tokenfilter:
        type: hashsplitter
        chunk_length: 4
        prefixes: ABCDEFGH

This will configure both an analyzer, a tokenizer and a token filter (all of them being separate). You can then create your custom analyzer using the newly configured tokenizer or/and token filter. Note that that custom analyzer will have type: custom.

Parameters:

  • chunk_length
  • prefixes

See hashsplitter field type parameters for more information.

Usage: querying

Term query

{
  term : {
    your_hash_field: "d41d8cd98f00b204e9800998ecf8427e"
  }
}

Note: The length is not checked. However, if your field values are always of the same fixed length and your query value is of that same size too, then your safe.

You will need to understand how this query works in order to clarify this warning a bit. The same analysis is performed when indexing the field, and when processing this query. The searched term will get split into terms, which will be merely AND-ed together. Hence, any additional terms (a longer field value) won't prevent the match. However, if the last term chunk is not of the correct size, no match will occur! (eg: "d41d8" would generate the query +Ad41d +B8 and B8 will never match.)

Positive side-effect: If the field length is not a multiple of the chunk length, then the match will only include same-length hashes, as a longer hash would have a longer (hence different, non matching) last term.

Do not use the default term query, as unlike stated in the docs, the provided term is analyzed, hence the provided value gets chunked and prefixed and AND-ed.

Chunk query

{
  hashsplitter_term : {
    your_hash_field: "H427e"
  }
}

This query allows you to match the generated terms exactly. No analysis is performed. A pure TermQuery is generated with the given field and term. This query is the only way to express yourself the prefix along with the chunk value to be queried.

Note that in the default term query, as unlike stated in the docs, the provided term is analyzed, hence the provided value gets chunked and prefixed and pieces are AND-ed together.

Prefix query

{
  prefix : {
    your_hash_field: "d41d8"
  }
}

Assuming chunk_length = 4, this will generate the query +A41d8 +PREFIX, where PREFIX is the prefix query B8*, filtered to only return terms whose size is between 2 and 5 (or equal to the resting size, if size is fixed).

Range query

{
  range : {
    your_hash_field : {
      from: "d4000000000000000000000000000000",
      include_lower: true,
      to:   "d4200000000000000000000000000000",
      include_upper: false
    }
  }
}

The generated range queries will be optimized to only query the terms at the required level, like Lucene NumericRangeQuery does. (With the difference that in Lucene the whole term upto the cut level is included, whereas we only include a middle chunk without the previous ones).

The lexicographical ordering of terms will be used. The used prefixes wont have any influence but the length of the terms will. For instance the range [d400 TO d42] (both inclusive) will match d400 0000 0000 ... but not d420 0000 0000 ... (space added to visualise the generated chunks), because d42 sorts before d420, hence the latter is not included within the range.

Wildcard query

{
  hashsplitter_wildcard : {
    your_hash_field : "d41?8*27e"
  }
}

Note: The ? and * wildcards must match the one configured in the field type mapping (these are the default values).

The * wildcard is restricted to one usage per pattern, and text may appear after it if and only if the field type mapping uses a fixed size. Using * at the end is always possible and equates to a prefix query with eventual ? wildcards.

This restriction arise from the fact that prefixes are used to “locate” chunks, hence all character un the pattern must be located precisely. Using more that one * makes it impossible to deterministically perform this localisation. A simple fallback will however be used: the particular case where all * match a zero length string. But this is likely to be of no help...

Sophisticated queries

As long as you query against your_hash_field (the field of type hashsplitter), the generated queries should function like the one described above. Sophisticated queries can often create multiple of the above queries, as they use complex lexical analysis that express combinations of multiple queries (eg. "+ANDed_token ORed_token -NOT_token [from_token TO to_token]").

Note that the default wildcard query won't function in the intended way. Don't use them through sophisticated queries through analyze_wildcard.

About

A N-Gram tokenizer generating non-overlapping positioned token chunks for partial token search

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages