Skip to content

an approximate string matching or fuzzy-matching system for spelling correction, normalisation or post-OCR correction

License

Notifications You must be signed in to change notification settings

proycon/analiticcl

Repository files navigation

Crate Docs GitHub build GitHub release Project Status: Active – The project has reached a stable, usable state and is being actively developed.

Analiticcl

Introduction

Analiticcl is an approximate string matching or fuzzy-matching system that can be used for spelling correction or text normalisation (such as post-OCR correction or post-HTR correction). Texts can be checked against a validated or corpus-derived lexicon (with or without frequency information) and spelling variants will be returned.

The distinguishing feature of the system is the usage of anagram hashing to drastically reduce the search space and make quick lookups possible even over larger edit distances. The underlying idea is largely derived from prior work TICCL (Reynaert 2010; Reynaert 2004), which was implemented in ticcltools. This analiticcl implementation attempts to re-implement the core of these ideas from scratch, but also introduces some novelties, such as the introduction of prime factors for improved anagram hashing. We aim at a high-performant implementation written in Rust.

Aside from reading this documentation, you can also view an in-depth presentation video that was presented at the KNAW Humanities Cluster in January 2022.

If you plan to use analiticcl from Python, then we recommend you to also follow this tutorial in the form of a jupyter notebook.

Demo

Analiticcl demo

Features

  • Quick retrieval of spelling variants given an input word due to smart anagram hashing lookup. This is the main feature that drastically reduces the search spaces.
  • Works against a lexicon, which can either be a validated lexicon (preferred), or a lexicon derived from a corpus.
  • Uses a user-provided alphabet file for anagram hashing, in which multiple characters may be mapped to a single alphabet entry if so desired (e.g. for casing or for more phonetic-like lookup behaviour like soundex)
  • Can take into account frequency information from the lexicon
  • Matching against final candidates using a variety of possible distance metrics. Scoring and ranking is implemented as a weighted linear combination including the following components:
    • Damerau-Levenshtein
    • Longest common substring
    • Longest common prefix/suffix
    • Casing difference (boolean) An exact match always has distance score 1.0.
  • Additionally, frequency information can be used to influence ranking.
  • A confusable list with known confusable patterns and weights can be provided. This is used to favour or penalize certain confusables in the ranking stage (this weight is applied to the whole score).
  • Rather than look up words in spelling-correction style, users may also output the entire hashed anagram index, or output a reverse index of all variants found the supplied input data for each item in the lexicon.
  • Also supports ingesting explicit variant lists/error lists.
  • Support for language models to consider context information.
  • Multi-threading support

The current implementation is still a work in progress and should be considered experimental. Especially the search mode and the use of language models is still being evaluated and improved.

Installation

You can build and install the latest stable analiticcl release using Rust's package manager:

cargo install analiticcl

or if you want the development version after cloning this repository:

cargo install --path .

No cargo/rust on your system yet? Do sudo apt install cargo on Debian/ubuntu based systems, brew install rust on mac, or use rustup.

Note that 32-bit architectures are not supported.

Usage

Analiticcl is typically used through its command line interface or through the Python binding. Full syntax help for the command line tool is always available through analiticcl --help.

Analiticcl can be run in several modes, each is invoked through a subcommand, each subcommand also takes its own --help parameter for detailed usage information.

  • Query mode - analiticcl query - Queries the model for variants for the provided input item (one per line)
  • Search mode - analiticcl search - Searches for variants in running text. This encompasses detection and correction whereas the above query mode only handles correction.
  • Learn mode - analiticcl learn - Learns variants from the input for each item in the lexicon and outputs a weighted variant list.
  • Index mode - analiticcl index - Computes and outputs the anagram index, takes no further input

In all modes, the performance of the system depends to a large depree on the quality of the lexicons, including the background lexicon, the importance of which can not be understated so we dedicate a special section to it later, and the chosen parameters.

Query Mode

The query mode takes one input item per line and outputs all variants and their scores found for the given input. Default output is TSV (tab separated fields) in which the first column contains the input and the variants and scores are tab delimited fields in the columns thereafter.

You need to pass at least an alphabet file and a lexicon file against which matches are made.

Example:

$ analiticcl query --interactive --lexicon examples/eng.aspell.lexicon --alphabet examples/simple.alphabet.tsv
Initializing model...
Loading lexicons...
Building model...
Computing anagram values for all items in the lexicon...
 - Found 119773 instances
Adding all instances to the index...
 - Found 108802 anagrams
Creating sorted secondary index...
Sorting secondary index...
 ...
Querying the model...
(accepting standard input; enter input to match, one per line)

The program is now taking standard input, enter a word to query and press ENTER to get the variants and the scores. Specify the --interactive parameter otherwise output may not be returned immediately but will be buffered for parallellisation:

seperate
seperate        separate        0.734375                operate 0.6875          desperate       0.6875          temperate       0.6875          serrate 0.65625         separates       0.609375                separated       0.609375

Rather than running it interactively, you can use your shell's standard redirection facilities to provide input and output, multiple variants will be processed in parallel.

$ analiticcl query --lexicon examples/eng.aspell.lexicon --alphabet examples/simple.alphabet.tsv < input.tsv >
output.tsv

The --lexicon argument can be specified multiple times for multiple lexicons. Lexicons may contain absolute frequency information, but frequencies between multiple lexicons must be balanced! In case you are using multiple lexicons, you can get analiticcl to output information on which lexicon a match was found in by setting. --output-lexmatch. The order of the lexicons (and variant lists) matters if there is associated frequency information. If an entry occurs in multiple lexicons, they will all be returned.

If you want JSON output rather than TSV, use the --json flag. The JSON output includes more details than the TSV output. Most notable, you will see the distance score (aka similarity score) and the frequency scores seperated, whereas the TSV mode only outputs the combined score.

$ analiticcl query --lexicon examples/eng.aspell.lexicon --alphabet examples/simple.alphabet.tsv --output-lexmatch
--json < input.tsv > output.json
[
    { "input": "seperate", "variants": [
        { "text": "separate", "score": 0.734375, "dist_score": 0.734375, "freq_score": 1, "lexicons": [ "examples/eng.aspell.lexicon" ] },
        { "text": "desperate", "score": 0.6875, "dist_score": 0.6875, "freq_score": 1, "lexicons": [ "examples/eng.aspell.lexicon" ] },
        { "text": "operate", "score": 0.6875, "dist_score": 0.6875, "freq_score": 1, "lexicons": [ "examples/eng.aspell.lexicon" ] },
        { "text": "temperate", "score": 0.6875, "dist_score": 0.6875, "freq_score": 1, "lexicons": [ "examples/eng.aspell.lexicon" ] },
        { "text": "serrate", "score": 0.65625, "dist_score": 0.65625, "freq_score": 1, "lexicons": [ "examples/eng.aspell.lexicon" ] },
        { "text": "separated", "score": 0.609375, "dist_score": 0.609375, "freq_score": 1, "lexicons": [ "examples/eng.aspell.lexicon" ] },
        { "text": "separates", "score": 0.609375, "dist_score": 0.609375, "freq_score": 1, "lexicons": [ "examples/eng.aspell.lexicon" ] }
    ] }
]

Learn Mode

In learn mode, analiticcl takes a lexicon, and collects variants from the input for each item in the lexicon.

It takes input similar like search mode or query mode (add an extra --strict flag if your input is a list/lexicon rather than running text). Instead of outputting the results directly, it collects all varaints and associates them with the items in the lexicon, effectively updating the model with this information. The output this mode provides is the inverse of what search or query does; for each item in the lexicon, all variants that were found (and their scores are listed). This output constitutes a weighted variant list which can be loaded in again using --variants.

The learned variants are used as intermediate words to guide the system towards a desired solution. Assume for instance that our lexicon contains the word separate, and we found the variant seperate in the data during learning. This variant is now associated with the right reference, and on subsequent runs matches against seperate will count towards matches on separate. This mechanism allows the system to bridge larger edit distances even when it is contrained to smaller ones. For example: seperete will match against seperate but not separate when the edit/anagram distance is constrained to 1.

Learn mode may do multiple iterations over the same data (set --iterations). As iterations grow, larger edit distances can be covered, but this is also a source for extra noise so accuracy will go down too.

When using learn mode, make sure to choose tight constraints (e.g. --max-matches 1 and a high --score-threshold). Learning on a list/lexicon using --strict rather than on running text, generally leads to better results.

Search Mode

In query mode you provide an exact input string and ask Analiticcl to correct it as a single unit. Query mode effectively implements the correction part of a spelling-correction system, but does not really handle the detection aspect. This is where search mode comes in. In search mode you can provide running text as input and the system will automatically attempt to detect the parts of your input that can corrected, and give the suggestions for correction.

In the output, Analiticcl will return UTF-8 byte offsets for fragments in your data that it finds variants for. You can set --unicode-offsets if you want unicode codepoint offsets instead. Both types of offsets are zero-indexed and the end offset is always non-inclusive.

Your input does not have to be tokenised, because tokenisation errors in the input may in itself account for variation which the system will attempt to resolve. Search mode can look at n-grams to this end, which effectively makes Analiticcl context-aware. You can use the --max-ngram-order parameter to set the maximum n-gram order you want to consider. Any setting above 1 enables a language modelling component in Analiticcl, which requires a frequency list of n-grams as input (using --lm).

Index Mode

The index mode simply outputs the anagram index, it takes no further input.

$ analiticcl index --lexicon examples/eng.aspell.lexicon --alphabet examples/simple.alphabet.tsv

It may be insightful to sort on the number of anagrams and show the top 20 , with a bit of awk scripting and some piping:

$ analiticcl index --lexicon examples/eng.aspell.lexicon --alphabet examples/simple.alphabet.tsv | awk -F'\t' '{ print NF-1"\t"$0 }' | sort -rn | head -n 20
[...]
8       1227306 least   slate   Stael   stale   steal   tales   teals   Tesla
7       98028906        elan's  lane's  Lane's  lean's  Lean's  Lena's  Neal's
7       55133630        actors  castor  Castor  Castro  costar  Croats  scrota
7       485214  abets   baste   bates   Bates   beast   beats   betas
7       416874  bares   baser   bears   braes   saber   sabre   Sabre
7       411761163       luster  lustre  result  rustle  sutler  ulster  Ulster
7       409102  alts    last    lats    LSAT    salt    SALT    slat
7       3781815 notes   onset   Seton   steno   stone   Stone   tones
7       33080178        carets  caster  caters  crates  reacts  recast  traces
7       2951915777547   luster's        lustre's        result's        rustle's        sutler's        ulster's        Ulster's
7       286404699       merits  mister  Mister  miters  mitres  remits  timers
7       28542   east    East    eats    etas    sate    seat    teas
7       28365   ergo    goer    gore    Gore    ogre    Oreg    Roeg
7       27489162        capers  crapes  pacers  parsec  recaps  scrape  spacer
7       1741062 aster   rates   resat   stare   tares   tears   treas
7       17286   ales    Elsa    lase    leas    Lesa    sale    seal
7       1446798 pares   parse   pears   rapes   reaps   spare   spear
7       1403315 opts    post    Post    pots    spot    stop    tops
7       13674   elan    lane    Lane    lean    Lean    Lena    Neal
6       96935466        parses  passer  spares  sparse  spears  Spears

The large number is the anagram value of the anagram.

Background Lexicon

We can not understate the importance of the background lexicon to reduce false positives. Analiticcl will eagerly attempt to match your test input to whatever lexicons you provide. This demands a certain degree of completeness in your lexicons. If your lexicon contains a relatively rare word like "boulder" and not a more common word like "builder", then analiticcl will happily suggest all instantes of "builder" to be "boulder". The risk for this increases as the allowed edit distances increase.

Such background lexicons should also contain morphological variants and not just lemma. Ideally it is derived automatically from a fully spell-checked corpus.

Analiticcl will not work for you if you just feed it some small lexicons and no complete enough background lexicons, unless you are sure your test texts have a very constrained limited vocabulary.

Scores and ranking

In query mode, analiticcl will return a similarity/distance score between your input and any matching variants. This score is expressed on a scale of 1.0 (exact match) to 0.0. The score takes the length of the input into account, so a levenshtein difference of 2 on a word weighs less than a levenshtein distance of 2 on a shorter word. The distance score itself consists of multiple components, each with a configurable weight:

  • Damerau-Levenshtein
  • Longest common substring
  • Longest common prefix
  • Longest common suffix
  • Casing difference (boolean)

A frequency score on a scale of 1.0 (most frequent variant) to 0.0 is returned separately (not shown in TSV output). By default, the ranking of variants is based primarily on the distance score, the frequency score is only used as a secondary key in case there is a tie (multiple items with the same distance score).

If you do want frequency information to play a larger role in the ranking of variants, you can use the --freq-ranking parameter, the value of which is a weight to attribute to frequency ranking in relation to the distance component and should be in the range 0.0 to 1.0, where a smaller value around 0.25 is recommended. This is used to compute a ranking score as follows:

ranking_score = (distance_score + freq_weight * freq_score) / (1 + freq_weight)

This ranking score is subsequently used to rank the results. This may result in a variant with less similarity to the input being preferred over a variant with more similarity to the input, if that first variant is far more frequent.

Data Formats

All input for analiticcl must be UTF-8 encoded and use unix-style line endings, NFC unicode normalisation is strongly recommended.

Alphabet File

The alphabet file is a TSV file (tab separated fields) containing all characters of the alphabet. Each line describes a single alphabet 'character'. An alphabet file may for example start as follows:

a
b
c

Multiple values on a line may be tab separated and are used to denote equivalents. A single line representing a single character could for example look like:

a	A	á	à	ä	Á	À	Ä

This means that these are all encoded the same way and are considered identical for all anagram hashing and distance metrics. A common situation is that all numerals are encoded indiscriminately, which you can accomplish with an alphabet entry like:

0	1	2	3	4	5	6	7	8	9

It is recommended to order the lines in the alphabet file based on the frequency of the character, as this will lead to the most optimal performance (i.e. generally smaller anagram values), but this is not a hard requirement by any means.

Entries in the alphabet file are not constrained to a single character but may also correspond to multiple characters, for instance:

ae	æ

Encoding always proceeds according to a greedy matching algorithm in the exact order entries are defined in the alphabet file.

Lexicon File

The lexicon is a TSV file (tab separated fields) containing either validated or corpus-derived words or phrases, one lexicon entry per line. The first column typically (this is configurable) contains the word and the optional second column contains the absolute frequency count. If no frequency information is available, all items in the lexicon carry the exact same weight.

Multiple lexicons may be passed and analiticcl will remember which lexicon was matched against, so you could use this information for some simple tagging.

Variant List

A variant list explicitly relates spelling variants to preferred forms, and in doing so go a step further than a simple lexicon which only specifies the validated or corpus-derived form.

A variant list is directed and weighted, it specifies a normalised/preferred form first, and then specifies variants and variant scores. Take the following example (all fields are tab separated):

huis	huys	1.0	huijs	1.0

This states that the preferred word huis has two variants (historical spelling in this case), and both have a score (0-1) that expresses how likely the variant maps to the preferred word. When loaded into analiticcl with --variants, both the preferred form and the variants will be valid results in normalization (as if you loaded a lexicon with all three words in it). Any matches on the variants will automatically expand to also match on the preferred form.

What you might be more interested in, is a special flavour of the variant list called an error list, loaded into analiticcl using --errors. Consider the following example:

separate	seperate	1.0	seperete 1.0

This states that the preferred word seperate has two variants that are considered errors. In this case, analiticcl considers these variants transparent, it will still match against the variants but but they will never be returned as a solution; the preferred variant will be returned as a solution instead. This mechanism helps bridge larger edit distances. In the JSON output, the "via" property conveys that a transparent variant was used in matching.

A variant list may also contain an extra column of absolute frequencies, provided that it's consistently provided for all references and variants:

separate	531	seperate	1.0	4	seperete 1.0	1

Here the reference occurs 531 times, the first misspelling 4 times, and the last variant only 1 time.

Analiticcl can also output variant lists, given input lexicons and a text to train on, this occurs when you run it in learn mode.

Confusable List

The confusable list is a TSV file (tab separated fields) containing known confusable patterns and weights to assign to these patterns when they are found. The file contains one confusable pattern per line. The patterns are expressed in the edit script language of sesdiff. Consider the following example:

-[y]+[i]	1.1

This pattern expressed a deletion of the letter y followed by insertion of i, which comes down to substitution of y for i. Edits that match against this confusable pattern receive the weight 1.1, meaning such an edit is given preference over edits with other confusable patterns, which by definition have weight 1.0. Weights greater than 1.0 are being given preference in the score weighting, weights smaller than 1.0 imply a penalty. When multiple confusable patterns match, the products of their weights is taken. The final weight is applied to the whole candidate score, so weights should be values fairly close to 1.0 in order not to introduce too large bonuses/penalties.

The edit script language from sesdiff also allows for matching on immediate context, consider the following variant of the above which only matches the substituion when it comes after a c or a k:

=[c|k]-[y]+[i]	1.1

To force matches on the beginning or end, start or end the pattern with respectively a ^ or a $. A further description of the edit script language can be found in the sesdiff documentation.

Language Model

In order to consider context information, analiticcl can construct and apply a simple n-gram language model. The input for this language model is an n-gram frequency list, provided through the --lm parameter. It is used in analiticcl's search mode.

This should be a corpus-derived list of unigrams and bigrams, optionally also trigrams (and even all up to quintgrams if needed, higher-order ngrams are not supported though). This is a TSV file containing the the ngram in the first column (space character acts as token separator), and the absolute frequency count in the second column. It is also recommended it contains the special tokens <bos> (begin of sentence) and <eos> end of sentence. The items in this list are NOT used for variant matching, use --lexicon instead if you want to also match against these items. It is fine to have an entry in both the language model and lexicon, analiticcl will store it only once internally.

Context Rules

Another way to consider context information is through context rules. The context rules define certain patterns that are to be either favoured or penalized. The context rules are expressed in a tab separated file which can be passed to analiticcl using --contextrules. The first column contains a sequence separated by semicolons, and the second a score close to 1.0 (lower scores penalize the pattern, higher scores favour it):

hello ; world	1.1

This means that if the words "hello world" appear as a solution a text/sentence, its total context score will be boosted (proportial to the length of the match), effectively preferring this solution over others. This context score is an independent component in the final score function and its weight can be set using --contextrules-weight.

Note that the words also need to be in a lexicon you provide for a rule to work. You can express disjunctions using the pipe character (|), as follows:

hello|hi ; world|planet	1.1

This will match all four possible combinations. Rather than match the text, you can match specific lexicons you loaded using the @ prefix. This makes sense mainly if you use different lexicons and could be used as a form of elementary tagging:

@greetings.tsv ; world	1.1

Here too you can create disjunctions using the pipe character:

@greetings.tsv|@curses.tsv ; world	1.1

If you want to negate a match, just add ! as a prefix. This also works in combination with @, allowing you to match anything except the words from a particular the lexicon. If you want to negate an entire disjunction, use parenthesis like !(a|b|c|).

There are two standalone characters you may use in matching:

  • ? - Matches anything
  • ^ - Matches anything that does not match with any lexicon (i.e. out of vocabulary words)

Note that in all cases, you'll still need to explicitly load the lexicons (or variants lists) using --lexicon, --variants, etc...

The rules are applied in the exact order you specify them. Note that a certain words in a text may only match against one pattern (the first that is found). When defining context rules, you'll generally want to specify longer rules before shorter ones, as otherwise the longer rules might never be considered. For example, in the following example, the second pattern would never apply because the first one already matches:

hello	1.1
hello ; world	1.1

Entity Tagging

Analiticcl can be used as a simple entity tagger using its context rules. Make sure you understand the above section before you continue reading.

You may pass two additional tab-separated columns to the context rules file, the third column specifies a tag to assign to any matches, and an optional fourth column specifies an offset for tagging (more about this later). For example:

hello ; world	1.1	greeting

Any instances of "hello world" will be assigned the tag "greeting", more specifically "hello" will be assigned the tag "greeting" and gets sequence number 0, "world" gets the same tag and sequence number 1.

If you want to tag only a subset and leave certain left or right context untagged, then you can do so by specifying an offset (in matches aka words, not characters). Such an offset takes the form offset:length. For example:

hello ; world	1.1	greeting	1:1

In this case only the word "world" will get the tag greeting (and sequence number 0).

It is also possible to assign multiple (even overlapping) tags with a single context rule. Use a semicolon to separate multiple tags and multiplet tag offsets (must be equal amount). However, it is not possible to apply multiple context rules once one has matched:

@firstname.tsv ; @lastname.tsv	1.0	person;firstname;lastname 0:2;0:1;1:2

This mechanism can also be used to assign tags based on lexicons whilst allowing some form of lexicon weighting, even if no further context is included:

@greetings.tsv	1.0	greeting
in|to|from ; @city.tsv	1.1	location	1:1
@firstname.tsv ; @lastname.tsv	1.0	person

Theoretical Background

A naive approach to find variants would be to compute the edit distance between the input string and all n items in the lexicon. This, however, is prohibitively expensive (O(mn)) when m input items need to be compared. Anagram hashing (Reynaert 2010; Reynaert 2004) aims to drastically reduce the variant search space. For all items in the lexicon, an order-independent anagram value is computed over all characters that make up the item. All words with the same set of characters (allowing for duplicates) obtain an identical anagram value. This value is subsequently used as a hash in a hash table that maps each anagram value to all variant instances. This is effectively what is outputted when running analiticcl index.

Unlike earlier work, Analiticcl uses prime factors for computation of anagram values. Each character in the alphabet gets assigned a prime number (e.g. a=2, b=3, c=5, d=7, e=11) and the product of these forms the anagram value. This provides the following useful properties:

  • We can multiply any two anagram values to get an anagram that represents the union set of all characters in both (including duplicates): av(A) ∙ av(B) = av(AB)
  • If anavalue A can be divided by anavalue B (av(A) % av(B) = 0), then the set of characters represented by B is fully contained within A.
    • av(A) / av(B) = av(A-B) contains the set difference (aka relative complement). It consists of the set of all characters in A that are not in B.

The caveat of this approach is that it results in fairly large anagram values that quickly exceed a 64-bit register, the analiticcl implementation therefore uses a big-number implementation to deal with arbitrarily large integers.

The properties of the anagram values facilitate a much quicker lookup, when given an input word to seek variants for (e.g. using analiticcl query), we take the following steps:

  • we compute the anagram value for the input
  • we look up this anagram value in the index (if it exists) and gather the variant candidates associated with the anagram value
  • we compute all deletions within a certain distance (e.g. by removing any 2 characters). This is a division operation on the anagram values. The maximum distance is set using the -k parameter.
  • for all of the anagram values resulting from these deletions, we look which anagram values in our index match or contain (av(A) % av(B) = 0) the value under consideration. We again gather the candidates that result from all matches.
    • To facilitate this lookup, we make use of a secondary index, the secondary index is grouped by the number of characters. For each length it enumerates, in sorted order, all anagram values that exist for that particular length. This means we can apply a binary search to find the anagrams that we should check our anagram value against (i.e. to check whether it is a subset of the anagram), rather than needing to exhaustively try all anagram values in our index.
  • Via the anagram index, we have collected all possibly relevant variant instances, which is a considerably smaller than the entire set we'd get if we didn't have the anagram heuristic. Now the set is reduced we apply more conventional measures:
    • We compute several metrics between the input and the possible variants:
      • Damerau-Levenshtein
      • Longest common substring
      • Longest common prefix/suffix
      • Casing difference (binary, different case or not)
    • A score is computed that is an expression of a weighted linear combination of the above items (the actual weights are configurable). An exact match always has score 1.0.
    • A cut-off value prunes the list of candidates that score too low (the parameter -n expresses how many variants we want)
    • Optionally, if a confusable list was provided, we compute the edit script between the input and each variant, and rescore when there are known confusables that are either favoured or penalized.

Licence

Analiticcl is open-source software licenced under the GNU Public Licence v3.

References

Acknowledgement

Analiticcl was developed at the KNAW Humanities Cluster. Its development was funded in the scope of the Golden Agents project.