FastSS is an efficient indexing data structure for string similarity search, invented by researchers at Zurich University in 2007.
TinyFastSS is a simple Python implementation of FastSS, written in less than 300 LoC.
- Create a FastSS index on disk.
- Perform very fast fuzzy searches using the index file.
- Python 2/3 compatible (tested with Python 2.7 / 3.4)
- No external modules required (only dependent on built-in modules)
You can install via pip:
$ pip install TinyFastSS
(1) Create an index file.
import fastss
with fastss.open('fastss.dat') as index:
for word in open('dictonary.txt'):
index.add(word.strip())
(2) Perform a fuzzy string search.
import fastss
with fastss.open('fastss.dat') as index:
# return a dict like: {0: ['test'], 1: ['text', 'west'], 2: ['taft']}
print(index.query('test'))
You can also use fastss.py from the command line. Here is a bare minimum example:
$ cat dictionary.txt | head -n 3
aardvark
abacus
aerial
$ python -m fastss -c index.dat dictionary.txt
$ python -m fastss -q index.dat adaptive
{"0": ["adaptive"], "1": ["adoptive"], "2": ["additive"]}
Invoke "python -m fastss -h" for more details.
A simple speed testing was done to grasp the overall performance of TinyFastSS.
The machine used in this test had Intel Core i3-4010U (1.70GHz) processor with 4GB memory.
The dictionary used in this test was one derived from SCOWL v2015-08-24 (english-50), which contained 98,986 English words (909 KB in disk size).
1. Index creation performance
- Roughly it took 3 minutes to complete the index creation.
- The size of the resulting index file was 161 MB.
$ time python -m fastss -c fastss.dat dictonary.txt
3m0.71s real 2m44.35s user 0m16.43s system
$ stat --format=%s fastss.dat
168214528
2. Query performance
- With randomly chosen words, it took around 5ms to perform a single search on average.
- The actual time varied between 1.16ms (with "nirvana") and 11.7ms (with "burn").
$ python -m timeit -s 'import fastss; index=fastss.open("fastss.dat")' 'index.query("sterner")'
100 loops, best of 3: 7.67 msec per loop
$ python -m timeit -s 'import fastss; index=fastss.open("fastss.dat")' 'index.query("spotlighted")'
100 loops, best of 3: 2.43 msec per loop
$ python -m timeit -s 'import fastss; index=fastss.open("fastss.dat")' 'index.query("burn")'
100 loops, best of 3: 11.7 msec per loop
$ python -m timeit -s 'import fastss; index=fastss.open("fastss.dat")' 'index.query("nirvana")'
1000 loops, best of 3: 1.16 msec per loop
$ python -m timeit -s 'import fastss; index=fastss.open("fastss.dat")' 'index.query("conveyor")'
100 loops, best of 3: 1.99 msec per loop
TinyFastSS uses built-in module dbm (anydbm) to store the index data.
A "index file" is basically a (disk-based) hash table which maps a key (read the original paper to know what a 'key' means) to a list of words.
As dbm can only store bytes, both keys and words are encoded in UTF-8, and each word-list is serialized into a byte string delimited by null bytes (b'\x00'). Here is an example of a (key, value) pair:
(b'almond', b'almond\x00almonds\x00almondy')