This Java library implements dictionaries that are stored in finite state automata. Dictomaton has the following features:
- Finite state dictionaries that implement the Java Set interface.
- Perfect hash dictionaries, that provide a unique hash for each character sequence that is in the dictionary. Perfect hash dictionaries can be used in two directions: (1) obtaining the hash code for a character sequence and (2) obtaining the character sequence for a hash code.
- Levenshtein automata, that allow you to efficiently find all the sequences in the dictionary that are within the given edit distance of a sequence.
- String to primitive type mappings, where the keys are stored in a perfect hashing automaton and the values in an (unboxed) array.
Dictomaton is in the Maven Central Repository:
<dependency>
<groupId>eu.danieldk.dictomaton</groupId>
<artifactId>dictomaton</artifactId>
<version>1.1.1</version>
</dependency>
SBT:
libraryDependencies += "eu.danieldk.dictomaton" % "dictomaton" % "1.1.1"
Grails:
compile 'eu.danieldk.dictomaton:dictomaton:1.1.1'
The following table compares the sizes of the object graphs of the Dictionary type of this library to that of TreeSet and HashSet. The comparisons were obtained by storing all the words in the web2 and web2a dictionaries and were measured using memory-measurer
Data type | Objects | References | char | int | boolean | float |
---|---|---|---|---|---|---|
TreeSet | 936277 | 1872555 | 3193749 | 624184 | 312091 | 0 |
HashSet | 936277 | 1772657 | 3193749 | 936277 | 1 | 1 |
Dictionary | 41188 | 94546 | 424169 | 397033 | 1 | 1 |
Benchmarks are in a different test group than normal unit tests. You can run benchmarks via Maven, adding the Benchmarks group:
mvn test -Djunit.groups=eu.danieldk.dictomaton.categories.Benchmarks
- Added immutable mapping from String to a generic type.
- Added a key-ordered builder for immutable mappings. This builder is more efficient since it construct the key automaton on the fly.
- Added Levenshtein automata for looking up sequences in a Dictionary that are within a certain edit distance of a sequence.
- Provide a variant of perfect hash automata that puts right language cardinalities in transitions rather than states. This provides faster hashing and hashcode lookups at the cost of some memory.
- Added String to String mapping (ImmutableStringStringMap).
- Generic object values.
- Fix an off-by-one error in integer width of the state table.
- Rename the project from fsadict-java to dictomaton.
- Store the state and transition tables as packed int arrays, resulting in drastically smaller automata.
- 1.0.0: first stable release.
- 1.1.0: generic object values.
Plans for 1.2.0: Perhaps an explicit, fast, and compact data storage format as an alternative to Java serialization. C or C++ version.