Based on Rajamaran, "Mining of Massive Datasets" - Section 3.4
A parallel implementation of LSH for High Dimensional Clustering.
- Documents or sets are represented by a MinHash signature.
- LSH is used to map similar signatures to similar bins.
- Items which map to the same bin are considered candidate pairs for clustering.
- A constraint function (currently Levenshtein distance) is applied to candidate pairs.
- Items which satisfy constraint function are clustered via UnionFind.
Signatures can be pre-computed (in parallel) and stored using the MinHasher. Clusters should be built from MinHash signatures. Constraint checking currently uses the Levenshtein distance of the actual documents stored in a LevelDB database via a JSON interface.
Summary of changes:
- Updated to use murmur3 hashing (for signatures and LSH)
- Unicode support
- De-coupled clustering from signature creation to allow parallel and pre-computation of signatures
- Ability to dump/load signer state to disk
- Constraint function checking for candidate pairs
- Native parallel processing for constraint checks
- Methods to help serialize cluster state to disk
Requires the C/C++ based pyhash
, python-Levenshtein
, and leveldb
libraries. These can be installed via pip:
pip install pyhash python-Levenshtein leveldb
TODO: Remove LevelDB dependency, improve generality of constraint checking, update tests.