GitHub - englehardt/pLSH-HDC: Parallel LSH based high dimensional clustering for documents

pLSH-HDC : Parallel Locality-Sensitive Hashing based High Dimensional Clustering

Based on Rajamaran, "Mining of Massive Datasets" - Section 3.4

A parallel implementation of LSH for High Dimensional Clustering.

Documents or sets are represented by a MinHash signature.
LSH is used to map similar signatures to similar bins.
Items which map to the same bin are considered candidate pairs for clustering.
A constraint function (currently Levenshtein distance) is applied to candidate pairs.
Items which satisfy constraint function are clustered via UnionFind.

Signatures can be pre-computed (in parallel) and stored using the MinHasher. Clusters should be built from MinHash signatures. Constraint checking currently uses the Levenshtein distance of the actual documents stored in a LevelDB database via a JSON interface.

Summary of changes:

Updated to use murmur3 hashing (for signatures and LSH)
Unicode support
De-coupled clustering from signature creation to allow parallel and pre-computation of signatures
Ability to dump/load signer state to disk
Constraint function checking for candidate pairs
Native parallel processing for constraint checks
Methods to help serialize cluster state to disk

Requires the C/C++ based pyhash, python-Levenshtein, and leveldb libraries. These can be installed via pip:

pip install pyhash python-Levenshtein leveldb

TODO: Remove LevelDB dependency, improve generality of constraint checking, update tests.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
lsh		lsh
tests		tests
.gitignore		.gitignore
AUTHORS		AUTHORS
CONTRIBUTORS		CONTRIBUTORS
LICENSE		LICENSE
README.md		README.md
jsonleveldb.py		jsonleveldb.py
runtests.py		runtests.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pLSH-HDC : Parallel Locality-Sensitive Hashing based High Dimensional Clustering

About

Releases

Packages

Languages

License

englehardt/pLSH-HDC

Folders and files

Latest commit

History

Repository files navigation

pLSH-HDC : Parallel Locality-Sensitive Hashing based High Dimensional Clustering

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages