-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metric Tree for blocking #210
Comments
Hey! Sorry, got super behind on e-mail these last couple weeks. Sure, I'm On Wed, Mar 5, 2014 at 9:23 PM, Forest Gregg [email protected]:
|
The basic idea is to make canopies http://en.wikipedia.org/wiki/Canopy_clustering_algorithm start by looking at this paper. Then the next step is to write code that creates a metric tree object where this object has a 'within' method that returns the record ids of records that are within some distance from a target string.
|
Cool. I'm off work this week, so I'll see if I can find some time to give On Mon, Mar 10, 2014 at 5:18 PM, Forest Gregg [email protected]:
|
Let's setup github repos for these. Also relevant http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html |
moman repo here: https://github.com/datamade/moman |
@cjdd3b, I finally got around to refactoring blocking. The next step for this would be to an Index class that had the same public methods as TfIdfIndex: https://github.com/datamade/dedupe/blob/master/dedupe/tfidf.py
|
Starting to rough it out here: https://github.com/datamade/moman/blob/master/example.py CC @cjdd3b |
Working on this here: #352 |
Got it working pretty well for small sized data, but fine-night just is too slow and takes too much memory for larger sets. This could be an implementation issue, but I don't know of another, better, implementation of Levenshtein distance trees. I'll keep an eye on universal-automata/liblevenshtein#9 |
@cjdd3b, interested in collaborating on using metric trees for blocking in dedupe?
The text was updated successfully, but these errors were encountered: