Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric Tree for blocking #210

Closed
fgregg opened this issue Mar 6, 2014 · 10 comments
Closed

Metric Tree for blocking #210

fgregg opened this issue Mar 6, 2014 · 10 comments

Comments

@fgregg
Copy link
Contributor

fgregg commented Mar 6, 2014

@cjdd3b, interested in collaborating on using metric trees for blocking in dedupe?

@cjdd3b
Copy link

cjdd3b commented Mar 10, 2014

Hey! Sorry, got super behind on e-mail these last couple weeks. Sure, I'm
in! I'm definitely no expert in metric trees, but I'm excited to learn
more. What's the next step?

On Wed, Mar 5, 2014 at 9:23 PM, Forest Gregg [email protected]:

@cjdd3b https://github.com/cjdd3b, interested in collaborating on using
metric trees for blocking in dedupe?

Reply to this email directly or view it on GitHubhttps://github.com//issues/210
.

@fgregg
Copy link
Contributor Author

fgregg commented Mar 10, 2014

The basic idea is to make canopies http://en.wikipedia.org/wiki/Canopy_clustering_algorithm start by looking at this paper.

Then the next step is to write code that creates a metric tree object where this object has a 'within' method that returns the record ids of records that are within some distance from a target string.

> data = {1: "foo",
              2: "bar",
              3: "baz"}
> tree = MetricTree(data)
> record_ids = tree.within('baz', threshold = 2)
> print record_ids
[2,3]

@cjdd3b
Copy link

cjdd3b commented Mar 10, 2014

Cool. I'm off work this week, so I'll see if I can find some time to give
that a look. In the mean time, some of the best implemented work on metric
trees I've found is this totally obscure library by Jochen Spieker:
http://well-adjusted.de/~jrspieker/mspace/. Code is extremely well
documented too.

On Mon, Mar 10, 2014 at 5:18 PM, Forest Gregg [email protected]:

The basic idea is to make canopies
http://en.wikipedia.org/wiki/Canopy_clustering_algorithm start by looking
at this paper.

Then the next step is to write code that creates a metric tree object
where this object has a 'within' method that returns the record ids of
records that are within some distance from a target string.

data = {1: "foo",
2: "bar",
3: "baz"}
tree = MetricTree(data)
record_ids = tree.within('baz', threshold = 2)
print record_ids
[2,3]

Reply to this email directly or view it on GitHubhttps://github.com//issues/210#issuecomment-37235591
.

@fgregg fgregg added this to the Refactor blocking.py milestone Apr 29, 2014
@fgregg fgregg removed this from the 0.5.3 Refactor blocking.py milestone May 11, 2014
@fgregg
Copy link
Contributor Author

fgregg commented Aug 27, 2014

Let's setup github repos for these.

Also relevant http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html

@fgregg
Copy link
Contributor Author

fgregg commented Aug 29, 2014

moman repo here: https://github.com/datamade/moman

@fgregg
Copy link
Contributor Author

fgregg commented Aug 29, 2014

@cjdd3b, I finally got around to refactoring blocking. The next step for this would be to an Index class that had the same public methods as TfIdfIndex: https://github.com/datamade/dedupe/blob/master/dedupe/tfidf.py

  • index
  • search
  • unindex (this could wait)
  • canopy (this could wait)

@fgregg
Copy link
Contributor Author

fgregg commented Jan 30, 2015

Starting to rough it out here: https://github.com/datamade/moman/blob/master/example.py

CC @cjdd3b

@fgregg
Copy link
Contributor Author

fgregg commented Feb 10, 2015

Working on this here: #352

@fgregg
Copy link
Contributor Author

fgregg commented Feb 12, 2015

Got it working pretty well for small sized data, but fine-night just is too slow and takes too much memory for larger sets. This could be an implementation issue, but I don't know of another, better, implementation of Levenshtein distance trees. I'll keep an eye on universal-automata/liblevenshtein#9

@fgregg fgregg closed this as completed Feb 12, 2015
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants