Given two documents, find the similarity between them. The problem does not involve extracting any semantic meaning of the documents but simply looking at whether they contain the same words.
In this project, I implemented Minhash algorithm to calclate Jaccard similarity for given documents. Also, compared it with brute force approach. All the explanation and results are provided here.
-
As we increase the sketch_size, the accuracy increases.
-
Higher the sketch size, more time is taken in preprocessing the sketches.
-
Once, all the sketches are made and cached, minhash is 99% faster as per my experimentation.
For more details checkout this!