GitHub - pallavi-garg/documentsimilarity: Document Similarity

Implement Jaccard Similarity using Minhash technique

Problem statement

Given two documents, find the similarity between them. The problem does not involve extracting any semantic meaning of the documents but simply looking at whether they contain the same words.

Solution Approach

In this project, I implemented Minhash algorithm to calclate Jaccard similarity for given documents. Also, compared it with brute force approach. All the explanation and results are provided here.

General Observation

As we increase the sketch_size, the accuracy increases.
Higher the sketch size, more time is taken in preprocessing the sketches.
Once, all the sketches are made and cached, minhash is 99% faster as per my experimentation.

For more details checkout this!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
include		include
results		results
src		src
test		test
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Implement Jaccard Similarity using Minhash technique

Problem statement

Solution Approach

General Observation

About

Releases

Packages

Languages

pallavi-garg/documentsimilarity

Folders and files

Latest commit

History

Repository files navigation

Implement Jaccard Similarity using Minhash technique

Problem statement

Solution Approach

General Observation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages