GitHub - AakashTyagi11/Inverted-index-using-Apache-Spark: Implemented Apache Spark – mapper and reducer functions to process the text from 20,000 web pages.

The project exhibits the functionality of the page-rank algorithm on 20,000 HTML input files using Apache spark (Map-reduce) and by computing cosine similarity of each file.

inverted-index-pyspark-post.ipynb :

Performs mapper and reducer tasks for approx 20,000 HTML files (input files).
generates inv_idx.txt and mag_doc.txt files as the final output of this python file.
These text files hold the values required for computing cosine similarity of each file.

CSC_502_project_DB_+_Interface.ipynb :

Database: Contains code for generating tables out of text files generated in the above step (using sqlite3).

Interface : Contains HTML components like a text box to enter a search query, a search button, and a text area to display the results.

Onclick Functionality: a function that computes the cosine similarity on the click of a search button that takes the enter search query as input and returns the most relevant document(among the input HTML files) to the search query.

Code_execution

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Code (report file)		Code (report file)
Data Files (input_output files)		Data Files (input_output files)
README.md		README.md
Search & result.JPG		Search & result.JPG
project_inv_idx.pdf		project_inv_idx.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

AakashTyagi11/Inverted-index-using-Apache-Spark

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages