This is the final project for the Advanced Information Retrieval Course at Sharif University of Science and Technology.
This project is for Retrieving information from public Github repositories based on their README.MD
and requirements.txt
files.
It consists of different phases:
- Public repositories in GitHub are crawled and preprocessed.(Code available in
Part1
directory) Then different representations are used including:
- Boolean
- TF-IDF
- A transformer-based module
- Fasttext
- A query is given to the project and most related repositories to that query are retrieved using the above representations.(Code available in
Part1
directory) - Clustering and classification methods are applied to the repositories based on their
requirements.txt
files. (Code available inPart2
directory) - Link analysis between repositories is applied. In this part, two famous algorithms,
Page Rank
andHITS
are used to rank outputs of the search based on their relation with the query and each other.