Dataset to be used is as follows: https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey#survey_results_public.csv
Dataset for job-postings https://www.kaggle.com/PromptCloudHQ/us-technology-jobs-on-dicecom
Feature Extraction and preprocessing
To run the files, download the stack overflow dataset from the given link and place into /data/user_preprocessing folder. The feature extraction and preprocessing of the user profiles are being done by feature_extraction_user_a.ipynb and feature_extraction_user_b.ipynb. The extracted features are already in /data/user_preprocessing folder.
Collaborative filtering
To run the files, download the stack overflow dataset and the job-postings dataset from the given link and place into /data/collaborative filtering folder. Run collaborative filtering.ipynb to check the output of CF recommendations based on Content based recommendations.
Steps to run content based filtering model-
- The following modules need to be installed spacy nltk sklearn scipy
- Kindly download the two datasets mentioned above and place them in the data folder with the following names: Job Postings Dataset-dice_com-job_us_sample.csv Stackoverflow Developer Survey 2018-survey_results_public.csv
- The core code for content based filtering is in Job Postings Preprocessing.ipynb. The Recommendations can be obtained by running the second cell. The entire code is organized in a class called job_postings.
- The model depends on all files in the data folder. The csv files in data folder contain the final user and job profiles
- The csv files contained in the ./data/job_profile and ./data/user_profile contain the independent job and user profiles
- The recommendations.csv contains top 10 recommendations for a random sample(first 200 users) of the Stack Overflow dataset
- Cells 3 onwards contain code snippets attempted during the preprocessing stages NOTE: To get your own recommendations, pass 1 as the third parameter and you will be prompted to enter your details
About Inferences.ipynb
- This contains code which was used to make inferences about the dataset