Skip to content

Zeiny96/Score_Prediction_using_Pyspark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Score_Prediction_using_Pyspark

  • In this repo a Score Prediction system will be built using Pyspark.

Content

Prerequisites

  • Ubuntu >= 18.04
  • Python > 3.8
  • Anaconda3

Installation

  • To setup and activate a conda environment with our needed libraries.
  • We will call our environment pyspark, you can use any desired name.
  • To do this follow the next steps:
conda create -n pyspark python=3.9 -y
conda activate pyspark
conda install openjdk pyspark scikit-learn scipy matplotlib seaborn -y
sudo apt-get update -y
sudo apt-get install sqlite3 -y

Structure

  • train.py: Used to load a given SQL database, train a logistic regression classifier using it, then evaluate the model, where its parameters are:
    sql_database_file: Containing the SQL database file path.
    csv_file: Containing the path to dump the CSV database file version in.
    model_path: Containing the directory to export the trained model to.

  • test.py: Used to load a given SQL database, then predict their scores, where its parameters are:
    sql_database_file: Containing the SQL database file path.
    csv_file: Containing the path to dump the CSV database file version in.
    model_path: Containing the directory where the trained model will be exported.
    output_csv_file: Containing the path where the output CSV file with the model predictions will be exported.

Preprocessing

  • The Amazon review SQL database file was used for our training and evaluation.
  • Before training the model any unwanted features were excluded.
  • Any Null rows were removed.
  • All columns were casted to either string or float types based on their content.
  • Any rows with HelpfulnessNumerator less than or equal to zero were excluded.
  • The desired features were Summary and Text.
  • They were concatenated into one column, tokenized, cleaned of stop words, then vectorized.
  • All these steps can be found with their in-code documentation in train.py.
  • The dataset was split into 75% training set to train the model and 25% test set to evaluate the model on.

Training

  • Using the preprocessed training data a logistic regression model was trained.
  • This model predicts the score based on its input data.

Evaluation

  • Using the created test set the model got the following results:
Precision Recall F1-score Support
Class 1 93% 93% 93% 7828
Class 2 88% 87% 87% 3593
Class 3 88% 87% 87% 4745
Class 4 85% 88% 86% 8489
Class 5 97% 96% 96% 40831
Accuracy 94% 94% 94% 65486
Micro-Avg 94% 94% 94% 65486
Macro-Avg 90% 90% 90% 65486
Weighted-Avg 94% 94% 94% 65486
  • Confusion matrix:
    alt text
  • Due to the random splitting the test set isn't balanced, so we will mainly look at the macro average F1-score which isn't affected by the classes imbalance.
  • As shown in the results table our macro-average F1-score is 90% which is still a good and acceptable result.

Prediction

  • To predict new scores given a database file use test.py, where it will read the SQL database file and produce the predictions with their indices in a CSV file.
  • Any new test set will be cleaned and the same preprocessing will be done to it except for removing any rows based on the HelpfulnessNumerator.
  • Due to these preprocessing steps some rows may be droped, for example:
    • If we used our available training SQL database as a test set 10 rows will be droped from the 568455 available rows.
  • All the predictions in the output file will contain their indices to distinguish them, and if any index was missing it means it was droped due to it containing wrong datatype or Nulls.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages