Skip to content

A Grade Predictor project using concepts of Entropy, Decision Trees and Forest.

Notifications You must be signed in to change notification settings

Hunaid2000/Grade-Predictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Predicting Grades using a Forest

The project uses machine learning technique, i.e. “Random” Forest, to predict grades in Data Structures course using previous academic data. Following concepts are used in predicting the grades.

  1. Dataset preprocessing
  2. Model training (generating a single tree, and a forest of trees)
  3. Modal value finding
  4. Tree traversals to display the trees
  5. Model testing (predicting a single grade, and grades of several students in Data Structures)

1. Preprocessing of the dataset

Consider the following dataset:

Sr.No Semester CourseCode CourseTitle CreditHours Grade GradePoint SGPA CGPA Warning
1 Fall 2016 MT104 Linear Algebra 3 B 3 3.27 3.27 0
1 Fall 2016 MT101 Calculus-I 3 B+ 3.33 3.27 3.27 0
1 Fall 2016 CS101 ITC 3 A 4 3.27 3.27 0
1 Fall 2016 CL101 ITC Lab 1 A+ 4 3.27 3.27 0
1 Fall 2016 EE182 Basic Elective 3 C- 1.67 3.27 3.27 0
1 Fall 2016 SL101 English Language 1 A- 3.67 3.27 3.27 0
1 Fall 2016 SS101 English Language Lab 3 A+ 4 3.27 3.27 0
1 Fall 2017 CS201 Data Structures 3 A 4 3.75 3.57 0
1 Spring 2017 EE227 Digital Logic and Design 3 A+ 4 3.71 3.49 0
1 Spring 2017 SS122 English Composition 3 A 4 3.71 3.49 0
1 Spring 2017 MT115 Calculus-II 3 A- 3.67 3.71 3.49 0
1 Spring 2017 SS111 Islamic and Religious Studies 3 B 3 3.71 3.49 0
1 Spring 2017 CS103 Computer Programming 3 A 4 3.71 3.49 0
1 Spring 2017 EL227 Digital Logic and Design Lab 1 B 3 3.71 3.49 0
1 Spring 2017 CL103 Computer Programming Lab 1 A+ 4 3.71 3.49 0

1.1. Instructions for preprocessing

At first the data in rawDataset.csv file is read and restructured by arranging each student's records in one line as shown below.

Sr.No MT104 MT119 CS118 CL118 EE182 SL101 SS101 EE227 SS122 MT224 SS111 CS217 EL227 CL217 CGPA Warning CS201(Data Stratures)
1 3 3.33 4 4 1.67 3.67 4 4 4 3.67 3 4 3 4 3.49 0 4
2 4 4 3.67 4 3.33 3 4 4 4 4 3.33 4 2.33 4 3.66 0 4
3 3.33 2.67 2.67 3.33 2.33 3.33 3 3 4 3 3.67 3.67 3 4 3.21 0 3.33

Each student record (feature vector) consists of his scores of all subjects of the first two semesters, CGPA, and Warning, where CS201-Data Structures is treated as a label or target class. The sequence of the course codes is sorted alphabetically to match trainDataset.csv file. From the entire dataset, only those students who have studied Data Structures are kept and all other are removed. After this preprocessing, the selected feature vectors are used to train/test the forest.

The final step of preprocessing is to split the dataset into trainDataset and testDataset. The first 40 vectors are used for training, and the remaining 20 vectors are used for testing.

1.2. Converting Scores/Grades into Labels

The grades of the students are converted into the labels to use them for the training and testing of the forest. The labels for each grade is given in the following table:

Grade Label
A+, A, A- Excellent
B+, B, B- Good
C+, C, C- Average
D+, D Bad
FA, F Fail
All others Unknown

Please note that the first alphabet of label is required to be capital.

2. Model Training

In this phase, the dataset is loaded into a linked list to create a forest of multiple decision-trees. The input to the forest are the following parameters:

  1. NumOfTrees: it represents the number of trees required in the forest
  2. WindowSize: it represents the number of columns (features here) to use to split the root node of a decision-tree
  3. StepSize: it is used to decide the first feature of the window of features for the root node (StepSize is used in combination with the WindowSize parameter)

In order to make a tree different from the other, it uses a special technique of feature selection (as explained under). This technique is different from the well-known Random Forest algorithm, where the feature-set is selected randomly to create a root node of a decision-tree. In this simple technique, no randomness is involved.

2.1. Creating a Single Decision-Tree

Each decision-tree gives a set of rules to predict grades based on a single or combination of features. At each node of the tree, a feature is chosen that best splits the dataset. A best split divides the dataset into multiple partitions (linked-lists), each of them hosting observations that are more similar among themselves and different from the ones in the other partitions. Therefore, the importance of each feature is derived from how “pure” each of the partition is. The measure to calculate the impurity of a dataset is Entropy (and a respective Information Gain).

Following is a step-by-step process to build a decision-tree:

Step 1: The forest passes WindowSize and FirstFeature to each decision-tree. For the first tree, the FirstFeature value is 0. For the next trees, it is calculated by adding the StepSize in the FirstFeature of the previous tree. Hence, the FirstFeature for the second tree would be 1*StepSize (and for the ith tree, it would be (i-1)*StepSize). The set of features comprising of the FirstFeature + WindowSize is used in the root node of the tree to split the dataset.

Step 2: If the WindowSize is 5 then the first 5 features are selected (indexed from 0 – 4) out of 16 features for the first tree. If the StepSize is 2 then the features 2 – 6 for the second tree are used, and so on.

Step 3: The entropy of the entire dataset is calculated using the CS201 column (the last column). If the entropy is non-zero then the dataset is partitioned while increasing the purity of the partitioned subsets. In order to partition the dataset, calculate the Entropy and Information Gain of each feature that is selected in the step 2.

Step 4: The feature with the highest Information Gain value is used to split the root node, and divide the dataset to its branches by grades value (each node can have 1 – 6 children, one for each grade label). Once the branches are created, the partitioned dataset is passed to the respective branch, i.e. for the branch with Excellent as its label, all the records having Excellent in the selected feature value goes to that branch.

Step 5: For all the nodes below the root level, complete feature list to is used compete using Information Gain value. The feature with the highest Information Gain is used used to split that node.

Step 6: A node with the entropy value equal to zero becomes a leaf node. Such a node contains the dataset where all the records have the same grade for the Data Structures course. When asked to predict, this leaf node returns this grade for the Data Structures course.

Step 7: A branch with entropy more than 0 further splits using step 5. This process continues recursively on all the non-leaf nodes until all branches terminate on their respective leaf nodes.

2.2. Creating a Forest

Multiple decision-trees are generated, where each tree predicts the grade of data structures for a given input data. The forest collects the predictions from every decision-tree and return mode (majority voting) of the predictions as the final grade of data structures. If two or more grades appear to be the mode then all the modal grades will be the output predictions.

Same process of feature selection is repeated on each tree, trying to generate unique tree. The above mentioned technique does not guarantee that all trees are distinct, however, inorder to limit the duplicate trees to some extent the window is shifted by 2 in the round-robin fashion.

Example: Feature (s1,s2,s3,s4,s5,s6,s7,s8,s9,s10) If the Windows_Size = 5 and No_of_trees = 5, then

Subset1 for root-selection of tree1 = {s1,s2,s3,s4,s5} and root node must be from these 5 features according to the value of the Gain.

Subset2 for root-selection of tree2 = {s3,s4,s5,s6,s7} and root node must be from these 5 features according to the value of the Gain.

Subset3 for root-selection of tree3 = {s5,s6,s7,s8,s9} and root node must be from these 5 features according to the value of the Gain.

Subset4 for root-selection of tree4 = {s7,s8,s9,s10,s1} and root node must be from these 5 features according to the value of the Gain.

Subset5 for root-selection of tree5 = {s9,s10,s1,s2,s3} and root node must be from these 5 features according to the value of the Gain.

3. Modal Value Finding

In order to predict the grade, the predictions from every decision-tree is collected and the mode (majority voting) of the predictions is the final grade of data structures. If there are more than one modal value then prediction is set of labels. If a tree does not find a branch for a given case, then it would return Unknown grade.

4. Tree Traversals

Following depth-first traversal methods are implemented in the project:

  1. In-order traversal
  2. Pre-order traversal
  3. Post-order traversal

When a node is visited the feature number which split that node is printed.

5. Model Testing (Prediction)

The performance of machine learning model can't be evaluated on trainDataset. Therefore, testDataset is used for measuring accuracy. The predicted class labels returned from all the tress are used to calculate the prediction accuracy using the Formula.

Note: The code can be used for predicting other grades with changes in target class. The 6 labels can also be changed corresponding to the grades.

About

A Grade Predictor project using concepts of Entropy, Decision Trees and Forest.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages