Predicting Grades using a Forest

The project uses machine learning technique, i.e. “Random” Forest, to predict grades in Data Structures course using previous academic data. Following concepts are used in predicting the grades.

Dataset preprocessing
Model training (generating a single tree, and a forest of trees)
Modal value finding
Tree traversals to display the trees
Model testing (predicting a single grade, and grades of several students in Data Structures)

1. Preprocessing of the dataset

Consider the following dataset:

Sr.No	Semester	CourseCode	CourseTitle	CreditHours	Grade	GradePoint	SGPA	CGPA
1	Fall 2016	MT104	Linear Algebra	3	B	3	3.27	3.27
1	Fall 2016	MT101	Calculus-I	3	B+	3.33	3.27	3.27
1	Fall 2016	CS101	ITC	3	A	4	3.27	3.27
1	Fall 2016	CL101	ITC Lab	1	A+	4	3.27	3.27
1	Fall 2016	EE182	Basic Elective	3	C-	1.67	3.27	3.27
1	Fall 2016	SL101	English Language	1	A-	3.67	3.27	3.27
1	Fall 2016	SS101	English Language Lab	3	A+	4	3.27	3.27
1	Fall 2017	CS201	Data Structures	3	A	4	3.75	3.57
1	Spring 2017	EE227	Digital Logic and Design	3	A+	4	3.71	3.49
1	Spring 2017	SS122	English Composition	3	A	4	3.71	3.49
1	Spring 2017	MT115	Calculus-II	3	A-	3.67	3.71	3.49
1	Spring 2017	SS111	Islamic and Religious Studies	3	B	3	3.71	3.49
1	Spring 2017	CS103	Computer Programming	3	A	4	3.71	3.49
1	Spring 2017	EL227	Digital Logic and Design Lab	1	B	3	3.71	3.49
1	Spring 2017	CL103	Computer Programming Lab	1	A+	4	3.71	3.49

1.1. Instructions for preprocessing

At first the data in rawDataset.csv file is read and restructured by arranging each student's records in one line as shown below.

Sr.No	MT104	MT119	CS118	CL118	EE182	SL101	SS101	EE227	SS122	MT224	SS111	CS217	EL227	CL217	CGPA	CS201(Data Stratures)
1	3	3.33	4	4	1.67	3.67	4	4	4	3.67	3	4	3	4	3.49	4
2	4	4	3.67	4	3.33	3	4	4	4	4	3.33	4	2.33	4	3.66	4
3	3.33	2.67	2.67	3.33	2.33	3.33	3	3	4	3	3.67	3.67	3	4	3.21	3.33

Each student record (feature vector) consists of his scores of all subjects of the first two semesters, CGPA, and Warning, where CS201-Data Structures is treated as a label or target class. The sequence of the course codes is sorted alphabetically to match trainDataset.csv file. From the entire dataset, only those students who have studied Data Structures are kept and all other are removed. After this preprocessing, the selected feature vectors are used to train/test the forest.

The final step of preprocessing is to split the dataset into trainDataset and testDataset. The first 40 vectors are used for training, and the remaining 20 vectors are used for testing.

1.2. Converting Scores/Grades into Labels

The grades of the students are converted into the labels to use them for the training and testing of the forest. The labels for each grade is given in the following table:

Grade	Label
A+, A, A-	Excellent
B+, B, B-	Good
C+, C, C-	Average
D+, D	Bad
FA, F	Fail
All others	Unknown

Please note that the first alphabet of label is required to be capital.

2. Model Training

In this phase, the dataset is loaded into a linked list to create a forest of multiple decision-trees. The input to the forest are the following parameters:

NumOfTrees: it represents the number of trees required in the forest
WindowSize: it represents the number of columns (features here) to use to split the root node of a decision-tree
StepSize: it is used to decide the first feature of the window of features for the root node (StepSize is used in combination with the WindowSize parameter)

In order to make a tree different from the other, it uses a special technique of feature selection (as explained under). This technique is different from the well-known Random Forest algorithm, where the feature-set is selected randomly to create a root node of a decision-tree. In this simple technique, no randomness is involved.

2.1. Creating a Single Decision-Tree

Each decision-tree gives a set of rules to predict grades based on a single or combination of features. At each node of the tree, a feature is chosen that best splits the dataset. A best split divides the dataset into multiple partitions (linked-lists), each of them hosting observations that are more similar among themselves and different from the ones in the other partitions. Therefore, the importance of each feature is derived from how “pure” each of the partition is. The measure to calculate the impurity of a dataset is Entropy (and a respective Information Gain).

Following is a step-by-step process to build a decision-tree:

Step 1: The forest passes WindowSize and FirstFeature to each decision-tree. For the first tree, the FirstFeature value is 0. For the next trees, it is calculated by adding the StepSize in the FirstFeature of the previous tree. Hence, the FirstFeature for the second tree would be 1*StepSize (and for the i^th tree, it would be (i-1)*StepSize). The set of features comprising of the FirstFeature + WindowSize is used in the root node of the tree to split the dataset.

Step 2: If the WindowSize is 5 then the first 5 features are selected (indexed from 0 – 4) out of 16 features for the first tree. If the StepSize is 2 then the features 2 – 6 for the second tree are used, and so on.

Step 3: The entropy of the entire dataset is calculated using the CS201 column (the last column). If the entropy is non-zero then the dataset is partitioned while increasing the purity of the partitioned subsets. In order to partition the dataset, calculate the Entropy and Information Gain of each feature that is selected in the step 2.

Step 4: The feature with the highest Information Gain value is used to split the root node, and divide the dataset to its branches by grades value (each node can have 1 – 6 children, one for each grade label). Once the branches are created, the partitioned dataset is passed to the respective branch, i.e. for the branch with Excellent as its label, all the records having Excellent in the selected feature value goes to that branch.

Step 5: For all the nodes below the root level, complete feature list to is used compete using Information Gain value. The feature with the highest Information Gain is used used to split that node.

Step 6: A node with the entropy value equal to zero becomes a leaf node. Such a node contains the dataset where all the records have the same grade for the Data Structures course. When asked to predict, this leaf node returns this grade for the Data Structures course.

Step 7: A branch with entropy more than 0 further splits using step 5. This process continues recursively on all the non-leaf nodes until all branches terminate on their respective leaf nodes.

2.2. Creating a Forest

Multiple decision-trees are generated, where each tree predicts the grade of data structures for a given input data. The forest collects the predictions from every decision-tree and return mode (majority voting) of the predictions as the final grade of data structures. If two or more grades appear to be the mode then all the modal grades will be the output predictions.

Same process of feature selection is repeated on each tree, trying to generate unique tree. The above mentioned technique does not guarantee that all trees are distinct, however, inorder to limit the duplicate trees to some extent the window is shifted by 2 in the round-robin fashion.

Example: Feature (s1,s2,s3,s4,s5,s6,s7,s8,s9,s10) If the Windows_Size = 5 and No_of_trees = 5, then

Subset1 for root-selection of tree1 = {s1,s2,s3,s4,s5} and root node must be from these 5 features according to the value of the Gain.

Subset2 for root-selection of tree2 = {s3,s4,s5,s6,s7} and root node must be from these 5 features according to the value of the Gain.

Subset3 for root-selection of tree3 = {s5,s6,s7,s8,s9} and root node must be from these 5 features according to the value of the Gain.

Subset4 for root-selection of tree4 = {s7,s8,s9,s10,s1} and root node must be from these 5 features according to the value of the Gain.

Subset5 for root-selection of tree5 = {s9,s10,s1,s2,s3} and root node must be from these 5 features according to the value of the Gain.

3. Modal Value Finding

In order to predict the grade, the predictions from every decision-tree is collected and the mode (majority voting) of the predictions is the final grade of data structures. If there are more than one modal value then prediction is set of labels. If a tree does not find a branch for a given case, then it would return Unknown grade.

4. Tree Traversals

Following depth-first traversal methods are implemented in the project:

In-order traversal
Pre-order traversal
Post-order traversal

When a node is visited the feature number which split that node is printed.

5. Model Testing (Prediction)

The performance of machine learning model can't be evaluated on trainDataset. Therefore, testDataset is used for measuring accuracy. The predicted class labels returned from all the tress are used to calculate the prediction accuracy using the Formula.

Note: The code can be used for predicting other grades with changes in target class. The 6 labels can also be changed corresponding to the grades.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Forest.h		Forest.h
README.md		README.md
Test Cases.cpp		Test Cases.cpp
rawDataset.csv		rawDataset.csv
testDataset.csv		testDataset.csv
trainDataset.csv		trainDataset.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Grades using a Forest

1. Preprocessing of the dataset

1.1. Instructions for preprocessing

1.2. Converting Scores/Grades into Labels

2. Model Training

2.1. Creating a Single Decision-Tree

2.2. Creating a Forest

3. Modal Value Finding

4. Tree Traversals

5. Model Testing (Prediction)

About

Releases

Packages

Languages

Hunaid2000/Grade-Predictor

Folders and files

Latest commit

History

Repository files navigation

Predicting Grades using a Forest

1. Preprocessing of the dataset

1.1. Instructions for preprocessing

1.2. Converting Scores/Grades into Labels

2. Model Training

2.1. Creating a Single Decision-Tree

2.2. Creating a Forest

3. Modal Value Finding

4. Tree Traversals

5. Model Testing (Prediction)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages