The project uses machine learning technique, i.e. “Random” Forest, to predict grades in Data Structures course using previous academic data. Following concepts are used in predicting the grades.
- Dataset preprocessing
- Model training (generating a single tree, and a forest of trees)
- Modal value finding
- Tree traversals to display the trees
- Model testing (predicting a single grade, and grades of several students in Data Structures)
Consider the following dataset:
Sr.No | Semester | CourseCode | CourseTitle | CreditHours | Grade | GradePoint | SGPA | CGPA | Warning |
---|---|---|---|---|---|---|---|---|---|
1 | Fall 2016 | MT104 | Linear Algebra | 3 | B | 3 | 3.27 | 3.27 | 0 |
1 | Fall 2016 | MT101 | Calculus-I | 3 | B+ | 3.33 | 3.27 | 3.27 | 0 |
1 | Fall 2016 | CS101 | ITC | 3 | A | 4 | 3.27 | 3.27 | 0 |
1 | Fall 2016 | CL101 | ITC Lab | 1 | A+ | 4 | 3.27 | 3.27 | 0 |
1 | Fall 2016 | EE182 | Basic Elective | 3 | C- | 1.67 | 3.27 | 3.27 | 0 |
1 | Fall 2016 | SL101 | English Language | 1 | A- | 3.67 | 3.27 | 3.27 | 0 |
1 | Fall 2016 | SS101 | English Language Lab | 3 | A+ | 4 | 3.27 | 3.27 | 0 |
1 | Fall 2017 | CS201 | Data Structures | 3 | A | 4 | 3.75 | 3.57 | 0 |
1 | Spring 2017 | EE227 | Digital Logic and Design | 3 | A+ | 4 | 3.71 | 3.49 | 0 |
1 | Spring 2017 | SS122 | English Composition | 3 | A | 4 | 3.71 | 3.49 | 0 |
1 | Spring 2017 | MT115 | Calculus-II | 3 | A- | 3.67 | 3.71 | 3.49 | 0 |
1 | Spring 2017 | SS111 | Islamic and Religious Studies | 3 | B | 3 | 3.71 | 3.49 | 0 |
1 | Spring 2017 | CS103 | Computer Programming | 3 | A | 4 | 3.71 | 3.49 | 0 |
1 | Spring 2017 | EL227 | Digital Logic and Design Lab | 1 | B | 3 | 3.71 | 3.49 | 0 |
1 | Spring 2017 | CL103 | Computer Programming Lab | 1 | A+ | 4 | 3.71 | 3.49 | 0 |
At first the data in rawDataset.csv file is read and restructured by arranging each student's records in one line as shown below.
Sr.No | MT104 | MT119 | CS118 | CL118 | EE182 | SL101 | SS101 | EE227 | SS122 | MT224 | SS111 | CS217 | EL227 | CL217 | CGPA | Warning | CS201(Data Stratures) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | 3.33 | 4 | 4 | 1.67 | 3.67 | 4 | 4 | 4 | 3.67 | 3 | 4 | 3 | 4 | 3.49 | 0 | 4 |
2 | 4 | 4 | 3.67 | 4 | 3.33 | 3 | 4 | 4 | 4 | 4 | 3.33 | 4 | 2.33 | 4 | 3.66 | 0 | 4 |
3 | 3.33 | 2.67 | 2.67 | 3.33 | 2.33 | 3.33 | 3 | 3 | 4 | 3 | 3.67 | 3.67 | 3 | 4 | 3.21 | 0 | 3.33 |
Each student record (feature vector) consists of his scores of all subjects of the first two semesters, CGPA, and Warning, where CS201-Data Structures is treated as a label or target class. The sequence of the course codes is sorted alphabetically to match trainDataset.csv file. From the entire dataset, only those students who have studied Data Structures are kept and all other are removed. After this preprocessing, the selected feature vectors are used to train/test the forest.
The final step of preprocessing is to split the dataset into trainDataset and testDataset. The first 40 vectors are used for training, and the remaining 20 vectors are used for testing.
The grades of the students are converted into the labels to use them for the training and testing of the forest. The labels for each grade is given in the following table:
Grade | Label |
A+, A, A- | Excellent |
B+, B, B- | Good |
C+, C, C- | Average |
D+, D | Bad |
FA, F | Fail |
All others | Unknown |
Please note that the first alphabet of label is required to be capital.
In this phase, the dataset is loaded into a linked list to create a forest of multiple decision-trees. The input to the forest are the following parameters:
- NumOfTrees: it represents the number of trees required in the forest
- WindowSize: it represents the number of columns (features here) to use to split the root node of a decision-tree
- StepSize: it is used to decide the first feature of the window of features for the root node (StepSize is used in combination with the WindowSize parameter)
In order to make a tree different from the other, it uses a special technique of feature selection (as explained under). This technique is different from the well-known Random Forest algorithm, where the feature-set is selected randomly to create a root node of a decision-tree. In this simple technique, no randomness is involved.
Each decision-tree gives a set of rules to predict grades based on a single or combination of features. At each node of the tree, a feature is chosen that best splits the dataset. A best split divides the dataset into multiple partitions (linked-lists), each of them hosting observations that are more similar among themselves and different from the ones in the other partitions. Therefore, the importance of each feature is derived from how “pure” each of the partition is. The measure to calculate the impurity of a dataset is Entropy (and a respective Information Gain).
Following is a step-by-step process to build a decision-tree:
Step 1: The forest passes WindowSize and FirstFeature to each decision-tree. For the first tree, the FirstFeature value is 0. For the next trees, it is calculated by adding the StepSize in the FirstFeature of the previous tree. Hence, the FirstFeature for the second tree would be 1*StepSize (and for the ith tree, it would be (i-1)*StepSize). The set of features comprising of the FirstFeature + WindowSize is used in the root node of the tree to split the dataset.
Step 2: If the WindowSize is 5 then the first 5 features are selected (indexed from 0 – 4) out of 16 features for the first tree. If the StepSize is 2 then the features 2 – 6 for the second tree are used, and so on.
Step 3: The entropy of the entire dataset is calculated using the CS201 column (the last column). If the entropy is non-zero then the dataset is partitioned while increasing the purity of the partitioned subsets. In order to partition the dataset, calculate the Entropy and Information Gain of each feature that is selected in the step 2.
Step 4: The feature with the highest Information Gain value is used to split the root node, and divide the dataset to its branches by grades value (each node can have 1 – 6 children, one for each grade label). Once the branches are created, the partitioned dataset is passed to the respective branch, i.e. for the branch with Excellent as its label, all the records having Excellent in the selected feature value goes to that branch.
Step 5: For all the nodes below the root level, complete feature list to is used compete using Information Gain value. The feature with the highest Information Gain is used used to split that node.
Step 6: A node with the entropy value equal to zero becomes a leaf node. Such a node contains the dataset where all the records have the same grade for the Data Structures course. When asked to predict, this leaf node returns this grade for the Data Structures course.
Step 7: A branch with entropy more than 0 further splits using step 5. This process continues recursively on all the non-leaf nodes until all branches terminate on their respective leaf nodes.
Multiple decision-trees are generated, where each tree predicts the grade of data structures for a given input data. The forest collects the predictions from every decision-tree and return mode (majority voting) of the predictions as the final grade of data structures. If two or more grades appear to be the mode then all the modal grades will be the output predictions.
Same process of feature selection is repeated on each tree, trying to generate unique tree. The above mentioned technique does not guarantee that all trees are distinct, however, inorder to limit the duplicate trees to some extent the window is shifted by 2 in the round-robin fashion.
Example: Feature (s1,s2,s3,s4,s5,s6,s7,s8,s9,s10) If the Windows_Size = 5 and No_of_trees = 5, then
Subset1 for root-selection of tree1 = {s1,s2,s3,s4,s5} and root node must be from these 5 features according to the value of the Gain.
Subset2 for root-selection of tree2 = {s3,s4,s5,s6,s7} and root node must be from these 5 features according to the value of the Gain.
Subset3 for root-selection of tree3 = {s5,s6,s7,s8,s9} and root node must be from these 5 features according to the value of the Gain.
Subset4 for root-selection of tree4 = {s7,s8,s9,s10,s1} and root node must be from these 5 features according to the value of the Gain.
Subset5 for root-selection of tree5 = {s9,s10,s1,s2,s3} and root node must be from these 5 features according to the value of the Gain.
In order to predict the grade, the predictions from every decision-tree is collected and the mode (majority voting) of the predictions is the final grade of data structures. If there are more than one modal value then prediction is set of labels. If a tree does not find a branch for a given case, then it would return Unknown grade.
Following depth-first traversal methods are implemented in the project:
- In-order traversal
- Pre-order traversal
- Post-order traversal
When a node is visited the feature number which split that node is printed.
The performance of machine learning model can't be evaluated on trainDataset. Therefore, testDataset is used for measuring accuracy. The predicted class labels returned from all the tress are used to calculate the prediction accuracy using the Formula.
Note: The code can be used for predicting other grades with changes in target class. The 6 labels can also be changed corresponding to the grades.