Skip to content

Latest commit

 

History

History
58 lines (45 loc) · 7.79 KB

Experiments.md

File metadata and controls

58 lines (45 loc) · 7.79 KB

Experiments

Four main experiments were conducted this semester: a performance comparison between the original model and the Cluster-GCN based model, a scalability comparison between the original and Cluster-GCN models, a performance comparison between the original model and 2-step/3-step neighbor variants, and an analysis of the distribution of features among nodes that were classified correctly and incorrectly. "Cluster-GCN model" refers to a variant of the original model that uses a partitioning method to convert the initial graph into a subgraph before being inputted into the autoencoder (a design motivated by scalability reasons), and "n-step neighbor" refers to a variant of the original model that converts the original adjacency matrix into its n-step neighbor version before being inputted into the autoencoder.

Original and Cluster-GCN Performance

The original and Cluster-GCN models were trained and tested on the DBLP dataset, then evaluated through several classification metrics. Several variants of the Cluster-GCN model were used with different hyperparameters, including variants where the Cluster-GCN partitioning was used in the training phase but not the testing phase (denoted as "Original Test").

Accuracy Recall Precision F1 TNR
Original Model 0.965 0.5 0.021 0.041 0.965
p=50, q=5, Original Test 0.965 0.483 0.021 0.04 0.965
p=50, q=5, k=10 0.993 0.143 0.02 0.074 0.994
p=50, q=5, k=30 0.98 0 0 0 0.982
p=50, q=5, k=100 0.942 0 0 0 0.944
p=50, q=1, Original Test 0.964 0.416 0.017 0.034 0.965

At a glance, it seems that while using Cluster-GCN partitioning in the training phase alone leads to similar performance as the original method, using partitioning in the testing phase leads to much lower recall, but additional tests are needed to make any firm conclusions. In particular, these tests should be replicated with additional datasets as the DBLP testing set may have had too few anomalies to accurately measure performance. In addition, more combinations of Cluster-GCN hyperparameters could be tested.

Original and Cluster-GCN Scalability

The original and Cluster-GCN models were each trained on zero matrices of increasing sizes in order to determine how the training time of each would scale to very large graphs. Test time was not considered. Additionally, the Cluster-GCN model was evaluated with random matrices to determine differences in scalability between sparse and dense graphs.

The runtime of the original model increases rapidly with larger graphs (nearly O(n2.8) for n x n matrices) while the Cluster-GCN model scales much better (about O(n2.2)). The Cluster-GCN model does take longer on dense matrices, but the actual rate of growth in runtime seems to be similar to the sparse case.

n-Step Neighbor Performance Comparison

The original model and its 2-neighbor and 3-neighbor variants were trained and tested on the UNSW dataset, then evaluated through several classification metrics. In this experiment, classification metrics were reported for graphs at time t, t-1, and t-2 separately rather than in aggregate as previously done.

Accuracy Recall Precision F1 TNR
Original model (time t-2) 0.914 0.25 0.5 0.333 0.976
Original model (time t-1) 0.883 0.125 0.2 0.153 0.953
Original model (time t) 0.862 0.25 0.222 0.235 0.919
2-step neighbor (time t-2) 0.894 0.25 0.333 0.286 0.953
2-step neighbor (time t-1) 0.894 0.125 0.25 0.167 0.965
2-step neighbor (time t) 0.851 0.125 0.125 0.125 0.918
3-step neighbor (time t-2) 0.926 0.25 0.667 0.364 0.988
3-step neighbor (time t-1) 0.862 0.125 0.143 0.133 0.930
3-step neighbor (time t) 0.872 0.25 0.25 0.25 0.930

Overall, it seems using 2-step or 3-step neighbor variants does not lead to improved performance over the original model. Thus, other methods for improving the original model's classification will need to be explored.

Feature Distribution Analysis

The original model was trained and tested on the UNSW dataset, then nodes in the testing set were grouped based on how they were classified and their features were averaged. In addition to the standard features used for the UNSW dataset, an additional feature "num_adj" was included representing the number of connections from the node to others.

ct_dst_sport_ltm tcprtt dwin ct_src_dport_ltm ct_dst_src_ltm ct_dst_ltm smeansz num_adj
Total Positive 3.71 0.14 53.13 2.29 4.13 3.71 307.17 10.54
True Positive 1.4 0.152 0 1 2 1.4 1132 10.8
False Negative 4.32 0.131 67.105 2.63 4.68 4.32 90.12 10.47
Total Negative 1.32 0.025 118.60 1.37 1.69 1.64 110.81 2.90
True Negative 1.33 0.025 113.45 1.39 1.69 1.65 52.31 2.89
False Positive 1.08 0.032 215.77 1 1.62 1.46 1213.39 3.08

Some immediate observations can be made based off of the table. For example, anomalous nodes feature higher values for "tcprtt", "ct_dst_src_ltm", and "num_adj", but lower values for "dwin". Certain features seem more responsible for misclassification than others. For example, false negatives had much lower values for "smeansz" than true positives, and false positives had much higher values for "smeansz" than true negatives. This implies the network learned to classify nodes with high "smeansz" values as anomalous and low "smeansz" values as non-anomalous even when other features like "tcprtt" or "dwin" may have suggested the opposite. Experiments could be ran using different combinations of the initial set of features in order to verify this hypothesis.