Problem statement: https://www.kaggle.com/c/malware-classification
1. Download the data from the link below Data: https://www.kaggle.com/c/malware-classification/data
2. Extract the data. You can use the below code to install p7zip
!sudo apt install p7zip-full
Then run the below code in the jupyter notebook to unzip files.
!7z x train.7z
- Understand the problem statement, metric we are using and the sample data for each of Byte and ASM files.
- Distribution of class labels in train and test data.
2. Extract unigram features using custom vectorizer.
3. Extract bigram features using custom vectorizer.
1. Count the number of prefixes, opcode, keywords, registers for each file using multi-processing.
2. Analyzed the size of ASM files.
3. Extract graph features using multi-processing.
4. Extract image features using multi-processing. (As suggested by 'say no to overfitting' in their video taking the pixel density values of first 800 values)
Multi-variate analysis of ASM features.
1. Multi-variate anaylsis on the final features.
2. Training XGBoost model on the final features.
3. Further possible improvements.