Skip to content

This repository contains about the Microsoft malware detection challenge.

Notifications You must be signed in to change notification settings

GowthamChowta/malware_casestudy

Repository files navigation

Microsoft Malware detection

Problem statement: https://www.kaggle.com/c/malware-classification

Steps to start with this case study:


1. Download the data from the link below Data: https://www.kaggle.com/c/malware-classification/data
2. Extract the data. You can use the below code to install p7zip
!sudo apt install p7zip-full

Then run the below code in the jupyter notebook to unzip files.
!7z x train.7z

EDA notebook -

  1. Understand the problem statement, metric we are using and the sample data for each of Byte and ASM files.
  2. Distribution of class labels in train and test data.

Byte files feature generation notebook -

1. Analyzed the size of byte files.
2. Extract unigram features using custom vectorizer.
3. Extract bigram features using custom vectorizer.

ASM files feature generation notebook -

1. Count the number of prefixes, opcode, keywords, registers for each file using multi-processing.
2. Analyzed the size of ASM files.
3. Extract graph features using multi-processing.
4. Extract image features using multi-processing. (As suggested by 'say no to overfitting' in their video taking the pixel density values of first 800 values)
Multi-variate analysis of ASM features.

Modelling notebook -

1. Multi-variate anaylsis on the final features.
2. Training XGBoost model on the final features.
3. Further possible improvements.
This repository contains about the Microsoft malware detection challenge.

About

This repository contains about the Microsoft malware detection challenge.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published