Data Science for Cybersecurity - Final Project

Topic

Malware Detection with Static Analysis & Model Ensemble

Introduction

When it comes to Malware Detection, there are many different ways to implement it, and Static Analysis is one of them.

Static Analysis is a technique which can help us to classify program into Malicious or Benign through some PE related data.

PE Section Headers
PE Imports
PE as Image

We can use different models and different PE partial data to make predictions.

Therefore, in this project, I want to know if we can use Model Ensemble to get a better Accuracy on Malware Detection with Static Analysis ?

In other word, I want to know will Model Ensemble perform better than Individual Model ?

Literature Review

1. Ensembling ConvNets using Keras

Reference Link

2. Keras: Multiple Inputs and Mixed Data

Reference Link

Dataset

[1] Angelo Oliveira, "Malware Analysis Datasets: PE Section Headers", IEEE Dataport, 2019. [Online]. Available: http://dx.doi.org/10.21227/2czh-es14. Accessed: Jun. 13, 2020.

[2] Angelo Oliveira, "Malware Analysis Datasets: Top-1000 PE Imports", IEEE Dataport, 2019. [Online]. Available: http://dx.doi.org/10.21227/004e-v304. Accessed: Jun. 13, 2020.

[3] Angelo Oliveira, "Malware Analysis Datasets: Raw PE as Image", IEEE Dataport, 2019. [Online]. Available: http://dx.doi.org/10.21227/8brp-j220. Accessed: Jun. 13, 2020.

Data Preprocessing

Merge DataFrame by Hash Value & drop duplicated observations
Add new column, which is calculated from original columns
Do some EDA (Exploratory Data Analysis)
Because data is imbalanced, we need resample. I use ADASYN to do oversampling.

Individual Models Training Result

First Model: PE Section Headers
- Standardization
- Build DNN model
  - Dense(32) + Dense(32) + Dense(64) + Dropout(0.2) + Dense(1)
  - Use Adam optimizer with learning rate = 0.0003 and Early stopping
- Result : Training Accuracy: 89.71% , Validation Accuracy: 58.46%
Second Model: Top-1000 PE Imports (with PCA)
- Build DNN model
  - Dense(64) + Dense(64) + Dropout(0.4) + Dense(32) + Dense(32) + Dropout(0.2) + Dense(1)
  - Use Adam optimizer with learning rate = 0.0001 and Early stopping
- Result : Training Accuracy: 97.97% , Validation Accuracy: 94.26%
Third Model: Raw PE as Image
- Min-Max Normalization (From [0, 255] to [0, 1])
- Reshape to (32, 32, 1)
- Build CNN model
  - Input + Conv2D(32, 44) + Conv2D(64, 44) + MaxPooling2D(22) + Conv2D(128, 44) + Conv2D(128, 44) + MaxPooling2D(22) + Flatten + Dense(256) + Dropout(0.4) + Dense(1)
  - Use Adam optimizer with learning rate = 0.000003 and Early stopping
- Result : Training Accuracy: 95.11% , Validation Accuracy: 85.1%

Model Ensemble

First, ensemble these three models, then add Dense Layer (Fully Connected Layer) with 16 neurons & Dense Layer (Fully Connected Layer) with 1 neurons as final output.

The result of Ensemble Model:

Use Adam optimizer with learning rate = 0.0003 and Early stopping
Result : Training Accuracy: 98.58% , Validation Accuracy: 95.99%

Model	Training Accuracy	Validation Accuracy
PE Section Headers with DNN	89.71%	58.46%
Top-1000 PE Imports with DNN	97.97%	94.26%
Raw PE as Image with CNN	95.11%	85.1%
Ensemble Model	98.58%	95.99%

Conclusion

From the model training results, it can be seen that Model Ensemble is indeed helpful for improving the Accuracy of Malware Detection.

Compared with individual models, Model Ensemble has the Highest Accuracy in both Training Data and Validation Data.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
data_science_final_project.ipynb		data_science_final_project.ipynb
final_project_demo.pptx		final_project_demo.pptx
final_project_document.pdf		final_project_document.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science for Cybersecurity - Final Project

Topic

Malware Detection with Static Analysis & Model Ensemble

Introduction

Literature Review

Dataset

Data Preprocessing

Individual Models Training Result

Model Ensemble

Conclusion

About

Languages

yujunkuo/DS4CS-Final

Folders and files

Latest commit

History

Repository files navigation

Data Science for Cybersecurity - Final Project

Topic

Malware Detection with Static Analysis & Model Ensemble

Introduction

Literature Review

Dataset

Data Preprocessing

Individual Models Training Result

Model Ensemble

Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Languages