When it comes to Malware Detection, there are many different ways to implement it, and Static Analysis is one of them.
Static Analysis is a technique which can help us to classify program into Malicious or Benign through some PE related data.
- PE Section Headers
- PE Imports
- PE as Image
We can use different models and different PE partial data to make predictions.
Therefore, in this project, I want to know if we can use Model Ensemble to get a better Accuracy on Malware Detection with Static Analysis ?
In other word, I want to know will Model Ensemble perform better than Individual Model ?
1. Ensembling ConvNets using Keras
Reference Link
2. Keras: Multiple Inputs and Mixed Data
Reference Link
[1] Angelo Oliveira, "Malware Analysis Datasets: PE Section Headers", IEEE Dataport, 2019. [Online]. Available: http://dx.doi.org/10.21227/2czh-es14. Accessed: Jun. 13, 2020.
[2] Angelo Oliveira, "Malware Analysis Datasets: Top-1000 PE Imports", IEEE Dataport, 2019. [Online]. Available: http://dx.doi.org/10.21227/004e-v304. Accessed: Jun. 13, 2020.
[3] Angelo Oliveira, "Malware Analysis Datasets: Raw PE as Image", IEEE Dataport, 2019. [Online]. Available: http://dx.doi.org/10.21227/8brp-j220. Accessed: Jun. 13, 2020.
-
Merge DataFrame by Hash Value & drop duplicated observations
-
Add new column, which is calculated from original columns
-
Do some EDA (Exploratory Data Analysis)
-
Because data is imbalanced, we need resample. I use ADASYN to do oversampling.
-
First Model: PE Section Headers
- Standardization
- Build DNN model
- Dense(32) + Dense(32) + Dense(64) + Dropout(0.2) + Dense(1)
- Use Adam optimizer with learning rate = 0.0003 and Early stopping
- Result : Training Accuracy: 89.71% , Validation Accuracy: 58.46%
-
Second Model: Top-1000 PE Imports (with PCA)
- Build DNN model
- Dense(64) + Dense(64) + Dropout(0.4) + Dense(32) + Dense(32) + Dropout(0.2) + Dense(1)
- Use Adam optimizer with learning rate = 0.0001 and Early stopping
- Result : Training Accuracy: 97.97% , Validation Accuracy: 94.26%
- Build DNN model
-
Third Model: Raw PE as Image
- Min-Max Normalization (From [0, 255] to [0, 1])
- Reshape to (32, 32, 1)
- Build CNN model
- Input + Conv2D(32, 44) + Conv2D(64, 44) + MaxPooling2D(22) + Conv2D(128, 44) + Conv2D(128, 44) + MaxPooling2D(22) + Flatten + Dense(256) + Dropout(0.4) + Dense(1)
- Use Adam optimizer with learning rate = 0.000003 and Early stopping
- Result : Training Accuracy: 95.11% , Validation Accuracy: 85.1%
First, ensemble these three models, then add Dense Layer (Fully Connected Layer) with 16 neurons & Dense Layer (Fully Connected Layer) with 1 neurons as final output.
The result of Ensemble Model:
- Use Adam optimizer with learning rate = 0.0003 and Early stopping
- Result : Training Accuracy: 98.58% , Validation Accuracy: 95.99%
Model | Training Accuracy | Validation Accuracy |
---|---|---|
PE Section Headers with DNN | 89.71% | 58.46% |
Top-1000 PE Imports with DNN | 97.97% | 94.26% |
Raw PE as Image with CNN | 95.11% | 85.1% |
Ensemble Model | 98.58% | 95.99% |
From the model training results, it can be seen that Model Ensemble is indeed helpful for improving the Accuracy of Malware Detection.
Compared with individual models, Model Ensemble has the Highest Accuracy in both Training Data and Validation Data.