Malware detection is an important process in modern computing to help protect various systems from getting infected. The goal for any project, program, or system that aims to detect malware is to prevent any malicious software from running on a user’s computer. With our project, we have aimed to assist in the battle against malicious software by creating a model that can detect and label different types of programs as either malware or benign software. For this project, we used a Deep Neural Network (DNN) model.
The architecture of our model, shown above, consists of a dense layer with relu activation, a batch normalization layer, and finally a dropout layer. As shown in the diagram, we use 10 of these layers. This project takes inspiration from the paper “Malware Analysis with Artificial Intelligence and a Particular Attention on Results Interpretability” created by Benjamin Marais, Tony Quertier, and Christophe Chesneau.
https://colab.research.google.com/drive/134uvYzJ9QGpv0qjxN85GuS5xXco3R2zv?usp=sharing
├── src
│ ├── ModelClass.py
│ ├── feature_vectorization.py
│ ├── features.py
│ ├── train.py
├── test
│ ├── environment_test.py
│ ├── sanity_test.py
├── .github
│ ├── workflows
│ ├── run_all_tests.yaml
├── requirements.txt
├── test-requirements.txt
├── README.md
└── .gitignore
- src/ModelClass.py: script that creates a sequential model that will allow for the model to train the training script
- src/feature_vectorization.py: script that creates feature vectors (essentially an array) for all files in the dataset that contains the id, hash, date, label, class, and subset
- src/features.py: contains the classes that help in sorting out features within files in the dataset
- src/train.py: training script for the model that takes in X_train, y_train, X_test, and y_test
- test/environment_test.py: test file to see if we installed the correct environment
- test/sanity_test.py: test file to see if anything broke
- .github/workflows/run_all_test.yaml: contains a script that runs all the tests for each commit tocheck if it's correct
- requirements.txt: contains a list of requirements that we need for our github repo
- test-requirements.txt: contains a list of requirements that we need for testing
- README.md: overview of repository
- .gitignore: contains a list of commands to ignore
pip install git+https://github.com/elastic/ember.git
pip install -r requirements.txt
pip install -r test-requirements.txt
pip install opendatasets
import opendatasets as od
import tarfile
import ember
import os
od.download("https://ember.elastic.co/ember_dataset_2018_2.tar.bz2")
tar = tarfile.open("./ember_dataset_2018_2.tar.bz2", "r:bz2")
tar.extractall()
tar.close()
model = build_model()
#train_dataset = build_dataset()
X_train, y_train, X_test, y_test, comldf = vectorization(
'C:\\Users\\amant\\Documents\\Anaconda_Envs\\coml_final\\ember2018\\')
#features, labels = next(comldf)
loss_object = loss(model, X_train, y_train, training=train_dataset)
l = loss(model, features, labels, training=False)
print("Loss test: {}".format(l))
# Set up optimizer
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
# Calculate a single optimization step
loss_value, grads = grad(model, features, labels)
print("Step: {}, Initial Loss: {}".format(
optimizer.iterations.numpy(), loss_value.numpy()))
optimizer.apply_gradients(zip(grads, model.trainable_variables))
print("Step: {}, Loss: {}".format(optimizer.iterations.numpy(),
loss(model, features, labels, training=True).numpy()))
# Train Model
# Note: Rerunning this cell uses the same model variables
# Keep results for plotting
train_loss_results = []
train_accuracy_results = []
num_epochs = 1
for epoch in range(num_epochs):
epoch_loss_avg = tf.keras.metrics.Mean()
epoch_accuracy = tf.keras.metrics.SparseCategoricalAccuracy()
# Training loop - using batches of 32
for x, y in train_dataset:
# Optimize the model
loss_value, grads = grad(model, X_train, y_train)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
# Track progress
epoch_loss_avg.update_state(loss_value) # Add current batch loss
# Compare predicted label to actual label
# training=True is needed only if there are layers with different
# behavior during training versus inference (e.g. Dropout).
epoch_accuracy.update_state(y, model(x, training=True))
# End epoch
train_loss_results.append(epoch_loss_avg.result())
train_accuracy_results.append(epoch_accuracy.result())
if epoch % 50 == 0:
print("Epoch {:03d}: Loss: {:.3f}, Accuracy: {:.3%}".format(
epoch, epoch_loss_avg.result(), epoch_accuracy.result()))
fig, axes = plt.subplots(2, sharex=True, figsize=(12, 8))
fig.suptitle('Training Metrics')
axes[0].set_ylabel("Loss", fontsize=14)
axes[0].plot(train_loss_results)
axes[1].set_ylabel("Accuracy", fontsize=14)
axes[1].set_xlabel("Epoch", fontsize=14)
axes[1].plot(train_accuracy_results)
plt.show()
The test script is incomplete, however if it was finished it would have used our model weights in order to predict and label the software as either malicious or benign
- B. Marais, T. Quertier, and C. Chesneau, “Malware analysis with Artificial Intelligence and a particular attention on results interpretability,” Distributed Computing and Artificial Intelligence, Volume 1: 18th International Conference, pp. 43–55, 2021.
- H. S. Anderson and P. Roth, “EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models,” arxiv, 2018.