CTU and MCFP malware-analysis-net
[1]Z. Niu, J. Xue, D. Qu, Y. Wang, J. Zheng, and H. Zhu, “A novel approach based on adaptive online analysis of encrypted traffic for identifying Malware in IIoT,” Information Sciences, vol. 601, pp. 162–174, Jul. 2022, doi: https://doi.org/10.1016/j.ins.2022.04.018.
This repository presents the implementation of the algorithm described in paper [1]. The process involves the following steps:
- Downloading data from multiple sources.
- Extracting TLS features from pcap files using tools such as Zui, Zed, and Brimcap, resulting in CSV files containing conn.csv, ssl.csv, and x509.csv, referring to the corresponding log files in the research article.
- Preprocessing the data, including merging CSV files based on their correlation and extracting text features. The data is then prepared for subsequent model processing.
- Implementing the algorithm from the paper using scikit-multiflow.
- Experimenting with the extracted data using the implemented IARF (Incremental Adaptive Random Forest) algorithm from step 4.
- Concluding the results obtained from the entire workflow.
The paper mentions four datasets: malware-traffic-analysis, CTU-13, MCFP, and Lastline.Inc. However, information about the last dataset is not available.
Due to the large number of files in different links that can't be downloaded manually, a Python script (malware-download.py) has been provided for automated downloading. Use the following command to download the data::
usage: malware-download.py [-h] [--password PASSWORD] [--url URL] [--file FILE] [--debug] [--folder FOLDER] [-sy SY] [-ey EY]
options:
-h, --help show this help message and exit
--password PASSWORD, -pwd PASSWORD
Password is used to extract zip file
--url URL, -u URL Url of web contains download links
--file FILE, -f FILE File contains links
--debug Enable debug mode
--folder FOLDER, -fd FOLDER
Folder to save downloaded files
-sy SY Start year
-ey EY End year
python3 malware-download.py --password infected --folder downloads
This command will download all files from the start year (default 2013) to the end year (default 2024) into the "downloads" folder in the current working directory.
Similar to Malware-Traffic-Analysis, you can use the following commands:
usage: Download help for CTU13 and MCFP [-h] [--folder FOLDER] [--url URL] [--substr SUBSTR] [--suffix SUFFIX] [--file FILE] [--limit LIMIT] [--cmd CMD]
options:
-h, --help show this help message and exit
--folder FOLDER Download folder name
--url URL Base url
--substr SUBSTR Substr in url
--suffix SUFFIX Suffix of file, ex: .pcap, .txt
--file FILE File contains urls to download
--limit LIMIT Enable limit size
--cmd CMD List and file have suffix
python3 src/python3/CTU13-download.py --folder CTU-13 --url https://www.stratosphereips.org/datasets-ctu13
python3 src/python3/CTU13-download.py --folder MCFP/Normal --url https://www.stratosphereips.org/datasets-malware
Utilize the provided script (pcap_to_zng.sh) to analyze pcap files using Brimcap and convert them to Zui data format (ZNG):
./src/sh/pcap_to_zng.sh <folder of pcap files> <folder to store zng files>
Example: ./src/sh/pcap_to_zng.sh MCFP/Normal ZNG/MCFP/Normal
ZNG files are the data format of Zui. Import them into a pool in the lake to make them exportable to CSV:
./src/ps1/load_zng_to_pool.ps1 <directory of ZNG files> <name of malware> <pool name>
Example: ./src/ps1/load_zng_to_pool.ps1 malware-traffic-analysis.net Dridex Dridex
Fist you click in a pool and choose "Query Tool" in top-right corner, pool Dridex for example.
After that, right click on the path conn (ssl or x509) -> "filter == value", the result will be similar with below image
In top-left corner, select File -> Export Result As -> CSV, to export csv and save in any folder you managed for future work.
Use the Python script (merge_csv.py) to merge CSV files:
python3 preprocessing/merge_csv.py <conn> <ssl> <x509> <output>
Apply Label Encoder and Mean ASCII to convert text features into numerical features:
python3 preprocessing/label_encoding.py <src_csv> <dst_csv>
Use preprocessing/label.py to label the features.
The modification in the Improved Adaptive Random Forests (IARF) in comparison to the Adaptive Random Forest (ARF) entails the incorporation of a validation mechanism to assess the model's ability to accurately predict samples. In the event of a misprediction, the sample is systematically included in the training set for further model refinement.
class IARF(ARF):
def __init__(self, n=10, m='auto', a=0.01, b=0.001):
'''
- m: maximum features for per split
- n: number of base trees
- a: warning threshold
- b: drift threshold
'''
self.max_features = m
self.warning_detector = ADWIN(a)
self.drift_detector = ADWIN(b)
self.n_models = n
super().__init__(n_estimators=self.n_models, max_features=self.max_features,
warning_detection_method=self.warning_detector,
drift_detection_method=self.drift_detector)
def partial_fit(self, X, y, classes=None, sample_weight=None):
""" Partially (incrementally) fit the model.
Parameters
----------
X : numpy.ndarray of shape (n_samples, n_features)
The features to train the model.
y: numpy.ndarray of shape (n_samples)
An array-like with the class labels of all samples in X.
classes: numpy.ndarray, list, optional (default=None)
Array with all possible/known class labels. This is an optional parameter, except
for the first partial_fit call where it is compulsory.
sample_weight: numpy.ndarray of shape (n_samples), optional (default=None)
Samples weight. If not provided, uniform weights are assumed.
Returns
-------
self
"""
if self.classes is None and classes is not None:
self.classes = classes
if sample_weight is None:
weight = 1.0
else:
weight = sample_weight
if y is not None:
row_cnt, _ = get_dimensions(X)
weight = check_weights(weight, expand_length=row_cnt)
for i in range(row_cnt):
if self.predict([X[i]]) == y[i]:
continue
if weight[i] != 0.0:
self._train_weight_seen_by_model += weight[i]
self._partial_fit(X[i], y[i], self.classes, weight[i])
return self
For detailed experimentation procedures and results, refer to the notebook files in the modeling folder.
The experiment's outcomes, while not surpassing those reported in the original paper, represent a significant personal achievement and contribute to the broader knowledge in the field. The results are summarized in a precision-recall-f1 table.
precision recall f1-score support
Normal 0.95 0.98 0.96 2021
Dridex 1.00 1.00 1.00 1933
Trickbot 0.99 0.99 0.99 1943
Vawtrak 0.99 0.99 0.99 1659
Tiuref 0.99 0.98 0.98 1045
Zeus 0.91 0.97 0.94 388
Tancitor 0.94 0.87 0.91 684
Zeus-panda 0.96 0.79 0.87 120
Treambot 1.00 1.00 1.00 82
Gootkit 1.00 0.95 0.97 61
accuracy 0.98 9936
macro avg 0.97 0.95 0.96 9936
weighted avg 0.98 0.98 0.98 9936