Skip to content

Latest commit

 

History

History
211 lines (153 loc) · 8.32 KB

README.md

File metadata and controls

211 lines (153 loc) · 8.32 KB

Source

CTU and MCFP malware-analysis-net

References

[1]Z. Niu, J. Xue, D. Qu, Y. Wang, J. Zheng, and H. Zhu, “A novel approach based on adaptive online analysis of encrypted traffic for identifying Malware in IIoT,” Information Sciences, vol. 601, pp. 162–174, Jul. 2022, doi: https://doi.org/10.1016/j.ins.2022.04.018. ‌

Introduction

This repository presents the implementation of the algorithm described in paper [1]. The process involves the following steps:

  1. Downloading data from multiple sources.
  2. Extracting TLS features from pcap files using tools such as Zui, Zed, and Brimcap, resulting in CSV files containing conn.csv, ssl.csv, and x509.csv, referring to the corresponding log files in the research article.
  3. Preprocessing the data, including merging CSV files based on their correlation and extracting text features. The data is then prepared for subsequent model processing.
  4. Implementing the algorithm from the paper using scikit-multiflow.
  5. Experimenting with the extracted data using the implemented IARF (Incremental Adaptive Random Forest) algorithm from step 4.
  6. Concluding the results obtained from the entire workflow.

Guide

1. Download the datasets

The paper mentions four datasets: malware-traffic-analysis, CTU-13, MCFP, and Lastline.Inc. However, information about the last dataset is not available.

1.1 Download malware-traffic-analysis

Due to the large number of files in different links that can't be downloaded manually, a Python script (malware-download.py) has been provided for automated downloading. Use the following command to download the data::

usage: malware-download.py [-h] [--password PASSWORD] [--url URL] [--file FILE] [--debug] [--folder FOLDER] [-sy SY] [-ey EY]

options:
  -h, --help            show this help message and exit
  --password PASSWORD, -pwd PASSWORD
                        Password is used to extract zip file
  --url URL, -u URL     Url of web contains download links
  --file FILE, -f FILE  File contains links
  --debug               Enable debug mode
  --folder FOLDER, -fd FOLDER
                        Folder to save downloaded files
  -sy SY                Start year
  -ey EY                End year

python3 malware-download.py --password infected --folder downloads

This command will download all files from the start year (default 2013) to the end year (default 2024) into the "downloads" folder in the current working directory.

1.2 Download CTU-13 and MCFP datasets.

Similar to Malware-Traffic-Analysis, you can use the following commands:

usage: Download help for CTU13 and MCFP [-h] [--folder FOLDER] [--url URL] [--substr SUBSTR] [--suffix SUFFIX] [--file FILE] [--limit LIMIT] [--cmd CMD]

options:
  -h, --help       show this help message and exit
  --folder FOLDER  Download folder name
  --url URL        Base url
  --substr SUBSTR  Substr in url
  --suffix SUFFIX  Suffix of file, ex: .pcap, .txt
  --file FILE      File contains urls to download
  --limit LIMIT    Enable limit size
  --cmd CMD        List and file have suffix

python3 src/python3/CTU13-download.py --folder CTU-13 --url https://www.stratosphereips.org/datasets-ctu13
python3 src/python3/CTU13-download.py --folder MCFP/Normal --url https://www.stratosphereips.org/datasets-malware

2. Extract TLS features.

2.1. Use brimcap to analyze pcap

Utilize the provided script (pcap_to_zng.sh) to analyze pcap files using Brimcap and convert them to Zui data format (ZNG):

./src/sh/pcap_to_zng.sh <folder of pcap files> <folder to store zng files>

Example: ./src/sh/pcap_to_zng.sh MCFP/Normal ZNG/MCFP/Normal

2.2. Load ZNG to Zed lake

ZNG files are the data format of Zui. Import them into a pool in the lake to make them exportable to CSV:

./src/ps1/load_zng_to_pool.ps1 <directory of ZNG files> <name of malware> <pool name>

Example: ./src/ps1/load_zng_to_pool.ps1 malware-traffic-analysis.net Dridex Dridex

2.3. Export conn, ssl, x509.

Fist you click in a pool and choose "Query Tool" in top-right corner, pool Dridex for example.

After that, right click on the path conn (ssl or x509) -> "filter == value", the result will be similar with below image Pool Query

In top-left corner, select File -> Export Result As -> CSV, to export csv and save in any folder you managed for future work.

3. Preprocessing data

3.1 Merge data

Use the Python script (merge_csv.py) to merge CSV files:

python3 preprocessing/merge_csv.py <conn> <ssl> <x509> <output>

3.2 Text feature extraction

Apply Label Encoder and Mean ASCII to convert text features into numerical features:

python3 preprocessing/label_encoding.py <src_csv> <dst_csv>

Use preprocessing/label.py to label the features.

4. Implement IARF.

The modification in the Improved Adaptive Random Forests (IARF) in comparison to the Adaptive Random Forest (ARF) entails the incorporation of a validation mechanism to assess the model's ability to accurately predict samples. In the event of a misprediction, the sample is systematically included in the training set for further model refinement.

class IARF(ARF):
    def __init__(self, n=10, m='auto', a=0.01, b=0.001):
        '''
        - m: maximum features for per split
        - n: number of base trees
        - a: warning threshold
        - b: drift threshold
        '''

        self.max_features = m
        self.warning_detector = ADWIN(a)
        self.drift_detector = ADWIN(b)
        self.n_models = n
        super().__init__(n_estimators=self.n_models, max_features=self.max_features, 
                       warning_detection_method=self.warning_detector,
                       drift_detection_method=self.drift_detector)


    def partial_fit(self, X, y, classes=None, sample_weight=None):
        """ Partially (incrementally) fit the model.

        Parameters
        ----------
        X : numpy.ndarray of shape (n_samples, n_features)
            The features to train the model.

        y: numpy.ndarray of shape (n_samples)
            An array-like with the class labels of all samples in X.

        classes: numpy.ndarray, list, optional (default=None)
            Array with all possible/known class labels. This is an optional parameter, except
            for the first partial_fit call where it is compulsory.

        sample_weight: numpy.ndarray of shape (n_samples), optional (default=None)
            Samples weight. If not provided, uniform weights are assumed.

        Returns
        -------
        self

        """
        if self.classes is None and classes is not None:
            self.classes = classes

        if sample_weight is None:
            weight = 1.0
        else:
            weight = sample_weight

        if y is not None:
            row_cnt, _ = get_dimensions(X)
            weight = check_weights(weight, expand_length=row_cnt)
            for i in range(row_cnt):
                if self.predict([X[i]]) == y[i]:
                    continue
                if weight[i] != 0.0:
                    self._train_weight_seen_by_model += weight[i]
                    self._partial_fit(X[i], y[i], self.classes, weight[i])

        return self

5. Experiment.

For detailed experimentation procedures and results, refer to the notebook files in the modeling folder.

6. Conclusion.

The experiment's outcomes, while not surpassing those reported in the original paper, represent a significant personal achievement and contribute to the broader knowledge in the field. The results are summarized in a precision-recall-f1 table.

              precision    recall  f1-score   support

      Normal       0.95      0.98      0.96      2021
      Dridex       1.00      1.00      1.00      1933
    Trickbot       0.99      0.99      0.99      1943
     Vawtrak       0.99      0.99      0.99      1659
      Tiuref       0.99      0.98      0.98      1045
        Zeus       0.91      0.97      0.94       388
    Tancitor       0.94      0.87      0.91       684
  Zeus-panda       0.96      0.79      0.87       120
    Treambot       1.00      1.00      1.00        82
     Gootkit       1.00      0.95      0.97        61

    accuracy                           0.98      9936
   macro avg       0.97      0.95      0.96      9936
weighted avg       0.98      0.98      0.98      9936