This is the artifact for our ICSE '22 paper "Practical Automated Detection of Malicious npm Packages", which presents an approach to automatically detecting malicious npm packages based on a combination of three components: machine-learning classifiers trained on known samples of malicious and benign npm packages; a reproducer for identifying packages that can be rebuilt from source and hence are unlikely to be malicious; and a clone detector for finding copies of known malicious packages.
We would like to claim an Artifact Available badge, and hence make this data publicly available at https://github.com/githubnext/amalfi-artifact. No specific technology skills are required to use this data. There are no external dependencies, and no setup is required.
The artifact contains the code for training the classifiers, reproducing packages from source and detecting clones; a description of the samples used for initial training; as well as input data and results for the two experiments reported in the paper: classifying and retraining on newly published packages over the course of one week (Section 4.1), and classifying manually labeled packages (Section 4.2). We further explain where to find this data in the repository below.
The artifact does not contain the feature-extraction code, the contents and features of the training samples, the trained classifiers, and the contents and features of the samples considered in our experiments. We further explain why these could not be included below.
This is implemented as a Python script code/trainining/train_classifier.py
. Invoking the script with the --help
option prints out an explanation of the various supported command-line flags. Note that this code is for reference purposes only and cannot be used to replicate our results, since it needs as its input features for the samples comprising the training set, which are not included in the artifact as explained below.
The reproducer is implemented as a Shell script code/reproducer/reproduce-package.sh
that, given a package name and a version, uses an auxiliary script code/reproducer/build-package.sh
to rebuild the package from source, and then compares the result to the published package.
The clone detector is implemented as a Python script code/clone-detector/hash_package.py
which computes an MD5 hash for a package.
The CSV file data/basic-corpus.csv
lists information about the samples constituting the basic corpus our classifiers were trained on (Section 3.3). For each sample, it contains the package name and version number of the npm package it corresponds to, the hash of the sample (computed as described in Section 3.4), and an analysis label indicating whether the sample is malicious or benign.
The CSV files data/july-29.csv
to data/august-4.csv
list information about the samples considered in Experiment 1, corresponding to all new package versions published to the npm registry that day, excluding private packages. The format is the same as for the training set, but samples that were not manually reviewed are labeled as "not triaged".
Taken together, these files total about 8MB of data.
The directory results/slide-window
contains the results of Experiment 1, again in a series of CSV files named july-29.csv
to august-4.csv
. For each day, it lists each sample that was labeled as malicious by at least one classifier or the clone detector. For each sample, we again list package name, version, and hash as above; whether the sample was reproducible from source by the reproducer; whether the sample was found to be malicious or not by manual analysis, and whether each of the classifiers (decision-tree, naive-bayes, svm, hash) labeled it as malicious.
The directory results/cross-validation
contains the result from the 10-fold cross-validation on our basic corpus data performed as part of Experiment 2, with one subdirectory per fold. For each fold, there are three TSV files, one per classifier, with three columns: package name, package version, and the label assigned by the classifier.
Finally, the directory results/maloss
contains the results from running our classifiers on the MalOSS dataset from Duan et al.'s paper "Towards Measuring Supply Chain Attacks on Package Managers for Interpreted Languages". As for the cross-validation experiment, there is one TSV file per classifier, with the same three columns as above.
Taken together, these files total less than 1MB of data.
The directory results/timing
has logs of the time it took for the different stages of Experiment 1. The files results/timing/extract_features_time.csv
and results/timing/extract_diffs_time.csv
list the timings for extracting the features and the difference of features between versions, respectively, for ~500 random packages. The subdirectories each contain the times for training (directory training) amd prediction (directory prediction) for each classifier.
The files amount to about 6MB.
We were not able to include the contents of the samples in our basic corpus or the samples considered in Experiment 1, since some of them contain malicious and harmful code.
We were not able to include the features extracted from the samples either. Our approach might be deployed in production at some future date, and we do not want to give a prospective attacker any support in reverse-engineering our technique so as to avoid detection.
For the same reason, we were not able to include the feature-extraction code.
Finally, the classifiers trained on the basic corpus and as part of Experiment 1 can, unfortunately, also not be made public, again due to concerns about abuse by malicious parties.