Small datasets for testing training scripts, inference scripts, and pipelines.
This is a labeled dataset of 1241 nv-smi logs generated once per minute from a single Tesla V100 in our lab environment running either GPU malware or benign workflows.
The same data in both CSV and jsonlines
This is a synthetic dataset of Azure AD logs with activities of 20 accounts (85 applications involved, 3567 records in total). The activities are split to a train and an inference set. An anomaly is included in the inference set for model validation. The data was generated using the python faker package. If there is any resemblance to real individuals, it is purely coincidental.
-
3239 records in total
-
Time range: 2022/08/01 - 2022/08/29
-
Users' log distribution:
- 5 high volume (>= 300) users
- 15 medium volume (~100) users
- 5 light volume (~10) users
Data for the pipeline validation contains an anomalous activity for a single user.
- Account:
[email protected]
- Time: 2022/08/31
- Description:
- Anomalously high log volume (100+)
- New IP for the account
- New location for the account (new country, state, city, latitude, longitude)
- New browser
- New app access (80 new apps accessed by the account on the day)
This dataset is stored in our S3 bucket. It can be downloaded using a script.
This is a synthetic dataset of AWS CloudTrail logs events with activities from two entities/users in separate files.
Files for user-123
include a single CSV and split JSON versions of the same data:
- dfp-cloudtrail-user123-training-data.csv
- hammah-user123-training-part2.json
- hammah-user123-training-part3.json
- training-data/cloudtrail/hammah-user123-training-part4.json
Files for role-g
include a single CSV and split JSON version of the same data:
- dfp-cloudtrail-role-g-training-data.csv
- hammah-role-g-training-part1.json
- hammah-role-g-training-part2.json
This is a small dataset augmented from the artificially generated transaction network demo data from the authors of Inductive Graph Representation Learning for Fraud Detection. The original demo data of 753 labeled transactions was downloaded from the paper's github repo on 02/10/2022 with an MD5 hash 64af64fcc6e3d55d25111a3f257378a4
. We augmented the training dataset to increase benign transactions by replicating that portion of the dataset for a total of 12053 transactions.
This sample dataset consists of a subset of Apache logs collected from a Linux system running Apache Web server as part of a larger public log dataset on loghub. The file was downloaded on 01/14/2020 with an MD5 hash of 1c3a706386b3ebc03a2ae07a2d864d66
. The logs were parsed using an apache log parsing package to create a labeled dataset.
Log validation data in CSV and JSON format
The SMS Spam Collection is a public set of 5574 SMS labeled messages that have been collected for mobile phone spam research hosted at UCI Machine Learning Repository: SMS Spam Collection Data Set last accessed on 11/09/2022 with an MD5 hash of ab53f9571d479ee677e7b283a06a661a
During training, 20% of the dataset is randomly selected as the test set and is saved as a jsonlines file for use in pipeline validation.
Additionally a subset of 100 messages from the dataset were augmented to include sender and recipient information using the python faker package. If there is any resemblance to real individuals, it is purely coincidental.
The dataset was generated by running ransomware and benign processes in a lab environment and recording the output from several plugins from the Volatility framework including cmdline
, envars
, handles
, ldrmodules
, netscan
, pslist
, threadlist
, vadinfo
. The training csv file contains 530 columns- a combination of features from the Volatility Plugins. This data collection is part of DOCA AppShield.
Training data CSV consists of 87968 preprocessed and labeled AppShield processes from 32 snapshots collected from 256 unique benign and ransomware activities.
The validation set contains raw data from 27 AppShield snapshots.
This dataset contains a small sample of anonymized Linux kernel logs of a DGX machine prior to a hardware failure. The training dataset contains 1359 logs labeled as indicators of the root cause or not. A model trained on this set can be robust enough to correctly identify previously undetected errors from the unseen-errors
file as a root cause as well.
This data contains 2000 synthetic pcap payloads generated to mimic sensitive and benign data found in nested JSONs from web APIs and environmental variables. Each row is labeled for the presence or absence of 10 different kinds of sensitive information. The data was generated using the python faker package and lists of most common passwords. If there is any resemblance to real individuals, it is purely coincidental.
Morpheus contributors will make every effort to keep datasets up-to-date and accurate. However, the data submitted to this repository is provided on an “as-is” basis and there is no warranty or guarantee of any kind that the information is accurate, complete, current or suitable for any particular purpose. It is the responsibility of all persons who use Morpheus to independently confirm the accuracy of the data, information, and results obtained via the Morpheus example workflows.