Skip to content

Latest commit

 

History

History
 
 

datasets

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Morpheus Datasets

Small datasets for testing training scripts, inference scripts, and pipelines.

Anomalous Behavioral Profiling (ABP)

This is a labeled dataset of 1241 nv-smi logs generated once per minute from a single Tesla V100 in our lab environment running either GPU malware or benign workflows.

Sample Training Data

Pipeline Validation Data

The same data in both CSV and jsonlines

Digital Fingerprinting (DFP) Data

DFP Azure Logs

This is a synthetic dataset of Azure AD logs with activities of 20 accounts (85 applications involved, 3567 records in total). The activities are split to a train and an inference set. An anomaly is included in the inference set for model validation. The data was generated using the python faker package. If there is any resemblance to real individuals, it is purely coincidental.

Sample Training Data

Pipeline Validation Data

Data for the pipeline validation contains an anomalous activity for a single user.

  • Account: [email protected]
  • Time: 2022/08/31
  • Description:
    • Anomalously high log volume (100+)
    • New IP for the account
    • New location for the account (new country, state, city, latitude, longitude)
    • New browser
    • New app access (80 new apps accessed by the account on the day)

This dataset is stored in our S3 bucket. It can be downloaded using a script.

DFP Cloudtrail Logs

This is a synthetic dataset of AWS CloudTrail logs events with activities from two entities/users in separate files.

Files for user-123 include a single CSV and split JSON versions of the same data:

Sample Training Data

Pipeline Validation Data

Files for role-g include a single CSV and split JSON version of the same data:

Sample Training Data

Pipeline Validation Data

Fraud Detection

This is a small dataset augmented from the artificially generated transaction network demo data from the authors of Inductive Graph Representation Learning for Fraud Detection. The original demo data of 753 labeled transactions was downloaded from the paper's github repo on 02/10/2022 with an MD5 hash 64af64fcc6e3d55d25111a3f257378a4. We augmented the training dataset to increase benign transactions by replicating that portion of the dataset for a total of 12053 transactions.

Sample Training Data

Pipeline Validation Data

Log Parsing

This sample dataset consists of a subset of Apache logs collected from a Linux system running Apache Web server as part of a larger public log dataset on loghub. The file was downloaded on 01/14/2020 with an MD5 hash of 1c3a706386b3ebc03a2ae07a2d864d66. The logs were parsed using an apache log parsing package to create a labeled dataset.

Sample Training Data

Pipeline Validation Data

Log validation data in CSV and JSON format

Phishing Detection

The SMS Spam Collection is a public set of 5574 SMS labeled messages that have been collected for mobile phone spam research hosted at UCI Machine Learning Repository: SMS Spam Collection Data Set last accessed on 11/09/2022 with an MD5 hash of ab53f9571d479ee677e7b283a06a661a During training, 20% of the dataset is randomly selected as the test set and is saved as a jsonlines file for use in pipeline validation.

Pipeline Validation Data

Example Data for Developer Guide

Additionally a subset of 100 messages from the dataset were augmented to include sender and recipient information using the python faker package. If there is any resemblance to real individuals, it is purely coincidental.

Ransomware Detection via AppShield

The dataset was generated by running ransomware and benign processes in a lab environment and recording the output from several plugins from the Volatility framework including cmdline, envars, handles, ldrmodules, netscan, pslist, threadlist, vadinfo. The training csv file contains 530 columns- a combination of features from the Volatility Plugins. This data collection is part of DOCA AppShield.

Sample Training Data

Training data CSV consists of 87968 preprocessed and labeled AppShield processes from 32 snapshots collected from 256 unique benign and ransomware activities.

Pipeline Validation Data

The validation set contains raw data from 27 AppShield snapshots.

Root Cause

This dataset contains a small sample of anonymized Linux kernel logs of a DGX machine prior to a hardware failure. The training dataset contains 1359 logs labeled as indicators of the root cause or not. A model trained on this set can be robust enough to correctly identify previously undetected errors from the unseen-errors file as a root cause as well.

Sample Training Data

Pipeline Validation Data

Sensitive Information Detection (SID)

This data contains 2000 synthetic pcap payloads generated to mimic sensitive and benign data found in nested JSONs from web APIs and environmental variables. Each row is labeled for the presence or absence of 10 different kinds of sensitive information. The data was generated using the python faker package and lists of most common passwords. If there is any resemblance to real individuals, it is purely coincidental.

Sample Training Data

Pipeline Validation Data

Disclaimer

Morpheus contributors will make every effort to keep datasets up-to-date and accurate. However, the data submitted to this repository is provided on an “as-is” basis and there is no warranty or guarantee of any kind that the information is accurate, complete, current or suitable for any particular purpose. It is the responsibility of all persons who use Morpheus to independently confirm the accuracy of the data, information, and results obtained via the Morpheus example workflows.