Encryption Analysis calculates the entropies of data flows and classifies each flow as either encrypted, text, media, or unknown.
For step-by-step instructions on getting started, see the Getting Started document.
The system should have wireshark/tshark installed; the versions that our machines have tested to be working are v2.6.7, v2.6.8, and v2.6.10.
We use Python 3 unless otherwise specified.
The Jupyter Notebook encryption_sample.ipynb provides steps to parse a pcap file and label each flow as one of the four data types (encrypted, text, media, unknown).
encryption.sh
is an equivalent to the Jupyter Notebook, which can be run directly in the terminal.
Usage: ./encryption.sh in_pcap out_csv ek_json
Example: ./encryption.sh sample.pcap sample.csv sample.json
The sample code intends to demonstrate how we processed a single file. One should adapt the code in their cluster environment to process the whole dataset (traffic of 34,586 experiments).
in_pcap
- The path to the input pcap file.
out_csv
- The path to the output CSV file that will will be generated from ek_json
.
ek_json
- The path to the intermediate JSON file that will be generated from in_pcap
.
Note: If out_csv
and ek_json
do not exist, they will be generated by the scripts. If they do currently exist, they will be overwritten.
First, TShark decodes the pcap file into the JSON file. shrink_compute.py
then performs analysis on the JSON file, which produces the CSV file.
The CSV file has ten headings. Their meanings are listed below:
ip_src
- The IP address of the source.ip_dest
- The IP address of the destination.srcport
- The transport layer source port number.dstport
- The transport layer destination port number.tp_proto
- The transport layer protocol.data_proto
- The application layer protocol.data_type
- The data type. Either unknown, text, media, compressed, or encrypted.data_len
- The length of the data in bytes.entropy
- The entropy of the data.reason
- Information about the output.