The intention of the tools in this repository is to demonstrate how two different ML methods (LDA and K-Means) can be used for categorizing JA3 pre-hash values. The JA3 algorithm is specified here: https://github.com/salesforce/ja3
Different TLS libraries will produce different JA3 values. One TLS library can produce several JA3 values, depending on various conditions. With the original JA3 fingerprint it is not possible to see that two values are close to each other. However, using the JA3 pre-hash string, which lists the different parameters used in the TLS Client Hello message, we are able to see that two different values have been generated by the same TLS library. We use two unsupervised classification algorithms to find clusters or topics in a set of different JA3 pre-hash values.
These scripts and the example data provided here are used in the article "Categorizing TLS traffic based on JA3 pre-hash values" by Jenny Heino, Antti Hakkala and Seppo Virtanen, presented at the 14th International Conference on Ambient Systems, Networks and Technologies (ANT), March 15 - 17, 2023, Leuven, Belgium, and published in Procedia Computer Science 220C (2023) pp. 94-101.
The repository structure is as follows:
This folder contains the example data used in the publication, and for creating the example models.
This folder will contain the graphs generated by scripts.
This folder contains the example models, and is the default location for new models when training them.
This folder contains the relevant scripts.