Skip to content

Machine Learning Models

Marco Rosa edited this page Oct 15, 2021 · 19 revisions

Since regex scanners are prone to produce a lot of false positive discoveries, machine learning models can be used to reduce the number of discoveries to be manually analysed. In particular, models automatically classify discoveries as false_positive (i.e., spam).

The models need an implementation (in credentialdigger/models folder). Possible binaries are automatically downloaded on-the-fly.

Supported Models

If you want to propose a new model to reduce false positive discoveries, please contact us (or open an issue in the project)

Path Model

The Path Model empowers regular expressions to match typical files that contain fake credentials.

After a pre-processing phase, the file path of a discovery is matched with a regular expression to guess whether the credentials contained in it will be real ones or not. Indeed, according to our observations, documentation (e.g., README and .md files in general), tutorials, tests, virtual environments and dependencies pushed to the repository (e.g., node_modules), don't contain real secrets used in production.

Up to v4.3 we used a ML approach based on fasttext, but we shifted to regular expressions in v4.4 since it proved to be more performing without loss of precision. Please visit the OLD machine learning models page for further information regarding the old Path Model.

Password Model

TODO