Filters are functions that return a boolean value to indicate whether a specific condition should be skipped.
Example #1:
This is an example of a filter:
# source: detect_secrets.filters.common
def is_invalid_file(filename: str) -> bool:
return not os.path.isfile(filename)
This will ignore all invalid files, and ensure that we don't perform unnecessary scans on files that don't exist.
Example #2:
# source: detect_secrets.filters.heuristic
def is_potential_uuid(secret: str) -> bool:
return bool(_get_uuid_regex().search(secret))
@lru_cache(maxsize=1)
def _get_uuid_regex() -> Pattern:
return re.compile(
r'[a-f0-9]{8}\-[a-f0-9]{4}\-[a-f0-9]{4}\-[a-f0-9]{4}\-[a-f0-9]{12}',
re.IGNORECASE,
)
This above example is another example of a filter. Unlike the former example, this serves
to make sure that UUIDs aren't classified as secrets by returning True
(hence, skipping)
if the raw secret value looks like a UUID.
This is a current list of filters that ship with detect-secrets
. All filters are found under
the detect_secrets.filters
namespace.
Name | Description |
---|---|
allowlist.is_line_allowlisted |
Powers the inline allowlisting functionality. |
common.is_invalid_file |
Ignores files that are not files (e.g. links). |
common.is_baseline_file |
Ignores the baseline file itself. |
common.is_ignored_due_to_verification_policies |
Powers secret verification functionality. |
gibberish.should_exclude_secret |
Excludes secrets that are not gibberish looking strings. |
heuristic.is_indirect_reference |
Primarily for KeywordDetector , filters secrets like secret = get_secret_key() . |
heuristic.is_likely_id_string |
Ignores secret values prefixed with id . |
heuristic.is_lock_file |
Ignores common lock files. |
heuristic.is_non_text_file |
Ignores non-text files (e.g. archives, images). |
heuristic.is_not_alphanumeric_string |
Ignores secrets that do not have a single alphanumeric character in it. |
heuristic.is_potential_uuid |
Ignores uuid looking secret values. |
heuristic.is_prefixed_with_dollar_sign |
Primarily for KeywordDetector , filters secrets like secret = $variableName; . |
heuristic.is_sequential_string |
Ignores secrets like abcdefg . |
heuristic.is_swagger_file |
Ignores swagger files and paths, like swagger-ui.html or /swagger/. |
heuristic.is_templated_secret |
Ignores secrets like secret = <key> , secret = {{key}} and secret = ${key} . |
regex.should_exclude_line |
Powers the --exclude-lines functionality. |
regex.should_exclude_file |
Powers the --exclude-files functionality. |
regex.should_exclude_secret |
Powers the --exclude-secrets functionality. |
wordlist.should_exclude_secret |
Powers the --word-list functionality. |
As of version 1.0, all built-in filters are included in every scan. A list of these can be found in the baseline produced:
$ detect-secrets scan test_data
{
"generated_at": "2020-11-13T19:05:03Z",
"version": "1.0.0",
"plugins_used": [...],
"filters_used": [
{
"path": "detect_secrets.filters.heuristic.is_potential_uuid"
},
{
"path": "detect_secrets.filters.common.is_ignored_due_to_verification_policies",
"min_level": 2,
}
...
],
"results": {...}
}
This path
value specifies the import path for the filter function to be used. If there are
any additional values that are needed for filter configuration (e.g. min_level
in
is_ignored_due_to_verification_policies
), they will also be listed within the same dictionary.
There are also a set of default filters that are added to every scan.
These can be found in
detect_secrets.settings.Settings.DEFAULT_FILTERS
and power global exclusion policies (e.g. inline allowlisting via pragma: allowlist secret
).
For the sake of brevity, these default filters are not included in the baseline output.
You can disable any filters with the --disable-filter
flag. For example:
$ detect-secrets scan test_data --disable-filter detect_secrets.filters.heuristic.is_prefixed_with_dollar_sign
If you're running detect-secrets
as a package, you can specify the filters you want by
customizing your own settings object. e.g.
from detect_secrets.core import baseline
from detect_secrets.settings import transient_settings
config = {
'filters_used': [
{
'path': 'detect_secrets.filters.heuristic.is_potential_uuid',
},
],
}
with transient_settings(config):
secrets = baseline.create('.')
You can provide your own filters with the --filter
flag. For example, if you wanted to provide
a custom file, you can provide it in the format:
/path/to/file::function_name
. For example:
$ cat custom_filter.py
def is_invalid_secret(secret: str) -> bool:
return secret == 'invalid'
$ detect-secrets scan --filter custom_filter.py::is_invalid_secret
You can also provide an import path to the filter function you desire:
$ detect-secrets scan --filter detect_secrets.filters.heuristic.is_md5_string
If you want to systematically exclude results from the scan, filters are the way to go. The most
common filters are configured via --exclude-lines
and --exclude-files
, however, if you need
a more custom solution for your environment, you may want to experiment with custom filters!
Before you can write your own filter, it's important to understand how they fit into the larger secret scanning engine.
Filters are dynamically executed as part of the scan process, and are sequentially run based on the information that is currently known at that specific stage of the process. For example, when you scan a file, this is the process it takes:
-
Check to see whether the filename should be scanned (filter)
-
Obtain lines by processing file.
-
For each line,
a. Check to see whether the line should be scanned (filter)
b. For each plugin, attempt to find a secret on that line, and determine whether it is indeed a secret (filter)
-
Aggregate secrets.
In this example, you can see that step #1's filters will operate on filenames alone (e.g. Example #1 above), yet step #3b will have the filename, line, secret found, as well as the plugin which found it.
Therefore, filters need to declare the variables that they will need to make a decision. Then, the dependency injection system will auto-magically call the registered filters based on the information available to it at the time.
For more details, check out detect_secrets/core/scan.py
.
Filters MUST only depend on some combination of the following variables:
Variable Name | Type | Description |
---|---|---|
filename |
string | The file path being scanned. |
line |
string | The line being scanned. |
plugin |
detect_secrets.core.plugins.util.Plugin |
The plugin that found the secret. |
secret |
string | The raw secret value. |
context |
detect_secrets.util.code_snippet.CodeSnippet |
Lines of code surrounding secret. |
Expect that your filter function will be called multiple times during a scan process. Imagine: if your filter operates on lines, it would need to be called for every line, in every file that you scan! If your filter has a long initialization process, it will considerably slow down the overall scan process.
Example #2 demonstrates a better way to accomplish this. By caching the compiled regex as such, the regex will only need to be created once for the entire scan, avoiding unnecessary detriments in scan speed.
The detect_secrets.filters.wordlist.should_exclude_secret
is a fantastic example of this.
This wordlist filter leverages the Aho-Corasick algorithm to process a list of words and to
determine whether a secret value is in the list of known false positives. However, the
initialization process of the automaton engine is a slow one -- and definitely not something we
want to be initializing for every secret found!
To get around this problem, we initialize
the automaton once on load, then cache it so that all
future executions will merely use the cached version.
Furthermore, the pattern of reading the filter settings from the global Settings
object is also
encouraged: read it once, setup your filter, and have smooth executions for the rest of the scan.