SpiderFoot’s goal is to automate OSINT collection and analysis to the greatest extent possible. Since its inception, SpiderFoot has heavily focused on automating OSINT collection and entity extraction, but the automation of common analysis tasks -- beyond some reporting and visualisations -- has been left entirely to the user. The meant that the strength of SpiderFoot's data collection capabilities has sometimes been its weakness since with so much data collected, users have often needed to export it and use other tools to weed out data of interest.
We started tackling this analysis gap with the launch of SpiderFoot HX in 2019 through the introduction of the "Correlations" feature. This feature was represented by some 30 "correlation rules" that ran with each scan, analyzing data and presenting results reflecting SpiderFoot's opinionated view on what may be important or interesting. Here are a few of those rules as examples:
- Hosts/IPs reported as malicious by multiple data sources
- Outlier web servers (can be an indication of shadow IT)
- Databases exposed on the Internet
- Open ports revealing software versions
- and many more.
With the release of SpiderFoot 4.0 we wanted to bring this capability from SpiderFoot HX to the community, but also re-imagine it at the same time so that the community might not simply run rules we provide, but also write their own correlation rules and contribute them back. We also hope that just as with modules, we see a long list of contributions made in the years ahead so that all may benefit.
With that said, let's get into what these rules look like and how to write one.
The rules themselves are written in YAML. Why YAML? It’s easy to read, write, allows for comments and is increasingly commonplace in many modern tools.
The simplest way to think of a SpiderFoot correlation rule is like a simple database query that consists of a few sections:
- Defining the rule itself (
id
,version
andmeta
sections). - Stating what you'd like to extract from the scan results (
collections
section). - Grouping that data in some way (
aggregation
section; optional). - Performing some analysis over that data in some way (
analysis
section; optional). - Presenting the results (
headline
section).
Here's an example rule that looks at SpiderFoot scan results for data revealing open TCP ports where the banner (the data returned upon connecting to the port) reports a software version. It does so by applying some regular expressions to the content of TCP_PORT_OPEN_BANNER
data elements, filtering out some false positives and then grouping the results by the banner itself so that one correlation result is created per banner revealing a version:
id: open_port_version
version: 1
meta:
name: Open TCP port reveals version
description: >
A possible software version has been revealed on an open port. Such
information may reveal the use of old/unpatched software used by
the target.
risk: INFO
collections:
- collect:
- method: exact
field: type
value: TCP_PORT_OPEN_BANNER
- method: regex
field: data
value: .*[0-9]\.[0-9].*
- method: regex
field: data
value: not .*Mime-Version.*
- method: regex
field: data
value: not .*HTTP/1.*
aggregation:
field: data
headline: "Software version revealed on open port: {data}"
To show this in practice, we can run a simple scan against a target, in this case focusing on performing a port scan:
-> # python3.9 ./sf.py -s www.binarypool.com -m sfp_dnsresolve,sfp_portscan_tcp
2022-04-06 08:14:58,476 [INFO] sflib : Scan [94EB5F0B] for 'www.binarypool.com' initiated.
...
sfp_portscan_tcp Open TCP Port Banner SSH-2.0-OpenSSH_7.2p2 Ubuntu-4ubuntu2.10
...
2022-04-06 08:15:23,110 [INFO] correlation : New correlation [open_port_version]: Software version revealed on open port: SSH-2.0-OpenSSH_7.2p2 Ubuntu-4ubuntu2.10
2022-04-06 08:15:23,244 [INFO] sflib : Scan [94EB5F0B] completed.
We can see above that a port was found to be open by the sfp_portscan_tcp
module, and it happens to include a version. The correlation rule open_port_version
picked this up and reported it. This is also visible in the web interface:
NOTE: Rules will only succeed if relevant data exists in your scan results in the first place. In other words, correlation rules analyze scan data, they don't collect data from targets.
In short, SpiderFoot translates the YAML rules into a combination queries against the backend database of scan results and Python logic to filter and group the results, creating "correlation results" in the SpiderFoot database. These results can be viewed in the SpiderFot web interface or from the SpiderFoot CLI. You can also query them directly out of the SQLite database if you like (they are in the tbl_scan_correlation_results
table, and the tbl_scan_correlation_results_events
table maps the events (data elements) to the correlation result).
Each rule exists as a YAML file within the /correlations
folder in the SpiderFoot installation path. Here you can see a list of rules in 4.0, which we hope to grow over time:
cert_expired.yaml host_only_from_certificatetransparency.yaml outlier_ipaddress.yaml
cloud_bucket_open.yaml http_errors.yaml outlier_registrar.yaml
cloud_bucket_open_related.yaml human_name_in_whois.yaml outlier_webserver.yaml
data_from_base64.yaml internal_host.yaml remote_desktop_exposed.yaml
data_from_docmeta.yaml multiple_malicious.yaml root_path_needs_auth.yaml
database_exposed.yaml multiple_malicious_affiliate.yaml stale_host.yaml
dev_or_test_system.yaml multiple_malicious_cohost.yaml strong_affiliate_certs.yaml
dns_zone_transfer_possible.yaml name_only_from_pasteleak_site.yaml strong_similardomain_crossref.yaml
egress_ip_from_wikipedia.yaml open_port_version.yaml template.yaml
email_in_multiple_breaches.yaml outlier_cloud.yaml vulnerability_critical.yaml
email_in_whois.yaml outlier_country.yaml vulnerability_high.yaml
email_only_from_pasteleak_site.yaml outlier_email.yaml vulnerability_mediumlow.yaml
host_only_from_bruteforce.yaml outlier_hostname.yaml
The rules themselves are broken down into the following components:
Meta: Describes the rule itself so that humans understand what the rule does and the risk level of any results. This information is used mostly in the web interface and CLI.
Collections: A collection represents a set of data pulled from scan results, to be used in later aggregation and analysis stages. Each rule can have multiple collections.
Aggregations: An aggregation buckets the collected data into groups for analysis in distinct groups of data elements.
Analysis: Analysis performs (you guessed it) analysis on the data to whittle down the data elements to what ultimately gets reported. For example, the analysis stage may look only for cases where the data field is repeated in the data set, indicating it was found multiple times and therefore discarding any only appearing once.
Headline: The headline represents the actual correlation title that summarizes what was found. You can think of this as equivalent to a meal name (beef stew), and all the data elements as being the ingredients (beef, tomatoes, onions, etc.).
To create your own rule, simply copy the template.yaml
file in the correlations
folder to a meaningful name that matches the ID you intend to provide it, e.g. aws_cloud_usage.yaml
and edit the rule to fit your needs. Save it and re-start SpiderFoot for the rule to be loaded. If there are any syntax errors, SpiderFoot will abort at startup and (hopefully) give you enough information to know where the error is.
The template.yaml
file is also a good next point of reference to better understand the structure of the rules and how to use them. We also recommend taking a look through the actual rules themselves to see the concepts in practice.
id: The internal ID for this rule, which needs to match the filename.
version: The rule syntax version. This must be 1
for now.
meta: This section contains a few important fields used to describe the rule.
- name: A short, human readable name for the rule.
- description: A longer (can be multi-paragraph) description of the rule.
- risk: The risk level represented by this rule's findings. Can be
INFO
,LOW
,MEDIUM
,HIGH
.
collection: A correlation rule contains one or more collect
blocks. Each collect
block contains one or more method
blocks telling SpiderFoot what criteria to use for extracting data from the database and how to filter it down.
-
collect: Technically, the first
method
block in eachcollect
block is what actually pulls data from the database, and each subsequentmethod
refines that dataset down to what you’re seeking. You may have multiplecollect
blocks overall but the rule remains that within eachcollect
, the firstmethod
pulls data from the database and subsequentmethod
blocks within thecollect
refine that data.-
method: Each
method
block tells SpiderFoot how to collect and refine data. Eachcollect
must contain at least onemethod
block. Valid methods areexact
for performing an exact match of the chosenfield
to the suppliedvalue
, orregex
to perform regular expression matching. -
field: Each
method
block has afield
upon which the matching should be performed. Valid fields aretype
(e.g.INTERNET_NAME
),module
(e.g.sfp_whois
) anddata
, which would be the value of the data element (e.g. in case of anINTERNET_NAME
, thedata
would be the hostname). After the firstmethod
block, you can also prefix the field withsource.
,child.
orentity.
to refer to the fields of the source, children or relevant entities of the collected data, respectively (seemultiple_malicious.yaml
anddata_from_docmeta.yaml
as examples of this approach). -
value: Here you supply the value or values you wish to match against the field you supplied in
field
. If yourmethod
wasregex
, this would be a regular expression.
-
aggregation: With all the data elements in their collections, you can now aggregate them into buckets for further analysis or immediately generate results from the rule. While the collection phase is about obtaining the data from the database and filtering down to data of interest, the aggregation phase is about grouping that data in different ways in order to support analysis and/or grouping reported results.
Aggregation simply iterates through the data elements in each collection and places them into groups based on the field
specified. For instance if you pick the type
field, you’ll end up with data elements with the same type
field grouped together. The purpose of this grouping is two-fold: to support the analysis stage, or if you don’t have the analysis stage, it’s how your correlation results will be grouped for the user.
- field: The
field
defines how you'd like your data elements grouped together. Just like thefield
option inmethod
blocks above, you may prefix the field withsource.
,child.
orentity.
to apply the aggregation on those fields of the data element instead. For example, if you intended to look for multiple occurrences of a hostname, you would specifydata
here as that field, since you want to count the number of times the value of thedata
field appears.
analysis
The analysis section applies (you guessed it) some analysis to the aggregated results or collections directly if you didn’t perform any aggregation, and drops candidate results if they fail this stage. Various analysis method
types exist, and each takes different options, described below.
- method:
- threshold: Drop any collection/aggregation of data elements that do not meet the defined thresholds. You would use this analysis rule when wanting to generate a result only when a data element has appeared more or less than a limit specified, for instance reporting when an email address is reported just once, or more than 100 times.
- field: The field you want to apply the threshold too. As per above, you can use
child.
,source.
andentity.
field prefixes here too. - count_unique_only: By default the threshold is applied to the
field
specified on all data elements, but by settingcount_unique_only
totrue
, you can limit the threshold to only unique values in thefield
specified, so as not to also count duplicates. - minimum: The minimum number of data elements that must appear within the collection or aggregation.
- maximum: The maximum number of data elements that must appear within the collection or aggregation.
- field: The field you want to apply the threshold too. As per above, you can use
- outlier: Only keep outliers within any collection/aggregation of data elements.
- maximum_percent: The maximum percentage of the overall results that an aggregation can represent. This method requires that you have performed an aggregation on a certain field in order to function. For example, if you aggregate on the
data
field of your collections, and one of those buckets contains less than 10% of the overall volume, it will be reported as an outlier. - noisy_percent: By default this is
10
, meaning that if the average percentage every bucket is below 10%, don't report outliers since the dataset is anomalous.
- maximum_percent: The maximum percentage of the overall results that an aggregation can represent. This method requires that you have performed an aggregation on a certain field in order to function. For example, if you aggregate on the
- first_collection_only: Only keep data elements that appeared in the first collection but not any others. For example, this is handy for finding cases where data was found from one or several data sources but not others.
- field: The field you want to use for looking up between collections.
- match_all_to_first_collection: Only keep data elements that have matched in some way to the first collection. This requires an aggregation to have been performed, as the field used for aggregation is what will be used for checking for a match.
- match_method: How to match between all collections and the first collection. Options are
contains
(simple wildcard match),exact
andsubnet
which reports a match if the expected field may contain an IP address that is within the first collection field containing a subnet.
- match_method: How to match between all collections and the first collection. Options are
- threshold: Drop any collection/aggregation of data elements that do not meet the defined thresholds. You would use this analysis rule when wanting to generate a result only when a data element has appeared more or less than a limit specified, for instance reporting when an email address is reported just once, or more than 100 times.
headline
After all data elements have been collected, filtered down, aggregated and analyzed, if data elements are remaining, these are what we call "correlation results" -> the results of your correlation rule. These need a "headline" to summarize the findings, which you can define here. To place any value from your data into the headline, you must enclose the field in {}
, e.g. {entity.data}
. There are two ways to write a headline
rule. The typical way is to simply have headline: titletexthere
, or have it as a block, in which case you can be more granular about how the correlation results are published:
- text: The headline text, as described above.
- publish_collections: The collection you wish to have associated with the correlation result. This is not often needed, but more in combination with the
match_all_to_first_collection
analysis rule in case your first collection is only used as a reference point and not actually contain any data elements you wish to publish with this correlation result. Take a look at theegress_ip_from_wikipedia.yaml
rule for an example of this used in practice.
Every data element pulled in the first match
rule in a collection will also have any children (data resulting from that data element), the source (the data element that this data element was generated from) and entity (the source, or source of source, etc. that was an entity like IP address, domain, etc.). This enables you to prefix subsequent (and only subsequent!) match block field names with child.
, source.
and entity.
if you wish to match based on those fields. These prefixes, as shown above, can also be used in the aggregation
, analysis
and headline
sections too.
It is vital to note that these prefixes always are in reference to the first match
block within each collect
block, since every subsequent match
block is always a refinement of the first match
block.
This can be complicated, so let's use an example to illustrate. Let's say your scan has found a hostname (a data element type of INTERNET_NAME
) of foo
, and it found that within some webpage content (a data element type of TARGET_WEB_CONTENT
) of "This is some web content: foo", which was from a URL (data element type of LINKED_URL_INTERNAL
) of "https://bar/page.html", which was from another host named bar
. Here's the data discovery path:
bar
[INTERNET_NAME
] -> https://bar/page.html
[LINKED_URL_INTERNAL
] -> This is some web content: foo
[TARGET_WEB_CONTENT
] -> foo
[INTERNET_NAME
]
If we were to look at This is some web content: foo
in our rule, here are the data
and type
fields you would expect to exist (module
would also exist but has been left out of this example for brevity):
data
:This is some web content: foo
type
:TARGET_WEB_CONTENT
source.data
:https://bar/page.html
source.type
:LINKED_URL_INTERNAL
child.data
:foo
child.type
:INTERNET_NAME
entity.type
:INTERNET_NAME
entity.data
:bar
Notice how the entity.type
and entity.data
fields for "This is some web content: foo" is not the LINKED_URL_INTERNAL
data element, but actually the bar
INTERNET_NAME
data element. This is because an INTERNET_NAME
is an entity, but a LINKED_URL_INTERNAL
is not.
You can look in spiderfoot/db.py
to see which data types are entities and which are not.