Because the significant_terms
aggregation works by analyzing
statistics, you need to have a certain threshold of data for it to become effective.
That means we won’t be able to index a small amount of example data for the demo.
Instead, we prepared a dataset that contains about 80,000 documents and saved it as a snapshot in our public demo repository. To "restore" this dataset into your cluster:
-
Add the following setting to your
elasticsearch.yml
configuration file to whitelist the Elastic demo repository:repositories.url.allowed_urls: ["http://download.elastic.co/*"]
-
Restart Elasticsearch.
-
Run the following snapshot commands. (For more information about using snapshots, see Backing Up Your Cluster.)
PUT /_snapshot/sigterms (1) { "type": "url", "settings": { "url": "http://download.elastic.co/definitiveguide/sigterms_demo/" } } GET /_snapshot/sigterms/_all (2) POST /_snapshot/sigterms/snapshot/_restore (3) GET /mlmovies,mlratings/_recovery (4)
-
Register a new read-only URL repository pointing at the demo snapshot
-
(Optional) Inspect the repository to learn details about available snapshots
-
Begin the Restore process. This will download two indices into your cluster:
mlmovies
andmlratings
-
(Optional) Monitor the Restore process using the Recovery API
-
Note
|
The dataset is around 50 MB and may take some time to download. |
In this demo, we are going to look at movie ratings by users of MovieLens. At
MovieLens, users make movie recommendations so other users can find new
movies to watch. For this demo, we are going to recommend movies by using significant_terms
based on an input movie.
Let’s take a look at some sample data, to get a feel for what we are working with.
There are two indices in this dataset, mlmovies
and mlratings
. Let’s look
at mlmovies
first:
GET mlmovies/_search (1)
{
"took": 4,
"timed_out": false,
"_shards": {...},
"hits": {
"total": 10681,
"max_score": 1,
"hits": [
{
"_index": "mlmovies",
"_type": "mlmovie",
"_id": "2",
"_score": 1,
"_source": {
"offset": 2,
"bytes": 34,
"title": "Jumanji (1995)"
}
},
....
-
Execute a search without a query, so that we can see a random sampling of docs.
Each document in mlmovies
represents a single movie. The two important pieces
of data are the _id
of the movie and the title
of the movie. You can ignore
offset
and bytes
; they are artifacts of the process used to extract this
data from the original CSV files. There are 10,681 movies in this dataset.
Now let’s look at mlratings
:
GET mlratings/_search
{
"took": 3,
"timed_out": false,
"_shards": {...},
"hits": {
"total": 69796,
"max_score": 1,
"hits": [
{
"_index": "mlratings",
"_type": "mlrating",
"_id": "00IC-2jDQFiQkpD6vhbFYA",
"_score": 1,
"_source": {
"offset": 1,
"bytes": 108,
"movie": [122,185,231,292,
316,329,355,356,362,364,370,377,420,
466,480,520,539,586,588,589,594,616
],
"user": 1
}
},
...
Here we can see the recommendations of individual users. Each document represents
a single user, denoted by the user
ID field. The movie
field holds a list
of movies that this user watched and recommended.
The first strategy we could take is trying to recommend movies based on popularity. Given a particular movie, we find all users who recommended that movie. Then we aggregate all their recommendations and take the top five most popular.
We can express that easily with a terms
aggregation and some filtering. Let’s
look at Talladega Nights, a comedy about NASCAR racing starring
Will Ferrell. Ideally, our recommender should find other comedies in a similar
style (and more than likely also starring Will Ferrell).
First we need to find the Talladega Nights ID:
GET mlmovies/_search
{
"query": {
"match": {
"title": "Talladega Nights"
}
}
}
...
"hits": [
{
"_index": "mlmovies",
"_type": "mlmovie",
"_id": "46970", (1)
"_score": 3.658795,
"_source": {
"offset": 9575,
"bytes": 74,
"title": "Talladega Nights: The Ballad of Ricky Bobby (2006)"
}
},
...
-
Talladega Nights is ID
46970
.
Armed with the ID, we can now filter the ratings and apply our terms
aggregation
to find the most popular movies from people who also like Talladega Nights:
GET mlratings/_search
{
"size" : 0, (1)
"query": {
"filtered": {
"filter": {
"term": {
"movie": 46970 (2)
}
}
}
},
"aggs": {
"most_popular": {
"terms": {
"field": "movie", (3)
"size": 6
}
}
}
}
-
We execute our query on
mlratings
this time, and set thesize
to 0 since we are interested only in the aggregation results. -
Apply a filter on the ID corresponding to Talladega Nights.
-
Finally, find the most popular movies by using a
terms
bucket.
We perform the search on the mlratings
index, and apply a filter for the ID of
Talladega Nights. Since aggregations operate on query scope, this will
effectively filter the aggregation results to only the users who recommended
Talladega Nights. Finally, we execute a terms
aggregation to bucket the most
popular movies. We are requesting the top six results, since it is likely
that Talladega Nights itself will be returned as a hit (and we don’t want
to recommend the same movie).
The results come back like so:
{
...
"aggregations": {
"most_popular": {
"buckets": [
{
"key": 46970,
"key_as_string": "46970",
"doc_count": 271
},
{
"key": 2571,
"key_as_string": "2571",
"doc_count": 197
},
{
"key": 318,
"key_as_string": "318",
"doc_count": 196
},
{
"key": 296,
"key_as_string": "296",
"doc_count": 183
},
{
"key": 2959,
"key_as_string": "2959",
"doc_count": 183
},
{
"key": 260,
"key_as_string": "260",
"doc_count": 90
}
]
}
}
...
We need to correlate these back to their original titles, which can be done with a simple filtered query:
GET mlmovies/_search
{
"query": {
"filtered": {
"filter": {
"ids": {
"values": [2571,318,296,2959,260]
}
}
}
}
}
And finally, we end up with the following list:
-
Matrix, The
-
Shawshank Redemption
-
Pulp Fiction
-
Fight Club
-
Star Wars Episode IV: A New Hope
OK—well that is certainly a good list! I like all of those movies. But that’s the problem: most everyone likes that list. Those movies are universally well-liked, which means they are popular on everyone’s recommendations. The list is basically a recommendation of popular movies, not recommendations related to Talladega Nights.
This is easily verified by running the aggregation again, but without the filter on Talladega Nights. This will give a top-five most popular movie list:
GET mlratings/_search
{
"size" : 0,
"aggs": {
"most_popular": {
"terms": {
"field": "movie",
"size": 5
}
}
}
}
This returns a list that is very similar:
-
Shawshank Redemption
-
Silence of the Lambs, The
-
Pulp Fiction
-
Forrest Gump
-
Star Wars Episode IV: A New Hope
Clearly, just checking the most popular movies is not sufficient to build a good, discriminating recommender.
Now that the scene is set, let’s try using significant_terms
. significant_terms
will analyze
the group of people who enjoy Talladega Nights (the foreground group) and
determine what movies are most popular. It will then construct a list of
popular films for everyone (the background group) and compare the two.
The statistical anomalies will be the movies that are over-represented in the foreground compared to the background. Theoretically, this should be a list of comedies, since people who enjoy Will Ferrell comedies will recommend them at a higher rate than the background population of people.
Let’s give it a shot:
GET mlratings/_search
{
"size" : 0,
"query": {
"filtered": {
"filter": {
"term": {
"movie": 46970
}
}
}
},
"aggs": {
"most_sig": {
"significant_terms": { (1)
"field": "movie",
"size": 6
}
}
}
}
-
The setup is nearly identical — we just use
significant_terms
instead ofterms
.
As you can see, the query is nearly the same. We filter for users who
liked Talladega Nights; this forms the foreground group. By default,
significant_terms
will use the entire index as the background, so we don’t need to do
anything special.
The results come back as a list of buckets similar to terms
, but with some
extra metadata:
...
"aggregations": {
"most_sig": {
"doc_count": 271, (1)
"buckets": [
{
"key": 46970,
"key_as_string": "46970",
"doc_count": 271,
"score": 256.549815498155,
"bg_count": 271
},
{
"key": 52245, (2)
"key_as_string": "52245",
"doc_count": 59, (3)
"score": 17.66462367106966,
"bg_count": 185 (4)
},
{
"key": 8641,
"key_as_string": "8641",
"doc_count": 107,
"score": 13.884387742677438,
"bg_count": 762
},
{
"key": 58156,
"key_as_string": "58156",
"doc_count": 17,
"score": 9.746428133759462,
"bg_count": 28
},
{
"key": 52973,
"key_as_string": "52973",
"doc_count": 95,
"score": 9.65770100311672,
"bg_count": 857
},
{
"key": 35836,
"key_as_string": "35836",
"doc_count": 128,
"score": 9.199001116457955,
"bg_count": 1610
}
]
...
-
The top-level
doc_count
shows the number of docs in the foreground group. -
Each bucket lists the key (for example, movie ID) being aggregated.
-
A
doc_count
for that bucket. -
And a background count, which shows the rate at which this value appears in the entire background.
You can see that the first bucket we get back is Talladega Nights. It is
found in all 271 documents, which is not surprising. Let’s look at the next bucket:
key 52245
.
This ID corresponds to Blades of Glory, a comedy about male figure skating
that also stars Will Ferrell. We can see that it was recommended 59 times by
the people who also liked Talladega Nights. This means that 21% of the foreground
group recommended Blades of Glory (59 / 271 = 0.2177
).
In contrast, Blades of Glory was recommended only 185 times in the entire dataset,
which equates to a mere 0.26% (185 / 69796 = 0.00265
). Blades of Glory is therefore
a statistical anomaly: it is uncommonly common in the group of people who
like Talladega Nights. We just found a good recommendation!
If we look at the entire list, they are all comedies that would fit as good recommendations (many of which also star Will Ferrell):
-
Blades of Glory
-
Anchorman: The Legend of Ron Burgundy
-
Semi-Pro
-
Knocked Up
-
40-Year-Old Virgin, The
This is just one example of the power of significant_terms
. Once you start using
significant_terms
, you find many situations where you don’t want the most popular—you want the most uncommonly common. This simple aggregation can uncover some
surprisingly sophisticated trends in your data.