Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add boxplot documentation to close aggregations content gaps #7168

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 15 additions & 17 deletions _aggregations/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,21 +13,17 @@ redirect_from:

# Aggregations

OpenSearch isn’t just for search. Aggregations let you tap into OpenSearch's powerful analytics engine to analyze your data and extract statistics from it.

The use cases of aggregations vary from analyzing data in real time to take some action to using OpenSearch Dashboards to create a visualization dashboard.

OpenSearch can perform aggregations on massive datasets in milliseconds. Compared to queries, aggregations consume more CPU cycles and memory.
With OpenSearch aggregations, you can analyze data and extract statistics. Aggregations support real-time analysis, creating visualizations using OpenSearch Dashboards, and handling large datasets efficiently. Compared to queries, however, aggregations consume more computational resources like CPU and memory.

## Aggregations on text fields

By default, OpenSearch doesn't support aggregations on a text field. Because text fields are tokenized, an aggregation on a text field has to reverse the tokenization process back to its original string and then formulate an aggregation based on that. This kind of an operation consumes significant memory and degrades cluster performance.
By default, OpenSearch does not support aggregations on a text field. Because text fields are tokenized, an aggregation on a text field has to reverse the tokenization process back to its original string and then formulate an aggregation based on that. This kind of an operation consumes significant memory and degrades cluster performance.

While you can enable aggregations on text fields by setting the `fielddata` parameter to `true` in the mapping, the aggregations are still based on the tokenized words and not on the raw text.

We recommend keeping a raw version of the text field as a `keyword` field that you can aggregate on.
It is recommended that you keep a raw version of the text field as a `keyword` field that you can aggregate on.

In this case, you can perform aggregations on the `title.raw` field, instead of on the `title` field:
In this case, you can perform aggregations on the `title.raw` field, instead of on the `title` field, as shown in the following request:

```json
PUT movies
Expand Down Expand Up @@ -64,19 +60,19 @@ GET _search
}
```

If you’re only interested in the aggregation result and not in the results of the query, set `size` to 0.
If you are only interested in the aggregation result and not in the results of the query, set `size` to 0.

In the `aggs` property (you can use `aggregations` if you want), you can define any number of aggregations. Each aggregation is defined by its name and one of the types of aggregations that OpenSearch supports.

The name of the aggregation helps you to distinguish between different aggregations in the response. The `AGG_TYPE` property is where you specify the type of aggregation.

## Sample aggregation

This section uses the OpenSearch Dashboards sample ecommerce data and web log data. To add the sample data, log in to OpenSearch Dashboards, choose **Home**, and then choose **Try our sample data**. For **Sample eCommerce orders** and **Sample web logs**, choose **Add data**.
This section uses the OpenSearch Dashboards sample ecommerce data and web log data. To add the sample data, log in to OpenSearch Dashboards, choose **Home**, and then choose **Add sample data**. For **Sample eCommerce orders** and **Sample web logs**, choose **Add data**.

### avg

To find the average value of the `taxful_total_price` field:
To find the average value of the `taxful_total_price` field enter the following request:

```json
GET opensearch_dashboards_sample_data_ecommerce/_search
Expand Down Expand Up @@ -124,22 +120,24 @@ The aggregation block in the response shows the average value for the `taxful_to

## Types of aggregations

There are three main types of aggregations:
There are three types of aggregations:

- Metric aggregations - Calculate metrics such as `sum`, `min`, `max`, and `avg` on numeric fields.
- Bucket aggregations - Sort query results into groups based on some criteria.
- Pipeline aggregations - Pipe the output of one aggregation as an input to another.
- [Metric aggregations]({{site.url}}{{site.baseurl}}/aggregations/metric/index/) - Calculate metrics such as `sum`, `min`, `max`, and `avg` on numeric fields.
- [Bucket aggregations]({{site.url}}{{site.baseurl}}/aggregations/bucket/index/) - Sort query results into groups based on some criteria.
- [Pipeline aggregations]({{site.url}}{{site.baseurl}}/aggregations/pipeline-agg/) - Pipe the output of one aggregation as an input to another.

## Nested aggregations

Aggregations within aggregations are called nested or subaggregations.
Aggregations within aggregations are called _nested_ or _subaggregations_.

Metric aggregations produce simple results and can't contain nested aggregations.
Metric aggregations produce simple results and cannot contain nested aggregations.

Bucket aggregations produce buckets of documents that you can nest in other aggregations. You can perform complex analysis on your data by nesting metric and bucket aggregations within bucket aggregations.

### General nested aggregation syntax

The following syntax is for a nested aggregation:

```json
{
"aggs": {
Expand Down
150 changes: 150 additions & 0 deletions _aggregations/metric/boxplot.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
---
layout: default
title: Boxplot
parent: Metric aggregations
grand_parent: Aggregations
nav_order: 15
---

# Boxplot

Check failure on line 9 in _aggregations/metric/boxplot.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _aggregations/metric/boxplot.md#L9

[OpenSearch.Spelling] Error: Boxplot. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.
Raw output
{"message": "[OpenSearch.Spelling] Error: Boxplot. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_aggregations/metric/boxplot.md", "range": {"start": {"line": 9, "column": 3}}}, "severity": "ERROR"}

A boxplot aggregation calculates the statistical distribution of a numeric field. It provides summary of the data, including the following key statistics: minimum value, first quartile, median, third quartile, and maximum value.

Check failure on line 11 in _aggregations/metric/boxplot.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _aggregations/metric/boxplot.md#L11

[OpenSearch.Spelling] Error: boxplot. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.
Raw output
{"message": "[OpenSearch.Spelling] Error: boxplot. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_aggregations/metric/boxplot.md", "range": {"start": {"line": 11, "column": 3}}}, "severity": "ERROR"}

## Syntax

The basic syntax for the boxplot aggregation is as follows:

Check failure on line 15 in _aggregations/metric/boxplot.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _aggregations/metric/boxplot.md#L15

[OpenSearch.Spelling] Error: boxplot. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.
Raw output
{"message": "[OpenSearch.Spelling] Error: boxplot. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_aggregations/metric/boxplot.md", "range": {"start": {"line": 15, "column": 26}}}, "severity": "ERROR"}

```json
{
"aggs": {
"boxplot_agg_name": {
"boxplot": {
"field": "numeric_field"
}
}
}
}
```
{% include copy-curl.html %}

Replace `boxplot_agg_name` with a descriptive name for your aggregation and `numeric_field` with the name of the numeric field you want to analyze.

## Example use case
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Developer: Please verify that the use case examples are relevant to the user and are accurate. If any changes are needed, please provide an updated example request. Thank you.


Let's say you have a dataset of website load times, and you want to analyze their distribution using the boxplot aggregation. Here's an example query:

Check failure on line 34 in _aggregations/metric/boxplot.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _aggregations/metric/boxplot.md#L34

[OpenSearch.Spelling] Error: boxplot. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.
Raw output
{"message": "[OpenSearch.Spelling] Error: boxplot. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_aggregations/metric/boxplot.md", "range": {"start": {"line": 34, "column": 106}}}, "severity": "ERROR"}

```json
GET website_logs/_search
{
"size": 0,
"aggs": {
"load_time_boxplot": {
"boxplot": {
"field": "load_time_ms"
}
}
}
}
```
{% include copy-curl.html %}

This query returns a response similar to the following:

```json
{
"aggregations": {
"load_time_boxplot": {
"min": 100.0,
"max": 5000.0,
"q1": 500.0,
"q2": 1000.0,
"q3": 2000.0
}
}
}
```
{% include copy-curl.html %}

## Advanced options

The boxplot aggregation in OpenSearch offers several advanced options to customize its behavior:

Check failure on line 70 in _aggregations/metric/boxplot.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _aggregations/metric/boxplot.md#L70

[OpenSearch.Spelling] Error: boxplot. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.
Raw output
{"message": "[OpenSearch.Spelling] Error: boxplot. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_aggregations/metric/boxplot.md", "range": {"start": {"line": 70, "column": 5}}}, "severity": "ERROR"}

- Scripting: You can use scripts to transform or calculate values on-the-fly, allowing for more complex data processing.
- Compression: By adjusting the compression parameter, you can control the trade-off between memory usage and approximation accuracy.
- Missing value handling: You can specify how to treat documents with missing values in the target field.

These advanced options provide more control over the boxplot aggregation, allowing you to handle complex scenarios and tailor the analysis to your specific requirements.

Check failure on line 76 in _aggregations/metric/boxplot.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _aggregations/metric/boxplot.md#L76

[OpenSearch.Spelling] Error: boxplot. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.
Raw output
{"message": "[OpenSearch.Spelling] Error: boxplot. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_aggregations/metric/boxplot.md", "range": {"start": {"line": 76, "column": 54}}}, "severity": "ERROR"}

### Scripting

You can use the `script` parameter to perform custom calculations or transformations on the fly, for example, to analyze the square root of a numeric field.

#### Example request

```json
GET website_logs/_search
{
"size": 0,
"aggs": {
"load_time_boxplot": {
"boxplot": {
"script": {
"source": "Math.sqrt(doc['load_time_ms'].value)"
}
}
}
}
}
```
{% include copy-curl.html %}

### Compression

The `compression` parameter controls the memory usage and accuracy trade-off for the boxplot calculation. A lower value provides better accuracy at the cost of higher memory usage, while a higher value reduces memory usage but may result in approximations. The default value is `3000`.

Check failure on line 103 in _aggregations/metric/boxplot.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _aggregations/metric/boxplot.md#L103

[OpenSearch.Spelling] Error: boxplot. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.
Raw output
{"message": "[OpenSearch.Spelling] Error: boxplot. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_aggregations/metric/boxplot.md", "range": {"start": {"line": 103, "column": 86}}}, "severity": "ERROR"}

#### Example request

```
GET website_logs/_search
{
"size": 0,
"aggs": {
"load_time_boxplot": {
"boxplot": {
"field": "load_time_ms",
"compression": 5000
}
}
}
}
```
{% include copy-curl.html %}

### Missing value handling

By default, documents with missing values in the `target_field` field are ignored. However, you can specify how to handle them using the missing parameter:

- `missing`: Treat missing values as if they were specified explicitly.
- `missing_inv`: Treat missing values as if they were infinite values.
- `missing_neg_value`: Treat missing values as if they had a specified negative value.
- `missing_pos_value`: Treat missing values as if they had a specified positive value.

#### Example request

```json
GET website_logs/_search
{
"size": 0,
"aggs": {
"load_time_boxplot": {
"boxplot": {
"field": "load_time_ms",
"missing": 0
}
}
}
}
```
{% include copy-curl.html %}

In this example, missing values in the load_time_ms field will be treated as if they were zeros.
Loading