Skip to content

Commit

Permalink
Merge branch 'main' into set-processor
Browse files Browse the repository at this point in the history
  • Loading branch information
vagimeli authored May 30, 2024
2 parents 520aed5 + 9af765f commit 2609707
Show file tree
Hide file tree
Showing 9 changed files with 467 additions and 8 deletions.
2 changes: 2 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ _Describe what this change achieves._
### Issues Resolved
_List any issues this PR will resolve, e.g. Closes [...]._

### Version
_List the OpenSearch version to which this PR applies, e.g. 2.14, 2.12--2.14, or all._

### Checklist
- [ ] By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the [Developers Certificate of Origin](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin).
Expand Down
158 changes: 158 additions & 0 deletions _aggregations/metric/median-absolute-deviation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
---
layout: default
title: Median absolute deviation
parent: Metric aggregations
grand_parent: Aggregations
nav_order: 65
redirect_from:
- /query-dsl/aggregations/metric/median-absolute-deviation/
---

# Median absolute deviation aggregations

The `median_absolute_deviation` metric is a single-value metric aggregation that returns a median absolute deviation field. Median absolute deviation is a statistical measure of data variability. Because the median absolute deviation measures dispersion from the median, it provides a more robust measure of variability that is less affected by outliers in a dataset.

Median absolute deviation is calculated as follows:<br>
median_absolute_deviation = median(|X<sub>i</sub> - Median(X<sub>i</sub>)|)

The following example calculates the median absolute deviation of the `DistanceMiles` field in the sample dataset `opensearch_dashboards_sample_data_flights`:


```json
GET opensearch_dashboards_sample_data_flights/_search
{
"size": 0,
"aggs": {
"median_absolute_deviation_DistanceMiles": {
"median_absolute_deviation": {
"field": "DistanceMiles"
}
}
}
}
```
{% include copy-curl.html %}

#### Example response

```json
{
"took": 35,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"median_absolute_deviation_distanceMiles": {
"value": 1829.8993624441966
}
}
}
```

### Missing

By default, if a field is missing or has a null value in a document, it is ignored during computation. However, you can specify a value to be used for those missing or null fields by using the `missing` parameter, as shown in the following request:

```json
GET opensearch_dashboards_sample_data_flights/_search
{
"size": 0,
"aggs": {
"median_absolute_deviation_distanceMiles": {
"median_absolute_deviation": {
"field": "DistanceMiles",
"missing": 1000
}
}
}
}
```
{% include copy-curl.html %}

#### Example response

```json
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"median_absolute_deviation_distanceMiles": {
"value": 1829.6443646143355
}
}
}
```

### Compression

The median absolute deviation is calculated using the [t-digest](https://github.com/tdunning/t-digest/tree/main) data structure, which balances between performance and estimation accuracy through the `compression` parameter (default value: `1000`). Adjusting the `compression` value affects the trade-off between computational efficiency and precision. Lower `compression` values improve performance but may reduce estimation accuracy, while higher values enhance accuracy at the cost of increased computational overhead, as shown in the following request:

```json
GET opensearch_dashboards_sample_data_flights/_search
{
"size": 0,
"aggs": {
"median_absolute_deviation_DistanceMiles": {
"median_absolute_deviation": {
"field": "DistanceMiles",
"compression": 10
}
}
}
}
```
{% include copy-curl.html %}

#### Example response

```json
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"median_absolute_deviation_DistanceMiles": {
"value": 1836.265614211182
}
}
}
```
12 changes: 12 additions & 0 deletions _api-reference/nodes-apis/nodes-stats.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ script_cache | Statistics about script cache.
indexing_pressure | Statistics about the node's indexing pressure.
shard_indexing_pressure | Statistics about shard indexing pressure.
search_backpressure | Statistics related to search backpressure.
cluster_manager_throttling | Statistics related to throttled tasks on the cluster manager node.
resource_usage_stats | Node-level resource usage statistics, such as CPU and JVM memory.
admission_control | Statistics about admission control.
caches | Statistics about caches.
Expand Down Expand Up @@ -832,6 +833,7 @@ http.total_opened | Integer | The total number of HTTP connections the node has
[indexing_pressure](#indexing_pressure) | Object | Statistics related to the node's indexing pressure.
[shard_indexing_pressure](#shard_indexing_pressure) | Object | Statistics related to indexing pressure at the shard level.
[search_backpressure]({{site.url}}{{site.baseurl}}/opensearch/search-backpressure#search-backpressure-stats-api) | Object | Statistics related to search backpressure.
[cluster_manager_throttling](#cluster_manager_throttling) | Object | Statistics related to throttled tasks on the cluster manager node.
[resource_usage_stats](#resource_usage_stats) | Object | Statistics related to resource usage for the node.
[admission_control](#admission_control) | Object | Statistics related to admission control for the node.
[caches](#caches) | Object | Statistics related to caches on the node.
Expand Down Expand Up @@ -1282,6 +1284,16 @@ total_rejections_breakup_shadow_mode.throughput_degradation_limits | Integer | T
enabled | Boolean | Specifies whether the shard indexing pressure feature is turned on for the node.
enforced | Boolean | If true, the shard indexing pressure runs in enforced mode (there are rejections). If false, the shard indexing pressure runs in shadow mode (there are no rejections, but statistics are recorded and can be retrieved in the `total_rejections_breakup_shadow_mode` object). Only applicable if shard indexing pressure is enabled.

### `cluster_manager_throttling`

The `cluster_manager_throttling` object contains statistics about throttled tasks on the cluster manager node. It is populated only for the node that is currently elected as the cluster manager.

Field | Field type | Description
:--- | :--- | :---
stats | Object | Statistics about throttled tasks on the cluster manager node.
stats.total_throttled_tasks | Long | The total number of throttled tasks.
stats.throttled_tasks_per_task_type | Object | A breakdown of statistics by individual task type, specified as key-value pairs. The keys are individual task types, and their values represent the number of requests that were throttled.

### `resource_usage_stats`

The `resource_usage_stats` object contains the resource usage statistics. Each entry is specified by the node ID and has the following properties.
Expand Down
6 changes: 4 additions & 2 deletions _ingest-pipelines/processors/foreach.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
---
layout: default
title: `foreach`
title: Foreach
parent: Ingest processors
nav_order: 110
---

# `foreach` processor
<!-- vale off -->
# Foreach processor
<!-- vale on -->

The `foreach` processor is used to iterate over a list of values in an input document and apply a transformation to each value. This can be useful for tasks like processing all the elements in an array consistently, such as converting all elements in a string to lowercase or uppercase.

Expand Down
135 changes: 135 additions & 0 deletions _ingest-pipelines/processors/join.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
---
layout: default
title: Join
parent: Ingest processors
nav_order: 160
---

# Join processor

The `join` processor concatenates the elements of an array into a single string value, using a specified separator between each element. It throws an exception if the provided input is not an array.

The following is the syntax for the `join` processor:

```json
{
"join": {
"field": "field_name",
"separator": "separator_string"
}
}
```
{% include copy-curl.html %}

## Configuration parameters

The following table lists the required and optional parameters for the `join` processor.

Parameter | Required/Optional | Description |
|-----------|-----------|-----------|
`field` | Required | The name of the field to which the join operator is applied. Must be an array.
`separator` | Required | A string separator to use when joining field values. If not specified, then the values are concatenated without a separator.
`target_field` | Optional | The field to assign the cleaned value to. If not specified, then the field is updated in place.
`description` | Optional | A description of the processor's purpose or configuration.
`if` | Optional | Specifies to conditionally execute the processor.
`ignore_failure` | Optional | Specifies to ignore failures for the processor. See [Handling pipeline failures]({{site.url}}{{site.baseurl}}/ingest-pipelines/pipeline-failures/).
`on_failure` | Optional | Specifies to handle failures for the processor. See [Handling pipeline failures]({{site.url}}{{site.baseurl}}/ingest-pipelines/pipeline-failures/).
`tag` | Optional | An identifier for the processor. Useful for debugging and metrics.

## Using the processor

Follow these steps to use the processor in a pipeline.

### Step 1: Create a pipeline

The following query creates a pipeline named `example-join-pipeline` that uses the `join` processor to concatenate all the values of the `uri` field, separating them with the specified separator `/`:

```json
PUT _ingest/pipeline/example-join-pipeline
{
"description": "Example pipeline using the join processor",
"processors": [
{
"join": {
"field": "uri",
"separator": "/"
}
}
]
}
```
{% include copy-curl.html %}

### Step 2 (Optional): Test the pipeline

It is recommended that you test your pipeline before you ingest documents.
{: .tip}

To test the pipeline, run the following query:

```json
POST _ingest/pipeline/example-join-pipeline/_simulate
{
"docs": [
{
"_source": {
"uri": [
"app",
"home",
"overview"
]
}
}
]
}
```
{% include copy-curl.html %}

#### Response

The following example response confirms that the pipeline is working as expected:

```json
{
"docs": [
{
"doc": {
"_index": "_index",
"_id": "_id",
"_source": {
"uri": "app/home/overview"
},
"_ingest": {
"timestamp": "2024-05-24T02:16:01.00659117Z"
}
}
}
]
}
```
{% include copy-curl.html %}

### Step 3: Ingest a document

The following query ingests a document into an index named `testindex1`:

```json
POST testindex1/_doc/1?pipeline=example-join-pipeline
{
"uri": [
"app",
"home",
"overview"
]
}
```
{% include copy-curl.html %}

### Step 4 (Optional): Retrieve the document

To retrieve the document, run the following query:

```json
GET testindex1/_doc/1
```
{% include copy-curl.html %}
1 change: 1 addition & 0 deletions _install-and-configure/plugins.md
Original file line number Diff line number Diff line change
Expand Up @@ -285,6 +285,7 @@ The following plugins are bundled with all OpenSearch distributions except for m
| Job Scheduler | [opensearch-job-scheduler](https://github.com/opensearch-project/job-scheduler) | 1.0.0 |
| k-NN | [opensearch-knn](https://github.com/opensearch-project/k-NN) | 1.0.0 |
| ML Commons | [opensearch-ml](https://github.com/opensearch-project/ml-commons) | 1.3.0 |
| Skills | [opensearch-skills](https://github.com/opensearch-project/skills) | 2.12.0 |
| Neural Search | [neural-search](https://github.com/opensearch-project/neural-search) | 2.4.0 |
| Observability | [opensearch-observability](https://github.com/opensearch-project/observability) | 1.2.0 |
| Performance Analyzer<sup>2</sup> | [opensearch-performance-analyzer](https://github.com/opensearch-project/performance-analyzer) | 1.0.0 |
Expand Down
26 changes: 21 additions & 5 deletions _search-plugins/knn/approximate-knn.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,10 +127,26 @@ GET my-knn-index-1/_search
}
```

`k` is the number of neighbors the search of each graph will return. You must also include the `size` option, which
indicates how many results the query actually returns. The plugin returns `k` amount of results for each shard
(and each segment) and `size` amount of results for the entire query. The plugin supports a maximum `k` value of 10,000.
Starting in OpenSearch 2.14, in addition to using the `k` variable, both the `min_score` and `max_distance` variables can be used for [radial search]({{site.url}}{{site.baseurl}}/search-plugins/knn/radial-search-knn/).
### The number of returned results

In the preceding query, `k` represents the number of neighbors returned by the search of each graph. You must also include the `size` option, indicating the final number of results that you want the query to return.

For the NMSLIB and Faiss engines, `k` represents the maximum number of documents returned for all segments of a shard. For the Lucene engine, `k` represents the number of documents returned for a shard. The maximum value of `k` is 10,000.

For any engine, each shard returns `size` results to the coordinator node. Thus, the total number of results that the coordinator node receives is `size * number of shards`. After the coordinator node consolidates the results received from all nodes, the query returns the top `size` results.

The following table provides examples of the number of results returned by various engines in several scenarios. For these examples, assume that the number of documents contained in the segments and shards is sufficient to return the number of results specified in the table.

`size` | `k` | Number of primary shards | Number of segments per shard | Number of returned results, Faiss/NMSLIB | Number of returned results, Lucene
:--- | :--- | :--- | :--- | :--- | :---
10 | 1 | 1 | 4 | 4 | 1
10 | 10 | 1 | 4 | 10 | 10
10 | 1 | 2 | 4 | 8 | 2

The number of results returned by Faiss/NMSLIB differs from the number of results returned by Lucene only when `k` is smaller than `size`. If `k` and `size` are equal, all engines return the same number of results.

Starting in OpenSearch 2.14, you can use `k`, `min_score`, or `max_distance` for [radial search]({{site.url}}{{site.baseurl}}/search-plugins/knn/radial-search-knn/).

### Building a k-NN index from a model

For some of the algorithms that we support, the native library index needs to be trained before it can be used. It would be expensive to training every newly created segment, so, instead, we introduce the concept of a *model* that is used to initialize the native library index during segment creation. A *model* is created by calling the [Train API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-a-model), passing in the source of training data as well as the method definition of the model. Once training is complete, the model will be serialized to a k-NN model system index. Then, during indexing, the model is pulled from this index to initialize the segments.
Expand Down Expand Up @@ -311,4 +327,4 @@ included in the distance function.
With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...]`) as input. This is because the magnitude of
such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests
containing the zero vector will be rejected and a corresponding exception will be thrown.
{: .note }
{: .note }
Loading

0 comments on commit 2609707

Please sign in to comment.