Merge branch 'main' into set-processor

opensearch-project · May 30, 2024 · 2609707 · 2609707
2 parents 520aed5 + 9af765f
commit 2609707
Show file tree

Hide file tree

Showing 9 changed files with 467 additions and 8 deletions.
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -4,6 +4,8 @@ _Describe what this change achieves._
 ### Issues Resolved
 _List any issues this PR will resolve, e.g. Closes [...]._
 
+### Version
+_List the OpenSearch version to which this PR applies, e.g. 2.14, 2.12--2.14, or all._
 
 ### Checklist
 - [ ] By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the [Developers Certificate of Origin](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin).

diff --git a/_aggregations/metric/median-absolute-deviation.md b/_aggregations/metric/median-absolute-deviation.md
@@ -0,0 +1,158 @@
+---
+layout: default
+title: Median absolute deviation
+parent: Metric aggregations
+grand_parent: Aggregations
+nav_order: 65
+redirect_from:
+  - /query-dsl/aggregations/metric/median-absolute-deviation/
+---
+
+# Median absolute deviation aggregations
+
+The `median_absolute_deviation` metric is a single-value metric aggregation that returns a median absolute deviation field. Median absolute deviation is a statistical measure of data variability. Because the median absolute deviation measures dispersion from the median, it provides a more robust measure of variability that is less affected by outliers in a dataset. 
+
+Median absolute deviation is calculated as follows:<br>
+median_absolute_deviation = median(|X<sub>i</sub> - Median(X<sub>i</sub>)|)
+
+The following example calculates the median absolute deviation of the `DistanceMiles` field in the sample dataset `opensearch_dashboards_sample_data_flights`:
+
+
+```json
+GET opensearch_dashboards_sample_data_flights/_search
+{
+  "size": 0,
+  "aggs": {
+    "median_absolute_deviation_DistanceMiles": {
+      "median_absolute_deviation": {
+        "field": "DistanceMiles"
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+#### Example response
+
+```json
+{
+  "took": 35,
+  "timed_out": false,
+  "_shards": {
+    "total": 1,
+    "successful": 1,
+    "skipped": 0,
+    "failed": 0
+  },
+  "hits": {
+    "total": {
+      "value": 10000,
+      "relation": "gte"
+    },
+    "max_score": null,
+    "hits": []
+  },
+  "aggregations": {
+    "median_absolute_deviation_distanceMiles": {
+      "value": 1829.8993624441966
+    }
+  }
+}
+```
+
+### Missing
+
+By default, if a field is missing or has a null value in a document, it is ignored during computation. However, you can specify a value to be used for those missing or null fields by using the `missing` parameter, as shown in the following request:
+
+```json
+GET opensearch_dashboards_sample_data_flights/_search
+{
+  "size": 0,
+  "aggs": {
+    "median_absolute_deviation_distanceMiles": {
+      "median_absolute_deviation": {
+        "field": "DistanceMiles",
+        "missing": 1000
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+#### Example response
+
+```json
+{
+  "took": 7,
+  "timed_out": false,
+  "_shards": {
+    "total": 1,
+    "successful": 1,
+    "skipped": 0,
+    "failed": 0
+  },
+  "hits": {
+    "total": {
+      "value": 10000,
+      "relation": "gte"
+    },
+    "max_score": null,
+    "hits": []
+  },
+  "aggregations": {
+    "median_absolute_deviation_distanceMiles": {
+      "value": 1829.6443646143355
+    }
+  }
+}
+```
+
+### Compression
+
+The median absolute deviation is calculated using the [t-digest](https://github.com/tdunning/t-digest/tree/main) data structure, which balances between performance and estimation accuracy through the `compression` parameter (default value: `1000`). Adjusting the `compression` value affects the trade-off between computational efficiency and precision. Lower `compression` values improve performance but may reduce estimation accuracy, while higher values enhance accuracy at the cost of increased computational overhead, as shown in the following request:
+
+```json
+GET opensearch_dashboards_sample_data_flights/_search
+{
+  "size": 0,
+  "aggs": {
+    "median_absolute_deviation_DistanceMiles": {
+      "median_absolute_deviation": {
+        "field": "DistanceMiles",
+        "compression": 10
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+#### Example response
+
+```json
+{
+  "took": 1,
+  "timed_out": false,
+  "_shards": {
+    "total": 1,
+    "successful": 1,
+    "skipped": 0,
+    "failed": 0
+  },
+  "hits": {
+    "total": {
+      "value": 10000,
+      "relation": "gte"
+    },
+    "max_score": null,
+    "hits": []
+  },
+  "aggregations": {
+    "median_absolute_deviation_DistanceMiles": {
+      "value": 1836.265614211182
+    }
+  }
+}
+```
diff --git a/_api-reference/nodes-apis/nodes-stats.md b/_api-reference/nodes-apis/nodes-stats.md
@@ -53,6 +53,7 @@ script_cache | Statistics about script cache.
 indexing_pressure | Statistics about the node's indexing pressure.
 shard_indexing_pressure | Statistics about shard indexing pressure.
 search_backpressure | Statistics related to search backpressure.
+cluster_manager_throttling | Statistics related to throttled tasks on the cluster manager node.
 resource_usage_stats | Node-level resource usage statistics, such as CPU and JVM memory.
 admission_control | Statistics about admission control.
 caches | Statistics about caches. 
@@ -832,6 +833,7 @@ http.total_opened | Integer | The total number of HTTP connections the node has
 [indexing_pressure](#indexing_pressure) | Object | Statistics related to the node's indexing pressure.
 [shard_indexing_pressure](#shard_indexing_pressure) | Object | Statistics related to indexing pressure at the shard level.
 [search_backpressure]({{site.url}}{{site.baseurl}}/opensearch/search-backpressure#search-backpressure-stats-api) | Object | Statistics related to search backpressure.
+[cluster_manager_throttling](#cluster_manager_throttling) | Object | Statistics related to throttled tasks on the cluster manager node.
 [resource_usage_stats](#resource_usage_stats) | Object | Statistics related to resource usage for the node.
 [admission_control](#admission_control) | Object | Statistics related to admission control for the node.
 [caches](#caches) | Object | Statistics related to caches on the node.
@@ -1282,6 +1284,16 @@ total_rejections_breakup_shadow_mode.throughput_degradation_limits | Integer | T
 enabled | Boolean | Specifies whether the shard indexing pressure feature is turned on for the node.
 enforced | Boolean | If true, the shard indexing pressure runs in enforced mode (there are rejections). If false, the shard indexing pressure runs in shadow mode (there are no rejections, but statistics are recorded and can be retrieved in the `total_rejections_breakup_shadow_mode` object). Only applicable if shard indexing pressure is enabled. 
 
+### `cluster_manager_throttling`
+
+The `cluster_manager_throttling` object contains statistics about throttled tasks on the cluster manager node. It is populated only for the node that is currently elected as the cluster manager.  
+
+Field | Field type | Description
+:--- | :--- | :---
+stats | Object | Statistics about throttled tasks on the cluster manager node.
+stats.total_throttled_tasks | Long | The total number of throttled tasks.
+stats.throttled_tasks_per_task_type | Object | A breakdown of statistics by individual task type, specified as key-value pairs. The keys are individual task types, and their values represent the number of requests that were throttled.
+
 ### `resource_usage_stats`
 
 The `resource_usage_stats` object contains the resource usage statistics. Each entry is specified by the node ID and has the following properties.

diff --git a/_ingest-pipelines/processors/foreach.md b/_ingest-pipelines/processors/foreach.md
@@ -1,11 +1,13 @@
 ---
 layout: default
-title: `foreach`
+title: Foreach
 parent: Ingest processors
 nav_order: 110
 ---
 
-# `foreach` processor
+<!-- vale off -->
+# Foreach processor
+<!-- vale on -->
 
 The `foreach` processor is used to iterate over a list of values in an input document and apply a transformation to each value. This can be useful for tasks like processing all the elements in an array consistently, such as converting all elements in a string to lowercase or uppercase.
 

diff --git a/_ingest-pipelines/processors/join.md b/_ingest-pipelines/processors/join.md
@@ -0,0 +1,135 @@
+---
+layout: default
+title: Join
+parent: Ingest processors
+nav_order: 160
+---
+
+# Join processor
+
+The `join` processor concatenates the elements of an array into a single string value, using a specified separator between each element. It throws an exception if the provided input is not an array.
+
+The following is the syntax for the `join` processor:
+
+```json
+{
+  "join": {
+    "field": "field_name",
+    "separator": "separator_string"
+  }
+}
+```
+{% include copy-curl.html %}
+
+## Configuration parameters
+
+The following table lists the required and optional parameters for the `join` processor.
+
+Parameter | Required/Optional | Description |
+|-----------|-----------|-----------|
+`field` | Required | The name of the field to which the join operator is applied. Must be an array.
+`separator` | Required | A string separator to use when joining field values. If not specified, then the values are concatenated without a separator.
+`target_field` | Optional | The field to assign the cleaned value to. If not specified, then the field is updated in place.
+`description` | Optional | A description of the processor's purpose or configuration.
+`if` | Optional | Specifies to conditionally execute the processor.
+`ignore_failure` | Optional | Specifies to ignore failures for the processor. See [Handling pipeline failures]({{site.url}}{{site.baseurl}}/ingest-pipelines/pipeline-failures/).
+`on_failure` | Optional | Specifies to handle failures for the processor. See [Handling pipeline failures]({{site.url}}{{site.baseurl}}/ingest-pipelines/pipeline-failures/).
+`tag` | Optional | An identifier for the processor. Useful for debugging and metrics.
+
+## Using the processor
+
+Follow these steps to use the processor in a pipeline.
+
+### Step 1: Create a pipeline
+
+The following query creates a pipeline named `example-join-pipeline` that uses the `join` processor to concatenate all the values of the `uri`  field, separating them with the specified separator `/`: 
+
+```json
+PUT _ingest/pipeline/example-join-pipeline  
+{  
+  "description": "Example pipeline using the join processor",  
+  "processors": [  
+    {  
+      "join": {  
+        "field": "uri",  
+        "separator": "/"  
+      }  
+    }  
+  ]  
+}  
+```
+{% include copy-curl.html %}
+
+### Step 2 (Optional): Test the pipeline
+
+It is recommended that you test your pipeline before you ingest documents.
+{: .tip}
+
+To test the pipeline, run the following query:
+
+```json
+POST _ingest/pipeline/example-join-pipeline/_simulate  
+{  
+  "docs": [  
+    {  
+      "_source": {  
+        "uri": [  
+          "app",  
+          "home",  
+          "overview"  
+        ]  
+      }  
+    }  
+  ]  
+}
+```
+{% include copy-curl.html %}
+
+#### Response
+
+The following example response confirms that the pipeline is working as expected:
+
+```json
+{  
+  "docs": [  
+    {  
+      "doc": {  
+        "_index": "_index",  
+        "_id": "_id",  
+        "_source": {  
+          "uri": "app/home/overview"  
+        },  
+        "_ingest": {  
+          "timestamp": "2024-05-24T02:16:01.00659117Z"  
+        }  
+      }  
+    }  
+  ]  
+}  
+```
+{% include copy-curl.html %}
+
+### Step 3: Ingest a document 
+
+The following query ingests a document into an index named `testindex1`:
+
+```json
+POST testindex1/_doc/1?pipeline=example-join-pipeline  
+{  
+  "uri": [  
+    "app",  
+    "home",  
+    "overview"  
+  ]  
+} 
+```
+{% include copy-curl.html %}
+
+### Step 4 (Optional): Retrieve the document
+
+To retrieve the document, run the following query:
+
+```json
+GET testindex1/_doc/1
+```
+{% include copy-curl.html %}
diff --git a/_install-and-configure/plugins.md b/_install-and-configure/plugins.md
@@ -285,6 +285,7 @@ The following plugins are bundled with all OpenSearch distributions except for m
 | Job Scheduler | [opensearch-job-scheduler](https://github.com/opensearch-project/job-scheduler) | 1.0.0 |
 | k-NN | [opensearch-knn](https://github.com/opensearch-project/k-NN) | 1.0.0 |
 | ML Commons | [opensearch-ml](https://github.com/opensearch-project/ml-commons) | 1.3.0 |
+| Skills | [opensearch-skills](https://github.com/opensearch-project/skills) | 2.12.0 |
 | Neural Search | [neural-search](https://github.com/opensearch-project/neural-search) | 2.4.0 |
 | Observability | [opensearch-observability](https://github.com/opensearch-project/observability) | 1.2.0 |
 | Performance Analyzer<sup>2</sup> | [opensearch-performance-analyzer](https://github.com/opensearch-project/performance-analyzer) | 1.0.0 |

diff --git a/_search-plugins/knn/approximate-knn.md b/_search-plugins/knn/approximate-knn.md
@@ -127,10 +127,26 @@ GET my-knn-index-1/_search
 }
 ```
 
-`k` is the number of neighbors the search of each graph will return. You must also include the `size` option, which
-indicates how many results the query actually returns. The plugin returns `k` amount of results for each shard
-(and each segment) and `size` amount of results for the entire query. The plugin supports a maximum `k` value of 10,000.
-Starting in OpenSearch 2.14, in addition to using the `k` variable, both the `min_score` and `max_distance` variables can be used for [radial search]({{site.url}}{{site.baseurl}}/search-plugins/knn/radial-search-knn/).
+### The number of returned results
+
+In the preceding query, `k` represents the number of neighbors returned by the search of each graph. You must also include the `size` option, indicating the final number of results that you want the query to return.  
+
+For the NMSLIB and Faiss engines, `k` represents the maximum number of documents returned for all segments of a shard. For the Lucene engine, `k` represents the number of documents returned for a shard. The maximum value of `k` is 10,000.
+
+For any engine, each shard returns `size` results to the coordinator node. Thus, the total number of results that the coordinator node receives is `size * number of shards`. After the coordinator node consolidates the results received from all nodes, the query returns the top `size` results.
+
+The following table provides examples of the number of results returned by various engines in several scenarios. For these examples, assume that the number of documents contained in the segments and shards is sufficient to return the number of results specified in the table.
+
+`size` 	| `k` | Number of primary shards | 	Number of segments per shard | Number of returned results, Faiss/NMSLIB | Number of returned results, Lucene
+:--- | :--- | :--- | :--- | :--- | :---
+10 |	1 |	1 |	4 |	4 | 1
+10 | 10 |	1 |	4 |	10 | 10
+10 |	1 |	2 |	4 |	8 | 2
+
+The number of results returned by Faiss/NMSLIB differs from the number of results returned by Lucene only when `k` is smaller than `size`. If `k` and `size` are equal, all engines return the same number of results. 
+
+Starting in OpenSearch 2.14, you can use `k`, `min_score`, or `max_distance` for [radial search]({{site.url}}{{site.baseurl}}/search-plugins/knn/radial-search-knn/).
+
 ### Building a k-NN index from a model
 
 For some of the algorithms that we support, the native library index needs to be trained before it can be used. It would be expensive to training every newly created segment, so, instead, we introduce the concept of a *model* that is used to initialize the native library index during segment creation. A *model* is created by calling the [Train API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-a-model), passing in the source of training data as well as the method definition of the model. Once training is complete, the model will be serialized to a k-NN model system index. Then, during indexing, the model is pulled from this index to initialize the segments.
@@ -311,4 +327,4 @@ included in the distance function.
 With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...]`) as input. This is because the magnitude of
 such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests
 containing the zero vector will be rejected and a corresponding exception will be thrown.
-{: .note }
+{: .note }