Merge branch 'main' into main

opensearch-project · Jun 6, 2024 · d498755 · d498755
2 parents f8ef118 + 6e2f1d4
commit d498755
Show file tree

Hide file tree

Showing 49 changed files with 5,323 additions and 118 deletions.
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -4,6 +4,8 @@ _Describe what this change achieves._
 ### Issues Resolved
 _List any issues this PR will resolve, e.g. Closes [...]._
 
+### Version
+_List the OpenSearch version to which this PR applies, e.g. 2.14, 2.12--2.14, or all._
 
 ### Checklist
 - [ ] By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the [Developers Certificate of Origin](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin).

diff --git a/.github/vale/styles/Vocab/OpenSearch/Products/accept.txt b/.github/vale/styles/Vocab/OpenSearch/Products/accept.txt
@@ -8,6 +8,7 @@ Amazon SageMaker
 Ansible
 Auditbeat
 AWS Cloud
+Cohere Command
 Cognito
 Dashboards Query Language
 Data Prepper

diff --git a/STYLE_GUIDE.md b/STYLE_GUIDE.md
diff --git a/TERMS.md b/TERMS.md
diff --git a/_about/version-history.md b/_about/version-history.md
@@ -29,6 +29,7 @@ OpenSearch version | Release highlights | Release date
 [2.0.1](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.0.1.md) | Includes bug fixes and maintenance updates for Alerting and Anomaly Detection. | 16 June 2022
 [2.0.0](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.0.0.md) | Includes document-level monitors for alerting, OpenSearch Notifications plugins, and Geo Map Tiles in OpenSearch Dashboards. Also adds support for Lucene 9 and bug fixes for all OpenSearch plugins. For a full list of release highlights, see the Release Notes. | 26 May 2022
 [2.0.0-rc1](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.0.0-rc1.md) | The Release Candidate for 2.0.0. This version allows you to preview the upcoming 2.0.0 release before the GA release. The preview release adds document-level alerting, support for Lucene 9, and the ability to use term lookup queries in document level security. | 03 May 2022
+[1.3.17](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.17.md) | Includes maintenance updates for OpenSearch security and OpenSearch Dashboards security. | 06 June 2024
 [1.3.16](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.16.md) | Includes bug fixes and maintenance updates for OpenSearch security, index management, performance analyzer, and reporting. | 23 April 2024
 [1.3.15](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.15.md) | Includes bug fixes and maintenance updates for cross-cluster replication, SQL, OpenSearch Dashboards reporting, and alerting.  | 05 March 2024
 [1.3.14](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.14.md) | Includes bug fixes and maintenance updates for OpenSearch security and OpenSearch Dashboards security.  | 12 December 2023

diff --git a/_aggregations/metric/median-absolute-deviation.md b/_aggregations/metric/median-absolute-deviation.md
@@ -0,0 +1,158 @@
+---
+layout: default
+title: Median absolute deviation
+parent: Metric aggregations
+grand_parent: Aggregations
+nav_order: 65
+redirect_from:
+  - /query-dsl/aggregations/metric/median-absolute-deviation/
+---
+
+# Median absolute deviation aggregations
+
+The `median_absolute_deviation` metric is a single-value metric aggregation that returns a median absolute deviation field. Median absolute deviation is a statistical measure of data variability. Because the median absolute deviation measures dispersion from the median, it provides a more robust measure of variability that is less affected by outliers in a dataset. 
+
+Median absolute deviation is calculated as follows:<br>
+median_absolute_deviation = median(|X<sub>i</sub> - Median(X<sub>i</sub>)|)
+
+The following example calculates the median absolute deviation of the `DistanceMiles` field in the sample dataset `opensearch_dashboards_sample_data_flights`:
+
+
+```json
+GET opensearch_dashboards_sample_data_flights/_search
+{
+  "size": 0,
+  "aggs": {
+    "median_absolute_deviation_DistanceMiles": {
+      "median_absolute_deviation": {
+        "field": "DistanceMiles"
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+#### Example response
+
+```json
+{
+  "took": 35,
+  "timed_out": false,
+  "_shards": {
+    "total": 1,
+    "successful": 1,
+    "skipped": 0,
+    "failed": 0
+  },
+  "hits": {
+    "total": {
+      "value": 10000,
+      "relation": "gte"
+    },
+    "max_score": null,
+    "hits": []
+  },
+  "aggregations": {
+    "median_absolute_deviation_distanceMiles": {
+      "value": 1829.8993624441966
+    }
+  }
+}
+```
+
+### Missing
+
+By default, if a field is missing or has a null value in a document, it is ignored during computation. However, you can specify a value to be used for those missing or null fields by using the `missing` parameter, as shown in the following request:
+
+```json
+GET opensearch_dashboards_sample_data_flights/_search
+{
+  "size": 0,
+  "aggs": {
+    "median_absolute_deviation_distanceMiles": {
+      "median_absolute_deviation": {
+        "field": "DistanceMiles",
+        "missing": 1000
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+#### Example response
+
+```json
+{
+  "took": 7,
+  "timed_out": false,
+  "_shards": {
+    "total": 1,
+    "successful": 1,
+    "skipped": 0,
+    "failed": 0
+  },
+  "hits": {
+    "total": {
+      "value": 10000,
+      "relation": "gte"
+    },
+    "max_score": null,
+    "hits": []
+  },
+  "aggregations": {
+    "median_absolute_deviation_distanceMiles": {
+      "value": 1829.6443646143355
+    }
+  }
+}
+```
+
+### Compression
+
+The median absolute deviation is calculated using the [t-digest](https://github.com/tdunning/t-digest/tree/main) data structure, which balances between performance and estimation accuracy through the `compression` parameter (default value: `1000`). Adjusting the `compression` value affects the trade-off between computational efficiency and precision. Lower `compression` values improve performance but may reduce estimation accuracy, while higher values enhance accuracy at the cost of increased computational overhead, as shown in the following request:
+
+```json
+GET opensearch_dashboards_sample_data_flights/_search
+{
+  "size": 0,
+  "aggs": {
+    "median_absolute_deviation_DistanceMiles": {
+      "median_absolute_deviation": {
+        "field": "DistanceMiles",
+        "compression": 10
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+#### Example response
+
+```json
+{
+  "took": 1,
+  "timed_out": false,
+  "_shards": {
+    "total": 1,
+    "successful": 1,
+    "skipped": 0,
+    "failed": 0
+  },
+  "hits": {
+    "total": {
+      "value": 10000,
+      "relation": "gte"
+    },
+    "max_score": null,
+    "hits": []
+  },
+  "aggregations": {
+    "median_absolute_deviation_DistanceMiles": {
+      "value": 1836.265614211182
+    }
+  }
+}
+```
diff --git a/_api-reference/nodes-apis/nodes-stats.md b/_api-reference/nodes-apis/nodes-stats.md
@@ -53,6 +53,8 @@ script_cache | Statistics about script cache.
 indexing_pressure | Statistics about the node's indexing pressure.
 shard_indexing_pressure | Statistics about shard indexing pressure.
 search_backpressure | Statistics related to search backpressure.
+cluster_manager_throttling | Statistics related to throttled tasks on the cluster manager node.
+weighted_routing | Statistics relevant to weighted round robin requests.
 resource_usage_stats | Node-level resource usage statistics, such as CPU and JVM memory.
 admission_control | Statistics about admission control.
 caches | Statistics about caches. 
@@ -832,6 +834,8 @@ http.total_opened | Integer | The total number of HTTP connections the node has
 [indexing_pressure](#indexing_pressure) | Object | Statistics related to the node's indexing pressure.
 [shard_indexing_pressure](#shard_indexing_pressure) | Object | Statistics related to indexing pressure at the shard level.
 [search_backpressure]({{site.url}}{{site.baseurl}}/opensearch/search-backpressure#search-backpressure-stats-api) | Object | Statistics related to search backpressure.
+[cluster_manager_throttling](#cluster_manager_throttling) | Object | Statistics related to throttled tasks on the cluster manager node.
+[weighted_routing](#weighted_routing) | Object | Statistics relevant to weighted round robin requests.
 [resource_usage_stats](#resource_usage_stats) | Object | Statistics related to resource usage for the node.
 [admission_control](#admission_control) | Object | Statistics related to admission control for the node.
 [caches](#caches) | Object | Statistics related to caches on the node.
@@ -1282,6 +1286,25 @@ total_rejections_breakup_shadow_mode.throughput_degradation_limits | Integer | T
 enabled | Boolean | Specifies whether the shard indexing pressure feature is turned on for the node.
 enforced | Boolean | If true, the shard indexing pressure runs in enforced mode (there are rejections). If false, the shard indexing pressure runs in shadow mode (there are no rejections, but statistics are recorded and can be retrieved in the `total_rejections_breakup_shadow_mode` object). Only applicable if shard indexing pressure is enabled. 
 
+### `cluster_manager_throttling`
+
+The `cluster_manager_throttling` object contains statistics about throttled tasks on the cluster manager node. It is populated only for the node that is currently elected as the cluster manager.  
+
+Field | Field type | Description
+:--- | :--- | :---
+stats | Object | Statistics about throttled tasks on the cluster manager node.
+stats.total_throttled_tasks | Long | The total number of throttled tasks.
+stats.throttled_tasks_per_task_type | Object | A breakdown of statistics by individual task type, specified as key-value pairs. The keys are individual task types, and their values represent the number of requests that were throttled.
+
+### `weighted_routing`
+
+The `weighted_routing` object contains statistics about weighted round robin requests. Specifically, it contains a counter of times this node has server a request while it was "zoned out".
+
+Field | Field type | Description
+:--- |:-----------| :---
+stats | Object | Statistics about weighted routing.
+fail_open_count | Integer | Number of times a shard on this node has served a request while the routing weight for the node was set to zero.
+
 ### `resource_usage_stats`
 
 The `resource_usage_stats` object contains the resource usage statistics. Each entry is specified by the node ID and has the following properties.

diff --git a/_benchmark/quickstart.md b/_benchmark/quickstart.md
@@ -18,7 +18,7 @@ To perform the Quickstart steps, you'll need to fulfill the following prerequisi
 
 ## Set up an OpenSearch cluster
 
-If you don't already have an active OpenSearch cluster, you can launch a new OpenSearch cluster to use with OpenSerch Benchmark.
+If you don't already have an active OpenSearch cluster, you can launch a new OpenSearch cluster to use with OpenSearch Benchmark.
 
 - Using **Docker Compose**. For instructions on how to use Docker Compose, see [OpenSearch Quickstart]({{site.url}}{{site.baseurl}}/quickstart/).
 - Using **Tar**. For instructions on how to install OpenSearch with Tar, see [Installing OpenSearch > Tarball]({{site.url}}{{site.baseurl}}/install-and-configure/install-opensearch/tar#step-1-download-and-unpack-opensearch).

diff --git a/_dashboards/dashboards-assistant/index.md b/_dashboards/dashboards-assistant/index.md
@@ -122,3 +122,4 @@ The following screenshot shows a saved conversation, along with actions you can
 
 - [Getting started guide for OpenSearch Assistant in OpenSearch Dashboards](https://github.com/opensearch-project/dashboards-assistant/blob/main/GETTING_STARTED_GUIDE.md)
 - [OpenSearch Assistant configuration through the REST API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/opensearch-assistant/)
+- [Build your own chatbot]({{site.url}}{{site.baseurl}}/ml-commons-plugin/tutorials/build-chatbot/)
diff --git a/_data-prepper/common-use-cases/codec-processor-combinations.md b/_data-prepper/common-use-cases/codec-processor-combinations.md
@@ -45,3 +45,6 @@ The [`newline` codec]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/config
 
 [Apache Avro] helps streamline streaming data pipelines. It is most efficient when used with the [`avro` codec]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sinks/s3#avro-codec) inside an `s3` sink.
 
+## `event_json`
+
+The `event_json` output codec converts event data and metadata into JSON format to send to a sink, such as an S3 sink. The `event_json` input codec reads the event and its metadata to create an event in Data Prepper.
diff --git a/_data-prepper/pipelines/configuration/processors/aggregate.md b/_data-prepper/pipelines/configuration/processors/aggregate.md
@@ -20,6 +20,7 @@ Option | Required | Type | Description
 identification_keys | Yes | List | An unordered list by which to group events. Events with the same values as these keys are put into the same group. If an event does not contain one of the `identification_keys`, then the value of that key is considered to be equal to `null`. At least one identification_key is required (for example, `["sourceIp", "destinationIp", "port"]`).
 action | Yes | AggregateAction | The action to be performed on each group. One of the [available aggregate actions](#available-aggregate-actions) must be provided, or you can create custom aggregate actions. `remove_duplicates` and `put_all` are the available actions. For more information, see [Creating New Aggregate Actions](https://github.com/opensearch-project/data-prepper/tree/main/data-prepper-plugins/aggregate-processor#creating-new-aggregate-actions).
 group_duration | No | String | The amount of time that a group should exist before it is concluded automatically. Supports ISO_8601 notation strings ("PT20.345S", "PT15M", etc.) as well as simple notation for seconds (`"60s"`) and milliseconds (`"1500ms"`). Default value is `180s`.
+local_mode | No | Boolean | When `local_mode` is set to `true`, the aggregation is performed locally on each Data Prepper node instead of forwarding events to a specific node based on the `identification_keys` using a hash function. Default is `false`.
 
 ## Available aggregate actions
 
@@ -176,4 +177,4 @@ The `aggregate` processor includes the following custom metrics.
 
 **Gauge**
 
-* `currentAggregateGroups`: The current number of groups. This gauge decreases when a group concludes and increases when an event initiates the creation of a new group.
+* `currentAggregateGroups`: This gauge represents the current number of active aggregate groups. It decreases when an aggregate group is completed and its results are emitted and increases when a new event initiates the creation of a new aggregate group.
diff --git a/_data-prepper/pipelines/configuration/processors/key-value.md b/_data-prepper/pipelines/configuration/processors/key-value.md
@@ -33,6 +33,8 @@ You can use the `key_value` processor to parse the specified field into key-valu
 | recursive | Specifies whether to recursively obtain additional key-value pairs from values. The extra key-value pairs will be stored as sub-keys of the root key. Default is `false`. The levels of recursive parsing must be defined by different brackets for each level: `[]`, `()`, and `<>`, in this order. Any other configurations specified will only be applied to the outmost keys. <br />When `recursive` is `true`: <br /> `remove_brackets` cannot also be `true`;<br />`skip_duplicate_values` will always be `true`; <br />`whitespace` will always be `"strict"`. | If `recursive` is true, `{"item1=[item1-subitem1=item1-subitem1-value&item1-subitem2=(item1-subitem2-subitem2A=item1-subitem2-subitem2A-value&item1-subitem2-subitem2B=item1-subitem2-subitem2B-value)]&item2=item2-value"}` will parse into `{"item1": {"item1-subitem1": "item1-subitem1-value", "item1-subitem2" {"item1-subitem2-subitem2A": "item1-subitem2-subitem2A-value", "item1-subitem2-subitem2B": "item1-subitem2-subitem2B-value"}}}`. |
 | overwrite_if_destination_exists | Specifies whether to overwrite existing fields if there are key conflicts when writing parsed fields to the event. Default is `true`. | If `overwrite_if_destination_exists` is `true` and destination is `null`, `{"key1": "old_value", "message": "key1=new_value"}` will parse into `{"key1": "new_value", "message": "key1=new_value"}`. |
 | tags_on_failure | When a `kv` operation causes a runtime exception within the processor, the operation is safely stopped without crashing the processor, and the event is tagged with the provided tags. | If `tags_on_failure` is set to `["keyvalueprocessor_failure"]`, `{"tags": ["keyvalueprocessor_failure"]}` will be added to the event's metadata in the event of a runtime exception. |
+| value_grouping | Specifies whether to group values using predefined value grouping delimiters: `{...}`, `[...]', `<...>`, `(...)`, `"..."`, `'...'`, `http://... (space)`, and `https:// (space)`. If this flag is enabled, then the content between the delimiters is considered to be one entity and is not parsed for key-value pairs. Default is `false`. If `value_grouping` is `true`, then `{"key1=[a=b,c=d]&key2=value2"}` parses to `{"key1": "[a=b,c=d]", "key2": "value2"}`. |
+| drop_keys_with_no_value | Specifies whether keys should be dropped if they have a null value. Default is `false`. If `drop_keys_with_no_value` is set to `true`, then `{"key1=value1&key2"}` parses to `{"key1": "value1"}`. |
 
 
 
@@ -42,4 +44,4 @@ Content will be added to this section.
 
 ## Metrics
 
-Content will be added to this section. --->
+Content will be added to this section. --->
diff --git a/_data-prepper/pipelines/configuration/processors/write_json.md b/_data-prepper/pipelines/configuration/processors/write_json.md
@@ -0,0 +1,18 @@
+---
+layout: default
+title: write_json
+parent: Processors
+grand_parent: Pipelines
+nav_order: 56
+---
+
+# write_json
+
+
+The `write_json` processor converts an object in an event into a JSON string. You can customize the processor to choose the source and target field names.
+
+| Option | Description | Example |
+| :--- | :--- | :--- |
+| source | Mandatory field that specifies the name of the field in the event containing the message or object to be parsed. | If `source` is set to `"message"` and the input is `{"message": {"key1":"value1", "key2":{"key3":"value3"}}`, then the `write_json` processor generates `{"message": "{\"key1\":\"value`\", \"key2\":"{\"key3\":\"value3\"}"}"`.
+| target | An optional field that specifies the name of the field in which the resulting JSON string should be stored. If `target` is not specified, then the `source` field is used.
+
diff --git a/_data-prepper/pipelines/configuration/sources/s3.md b/_data-prepper/pipelines/configuration/sources/s3.md
@@ -104,7 +104,7 @@ Option | Required | Type | Description
 `s3_select` | No | [s3_select](#s3_select) | The Amazon S3 Select configuration.
 `scan` | No | [scan](#scan) | The S3 scan configuration.
 `delete_s3_objects_on_read` | No | Boolean | When `true`, the S3 scan attempts to delete S3 objects after all events from the S3 object are successfully acknowledged by all sinks. `acknowledgments` should be enabled when deleting S3 objects. Default is `false`.
-`workers` | No | Integer | Configures the number of worker threads that the source uses to read data from S3.  Leaving this value at the default unless your S3 objects are less than 1MB. Performance may decrease for larger S3 objects. This setting only affects SQS-based sources. Default is `1`.
+`workers` | No | Integer | Configures the number of worker threads that the source uses to read data from S3. Leave this value as the default unless your S3 objects are less than 1 MB in size. Performance may decrease for larger S3 objects. This setting affects SQS-based sources and S3-Scan sources. Default is `1`.
 
 
 

diff --git a/_data-prepper/pipelines/expression-syntax.md b/_data-prepper/pipelines/expression-syntax.md
@@ -114,7 +114,23 @@ null != <JSON Pointer>
 null != /response
 ```
 
-#### Conditional expression
+## Type check operator
+
+The type check operator tests whether a JSON Pointer is of a certain data type.
+
+### Syntax
+```
+<JSON Pointer> typeof <DataType>
+```
+Supported data types are `integer`, `long`, `boolean`, `double`, `string`, `map`, and `array`.
+
+#### Example
+```
+/response typeof integer
+/message typeof string
+```
+
+### Conditional expression
 
 A conditional expression is used to chain together multiple expressions and/or values.
 
@@ -218,6 +234,7 @@ White space is **required** surrounding set initializers, priority expressions,
 | `==`, `!=`           | Equality operators       | No                   | `/status == 200`<br>`/status_code==200`                        |                                       |
 | `and`, `or`, `not`   | Conditional operators    | Yes                  | `/a<300 and /b>200`                                            | `/b<300and/b>200`                     |
 | `,`                  | Set value delimiter      | No                   | `/a in {200, 202}`<br>`/a in {200,202}`<br>`/a in {200 , 202}` | `/a in {200,}`                        |
+| `typeof`             | Type check operator      | Yes                   | `/a typeof integer`<br>`/a typeof long`<br>`/a typeof string`<br> `/a typeof double`<br> `/a typeof boolean`<br>`/a typeof map`<br>`/a typeof array` |`/a typeof /b`<br>`/a typeof 2`                      |
 
 
 ## Functions
Original file line number	Diff line number	Diff line change
Expand Up		@@ -122,3 +122,4 @@ The following screenshot shows a saved conversation, along with actions you can

		- [Getting started guide for OpenSearch Assistant in OpenSearch Dashboards](https://github.com/opensearch-project/dashboards-assistant/blob/main/GETTING_STARTED_GUIDE.md)
		- [OpenSearch Assistant configuration through the REST API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/opensearch-assistant/)
		- [Build your own chatbot]({{site.url}}{{site.baseurl}}/ml-commons-plugin/tutorials/build-chatbot/)
Original file line number	Diff line number	Diff line change
Expand Up		@@ -45,3 +45,6 @@ The [`newline` codec]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/config

		[Apache Avro] helps streamline streaming data pipelines. It is most efficient when used with the [`avro` codec]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sinks/s3#avro-codec) inside an `s3` sink.

		## `event_json`

		The `event_json` output codec converts event data and metadata into JSON format to send to a sink, such as an S3 sink. The `event_json` input codec reads the event and its metadata to create an event in Data Prepper.