diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index 7eccae7052..bbf3b8d035 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -4,6 +4,8 @@ _Describe what this change achieves._ ### Issues Resolved _List any issues this PR will resolve, e.g. Closes [...]._ +### Version +_List the OpenSearch version to which this PR applies, e.g. 2.14, 2.12--2.14, or all._ ### Checklist - [ ] By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the [Developers Certificate of Origin](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin). diff --git a/_aggregations/metric/median-absolute-deviation.md b/_aggregations/metric/median-absolute-deviation.md new file mode 100644 index 0000000000..7332d7eb2f --- /dev/null +++ b/_aggregations/metric/median-absolute-deviation.md @@ -0,0 +1,158 @@ +--- +layout: default +title: Median absolute deviation +parent: Metric aggregations +grand_parent: Aggregations +nav_order: 65 +redirect_from: + - /query-dsl/aggregations/metric/median-absolute-deviation/ +--- + +# Median absolute deviation aggregations + +The `median_absolute_deviation` metric is a single-value metric aggregation that returns a median absolute deviation field. Median absolute deviation is a statistical measure of data variability. Because the median absolute deviation measures dispersion from the median, it provides a more robust measure of variability that is less affected by outliers in a dataset. + +Median absolute deviation is calculated as follows:
+median_absolute_deviation = median(|Xi - Median(Xi)|) + +The following example calculates the median absolute deviation of the `DistanceMiles` field in the sample dataset `opensearch_dashboards_sample_data_flights`: + + +```json +GET opensearch_dashboards_sample_data_flights/_search +{ + "size": 0, + "aggs": { + "median_absolute_deviation_DistanceMiles": { + "median_absolute_deviation": { + "field": "DistanceMiles" + } + } + } +} +``` +{% include copy-curl.html %} + +#### Example response + +```json +{ + "took": 35, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 10000, + "relation": "gte" + }, + "max_score": null, + "hits": [] + }, + "aggregations": { + "median_absolute_deviation_distanceMiles": { + "value": 1829.8993624441966 + } + } +} +``` + +### Missing + +By default, if a field is missing or has a null value in a document, it is ignored during computation. However, you can specify a value to be used for those missing or null fields by using the `missing` parameter, as shown in the following request: + +```json +GET opensearch_dashboards_sample_data_flights/_search +{ + "size": 0, + "aggs": { + "median_absolute_deviation_distanceMiles": { + "median_absolute_deviation": { + "field": "DistanceMiles", + "missing": 1000 + } + } + } +} +``` +{% include copy-curl.html %} + +#### Example response + +```json +{ + "took": 7, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 10000, + "relation": "gte" + }, + "max_score": null, + "hits": [] + }, + "aggregations": { + "median_absolute_deviation_distanceMiles": { + "value": 1829.6443646143355 + } + } +} +``` + +### Compression + +The median absolute deviation is calculated using the [t-digest](https://github.com/tdunning/t-digest/tree/main) data structure, which balances between performance and estimation accuracy through the `compression` parameter (default value: `1000`). Adjusting the `compression` value affects the trade-off between computational efficiency and precision. Lower `compression` values improve performance but may reduce estimation accuracy, while higher values enhance accuracy at the cost of increased computational overhead, as shown in the following request: + +```json +GET opensearch_dashboards_sample_data_flights/_search +{ + "size": 0, + "aggs": { + "median_absolute_deviation_DistanceMiles": { + "median_absolute_deviation": { + "field": "DistanceMiles", + "compression": 10 + } + } + } +} +``` +{% include copy-curl.html %} + +#### Example response + +```json +{ + "took": 1, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 10000, + "relation": "gte" + }, + "max_score": null, + "hits": [] + }, + "aggregations": { + "median_absolute_deviation_DistanceMiles": { + "value": 1836.265614211182 + } + } +} +``` diff --git a/_api-reference/nodes-apis/nodes-stats.md b/_api-reference/nodes-apis/nodes-stats.md index f28d30c0af..1d504aae2e 100644 --- a/_api-reference/nodes-apis/nodes-stats.md +++ b/_api-reference/nodes-apis/nodes-stats.md @@ -53,6 +53,7 @@ script_cache | Statistics about script cache. indexing_pressure | Statistics about the node's indexing pressure. shard_indexing_pressure | Statistics about shard indexing pressure. search_backpressure | Statistics related to search backpressure. +cluster_manager_throttling | Statistics related to throttled tasks on the cluster manager node. resource_usage_stats | Node-level resource usage statistics, such as CPU and JVM memory. admission_control | Statistics about admission control. caches | Statistics about caches. @@ -832,6 +833,7 @@ http.total_opened | Integer | The total number of HTTP connections the node has [indexing_pressure](#indexing_pressure) | Object | Statistics related to the node's indexing pressure. [shard_indexing_pressure](#shard_indexing_pressure) | Object | Statistics related to indexing pressure at the shard level. [search_backpressure]({{site.url}}{{site.baseurl}}/opensearch/search-backpressure#search-backpressure-stats-api) | Object | Statistics related to search backpressure. +[cluster_manager_throttling](#cluster_manager_throttling) | Object | Statistics related to throttled tasks on the cluster manager node. [resource_usage_stats](#resource_usage_stats) | Object | Statistics related to resource usage for the node. [admission_control](#admission_control) | Object | Statistics related to admission control for the node. [caches](#caches) | Object | Statistics related to caches on the node. @@ -1282,6 +1284,16 @@ total_rejections_breakup_shadow_mode.throughput_degradation_limits | Integer | T enabled | Boolean | Specifies whether the shard indexing pressure feature is turned on for the node. enforced | Boolean | If true, the shard indexing pressure runs in enforced mode (there are rejections). If false, the shard indexing pressure runs in shadow mode (there are no rejections, but statistics are recorded and can be retrieved in the `total_rejections_breakup_shadow_mode` object). Only applicable if shard indexing pressure is enabled. +### `cluster_manager_throttling` + +The `cluster_manager_throttling` object contains statistics about throttled tasks on the cluster manager node. It is populated only for the node that is currently elected as the cluster manager. + +Field | Field type | Description +:--- | :--- | :--- +stats | Object | Statistics about throttled tasks on the cluster manager node. +stats.total_throttled_tasks | Long | The total number of throttled tasks. +stats.throttled_tasks_per_task_type | Object | A breakdown of statistics by individual task type, specified as key-value pairs. The keys are individual task types, and their values represent the number of requests that were throttled. + ### `resource_usage_stats` The `resource_usage_stats` object contains the resource usage statistics. Each entry is specified by the node ID and has the following properties. diff --git a/_ingest-pipelines/processors/foreach.md b/_ingest-pipelines/processors/foreach.md index 72a0ed1420..d0f962e618 100644 --- a/_ingest-pipelines/processors/foreach.md +++ b/_ingest-pipelines/processors/foreach.md @@ -1,11 +1,13 @@ --- layout: default -title: `foreach` +title: Foreach parent: Ingest processors nav_order: 110 --- -# `foreach` processor + +# Foreach processor + The `foreach` processor is used to iterate over a list of values in an input document and apply a transformation to each value. This can be useful for tasks like processing all the elements in an array consistently, such as converting all elements in a string to lowercase or uppercase. diff --git a/_ingest-pipelines/processors/join.md b/_ingest-pipelines/processors/join.md new file mode 100644 index 0000000000..c2cdcfe4de --- /dev/null +++ b/_ingest-pipelines/processors/join.md @@ -0,0 +1,135 @@ +--- +layout: default +title: Join +parent: Ingest processors +nav_order: 160 +--- + +# Join processor + +The `join` processor concatenates the elements of an array into a single string value, using a specified separator between each element. It throws an exception if the provided input is not an array. + +The following is the syntax for the `join` processor: + +```json +{ + "join": { + "field": "field_name", + "separator": "separator_string" + } +} +``` +{% include copy-curl.html %} + +## Configuration parameters + +The following table lists the required and optional parameters for the `join` processor. + +Parameter | Required/Optional | Description | +|-----------|-----------|-----------| +`field` | Required | The name of the field to which the join operator is applied. Must be an array. +`separator` | Required | A string separator to use when joining field values. If not specified, then the values are concatenated without a separator. +`target_field` | Optional | The field to assign the cleaned value to. If not specified, then the field is updated in place. +`description` | Optional | A description of the processor's purpose or configuration. +`if` | Optional | Specifies to conditionally execute the processor. +`ignore_failure` | Optional | Specifies to ignore failures for the processor. See [Handling pipeline failures]({{site.url}}{{site.baseurl}}/ingest-pipelines/pipeline-failures/). +`on_failure` | Optional | Specifies to handle failures for the processor. See [Handling pipeline failures]({{site.url}}{{site.baseurl}}/ingest-pipelines/pipeline-failures/). +`tag` | Optional | An identifier for the processor. Useful for debugging and metrics. + +## Using the processor + +Follow these steps to use the processor in a pipeline. + +### Step 1: Create a pipeline + +The following query creates a pipeline named `example-join-pipeline` that uses the `join` processor to concatenate all the values of the `uri` field, separating them with the specified separator `/`: + +```json +PUT _ingest/pipeline/example-join-pipeline +{ + "description": "Example pipeline using the join processor", + "processors": [ + { + "join": { + "field": "uri", + "separator": "/" + } + } + ] +} +``` +{% include copy-curl.html %} + +### Step 2 (Optional): Test the pipeline + +It is recommended that you test your pipeline before you ingest documents. +{: .tip} + +To test the pipeline, run the following query: + +```json +POST _ingest/pipeline/example-join-pipeline/_simulate +{ + "docs": [ + { + "_source": { + "uri": [ + "app", + "home", + "overview" + ] + } + } + ] +} +``` +{% include copy-curl.html %} + +#### Response + +The following example response confirms that the pipeline is working as expected: + +```json +{ + "docs": [ + { + "doc": { + "_index": "_index", + "_id": "_id", + "_source": { + "uri": "app/home/overview" + }, + "_ingest": { + "timestamp": "2024-05-24T02:16:01.00659117Z" + } + } + } + ] +} +``` +{% include copy-curl.html %} + +### Step 3: Ingest a document + +The following query ingests a document into an index named `testindex1`: + +```json +POST testindex1/_doc/1?pipeline=example-join-pipeline +{ + "uri": [ + "app", + "home", + "overview" + ] +} +``` +{% include copy-curl.html %} + +### Step 4 (Optional): Retrieve the document + +To retrieve the document, run the following query: + +```json +GET testindex1/_doc/1 +``` +{% include copy-curl.html %} diff --git a/_install-and-configure/plugins.md b/_install-and-configure/plugins.md index 6b0b28769e..bbfbce9796 100644 --- a/_install-and-configure/plugins.md +++ b/_install-and-configure/plugins.md @@ -285,6 +285,7 @@ The following plugins are bundled with all OpenSearch distributions except for m | Job Scheduler | [opensearch-job-scheduler](https://github.com/opensearch-project/job-scheduler) | 1.0.0 | | k-NN | [opensearch-knn](https://github.com/opensearch-project/k-NN) | 1.0.0 | | ML Commons | [opensearch-ml](https://github.com/opensearch-project/ml-commons) | 1.3.0 | +| Skills | [opensearch-skills](https://github.com/opensearch-project/skills) | 2.12.0 | | Neural Search | [neural-search](https://github.com/opensearch-project/neural-search) | 2.4.0 | | Observability | [opensearch-observability](https://github.com/opensearch-project/observability) | 1.2.0 | | Performance Analyzer2 | [opensearch-performance-analyzer](https://github.com/opensearch-project/performance-analyzer) | 1.0.0 | diff --git a/_search-plugins/knn/approximate-knn.md b/_search-plugins/knn/approximate-knn.md index 7d3e119349..c0a9557728 100644 --- a/_search-plugins/knn/approximate-knn.md +++ b/_search-plugins/knn/approximate-knn.md @@ -127,10 +127,26 @@ GET my-knn-index-1/_search } ``` -`k` is the number of neighbors the search of each graph will return. You must also include the `size` option, which -indicates how many results the query actually returns. The plugin returns `k` amount of results for each shard -(and each segment) and `size` amount of results for the entire query. The plugin supports a maximum `k` value of 10,000. -Starting in OpenSearch 2.14, in addition to using the `k` variable, both the `min_score` and `max_distance` variables can be used for [radial search]({{site.url}}{{site.baseurl}}/search-plugins/knn/radial-search-knn/). +### The number of returned results + +In the preceding query, `k` represents the number of neighbors returned by the search of each graph. You must also include the `size` option, indicating the final number of results that you want the query to return. + +For the NMSLIB and Faiss engines, `k` represents the maximum number of documents returned for all segments of a shard. For the Lucene engine, `k` represents the number of documents returned for a shard. The maximum value of `k` is 10,000. + +For any engine, each shard returns `size` results to the coordinator node. Thus, the total number of results that the coordinator node receives is `size * number of shards`. After the coordinator node consolidates the results received from all nodes, the query returns the top `size` results. + +The following table provides examples of the number of results returned by various engines in several scenarios. For these examples, assume that the number of documents contained in the segments and shards is sufficient to return the number of results specified in the table. + +`size` | `k` | Number of primary shards | Number of segments per shard | Number of returned results, Faiss/NMSLIB | Number of returned results, Lucene +:--- | :--- | :--- | :--- | :--- | :--- +10 | 1 | 1 | 4 | 4 | 1 +10 | 10 | 1 | 4 | 10 | 10 +10 | 1 | 2 | 4 | 8 | 2 + +The number of results returned by Faiss/NMSLIB differs from the number of results returned by Lucene only when `k` is smaller than `size`. If `k` and `size` are equal, all engines return the same number of results. + +Starting in OpenSearch 2.14, you can use `k`, `min_score`, or `max_distance` for [radial search]({{site.url}}{{site.baseurl}}/search-plugins/knn/radial-search-knn/). + ### Building a k-NN index from a model For some of the algorithms that we support, the native library index needs to be trained before it can be used. It would be expensive to training every newly created segment, so, instead, we introduce the concept of a *model* that is used to initialize the native library index during segment creation. A *model* is created by calling the [Train API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-a-model), passing in the source of training data as well as the method definition of the model. Once training is complete, the model will be serialized to a k-NN model system index. Then, during indexing, the model is pulled from this index to initialize the segments. @@ -311,4 +327,4 @@ included in the distance function. With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...]`) as input. This is because the magnitude of such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests containing the zero vector will be rejected and a corresponding exception will be thrown. -{: .note } \ No newline at end of file +{: .note } diff --git a/_search-plugins/knn/knn-vector-quantization.md b/_search-plugins/knn/knn-vector-quantization.md index 96db75b3eb..549437f346 100644 --- a/_search-plugins/knn/knn-vector-quantization.md +++ b/_search-plugins/knn/knn-vector-quantization.md @@ -51,7 +51,7 @@ PUT /test-index "space_type": "l2", "parameters": { "encoder": { - "name": "sq", + "name": "sq" }, "ef_construction": 256, "m": 8 diff --git a/_security/configuration/best-practices.md b/_security/configuration/best-practices.md new file mode 100644 index 0000000000..97457cdb4b --- /dev/null +++ b/_security/configuration/best-practices.md @@ -0,0 +1,133 @@ +--- +layout: default +title: Best practices +parent: Configuration +nav_order: 11 +--- + +# Best practices for OpenSearch security + +Setting up security in OpenSearch is crucial for protecting your data. Here are 10 best practices that offer clear steps for keeping your system safe. + +## 1. Use your own PKI to set up SSL/TLS + +Although using your own public key infrastructure (PKI), such as [AWS Certificate Manager](https://docs.aws.amazon.com/crypto/latest/userguide/awspki-service-acm.html), requires more initial effort, a custom PKI provides you with the flexibility needed to set up SSL/TLS in the most secure and performant way. + +### Enable SSL/TLS for node- and REST-layer traffic + +SSL/TLS is enabled by default on the transport layer, which is used for node-to-node communication. SSL/TLS is disabled by default on the REST layer. + +The following setting is required in order to enable encryption on the REST layer: + +``` +plugins.security.ssl.http.enabled: true +``` +{% include copy.html %} + + +For additional configuration options, such as specifying certificate paths, keys, and certificate authority files, refer to [Configuring TLS certificates]({{site.url}}{{site.baseurl}}/security/configuration/tls/). + +### Replace all demo certificates with your own PKI + +The certificates generated when initializing an OpenSearch cluster with `install_demo_configuration.sh` are not suitable for production. These should be replaced with your own certificates. + +You can generate custom certificates in a few different ways. One approach is to use OpenSSL, described in detail at [Generating self-signed certificates]({{site.url}}{{site.baseurl}}/security/configuration/generate-certificates/). Alternatively, there are online tools available that can simplify the certificate creation process, such as the following: + +- [SearchGuard TLS Tool](https://docs.search-guard.com/latest/offline-tls-tool) +- [TLSTool by dylandreimerink](https://github.com/dylandreimerink/tlstool) + +## 2. Prefer client certificate authentication for API authentication + +Client certificate authentication offers a secure alternative to password authentication and is more suitable for machine-to-machine interactions. It also ensures low performance overhead because the authentication occurs on the TLS level. Nearly all client software, such as curl and client libraries, support this authentication method. + +For detailed configuration instructions and additional information about client certificate authentication, see [Enabling client certificate authentication]({{site.url}}{{site.baseurl}}/security/authentication-backends/client-auth/#enabling-client-certificate-authentication). + + +## 3. Prefer SSO using SAML or OpenID for OpenSearch Dashboards authentication + +Implementing single sign-on (SSO) with protocols like SAML or OpenID for OpenSearch Dashboards authentication enhances security by delegating credential management to a dedicated system. + +This approach minimizes direct interaction with passwords in OpenSearch, streamlines authentication processes, and prevents clutter in the internal user database. For more information, go to the [SAML section of the OpenSearch documentation]({{site.url}}{{site.baseurl}}/security/authentication-backends/saml/). + +## 4. Limit the number of roles assigned to a user + +Prioritizing fewer, more intricate user roles over numerous simplistic roles enhances security and simplifies administration. + +Additional best practices for role management include: + +1. Role granularity: Define roles based on specific job functions or access requirements to minimize unnecessary privileges. +2. Regular role review: Regularly review and audit assigned roles to ensure alignment with organizational policies and access needs. + +For more information about roles, go to the documentation on [defining users and roles in OpenSearch]({{site.url}}{{site.baseurl}}/security/access-control/users-roles/). + +## 5. Verify DLS, FLS, and field masking + +If you have configured Document Level Security (DLS), Field Level Security (FLS), or field masking, make sure you double-check your role definitions, especially if a user is mapped to multiple roles. It is highly recommended that you test this by making a GET request to `_plugins/_security/authinfo`. + +The following resources provide detailed examples and additional configurations: + + - [Document-level security]({{site.url}}{{site.baseurl}}/security/access-control/document-level-security/). + - [Field-level security]({{site.url}}{{site.baseurl}}/security/access-control/field-level-security/). + - [Field masking]({{site.url}}{{site.baseurl}}/security/access-control/field-masking/). + +## 6. Use only the essentials for the audit logging configuration + +Extensive audit logging can degrade system performance due to the following: + +- Each logged event adds to the processing load. +- Audit logs can quickly grow in size, consuming significant disk space. + +To ensure optimal performance, disable unnecessary logging and be selective about which logs are used. If not strictly required by compliance regulations, consider turning off audit logging. If audit logging is essential for your cluster, configure it according to your compliance requirements. + +Whenever possible, adhere to these recommendations: + +- Set `audit.log_request_body` to `false`. +- Set `audit.resolve_bulk_requests` to `false`. +- Enable `compliance.write_log_diffs`. +- Minimize entries for `compliance.read_watched_fields`. +- Minimize entries for `compliance.write_watched_indices`. + +## 7. Consider disabling the private tenant + +In many cases, the use of private tenants is unnecessary, although this feature is enabled by default. As a result, every OpenSearch Dashboards user is provided with their own private tenant and a corresponding new index in which to save objects. This can lead to a large number of unnecessary indexes. Evaluate whether private tenants are needed in your cluster. If private tenants are not needed, disable the feature by adding the following configuration to the `config.yml` file: + +```yaml +config: + dynamic: + kibana: + multitenancy_enabled: true + private_tenant_enabled: false +``` +{% include copy.html %} + +## 8. Manage the configuration using `securityadmin.sh` + +Use `securityadmin.sh` to manage the configuration of your clusters. `securityadmin.sh` is a command-line tool provided by OpenSearch for managing security configurations. It allows administrators to efficiently manage security settings, including roles, role mappings, and other security-related configurations within an OpenSearch cluster. + +Using `securityadmin.sh` provides the following benefits: + +1. Consistency: By using `securityadmin.sh`, administrators can ensure consistency across security configurations within a cluster. This helps to maintain a standardized and secure environment. +2. Automation: `securityadmin.sh` enables automation of security configuration tasks, making it easier to deploy and manage security settings across multiple nodes or clusters. +3. Version control: Security configurations managed through `securityadmin.sh` can be version controlled using standard version control systems like Git. This facilitates tracking changes, auditing, and reverting to previous configurations. + +You can prevent configuration overrides by first creating a backup of the current configuration created using the OpenSearch Dashboards UI or the OpenSearch API by running the `securityadmin.sh` tool with the `-backup` option. This ensures that all configurations are captured before uploading the modified configuration with `securityadmin.sh`. + +For more detailed information about using `securityadmin.sh` and managing OpenSearch security configurations, refer to the following resources: +- [Applying changes to configuration files]({{site.url}}{{site.baseurl}}/security/configuration/security-admin/) +- [Modifying YAML files]({{site.url}}{{site.baseurl}}/security/configuration/yaml/) + +## 9. Replace all default passwords + +When initializing OpenSearch with the demo configuration, many default passwords are provided for internal users in `internal_users.yml`, such as `admin`, `kibanaserver`, and `logstash`. + +You should change the passwords for these users to strong, complex passwords either at startup or as soon as possible once the cluster is running. Creating password configurations is a straightforward procedure, especially when using the scripts bundled with OpenSearch, like `hash.sh` or `hash.bat`, located in the `plugin/OpenSearch security/tools` directory. + +The `kibanaserver` user is a crucial component that allows OpenSearch Dashboards to communicate with the OpenSearch cluster. By default, this user is preconfigured with a default password in the demo configuration. This should be replaced with a strong, unique password in the OpenSearch configuration, and the `opensearch_dashboards.yml` file should be updated to reflect this change. + + +## 10. Getting help + +If you need additional help, you can do the following: + +- Create an issue on GitHub at [OpenSearch-project/security](https://github.com/opensearch-project/security/security) or [OpenSearch-project/OpenSearch](https://github.com/opensearch-project/OpenSearch/security). +- Ask a question on the [OpenSearch forum](https://forum.opensearch.org/tag/cve).