Skip to content

Commit

Permalink
Merge branch 'main' into multiple-indices
Browse files Browse the repository at this point in the history
  • Loading branch information
vagimeli authored May 7, 2024
2 parents dde8ffe + 203c22a commit b213fc2
Show file tree
Hide file tree
Showing 8 changed files with 134 additions and 34 deletions.
17 changes: 9 additions & 8 deletions _automating-configurations/workflow-steps.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,11 @@ The following table lists the workflow step types. The `user_inputs` fields for

|Step type |Corresponding API |Description |
|--- |--- |--- |
|`noop` |No API | A no-operation (no-op) step that does nothing. It may be useful in some cases for synchronizing parallel steps. |
| `noop` | No API | A no-operation (no-op) step that does nothing, which is useful for synchronizing parallel steps. If the `user_inputs` field contains a `delay` key, this step will wait for the specified amount of time. |
|`create_connector` |[Create Connector]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/connector-apis/create-connector/) |Creates a connector to a model hosted on a third-party platform. |
|`delete_connector` |[Delete Connector]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/connector-apis/delete-connector/) |Deletes a connector to a model hosted on a third-party platform. |
|`register_model_group` |[Register Model Group]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-group-apis/register-model-group/) |Registers a model group. The model group will be deleted automatically once no model is present in the group. |
|`register_remote_model` |[Register Model (remote)]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-apis/register-model/#register-a-model-hosted-on-a-third-party-platform) |Registers a model hosted on a third-party platform. If the `user_inputs` field contains a `deploy` key that is set to `true`, also deploys the model. |
|`register_remote_model` |[Register Model (remote)]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-apis/register-model/#register-a-model-hosted-on-a-third-party-platform) | Registers a model hosted on a third-party platform. If the `user_inputs` field contains a `deploy` key that is set to `true`, the model is also deployed. |
|`register_local_pretrained_model` |[Register Model (pretrained)]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-apis/register-model/#register-a-pretrained-text-embedding-model) | Registers an OpenSearch-provided pretrained text embedding model that is hosted on your OpenSearch cluster. If the `user_inputs` field contains a `deploy` key that is set to `true`, also deploys the model. |
|`register_local_sparse_encoding_model` |[Register Model (sparse)]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-apis/register-model/#register-a-pretrained-sparse-encoding-model) | Registers an OpenSearch-provided pretrained sparse encoding model that is hosted on your OpenSearch cluster. If the `user_inputs` field contains a `deploy` key that is set to `true`, also deploys the model. |
|`register_local_custom_model` |[Register Model (custom)]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-apis/register-model/#register-a-custom-model) | Registers a custom model that is hosted on your OpenSearch cluster. If the `user_inputs` field contains a `deploy` key that is set to `true`, also deploys the model. |
Expand All @@ -45,13 +45,14 @@ The following table lists the workflow step types. The `user_inputs` fields for

## Additional fields

You can include the following additional fields in the `user_inputs` field when indicated.
You can include the following additional fields in the `user_inputs` field if the field is supported by the indicated step type.

|Field |Data type |Description |
|Field |Data type | Step type | Description |
|--- |--- |--- |
|`node_timeout` |Time units |A user-provided timeout for this step. For example, `20s` for a 20-second timeout. |
|`deploy` |Boolean |Applicable to the Register Model step type. If set to `true`, also executes the Deploy Model step. |
|`tools_order` |List |Applicable only to the Register Agent step type. Specifies the ordering of `tools`. For example, specify `["foo_tool", "bar_tool"]` to sequence those tools in that order. |
|`node_timeout` | Time units | All | A user-provided timeout for this step. For example, `20s` for a 20-second timeout. |
|`deploy` | Boolean | Register model | If set to `true`, also deploys the model. |
|`tools_order` | List | Register agent | Specifies the ordering of `tools`. For example, specify `["foo_tool", "bar_tool"]` to sequence those tools in that order. |
|`delay` | Time units | No-op | Waits for the specified amount of time. For example, `250ms` specifies to wait for 250 milliseconds before continuing the workflow. |

You can include the following additional fields in the `previous_node_inputs` field when indicated.

Expand All @@ -61,4 +62,4 @@ You can include the following additional fields in the `previous_node_inputs` fi

## Example workflow steps

For example workflow step implementations, see the [Workflow tutorial]({{site.url}}{{site.baseurl}}/automating-configurations/workflow-tutorial/).
For example workflow step implementations, see the [Workflow tutorial]({{site.url}}{{site.baseurl}}/automating-configurations/workflow-tutorial/).
55 changes: 55 additions & 0 deletions _benchmark/user-guide/finetine-workloads.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
---
layout: default
title: Fine-tuning custom workloads
nav_order: 12
parent: User guide
---

# Fine-tuning custom workloads

While custom workloads can help make benchmarks more specific to your application's needs, they sometimes require additional adjustments to make sure they closely resemble a production cluster.

You can fine-tune your custom workloads to more closely match your benchmarking needs by using the `create-workload` feature. `create-workload` can extract documents from all indexes or specific indexes selected by the user.

## Characteristics to consider

When beginning to use `create-workload` to fine-tune a custom workload, consider the following characteristics of the workload:

1. **Queries** -- Consider the kinds of documents and as well as indexes targeted by the queries and the fields the queries call.
2. **Shard Size** -- Match the shard size of the workload with the shard size of your cluster, or the benchmark will not simulate the behavior of your application. Lucene operates based on shard sizes and does not have indexes. Calculate the shard size for any index that you want to include in the custom workload.
3. **Shard count** -- Choose the number of shards according to how you want to size the workload and improve query performance. Because each use case is different, you can determine the shard count in two ways:
1. Divide the ideal index size by the shard size found in step 2.
2. Multiply the ideal number of shards by the shard size found in step 2.
4. **Decide how many documents to extract (Recommended)** -- Now that the shard size is set and the number of shards needed in the final index decided, you can determine how many documents you want to extract. In some cases, users aren’t interested in retrieving the entire document corpus from an index because the corpus might be too big. Instead, you might want to generate a smaller corpus. However, the corpus of the generated index should be *representative* of the index in the production workload. In other words, it should contain documents from across the index and not only from a certain area, for example, the first half of the index or the last third of the index. To decide how many documents to extract:
1. Multiply the number of shards by the shard size. Because every document is created unequally, add a buffer to the number---an arbitrary number of additional documents. The buffer provides assistance when the number of documents extracted is lower than the expected number, which will not help to retain your shard size. The shard size should be lower than the number of documents extracted.
2. Divide the store size by the product of the previous step. The value of the multiple is used to set the number of sample documents from the reference index.
5. **Target cluster configuration** -- Factor the configuration and characteristics of the target cluster into how you generate the workload.


## Example

The following example contains an index named `stocks`. The `stocks` index includes documents containing statistics on all stocks being traded on the New York Stock Exchange (NYSE). OpenSearch Dashboards provides information about the `stocks` index, as shown in the following code example:

```
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open stocks asjdkfjacklajldf 12 1 997818020 232823733 720gb 360gb
```

Using this information, you can start adjusting the workload to your specifications, as shown in the following steps:

1. **Fetch queries associated with this index** -- Obtain the queries needed to make requests to the `stocks` index.
2. **Find the shard size for the index** -- To get the shard size of the index, divide the store size by the number of shards in the index: `720 / (12 + (12 * 1)) = 30`. 30 GB is the shard size. You can verify this by dividing the primary store size value by the number of primary shards.
3. **Determine the number of index shards** -- Determine the number of shards needed in the index to represent your application under a production load. For example, if you want your index to hold 300 GB of documents, but 300 GB is too much for the benchmark, determine a number that makes sense. For example, 300 GB of documents divided by the 30 GB shard size determined in the last step, or `300 / 30 = 10`, produces 10 shards. These 10 shards can either be 10 primary shards and 0 replicas, 5 primary shards and 1 replica, or 2 primary shards and 4 replicas. The shard configuration depends on your cluster's index needs.
4. **Decide how many documents to extract** -- To retain 30 GB and have 10 shards, you need to extract at least 300 GB of documents. To determine the number of documents to extract, divide the store size value by the index size value, in this example, `720 / 300 = 2.4`. Because you want to make sure you reach a value of 30 GB per shard, it is best to round down and choose 2 as the extraction multiple, which means that OpenSearch Benchmark will extract every other document.
5. **Think about the target cluster configuration** -- Assess the cluster you're planning to work on. Consider the use case, the size of the cluster, and the number of nodes. While fine-tuning the workload, this could be an iterative process where small adjustments to the cluster are made according to the results of the workload runs.


## Replicating metrics

In many cases, a workload will not be able to exactly replicate the exact metrics of a production cluster. However, you can aim to get as close as possible to your ideal cluster metrics by replicating the following metrics:

* CPU utilization
* Search request rates
* Indexing rates


2 changes: 1 addition & 1 deletion _field-types/supported-field-types/date.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ OpenSearch has built-in date formats, but you can also create your own custom fo

## Default format

As of OpenSearch 2.12, the default date format is `strict_date_time_no_millis||strict_date_optional_time||epoch_millis`. To revert the default format back to `strict_date_optional_time||epoch_millis` (the default format for OpenSearch 2.11 and earlier), set the `opensearch.experimental.optimization.datetime_formatter_caching.enabled` feature flag to `false`. For more information about enabling and disabling feature flags, see [Enabling experimental features]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/experimental/).
As of OpenSearch 2.12, you can to choose to use an experimental default date format, `strict_date_time_no_millis||strict_date_optional_time||epoch_millis`. To use the experimental default, set the `opensearch.experimental.optimization.datetime_formatter_caching.enabled` feature flag to `true`. For more information about enabling and disabling feature flags, see [Enabling experimental features]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/experimental/).

## Built-in formats

Expand Down
16 changes: 16 additions & 0 deletions _install-and-configure/configuring-opensearch/index-settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,22 @@ OpenSearch supports the following cluster-level index settings. All settings in

- `indices.fielddata.cache.size` (String): The maximum size of the field data cache. May be specified as an absolute value (for example, `8GB`) or a percentage of the node heap (for example, `50%`). This value is static so you must specify it in the `opensearch.yml` file. If you don't specify this setting, the maximum size is unlimited. This value should be smaller than the `indices.breaker.fielddata.limit`. For more information, see [Field data circuit breaker]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/circuit-breaker/#field-data-circuit-breaker-settings).

- `cluster.remote_store.index.path.type` (String): The path strategy for the data stored in the remote store. This setting is effective only for remote-store-enabled clusters. This setting supports the following values:
- `fixed`: Stores the data in path structure `<repository_base_path>/<index_uuid>/<shard_id>/`.
- `hashed_prefix`: Stores the data in path structure `hash(<shard-data-idenitifer>)/<repository_base_path>/<index_uuid>/<shard_id>/`.
- `hashed_infix`: Stores the data in path structure `<repository_base_path>/hash(<shard-data-idenitifer>)/<index_uuid>/<shard_id>/`.
`shard-data-idenitifer` is characterized by the index_uuid, shard_id, kind of data (translog, segments), and type of data (data, metadata, lock_files).
Default is `fixed`.

- `cluster.remote_store.index.path.hash_algorithm` (String): The hash function used to derive the hash value when `cluster.remote_store.index.path.type` is set to `hashed_prefix` or `hashed_infix`. This setting is effective only for remote-store-enabled clusters. This setting supports the following values:
- `fnv_1a_base64`: Uses the FNV1a hash function and generates a url-safe 20-bit base64-encoded hash value.
- `fnv_1a_composite_1`: Uses the FNV1a hash function and generates a custom encoded hash value that scales well with most remote store options. The FNV1a function generates 64-bit value. The custom encoding uses the most significant 6 bits to create a url-safe base64 character and the next 14 bits to create a binary string. Default is `fnv_1a_composite_1`.

- `cluster.remote_store.translog.transfer_timeout` (Time unit): Controls the timeout value while uploading translog and checkpoint files during a sync to the remote store. This setting is applicable only for remote-store-enabled clusters. Default is `30s`.

- `cluster.remote_store.index.segment_metadata.retention.max_count` (Integer): Controls the minimum number of metadata files to keep in the segment repository on a remote store. A value below `1` disables the deletion of stale segment metadata files. Default is `10`.


## Index-level index settings

You can specify index settings at index creation. There are two types of index settings:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,4 +56,4 @@ The default OpenSearch transport is provided by the `transport-netty4` module an
Plugin | Description
:---------- | :--------
`transport-nio` | The OpenSearch transport based on Java NIO. <br> Installation: `./bin/opensearch-plugin install transport-nio` <br> Configuration (using `opensearch.yml`): <br> `transport.type: nio-transport` <br> `http.type: nio-http-transport`
`transport-reactor-netty4` | The OpenSearch HTTP transport based on [Project Reactor](https://github.com/reactor/reactor-netty) and Netty 4 (**experimental**) <br> Installation: `./bin/opensearch-plugin install transport-reactor-netty4` <br> Configuration (using `opensearch.yml`): <br> `http.type: reactor-netty4`
`transport-reactor-netty4` | The OpenSearch HTTP transport based on [Project Reactor](https://github.com/reactor/reactor-netty) and Netty 4 (**experimental**) <br> Installation: `./bin/opensearch-plugin install transport-reactor-netty4` <br> Configuration (using `opensearch.yml`): <br> `http.type: reactor-netty4` <br> `http.type: reactor-netty4-secure`
27 changes: 21 additions & 6 deletions _install-and-configure/install-opensearch/debian.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,21 +37,23 @@ This guide assumes that you are comfortable working from the Linux command line
### Install OpenSearch from a package

1. Download the Debian package for the desired version directly from the [OpenSearch downloads page](https://opensearch.org/downloads.html){:target='\_blank'}. The Debian package can be downloaded for both **x64** and **arm64** architectures.
1. From the CLI, install using `dpkg`.
1. From the CLI, install the package using `dpkg`:

For new installations of OpenSearch 2.12 and later, you must define a custom admin password in order to set up a demo security configuration. Use one of the following commands to define a custom admin password:
```bash
# x64
sudo dpkg -i opensearch-{{site.opensearch_version}}-linux-x64.deb
sudo env OPENSEARCH_INITIAL_ADMIN_PASSWORD=<custom-admin-password> dpkg -i opensearch-{{site.opensearch_version}}-linux-x64.deb

# arm64
sudo dpkg -i opensearch-{{site.opensearch_version}}-linux-arm64.deb
sudo env OPENSEARCH_INITIAL_ADMIN_PASSWORD=<custom-admin-password> dpkg -i opensearch-{{site.opensearch_version}}-linux-arm64.deb
```
For OpenSearch 2.12 and greater, a custom admin password is required in order to set up a security demo configuration. To set a custom admin password, use one the following commands:
Use the following command for OpenSearch versions 2.11 and earlier:
```bash
# x64
sudo env OPENSEARCH_INITIAL_ADMIN_PASSWORD=<custom-admin-password> dpkg -i opensearch-{{site.opensearch_version}}-linux-x64.deb
sudo dpkg -i opensearch-{{site.opensearch_version}}-linux-x64.deb

# arm64
sudo env OPENSEARCH_INITIAL_ADMIN_PASSWORD=<custom-admin-password> dpkg -i opensearch-{{site.opensearch_version}}-linux-arm64.deb
sudo dpkg -i opensearch-{{site.opensearch_version}}-linux-arm64.deb
```

1. After the installation succeeds, enable OpenSearch as a service.
Expand Down Expand Up @@ -136,14 +138,27 @@ APT, the primary package management tool for Debian–based operating systems, a

1. Choose the version of OpenSearch you want to install:
- Unless otherwise indicated, the latest available version of OpenSearch is installed.

```bash
# For new installations of OpenSearch 2.12 and later, you must define a custom admin password in order to set up a demo security configuration.
# Use one of the following commands to define a custom admin password:
sudo env OPENSEARCH_INITIAL_ADMIN_PASSWORD=<custom-admin-password> apt-get install opensearch
# Use the following command for OpenSearch versions 2.11 and earlier:
sudo apt-get install opensearch
```
{% include copy.html %}

- To install a specific version of OpenSearch:

```bash
# Specify the version manually using opensearch=<version>
# For new installations of OpenSearch 2.12 and later, you must define a custom admin password in order to set up a demo security configuration.
# Use one of the following commands to define a custom admin password:
sudo env OPENSEARCH_INITIAL_ADMIN_PASSWORD=<custom-admin-password> apt-get install opensearch={{site.opensearch_version}}
# Use the following command for OpenSearch versions 2.11 and earlier:
sudo apt-get install opensearch={{site.opensearch_version}}
```

Expand Down
Loading

0 comments on commit b213fc2

Please sign in to comment.