diff --git a/.github/vale/styles/Vocab/OpenSearch/Products/accept.txt b/.github/vale/styles/Vocab/OpenSearch/Products/accept.txt index 83e9aee603..9be8da79a9 100644 --- a/.github/vale/styles/Vocab/OpenSearch/Products/accept.txt +++ b/.github/vale/styles/Vocab/OpenSearch/Products/accept.txt @@ -76,7 +76,6 @@ Painless Peer Forwarder Performance Analyzer Piped Processing Language -Point in Time Powershell Python PyTorch diff --git a/.github/workflows/pr_checklist.yml b/.github/workflows/pr_checklist.yml new file mode 100644 index 0000000000..accce0f882 --- /dev/null +++ b/.github/workflows/pr_checklist.yml @@ -0,0 +1,43 @@ +name: PR Checklist + +on: + pull_request: + types: [opened] + +permissions: + pull-requests: write + +jobs: + add-checklist: + runs-on: ubuntu-latest + + steps: + - name: Comment PR with checklist + uses: peter-evans/create-or-update-comment@v3 + with: + token: ${{ secrets.GITHUB_TOKEN }} + issue-number: ${{ github.event.pull_request.number }} + body: | + Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged. + + Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a [maintainer](https://github.com/opensearch-project/documentation-website/blob/main/MAINTAINERS.md). + + **When you're ready for doc review, tag the assignee of this PR**. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review. + + - name: Auto assign PR to repo owner + uses: actions/github-script@v6 + with: + script: | + let assignee = context.payload.pull_request.user.login; + const prOwners = ['Naarcha-AWS', 'kolchfa-aws', 'vagimeli', 'natebower']; + + if (!prOwners.includes(assignee)) { + assignee = 'hdhalter' + } + + github.rest.issues.addAssignees({ + issue_number: context.issue.number, + owner: context.repo.owner, + repo: context.repo.repo, + assignees: [assignee] + }); \ No newline at end of file diff --git a/.gitignore b/.gitignore index ae2249e73f..446d1deda6 100644 --- a/.gitignore +++ b/.gitignore @@ -4,4 +4,5 @@ _site .DS_Store Gemfile.lock .idea +*.iml .jekyll-cache diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index de44bbe4ee..7afa9d7596 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -100,10 +100,10 @@ Follow these steps to set up your local copy of the repository: #### Troubleshooting -If you encounter an error while trying to build the documentation website, find the error in the following troubleshooting list: +Try the following troubleshooting steps if you encounter an error when trying to build the documentation website: -- When running `rvm install 3.2` if you receive a `Error running '__rvm_make -j10'`, resolve this by running `rvm install 3.2.0 -C --with-openssl-dir=/opt/homebrew/opt/openssl@3.2` instead of `rvm install 3.2`. -- If receive a `bundle install`: `An error occurred while installing posix-spawn (0.3.15), and Bundler cannot continue.` error when trying to run `bundle install`, resolve this by running `gem install posix-spawn -v 0.3.15 -- --with-cflags=\"-Wno-incompatible-function-pointer-types\"`. Then, run `bundle install`. +- If you see the `Error running '__rvm_make -j10'` error when running `rvm install 3.2`, you can resolve it by running `rvm install 3.2.0 -C --with-openssl-dir=/opt/homebrew/opt/openssl@3.2` instead of `rvm install 3.2`. +- If you see the `bundle install`: `An error occurred while installing posix-spawn (0.3.15), and Bundler cannot continue.` error when trying to run `bundle install`, you can resolve it by running `gem install posix-spawn -v 0.3.15 -- --with-cflags=\"-Wno-incompatible-function-pointer-types\"` and then `bundle install`. diff --git a/_automating-configurations/api/create-workflow.md b/_automating-configurations/api/create-workflow.md index 5c501ce4e8..83c0110ac3 100644 --- a/_automating-configurations/api/create-workflow.md +++ b/_automating-configurations/api/create-workflow.md @@ -20,9 +20,9 @@ You can include placeholder expressions in the value of workflow step fields. Fo Once a workflow is created, provide its `workflow_id` to other APIs. -The `POST` method creates a new workflow. The `PUT` method updates an existing workflow. +The `POST` method creates a new workflow. The `PUT` method updates an existing workflow. You can specify the `update_fields` parameter to update specific fields. -You can only update a workflow if it has not yet been provisioned. +You can only update a complete workflow if it has not yet been provisioned. {: .note} ## Path and HTTP methods @@ -58,11 +58,26 @@ POST /_plugins/_flow_framework/workflow?validation=none ``` {% include copy-curl.html %} +You cannot update a full workflow once it has been provisioned, but you can update fields other than the `workflows` field, such as `name` and `description`: + +```json +PUT /_plugins/_flow_framework/workflow/?update_fields=true +{ + "name": "new-template-name", + "description": "A new description for the existing template" +} +``` +{% include copy-curl.html %} + +You cannot specify both the `provision` and `update_fields` parameters at the same time. +{: .note} + The following table lists the available query parameters. All query parameters are optional. User-provided parameters are only allowed if the `provision` parameter is set to `true`. | Parameter | Data type | Description | | :--- | :--- | :--- | | `provision` | Boolean | Whether to provision the workflow as part of the request. Default is `false`. | +| `update_fields` | Boolean | Whether to update only the fields included in the request body. Default is `false`. | | `validation` | String | Whether to validate the workflow. Valid values are `all` (validate the template) and `none` (do not validate the template). Default is `all`. | | User-provided substitution expressions | String | Parameters matching substitution expressions in the template. Only allowed if `provision` is set to `true`. Optional. If `provision` is set to `false`, you can pass these parameters in the [Provision Workflow API query parameters]({{site.url}}{{site.baseurl}}/automating-configurations/api/provision-workflow/#query-parameters). | diff --git a/_ingest-pipelines/processors/split.md b/_ingest-pipelines/processors/split.md index 2052c3def1..c424ef671c 100644 --- a/_ingest-pipelines/processors/split.md +++ b/_ingest-pipelines/processors/split.md @@ -26,19 +26,18 @@ The following is the syntax for the `split` processor: The following table lists the required and optional parameters for the `split` processor. -Parameter | Required/Optional | Description | -|-----------|-----------|-----------| -`field` | Required | The field containing the string to be split. -`separator` | Required | The delimiter used to split the string. This can be a regular expression pattern. -`preserve_field` | Optional | If set to `true`, preserves empty trailing fields (for example, `''`) in the resulting array. If set to `false`, empty trailing fields are removed from the resulting array. Default is `false`. -`target_field` | Optional | The field where the array of substrings is stored. If not specified, then the field is updated in-place. -`ignore_missing` | Optional | Specifies whether the processor should ignore documents that do not contain the specified -field. If set to `true`, then the processor ignores missing values in the field and leaves the `target_field` unchanged. Default is `false`. -`description` | Optional | A brief description of the processor. -`if` | Optional | A condition for running the processor. -`ignore_failure` | Optional | Specifies whether the processor continues execution even if it encounters an error. If set to `true`, then failures are ignored. Default is `false`. -`on_failure` | Optional | A list of processors to run if the processor fails. -`tag` | Optional | An identifier tag for the processor. Useful for debugging in order to distinguish between processors of the same type. +Parameter | Required/Optional | Description +:--- | :--- | :--- +`field` | Required | The field containing the string to be split. +`separator` | Required | The delimiter used to split the string. This can be a regular expression pattern. +`preserve_field` | Optional | If set to `true`, preserves empty trailing fields (for example, `''`) in the resulting array. If set to `false`, empty trailing fields are removed from the resulting array. Default is `false`. +`target_field` | Optional | The field where the array of substrings is stored. If not specified, then the field is updated in-place. +`ignore_missing` | Optional | Specifies whether the processor should ignore documents that do not contain the specified field. If set to `true`, then the processor ignores missing values in the field and leaves the `target_field` unchanged. Default is `false`. +`description` | Optional | A brief description of the processor. +`if` | Optional | A condition for running the processor. +`ignore_failure` | Optional | Specifies whether the processor continues execution even if it encounters an error. If set to `true`, then failures are ignored. Default is `false`. +`on_failure` | Optional | A list of processors to run if the processor fails. +`tag` | Optional | An identifier tag for the processor. Useful for debugging in order to distinguish between processors of the same type. ## Using the processor diff --git a/_query-dsl/compound/hybrid.md b/_query-dsl/compound/hybrid.md index e573d17676..22b3a17fc1 100644 --- a/_query-dsl/compound/hybrid.md +++ b/_query-dsl/compound/hybrid.md @@ -12,11 +12,7 @@ You can use a hybrid query to combine relevance scores from multiple queries int ## Example -Before using a `hybrid` query, you must set up a machine learning (ML) model, ingest documents, and configure a search pipeline with a [`normalization-processor`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/normalization-processor/). - -To learn how to set up an ML model, see [Choosing a model]({{site.url}}{{site.baseurl}}/ml-commons-plugin/integrating-ml-models/#choosing-a-model). - -Once you set up an ML model, learn how to use the `hybrid` query by following the steps in [Using hybrid search]({{site.url}}{{site.baseurl}}/search-plugins/hybrid-search/#using-hybrid-search). +Learn how to use the `hybrid` query by following the steps in [Using hybrid search]({{site.url}}{{site.baseurl}}/search-plugins/hybrid-search/#using-hybrid-search). For a comprehensive example, follow the [Neural search tutorial]({{site.url}}{{site.baseurl}}/ml-commons-plugin/semantic-search#tutorial). diff --git a/_query-dsl/geo-and-xy/geopolygon.md b/_query-dsl/geo-and-xy/geopolygon.md new file mode 100644 index 0000000000..c53b1379cf --- /dev/null +++ b/_query-dsl/geo-and-xy/geopolygon.md @@ -0,0 +1,177 @@ +--- +layout: default +title: Geopolygon +parent: Geographic and xy queries +grand_parent: Query DSL +nav_order: 30 +--- + +# Geopolygon query + +A geopolygon query returns documents containing geopoints that are within the specified polygon. A document containing multiple geopoints matches the query if at least one geopoint matches the query. + +A polygon is specified by a list of vertices in coordinate form. Unlike specifying a polygon for a geoshape field, the polygon does not have to be closed (specifying the first and last points at the same is unnecessary). Though points do not have to follow either clockwise or counterclockwise order, it is recommended that you list them in either of these orders. This will ensure that the correct polygon is captured. + +The searched document field must be mapped as `geo_point`. +{: .note} + +## Example + +Create a mapping with the `point` field mapped as `geo_point`: + +```json +PUT /testindex1 +{ + "mappings": { + "properties": { + "point": { + "type": "geo_point" + } + } + } +} +``` +{% include copy-curl.html %} + +Index a geopoint, specifying its latitude and longitude: + +```json +PUT testindex1/_doc/1 +{ + "point": { + "lat": 73.71, + "lon": 41.32 + } +} +``` +{% include copy-curl.html %} + +Search for documents whose `point` objects are within the specified `geo_polygon`: + +```json +GET /testindex1/_search +{ + "query": { + "bool": { + "must": { + "match_all": {} + }, + "filter": { + "geo_polygon": { + "point": { + "points": [ + { "lat": 74.5627, "lon": 41.8645 }, + { "lat": 73.7562, "lon": 42.6526 }, + { "lat": 73.3245, "lon": 41.6189 }, + { "lat": 74.0060, "lon": 40.7128 } + ] + } + } + } + } + } +} +``` +{% include copy-curl.html %} + +The polygon specified in the preceding request is the quadrilateral depicted in the following image. The matching document is within this quadrilateral. The coordinates of the quadrilateral vertices are specified in `(latitude, longitude)` format. + +![Search for points within the specified quadrilateral]({{site.url}}{{site.baseurl}}/images/geopolygon-query.png) + +The response contains the matching document: + +```json +{ + "took": 6, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 1, + "relation": "eq" + }, + "max_score": 1, + "hits": [ + { + "_index": "testindex1", + "_id": "1", + "_score": 1, + "_source": { + "point": { + "lat": 73.71, + "lon": 41.32 + } + } + } + ] + } +} +``` + +In the preceding search request, you specified the polygon vertices in clockwise order: + +```json +"geo_polygon": { + "point": { + "points": [ + { "lat": 74.5627, "lon": 41.8645 }, + { "lat": 73.7562, "lon": 42.6526 }, + { "lat": 73.3245, "lon": 41.6189 }, + { "lat": 74.0060, "lon": 40.7128 } + ] + } +} +``` + +Alternatively, you can specify the vertices in counterclockwise order: + +```json +"geo_polygon": { + "point": { + "points": [ + { "lat": 74.5627, "lon": 41.8645 }, + { "lat": 74.0060, "lon": 40.7128 }, + { "lat": 73.3245, "lon": 41.6189 }, + { "lat": 73.7562, "lon": 42.6526 } + ] + } +} +``` + +The resulting query response contains the same matching document. + +However, if you specify the vertices in the following order: + +```json +"geo_polygon": { + "point": { + "points": [ + { "lat": 74.5627, "lon": 41.8645 }, + { "lat": 74.0060, "lon": 40.7128 }, + { "lat": 73.7562, "lon": 42.6526 }, + { "lat": 73.3245, "lon": 41.6189 } + ] + } +} +``` + +The response returns no results. + +## Request fields + +Geopolygon queries accept the following fields. + +Field | Data type | Description +:--- | :--- | :--- +`_name` | String | The name of the filter. Optional. +`validation_method` | String | The validation method. Valid values are `IGNORE_MALFORMED` (accept geopoints with invalid coordinates), `COERCE` (try to coerce coordinates to valid values), and `STRICT` (return an error when coordinates are invalid). Optional. Default is `STRICT`. +`ignore_unmapped` | Boolean | Specifies whether to ignore an unmapped field. If set to `true`, then the query does not return any documents that contain an unmapped field. If set to `false`, then an exception is thrown when the field is unmapped. Optional. Default is `false`. + +## Accepted formats + +You can specify the geopoint coordinates when indexing a document and searching for documents in any [format]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/geo-point#formats) accepted by the geopoint field type. \ No newline at end of file diff --git a/_query-dsl/geo-and-xy/index.md b/_query-dsl/geo-and-xy/index.md index cb0559927d..83cdbf08d7 100644 --- a/_query-dsl/geo-and-xy/index.md +++ b/_query-dsl/geo-and-xy/index.md @@ -30,7 +30,7 @@ OpenSearch provides the following geographic query types: - [**Geo-bounding box queries**]({{site.url}}{{site.baseurl}}/opensearch/query-dsl/geo-and-xy/geo-bounding-box/): Return documents with geopoint field values that are within a bounding box. - [**Geodistance queries**]({{site.url}}{{site.baseurl}}/query-dsl/geo-and-xy/geodistance/): Return documents with geopoints that are within a specified distance from the provided geopoint. -- **Geopolygon queries**: Return documents with geopoints that are within a polygon. +- [**Geopolygon queries**]({{site.url}}{{site.baseurl}}/query-dsl/geo-and-xy/geodistance/): Return documents containing geopoints that are within a polygon. - **Geoshape queries**: Return documents that contain: - Geoshapes and geopoints that have one of four spatial relations to the provided shape: `INTERSECTS`, `DISJOINT`, `WITHIN`, or `CONTAINS`. - Geopoints that intersect the provided shape. \ No newline at end of file diff --git a/_search-plugins/hybrid-search.md b/_search-plugins/hybrid-search.md index b0fb4d5bef..7f08d63d0f 100644 --- a/_search-plugins/hybrid-search.md +++ b/_search-plugins/hybrid-search.md @@ -12,7 +12,7 @@ Introduced 2.11 Hybrid search combines keyword and neural search to improve search relevance. To implement hybrid search, you need to set up a [search pipeline]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/index/) that runs at search time. The search pipeline you'll configure intercepts search results at an intermediate stage and applies the [`normalization_processor`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/normalization-processor/) to them. The `normalization_processor` normalizes and combines the document scores from multiple query clauses, rescoring the documents according to the chosen normalization and combination techniques. **PREREQUISITE**
-Before using hybrid search, you must set up a text embedding model. For more information, see [Choosing a model]({{site.url}}{{site.baseurl}}/ml-commons-plugin/integrating-ml-models/#choosing-a-model). +To follow this example, you must set up a text embedding model. For more information, see [Choosing a model]({{site.url}}{{site.baseurl}}/ml-commons-plugin/integrating-ml-models/#choosing-a-model). If you have already generated text embeddings, ingest the embeddings into an index and skip to [Step 4](#step-4-configure-a-search-pipeline). {: .note} ## Using hybrid search diff --git a/_search-plugins/knn/settings.md b/_search-plugins/knn/settings.md index f4ef057cfb..4d84cc80bb 100644 --- a/_search-plugins/knn/settings.md +++ b/_search-plugins/knn/settings.md @@ -12,17 +12,28 @@ The k-NN plugin adds several new cluster settings. To learn more about static an ## Cluster settings +The following table lists all available cluster-level k-NN settings. For more information about cluster settings, see [Configuring OpenSearch]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/index/#updating-cluster-settings-using-the-api) and [Updating cluster settings using the API]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/index/#updating-cluster-settings-using-the-api). + +Setting | Static/Dynamic | Default | Description +:--- | :--- | :--- | :--- +`knn.plugin.enabled`| Dynamic | `true` | Enables or disables the k-NN plugin. +`knn.algo_param.index_thread_qty` | Dynamic | `1` | The number of threads used for native library index creation. Keeping this value low reduces the CPU impact of the k-NN plugin but also reduces indexing performance. +`knn.cache.item.expiry.enabled` | Dynamic | `false` | Whether to remove native library indexes that have not been accessed for a certain duration from memory. +`knn.cache.item.expiry.minutes` | Dynamic | `3h` | If enabled, the amount of idle time before a native library index is removed from memory. +`knn.circuit_breaker.unset.percentage` | Dynamic | `75` | The native memory usage threshold for the circuit breaker. Memory usage must be lower than this percentage of `knn.memory.circuit_breaker.limit` in order for `knn.circuit_breaker.triggered` to remain `false`. +`knn.circuit_breaker.triggered` | Dynamic | `false` | True when memory usage exceeds the `knn.circuit_breaker.unset.percentage` value. +`knn.memory.circuit_breaker.limit` | Dynamic | `50%` | The native memory limit for native library indexes. At the default value, if a machine has 100 GB of memory and the JVM uses 32 GB, then the k-NN plugin uses 50% of the remaining 68 GB (34 GB). If memory usage exceeds this value, then the plugin removes the native library indexes used least recently. +`knn.memory.circuit_breaker.enabled` | Dynamic | `true` | Whether to enable the k-NN memory circuit breaker. +`knn.model.index.number_of_shards`| Dynamic | `1` | The number of shards to use for the model system index, which is the OpenSearch index that stores the models used for approximate nearest neighbor (ANN) search. +`knn.model.index.number_of_replicas`| Dynamic | `1` | The number of replica shards to use for the model system index. Generally, in a multi-node cluster, this value should be at least 1 in order to increase stability. +`knn.model.cache.size.limit` | Dynamic | `10%` | The model cache limit cannot exceed 25% of the JVM heap. +`knn.faiss.avx2.disabled` | Static | `false` | A static setting that specifies whether to disable the SIMD-based `libopensearchknn_faiss_avx2.so` library and load the non-optimized `libopensearchknn_faiss.so` library for the Faiss engine on machines with x64 architecture. For more information, see [SIMD optimization for the Faiss engine]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index/#simd-optimization-for-the-faiss-engine). + +## Index settings + +The following table lists all available index-level k-NN settings. All settings are static. For information about updating static index-level settings, see [Updating a static index setting]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/index-settings/#updating-a-static-index-setting). + Setting | Default | Description -:--- | :--- | :--- -`knn.algo_param.index_thread_qty` | 1 | The number of threads used for native library index creation. Keeping this value low reduces the CPU impact of the k-NN plugin, but also reduces indexing performance. -`knn.cache.item.expiry.enabled` | false | Whether to remove native library indexes that have not been accessed for a certain duration from memory. -`knn.cache.item.expiry.minutes` | 3h | If enabled, the idle time before removing a native library index from memory. -`knn.circuit_breaker.unset.percentage` | 75% | The native memory usage threshold for the circuit breaker. Memory usage must be below this percentage of `knn.memory.circuit_breaker.limit` for `knn.circuit_breaker.triggered` to remain false. -`knn.circuit_breaker.triggered` | false | True when memory usage exceeds the `knn.circuit_breaker.unset.percentage` value. -`knn.memory.circuit_breaker.limit` | 50% | The native memory limit for native library indexes. At the default value, if a machine has 100 GB of memory and the JVM uses 32 GB, the k-NN plugin uses 50% of the remaining 68 GB (34 GB). If memory usage exceeds this value, k-NN removes the least recently used native library indexes. -`knn.memory.circuit_breaker.enabled` | true | Whether to enable the k-NN memory circuit breaker. -`knn.plugin.enabled`| true | Enables or disables the k-NN plugin. -`knn.model.index.number_of_shards`| 1 | The number of shards to use for the model system index, the OpenSearch index that stores the models used for Approximate Nearest Neighbor (ANN) search. -`knn.model.index.number_of_replicas`| 1 | The number of replica shards to use for the model system index. Generally, in a multi-node cluster, this should be at least 1 to increase stability. -`knn.advanced.filtered_exact_search_threshold`| null | The threshold value for the filtered IDs that is used to switch to exact search during filtered ANN search. If the number of filtered IDs in a segment is less than this setting's value, exact search will be performed on the filtered IDs. -`knn.faiss.avx2.disabled` | False | A static setting that specifies whether to disable the SIMD-based `libopensearchknn_faiss_avx2.so` library and load the non-optimized `libopensearchknn_faiss.so` library for the Faiss engine on machines with x64 architecture. For more information, see [SIMD optimization for the Faiss engine]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index/#simd-optimization-for-the-faiss-engine). +:--- | :--- | :--- +`index.knn.advanced.filtered_exact_search_threshold`| `null` | The filtered ID threshold value used to switch to exact search during filtered ANN search. If the number of filtered IDs in a segment is lower than this setting's value, then exact search will be performed on the filtered IDs. +`index.knn.algo_param.ef_search` | `100` | `ef` (or `efSearch`) represents the size of the dynamic list for the nearest neighbors used during a search. Higher `ef` values lead to a more accurate but slower search. `ef` cannot be set to a value lower than the number of queried nearest neighbors, `k`. `ef` can take any value between `k` and the size of the dataset. \ No newline at end of file diff --git a/_search-plugins/sql/functions.md b/_search-plugins/sql/functions.md index de3b578e1a..9706148d76 100644 --- a/_search-plugins/sql/functions.md +++ b/_search-plugins/sql/functions.md @@ -32,10 +32,10 @@ The SQL plugin supports the following common functions shared across the SQL and | `expm1` | `expm1(number T) -> double` | `SELECT expm1(0.5)` | | `floor` | `floor(number T) -> long` | `SELECT floor(0.5)` | | `ln` | `ln(number T) -> double` | `SELECT ln(10)` | -| `log` | `log(number T) -> double` or `log(number T, number T) -> double` | `SELECT log(10)`, `SELECT log(2, 16)` | +| `log` | `log(number T) -> double` or `log(number T, number T) -> double` | `SELECT log(10) -> 2.3`, `SELECT log(2, 16) -> 4`| | `log2` | `log2(number T) -> double` | `SELECT log2(10)` | -| `log10` | `log10(number T) -> double` | `SELECT log10(10)` | -| `mod` | `mod(number T, number T) -> T` | `SELECT mod(2, 3)` | +| `log10` | `log10(number T) -> double` | `SELECT log10(100)` | +| `mod` | `mod(number T, number T) -> T` | `SELECT mod(10,4) -> 2 ` | | `modulus` | `modulus(number T, number T) -> T` | `SELECT modulus(2, 3)` | | `multiply` | `multiply(number T, number T) -> T` | `SELECT multiply(2, 3)` | | `pi` | `pi() -> double` | `SELECT pi()` | @@ -162,7 +162,7 @@ Functions marked with * are only available in SQL. | `replace` | `replace(string, string, string) -> string` | `SELECT replace('hello', 'l', 'x')` | | `right` | `right(string, integer) -> string` | `SELECT right('hello', 2)` | | `rtrim` | `rtrim(string) -> string` | `SELECT rtrim('hello ')` | -| `substring` | `substring(string, integer, integer) -> string` | `SELECT substring('hello', 2, 4)` | +| `substring` | `substring(string, integer, integer) -> string` | `SELECT substring('hello', 2, 2) -> 'el'` | | `trim` | `trim(string) -> string` | `SELECT trim(' hello')` | | `upper` | `upper(string) -> string` | `SELECT upper('hello world')` | diff --git a/_search-plugins/sql/ppl/functions.md b/_search-plugins/sql/ppl/functions.md index 275030f723..d192799f2e 100644 --- a/_search-plugins/sql/ppl/functions.md +++ b/_search-plugins/sql/ppl/functions.md @@ -11,7 +11,7 @@ redirect_from: # Commands -`PPL` supports all [`SQL` common]({{site.url}}{{site.baseurl}}/search-plugins/sql/functions/) functions, including [relevance search]({{site.url}}{{site.baseurl}}/search-plugins/sql/full-text/), but also introduces few more functions (called `commands`) which are available in `PPL` only. +`PPL` supports most [`SQL` common]({{site.url}}{{site.baseurl}}/search-plugins/sql/functions/) functions, including [relevance search]({{site.url}}{{site.baseurl}}/search-plugins/sql/full-text/), but also introduces few more functions (called `commands`) which are available in `PPL` only. ## dedup diff --git a/_search-plugins/vector-search.md b/_search-plugins/vector-search.md new file mode 100644 index 0000000000..862b26b375 --- /dev/null +++ b/_search-plugins/vector-search.md @@ -0,0 +1,283 @@ +--- +layout: default +title: Vector search +nav_order: 22 +has_children: false +has_toc: false +--- + +# Vector search + +OpenSearch is a comprehensive search platform that supports a variety of data types, including vectors. OpenSearch vector database functionality is seamlessly integrated with its generic database function. + +In OpenSearch, you can generate vector embeddings, store those embeddings in an index, and use them for vector search. Choose one of the following options: + +- Generate embeddings using a library of your choice before ingesting them into OpenSearch. Once you ingest vectors into an index, you can perform a vector similarity search on the vector space. For more information, see [Working with embeddings generated outside of OpenSearch](#working-with-embeddings-generated-outside-of-opensearch). +- Automatically generate embeddings within OpenSearch. To use embeddings for semantic search, the ingested text (the corpus) and the query need to be embedded using the same model. [Neural search]({{site.url}}{{site.baseurl}}/search-plugins/neural-search/) packages this functionality, eliminating the need to manage the internal details. For more information, see [Generating vector embeddings within OpenSearch](#generating-vector-embeddings-in-opensearch). + +## Working with embeddings generated outside of OpenSearch + +After you generate vector embeddings, upload them to an OpenSearch index and search the index using vector search. For a complete example, see [Example](#example). + +### k-NN index + +To build a vector database and use vector search, you must specify your index as a [k-NN index]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index/) when creating it by setting `index.knn` to `true`: + +```json +PUT test-index +{ + "settings": { + "index": { + "knn": true, + "knn.algo_param.ef_search": 100 + } + }, + "mappings": { + "properties": { + "my_vector1": { + "type": "knn_vector", + "dimension": 1024, + "method": { + "name": "hnsw", + "space_type": "l2", + "engine": "nmslib", + "parameters": { + "ef_construction": 128, + "m": 24 + } + } + } + } + } +} +``` +{% include copy-curl.html %} + +### k-NN vector + +You must designate the field that will store vectors as a [`knn_vector`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/) field type. OpenSearch supports vectors of up to 16,000 dimensions, each of which is represented as a 32-bit or 16-bit float. + +To save storage space, you can use `byte` vectors. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector). + +### k-NN vector search + +Vector search finds the vectors in your database that are most similar to the query vector. OpenSearch supports the following search methods: + +- [Approximate search](#approximate-search) (approximate k-NN, or ANN): Returns approximate nearest neighbors to the query vector. Usually, approximate search algorithms sacrifice indexing speed and search accuracy in exchange for performance benefits such as lower latency, smaller memory footprints, and more scalable search. For most use cases, approximate search is the best option. + +- Exact search (exact k-NN): A brute-force, exact k-NN search of vector fields. OpenSearch supports the following types of exact search: + - [Exact k-NN with scoring script]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-score-script/): Using the k-NN scoring script, you can apply a filter to an index before executing the nearest neighbor search. + - [Painless extensions]({{site.url}}{{site.baseurl}}/search-plugins/knn/painless-functions/): Adds the distance functions as Painless extensions that you can use in more complex combinations. You can use this method to perform a brute-force, exact k-NN search of an index, which also supports pre-filtering. + +### Approximate search + +OpenSearch supports several algorithms for approximate vector search, each with its own advantages. For complete documentation, see [Approximate search]({{site.url}}{{site.baseurl}}/search-plugins/knn/approximate-knn/). For more information about the search methods and engines, see [Method definitions]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index/#method-definitions). For method recommendations, see [Choosing the right method]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index/#choosing-the-right-method). + +To use approximate vector search, specify one of the following search methods (algorithms) in the `method` parameter: + +- Hierarchical Navigable Small World (HNSW) +- Inverted File System (IVF) + +Additionally, specify the engine (library) that implements this method in the `engine` parameter: + +- [Non-Metric Space Library (NMSLIB)](https://github.com/nmslib/nmslib) +- [Facebook AI Similarity Search (Faiss)](https://github.com/facebookresearch/faiss) +- Lucene + +The following table lists the combinations of search methods and libraries supported by the k-NN engine for approximate vector search. + +Method | Engine +:--- | :--- +HNSW | NMSLIB, Faiss, Lucene +IVF | Faiss + +### Engine recommendations + +In general, select NMSLIB or Faiss for large-scale use cases. Lucene is a good option for smaller deployments and offers benefits like smart filtering, where the optimal filtering strategy—pre-filtering, post-filtering, or exact k-NN—is automatically applied depending on the situation. The following table summarizes the differences between each option. + +| | NMSLIB/HNSW | Faiss/HNSW | Faiss/IVF | Lucene/HNSW | +|:---|:---|:---|:---|:---| +| Max dimensions | 16,000 | 16,000 | 16,000 | 1,024 | +| Filter | Post-filter | Post-filter | Post-filter | Filter during search | +| Training required | No | No | Yes | No | +| Similarity metrics | `l2`, `innerproduct`, `cosinesimil`, `l1`, `linf` | `l2`, `innerproduct` | `l2`, `innerproduct` | `l2`, `cosinesimil` | +| Number of vectors | Tens of billions | Tens of billions | Tens of billions | Less than 10 million | +| Indexing latency | Low | Low | Lowest | Low | +| Query latency and quality | Low latency and high quality | Low latency and high quality | Low latency and low quality | High latency and high quality | +| Vector compression | Flat | Flat
Product quantization | Flat
Product quantization | Flat | +| Memory consumption | High | High
Low with PQ | Medium
Low with PQ | High | + +### Example + +In this example, you'll create a k-NN index, add data to the index, and search the data. + +#### Step 1: Create a k-NN index + +First, create an index that will store sample hotel data. Set `index.knn` to `true` and specify the `location` field as a `knn_vector`: + +```json +PUT /hotels-index +{ + "settings": { + "index": { + "knn": true, + "knn.algo_param.ef_search": 100, + "number_of_shards": 1, + "number_of_replicas": 0 + } + }, + "mappings": { + "properties": { + "location": { + "type": "knn_vector", + "dimension": 2, + "method": { + "name": "hnsw", + "space_type": "l2", + "engine": "lucene", + "parameters": { + "ef_construction": 100, + "m": 16 + } + } + } + } + } +} +``` +{% include copy-curl.html %} + +#### Step 2: Add data to your index + +Next, add data to your index. Each document represents a hotel. The `location` field in each document contains a vector specifying the hotel's location: + +```json +POST /_bulk +{ "index": { "_index": "hotels-index", "_id": "1" } } +{ "location": [5.2, 4.4] } +{ "index": { "_index": "hotels-index", "_id": "2" } } +{ "location": [5.2, 3.9] } +{ "index": { "_index": "hotels-index", "_id": "3" } } +{ "location": [4.9, 3.4] } +{ "index": { "_index": "hotels-index", "_id": "4" } } +{ "location": [4.2, 4.6] } +{ "index": { "_index": "hotels-index", "_id": "5" } } +{ "location": [3.3, 4.5] } +``` +{% include copy-curl.html %} + +#### Step 3: Search your data + +Now search for hotels closest to the pin location `[5, 4]`. This location is labeled `Pin` in the following image. Each hotel is labeled with its document number. + +![Hotels on a coordinate plane]({{site.url}}{{site.baseurl}}/images/k-nn-search-hotels.png/) + +To search for the top three closest hotels, set `k` to `3`: + +```json +POST /hotels-index/_search +{ + "size": 3, + "query": { + "knn": { + "location": { + "vector": [ + 5, + 4 + ], + "k": 3 + } + } + } +} +``` +{% include copy-curl.html %} + +The response contains the hotels closest to the specified pin location: + +```json +{ + "took": 1093, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 3, + "relation": "eq" + }, + "max_score": 0.952381, + "hits": [ + { + "_index": "hotels-index", + "_id": "2", + "_score": 0.952381, + "_source": { + "location": [ + 5.2, + 3.9 + ] + } + }, + { + "_index": "hotels-index", + "_id": "1", + "_score": 0.8333333, + "_source": { + "location": [ + 5.2, + 4.4 + ] + } + }, + { + "_index": "hotels-index", + "_id": "3", + "_score": 0.72992706, + "_source": { + "location": [ + 4.9, + 3.4 + ] + } + } + ] + } +} +``` + +### Vector search with filtering + +For information about vector search with filtering, see [k-NN search with filters]({{site.url}}{{site.baseurl}}/search-plugins/knn/filter-search-knn/). + +## Generating vector embeddings in OpenSearch + +[Neural search]({{site.url}}{{site.baseurl}}/search-plugins/neural-search/) encapsulates the infrastructure needed to perform semantic vector searches. After you integrate an inference (embedding) service, neural search functions like lexical search, accepting a textual query and returning relevant documents. + +When you index your data, neural search transforms text into vector embeddings and indexes both the text and its vector embeddings in a vector index. When you use a neural query during search, neural search converts the query text into vector embeddings and uses vector search to return the results. + +### Choosing a model + +The first step in setting up neural search is choosing a model. You can upload a model to your OpenSearch cluster, use one of the pretrained models provided by OpenSearch, or connect to an externally hosted model. For more information, see [Integrating ML models]({{site.url}}{{site.baseurl}}/ml-commons-plugin/integrating-ml-models/). + +### Neural search tutorial + +For a step-by-step tutorial, see [Neural search tutorial]({{site.url}}{{site.baseurl}}/search-plugins/neural-search-tutorial/). + +### Search methods + +Choose one of the following search methods to use your model for neural search: + +- [Semantic search]({{site.url}}{{site.baseurl}}/search-plugins/semantic-search/): Uses dense retrieval based on text embedding models to search text data. + +- [Hybrid search]({{site.url}}{{site.baseurl}}/search-plugins/hybrid-search/): Combines lexical and neural search to improve search relevance. + +- [Multimodal search]({{site.url}}{{site.baseurl}}/search-plugins/multimodal-search/): Uses neural search with multimodal embedding models to search text and image data. + +- [Neural sparse search]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/): Uses neural search with sparse retrieval based on sparse embedding models to search text data. + +- [Conversational search]({{site.url}}{{site.baseurl}}/search-plugins/conversational-search/): With conversational search, you can ask questions in natural language, receive a text response, and ask additional clarifying questions. diff --git a/_security/access-control/document-level-security.md b/_security/access-control/document-level-security.md index 08de85bbf7..352fe06a61 100644 --- a/_security/access-control/document-level-security.md +++ b/_security/access-control/document-level-security.md @@ -191,6 +191,10 @@ Adaptive | `adaptive-level` | The default setting that allows OpenSearch to auto OpenSearch combines all DLS queries with the logical `OR` operator. However, when a role that uses DLS is combined with another security role that doesn't use DLS, the query results are filtered to display only documents matching the DLS from the first role. This filter rule also applies to roles that do not grant read documents. +### DLS and write permissions + +Make sure that a user that has DLS-configured roles does not have write permissions. If write permissions are added, the user will be able to index documents which they will not be able to retrieve due to DLS filtering. + ### When to enable `plugins.security.dfm_empty_overrides_all` When to enable the `plugins.security.dfm_empty_overrides_all` setting depends on whether you want to restrict user access to documents without DLS. diff --git a/_security/configuration/tls.md b/_security/configuration/tls.md index bca932bc0c..a4115b8c25 100755 --- a/_security/configuration/tls.md +++ b/_security/configuration/tls.md @@ -137,7 +137,7 @@ plugins.security.authcz.admin_dn: For security reasons, you cannot use wildcards or regular expressions as values for the `admin_dn` setting. -For more information about admin and super admin user roles, see [Admin and super admin roles](https://opensearch.org/docs/latest/security/access-control/users-roles/#admin-and-super-admin-roles) and [Configuring super admin certificates](https://opensearch.org/docs/latest/security/configuration/tls/#configuring-admin-certificates). +For more information about admin and super admin user roles, see [Admin and super admin roles](https://opensearch.org/docs/latest/security/access-control/users-roles/#admin-and-super-admin-roles). ## (Advanced) OpenSSL diff --git a/images/geopolygon-query.png b/images/geopolygon-query.png new file mode 100644 index 0000000000..16d73628de Binary files /dev/null and b/images/geopolygon-query.png differ diff --git a/images/k-nn-search-hotels.png b/images/k-nn-search-hotels.png new file mode 100644 index 0000000000..f17fd171cf Binary files /dev/null and b/images/k-nn-search-hotels.png differ