From d4bb0f56039d5e43ad19a3af14b7b963d995fa97 Mon Sep 17 00:00:00 2001 From: "Theo N. Truong" Date: Mon, 11 Nov 2024 13:37:24 -0700 Subject: [PATCH 01/14] [Jekyll] Spec Insert Plugin (#8692) * Spec Insert A program that insert API Components generated from the OpenSearch OpenAPI Spec into markdown files Signed-off-by: Theo Truong * # Sentence casing Signed-off-by: Theo Truong * # vale:reviewdog Signed-off-by: Theo Truong * # vale:reviewdog Signed-off-by: Theo Truong * # vale:reviewdog Signed-off-by: Theo Truong * # Correction on cron job run time. Signed-off-by: Theo Truong * Apply suggestions from code review Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Daniel (dB.) Doubrovkine Signed-off-by: Theo N. Truong * # More clarity in method documentation Signed-off-by: Theo Truong * Update DEVELOPER_GUIDE.md * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: Theo Truong Signed-off-by: Theo N. Truong Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Daniel (dB.) Doubrovkine Co-authored-by: Nathan Bower --- .github/workflows/jekyll-spec-insert.yml | 20 +++ .github/workflows/update-api-components.yml | 52 +++++++ DEVELOPER_GUIDE.md | 135 ++++++++++++++++++ Gemfile | 43 ++++-- README.md | 1 + _config.yml | 15 +- spec-insert/.gitignore | 2 + spec-insert/.rspec | 1 + spec-insert/.rubocop.yml | 29 ++++ spec-insert/jekyll-spec-insert.gemspec | 16 +++ spec-insert/lib/api/action.rb | 68 +++++++++ spec-insert/lib/api/operation.rb | 34 +++++ spec-insert/lib/api/parameter.rb | 94 ++++++++++++ spec-insert/lib/doc_processor.rb | 62 ++++++++ spec-insert/lib/insert_arguments.rb | 67 +++++++++ spec-insert/lib/jekyll-spec-insert.rb | 56 ++++++++ .../lib/renderers/base_mustache_renderer.rb | 18 +++ .../lib/renderers/parameter_table_renderer.rb | 51 +++++++ spec-insert/lib/renderers/path_parameters.rb | 21 +++ .../lib/renderers/paths_and_methods.rb | 21 +++ spec-insert/lib/renderers/query_parameters.rb | 25 ++++ spec-insert/lib/renderers/spec_insert.rb | 42 ++++++ spec-insert/lib/renderers/table_renderer.rb | 58 ++++++++ .../templates/path_parameters.mustache | 2 + .../templates/paths_and_methods.mustache | 6 + .../templates/query_parameters.mustache | 5 + .../renderers/templates/spec_insert.mustache | 7 + spec-insert/lib/spec_hash.rb | 60 ++++++++ spec-insert/lib/spec_insert_error.rb | 4 + .../spec/_fixtures/actual_output/.gitignore | 1 + .../_fixtures/expected_output/param_tables.md | 43 ++++++ .../expected_output/paths_and_http_methods.md | 13 ++ .../spec/_fixtures/input/param_tables.md | 38 +++++ .../_fixtures/input/paths_and_http_methods.md | 6 + .../spec/_fixtures/opensearch_spec.yaml | 120 ++++++++++++++++ spec-insert/spec/doc_processor_spec.rb | 24 ++++ spec-insert/spec/spec_helper.rb | 102 +++++++++++++ 37 files changed, 1345 insertions(+), 17 deletions(-) create mode 100644 .github/workflows/jekyll-spec-insert.yml create mode 100644 .github/workflows/update-api-components.yml create mode 100644 DEVELOPER_GUIDE.md create mode 100644 spec-insert/.gitignore create mode 100644 spec-insert/.rspec create mode 100644 spec-insert/.rubocop.yml create mode 100644 spec-insert/jekyll-spec-insert.gemspec create mode 100644 spec-insert/lib/api/action.rb create mode 100644 spec-insert/lib/api/operation.rb create mode 100644 spec-insert/lib/api/parameter.rb create mode 100644 spec-insert/lib/doc_processor.rb create mode 100644 spec-insert/lib/insert_arguments.rb create mode 100644 spec-insert/lib/jekyll-spec-insert.rb create mode 100644 spec-insert/lib/renderers/base_mustache_renderer.rb create mode 100644 spec-insert/lib/renderers/parameter_table_renderer.rb create mode 100644 spec-insert/lib/renderers/path_parameters.rb create mode 100644 spec-insert/lib/renderers/paths_and_methods.rb create mode 100644 spec-insert/lib/renderers/query_parameters.rb create mode 100644 spec-insert/lib/renderers/spec_insert.rb create mode 100644 spec-insert/lib/renderers/table_renderer.rb create mode 100644 spec-insert/lib/renderers/templates/path_parameters.mustache create mode 100644 spec-insert/lib/renderers/templates/paths_and_methods.mustache create mode 100644 spec-insert/lib/renderers/templates/query_parameters.mustache create mode 100644 spec-insert/lib/renderers/templates/spec_insert.mustache create mode 100644 spec-insert/lib/spec_hash.rb create mode 100644 spec-insert/lib/spec_insert_error.rb create mode 100644 spec-insert/spec/_fixtures/actual_output/.gitignore create mode 100644 spec-insert/spec/_fixtures/expected_output/param_tables.md create mode 100644 spec-insert/spec/_fixtures/expected_output/paths_and_http_methods.md create mode 100644 spec-insert/spec/_fixtures/input/param_tables.md create mode 100644 spec-insert/spec/_fixtures/input/paths_and_http_methods.md create mode 100644 spec-insert/spec/_fixtures/opensearch_spec.yaml create mode 100644 spec-insert/spec/doc_processor_spec.rb create mode 100644 spec-insert/spec/spec_helper.rb diff --git a/.github/workflows/jekyll-spec-insert.yml b/.github/workflows/jekyll-spec-insert.yml new file mode 100644 index 0000000000..cefd477be2 --- /dev/null +++ b/.github/workflows/jekyll-spec-insert.yml @@ -0,0 +1,20 @@ +name: Lint and Test Jekyll Spec Insert +on: + push: + paths: + - 'spec-insert/**' + pull_request: + paths: + - 'spec-insert/**' +jobs: + lint-and-test: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: ruby/setup-ruby@v1 + with: { ruby-version: 3.3.0 } + - run: bundle install + - working-directory: spec-insert + run: | + bundle exec rubocop + bundle exec rspec diff --git a/.github/workflows/update-api-components.yml b/.github/workflows/update-api-components.yml new file mode 100644 index 0000000000..42cc1d2827 --- /dev/null +++ b/.github/workflows/update-api-components.yml @@ -0,0 +1,52 @@ +name: Update API Components +on: + workflow_dispatch: + schedule: + - cron: "0 0 * * 0" # Every Sunday at midnight GMT +jobs: + update-api-components: + if: ${{ github.repository == 'opensearch-project/documentation-website' }} + runs-on: ubuntu-latest + permissions: + contents: write + pull-requests: write + steps: + - uses: actions/checkout@v4 + with: + submodules: recursive + fetch-depth: 0 + + - run: git config --global pull.rebase true + + - uses: ruby/setup-ruby@v1 + with: { ruby-version: 3.3.0 } + + - run: bundle install + + - name: Download spec and insert into documentation + run: bundle exec jekyll spec-insert + + - name: Get current date + id: date + run: echo "date=$(date +'%Y-%m-%d')" >> $GITHUB_ENV + + - name: GitHub App token + id: github_app_token + uses: tibdex/github-app-token@v2.1.0 + with: + app_id: ${{ secrets.APP_ID }} + private_key: ${{ secrets.APP_PRIVATE_KEY }} + + - name: Create pull request + uses: peter-evans/create-pull-request@v6 + with: + token: ${{ steps.github_app_token.outputs.token }} + commit-message: "Updated API components to reflect the latest OpenSearch API spec (${{ env.date }})" + title: "[AUTOCUT] Update API components to reflect the latest OpenSearch API spec (${{ env.date }})" + body: | + Update API components to reflect the latest [OpenSearch API spec](https://github.com/opensearch-project/opensearch-api-specification/releases/download/main-latest/opensearch-openapi.yaml). + Date: ${{ env.date }} + branch: update-api-components-${{ env.date }} + base: main + signoff: true + labels: autocut \ No newline at end of file diff --git a/DEVELOPER_GUIDE.md b/DEVELOPER_GUIDE.md new file mode 100644 index 0000000000..f414fe0020 --- /dev/null +++ b/DEVELOPER_GUIDE.md @@ -0,0 +1,135 @@ +# Developer guide + - [Introduction](#introduction) + - [Starting the Jekyll server locally](#starting-the-jekyll-server-locally) + - [Using the spec-insert Jekyll plugin](#using-the-spec-insert-jekyll-plugin) + - [Inserting query parameters](#inserting-query-parameters) + - [Inserting path parameters](#inserting-path-parameters) + - [Inserting paths and HTTP methods](#inserting-paths-and-http-methods) + - [Ignoring files and folders](#ignoring-files-and-folders) + - [CI/CD](#cicd) + +## Introduction + +The `.md` documents in this repository are rendered into HTML pages using [Jekyll](https://jekyllrb.com/). These HTML pages are hosted on [opensearch.org](https://opensearch.org/docs/latest/). + +## Starting the Jekyll server locally +You can run the Jekyll server locally to view the rendered HTML pages using the following steps: + +1. Install [Ruby](https://www.ruby-lang.org/en/documentation/installation/) 3.1.0 or later for your operating system. +2. Install the required gems by running `bundle install`. +3. Run `bundle exec jekyll serve` to start the Jekyll server locally (this can take several minutes to complete). +4. Open your browser and navigate to `http://localhost:4000` to view the rendered HTML pages. + +## Using the `spec-insert` Jekyll plugin +The `spec-insert` Jekyll plugin is used to insert API components into Markdown files. The plugin downloads the [latest OpenSearch specification](https://github.com/opensearch-project/opensearch-api-specification) and renders the API components from the spec. This aims to reduce the manual effort required to keep the documentation up to date. + +To use this plugin, make sure that you have installed Ruby 3.1.0 or later and the required gems by running `bundle install`. + +Edit your Markdown file and insert the following snippet where you want render an API component: + +```markdown + + +This is where the API component will be inserted. +Everything between the `spec_insert_start` and `spec_insert_end` tags will be overwritten. + + +``` + +Then run the following Jekyll command to render the API components: +```shell +bundle exec jekyll spec-insert +``` + +If you are working on multiple Markdown files and do not want to keep running the `jekyll spec-insert` command, you can add the `--watch` (or `-W`) flag to the command to watch for changes in the Markdown files and automatically render the API components: + +```shell +bundle exec jekyll spec-insert --watch +``` + +Depending on the text editor you are using, you may need to manually reload the file from disk to see the changes applied by the plugin if the editor does not automatically reload the file periodically. + +The plugin will pull the newest OpenSearch API spec from its [repository](https://github.com/opensearch-project/opensearch-api-specification) if the spec file does not exist locally or if it is older than 24 hours. To tell the plugin to always pull the newest spec, you can add the `--refresh-spec` (or `-R`) flag to the command: + +```shell +bundle exec jekyll spec-insert --refresh-spec +``` + +### Inserting query parameters + +To insert the API query parameters table, use the following snippet: + +```markdown + + +``` + +This will insert the query parameters of the `cat.indices` API into the `.md` file with three default columns: `Parameter`, `Type`, and `Description`. There are five columns that can be inserted: `Parameter`, `Type`, `Description`, `Required`, and `Default`. When `Required`/`Default` is not chosen, the information will be written in the `Description` column. + +You can customize the query parameters table with the following columns: + +- `Parameter` +- `Type` +- `Description` +- `Required` +- `Default` + + You can also customize this component with the following settings: + +- `include_global` (Boolean; default is `false`): Includes global query parameters in the table. +- `include_deprecated` (Boolean; default is `true`): Includes deprecated parameters in the table. +- `pretty` (Boolean; default is `false`): Renders the table in the pretty format instead of the compact format. + +The following snippet inserts the specified columns into the query parameters table: + +```markdown + + +``` + +### Inserting path parameters + +To insert the `indices.create` API path parameters table, use the following snippet: + +```markdown + + +``` + +This table behaves identically to the query parameters table except that it does not accept the `include_global` argument. + +### Inserting paths and HTTP methods + +To insert paths and HTTP methods for the `search` API, use the following snippet: + +```markdown + + +``` + +### Ignoring files and folders + +The `spec-insert` plugin ignores all files and folders listed in the [./_config.yml#exclude](./_config.yml) list, which is also the list of files and folders that Jekyll ignores. + +### CI/CD + +The `spec-insert` plugin is run as part of the CI/CD pipeline to ensure that the API components are up to date in the documentation. This is performed through the [update-api-components.yml](.github/workflows/update-api-components.yml) GitHub Actions workflow, which creates a pull request containing the updated API components every Sunday. diff --git a/Gemfile b/Gemfile index 7825dcd02b..fee04f3c48 100644 --- a/Gemfile +++ b/Gemfile @@ -1,4 +1,9 @@ -source "http://rubygems.org" +# frozen_string_literal: true + +source 'https://rubygems.org' + +# Manually add csv gem since Ruby 3.4.0 no longer includes it +gem 'csv', '~> 3.0' # Hello! This is where you manage which Jekyll version is used to run. # When you want to use a different version, change it below, save the @@ -8,12 +13,12 @@ source "http://rubygems.org" # # This will help ensure the proper Jekyll version is running. # Happy Jekylling! -gem "jekyll", "~> 4.3.2" +gem 'jekyll', '~> 4.3.2' # This is the default theme for new Jekyll sites. You may change this to anything you like. -gem "just-the-docs", "~> 0.3.3" -gem "jekyll-remote-theme", "~> 0.4" -gem "jekyll-redirect-from", "~> 0.16" +gem 'jekyll-redirect-from', '~> 0.16' +gem 'jekyll-remote-theme', '~> 0.4' +gem 'just-the-docs', '~> 0.3.3' # If you want to use GitHub Pages, remove the "gem "jekyll"" above and # uncomment the line below. To upgrade, run `bundle update github-pages`. @@ -22,21 +27,31 @@ gem "jekyll-redirect-from", "~> 0.16" # If you have any plugins, put them here! group :jekyll_plugins do - gem "jekyll-last-modified-at" - gem "jekyll-sitemap" + gem 'jekyll-last-modified-at' + gem 'jekyll-sitemap' + gem 'jekyll-spec-insert', :path => './spec-insert' end # Windows does not include zoneinfo files, so bundle the tzinfo-data gem -gem "tzinfo-data", platforms: [:mingw, :mswin, :x64_mingw, :jruby] +gem 'tzinfo-data', platforms: %i[mingw mswin x64_mingw jruby] # Performance-booster for watching directories on Windows -gem "wdm", "~> 0.1.0" if Gem.win_platform? +gem 'wdm', '~> 0.1.0' if Gem.win_platform? # Installs webrick dependency for building locally -gem "webrick", "~> 1.7" - +gem 'webrick', '~> 1.7' # Link checker -gem "typhoeus" -gem "ruby-link-checker" -gem "ruby-enum" +gem 'ruby-enum' +gem 'ruby-link-checker' +gem 'typhoeus' + +# Spec Insert +gem 'activesupport', '~> 7' +gem 'mustache', '~> 1' + +group :development, :test do + gem 'rspec' + gem 'rubocop', '~> 1.44', require: false + gem 'rubocop-rake', require: false +end diff --git a/README.md b/README.md index 66beb1948c..52321335c7 100644 --- a/README.md +++ b/README.md @@ -3,6 +3,7 @@ # About the OpenSearch documentation repo The `documentation-website` repository contains the user documentation for OpenSearch. You can find the rendered documentation at [opensearch.org/docs](https://opensearch.org/docs). +The markdown files in this repository are rendered into HTML pages using [Jekyll](https://jekyllrb.com/). Check the [DEVELOPER_GUIDE](DEVELOPER_GUIDE.md) for more information about how to use Jekyll for this repository. ## Contributing diff --git a/_config.yml b/_config.yml index 68b4b1395f..0e45176320 100644 --- a/_config.yml +++ b/_config.yml @@ -311,6 +311,7 @@ plugins: - jekyll-remote-theme - jekyll-redirect-from - jekyll-sitemap + - jekyll-spec-insert # This format has to conform to RFC822 last-modified-at: @@ -320,6 +321,8 @@ last-modified-at: # The following items will not be processed, by default. Create a custom list # to override the default setting. exclude: + - README.md + - DEVELOPER_GUIDE.md - Gemfile - Gemfile.lock - node_modules @@ -327,6 +330,12 @@ exclude: - vendor/cache/ - vendor/gems/ - vendor/ruby/ - - README.md - - .idea - - templates + - templates/ + - .sass-cache/ + - .jekyll-cache/ + - .idea/ + - .github/ + - .bundle/ + - _site/ + - spec-insert + - release-notes \ No newline at end of file diff --git a/spec-insert/.gitignore b/spec-insert/.gitignore new file mode 100644 index 0000000000..c9958b86d2 --- /dev/null +++ b/spec-insert/.gitignore @@ -0,0 +1,2 @@ +opensearch-openapi.yaml +rspec_examples.txt diff --git a/spec-insert/.rspec b/spec-insert/.rspec new file mode 100644 index 0000000000..c99d2e7396 --- /dev/null +++ b/spec-insert/.rspec @@ -0,0 +1 @@ +--require spec_helper diff --git a/spec-insert/.rubocop.yml b/spec-insert/.rubocop.yml new file mode 100644 index 0000000000..5b88e922f4 --- /dev/null +++ b/spec-insert/.rubocop.yml @@ -0,0 +1,29 @@ +require: rubocop-rake +AllCops: + Include: + - 'lib/**/*.rb' + - 'Rakefile' + NewCops: enable + +Metrics/CyclomaticComplexity: + Enabled: false +Metrics/MethodLength: + Enabled: false +Metrics/ParameterLists: + Enabled: false +Metrics/AbcSize: + Enabled: false +Metrics/PerceivedComplexity: + Enabled: false + +Layout/EmptyLineAfterGuardClause: + Enabled: false + +Style/MultilineBlockChain: + Enabled: false +Style/SingleLineMethods: + Enabled: false + +Naming/FileName: + Exclude: + - 'lib/jekyll-spec-insert.rb' # For Jekyll to recognize the plugin diff --git a/spec-insert/jekyll-spec-insert.gemspec b/spec-insert/jekyll-spec-insert.gemspec new file mode 100644 index 0000000000..d397f40af2 --- /dev/null +++ b/spec-insert/jekyll-spec-insert.gemspec @@ -0,0 +1,16 @@ +# frozen_string_literal: true + +Gem::Specification.new do |spec| + spec.name = 'jekyll-spec-insert' + spec.version = '0.1.0' + spec.authors = ['Theo Truong'] + spec.email = ['theo.nam.truong@gmail.com'] + + spec.summary = 'A Jekyll plugin for inserting OpenSearch OpenAPI specifications into Jekyll sites.' + + spec.files = Dir['lib/**/*.rb'] + spec.require_paths = ['lib'] + + spec.metadata['rubygems_mfa_required'] = 'true' + spec.required_ruby_version = '>= 3.1.0' +end diff --git a/spec-insert/lib/api/action.rb b/spec-insert/lib/api/action.rb new file mode 100644 index 0000000000..5ad3dded77 --- /dev/null +++ b/spec-insert/lib/api/action.rb @@ -0,0 +1,68 @@ +# SPDX-License-Identifier: Apache-2.0 +# +# The OpenSearch Contributors require contributions made to +# this file be licensed under the Apache-2.0 license or a +# compatible open source license. + +# frozen_string_literal: true + +require_relative 'parameter' +require_relative 'operation' + +# A collection of operations that comprise a single API Action +# AKA operation-group +class Action + # @param [SpecHash] spec Parsed OpenAPI spec + def self.actions=(spec) + operations = spec.paths.flat_map do |url, ops| + ops.filter_map { |verb, op| Operation.new(op, url, verb) unless op['x-ignorable'] } + end + @actions = operations.group_by(&:group).values.map { |ops| Action.new(ops) }.index_by(&:full_name) + end + + # @return [Hash] API Actions indexed by operation-group + def self.actions + raise 'Actions not set' unless @actions + @actions + end + + # @return [Array] Operations in the action + attr_reader :operations + + # @param [Array] operations + def initialize(operations) + @operations = operations + @operation = operations.first + @spec = @operation&.spec + end + + # @return [Array] Input arguments. + def arguments; @arguments ||= Parameter.from_operations(@operations.map(&:spec)); end + + # @return [String] Full name of the action (i.e. namespace.action) + def full_name; @operation&.group; end + + # return [String] Name of the action + def name; @operation&.action; end + + # @return [String] Namespace of the action + def namespace; @operation&.namespace; end + + # @return [Array] Sorted unique HTTP verbs + def http_verbs; @operations.map(&:http_verb).uniq.sort; end + + # @return [Array] Unique URLs + def urls; @operations.map(&:url).uniq; end + + # @return [String] Description of the action + def description; @spec&.description; end + + # @return [Boolean] Whether the action is deprecated + def deprecated; @spec&.deprecated; end + + # @return [String] Deprecation message + def deprecation_message; @spec['x-deprecation-message']; end + + # @return [String] API reference + def api_reference; @operation&.external_docs&.url; end +end diff --git a/spec-insert/lib/api/operation.rb b/spec-insert/lib/api/operation.rb new file mode 100644 index 0000000000..6f9fb44cc4 --- /dev/null +++ b/spec-insert/lib/api/operation.rb @@ -0,0 +1,34 @@ +# SPDX-License-Identifier: Apache-2.0 +# +# The OpenSearch Contributors require contributions made to +# this file be licensed under the Apache-2.0 license or a +# compatible open source license. + +# frozen_string_literal: true + +# An API Operation +class Operation + # @return [Openapi3Parser::Node::Operation] Operation Spec + attr_reader :spec + # @return [String] URL + attr_reader :url + # @return [String] HTTP Verb + attr_reader :http_verb + # @return [String] Operation Group + attr_reader :group + # @return [String] API Action + attr_reader :action + # @return [String] API Namespace + attr_reader :namespace + + # @param [Openapi3Parser::Node::Operation] spec Operation Spec + # @param [String] url + # @param [String] http_verb + def initialize(spec, url, http_verb) + @spec = spec + @url = url + @http_verb = http_verb.upcase + @group = spec['x-operation-group'] + @action, @namespace = @group.split('.').reverse + end +end diff --git a/spec-insert/lib/api/parameter.rb b/spec-insert/lib/api/parameter.rb new file mode 100644 index 0000000000..fbd87fd50e --- /dev/null +++ b/spec-insert/lib/api/parameter.rb @@ -0,0 +1,94 @@ +# frozen_string_literal: true + +module ArgLocation + PATH = :path + QUERY = :query +end + +# Represents a parameter of an API action +class Parameter + # @return [String] The name of the parameter + attr_reader :name + # @return [String] The description of the parameter + attr_reader :description + # @return [Boolean] Whether the parameter is required + attr_reader :required + # @return [SpecHash] The JSON schema of the parameter + attr_reader :schema + # @return [String] Argument type in documentation + attr_reader :doc_type + # @return [String] The default value of the parameter + attr_reader :default + # @return [Boolean] Whether the parameter is deprecated + attr_reader :deprecated + # @return [String] The deprecation message + attr_reader :deprecation_message + # @return [String] The OpenSearch version when the parameter was deprecated + attr_reader :version_deprecated + # @return [ArgLocation] The location of the parameter + attr_reader :location + + def initialize(name:, description:, required:, schema:, default:, deprecated:, deprecation_message:, + version_deprecated:, location:) + @name = name + @description = description + @required = required + @schema = schema + @doc_type = get_doc_type(schema).gsub('String / List', 'List').gsub('List / String', 'List') + @default = default + @deprecated = deprecated + @deprecation_message = deprecation_message + @version_deprecated = version_deprecated + @location = location + end + + # @param [SpecHash | nil] schema + # @return [String | nil] Documentation type + def get_doc_type(schema) + return nil if schema.nil? + union = schema.anyOf || schema.oneOf + return union.map { |sch| get_doc_type(sch) }.join(' / ') unless union.nil? + return 'Integer' if schema.type == 'integer' + return 'Float' if schema.type == 'number' + return 'Boolean' if schema.type == 'boolean' + return 'String' if schema.type == 'string' + return 'NULL' if schema.type == 'null' + return 'List' if schema.type == 'array' + 'Object' + end + + # @param [SpecHash] Full OpenAPI spec + def self.global=(spec) + @global = spec.components.parameters.filter { |_, p| p['x-global'] }.map { |_, p| from_parameters([p], 1) } + end + + # @return [Array] Global parameters + def self.global + raise 'Global parameters not set' unless @global + @global + end + + # @param [Array] operations List of operations of the same group + # @return [Array] List of parameters of the operation group + def self.from_operations(operations) + operations.flat_map(&:parameters).filter { |param| !param['x-global'] } + .group_by(&:name).values.map { |params| from_parameters(params, operations.size) } + end + + # @param [Array] params List of parameters of the same name + # @param [Integer] opts_count Number of operations involved + # @return [Parameter] Single parameter distilled from the list + def self.from_parameters(params, opts_count) + param = params.first || SpecHash.new + schema = param&.schema || SpecHash.new + Parameter.new(name: param.name, + description: param.description || schema.description, + required: params.filter(&:required).size >= opts_count, + schema:, + default: param.default || schema.default, + deprecated: param.deprecated || schema.deprecated, + deprecation_message: param['x-deprecation-message'] || schema['x-deprecation-message'], + version_deprecated: param['x-version-deprecated'] || schema['x-version-deprecated'], + location: params.any? { |p| p.in == 'path' } ? ArgLocation::PATH : ArgLocation::QUERY) + end +end diff --git a/spec-insert/lib/doc_processor.rb b/spec-insert/lib/doc_processor.rb new file mode 100644 index 0000000000..0aaa01061a --- /dev/null +++ b/spec-insert/lib/doc_processor.rb @@ -0,0 +1,62 @@ +# frozen_string_literal: true + +require 'pathname' +require_relative 'renderers/spec_insert' +require_relative 'spec_insert_error' + +# Processes a file, replacing spec-insert blocks with rendered content +class DocProcessor + START_MARKER = // + + def initialize(file_path, logger:) + @file_path = Pathname(file_path) + @logger = logger + end + + # Processes the file, replacing spec-insert blocks with rendered content + # @param [Boolean] write_to_file Whether to write the changes back to the file + def process(write_to_file: true) + relative_path = @file_path.relative_path_from(Pathname.new(Dir.pwd)) + lines = File.readlines(@file_path) + original_content = lines.join + insertions = find_insertions(lines) + return if insertions.empty? + + insertions.reverse_each { |start, finish, insert| lines[start..finish] = insert.render } + rendered_content = lines.join + if write_to_file && rendered_content != original_content + File.write(@file_path, rendered_content) + @logger.info "Spec components inserted into #{relative_path} successfully." + end + rendered_content + rescue SpecInsertError => e + @logger.error "Error processing #{relative_path}. #{e.message}" + end + + private + + # @return Array<[Integer, Integer, SpecInsert]> + def find_insertions(lines) + start_indices = lines.each_with_index + .filter { |line, _index| line.match?(START_MARKER) } + .map { |_line, index| index } + end_indices = start_indices.map do |index| + (index..lines.length - 1).find { |i| lines[i].match?(END_MARKER) } + end.compact + + validate_markers!(start_indices, end_indices) + + start_indices.zip(end_indices).map do |start, finish| + [start, finish, SpecInsert.new(lines[start..finish])] + end + end + + # @param [Array] start_indices + # @param [Array] end_indices + def validate_markers!(start_indices, end_indices) + return if start_indices.length == end_indices.length && + start_indices.zip(end_indices).flatten.each_cons(2).all? { |a, b| a < b } + raise SpecInsertError, 'Mismatched "spec_insert_start" and "spec_insert_end" markers.' + end +end diff --git a/spec-insert/lib/insert_arguments.rb b/spec-insert/lib/insert_arguments.rb new file mode 100644 index 0000000000..08b9b4dc9b --- /dev/null +++ b/spec-insert/lib/insert_arguments.rb @@ -0,0 +1,67 @@ +# frozen_string_literal: true + +# Doc Insert Arguments +class InsertArguments + COLUMNS = %w[Parameter Description Required Type Default].freeze + DEFAULT_COLUMNS = %w[Parameter Type Description].freeze + attr_reader :raw + + # @param [Array] lines the lines between + def initialize(lines) + end_index = lines.each_with_index.find { |line, _index| line.match?(/^\s*-->/) }&.last&.- 1 + @raw = lines[1..end_index].filter { |line| line.include?(':') }.to_h do |line| + key, value = line.split(':') + [key.strip, value.strip] + end + end + + # @return [String] + def api + @raw['api'] + end + + # @return [String] + def component + @raw['component'] + end + + # @return [Array] + def columns + cols = parse_array(@raw['columns']) || DEFAULT_COLUMNS + invalid = cols - COLUMNS + raise ArgumentError, "Invalid column(s): #{invalid.join(', ')}" unless invalid.empty? + cols + end + + # @return [Boolean] + def pretty + parse_boolean(@raw['pretty'], default: false) + end + + # @return [Boolean] + def include_global + parse_boolean(@raw['include_global'], default: false) + end + + # @return [Boolean] + def include_deprecated + parse_boolean(@raw['include_deprecated'], default: true) + end + + private + + # @param [String] value comma-separated array + def parse_array(value) + return nil if value.nil? + value.split(',').map(&:strip) + end + + # @param [String] value + # @param [Boolean] default value to return when nil + def parse_boolean(value, default:) + return default if value.nil? + return true if value.in?(%w[true True TRUE yes Yes YES 1]) + return false if value.in?(%w[false False FALSE no No NO 0]) + raise ArgumentError, "Invalid boolean value: #{value}" + end +end diff --git a/spec-insert/lib/jekyll-spec-insert.rb b/spec-insert/lib/jekyll-spec-insert.rb new file mode 100644 index 0000000000..14a8997cc8 --- /dev/null +++ b/spec-insert/lib/jekyll-spec-insert.rb @@ -0,0 +1,56 @@ +# frozen_string_literal: true + +require 'active_support/all' +require 'listen' +require 'yaml' +require_relative 'spec_hash' +require_relative 'doc_processor' + +# Jekyll plugin to insert document components generated from the spec into the Jekyll site +class JekyllSpecInsert < Jekyll::Command + # @param [Mercenary::Program] prog + def self.init_with_program(prog) + prog.command(:'spec-insert') do |c| + c.syntax 'spec-insert [options]' + c.option 'watch', '--watch', '-W', 'Watch for changes and rebuild' + c.option 'refresh-spec', '--refresh-spec', '-R', 'Redownload the OpenSearch API specification' + c.action do |_args, options| + spec_file = File.join(Dir.pwd, 'spec-insert/opensearch-openapi.yaml') + excluded_paths = YAML.load_file('_config.yml')['exclude'] + download_spec(spec_file, forced: options['refresh-spec']) + SpecHash.load_file(spec_file) + run_once(excluded_paths) + watch(excluded_paths) if options['watch'] + end + end + end + + def self.download_spec(spec_file, forced: false) + return if !forced && File.exist?(spec_file) && (File.mtime(spec_file) > 1.day.ago) + Jekyll.logger.info 'Downloading OpenSearch API specification...' + system 'curl -L -X GET ' \ + 'https://github.com/opensearch-project/opensearch-api-specification' \ + '/releases/download/main-latest/opensearch-openapi.yaml ' \ + "-o #{spec_file}" + end + + def self.run_once(excluded_paths) + excluded_paths = excluded_paths.map { |path| File.join(Dir.pwd, path) } + Dir.glob(File.join(Dir.pwd, '**/*.md')) + .filter { |file| excluded_paths.none? { |excluded| file.start_with?(excluded) } } + .each { |file| DocProcessor.new(file, logger: Jekyll.logger).process } + end + + def self.watch(excluded_paths) + Jekyll.logger.info "\nWatching for changes...\n" + excluded_paths = excluded_paths.map { |path| /\.#{path}$/ } + + Listen.to(Dir.pwd, only: /\.md$/, ignore: excluded_paths) do |modified, added, _removed| + (modified + added).each { |file| DocProcessor.new(file, logger: Jekyll.logger).process } + end.start + + trap('INT') { exit } + trap('TERM') { exit } + sleep + end +end diff --git a/spec-insert/lib/renderers/base_mustache_renderer.rb b/spec-insert/lib/renderers/base_mustache_renderer.rb new file mode 100644 index 0000000000..2ebd83783d --- /dev/null +++ b/spec-insert/lib/renderers/base_mustache_renderer.rb @@ -0,0 +1,18 @@ +# frozen_string_literal: true + +require 'mustache' + +# Base Mustache Renderer +class BaseMustacheRenderer < Mustache + self.template_path = "#{__dir__}/templates" + + def initialize(output_file = nil) + @output_file = output_file + super + end + + def generate + raise 'Output file not specified' unless @output_file + @output_file&.write(render) + end +end diff --git a/spec-insert/lib/renderers/parameter_table_renderer.rb b/spec-insert/lib/renderers/parameter_table_renderer.rb new file mode 100644 index 0000000000..c23e90c240 --- /dev/null +++ b/spec-insert/lib/renderers/parameter_table_renderer.rb @@ -0,0 +1,51 @@ +# frozen_string_literal: true + +require_relative 'table_renderer' + +# Renders a table of parameters of an API action +class ParameterTableRenderer + # @param [Array] parameters + # @param [InsertArguments] args + def initialize(parameters, args) + @columns = args.columns + @pretty = args.pretty + @parameters = parameters + @parameters = @parameters.reject(&:deprecated) unless args.include_deprecated + @parameters += Parameter.global if args.include_global + @parameters = @parameters.sort_by { |arg| [arg.required ? 0 : 1, arg.deprecated ? 1 : 0, arg.name] } + end + + # @return [String] + def render + columns = @columns.map { |col| TableRenderer::Column.new(col, col) } + rows = @parameters.map { |arg| row(arg) } + TableRenderer.new(columns, rows, pretty: @pretty).render_lines.join("\n") + end + + private + + def row(param) + { + 'Parameter' => "`#{param.name}`#{'
_DEPRECATED_' if param.deprecated}", + 'Description' => description(param), + 'Required' => param.required ? 'Required' : nil, + 'Type' => param.doc_type, + 'Default' => param.default + } + end + + def description(param) + deprecation = deprecation(param) + required = param.required && @columns.exclude?('Required') ? '**(Required)** ' : '' + description = param.description.gsub("\n", ' ') + default = param.default.nil? || @columns.includes('Default') ? '' : " _(Default: #{param.default})_" + + "#{deprecation}#{required}#{description}#{default}" + end + + def deprecation(param) + message = ": #{param.deprecation_message}" if param.deprecation_message.present? + since = " since #{param.version_deprecated}" if param.version_deprecated.present? + "_(Deprecated#{since}#{message})_ " if param.deprecated + end +end diff --git a/spec-insert/lib/renderers/path_parameters.rb b/spec-insert/lib/renderers/path_parameters.rb new file mode 100644 index 0000000000..07476102f2 --- /dev/null +++ b/spec-insert/lib/renderers/path_parameters.rb @@ -0,0 +1,21 @@ +# frozen_string_literal: true + +require_relative 'base_mustache_renderer' +require_relative 'parameter_table_renderer' + +# Renders path parameters +class PathParameters < BaseMustacheRenderer + self.template_file = "#{__dir__}/templates/path_parameters.mustache" + + # @param [Action] action API Action + # @param [InsertArguments] args + def initialize(action, args) + super(nil) + @params = action.arguments.select { |arg| arg.location == ArgLocation::PATH } + @args = args + end + + def table + ParameterTableRenderer.new(@params, @args).render + end +end diff --git a/spec-insert/lib/renderers/paths_and_methods.rb b/spec-insert/lib/renderers/paths_and_methods.rb new file mode 100644 index 0000000000..f6776c8226 --- /dev/null +++ b/spec-insert/lib/renderers/paths_and_methods.rb @@ -0,0 +1,21 @@ +# frozen_string_literal: true + +require_relative 'base_mustache_renderer' + +# Renders paths and http methods +class PathsAndMethods < BaseMustacheRenderer + self.template_file = "#{__dir__}/templates/paths_and_methods.mustache" + + # @param [Action] action API Action + def initialize(action) + super + @action = action + end + + def operations + ljust = @action.operations.map { |op| op.http_verb.length }.max + @action.operations + .sort_by { |op| [op.url.length, op.http_verb] } + .map { |op| { verb: op.http_verb.ljust(ljust), path: op.url } } + end +end diff --git a/spec-insert/lib/renderers/query_parameters.rb b/spec-insert/lib/renderers/query_parameters.rb new file mode 100644 index 0000000000..996a68903c --- /dev/null +++ b/spec-insert/lib/renderers/query_parameters.rb @@ -0,0 +1,25 @@ +# frozen_string_literal: true + +require_relative 'base_mustache_renderer' +require_relative 'parameter_table_renderer' + +# Renders query parameters +class QueryParameters < BaseMustacheRenderer + self.template_file = "#{__dir__}/templates/query_parameters.mustache" + + # @param [Action] action API Action + # @param [InsertArguments] args + def initialize(action, args) + super(nil) + @params = action.arguments.select { |arg| arg.location == ArgLocation::QUERY } + @args = args + end + + def table + ParameterTableRenderer.new(@params, @args).render + end + + def optional + @params.none?(&:required) + end +end diff --git a/spec-insert/lib/renderers/spec_insert.rb b/spec-insert/lib/renderers/spec_insert.rb new file mode 100644 index 0000000000..4840b12e15 --- /dev/null +++ b/spec-insert/lib/renderers/spec_insert.rb @@ -0,0 +1,42 @@ +# frozen_string_literal: true + +require_relative 'base_mustache_renderer' +require_relative '../insert_arguments' +require_relative '../api/action' +require_relative '../spec_insert_error' +require_relative 'paths_and_methods' +require_relative 'path_parameters' +require_relative 'query_parameters' + +# Class to render spec insertions +class SpecInsert < BaseMustacheRenderer + COMPONENTS = Set.new(%w[query_params path_params paths_and_http_methods]).freeze + self.template_file = "#{__dir__}/templates/spec_insert.mustache" + + # @param [Array] arg_lines the lines between "" + def initialize(arg_lines) + super + @args = InsertArguments.new(arg_lines) + @action = Action.actions[@args.api] + raise SpecInsertError, '`api` argument not specified.' unless @args.api + raise SpecInsertError, "API Action '#{@args.api}' does not exist in the spec." unless @action + end + + def arguments + @args.raw.map { |key, value| { key:, value: } } + end + + def content + raise SpecInsertError, '`component` argument not specified.' unless @args.component + case @args.component.to_sym + when :query_parameters + QueryParameters.new(@action, @args).render + when :path_parameters + PathParameters.new(@action, @args).render + when :paths_and_http_methods + PathsAndMethods.new(@action).render + else + raise SpecInsertError, "Invalid component: #{@args.component}" + end + end +end diff --git a/spec-insert/lib/renderers/table_renderer.rb b/spec-insert/lib/renderers/table_renderer.rb new file mode 100644 index 0000000000..1cabc435bd --- /dev/null +++ b/spec-insert/lib/renderers/table_renderer.rb @@ -0,0 +1,58 @@ +# frozen_string_literal: true + +# TableRenderer renders a markdown table with the given columns and rows +class TableRenderer + # Column object for rendering markdown tables + class Column + attr_reader :title, :key + attr_accessor :width + + # @param [String] title display title + # @param [String | Symbol] key key to access in row hash + def initialize(title, key) + @title = title + @key = key + @width = 0 + end + end + + # @param [Array] columns + # @param [Array] rows + # @param [Boolean] pretty whether to render a pretty table or a compact one + def initialize(columns, rows, pretty:) + @column = columns + @rows = rows + @pretty = pretty + end + + # @return [Array] + def render_lines + calculate_column_widths if @pretty + [render_column, render_divider] + render_rows + end + + private + + def calculate_column_widths + @column.each do |column| + column.width = [@rows.map { |row| row[column.key].to_s.length }.max || 0, column.title.length].max + end + end + + def render_column + columns = @column.map { |column| column.title.ljust(column.width) }.join(' | ') + @pretty ? "| #{columns} |" : columns + end + + def render_divider + dividers = @column.map { |column| ":#{'-' * [column.width + 1, 3].max}" } + @pretty ? "|#{dividers.join('|')}|" : dividers.join(' | ') + end + + def render_rows + @rows.map do |row| + cells = @column.map { |column| row[column.key].to_s.ljust(column.width).gsub('|', '\|') }.join(' | ') + @pretty ? "| #{cells} |" : cells + end + end +end diff --git a/spec-insert/lib/renderers/templates/path_parameters.mustache b/spec-insert/lib/renderers/templates/path_parameters.mustache new file mode 100644 index 0000000000..1b97bededd --- /dev/null +++ b/spec-insert/lib/renderers/templates/path_parameters.mustache @@ -0,0 +1,2 @@ +## Path parameters +{{{table}}} \ No newline at end of file diff --git a/spec-insert/lib/renderers/templates/paths_and_methods.mustache b/spec-insert/lib/renderers/templates/paths_and_methods.mustache new file mode 100644 index 0000000000..5221de6158 --- /dev/null +++ b/spec-insert/lib/renderers/templates/paths_and_methods.mustache @@ -0,0 +1,6 @@ +## Paths and HTTP methods +```json +{{#operations}} +{{{verb}}} {{{path}}} +{{/operations}} +``` \ No newline at end of file diff --git a/spec-insert/lib/renderers/templates/query_parameters.mustache b/spec-insert/lib/renderers/templates/query_parameters.mustache new file mode 100644 index 0000000000..4ca8255180 --- /dev/null +++ b/spec-insert/lib/renderers/templates/query_parameters.mustache @@ -0,0 +1,5 @@ +## Query parameters +{{#optional}} +All query parameters are optional. +{{/optional}} +{{{table}}} \ No newline at end of file diff --git a/spec-insert/lib/renderers/templates/spec_insert.mustache b/spec-insert/lib/renderers/templates/spec_insert.mustache new file mode 100644 index 0000000000..63b6323d48 --- /dev/null +++ b/spec-insert/lib/renderers/templates/spec_insert.mustache @@ -0,0 +1,7 @@ + +{{{content}}} + diff --git a/spec-insert/lib/spec_hash.rb b/spec-insert/lib/spec_hash.rb new file mode 100644 index 0000000000..06a872c9b9 --- /dev/null +++ b/spec-insert/lib/spec_hash.rb @@ -0,0 +1,60 @@ +# frozen_string_literal: true + +require 'yaml' +require_relative 'api/action' +require_relative 'api/parameter' + +# Spec class for parsing OpenAPI spec +# It's basically a wrapper around a Hash that allows for accessing hash values as object attributes +# and resolving of $refs +class SpecHash + def self.load_file(file_path) + @raw = YAML.load_file(file_path) + @parsed = SpecHash.new(@raw, parsed: false) + Action.actions = @parsed + Parameter.global = @parsed + end + + # @return [Hash] Raw OpenAPI Spec + class << self; attr_reader :raw; end + + # @return [Spec] Parsed OpenAPI Spec + class << self; attr_reader :parsed; end + + attr_reader :hash + + # @param [Hash] hash + def initialize(hash = {}, parsed: true) + @hash = parsed ? hash : parse(hash) + end + + def [](key) + parse(@hash[key]) + end + + def respond_to_missing?(name, include_private = false) + @hash.key?(name.to_s) || @hash.respond_to?(name) || super + end + + def method_missing(name, ...) + return @hash.send(name, ...) if @hash.respond_to?(name) + parse(@hash[name.to_s]) + end + + private + + def parse(value) + return value.map { |v| parse(v) } if value.is_a?(Array) + return value unless value.is_a?(Hash) + ref = value.delete('$ref') + value.transform_values! { |v| parse(v) } + return SpecHash.new(value) unless ref + SpecHash.new(parse(resolve(ref)).merge(value)) + end + + def resolve(ref) + parts = ref.split('/') + parts.shift + self.class.raw.dig(*parts) + end +end diff --git a/spec-insert/lib/spec_insert_error.rb b/spec-insert/lib/spec_insert_error.rb new file mode 100644 index 0000000000..0ee5ccf159 --- /dev/null +++ b/spec-insert/lib/spec_insert_error.rb @@ -0,0 +1,4 @@ +# frozen_string_literal: true + +# Error unique to the SpecInsert process +class SpecInsertError < StandardError; end diff --git a/spec-insert/spec/_fixtures/actual_output/.gitignore b/spec-insert/spec/_fixtures/actual_output/.gitignore new file mode 100644 index 0000000000..de056073af --- /dev/null +++ b/spec-insert/spec/_fixtures/actual_output/.gitignore @@ -0,0 +1 @@ +**/*.md diff --git a/spec-insert/spec/_fixtures/expected_output/param_tables.md b/spec-insert/spec/_fixtures/expected_output/param_tables.md new file mode 100644 index 0000000000..48df5c2bd4 --- /dev/null +++ b/spec-insert/spec/_fixtures/expected_output/param_tables.md @@ -0,0 +1,43 @@ +Typical Path Parameters Example + + +## Path parameters +Parameter | Type | Description +:--- | :--- | :--- +`index` | List | Comma-separated list of data streams, indexes, and aliases to search. Supports wildcards (`*`). To search all data streams and indexes, omit this parameter or use `*` or `_all`. + + +Query Parameters Example with Global Parameters, Pretty Print, and Custom Columns + + +## Query parameters +| Type | Parameter | Description | Required | Default | +|:--------|:--------------------------|:-----------------------------------------------------------------------------------------------------------------------------------|:---------|:--------| +| Boolean | `analyze_wildcard` | If true, wildcard and prefix queries are analyzed. This parameter can only be used when the q query string parameter is specified. | Required | | +| String | `analyzer` | Analyzer to use for the query string. This parameter can only be used when the q query string parameter is specified. | | | +| Boolean | `pretty` | Whether to pretty format the returned JSON response. | | | +| Boolean | `human`
_DEPRECATED_ | _(Deprecated since 3.0: Use the `format` parameter instead.)_ Whether to return human readable values for statistics. | | | + + +Query Parameters Example with only Parameter and Description Columns + + +## Query parameters +Parameter | Description +:--- | :--- +`analyze_wildcard` | **(Required)** If true, wildcard and prefix queries are analyzed. This parameter can only be used when the q query string parameter is specified. +`analyzer` | Analyzer to use for the query string. This parameter can only be used when the q query string parameter is specified. + diff --git a/spec-insert/spec/_fixtures/expected_output/paths_and_http_methods.md b/spec-insert/spec/_fixtures/expected_output/paths_and_http_methods.md new file mode 100644 index 0000000000..8ca1569b52 --- /dev/null +++ b/spec-insert/spec/_fixtures/expected_output/paths_and_http_methods.md @@ -0,0 +1,13 @@ + + +## Paths and HTTP methods +```json +GET /_search +POST /_search +GET /{index}/_search +POST /{index}/_search +``` + diff --git a/spec-insert/spec/_fixtures/input/param_tables.md b/spec-insert/spec/_fixtures/input/param_tables.md new file mode 100644 index 0000000000..e53c09026a --- /dev/null +++ b/spec-insert/spec/_fixtures/input/param_tables.md @@ -0,0 +1,38 @@ +Typical Path Parameters Example + + +THIS + TEXT + SHOULD + BE + REPLACED + + +Query Parameters Example with Global Parameters, Pretty Print, and Custom Columns + + + THIS TEXT SHOULD BE REPLACED + + +Query Parameters Example with only Parameter and Description Columns + + +THIS +TEXT +SHOULD +BE +REPLACED + diff --git a/spec-insert/spec/_fixtures/input/paths_and_http_methods.md b/spec-insert/spec/_fixtures/input/paths_and_http_methods.md new file mode 100644 index 0000000000..0e92b8af8e --- /dev/null +++ b/spec-insert/spec/_fixtures/input/paths_and_http_methods.md @@ -0,0 +1,6 @@ + + + diff --git a/spec-insert/spec/_fixtures/opensearch_spec.yaml b/spec-insert/spec/_fixtures/opensearch_spec.yaml new file mode 100644 index 0000000000..7c67f27e69 --- /dev/null +++ b/spec-insert/spec/_fixtures/opensearch_spec.yaml @@ -0,0 +1,120 @@ +openapi: 3.1.0 +info: + title: OpenSearch API Specification + version: 1.0.0 + x-api-version: 2.16.0 +paths: + /_search: + get: + operationId: search.0 + x-operation-group: search + x-version-added: '1.0' + description: Returns results matching a query. + externalDocs: + url: https://opensearch.org/docs/latest/api-reference/search/ + parameters: + - $ref: '#/components/parameters/search___query.analyze_wildcard' + - $ref: '#/components/parameters/search___query.analyzer' + post: + operationId: search.1 + x-operation-group: search + x-version-added: '1.0' + description: Returns results matching a query. + externalDocs: + url: https://opensearch.org/docs/latest/api-reference/search/ + parameters: + - $ref: '#/components/parameters/search___query.analyze_wildcard' + - $ref: '#/components/parameters/search___query.analyzer' + /{index}/_search: + get: + operationId: search.2 + x-operation-group: search + x-version-added: '1.0' + description: Returns results matching a query. + externalDocs: + url: https://opensearch.org/docs/latest/api-reference/search/ + parameters: + - $ref: '#/components/parameters/search___path.index' + - $ref: '#/components/parameters/search___query.analyze_wildcard' + - $ref: '#/components/parameters/search___query.analyzer' + post: + operationId: search.3 + x-operation-group: search + x-version-added: '1.0' + description: Returns results matching a query. + externalDocs: + url: https://opensearch.org/docs/latest/api-reference/search/ + parameters: + - $ref: '#/components/parameters/search___path.index' + - $ref: '#/components/parameters/search___query.analyze_wildcard' + - $ref: '#/components/parameters/search___query.analyzer' +components: + + parameters: + + _global___query.pretty: + name: pretty + in: query + description: Whether to pretty format the returned JSON response. + schema: + type: boolean + default: false + x-global: true + + _global___query.human: + name: human + in: query + description: Whether to return human readable values for statistics. + schema: + type: boolean + default: true + x-global: true + deprecated: true + x-version-deprecated: '3.0' + x-deprecation-message: Use the `format` parameter instead. + + search___path.index: + in: path + name: index + description: |- + Comma-separated list of data streams, indexes, and aliases to search. + Supports wildcards (`*`). + To search all data streams and indexes, omit this parameter or use `*` or `_all`. + required: true + schema: + $ref: '#/components/schemas/_common___Indices' + style: simple + + search___query.analyze_wildcard: + in: query + name: analyze_wildcard + required: true + description: |- + If true, wildcard and prefix queries are analyzed. + This parameter can only be used when the q query string parameter is specified. + schema: + type: boolean + default: false + style: form + + search___query.analyzer: + in: query + name: analyzer + description: |- + Analyzer to use for the query string. + This parameter can only be used when the q query string parameter is specified. + schema: + type: string + style: form + + schemas: + + _common___Indices: + oneOf: + - $ref: '#/components/schemas/_common___IndexName' + - type: array + items: + $ref: '#/components/schemas/_common___IndexName' + + _common___IndexName: + type: string diff --git a/spec-insert/spec/doc_processor_spec.rb b/spec-insert/spec/doc_processor_spec.rb new file mode 100644 index 0000000000..073613a2a9 --- /dev/null +++ b/spec-insert/spec/doc_processor_spec.rb @@ -0,0 +1,24 @@ +# frozen_string_literal: true + +require_relative 'spec_helper' +require_relative '../lib/doc_processor' +require_relative '../lib/spec_hash' + +describe DocProcessor do + SpecHash.load_file('spec/_fixtures/opensearch_spec.yaml') + + def test_file(file_name) + expected_output = File.read("#{__dir__}/_fixtures/expected_output/#{file_name}.md") + actual_output = described_class.new("#{__dir__}/_fixtures/input/#{file_name}.md", logger: Logger.new($stdout)).process(write_to_file: false) + File.write("./spec/_fixtures/actual_output/#{file_name}.md", actual_output) + expect(actual_output).to eq(expected_output) + end + + it 'inserts the param tables correctly' do + test_file('param_tables') + end + + it 'inserts the paths and http methods correctly' do + test_file('paths_and_http_methods') + end +end diff --git a/spec-insert/spec/spec_helper.rb b/spec-insert/spec/spec_helper.rb new file mode 100644 index 0000000000..74d9dc9bb9 --- /dev/null +++ b/spec-insert/spec/spec_helper.rb @@ -0,0 +1,102 @@ +# This file was generated by the `rspec --init` command. Conventionally, all +# specs live under a `spec` directory, which RSpec adds to the `$LOAD_PATH`. +# The generated `.rspec` file contains `--require spec_helper` which will cause +# this file to always be loaded, without a need to explicitly require it in any +# files. +# +# Given that it is always loaded, you are encouraged to keep this file as +# light-weight as possible. Requiring heavyweight dependencies from this file +# will add to the boot time of your test suite on EVERY test run, even for an +# individual file that may not need all of that loaded. Instead, consider making +# a separate helper file that requires the additional dependencies and performs +# the additional setup, and require it from the spec files that actually need +# it. +# +# See https://rubydoc.info/gems/rspec-core/RSpec/Core/Configuration +RSpec.configure do |config| + # rspec-expectations config goes here. You can use an alternate + # assertion/expectation library such as wrong or the stdlib/minitest + # assertions if you prefer. + config.expect_with :rspec do |expectations| + # This option will default to `true` in RSpec 4. It makes the `description` + # and `failure_message` of custom matchers include text for helper methods + # defined using `chain`, e.g.: + # be_bigger_than(2).and_smaller_than(4).description + # # => "be bigger than 2 and smaller than 4" + # ...rather than: + # # => "be bigger than 2" + expectations.include_chain_clauses_in_custom_matcher_descriptions = true + end + + # rspec-mocks config goes here. You can use an alternate test double + # library (such as bogus or mocha) by changing the `mock_with` option here. + config.mock_with :rspec do |mocks| + # Prevents you from mocking or stubbing a method that does not exist on + # a real object. This is generally recommended, and will default to + # `true` in RSpec 4. + mocks.verify_partial_doubles = true + end + + # This option will default to `:apply_to_host_groups` in RSpec 4 (and will + # have no way to turn it off -- the option exists only for backwards + # compatibility in RSpec 3). It causes shared context metadata to be + # inherited by the metadata hash of host groups and examples, rather than + # triggering implicit auto-inclusion in groups with matching metadata. + config.shared_context_metadata_behavior = :apply_to_host_groups + + # The settings below are suggested to provide a good initial experience + # with RSpec, but feel free to customize to your heart's content. + + # This allows you to limit a spec run to individual examples or groups + # you care about by tagging them with `:focus` metadata. When nothing + # is tagged with `:focus`, all examples get run. RSpec also provides + # aliases for `it`, `describe`, and `context` that include `:focus` + # metadata: `fit`, `fdescribe` and `fcontext`, respectively. + config.filter_run_when_matching :focus + + # Allows RSpec to persist some state between runs in order to support + # the `--only-failures` and `--next-failure` CLI options. We recommend + # you configure your source control system to ignore this file. + config.example_status_persistence_file_path = 'rspec_examples.txt' + + # Limits the available syntax to the non-monkey patched syntax that is + # recommended. For more details, see: + # https://rspec.info/features/3-12/rspec-core/configuration/zero-monkey-patching-mode/ + config.disable_monkey_patching! + + # This setting enables warnings. It's recommended, but in some cases may + # be too noisy due to issues in dependencies. + config.warnings = true + + # Many RSpec users commonly either run the entire suite or an individual + # file, and it's useful to allow more verbose expected_output when running an + # individual spec file. + if config.files_to_run.one? + # Use the documentation formatter for detailed expected_output, + # unless a formatter has already been configured + # (e.g. via a command-line flag). + config.default_formatter = 'doc' + end + + # Print the 10 slowest examples and example groups at the + # end of the spec run, to help surface which specs are running + # particularly slow. + config.profile_examples = 10 + + # Run specs in random order to surface order dependencies. If you find an + # order dependency and want to debug it, you can fix the order by providing + # the seed, which is printed after each run. + # --seed 1234 + config.order = :random + + # Seed global randomization in this process using the `--seed` CLI option. + # Setting this allows you to use `--seed` to deterministically reproduce + # test failures related to randomization by passing the same `--seed` value + # as the one that triggered the failure. + Kernel.srand config.seed + + config.expose_dsl_globally = true +end + +require 'active_support/all' +require 'rspec' From 8d3ec419e67d46064bfcde376d1795abe71b2b3a Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Tue, 12 Nov 2024 18:10:21 +0000 Subject: [PATCH 02/14] Add keep type docs #8063 (#8122) * adding keep type docs #8063 Signed-off-by: Anton Rubin * Update keep-types.md Signed-off-by: AntonEliatra * updating parameter table Signed-off-by: Anton Rubin * Update keep-types.md Signed-off-by: AntonEliatra * Apply suggestions from code review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: AntonEliatra * fixing types Signed-off-by: Anton Rubin * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Anton Rubin Signed-off-by: AntonEliatra Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower --- _analyzers/token-filters/index.md | 2 +- _analyzers/token-filters/keep-types.md | 115 +++++++++++++++++++++++++ 2 files changed, 116 insertions(+), 1 deletion(-) create mode 100644 _analyzers/token-filters/keep-types.md diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index 9976feed60..0d87ce72aa 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -32,7 +32,7 @@ Token filter | Underlying Lucene token filter| Description `flatten_graph` | [FlattenGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing. `hunspell` | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token. Requires you to configure one or more language-specific Hunspell dictionaries. `hyphenation_decompounder` | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list. -`keep_types` | [TypeTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type. +[`keep_types`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keep-types/) | [TypeTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type. `keep_word` | [KeepWordFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list. `keyword_marker` | [KeywordMarkerFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordMarkerFilter.html) | Marks specified tokens as keywords, preventing them from being stemmed. `keyword_repeat` | [KeywordRepeatFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilter.html) | Emits each incoming token twice: once as a keyword and once as a non-keyword. diff --git a/_analyzers/token-filters/keep-types.md b/_analyzers/token-filters/keep-types.md new file mode 100644 index 0000000000..59e617f567 --- /dev/null +++ b/_analyzers/token-filters/keep-types.md @@ -0,0 +1,115 @@ +--- +layout: default +title: Keep types +parent: Token filters +nav_order: 180 +--- + +# Keep types token filter + +The `keep_types` token filter is a type of token filter used in text analysis to control which token types are kept or discarded. Different tokenizers produce different token types, for example, ``, ``, or ``. + +The `keyword`, `simple_pattern`, and `simple_pattern_split` tokenizers do not support the `keep_types` token filter because these tokenizers do not support token type attributes. +{: .note} + +## Parameters + +The `keep_types` token filter can be configured with the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`types` | Required | List of strings | List of token types to be kept or discarded (determined by the `mode`). +`mode`| Optional | String | Whether to `include` or `exclude` the token types specified in `types`. Default is `include`. + + +## Example + +The following example request creates a new index named `test_index` and configures an analyzer with a `keep_types` filter: + +```json +PUT /test_index +{ + "settings": { + "analysis": { + "analyzer": { + "custom_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": ["lowercase", "keep_types_filter"] + } + }, + "filter": { + "keep_types_filter": { + "type": "keep_types", + "types": [""] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET /test_index/_analyze +{ + "analyzer": "custom_analyzer", + "text": "Hello 2 world! This is an example." +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "hello", + "start_offset": 0, + "end_offset": 5, + "type": "", + "position": 0 + }, + { + "token": "world", + "start_offset": 8, + "end_offset": 13, + "type": "", + "position": 2 + }, + { + "token": "this", + "start_offset": 15, + "end_offset": 19, + "type": "", + "position": 3 + }, + { + "token": "is", + "start_offset": 20, + "end_offset": 22, + "type": "", + "position": 4 + }, + { + "token": "an", + "start_offset": 23, + "end_offset": 25, + "type": "", + "position": 5 + }, + { + "token": "example", + "start_offset": 26, + "end_offset": 33, + "type": "", + "position": 6 + } + ] +} +``` From 239d06f62053ed9c5535312b977dbb79cea45a13 Mon Sep 17 00:00:00 2001 From: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Date: Wed, 13 Nov 2024 13:48:45 -0500 Subject: [PATCH 03/14] Revert "Document setting allowing size > 0 queries into request cache (#8634)" (#8737) This reverts commit 274987d6481c3ad67b1594908131ebe931e04f86. --- _search-plugins/caching/request-cache.md | 1 - 1 file changed, 1 deletion(-) diff --git a/_search-plugins/caching/request-cache.md b/_search-plugins/caching/request-cache.md index 768c75fc92..124152300b 100644 --- a/_search-plugins/caching/request-cache.md +++ b/_search-plugins/caching/request-cache.md @@ -28,7 +28,6 @@ Setting | Data type | Default | Level | Static/Dynamic | Description `indices.cache.cleanup_interval` | Time unit | `1m` (1 minute) | Cluster | Static | Schedules a recurring background task that cleans up expired entries from the cache at the specified interval. `indices.requests.cache.size` | Percentage | `1%` | Cluster | Static | The cache size as a percentage of the heap size (for example, to use 1% of the heap, specify `1%`). `index.requests.cache.enable` | Boolean | `true` | Index | Dynamic | Enables or disables the request cache. -`indices.requests.cache.enable_for_all_requests` | Boolean | `false` | Cluster | Dynamic | Enables or disables caching queries in which `size` is greater than `0`. ### Example From 87a36f72f36107010f8581945773992fbf10cb08 Mon Sep 17 00:00:00 2001 From: pjuri <47542497+pjuri@users.noreply.github.com> Date: Wed, 13 Nov 2024 21:27:01 +0100 Subject: [PATCH 04/14] Update snapshot-restore.md (#8734) * Update snapshot-restore.md Adds info from: https://github.com/opensearch-project/OpenSearch/issues/16305 Signed-off-by: pjuri <47542497+pjuri@users.noreply.github.com> * Update _tuning-your-cluster/availability-and-recovery/snapshots/snapshot-restore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: pjuri <47542497+pjuri@users.noreply.github.com> * Update _tuning-your-cluster/availability-and-recovery/snapshots/snapshot-restore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: pjuri <47542497+pjuri@users.noreply.github.com> * Update _tuning-your-cluster/availability-and-recovery/snapshots/snapshot-restore.md Co-authored-by: Nathan Bower Signed-off-by: pjuri <47542497+pjuri@users.noreply.github.com> --------- Signed-off-by: pjuri <47542497+pjuri@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower --- .../snapshots/snapshot-restore.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/_tuning-your-cluster/availability-and-recovery/snapshots/snapshot-restore.md b/_tuning-your-cluster/availability-and-recovery/snapshots/snapshot-restore.md index 759080bdec..ac717633f6 100644 --- a/_tuning-your-cluster/availability-and-recovery/snapshots/snapshot-restore.md +++ b/_tuning-your-cluster/availability-and-recovery/snapshots/snapshot-restore.md @@ -110,6 +110,20 @@ You will most likely not need to specify any parameters except for `location`. F sudo ./bin/opensearch-keystore add s3.client.default.secret_key ``` +1. (Optional) If you're using a custom S3 endpoint (for example, MinIO), disable the Amazon EC2 metadata connection: + + ```bash + export AWS_EC2_METADATA_DISABLED=true + ``` + + If you're installing OpenSearch using Helm, update the following settings in your values file: + + ```yml + extraEnvs: + - name: AWS_EC2_METADATA_DISABLED + value: "true" + ``` + 1. (Optional) If you're using temporary credentials, add your session token: ```bash From 57157d056d1abfb80c0a68dbf14e941d5c1d35ea Mon Sep 17 00:00:00 2001 From: Kaushal Kumar Date: Wed, 13 Nov 2024 15:13:17 -0800 Subject: [PATCH 05/14] add wlm feature overview (#8632) * add wlm feature overview Signed-off-by: Kaushal Kumar * address automated comments Signed-off-by: Kaushal Kumar * address automated comments Signed-off-by: Kaushal Kumar * Recommit intial changes Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update wlm-feature-overview.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update wlm-feature-overview.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Grammar and typo fixes Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update _tuning-your-cluster/availability-and-recovery/workload-management/wlm-feature-overview.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * move permissions section Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update wlm-feature-overview.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update wlm-feature-overview.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: Kaushal Kumar Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower --- .../wlm-feature-overview.md | 194 ++++++++++++++++++ 1 file changed, 194 insertions(+) create mode 100644 _tuning-your-cluster/availability-and-recovery/workload-management/wlm-feature-overview.md diff --git a/_tuning-your-cluster/availability-and-recovery/workload-management/wlm-feature-overview.md b/_tuning-your-cluster/availability-and-recovery/workload-management/wlm-feature-overview.md new file mode 100644 index 0000000000..956a01a774 --- /dev/null +++ b/_tuning-your-cluster/availability-and-recovery/workload-management/wlm-feature-overview.md @@ -0,0 +1,194 @@ +--- +layout: default +title: Workload management +nav_order: 70 +has_children: true +parent: Availability and recovery +--- + +Introduced 2.18 +{: .label .label-purple } + +# Workload management + +Workload management allows you to group search traffic and isolate network resources, preventing the overuse of network resources by specific requests. It offers the following benefits: + +- Tenant-level admission control and reactive query management. When resource usage exceeds configured limits, it automatically identifies and cancels demanding queries, ensuring fair resource distribution. + +- Tenant-level isolation within the cluster for search workloads, operating at the node level. + +## Installing workload management + +To install workload management, use the following command: + +```json +./bin/opensearch-plugin install workload-management +``` +{% include copy-curl.html %} + +## Query groups + +A _query group_ is a logical grouping of tasks with defined resource limits. System administrators can dynamically manage query groups using the Workload Management APIs. These query groups can be used to create search requests with resource limits. + +### Permissions + +Only users with administrator-level permissions can create and update query groups using the Workload Management APIs. + +### Operating modes + +The following operating modes determine the operating level for a query group: + +- **Disabled mode**: Workload management is disabled. + +- **Enabled mode**: Workload management is enabled and will cancel and reject queries once the query group's configured thresholds are reached. + +- **Monitor_only mode** (Default): Workload management will monitor tasks but will not cancel or reject any queries. + +### Example request + +The following example request adds a query group named `analytics`: + +```json +PUT _wlm/query_group +{ + “name”: “analytics”, + “resiliency_mode”: “enforced”, + “resource_limits”: { + “cpu”: 0.4, + “memory”: 0.2 + } +} +``` +{% include copy-curl.html %} + +When creating a query group, make sure that the sum of the resource limits for a single resource, such as `cpu` or `memory`, does not exceed `1`. + +### Example response + +OpenSearch responds with the set resource limits and the `_id` for the query group: + +```json +{ + "_id":"preXpc67RbKKeCyka72_Gw", + "name":"analytics", + "resiliency_mode":"enforced", + "resource_limits":{ + "cpu":0.4, + "memory":0.2 + }, + "updated_at":1726270184642 +} +``` + +## Using `queryGroupID` + +You can associate a query request with a `queryGroupID` to manage and allocate resources within the limits defined by the query group. By using this ID, request routing and tracking are associated with the query group, ensuring resource quotas and task limits are maintained. + +The following example query uses the `queryGroupId` to ensure that the query does not exceed that query group's resource limits: + +```json +GET testindex/_search +Host: localhost:9200 +Content-Type: application/json +queryGroupId: preXpc67RbKKeCyka72_Gw +{ + "query": { + "match": { + "field_name": "value" + } + } +} +``` +{% include copy-curl.html %} + +## Workload management settings + +The following settings can be used to customize workload management using the `_cluster/settings` API. + +| **Setting name** | **Description** | +| :--- | :--- | +| `wlm.query_group.duress_streak` | Determines the node duress threshold. Once the threshold is reached, the node is marked as `in duress`. | +| `wlm.query_group.enforcement_interval` | Defines the monitoring interval. | +| `wlm.query_group.mode` | Defines the [operating mode](#operating-modes). | +| `wlm.query_group.node.memory_rejection_threshold` | Defines the query group level `memory` threshold. When the threshold is reached, the request is rejected. | +| `wlm.query_group.node.cpu_rejection_threshold` | Defines the query group level `cpu` threshold. When the threshold is reached, the request is rejected. | +| `wlm.query_group.node.memory_cancellation_threshold` | Controls whether the node is considered to be in duress when the `memory` threshold is reached. Requests routed to nodes in duress are canceled. | +| `wlm.query_group.node.cpu_cancellation_threshold` | Controls whether the node is considered to be in duress when the `cpu` threshold is reached. Requests routed to nodes in duress are canceled. | + +When setting rejection and cancellation thresholds, remember that the rejection threshold for a resource should always be lower than the cancellation threshold. + +## Workload Management Stats API + +The Workload Management Stats API returns workload management metrics for a query group, using the following method: + +```json +GET _wlm/stats +``` +{% include copy-curl.html %} + +### Example response + +```json +{ + “_nodes”: { + “total”: 1, + “successful”: 1, + “failed”: 0 + }, + “cluster_name”: “XXXXXXYYYYYYYY”, + “A3L9EfBIQf2anrrUhh_goA”: { + “query_groups”: { + “16YGxFlPRdqIO7K4EACJlw”: { + “total_completions”: 33570, + “total_rejections”: 0, + “total_cancellations”: 0, + “cpu”: { + “current_usage”: 0.03319935314357281, + “cancellations”: 0, + “rejections”: 0 + }, + “memory”: { + “current_usage”: 0.002306486276211217, + “cancellations”: 0, + “rejections”: 0 + } + }, + “DEFAULT_QUERY_GROUP”: { + “total_completions”: 42572, + “total_rejections”: 0, + “total_cancellations”: 0, + “cpu”: { + “current_usage”: 0, + “cancellations”: 0, + “rejections”: 0 + }, + “memory”: { + “current_usage”: 0, + “cancellations”: 0, + “rejections”: 0 + } + } + } + } +} +``` +{% include copy-curl.html %} + +### Response body fields + +| Field name | Description | +| :--- | :--- | +| `total_completions` | The total number of request completions in the `query_group` at the given node. This includes all shard-level and coordinator-level requests. | +| `total_rejections` | The total number request rejections in the `query_group` at the given node. This includes all shard-level and coordinator-level requests. | +| `total_cancellations` | The total number of cancellations in the `query_group` at the given node. This includes all shard-level and coordinator-level requests. | +| `cpu` | The `cpu` resource type statistics for the `query_group`. | +| `memory` | The `memory` resource type statistics for the `query_group`. | + +### Resource type statistics + +| Field name | Description | +| :--- | :---- | +| `current_usage` |The resource usage for the `query_group` at the given node based on the last run of the monitoring thread. This value is updated based on the `wlm.query_group.enforcement_interval`. | +| `cancellations` | The number of cancellations resulting from the cancellation threshold being reached. | +| `rejections` | The number of rejections resulting from the cancellation threshold being reached. | + From c4d59f215131ec07eefb640705212f84f3a637dc Mon Sep 17 00:00:00 2001 From: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Date: Thu, 14 Nov 2024 09:21:56 -0500 Subject: [PATCH 06/14] Update index.md (#8743) Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --- _query-dsl/geo-and-xy/index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_query-dsl/geo-and-xy/index.md b/_query-dsl/geo-and-xy/index.md index ee51e1e523..9bcf6a9462 100644 --- a/_query-dsl/geo-and-xy/index.md +++ b/_query-dsl/geo-and-xy/index.md @@ -30,7 +30,7 @@ OpenSearch provides the following geographic query types: - [**Geo-bounding box queries**]({{site.url}}{{site.baseurl}}/opensearch/query-dsl/geo-and-xy/geo-bounding-box/): Return documents with geopoint field values that are within a bounding box. - [**Geodistance queries**]({{site.url}}{{site.baseurl}}/query-dsl/geo-and-xy/geodistance/): Return documents with geopoints that are within a specified distance from the provided geopoint. -- [**Geopolygon queries**]({{site.url}}{{site.baseurl}}/query-dsl/geo-and-xy/geodistance/): Return documents containing geopoints that are within a polygon. +- [**Geopolygon queries**]({{site.url}}{{site.baseurl}}/query-dsl/geo-and-xy/geopolygon/): Return documents containing geopoints that are within a polygon. - [**Geoshape queries**]({{site.url}}{{site.baseurl}}/query-dsl/geo-and-xy/geoshape/): Return documents that contain: - Geoshapes and geopoints that have one of four spatial relations to the provided shape: `INTERSECTS`, `DISJOINT`, `WITHIN`, or `CONTAINS`. - - Geopoints that intersect the provided shape. \ No newline at end of file + - Geopoints that intersect the provided shape. From 6a331a5378e310107754e6a4cd7c536a527cfde0 Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Thu, 14 Nov 2024 19:58:32 +0000 Subject: [PATCH 07/14] Add dictionary decompounder docs #7979 (#7994) * Adding dictionary decompounder docs #7979 Signed-off-by: Anton Rubin * Update dictionary-decompounder.md Signed-off-by: AntonEliatra * Update dictionary-decompounder.md Signed-off-by: AntonEliatra * updating parameter table Signed-off-by: Anton Rubin * updating parameter table Signed-off-by: Anton Rubin * Apply suggestions from code review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: AntonEliatra * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Anton Rubin Signed-off-by: AntonEliatra Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower --- .../token-filters/dictionary-decompounder.md | 101 ++++++++++++++++++ _analyzers/token-filters/index.md | 2 +- 2 files changed, 102 insertions(+), 1 deletion(-) create mode 100644 _analyzers/token-filters/dictionary-decompounder.md diff --git a/_analyzers/token-filters/dictionary-decompounder.md b/_analyzers/token-filters/dictionary-decompounder.md new file mode 100644 index 0000000000..ced6fd6fbc --- /dev/null +++ b/_analyzers/token-filters/dictionary-decompounder.md @@ -0,0 +1,101 @@ +--- +layout: default +title: Dictionary decompounder +parent: Token filters +nav_order: 110 +--- + +# Dictionary decompounder token filter + +The `dictionary_decompounder` token filter is used to split compound words into their constituent parts based on a predefined dictionary. This filter is particularly useful for languages like German, Dutch, or Finnish, in which compound words are common, so breaking them down can improve search relevance. The `dictionary_decompounder` token filter determines whether each token (word) can be split into smaller tokens based on a list of known words. If the token can be split into known words, the filter generates the subtokens for the token. + +## Parameters + +The `dictionary_decompounder` token filter has the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`word_list` | Required unless `word_list_path` is configured | Array of strings | The dictionary of words that the filter uses to split compound words. +`word_list_path` | Required unless `word_list` is configured | String | A file path to a text file containing the dictionary words. Accepts either an absolute path or a path relative to the `config` directory. The dictionary file must be UTF-8 encoded, and each word must be listed on a separate line. +`min_word_size` | Optional | Integer | The minimum length of the entire compound word that will be considered for splitting. If a compound word is shorter than this value, it is not split. Default is `5`. +`min_subword_size` | Optional | Integer | The minimum length for any subword. If a subword is shorter than this value, it is not included in the output. Default is `2`. +`max_subword_size` | Optional | Integer | The maximum length for any subword. If a subword is longer than this value, it is not included in the output. Default is `15`. +`only_longest_match` | Optional | Boolean | If set to `true`, only the longest matching subword will be returned. Default is `false`. + +## Example + +The following example request creates a new index named `decompound_example` and configures an analyzer with the `dictionary_decompounder` filter: + +```json +PUT /decompound_example +{ + "settings": { + "analysis": { + "filter": { + "my_dictionary_decompounder": { + "type": "dictionary_decompounder", + "word_list": ["slow", "green", "turtle"] + } + }, + "analyzer": { + "my_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": ["lowercase", "my_dictionary_decompounder"] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /decompound_example/_analyze +{ + "analyzer": "my_analyzer", + "text": "slowgreenturtleswim" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "slowgreenturtleswim", + "start_offset": 0, + "end_offset": 19, + "type": "", + "position": 0 + }, + { + "token": "slow", + "start_offset": 0, + "end_offset": 19, + "type": "", + "position": 0 + }, + { + "token": "green", + "start_offset": 0, + "end_offset": 19, + "type": "", + "position": 0 + }, + { + "token": "turtle", + "start_offset": 0, + "end_offset": 19, + "type": "", + "position": 0 + } + ] +} +``` diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index 0d87ce72aa..d2f4ce0660 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -25,7 +25,7 @@ Token filter | Underlying Lucene token filter| Description [`decimal_digit`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/decimal-digit/) | [DecimalDigitFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/DecimalDigitFilter.html) | Converts all digits in the Unicode decimal number general category to basic Latin digits (0--9). [`delimited_payload`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/delimited-payload/) | [DelimitedPayloadTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/payloads/DelimitedPayloadTokenFilter.html) | Separates a token stream into tokens with corresponding payloads, based on a provided delimiter. A token consists of all characters preceding the delimiter, and a payload consists of all characters following the delimiter. For example, if the delimiter is `|`, then for the string `foo|bar`, `foo` is the token and `bar` is the payload. [`delimited_term_freq`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/delimited-term-frequency/) | [DelimitedTermFrequencyTokenFilter](https://lucene.apache.org/core/9_7_0/analysis/common/org/apache/lucene/analysis/miscellaneous/DelimitedTermFrequencyTokenFilter.html) | Separates a token stream into tokens with corresponding term frequencies, based on a provided delimiter. A token consists of all characters before the delimiter, and a term frequency is the integer after the delimiter. For example, if the delimiter is `|`, then for the string `foo|5`, `foo` is the token and `5` is the term frequency. -`dictionary_decompounder` | [DictionaryCompoundWordTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html) | Decomposes compound words found in many Germanic languages. +[`dictionary_decompounder`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/dictionary-decompounder/) | [DictionaryCompoundWordTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html) | Splits compound words into their constituent parts based on a predefined dictionary. Useful for many Germanic languages. `edge_ngram` | [EdgeNGramTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html) | Tokenizes the given token into edge n-grams (n-grams that start at the beginning of the token) of lengths between `min_gram` and `max_gram`. Optionally, keeps the original token. `elision` | [ElisionFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/util/ElisionFilter.html) | Removes the specified [elisions](https://en.wikipedia.org/wiki/Elision) from the beginning of tokens. For example, changes `l'avion` (the plane) to `avion` (plane). `fingerprint` | [FingerprintFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html) | Sorts and deduplicates the token list and concatenates tokens into a single token. From eccc64f8ada89e17fd62379d9784e9336ae51490 Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Thu, 14 Nov 2024 20:08:29 +0000 Subject: [PATCH 08/14] Add edge n-gram token filter docs #7980 (#8025) * adding edge n-gram token filter docs #7980 Signed-off-by: Anton Rubin * fixing vale errors Signed-off-by: Anton Rubin * Update edge-ngram.md Signed-off-by: AntonEliatra * updating parameter table Signed-off-by: Anton Rubin * Apply suggestions from code review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: AntonEliatra * adding comparison to ngram token filter Signed-off-by: Anton Rubin * Update edge-ngram.md Signed-off-by: AntonEliatra * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Anton Rubin Signed-off-by: AntonEliatra Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower --- _analyzers/token-filters/edge-ngram.md | 111 +++++++++++++++++++++++++ _analyzers/token-filters/index.md | 4 +- 2 files changed, 113 insertions(+), 2 deletions(-) create mode 100644 _analyzers/token-filters/edge-ngram.md diff --git a/_analyzers/token-filters/edge-ngram.md b/_analyzers/token-filters/edge-ngram.md new file mode 100644 index 0000000000..be3eaf6fab --- /dev/null +++ b/_analyzers/token-filters/edge-ngram.md @@ -0,0 +1,111 @@ +--- +layout: default +title: Edge n-gram +parent: Token filters +nav_order: 120 +--- +# Edge n-gram token filter +The `edge_ngram` token filter is very similar to the `ngram` token filter, where a particular string is split into substrings of different lengths. The `edge_ngram` token filter, however, generates n-grams (substrings) only from the beginning (edge) of a token. It's particularly useful in scenarios like autocomplete or prefix matching, where you want to match the beginning of words or phrases as the user types them. + +## Parameters + +The `edge_ngram` token filter can be configured with the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`min_gram` | Optional | Integer | The minimum length of the n-grams that will be generated. Default is `1`. +`max_gram` | Optional | Integer | The maximum length of the n-grams that will be generated. Default is `1` for the `edge_ngram` filter and `2` for custom token filters. Avoid setting this parameter to a low value. If the value is set too low, only very short n-grams will be generated and the search term will not be found. For example, if `max_gram` is set to `3` and you index the word "banana", the longest generated token will be "ban". If the user searches for "banana", no matches will be returned. You can use the `truncate` token filter as a search analyzer to mitigate this risk. +`preserve_original` | Optional | Boolean | Includes the original token in the output. Default is `false` . + +## Example + +The following example request creates a new index named `edge_ngram_example` and configures an analyzer with the `edge_ngram` filter: + +```json +PUT /edge_ngram_example +{ + "settings": { + "analysis": { + "filter": { + "my_edge_ngram": { + "type": "edge_ngram", + "min_gram": 3, + "max_gram": 4 + } + }, + "analyzer": { + "my_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": ["lowercase", "my_edge_ngram"] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /edge_ngram_example/_analyze +{ + "analyzer": "my_analyzer", + "text": "slow green turtle" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "slo", + "start_offset": 0, + "end_offset": 4, + "type": "", + "position": 0 + }, + { + "token": "slow", + "start_offset": 0, + "end_offset": 4, + "type": "", + "position": 0 + }, + { + "token": "gre", + "start_offset": 5, + "end_offset": 10, + "type": "", + "position": 1 + }, + { + "token": "gree", + "start_offset": 5, + "end_offset": 10, + "type": "", + "position": 1 + }, + { + "token": "tur", + "start_offset": 11, + "end_offset": 17, + "type": "", + "position": 2 + }, + { + "token": "turt", + "start_offset": 11, + "end_offset": 17, + "type": "", + "position": 2 + } + ] +} +``` diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index d2f4ce0660..95a09f0807 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -25,8 +25,8 @@ Token filter | Underlying Lucene token filter| Description [`decimal_digit`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/decimal-digit/) | [DecimalDigitFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/DecimalDigitFilter.html) | Converts all digits in the Unicode decimal number general category to basic Latin digits (0--9). [`delimited_payload`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/delimited-payload/) | [DelimitedPayloadTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/payloads/DelimitedPayloadTokenFilter.html) | Separates a token stream into tokens with corresponding payloads, based on a provided delimiter. A token consists of all characters preceding the delimiter, and a payload consists of all characters following the delimiter. For example, if the delimiter is `|`, then for the string `foo|bar`, `foo` is the token and `bar` is the payload. [`delimited_term_freq`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/delimited-term-frequency/) | [DelimitedTermFrequencyTokenFilter](https://lucene.apache.org/core/9_7_0/analysis/common/org/apache/lucene/analysis/miscellaneous/DelimitedTermFrequencyTokenFilter.html) | Separates a token stream into tokens with corresponding term frequencies, based on a provided delimiter. A token consists of all characters before the delimiter, and a term frequency is the integer after the delimiter. For example, if the delimiter is `|`, then for the string `foo|5`, `foo` is the token and `5` is the term frequency. -[`dictionary_decompounder`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/dictionary-decompounder/) | [DictionaryCompoundWordTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html) | Splits compound words into their constituent parts based on a predefined dictionary. Useful for many Germanic languages. -`edge_ngram` | [EdgeNGramTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html) | Tokenizes the given token into edge n-grams (n-grams that start at the beginning of the token) of lengths between `min_gram` and `max_gram`. Optionally, keeps the original token. +[`dictionary_decompounder`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/dictionary-decompounder/) | [DictionaryCompoundWordTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html) | Decomposes compound words found in many Germanic languages. +[`edge_ngram`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/edge-ngram/) | [EdgeNGramTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html) | Tokenizes the given token into edge n-grams (n-grams that start at the beginning of the token) of lengths between `min_gram` and `max_gram`. Optionally, keeps the original token. `elision` | [ElisionFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/util/ElisionFilter.html) | Removes the specified [elisions](https://en.wikipedia.org/wiki/Elision) from the beginning of tokens. For example, changes `l'avion` (the plane) to `avion` (plane). `fingerprint` | [FingerprintFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html) | Sorts and deduplicates the token list and concatenates tokens into a single token. `flatten_graph` | [FlattenGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing. From b0a56c1a748bf4746763fda501258b236a8aec5e Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Thu, 14 Nov 2024 20:12:24 +0000 Subject: [PATCH 09/14] Add fingerprint token filter #7982 (#8059) * adding fingerprint token filter #7982 Signed-off-by: Anton Rubin * fixing typo Signed-off-by: Anton Rubin * Update fingerprint.md Signed-off-by: AntonEliatra * updating parameter table Signed-off-by: Anton Rubin * Apply suggestions from code review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: AntonEliatra * Update _analyzers/token-filters/fingerprint.md Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Anton Rubin Signed-off-by: AntonEliatra Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower --- _analyzers/token-filters/fingerprint.md | 86 +++++++++++++++++++++++++ _analyzers/token-filters/index.md | 2 +- 2 files changed, 87 insertions(+), 1 deletion(-) create mode 100644 _analyzers/token-filters/fingerprint.md diff --git a/_analyzers/token-filters/fingerprint.md b/_analyzers/token-filters/fingerprint.md new file mode 100644 index 0000000000..75c6615459 --- /dev/null +++ b/_analyzers/token-filters/fingerprint.md @@ -0,0 +1,86 @@ +--- +layout: default +title: Fingerprint +parent: Token filters +nav_order: 140 +--- + +# Fingerprint token filter + +The `fingerprint` token filter is used to standardize and deduplicate text. This is particularly useful when consistency in text processing is crucial. The `fingerprint` token filter achieves this by processing text using the following steps: + +1. **Lowercasing**: Converts all text to lowercase. +2. **Splitting**: Breaks the text into tokens. +3. **Sorting**: Arranges the tokens in alphabetical order. +4. **Removing duplicates**: Eliminates repeated tokens. +5. **Joining tokens**: Combines the tokens into a single string, typically joined by a space or another specified separator. + +## Parameters + +The `fingerprint` token filter can be configured with the following two parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`max_output_size` | Optional | Integer | Limits the length of the generated fingerprint string. If the concatenated string exceeds the `max_output_size`, the filter will not produce any output, resulting in an empty token. Default is `255`. +`separator` | Optional | String | Defines the character(s) used to join the tokens into a single string after they have been sorted and deduplicated. Default is space (`" "`). + +## Example + +The following example request creates a new index named `my_index` and configures an analyzer with a `fingerprint` token filter: + +```json +PUT /my_index +{ + "settings": { + "analysis": { + "filter": { + "my_fingerprint": { + "type": "fingerprint", + "max_output_size": 200, + "separator": "-" + } + }, + "analyzer": { + "my_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "lowercase", + "my_fingerprint" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_index/_analyze +{ + "analyzer": "my_analyzer", + "text": "OpenSearch is a powerful search engine that scales easily" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "a-easily-engine-is-opensearch-powerful-scales-search-that", + "start_offset": 0, + "end_offset": 57, + "type": "fingerprint", + "position": 0 + } + ] +} +``` diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index 95a09f0807..f04357df15 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -28,7 +28,7 @@ Token filter | Underlying Lucene token filter| Description [`dictionary_decompounder`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/dictionary-decompounder/) | [DictionaryCompoundWordTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html) | Decomposes compound words found in many Germanic languages. [`edge_ngram`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/edge-ngram/) | [EdgeNGramTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html) | Tokenizes the given token into edge n-grams (n-grams that start at the beginning of the token) of lengths between `min_gram` and `max_gram`. Optionally, keeps the original token. `elision` | [ElisionFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/util/ElisionFilter.html) | Removes the specified [elisions](https://en.wikipedia.org/wiki/Elision) from the beginning of tokens. For example, changes `l'avion` (the plane) to `avion` (plane). -`fingerprint` | [FingerprintFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html) | Sorts and deduplicates the token list and concatenates tokens into a single token. +[`fingerprint`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/fingerprint/) | [FingerprintFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html) | Sorts and deduplicates the token list and concatenates tokens into a single token. `flatten_graph` | [FlattenGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing. `hunspell` | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token. Requires you to configure one or more language-specific Hunspell dictionaries. `hyphenation_decompounder` | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list. From f98dcafccfefe288c9b19fbf710301e37a083b8f Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Thu, 14 Nov 2024 20:25:04 +0000 Subject: [PATCH 10/14] Add elision token filter docs #7981 (#8026) * adding elision token filter docs #7981 Signed-off-by: Anton Rubin * Update elision.md Signed-off-by: AntonEliatra * Update elision.md Signed-off-by: AntonEliatra * updating parameter table Signed-off-by: Anton Rubin * Apply suggestions from code review Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _analyzers/token-filters/elision.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Anton Rubin Signed-off-by: AntonEliatra Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower --- _analyzers/token-filters/elision.md | 124 ++++++++++++++++++++++++++++ _analyzers/token-filters/index.md | 2 +- 2 files changed, 125 insertions(+), 1 deletion(-) create mode 100644 _analyzers/token-filters/elision.md diff --git a/_analyzers/token-filters/elision.md b/_analyzers/token-filters/elision.md new file mode 100644 index 0000000000..b5dd5134b6 --- /dev/null +++ b/_analyzers/token-filters/elision.md @@ -0,0 +1,124 @@ +--- +layout: default +title: Elision +parent: Token filters +nav_order: 130 +--- + +# Elision token filter + +The `elision` token filter is used to remove elided characters from words in certain languages. Elision typically occurs in languages such as French, in which words are often contracted and combined with the following word, typically by omitting a vowel and replacing it with an apostrophe. + +The `elision` token filter is already preconfigured in the following [language analyzers]({{site.url}}{{site.baseurl}}/analyzers/language-analyzers/): `catalan`, `french`, `irish`, and `italian`. +{: .note} + +## Parameters + +The custom `elision` token filter can be configured with the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`articles` | Required if `articles_path` is not configured | Array of strings | Defines which articles or short words should be removed when they appear as part of an elision. +`articles_path` | Required if `articles` is not configured | String | Specifies the path to a custom list of articles that should be removed during the analysis process. +`articles_case` | Optional | Boolean | Specifies whether the filter is case sensitive when matching elisions. Default is `false`. + +## Example + +The default set of French elisions is `l'`, `m'`, `t'`, `qu'`, `n'`, `s'`, `j'`, `d'`, `c'`, `jusqu'`, `quoiqu'`, `lorsqu'`, and `puisqu'`. You can update this by configuring the `french_elision` token filter. The following example request creates a new index named `french_texts` and configures an analyzer with the `french_elision` filter: + +```json +PUT /french_texts +{ + "settings": { + "analysis": { + "filter": { + "french_elision": { + "type": "elision", + "articles": [ "l", "t", "m", "d", "n", "s", "j" ] + } + }, + "analyzer": { + "french_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": ["lowercase", "french_elision"] + } + } + } + }, + "mappings": { + "properties": { + "text": { + "type": "text", + "analyzer": "french_analyzer" + } + } + } +} + +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /french_texts/_analyze +{ + "analyzer": "french_analyzer", + "text": "L'étudiant aime l'école et le travail." +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "étudiant", + "start_offset": 0, + "end_offset": 10, + "type": "", + "position": 0 + }, + { + "token": "aime", + "start_offset": 11, + "end_offset": 15, + "type": "", + "position": 1 + }, + { + "token": "école", + "start_offset": 16, + "end_offset": 23, + "type": "", + "position": 2 + }, + { + "token": "et", + "start_offset": 24, + "end_offset": 26, + "type": "", + "position": 3 + }, + { + "token": "le", + "start_offset": 27, + "end_offset": 29, + "type": "", + "position": 4 + }, + { + "token": "travail", + "start_offset": 30, + "end_offset": 37, + "type": "", + "position": 5 + } + ] +} +``` diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index f04357df15..2a3d4fe784 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -27,7 +27,7 @@ Token filter | Underlying Lucene token filter| Description [`delimited_term_freq`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/delimited-term-frequency/) | [DelimitedTermFrequencyTokenFilter](https://lucene.apache.org/core/9_7_0/analysis/common/org/apache/lucene/analysis/miscellaneous/DelimitedTermFrequencyTokenFilter.html) | Separates a token stream into tokens with corresponding term frequencies, based on a provided delimiter. A token consists of all characters before the delimiter, and a term frequency is the integer after the delimiter. For example, if the delimiter is `|`, then for the string `foo|5`, `foo` is the token and `5` is the term frequency. [`dictionary_decompounder`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/dictionary-decompounder/) | [DictionaryCompoundWordTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html) | Decomposes compound words found in many Germanic languages. [`edge_ngram`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/edge-ngram/) | [EdgeNGramTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html) | Tokenizes the given token into edge n-grams (n-grams that start at the beginning of the token) of lengths between `min_gram` and `max_gram`. Optionally, keeps the original token. -`elision` | [ElisionFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/util/ElisionFilter.html) | Removes the specified [elisions](https://en.wikipedia.org/wiki/Elision) from the beginning of tokens. For example, changes `l'avion` (the plane) to `avion` (plane). +[`elision`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/elision/) | [ElisionFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/util/ElisionFilter.html) | Removes the specified [elisions](https://en.wikipedia.org/wiki/Elision) from the beginning of tokens. For example, changes `l'avion` (the plane) to `avion` (plane). [`fingerprint`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/fingerprint/) | [FingerprintFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html) | Sorts and deduplicates the token list and concatenates tokens into a single token. `flatten_graph` | [FlattenGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing. `hunspell` | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token. Requires you to configure one or more language-specific Hunspell dictionaries. From 01c0d493c1bb6e02dc7a33d076a209ae9059f6ec Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Thu, 14 Nov 2024 20:57:23 +0000 Subject: [PATCH 11/14] Add hunspell token filter #8061 (#8070) * adding hunspell token filter #8061 Signed-off-by: Anton Rubin * adding dedup and example where to download files Signed-off-by: Anton Rubin * Update hunspell.md Signed-off-by: AntonEliatra * Update hunspell.md Signed-off-by: AntonEliatra * updating parameter table Signed-off-by: Anton Rubin * Apply suggestions from code review Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _analyzers/token-filters/hunspell.md Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Anton Rubin Signed-off-by: AntonEliatra Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower --- _analyzers/token-filters/hunspell.md | 108 +++++++++++++++++++++++++++ _analyzers/token-filters/index.md | 2 +- 2 files changed, 109 insertions(+), 1 deletion(-) create mode 100644 _analyzers/token-filters/hunspell.md diff --git a/_analyzers/token-filters/hunspell.md b/_analyzers/token-filters/hunspell.md new file mode 100644 index 0000000000..6720ba74de --- /dev/null +++ b/_analyzers/token-filters/hunspell.md @@ -0,0 +1,108 @@ +--- +layout: default +title: Hunspell +parent: Token filters +nav_order: 160 +--- + +# Hunspell token filter + +The `hunspell` token filter is used for stemming and morphological analysis of words in a specific language. This filter applies Hunspell dictionaries, which are widely used in spell checkers. It works by breaking down words into their root forms (stemming). + +The Hunspell dictionary files are automatically loaded at startup from the `/hunspell/` directory. For example, the `en_GB` locale must have at least one `.aff` file and one or more `.dic` files in the `/hunspell/en_GB/` directory. + +You can download these files from [LibreOffice dictionaries](https://github.com/LibreOffice/dictionaries). + +## Parameters + +The `hunspell` token filter can be configured with the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`language/lang/locale` | At least one of the three is required | String | Specifies the language for the Hunspell dictionary. +`dedup` | Optional | Boolean | Determines whether to remove multiple duplicate stemming terms for the same token. Default is `true`. +`dictionary` | Optional | Array of strings | Configures the dictionary files to be used for the Hunspell dictionary. Default is all files in the `/hunspell/` directory. +`longest_only` | Optional | Boolean | Specifies whether only the longest stemmed version of the token should be returned. Default is `false`. + +## Example + +The following example request creates a new index named `my_index` and configures an analyzer with a `hunspell` filter: + +```json +PUT /my_index +{ + "settings": { + "analysis": { + "filter": { + "my_hunspell_filter": { + "type": "hunspell", + "lang": "en_GB", + "dedup": true, + "longest_only": true + } + }, + "analyzer": { + "my_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "lowercase", + "my_hunspell_filter" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_index/_analyze +{ + "analyzer": "my_analyzer", + "text": "the turtle moves slowly" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "the", + "start_offset": 0, + "end_offset": 3, + "type": "", + "position": 0 + }, + { + "token": "turtle", + "start_offset": 4, + "end_offset": 10, + "type": "", + "position": 1 + }, + { + "token": "move", + "start_offset": 11, + "end_offset": 16, + "type": "", + "position": 2 + }, + { + "token": "slow", + "start_offset": 17, + "end_offset": 23, + "type": "", + "position": 3 + } + ] +} +``` diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index 2a3d4fe784..fd931d1efc 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -30,7 +30,7 @@ Token filter | Underlying Lucene token filter| Description [`elision`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/elision/) | [ElisionFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/util/ElisionFilter.html) | Removes the specified [elisions](https://en.wikipedia.org/wiki/Elision) from the beginning of tokens. For example, changes `l'avion` (the plane) to `avion` (plane). [`fingerprint`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/fingerprint/) | [FingerprintFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html) | Sorts and deduplicates the token list and concatenates tokens into a single token. `flatten_graph` | [FlattenGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing. -`hunspell` | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token. Requires you to configure one or more language-specific Hunspell dictionaries. +[`hunspell`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/hunspell/) | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell allows a word to have multiple stems, this filter can emit multiple tokens for each consumed token. Requires the configuration of one or more language-specific Hunspell dictionaries. `hyphenation_decompounder` | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list. [`keep_types`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keep-types/) | [TypeTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type. `keep_word` | [KeepWordFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list. From 1bb7f3e073da935e7c4f7e0e02d6b09872b27723 Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Thu, 14 Nov 2024 21:01:16 +0000 Subject: [PATCH 12/14] Add keep words token filter docs #8064 (#8124) * adding keep words token filter docs #8064 Signed-off-by: Anton Rubin * fixing vale errors Signed-off-by: Anton Rubin * Update keep-words.md Signed-off-by: AntonEliatra * updating parameter table Signed-off-by: Anton Rubin * Update keep-words.md Signed-off-by: AntonEliatra * Apply suggestions from code review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: AntonEliatra * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Anton Rubin Signed-off-by: AntonEliatra Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower --- _analyzers/token-filters/index.md | 2 +- _analyzers/token-filters/keep-words.md | 92 ++++++++++++++++++++++++++ 2 files changed, 93 insertions(+), 1 deletion(-) create mode 100644 _analyzers/token-filters/keep-words.md diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index fd931d1efc..f6b020b51f 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -33,7 +33,7 @@ Token filter | Underlying Lucene token filter| Description [`hunspell`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/hunspell/) | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell allows a word to have multiple stems, this filter can emit multiple tokens for each consumed token. Requires the configuration of one or more language-specific Hunspell dictionaries. `hyphenation_decompounder` | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list. [`keep_types`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keep-types/) | [TypeTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type. -`keep_word` | [KeepWordFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list. +[`keep_words`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keep-words/) | [KeepWordFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list. `keyword_marker` | [KeywordMarkerFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordMarkerFilter.html) | Marks specified tokens as keywords, preventing them from being stemmed. `keyword_repeat` | [KeywordRepeatFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilter.html) | Emits each incoming token twice: once as a keyword and once as a non-keyword. `kstem` | [KStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/KStemFilter.html) | Provides kstem-based stemming for the English language. Combines algorithmic stemming with a built-in dictionary. diff --git a/_analyzers/token-filters/keep-words.md b/_analyzers/token-filters/keep-words.md new file mode 100644 index 0000000000..4a6b199e5c --- /dev/null +++ b/_analyzers/token-filters/keep-words.md @@ -0,0 +1,92 @@ +--- +layout: default +title: Keep words +parent: Token filters +nav_order: 190 +--- + +# Keep words token filter + +The `keep_words` token filter is designed to keep only certain words during the analysis process. This filter is useful if you have a large body of text but are only interested in certain keywords or terms. + +## Parameters + +The `keep_words` token filter can be configured with the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`keep_words` | Required if `keep_words_path` is not configured | List of strings | The list of words to keep. +`keep_words_path` | Required if `keep_words` is not configured | String | The path to the file containing the list of words to keep. +`keep_words_case` | Optional | Boolean | Whether to lowercase all words during comparison. Default is `false`. + + +## Example + +The following example request creates a new index named `my_index` and configures an analyzer with a `keep_words` filter: + +```json +PUT my_index +{ + "settings": { + "analysis": { + "analyzer": { + "custom_keep_word": { + "tokenizer": "standard", + "filter": [ "keep_words_filter" ] + } + }, + "filter": { + "keep_words_filter": { + "type": "keep", + "keep_words": ["example", "world", "opensearch"], + "keep_words_case": true + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET /my_index/_analyze +{ + "analyzer": "custom_keep_word", + "text": "Hello, world! This is an OpenSearch example." +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "world", + "start_offset": 7, + "end_offset": 12, + "type": "", + "position": 1 + }, + { + "token": "OpenSearch", + "start_offset": 25, + "end_offset": 35, + "type": "", + "position": 5 + }, + { + "token": "example", + "start_offset": 36, + "end_offset": 43, + "type": "", + "position": 6 + } + ] +} +``` From 883b673fd47416037ea71212e50dc81df8d32ec4 Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Thu, 14 Nov 2024 21:09:19 +0000 Subject: [PATCH 13/14] Add hyphenation_decompounder token filter docs #8062 (#8120) * adding hyphenation_decompounder token filter docs #8062 Signed-off-by: Anton Rubin * Update hyphenation-decompounder.md Signed-off-by: AntonEliatra * Update hyphenation-decompounder.md Signed-off-by: AntonEliatra * updating parameter table Signed-off-by: Anton Rubin * Apply suggestions from code review Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _analyzers/token-filters/hyphenation-decompounder.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _analyzers/token-filters/hyphenation-decompounder.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Anton Rubin Signed-off-by: AntonEliatra Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower --- .../token-filters/hyphenation-decompounder.md | 102 ++++++++++++++++++ _analyzers/token-filters/index.md | 2 +- 2 files changed, 103 insertions(+), 1 deletion(-) create mode 100644 _analyzers/token-filters/hyphenation-decompounder.md diff --git a/_analyzers/token-filters/hyphenation-decompounder.md b/_analyzers/token-filters/hyphenation-decompounder.md new file mode 100644 index 0000000000..6e53d4dfd5 --- /dev/null +++ b/_analyzers/token-filters/hyphenation-decompounder.md @@ -0,0 +1,102 @@ +--- +layout: default +title: Hyphenation decompounder +parent: Token filters +nav_order: 170 +--- + +# Hyphenation decompounder token filter + +The `hyphenation_decompounder` token filter is used to break down compound words into their constituent parts. This filter is particularly useful for languages like German, Dutch, and Swedish, in which compound words are common. The filter uses hyphenation patterns (typically defined in .xml files) to identify the possible locations within a compound word where it can be split into components. These components are then checked against a provided dictionary. If there is a match, those components are treated as valid tokens. For more information about hyphenation pattern files, see [FOP XML Hyphenation Patterns](https://offo.sourceforge.net/#FOP+XML+Hyphenation+Patterns). + +## Parameters + +The `hyphenation_decompounder` token filter can be configured with the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`hyphenation_patterns_path` | Required | String | The path (relative to the `config` directory or absolute) to the hyphenation patterns file, which contains the language-specific rules for word splitting. The file is typically in XML format. Sample files can be downloaded from the [OFFO SourceForge project](https://sourceforge.net/projects/offo/). +`word_list` | Required if `word_list_path` is not set | Array of strings | A list of words used to validate the components generated by the hyphenation patterns. +`word_list_path` | Required if `word_list` is not set | String | The path (relative to the `config` directory or absolute) to a list of subwords. +`max_subword_size` | Optional | Integer | The maximum subword length. If the generated subword exceeds this length, it will not be added to the generated tokens. Default is `15`. +`min_subword_size` | Optional | Integer | The minimum subword length. If the generated subword is shorter than the specified length, it will not be added to the generated tokens. Default is `2`. +`min_word_size` | Optional | Integer | The minimum word character length. Word tokens shorter than this length are excluded from decomposition into subwords. Default is `5`. +`only_longest_match` | Optional | Boolean | Only includes the longest subword in the generated tokens. Default is `false`. + +## Example + +The following example request creates a new index named `test_index` and configures an analyzer with a `hyphenation_decompounder` filter: + +```json +PUT /test_index +{ + "settings": { + "analysis": { + "filter": { + "my_hyphenation_decompounder": { + "type": "hyphenation_decompounder", + "hyphenation_patterns_path": "analysis/hyphenation_patterns.xml", + "word_list": ["notebook", "note", "book"], + "min_subword_size": 3, + "min_word_size": 5, + "only_longest_match": false + } + }, + "analyzer": { + "my_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "lowercase", + "my_hyphenation_decompounder" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /test_index/_analyze +{ + "analyzer": "my_analyzer", + "text": "notebook" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "notebook", + "start_offset": 0, + "end_offset": 8, + "type": "", + "position": 0 + }, + { + "token": "note", + "start_offset": 0, + "end_offset": 8, + "type": "", + "position": 0 + }, + { + "token": "book", + "start_offset": 0, + "end_offset": 8, + "type": "", + "position": 0 + } + ] +} +``` diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index f6b020b51f..8e72a19dbb 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -31,7 +31,7 @@ Token filter | Underlying Lucene token filter| Description [`fingerprint`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/fingerprint/) | [FingerprintFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html) | Sorts and deduplicates the token list and concatenates tokens into a single token. `flatten_graph` | [FlattenGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing. [`hunspell`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/hunspell/) | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell allows a word to have multiple stems, this filter can emit multiple tokens for each consumed token. Requires the configuration of one or more language-specific Hunspell dictionaries. -`hyphenation_decompounder` | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list. +[`hyphenation_decompounder`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/hyphenation-decompounder) | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list. [`keep_types`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keep-types/) | [TypeTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type. [`keep_words`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keep-words/) | [KeepWordFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list. `keyword_marker` | [KeywordMarkerFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordMarkerFilter.html) | Marks specified tokens as keywords, preventing them from being stemmed. From 37bbf249c5e6be32ce986a4c91ad11c18ec00279 Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Thu, 14 Nov 2024 21:09:38 +0000 Subject: [PATCH 14/14] Add keyword marker token filter docs #8065 (#8134) * Add keyword marker token filter docs #8065 Signed-off-by: Anton Rubin * Update keyword-marker.md Signed-off-by: AntonEliatra * updating parameter table Signed-off-by: Anton Rubin * Apply suggestions from code review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: AntonEliatra * Change example Signed-off-by: Fanit Kolchina * Add article to elision token filter Signed-off-by: Fanit Kolchina * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Anton Rubin Signed-off-by: AntonEliatra Signed-off-by: Fanit Kolchina Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Fanit Kolchina Co-authored-by: Nathan Bower --- _analyzers/token-filters/elision.md | 2 +- _analyzers/token-filters/index.md | 2 +- _analyzers/token-filters/keyword-marker.md | 127 +++++++++++++++++++++ 3 files changed, 129 insertions(+), 2 deletions(-) create mode 100644 _analyzers/token-filters/keyword-marker.md diff --git a/_analyzers/token-filters/elision.md b/_analyzers/token-filters/elision.md index b5dd5134b6..abc6dba658 100644 --- a/_analyzers/token-filters/elision.md +++ b/_analyzers/token-filters/elision.md @@ -24,7 +24,7 @@ Parameter | Required/Optional | Data type | Description ## Example -The default set of French elisions is `l'`, `m'`, `t'`, `qu'`, `n'`, `s'`, `j'`, `d'`, `c'`, `jusqu'`, `quoiqu'`, `lorsqu'`, and `puisqu'`. You can update this by configuring the `french_elision` token filter. The following example request creates a new index named `french_texts` and configures an analyzer with the `french_elision` filter: +The default set of French elisions is `l'`, `m'`, `t'`, `qu'`, `n'`, `s'`, `j'`, `d'`, `c'`, `jusqu'`, `quoiqu'`, `lorsqu'`, and `puisqu'`. You can update this by configuring the `french_elision` token filter. The following example request creates a new index named `french_texts` and configures an analyzer with a `french_elision` filter: ```json PUT /french_texts diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index 8e72a19dbb..003b275782 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -34,7 +34,7 @@ Token filter | Underlying Lucene token filter| Description [`hyphenation_decompounder`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/hyphenation-decompounder) | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list. [`keep_types`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keep-types/) | [TypeTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type. [`keep_words`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keep-words/) | [KeepWordFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list. -`keyword_marker` | [KeywordMarkerFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordMarkerFilter.html) | Marks specified tokens as keywords, preventing them from being stemmed. +[`keyword_marker`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keyword-marker/) | [KeywordMarkerFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordMarkerFilter.html) | Marks specified tokens as keywords, preventing them from being stemmed. `keyword_repeat` | [KeywordRepeatFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilter.html) | Emits each incoming token twice: once as a keyword and once as a non-keyword. `kstem` | [KStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/KStemFilter.html) | Provides kstem-based stemming for the English language. Combines algorithmic stemming with a built-in dictionary. `kuromoji_completion` | [JapaneseCompletionFilter](https://lucene.apache.org/core/9_10_0/analysis/kuromoji/org/apache/lucene/analysis/ja/JapaneseCompletionFilter.html) | Adds Japanese romanized terms to the token stream (in addition to the original tokens). Usually used to support autocomplete on Japanese search terms. Note that the filter has a `mode` parameter, which should be set to `index` when used in an index analyzer and `query` when used in a search analyzer. Requires the `analysis-kuromoji` plugin. For information about installing the plugin, see [Additional plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/#additional-plugins). diff --git a/_analyzers/token-filters/keyword-marker.md b/_analyzers/token-filters/keyword-marker.md new file mode 100644 index 0000000000..0ec2cb96f5 --- /dev/null +++ b/_analyzers/token-filters/keyword-marker.md @@ -0,0 +1,127 @@ +--- +layout: default +title: Keyword marker +parent: Token filters +nav_order: 200 +--- + +# Keyword marker token filter + +The `keyword_marker` token filter is used to prevent certain tokens from being altered by stemmers or other filters. The `keyword_marker` token filter does this by marking the specified tokens as `keywords`, which prevents any stemming or other processing. This ensures that specific words remain in their original form. + +## Parameters + +The `keyword_marker` token filter can be configured with the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`ignore_case` | Optional | Boolean | Whether to ignore the letter case when matching keywords. Default is `false`. +`keywords` | Required if either `keywords_path` or `keywords_pattern` is not set | List of strings | The list of tokens to mark as keywords. +`keywords_path` | Required if either `keywords` or `keywords_pattern` is not set | String | The path (relative to the `config` directory or absolute) to the list of keywords. +`keywords_pattern` | Required if either `keywords` or `keywords_path` is not set | String | A [regular expression](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) used for matching tokens to be marked as keywords. + + +## Example + +The following example request creates a new index named `my_index` and configures an analyzer with a `keyword_marker` filter. The filter marks the word `example` as a keyword: + +```json +PUT /my_index +{ + "settings": { + "analysis": { + "analyzer": { + "custom_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": ["lowercase", "keyword_marker_filter", "stemmer"] + } + }, + "filter": { + "keyword_marker_filter": { + "type": "keyword_marker", + "keywords": ["example"] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET /my_index/_analyze +{ + "analyzer": "custom_analyzer", + "text": "Favorite example" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens. Note that while the word `favorite` was stemmed, the word `example` was not stemmed because it was marked as a keyword: + +```json +{ + "tokens": [ + { + "token": "favorit", + "start_offset": 0, + "end_offset": 8, + "type": "", + "position": 0 + }, + { + "token": "example", + "start_offset": 9, + "end_offset": 16, + "type": "", + "position": 1 + } + ] +} +``` + +You can further examine the impact of the `keyword_marker` token filter by adding the following parameters to the `_analyze` query: + +```json +GET /my_index/_analyze +{ + "analyzer": "custom_analyzer", + "text": "This is an OpenSearch example demonstrating keyword marker.", + "explain": true, + "attributes": "keyword" +} +``` +{% include copy-curl.html %} + +This will produce additional details in the response similar to the following: + +```json +{ + "name": "porter_stem", + "tokens": [ + ... + { + "token": "example", + "start_offset": 22, + "end_offset": 29, + "type": "", + "position": 4, + "keyword": true + }, + { + "token": "demonstr", + "start_offset": 30, + "end_offset": 43, + "type": "", + "position": 5, + "keyword": false + }, + ... + ] +} +```