Skip to content

Commit

Permalink
Merge branch 'main' into adding-language-analyzers-docs
Browse files Browse the repository at this point in the history
  • Loading branch information
kolchfa-aws authored Nov 14, 2024
2 parents e29f690 + 37bbf24 commit 25e6771
Show file tree
Hide file tree
Showing 51 changed files with 2,530 additions and 29 deletions.
20 changes: 20 additions & 0 deletions .github/workflows/jekyll-spec-insert.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
name: Lint and Test Jekyll Spec Insert
on:
push:
paths:
- 'spec-insert/**'
pull_request:
paths:
- 'spec-insert/**'
jobs:
lint-and-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: ruby/setup-ruby@v1
with: { ruby-version: 3.3.0 }
- run: bundle install
- working-directory: spec-insert
run: |
bundle exec rubocop
bundle exec rspec
52 changes: 52 additions & 0 deletions .github/workflows/update-api-components.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
name: Update API Components
on:
workflow_dispatch:
schedule:
- cron: "0 0 * * 0" # Every Sunday at midnight GMT
jobs:
update-api-components:
if: ${{ github.repository == 'opensearch-project/documentation-website' }}
runs-on: ubuntu-latest
permissions:
contents: write
pull-requests: write
steps:
- uses: actions/checkout@v4
with:
submodules: recursive
fetch-depth: 0

- run: git config --global pull.rebase true

- uses: ruby/setup-ruby@v1
with: { ruby-version: 3.3.0 }

- run: bundle install

- name: Download spec and insert into documentation
run: bundle exec jekyll spec-insert

- name: Get current date
id: date
run: echo "date=$(date +'%Y-%m-%d')" >> $GITHUB_ENV

- name: GitHub App token
id: github_app_token
uses: tibdex/[email protected]
with:
app_id: ${{ secrets.APP_ID }}
private_key: ${{ secrets.APP_PRIVATE_KEY }}

- name: Create pull request
uses: peter-evans/create-pull-request@v6
with:
token: ${{ steps.github_app_token.outputs.token }}
commit-message: "Updated API components to reflect the latest OpenSearch API spec (${{ env.date }})"
title: "[AUTOCUT] Update API components to reflect the latest OpenSearch API spec (${{ env.date }})"
body: |
Update API components to reflect the latest [OpenSearch API spec](https://github.com/opensearch-project/opensearch-api-specification/releases/download/main-latest/opensearch-openapi.yaml).
Date: ${{ env.date }}
branch: update-api-components-${{ env.date }}
base: main
signoff: true
labels: autocut
135 changes: 135 additions & 0 deletions DEVELOPER_GUIDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# Developer guide
- [Introduction](#introduction)
- [Starting the Jekyll server locally](#starting-the-jekyll-server-locally)
- [Using the spec-insert Jekyll plugin](#using-the-spec-insert-jekyll-plugin)
- [Inserting query parameters](#inserting-query-parameters)
- [Inserting path parameters](#inserting-path-parameters)
- [Inserting paths and HTTP methods](#inserting-paths-and-http-methods)
- [Ignoring files and folders](#ignoring-files-and-folders)
- [CI/CD](#cicd)

## Introduction

The `.md` documents in this repository are rendered into HTML pages using [Jekyll](https://jekyllrb.com/). These HTML pages are hosted on [opensearch.org](https://opensearch.org/docs/latest/).

## Starting the Jekyll server locally
You can run the Jekyll server locally to view the rendered HTML pages using the following steps:

1. Install [Ruby](https://www.ruby-lang.org/en/documentation/installation/) 3.1.0 or later for your operating system.
2. Install the required gems by running `bundle install`.
3. Run `bundle exec jekyll serve` to start the Jekyll server locally (this can take several minutes to complete).
4. Open your browser and navigate to `http://localhost:4000` to view the rendered HTML pages.

## Using the `spec-insert` Jekyll plugin
The `spec-insert` Jekyll plugin is used to insert API components into Markdown files. The plugin downloads the [latest OpenSearch specification](https://github.com/opensearch-project/opensearch-api-specification) and renders the API components from the spec. This aims to reduce the manual effort required to keep the documentation up to date.

To use this plugin, make sure that you have installed Ruby 3.1.0 or later and the required gems by running `bundle install`.

Edit your Markdown file and insert the following snippet where you want render an API component:

```markdown
<!-- spec_insert_start
api: <API_NAME>
component: <COMPONENT_NAME>
other_param: <OTHER_PARAM>
-->

This is where the API component will be inserted.
Everything between the `spec_insert_start` and `spec_insert_end` tags will be overwritten.

<!-- spec_insert_end -->
```

Then run the following Jekyll command to render the API components:
```shell
bundle exec jekyll spec-insert
```

If you are working on multiple Markdown files and do not want to keep running the `jekyll spec-insert` command, you can add the `--watch` (or `-W`) flag to the command to watch for changes in the Markdown files and automatically render the API components:

```shell
bundle exec jekyll spec-insert --watch
```

Depending on the text editor you are using, you may need to manually reload the file from disk to see the changes applied by the plugin if the editor does not automatically reload the file periodically.

The plugin will pull the newest OpenSearch API spec from its [repository](https://github.com/opensearch-project/opensearch-api-specification) if the spec file does not exist locally or if it is older than 24 hours. To tell the plugin to always pull the newest spec, you can add the `--refresh-spec` (or `-R`) flag to the command:

```shell
bundle exec jekyll spec-insert --refresh-spec
```

### Inserting query parameters

To insert the API query parameters table, use the following snippet:

```markdown
<!-- spec_insert_start
api: cat.indices
component: query_parameters
-->
<!-- spec_insert_end -->
```

This will insert the query parameters of the `cat.indices` API into the `.md` file with three default columns: `Parameter`, `Type`, and `Description`. There are five columns that can be inserted: `Parameter`, `Type`, `Description`, `Required`, and `Default`. When `Required`/`Default` is not chosen, the information will be written in the `Description` column.

You can customize the query parameters table with the following columns:

- `Parameter`
- `Type`
- `Description`
- `Required`
- `Default`

You can also customize this component with the following settings:

- `include_global` (Boolean; default is `false`): Includes global query parameters in the table.
- `include_deprecated` (Boolean; default is `true`): Includes deprecated parameters in the table.
- `pretty` (Boolean; default is `false`): Renders the table in the pretty format instead of the compact format.

The following snippet inserts the specified columns into the query parameters table:

```markdown
<!-- spec_insert_start
api: cat.indices
component: query_parameters
include_global: true
include_deprecated: false
pretty: true
-->
<!-- spec_insert_end -->
```

### Inserting path parameters

To insert the `indices.create` API path parameters table, use the following snippet:

```markdown
<!-- spec_insert_start
api: indices.create
component: path_parameters
-->
<!-- spec_insert_end -->
```

This table behaves identically to the query parameters table except that it does not accept the `include_global` argument.

### Inserting paths and HTTP methods

To insert paths and HTTP methods for the `search` API, use the following snippet:

```markdown
<!-- spec_insert_start
api: search
component: paths_and_http_methods
-->
<!-- spec_insert_end -->
```

### Ignoring files and folders

The `spec-insert` plugin ignores all files and folders listed in the [./_config.yml#exclude](./_config.yml) list, which is also the list of files and folders that Jekyll ignores.

### CI/CD

The `spec-insert` plugin is run as part of the CI/CD pipeline to ensure that the API components are up to date in the documentation. This is performed through the [update-api-components.yml](.github/workflows/update-api-components.yml) GitHub Actions workflow, which creates a pull request containing the updated API components every Sunday.
43 changes: 29 additions & 14 deletions Gemfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
source "http://rubygems.org"
# frozen_string_literal: true

source 'https://rubygems.org'

# Manually add csv gem since Ruby 3.4.0 no longer includes it
gem 'csv', '~> 3.0'

# Hello! This is where you manage which Jekyll version is used to run.
# When you want to use a different version, change it below, save the
Expand All @@ -8,12 +13,12 @@ source "http://rubygems.org"
#
# This will help ensure the proper Jekyll version is running.
# Happy Jekylling!
gem "jekyll", "~> 4.3.2"
gem 'jekyll', '~> 4.3.2'

# This is the default theme for new Jekyll sites. You may change this to anything you like.
gem "just-the-docs", "~> 0.3.3"
gem "jekyll-remote-theme", "~> 0.4"
gem "jekyll-redirect-from", "~> 0.16"
gem 'jekyll-redirect-from', '~> 0.16'
gem 'jekyll-remote-theme', '~> 0.4'
gem 'just-the-docs', '~> 0.3.3'

# If you want to use GitHub Pages, remove the "gem "jekyll"" above and
# uncomment the line below. To upgrade, run `bundle update github-pages`.
Expand All @@ -22,21 +27,31 @@ gem "jekyll-redirect-from", "~> 0.16"

# If you have any plugins, put them here!
group :jekyll_plugins do
gem "jekyll-last-modified-at"
gem "jekyll-sitemap"
gem 'jekyll-last-modified-at'
gem 'jekyll-sitemap'
gem 'jekyll-spec-insert', :path => './spec-insert'
end

# Windows does not include zoneinfo files, so bundle the tzinfo-data gem
gem "tzinfo-data", platforms: [:mingw, :mswin, :x64_mingw, :jruby]
gem 'tzinfo-data', platforms: %i[mingw mswin x64_mingw jruby]

# Performance-booster for watching directories on Windows
gem "wdm", "~> 0.1.0" if Gem.win_platform?
gem 'wdm', '~> 0.1.0' if Gem.win_platform?

# Installs webrick dependency for building locally
gem "webrick", "~> 1.7"

gem 'webrick', '~> 1.7'

# Link checker
gem "typhoeus"
gem "ruby-link-checker"
gem "ruby-enum"
gem 'ruby-enum'
gem 'ruby-link-checker'
gem 'typhoeus'

# Spec Insert
gem 'activesupport', '~> 7'
gem 'mustache', '~> 1'

group :development, :test do
gem 'rspec'
gem 'rubocop', '~> 1.44', require: false
gem 'rubocop-rake', require: false
end
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
# About the OpenSearch documentation repo

The `documentation-website` repository contains the user documentation for OpenSearch. You can find the rendered documentation at [opensearch.org/docs](https://opensearch.org/docs).
The markdown files in this repository are rendered into HTML pages using [Jekyll](https://jekyllrb.com/). Check the [DEVELOPER_GUIDE](DEVELOPER_GUIDE.md) for more information about how to use Jekyll for this repository.


## Contributing
Expand Down
101 changes: 101 additions & 0 deletions _analyzers/token-filters/dictionary-decompounder.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
---
layout: default
title: Dictionary decompounder
parent: Token filters
nav_order: 110
---

# Dictionary decompounder token filter

The `dictionary_decompounder` token filter is used to split compound words into their constituent parts based on a predefined dictionary. This filter is particularly useful for languages like German, Dutch, or Finnish, in which compound words are common, so breaking them down can improve search relevance. The `dictionary_decompounder` token filter determines whether each token (word) can be split into smaller tokens based on a list of known words. If the token can be split into known words, the filter generates the subtokens for the token.

## Parameters

The `dictionary_decompounder` token filter has the following parameters.

Parameter | Required/Optional | Data type | Description
:--- | :--- | :--- | :---
`word_list` | Required unless `word_list_path` is configured | Array of strings | The dictionary of words that the filter uses to split compound words.
`word_list_path` | Required unless `word_list` is configured | String | A file path to a text file containing the dictionary words. Accepts either an absolute path or a path relative to the `config` directory. The dictionary file must be UTF-8 encoded, and each word must be listed on a separate line.
`min_word_size` | Optional | Integer | The minimum length of the entire compound word that will be considered for splitting. If a compound word is shorter than this value, it is not split. Default is `5`.
`min_subword_size` | Optional | Integer | The minimum length for any subword. If a subword is shorter than this value, it is not included in the output. Default is `2`.
`max_subword_size` | Optional | Integer | The maximum length for any subword. If a subword is longer than this value, it is not included in the output. Default is `15`.
`only_longest_match` | Optional | Boolean | If set to `true`, only the longest matching subword will be returned. Default is `false`.

## Example

The following example request creates a new index named `decompound_example` and configures an analyzer with the `dictionary_decompounder` filter:

```json
PUT /decompound_example
{
"settings": {
"analysis": {
"filter": {
"my_dictionary_decompounder": {
"type": "dictionary_decompounder",
"word_list": ["slow", "green", "turtle"]
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "my_dictionary_decompounder"]
}
}
}
}
}
```
{% include copy-curl.html %}

## Generated tokens

Use the following request to examine the tokens generated using the analyzer:

```json
POST /decompound_example/_analyze
{
"analyzer": "my_analyzer",
"text": "slowgreenturtleswim"
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{
"token": "slowgreenturtleswim",
"start_offset": 0,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "slow",
"start_offset": 0,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "green",
"start_offset": 0,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "turtle",
"start_offset": 0,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 0
}
]
}
```
Loading

0 comments on commit 25e6771

Please sign in to comment.