forked from opensearch-project/documentation-website
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into adding-language-analyzers-docs
- Loading branch information
Showing
51 changed files
with
2,530 additions
and
29 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
name: Lint and Test Jekyll Spec Insert | ||
on: | ||
push: | ||
paths: | ||
- 'spec-insert/**' | ||
pull_request: | ||
paths: | ||
- 'spec-insert/**' | ||
jobs: | ||
lint-and-test: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@v4 | ||
- uses: ruby/setup-ruby@v1 | ||
with: { ruby-version: 3.3.0 } | ||
- run: bundle install | ||
- working-directory: spec-insert | ||
run: | | ||
bundle exec rubocop | ||
bundle exec rspec |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
name: Update API Components | ||
on: | ||
workflow_dispatch: | ||
schedule: | ||
- cron: "0 0 * * 0" # Every Sunday at midnight GMT | ||
jobs: | ||
update-api-components: | ||
if: ${{ github.repository == 'opensearch-project/documentation-website' }} | ||
runs-on: ubuntu-latest | ||
permissions: | ||
contents: write | ||
pull-requests: write | ||
steps: | ||
- uses: actions/checkout@v4 | ||
with: | ||
submodules: recursive | ||
fetch-depth: 0 | ||
|
||
- run: git config --global pull.rebase true | ||
|
||
- uses: ruby/setup-ruby@v1 | ||
with: { ruby-version: 3.3.0 } | ||
|
||
- run: bundle install | ||
|
||
- name: Download spec and insert into documentation | ||
run: bundle exec jekyll spec-insert | ||
|
||
- name: Get current date | ||
id: date | ||
run: echo "date=$(date +'%Y-%m-%d')" >> $GITHUB_ENV | ||
|
||
- name: GitHub App token | ||
id: github_app_token | ||
uses: tibdex/[email protected] | ||
with: | ||
app_id: ${{ secrets.APP_ID }} | ||
private_key: ${{ secrets.APP_PRIVATE_KEY }} | ||
|
||
- name: Create pull request | ||
uses: peter-evans/create-pull-request@v6 | ||
with: | ||
token: ${{ steps.github_app_token.outputs.token }} | ||
commit-message: "Updated API components to reflect the latest OpenSearch API spec (${{ env.date }})" | ||
title: "[AUTOCUT] Update API components to reflect the latest OpenSearch API spec (${{ env.date }})" | ||
body: | | ||
Update API components to reflect the latest [OpenSearch API spec](https://github.com/opensearch-project/opensearch-api-specification/releases/download/main-latest/opensearch-openapi.yaml). | ||
Date: ${{ env.date }} | ||
branch: update-api-components-${{ env.date }} | ||
base: main | ||
signoff: true | ||
labels: autocut |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,135 @@ | ||
# Developer guide | ||
- [Introduction](#introduction) | ||
- [Starting the Jekyll server locally](#starting-the-jekyll-server-locally) | ||
- [Using the spec-insert Jekyll plugin](#using-the-spec-insert-jekyll-plugin) | ||
- [Inserting query parameters](#inserting-query-parameters) | ||
- [Inserting path parameters](#inserting-path-parameters) | ||
- [Inserting paths and HTTP methods](#inserting-paths-and-http-methods) | ||
- [Ignoring files and folders](#ignoring-files-and-folders) | ||
- [CI/CD](#cicd) | ||
|
||
## Introduction | ||
|
||
The `.md` documents in this repository are rendered into HTML pages using [Jekyll](https://jekyllrb.com/). These HTML pages are hosted on [opensearch.org](https://opensearch.org/docs/latest/). | ||
|
||
## Starting the Jekyll server locally | ||
You can run the Jekyll server locally to view the rendered HTML pages using the following steps: | ||
|
||
1. Install [Ruby](https://www.ruby-lang.org/en/documentation/installation/) 3.1.0 or later for your operating system. | ||
2. Install the required gems by running `bundle install`. | ||
3. Run `bundle exec jekyll serve` to start the Jekyll server locally (this can take several minutes to complete). | ||
4. Open your browser and navigate to `http://localhost:4000` to view the rendered HTML pages. | ||
|
||
## Using the `spec-insert` Jekyll plugin | ||
The `spec-insert` Jekyll plugin is used to insert API components into Markdown files. The plugin downloads the [latest OpenSearch specification](https://github.com/opensearch-project/opensearch-api-specification) and renders the API components from the spec. This aims to reduce the manual effort required to keep the documentation up to date. | ||
|
||
To use this plugin, make sure that you have installed Ruby 3.1.0 or later and the required gems by running `bundle install`. | ||
|
||
Edit your Markdown file and insert the following snippet where you want render an API component: | ||
|
||
```markdown | ||
<!-- spec_insert_start | ||
api: <API_NAME> | ||
component: <COMPONENT_NAME> | ||
other_param: <OTHER_PARAM> | ||
--> | ||
|
||
This is where the API component will be inserted. | ||
Everything between the `spec_insert_start` and `spec_insert_end` tags will be overwritten. | ||
|
||
<!-- spec_insert_end --> | ||
``` | ||
|
||
Then run the following Jekyll command to render the API components: | ||
```shell | ||
bundle exec jekyll spec-insert | ||
``` | ||
|
||
If you are working on multiple Markdown files and do not want to keep running the `jekyll spec-insert` command, you can add the `--watch` (or `-W`) flag to the command to watch for changes in the Markdown files and automatically render the API components: | ||
|
||
```shell | ||
bundle exec jekyll spec-insert --watch | ||
``` | ||
|
||
Depending on the text editor you are using, you may need to manually reload the file from disk to see the changes applied by the plugin if the editor does not automatically reload the file periodically. | ||
|
||
The plugin will pull the newest OpenSearch API spec from its [repository](https://github.com/opensearch-project/opensearch-api-specification) if the spec file does not exist locally or if it is older than 24 hours. To tell the plugin to always pull the newest spec, you can add the `--refresh-spec` (or `-R`) flag to the command: | ||
|
||
```shell | ||
bundle exec jekyll spec-insert --refresh-spec | ||
``` | ||
|
||
### Inserting query parameters | ||
|
||
To insert the API query parameters table, use the following snippet: | ||
|
||
```markdown | ||
<!-- spec_insert_start | ||
api: cat.indices | ||
component: query_parameters | ||
--> | ||
<!-- spec_insert_end --> | ||
``` | ||
|
||
This will insert the query parameters of the `cat.indices` API into the `.md` file with three default columns: `Parameter`, `Type`, and `Description`. There are five columns that can be inserted: `Parameter`, `Type`, `Description`, `Required`, and `Default`. When `Required`/`Default` is not chosen, the information will be written in the `Description` column. | ||
|
||
You can customize the query parameters table with the following columns: | ||
|
||
- `Parameter` | ||
- `Type` | ||
- `Description` | ||
- `Required` | ||
- `Default` | ||
|
||
You can also customize this component with the following settings: | ||
|
||
- `include_global` (Boolean; default is `false`): Includes global query parameters in the table. | ||
- `include_deprecated` (Boolean; default is `true`): Includes deprecated parameters in the table. | ||
- `pretty` (Boolean; default is `false`): Renders the table in the pretty format instead of the compact format. | ||
|
||
The following snippet inserts the specified columns into the query parameters table: | ||
|
||
```markdown | ||
<!-- spec_insert_start | ||
api: cat.indices | ||
component: query_parameters | ||
include_global: true | ||
include_deprecated: false | ||
pretty: true | ||
--> | ||
<!-- spec_insert_end --> | ||
``` | ||
|
||
### Inserting path parameters | ||
|
||
To insert the `indices.create` API path parameters table, use the following snippet: | ||
|
||
```markdown | ||
<!-- spec_insert_start | ||
api: indices.create | ||
component: path_parameters | ||
--> | ||
<!-- spec_insert_end --> | ||
``` | ||
|
||
This table behaves identically to the query parameters table except that it does not accept the `include_global` argument. | ||
|
||
### Inserting paths and HTTP methods | ||
|
||
To insert paths and HTTP methods for the `search` API, use the following snippet: | ||
|
||
```markdown | ||
<!-- spec_insert_start | ||
api: search | ||
component: paths_and_http_methods | ||
--> | ||
<!-- spec_insert_end --> | ||
``` | ||
|
||
### Ignoring files and folders | ||
|
||
The `spec-insert` plugin ignores all files and folders listed in the [./_config.yml#exclude](./_config.yml) list, which is also the list of files and folders that Jekyll ignores. | ||
|
||
### CI/CD | ||
|
||
The `spec-insert` plugin is run as part of the CI/CD pipeline to ensure that the API components are up to date in the documentation. This is performed through the [update-api-components.yml](.github/workflows/update-api-components.yml) GitHub Actions workflow, which creates a pull request containing the updated API components every Sunday. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
--- | ||
layout: default | ||
title: Dictionary decompounder | ||
parent: Token filters | ||
nav_order: 110 | ||
--- | ||
|
||
# Dictionary decompounder token filter | ||
|
||
The `dictionary_decompounder` token filter is used to split compound words into their constituent parts based on a predefined dictionary. This filter is particularly useful for languages like German, Dutch, or Finnish, in which compound words are common, so breaking them down can improve search relevance. The `dictionary_decompounder` token filter determines whether each token (word) can be split into smaller tokens based on a list of known words. If the token can be split into known words, the filter generates the subtokens for the token. | ||
|
||
## Parameters | ||
|
||
The `dictionary_decompounder` token filter has the following parameters. | ||
|
||
Parameter | Required/Optional | Data type | Description | ||
:--- | :--- | :--- | :--- | ||
`word_list` | Required unless `word_list_path` is configured | Array of strings | The dictionary of words that the filter uses to split compound words. | ||
`word_list_path` | Required unless `word_list` is configured | String | A file path to a text file containing the dictionary words. Accepts either an absolute path or a path relative to the `config` directory. The dictionary file must be UTF-8 encoded, and each word must be listed on a separate line. | ||
`min_word_size` | Optional | Integer | The minimum length of the entire compound word that will be considered for splitting. If a compound word is shorter than this value, it is not split. Default is `5`. | ||
`min_subword_size` | Optional | Integer | The minimum length for any subword. If a subword is shorter than this value, it is not included in the output. Default is `2`. | ||
`max_subword_size` | Optional | Integer | The maximum length for any subword. If a subword is longer than this value, it is not included in the output. Default is `15`. | ||
`only_longest_match` | Optional | Boolean | If set to `true`, only the longest matching subword will be returned. Default is `false`. | ||
|
||
## Example | ||
|
||
The following example request creates a new index named `decompound_example` and configures an analyzer with the `dictionary_decompounder` filter: | ||
|
||
```json | ||
PUT /decompound_example | ||
{ | ||
"settings": { | ||
"analysis": { | ||
"filter": { | ||
"my_dictionary_decompounder": { | ||
"type": "dictionary_decompounder", | ||
"word_list": ["slow", "green", "turtle"] | ||
} | ||
}, | ||
"analyzer": { | ||
"my_analyzer": { | ||
"type": "custom", | ||
"tokenizer": "standard", | ||
"filter": ["lowercase", "my_dictionary_decompounder"] | ||
} | ||
} | ||
} | ||
} | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
## Generated tokens | ||
|
||
Use the following request to examine the tokens generated using the analyzer: | ||
|
||
```json | ||
POST /decompound_example/_analyze | ||
{ | ||
"analyzer": "my_analyzer", | ||
"text": "slowgreenturtleswim" | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
The response contains the generated tokens: | ||
|
||
```json | ||
{ | ||
"tokens": [ | ||
{ | ||
"token": "slowgreenturtleswim", | ||
"start_offset": 0, | ||
"end_offset": 19, | ||
"type": "<ALPHANUM>", | ||
"position": 0 | ||
}, | ||
{ | ||
"token": "slow", | ||
"start_offset": 0, | ||
"end_offset": 19, | ||
"type": "<ALPHANUM>", | ||
"position": 0 | ||
}, | ||
{ | ||
"token": "green", | ||
"start_offset": 0, | ||
"end_offset": 19, | ||
"type": "<ALPHANUM>", | ||
"position": 0 | ||
}, | ||
{ | ||
"token": "turtle", | ||
"start_offset": 0, | ||
"end_offset": 19, | ||
"type": "<ALPHANUM>", | ||
"position": 0 | ||
} | ||
] | ||
} | ||
``` |
Oops, something went wrong.