From b41858a146abd7fa8f248f9457d3e8dee07ffba6 Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Wed, 11 Sep 2024 18:08:11 +0100 Subject: [PATCH] Add Ascii folding token filter (#7912) * adding asciifolding token filter page #7873 Signed-off-by: AntonEliatra * updating the naming Signed-off-by: AntonEliatra * updating as per PR comments Signed-off-by: AntonEliatra * updating the heading Signed-off-by: AntonEliatra * Updating details as per comments Signed-off-by: AntonEliatra * Updating details as per comments Signed-off-by: AntonEliatra * Updating details as per comments Signed-off-by: AntonEliatra * Apply suggestions from code review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: AntonEliatra * updating as per comments Signed-off-by: Anton Rubin * Apply suggestions from code review Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: AntonEliatra * Update asciifolding.md Signed-off-by: AntonEliatra --------- Signed-off-by: AntonEliatra Signed-off-by: Anton Rubin Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower --- _analyzers/token-filters/asciifolding.md | 135 +++++++++++++++++++++++ _analyzers/token-filters/index.md | 2 +- 2 files changed, 136 insertions(+), 1 deletion(-) create mode 100644 _analyzers/token-filters/asciifolding.md diff --git a/_analyzers/token-filters/asciifolding.md b/_analyzers/token-filters/asciifolding.md new file mode 100644 index 0000000000..d572251988 --- /dev/null +++ b/_analyzers/token-filters/asciifolding.md @@ -0,0 +1,135 @@ +--- +layout: default +title: ASCII folding +parent: Token filters +nav_order: 20 +--- + +# ASCII folding token filter + +The `asciifolding` token filter converts non-ASCII characters to their closest ASCII equivalents. For example, *é* becomes *e*, *ü* becomes *u*, and *ñ* becomes *n*. This process is known as *transliteration*. + + +The `asciifolding` token filter offers a number of benefits: + + - **Enhanced search flexibility**: Users often omit accents or special characters when entering queries. The `asciifolding` token filter ensures that such queries still return relevant results. + - **Normalization**: Standardizes the indexing process by ensuring that accented characters are consistently converted to their ASCII equivalents. + - **Internationalization**: Particularly useful for applications including multiple languages and character sets. + +While the `asciifolding` token filter can simplify searches, it may also lead to the loss of specific information, particularly if the distinction between accented and non-accented characters in the dataset is significant. +{: .warning} + +## Parameters + +You can configure the `asciifolding` token filter using the `preserve_original` parameter. Setting this parameter to `true` keeps both the original token and its ASCII-folded version in the token stream. This can be particularly useful when you want to match both the original (with accents) and the normalized (without accents) versions of a term in a search query. Default is `false`. + +## Example + +The following example request creates a new index named `example_index` and defines an analyzer with the `asciifolding` filter and `preserve_original` parameter set to `true`: + +```json +PUT /example_index +{ + "settings": { + "analysis": { + "filter": { + "custom_ascii_folding": { + "type": "asciifolding", + "preserve_original": true + } + }, + "analyzer": { + "custom_ascii_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "lowercase", + "custom_ascii_folding" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /example_index/_analyze +{ + "analyzer": "custom_ascii_analyzer", + "text": "Résumé café naïve coördinate" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "resume", + "start_offset": 0, + "end_offset": 6, + "type": "", + "position": 0 + }, + { + "token": "résumé", + "start_offset": 0, + "end_offset": 6, + "type": "", + "position": 0 + }, + { + "token": "cafe", + "start_offset": 7, + "end_offset": 11, + "type": "", + "position": 1 + }, + { + "token": "café", + "start_offset": 7, + "end_offset": 11, + "type": "", + "position": 1 + }, + { + "token": "naive", + "start_offset": 12, + "end_offset": 17, + "type": "", + "position": 2 + }, + { + "token": "naïve", + "start_offset": 12, + "end_offset": 17, + "type": "", + "position": 2 + }, + { + "token": "coordinate", + "start_offset": 18, + "end_offset": 28, + "type": "", + "position": 3 + }, + { + "token": "coördinate", + "start_offset": 18, + "end_offset": 28, + "type": "", + "position": 3 + } + ] +} +``` + + diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index f4e9c434e7..a9b621d5ab 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -14,7 +14,7 @@ The following table lists all token filters that OpenSearch supports. Token filter | Underlying Lucene token filter| Description [`apostrophe`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/apostrophe/) | [ApostropheFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/ApostropheFilter.html) | In each token containing an apostrophe, the `apostrophe` token filter removes the apostrophe itself and all characters following it. -`asciifolding` | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters. +[`asciifolding`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/asciifolding/) | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters. `cjk_bigram` | [CJKBigramFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens. `cjk_width` | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules:
- Folds full-width ASCII character variants into the equivalent basic Latin characters.
- Folds half-width Katakana character variants into the equivalent Kana characters. `classic` | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms.