Skip to content

Commit

Permalink
Add Ascii folding token filter (#7912)
Browse files Browse the repository at this point in the history
* adding asciifolding token filter page #7873

Signed-off-by: AntonEliatra <[email protected]>

* updating the naming

Signed-off-by: AntonEliatra <[email protected]>

* updating as per PR comments

Signed-off-by: AntonEliatra <[email protected]>

* updating the heading

Signed-off-by: AntonEliatra <[email protected]>

* Updating details as per comments

Signed-off-by: AntonEliatra <[email protected]>

* Updating details as per comments

Signed-off-by: AntonEliatra <[email protected]>

* Updating details as per comments

Signed-off-by: AntonEliatra <[email protected]>

* Apply suggestions from code review

Co-authored-by: kolchfa-aws <[email protected]>
Signed-off-by: AntonEliatra <[email protected]>

* updating as per comments

Signed-off-by: Anton Rubin <[email protected]>

* Apply suggestions from code review

Signed-off-by: kolchfa-aws <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: AntonEliatra <[email protected]>

* Update asciifolding.md

Signed-off-by: AntonEliatra <[email protected]>

---------

Signed-off-by: AntonEliatra <[email protected]>
Signed-off-by: Anton Rubin <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: kolchfa-aws <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
  • Loading branch information
3 people authored Sep 11, 2024
1 parent 9b609c6 commit b41858a
Show file tree
Hide file tree
Showing 2 changed files with 136 additions and 1 deletion.
135 changes: 135 additions & 0 deletions _analyzers/token-filters/asciifolding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
---
layout: default
title: ASCII folding
parent: Token filters
nav_order: 20
---

# ASCII folding token filter

The `asciifolding` token filter converts non-ASCII characters to their closest ASCII equivalents. For example, *é* becomes *e*, *ü* becomes *u*, and *ñ* becomes *n*. This process is known as *transliteration*.


The `asciifolding` token filter offers a number of benefits:

- **Enhanced search flexibility**: Users often omit accents or special characters when entering queries. The `asciifolding` token filter ensures that such queries still return relevant results.
- **Normalization**: Standardizes the indexing process by ensuring that accented characters are consistently converted to their ASCII equivalents.
- **Internationalization**: Particularly useful for applications including multiple languages and character sets.

While the `asciifolding` token filter can simplify searches, it may also lead to the loss of specific information, particularly if the distinction between accented and non-accented characters in the dataset is significant.
{: .warning}

## Parameters

You can configure the `asciifolding` token filter using the `preserve_original` parameter. Setting this parameter to `true` keeps both the original token and its ASCII-folded version in the token stream. This can be particularly useful when you want to match both the original (with accents) and the normalized (without accents) versions of a term in a search query. Default is `false`.

## Example

The following example request creates a new index named `example_index` and defines an analyzer with the `asciifolding` filter and `preserve_original` parameter set to `true`:

```json
PUT /example_index
{
"settings": {
"analysis": {
"filter": {
"custom_ascii_folding": {
"type": "asciifolding",
"preserve_original": true
}
},
"analyzer": {
"custom_ascii_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"custom_ascii_folding"
]
}
}
}
}
}
```
{% include copy-curl.html %}

## Generated tokens

Use the following request to examine the tokens generated using the analyzer:

```json
POST /example_index/_analyze
{
"analyzer": "custom_ascii_analyzer",
"text": "Résumé café naïve coördinate"
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{
"token": "resume",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "résumé",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "cafe",
"start_offset": 7,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "café",
"start_offset": 7,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "naive",
"start_offset": 12,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "naïve",
"start_offset": 12,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "coordinate",
"start_offset": 18,
"end_offset": 28,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "coördinate",
"start_offset": 18,
"end_offset": 28,
"type": "<ALPHANUM>",
"position": 3
}
]
}
```


2 changes: 1 addition & 1 deletion _analyzers/token-filters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ The following table lists all token filters that OpenSearch supports.

Token filter | Underlying Lucene token filter| Description
[`apostrophe`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/apostrophe/) | [ApostropheFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/ApostropheFilter.html) | In each token containing an apostrophe, the `apostrophe` token filter removes the apostrophe itself and all characters following it.
`asciifolding` | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters.
[`asciifolding`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/asciifolding/) | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters.
`cjk_bigram` | [CJKBigramFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens.
`cjk_width` | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules: <br> - Folds full-width ASCII character variants into the equivalent basic Latin characters. <br> - Folds half-width Katakana character variants into the equivalent Kana characters.
`classic` | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms.
Expand Down

0 comments on commit b41858a

Please sign in to comment.