Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add apostrophe token filter page #7871 #7884

Merged
118 changes: 118 additions & 0 deletions _analyzers/token-filters/apostrophe.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
---
layout: default
title: Apostrophe
parent: Token filters
nav_order: 110
---

# Apostrophe token filter

The `apostrophe` token filter's primary function is to remove possessive apostrophes and anything following them. This can be very useful in analyzing text in languages which rely heavily on apostrophes, such as Turkish, where apostrophes serves to separate the root word from suffixes, including possessive suffixes, case markers, and other grammatical endings.
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved


## Example

Following example can be used to create new index `custom_text_index` with custom analyzer configured in `settings` and used in `mappings`:
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved

```json
PUT /custom_text_index
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "standard", // splits text into words
"filter": [
"lowercase",
"apostrophe"
]
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "custom_analyzer"
}
}
}
}
```
{% include copy-curl.html %}

## Generated tokens

You can use the following command to examine the tokens being generated using the created analyzer:
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved

```json
POST /custom_text_index/_analyze
{
"analyzer": "custom_analyzer",
"text": "John's car is faster than Peter's bike"
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{
"token": "john",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "car",
"start_offset": 7,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "is",
"start_offset": 11,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "faster",
"start_offset": 14,
"end_offset": 20,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "than",
"start_offset": 21,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "peter",
"start_offset": 26,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "bike",
"start_offset": 34,
"end_offset": 38,
"type": "<ALPHANUM>",
"position": 6
}
]
}
```

The built in `apostrophe` token filter is not suitable for languages such as French, as the apostrophes are used at the beginning of the words, for example `"C'est l'amour de l'école"` will result in four tokens: "C", "l", "de", "l".
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved
{: .note}
2 changes: 1 addition & 1 deletion _analyzers/token-filters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Token filters receive the stream of tokens from the tokenizer and add, remove, o
The following table lists all token filters that OpenSearch supports.

Token filter | Underlying Lucene token filter| Description
`apostrophe` | [ApostropheFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/ApostropheFilter.html) | In each token that contains an apostrophe, the `apostrophe` token filter removes the apostrophe itself and all characters following the apostrophe.
[`apostrophe`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/apostrophe/) | [ApostropheFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/ApostropheFilter.html) | In each token that contains an apostrophe, the `apostrophe` token filter removes the apostrophe itself and all characters following the apostrophe.
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved
`asciifolding` | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters.
`cjk_bigram` | [CJKBigramFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens.
`cjk_width` | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules: <br> - Folds full-width ASCII character variants into the equivalent basic Latin characters. <br> - Folds half-width Katakana character variants into the equivalent Kana characters.
Expand Down
Loading