Skip to content

Commit

Permalink
Add apostrophe token filter page #7871 (#7884)
Browse files Browse the repository at this point in the history
* adding apostrophe token filter page #7871

Signed-off-by: AntonEliatra <[email protected]>

* fixing vale error

Signed-off-by: AntonEliatra <[email protected]>

* Update apostrophe-token-filter.md

Signed-off-by: AntonEliatra <[email protected]>

* updating the naming

Signed-off-by: AntonEliatra <[email protected]>

* updating as per the review comments

Signed-off-by: AntonEliatra <[email protected]>

* updating the heading to Apostrophe token filter

Signed-off-by: AntonEliatra <[email protected]>

* updating as per PR comments

Signed-off-by: AntonEliatra <[email protected]>

* Apply suggestions from code review

Co-authored-by: kolchfa-aws <[email protected]>
Signed-off-by: AntonEliatra <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: AntonEliatra <[email protected]>

---------

Signed-off-by: AntonEliatra <[email protected]>
Co-authored-by: kolchfa-aws <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
(cherry picked from commit 6eccc88)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
  • Loading branch information
3 people committed Aug 6, 2024
1 parent 22b719f commit 3fd2319
Show file tree
Hide file tree
Showing 2 changed files with 119 additions and 1 deletion.
118 changes: 118 additions & 0 deletions _analyzers/token-filters/apostrophe.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
---
layout: default
title: Apostrophe
parent: Token filters
nav_order: 110
---

# Apostrophe token filter

The `apostrophe` token filter's primary function is to remove possessive apostrophes and anything following them. This can be very useful in analyzing text in languages that rely heavily on apostrophes, such as Turkish, in which apostrophes separate the root word from suffixes, including possessive suffixes, case markers, and other grammatical endings.


## Example

The following example request creates a new index named `custom_text_index` with a custom analyzer configured in `settings` and used in `mappings`:

```json
PUT /custom_text_index
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "standard", // splits text into words
"filter": [
"lowercase",
"apostrophe"
]
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "custom_analyzer"
}
}
}
}
```
{% include copy-curl.html %}

## Generated tokens

Use the following request to examine the tokens generated using the created analyzer:

```json
POST /custom_text_index/_analyze
{
"analyzer": "custom_analyzer",
"text": "John's car is faster than Peter's bike"
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{
"token": "john",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "car",
"start_offset": 7,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "is",
"start_offset": 11,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "faster",
"start_offset": 14,
"end_offset": 20,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "than",
"start_offset": 21,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "peter",
"start_offset": 26,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "bike",
"start_offset": 34,
"end_offset": 38,
"type": "<ALPHANUM>",
"position": 6
}
]
}
```

The built-in `apostrophe` token filter is not suitable for languages such as French, in which apostrophes are used at the beginning of words. For example, `"C'est l'amour de l'école"` will result in four tokens: "C", "l", "de", and "l".
{: .note}
2 changes: 1 addition & 1 deletion _analyzers/token-filters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Token filters receive the stream of tokens from the tokenizer and add, remove, o
The following table lists all token filters that OpenSearch supports.

Token filter | Underlying Lucene token filter| Description
`apostrophe` | [ApostropheFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/ApostropheFilter.html) | In each token that contains an apostrophe, the `apostrophe` token filter removes the apostrophe itself and all characters following the apostrophe.
[`apostrophe`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/apostrophe/) | [ApostropheFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/ApostropheFilter.html) | In each token containing an apostrophe, the `apostrophe` token filter removes the apostrophe itself and all characters following it.
`asciifolding` | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters.
`cjk_bigram` | [CJKBigramFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens.
`cjk_width` | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules: <br> - Folds full-width ASCII character variants into the equivalent basic Latin characters. <br> - Folds half-width Katakana character variants into the equivalent Kana characters.
Expand Down

0 comments on commit 3fd2319

Please sign in to comment.