adding classic token filter docs opensearch-project#7876

Signed-off-by: AntonEliatra <[email protected]>
AntonEliatra · Aug 6, 2024 · a000c66 · a000c66
1 parent 639cb38
commit a000c66
Show file tree

Hide file tree

Showing 2 changed files with 95 additions and 1 deletion.
diff --git a/_analyzers/token-filters/classic.md b/_analyzers/token-filters/classic.md
@@ -0,0 +1,94 @@
+---
+layout: default
+title: classic
+parent: Token filters
+nav_order: 150
+---
+
+# Classic token filter
+
+The primary function of the `classic` token filter is to work along side `classic` tokenizer and process tokens by applying several common transformations that help in text analysis and search. The transformations include:
+ - Removal of possessive endings such as "'s", for example: "John's" becomes "John".
+ - Separating words on internal hyphens, making terms like "co-operate" become tokens "co" and "operate".
+ - Removal of "." from acronyms, for example: "D.A.R.P.A." becomes "DARPA".
+
+
+## Example
+
+Following is an example of how you can define an analyzer with the `classic` filter:
+
+```json
+PUT /custom_classic_filter
+{
+  "settings": {
+    "analysis": {
+      "analyzer": {
+        "custom_classic": {
+          "type": "custom",
+          "tokenizer": "classic",
+          "filter": ["classic"]
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+You can use the following command to examine the tokens being generated using the created analyzer:
+
+```json
+POST /custom_classic_filter/_analyze
+{
+  "analyzer": "custom_classic",
+  "text": "John's co-operate was excellent."
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+  "tokens": [
+    {
+      "token": "John",
+      "start_offset": 0,
+      "end_offset": 6,
+      "type": "<APOSTROPHE>",
+      "position": 0
+    },
+    {
+      "token": "co",
+      "start_offset": 7,
+      "end_offset": 9,
+      "type": "<ALPHANUM>",
+      "position": 1
+    },
+    {
+      "token": "operate",
+      "start_offset": 10,
+      "end_offset": 17,
+      "type": "<ALPHANUM>",
+      "position": 2
+    },
+    {
+      "token": "was",
+      "start_offset": 18,
+      "end_offset": 21,
+      "type": "<ALPHANUM>",
+      "position": 3
+    },
+    {
+      "token": "excellent",
+      "start_offset": 22,
+      "end_offset": 31,
+      "type": "<ALPHANUM>",
+      "position": 4
+    }
+  ]
+}
+```
+
diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md
@@ -17,7 +17,7 @@ Token filter | Underlying Lucene token filter|  Description
 `asciifolding` | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters.
 `cjk_bigram` | [CJKBigramFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens. 
 `cjk_width` | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules: <br> - Folds full-width ASCII character variants into the equivalent basic Latin characters. <br> - Folds half-width Katakana character variants into the equivalent Kana characters.
-`classic` | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms.
+[`classic`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/classic) | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms.
 `common_grams` | [CommonGramsFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html) | Generates bigrams for a list of frequently occurring terms. The output contains both single terms and bigrams.
 `conditional` | [ConditionalTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ConditionalTokenFilter.html) | Applies an ordered list of token filters to tokens that match the conditions provided in a script.
 `decimal_digit` | [DecimalDigitFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/DecimalDigitFilter.html) | Converts all digits in the Unicode decimal number general category to basic Latin digits (0--9).