Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add edge n-gram token filter docs #7980 #8025

113 changes: 113 additions & 0 deletions _analyzers/token-filters/edge-ngram.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
---
layout: default
title: Edge ngram
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved
parent: Token filters
nav_order: 120
---
<!-- vale off -->
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved
# Edge ngram token filter
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved
<!-- vale on -->
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved
The `edge_ngram` token filter generates n-grams (substrings) from the beginning (edge) of a token. It's particularly useful in scenarios like autocomplete or prefix matching, where you want to match the start of words or phrases as the user types.
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved

AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved
## Parameters

The `edge_ngram` token filter can be configured with the following parameters.

Parameter | Required/Optional | Data type | Description
:--- | :--- | :--- | :---
`min_gram` | Optional | Integer | The minimum length of the n-grams that will be generated. Default is `1`.
`max_gram` | Optional | Integer | The maximum length of the n-grams that will be generated. Default is `1`. Beware of setting this value too low, as any searches that exceed this value will not be found, as that token would not exist. For example if `max_gram` is set to `3` and the word "banana" is indexed, longest token that will be created is "ban". If the user searches for "banana", no matches will be found. You can use `truncate` token filter as search analyzer to mitigate this risk.
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved
`preserve_original` | Optional | Boolean | Include the original token in the output. Default is `false` .
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

## Example

The following example request creates a new index named `edge_ngram_example` and configures an analyzer with `edge_ngram` filter:
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

```json
PUT /edge_ngram_example
{
"settings": {
"analysis": {
"filter": {
"my_edge_ngram": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 4
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "my_edge_ngram"]
}
}
}
}
}
```
{% include copy-curl.html %}

## Generated tokens

Use the following request to examine the tokens generated using the analyzer:

```json
POST /edge_ngram_example/_analyze
{
"analyzer": "my_analyzer",
"text": "slow green turtle"
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{
"token": "slo",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "slow",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "gre",
"start_offset": 5,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "gree",
"start_offset": 5,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "tur",
"start_offset": 11,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "turt",
"start_offset": 11,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
}
]
}
```
2 changes: 1 addition & 1 deletion _analyzers/token-filters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Token filter | Underlying Lucene token filter| Description
`delimited_payload` | [DelimitedPayloadTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/payloads/DelimitedPayloadTokenFilter.html) | Separates a token stream into tokens with corresponding payloads, based on a provided delimiter. A token consists of all characters before the delimiter, and a payload consists of all characters after the delimiter. For example, if the delimiter is `|`, then for the string `foo|bar`, `foo` is the token and `bar` is the payload.
[`delimited_term_freq`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/delimited-term-frequency/) | [DelimitedTermFrequencyTokenFilter](https://lucene.apache.org/core/9_7_0/analysis/common/org/apache/lucene/analysis/miscellaneous/DelimitedTermFrequencyTokenFilter.html) | Separates a token stream into tokens with corresponding term frequencies, based on a provided delimiter. A token consists of all characters before the delimiter, and a term frequency is the integer after the delimiter. For example, if the delimiter is `|`, then for the string `foo|5`, `foo` is the token and `5` is the term frequency.
`dictionary_decompounder` | [DictionaryCompoundWordTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html) | Decomposes compound words found in many Germanic languages.
`edge_ngram` | [EdgeNGramTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html) | Tokenizes the given token into edge n-grams (n-grams that start at the beginning of the token) of lengths between `min_gram` and `max_gram`. Optionally, keeps the original token.
[`edge_ngram`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/edge-ngram/) | [EdgeNGramTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html) | Tokenizes the given token into edge n-grams (n-grams that start at the beginning of the token) of lengths between `min_gram` and `max_gram`. Optionally, keeps the original token.
`elision` | [ElisionFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/util/ElisionFilter.html) | Removes the specified [elisions](https://en.wikipedia.org/wiki/Elision) from the beginning of tokens. For example, changes `l'avion` (the plane) to `avion` (plane).
`fingerprint` | [FingerprintFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html) | Sorts and deduplicates the token list and concatenates tokens into a single token.
`flatten_graph` | [FlattenGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing.
Expand Down