Skip to content

Commit

Permalink
Resolve tech feedback
Browse files Browse the repository at this point in the history
Signed-off-by: Junqiu Lei <[email protected]>
  • Loading branch information
junqiu-lei committed Jul 29, 2024
1 parent 0eea1d3 commit ce3eac4
Show file tree
Hide file tree
Showing 4 changed files with 62 additions and 8 deletions.
52 changes: 46 additions & 6 deletions _field-types/supported-field-types/knn-vector.md
Original file line number Diff line number Diff line change
Expand Up @@ -367,15 +367,49 @@ GET test-binary-hnsw/_search

The follow example demonstrates how to create a binary vector index with the Faiss engine and IVF algorithm:

Firstly, we need create the training index and model in binary format. For convenience, we use above `test-binary-hnsw` index and `my_vector1` field as the training index and field to train model.
Firstly, we need create the training index with binary format data type:
```json
PUT train-index
{
"mappings": {
"properties": {
"train-field": {
"type": "knn_vector",
"dimension": 8,
"data_type": "binary"
}
}
}
}
```
{% include copy-curl.html %}'

Then, ingest some documents with binary vectors to the training index:
```json
PUT _bulk
{ "index": { "_index": "train-index", "_id": "1" } }
{ "train-field": [1] }
{ "index": { "_index": "train-index", "_id": "2" } }
{ "train-field": [2] }
{ "index": { "_index": "train-index", "_id": "3" } }
{ "train-field": [3] }
{ "index": { "_index": "train-index", "_id": "4" } }
{ "train-field": [4] }
{ "index": { "_index": "train-index", "_id": "5" } }
{ "train-field": [5] }
...
```
{% include copy-curl.html %}

Then, train the model with the training index and field in binary format, and specify the method space type as `hamming`:

```json
POST _plugins/_knn/models/test-binary-model/_train
{
"training_index": "test-binary-hnsw",
"training_field": "my_vector",
"training_index": "train-index",
"training_field": "train-field",
"dimension": 8,
"description": "My model description",
"description": "model with binary data",
"data_type": "binary",
"method": {
"name": "ivf",
Expand All @@ -390,7 +424,13 @@ POST _plugins/_knn/models/test-binary-model/_train
```
{% include copy-curl.html %}

Then create IVF index with the trained model:
Then, make sure the model state is `created`:
```json
GET _plugins/_knn/models/test-binary-model?filter_path=state
```
{% include copy-curl.html %}

Then, create IVF index with the trained model:

```json
PUT test-binary-ivf
Expand All @@ -402,7 +442,7 @@ PUT test-binary-ivf
},
"mappings": {
"properties": {
"my_vector1": {
"my_vector": {
"type": "knn_vector",
"model_id": "test-binary-model"
}
Expand Down
5 changes: 4 additions & 1 deletion _search-plugins/knn/approximate-knn.md
Original file line number Diff line number Diff line change
Expand Up @@ -368,7 +368,7 @@ Not every method supports each of these spaces. Be sure to check out [the method
</td>
</tr>
<tr>
<td>hammingbit</td>
<td>hamming</td>
<td>\[ d(\mathbf{x}, \mathbf{y}) = \text{countSetBits}(\mathbf{x} \oplus \mathbf{y})\]</td>
<td>\[ score = {1 \over 1 + d } \]</td>
</tr>
Expand All @@ -383,3 +383,6 @@ With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...]`) as
such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests
containing the zero vector will be rejected and a corresponding exception will be thrown.
{: .note }

The `hamming` space type is supported for binary format vectors only in OpenSearch 2.16 and later, see [Binary vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).
{: .note}
10 changes: 9 additions & 1 deletion _search-plugins/knn/knn-score-script.md
Original file line number Diff line number Diff line change
Expand Up @@ -323,6 +323,11 @@ A space corresponds to the function used to measure the distance between two poi
<td>\[ d(\mathbf{x}, \mathbf{y}) = \text{countSetBits}(\mathbf{x} \oplus \mathbf{y})\]</td>
<td>\[ score = {1 \over 1 + d } \]</td>
</tr>
<tr>
<td>hamming</td>
<td>\[ d(\mathbf{x}, \mathbf{y}) = \text{countSetBits}(\mathbf{x} \oplus \mathbf{y})\]</td>
<td>\[ score = {1 \over 1 + d } \]</td>
</tr>
</table>


Expand All @@ -331,4 +336,7 @@ Cosine similarity returns a number between -1 and 1, and because OpenSearch rele
With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...`]) as input. This is because the magnitude of
such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests
containing the zero vector will be rejected and a corresponding exception will be thrown.
{: .note }
{: .note }

The `hamming` space type is supported for binary format vectors only in OpenSearch 2.16 and later, see [Binary vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).
{: .note}
3 changes: 3 additions & 0 deletions _search-plugins/knn/painless-functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,9 @@ l1Norm | `float l1Norm (float[] queryVector, doc['vector field'])` | This functi
cosineSimilarity | `float cosineSimilarity (float[] queryVector, doc['vector field'])` | Cosine similarity is an inner product of the query vector and document vector normalized to both have a length of 1. If the magnitude of the query vector doesn't change throughout the query, you can pass the magnitude of the query vector to improve performance, instead of calculating the magnitude every time for every filtered document:<br /> `float cosineSimilarity (float[] queryVector, doc['vector field'], float normQueryVector)` <br />In general, the range of cosine similarity is [-1, 1]. However, in the case of information retrieval, the cosine similarity of two documents ranges from 0 to 1 because the tf-idf statistic can't be negative. Therefore, the k-NN plugin adds 1.0 in order to always yield a positive cosine similarity score.
hamming | `float hamming (float[] queryVector, doc['vector field'])` | This function calculates the Hamming distance between a given query vector and document vectors. The Hamming distance is the number of positions at which the corresponding elements are different. The shorter the distance, the more relevant the document is, so this example inverts the return value of the Hamming distance.

The `hamming` space type is supported for binary format vectors only in OpenSearch 2.16 and later, see [Binary vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).
{: .note}

## Constraints

1. If a document’s `knn_vector` field has different dimensions than the query, the function throws an `IllegalArgumentException`.
Expand Down

0 comments on commit ce3eac4

Please sign in to comment.