Skip to content

Commit

Permalink
Add doc for binary format support in k-NN
Browse files Browse the repository at this point in the history
Signed-off-by: Junqiu Lei <[email protected]>
  • Loading branch information
junqiu-lei committed Jul 25, 2024
1 parent fdfd53f commit 0eea1d3
Show file tree
Hide file tree
Showing 5 changed files with 201 additions and 5 deletions.
179 changes: 179 additions & 0 deletions _field-types/supported-field-types/knn-vector.md
Original file line number Diff line number Diff line change
Expand Up @@ -267,3 +267,182 @@ else:
return Byte(bval)
```
{% include copy.html %}

## Binary vector
By switching from float to binary vectors, users can reduce memory costs by a factor of 32.
Using binary type vector indices can lower operational costs, and maintain high recall performance, making large-scale deployment more economical and efficient.

Check failure on line 273 in _field-types/supported-field-types/knn-vector.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.SubstitutionsError] Use 'indexes' instead of 'indices'. Raw Output: {"message": "[OpenSearch.SubstitutionsError] Use 'indexes' instead of 'indices'.", "location": {"path": "_field-types/supported-field-types/knn-vector.md", "range": {"start": {"line": 273, "column": 26}}}, "severity": "ERROR"}

### Supported Capabilities

Check failure on line 275 in _field-types/supported-field-types/knn-vector.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.HeadingCapitalization] 'Supported Capabilities' is a heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.HeadingCapitalization] 'Supported Capabilities' is a heading and should be in sentence case.", "location": {"path": "_field-types/supported-field-types/knn-vector.md", "range": {"start": {"line": 275, "column": 5}}}, "severity": "ERROR"}

- **Approximate k-NN**: The binary format support is currently available only for the Faiss engine with HNSW and IVF algorithms supported.
- **Script Score k-NN**: Enables the use of binary vectors in script scoring.
- **Painless Extensions**: Allows the use of binary vectors with Painless scripting extensions.

### Requirements
There are several requirements for using binary vectors in OpenSearch k-NN plugin:

#### Data Type

Check failure on line 284 in _field-types/supported-field-types/knn-vector.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.HeadingCapitalization] 'Data Type' is a heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.HeadingCapitalization] 'Data Type' is a heading and should be in sentence case.", "location": {"path": "_field-types/supported-field-types/knn-vector.md", "range": {"start": {"line": 284, "column": 6}}}, "severity": "ERROR"}
The `data_type` of the binary vector index must be `binary`.

#### Space Type

Check failure on line 287 in _field-types/supported-field-types/knn-vector.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.HeadingCapitalization] 'Space Type' is a heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.HeadingCapitalization] 'Space Type' is a heading and should be in sentence case.", "location": {"path": "_field-types/supported-field-types/knn-vector.md", "range": {"start": {"line": 287, "column": 6}}}, "severity": "ERROR"}

The `space_type` of the binary vector index must be `hamming`.

#### Dimension

The `dimension` of the binary vector index must be a multiple of 8.

#### Input Vector

Check failure on line 295 in _field-types/supported-field-types/knn-vector.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.HeadingCapitalization] 'Input Vector' is a heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.HeadingCapitalization] 'Input Vector' is a heading and should be in sentence case.", "location": {"path": "_field-types/supported-field-types/knn-vector.md", "range": {"start": {"line": 295, "column": 6}}}, "severity": "ERROR"}

User should encode their binary data into bytes (int8). For example, the binary sequence `0, 1, 1, 0, 0, 0, 1, 1` should be packed into the byte value 99 as binary format vector input.

### Examples
The following example demonstrates how to create a binary vector index with the Faiss engine and HNSW algorithm:

```json
PUT test-binary-hnsw
{
"settings": {
"index": {
"knn": true
}
},
"mappings": {
"properties": {
"my_vector1": {
"type": "knn_vector",
"dimension": 8,
"data_type": "binary",
"method": {
"name": "hnsw",
"space_type": "hamming",
"engine": "faiss",
"parameters": {
"ef_construction": 128,
"m": 24
}
}
}
}
}
}
```
{% include copy-curl.html %}

Then ingest some documents with binary vectors:

```json
PUT _bulk?refresh=true
{"index": {"_index": "test-binary-hnsw", "_id": "1"}}
{"my_vector": [7], "price": 4.4}
{"index": {"_index": "test-binary-hnsw", "_id": "2"}}
{"my_vector": [10], "price": 14.2}
{"index": {"_index": "test-binary-hnsw", "_id": "3"}}
{"my_vector": [15], "price": 19.1}
{"index": {"_index": "test-binary-hnsw", "_id": "4"}}
{"my_vector": [99], "price": 1.2}
{"index": {"_index": "test-binary-hnsw", "_id": "5"}}
{"my_vector": [80], "price": 16.5}
```
{% include copy-curl.html %}


When querying, be sure to use a binary vector:

```json
GET test-binary-hnsw/_search
{
"size": 2,
"query": {
"knn": {
"my_vector1": {
"vector": [9],
"k": 2
}
}
}
}
```
{% include copy-curl.html %}

The follow example demonstrates how to create a binary vector index with the Faiss engine and IVF algorithm:

Firstly, we need create the training index and model in binary format. For convenience, we use above `test-binary-hnsw` index and `my_vector1` field as the training index and field to train model.

```json
POST _plugins/_knn/models/test-binary-model/_train
{
"training_index": "test-binary-hnsw",
"training_field": "my_vector",
"dimension": 8,
"description": "My model description",
"data_type": "binary",
"method": {
"name": "ivf",
"engine": "faiss",
"space_type": "hamming",
"parameters": {
"nlist": 1,
"nprobes":1
}
}
}
```
{% include copy-curl.html %}

Then create IVF index with the trained model:

```json
PUT test-binary-ivf
{
"settings": {
"index": {
"knn": true
}
},
"mappings": {
"properties": {
"my_vector1": {
"type": "knn_vector",
"model_id": "test-binary-model"
}
}
}
}
```
{% include copy-curl.html %}

Then ingest some documents with binary vectors:

```json
PUT _bulk?refresh=true
{"index": {"_index": "test-binary-ivf", "_id": "1"}}
{"my_vector": [7], "price": 4.4}
{"index": {"_index": "test-binary-ivf", "_id": "2"}}
{"my_vector": [10], "price": 14.2}
{"index": {"_index": "test-binary-ivf", "_id": "3"}}
{"my_vector": [15], "price": 19.1}
{"index": {"_index": "test-binary-ivf", "_id": "4"}}
{"my_vector": [99], "price": 1.2}
{"index": {"_index": "test-binary-ivf", "_id": "5"}}
{"my_vector": [80], "price": 16.5}
```
{% include copy-curl.html %}

When querying, be sure to use a binary vector:

```json
GET test-binary-ivf/_search
{
"size": 2,
"query": {
"knn": {
"my_vector1": {
"vector": [9],
"k": 2
}
}
}
}
```
{% include copy-curl.html %}
9 changes: 9 additions & 0 deletions _search-plugins/knn/approximate-knn.md
Original file line number Diff line number Diff line change
Expand Up @@ -314,6 +314,10 @@ To learn about using k-NN search with nested fields, see [k-NN search with neste

To learn more about the radial search feature, see [k-NN radial search]({{site.url}}{{site.baseurl}}/search-plugins/knn/radial-search-knn/).

### Using approximate k-NN with binary vectors

To learn more about using binary vectors with k-NN search, see [k-NN search with binary vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).

## Spaces

A space corresponds to the function used to measure the distance between two points in order to determine the k-nearest neighbors. From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how OpenSearch scores results, where a greater score equates to a better result. To convert distances to OpenSearch scores, we take 1 / (1 + distance). The k-NN plugin supports the following spaces.
Expand Down Expand Up @@ -363,6 +367,11 @@ Not every method supports each of these spaces. Be sure to check out [the method
\[ \text{If} d > 0, score = d + 1 \] \[\text{If} d \le 0\] \[score = {1 \over 1 + (-1 &middot; d) }\]
</td>
</tr>
<tr>
<td>hammingbit</td>

Check failure on line 371 in _search-plugins/knn/approximate-knn.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: hammingbit. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: hammingbit. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_search-plugins/knn/approximate-knn.md", "range": {"start": {"line": 371, "column": 9}}}, "severity": "ERROR"}
<td>\[ d(\mathbf{x}, \mathbf{y}) = \text{countSetBits}(\mathbf{x} \oplus \mathbf{y})\]</td>
<td>\[ score = {1 \over 1 + d } \]</td>
</tr>
</table>

The cosine similarity formula does not include the `1 -` prefix. However, because similarity search libraries equates
Expand Down
15 changes: 11 additions & 4 deletions _search-plugins/knn/knn-index.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,10 @@ PUT /test-index

Starting with k-NN plugin version 2.9, you can use `byte` vectors with the `lucene` engine to reduce the amount of storage space needed. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector).

## Binary vector

Starting with k-NN plugin version 2.16, you can use `binary` vectors with the `faiss` engine to reduce the amount of storage space needed. For more information, see [Binary vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).

## SIMD optimization for the Faiss engine

Starting with version 2.13, the k-NN plugin supports [Single Instruction Multiple Data (SIMD)](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) processing if the underlying hardware supports SIMD instructions (AVX2 on x64 architecture and Neon on ARM64 architecture). SIMD is supported by default on Linux machines only for the Faiss engine. SIMD architecture helps boost overall performance by improving indexing throughput and reducing search latency.
Expand Down Expand Up @@ -104,14 +108,17 @@ An index created in OpenSearch version 2.11 or earlier will still use the old `e

### Supported Faiss methods

Method name | Requires training | Supported spaces | Description
:--- | :--- | :--- | :---
`hnsw` | false | l2, innerproduct | Hierarchical proximity graph approach to approximate k-NN search.
`ivf` | true | l2, innerproduct | Stands for _inverted file index_. Bucketing approach where vectors are assigned different buckets based on clustering and, during search, only a subset of the buckets is searched.
Method name | Requires training | Supported spaces | Description
:--- | :--- |:------------------------------------------------------------------------------------------| :---
`hnsw` | false | l2, innerproduct, hamming | Hierarchical proximity graph approach to approximate k-NN search.
`ivf` | true | l2, innerproduct, hamming | Stands for _inverted file index_. Bucketing approach where vectors are assigned different buckets based on clustering and, during search, only a subset of the buckets is searched.

For hnsw, "innerproduct" is not available when PQ is used.
{: .note}

The `hamming` space type is supported for binary format vectors only in OpenSearch 2.16 and later, see [Binary vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).
{: .note}

#### HNSW parameters

Parameter name | Required | Default | Updatable | Description
Expand Down
1 change: 1 addition & 0 deletions _search-plugins/knn/painless-functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ Function name | Function signature | Description
l2Squared | `float l2Squared (float[] queryVector, doc['vector field'])` | This function calculates the square of the L2 distance (Euclidean distance) between a given query vector and document vectors. The shorter the distance, the more relevant the document is, so this example inverts the return value of the l2Squared function. If the document vector matches the query vector, the result is 0, so this example also adds 1 to the distance to avoid divide by zero errors.
l1Norm | `float l1Norm (float[] queryVector, doc['vector field'])` | This function calculates the square of the L2 distance (Euclidean distance) between a given query vector and document vectors. The shorter the distance, the more relevant the document is, so this example inverts the return value of the l2Squared function. If the document vector matches the query vector, the result is 0, so this example also adds 1 to the distance to avoid divide by zero errors.
cosineSimilarity | `float cosineSimilarity (float[] queryVector, doc['vector field'])` | Cosine similarity is an inner product of the query vector and document vector normalized to both have a length of 1. If the magnitude of the query vector doesn't change throughout the query, you can pass the magnitude of the query vector to improve performance, instead of calculating the magnitude every time for every filtered document:<br /> `float cosineSimilarity (float[] queryVector, doc['vector field'], float normQueryVector)` <br />In general, the range of cosine similarity is [-1, 1]. However, in the case of information retrieval, the cosine similarity of two documents ranges from 0 to 1 because the tf-idf statistic can't be negative. Therefore, the k-NN plugin adds 1.0 in order to always yield a positive cosine similarity score.
hamming | `float hamming (float[] queryVector, doc['vector field'])` | This function calculates the Hamming distance between a given query vector and document vectors. The Hamming distance is the number of positions at which the corresponding elements are different. The shorter the distance, the more relevant the document is, so this example inverts the return value of the Hamming distance.

## Constraints

Expand Down
2 changes: 1 addition & 1 deletion _search-plugins/vector-search.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ PUT test-index

You must designate the field that will store vectors as a [`knn_vector`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/) field type. OpenSearch supports vectors of up to 16,000 dimensions, each of which is represented as a 32-bit or 16-bit float.

To save storage space, you can use `byte` vectors. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector).
To save storage space, you can use `byte` or `binary` vectors. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector) and [Binary vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).

### k-NN vector search

Expand Down

0 comments on commit 0eea1d3

Please sign in to comment.