From 0eea1d3ffb0b65dc26dfcc1f25430ee480947b97 Mon Sep 17 00:00:00 2001
From: Junqiu Lei <junqiu@amazon.com>
Date: Thu, 25 Jul 2024 16:52:21 -0700
Subject: [PATCH] Add doc for binary format support in k-NN

Signed-off-by: Junqiu Lei <junqiu@amazon.com>
---
 .../supported-field-types/knn-vector.md       | 179 ++++++++++++++++++
 _search-plugins/knn/approximate-knn.md        |   9 +
 _search-plugins/knn/knn-index.md              |  15 +-
 _search-plugins/knn/painless-functions.md     |   1 +
 _search-plugins/vector-search.md              |   2 +-
 5 files changed, 201 insertions(+), 5 deletions(-)

diff --git a/_field-types/supported-field-types/knn-vector.md b/_field-types/supported-field-types/knn-vector.md
index c7f9ec7f2b..a1154e87e9 100644
--- a/_field-types/supported-field-types/knn-vector.md
+++ b/_field-types/supported-field-types/knn-vector.md
@@ -267,3 +267,182 @@ else:
 return Byte(bval)
 ```
 {% include copy.html %}
+
+## Binary vector
+By switching from float to binary vectors, users can reduce memory costs by a factor of 32.
+Using binary type vector indices can lower operational costs, and maintain high recall performance, making large-scale deployment more economical and efficient.
+
+### Supported Capabilities
+
+- **Approximate k-NN**: The binary format support is currently available only for the Faiss engine with HNSW and IVF algorithms supported.
+- **Script Score k-NN**: Enables the use of binary vectors in script scoring.
+- **Painless Extensions**: Allows the use of binary vectors with Painless scripting extensions.
+
+### Requirements
+There are several requirements for using binary vectors in OpenSearch k-NN plugin:
+
+#### Data Type
+The `data_type` of the binary vector index must be `binary`.
+
+#### Space Type
+
+The `space_type` of the binary vector index must be `hamming`.
+
+#### Dimension
+
+The `dimension` of the binary vector index must be a multiple of 8.
+
+#### Input Vector
+
+User should encode their binary data into bytes (int8). For example, the binary sequence `0, 1, 1, 0, 0, 0, 1, 1` should be packed into the byte value 99 as binary format vector input.
+
+### Examples
+The following example demonstrates how to create a binary vector index with the Faiss engine and HNSW algorithm:
+
+```json
+PUT test-binary-hnsw
+{
+  "settings": {
+    "index": {
+      "knn": true
+    }
+  },
+  "mappings": {
+    "properties": {
+      "my_vector1": {
+        "type": "knn_vector",
+        "dimension": 8,
+        "data_type": "binary",
+        "method": {
+          "name": "hnsw",
+          "space_type": "hamming",
+          "engine": "faiss",
+          "parameters": {
+            "ef_construction": 128,
+            "m": 24
+          }
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+Then ingest some documents with binary vectors:
+
+```json
+PUT _bulk?refresh=true
+{"index": {"_index": "test-binary-hnsw", "_id": "1"}}
+{"my_vector": [7], "price": 4.4}
+{"index": {"_index": "test-binary-hnsw", "_id": "2"}}
+{"my_vector": [10], "price": 14.2}
+{"index": {"_index": "test-binary-hnsw", "_id": "3"}}
+{"my_vector": [15], "price": 19.1}
+{"index": {"_index": "test-binary-hnsw", "_id": "4"}}
+{"my_vector": [99], "price": 1.2}
+{"index": {"_index": "test-binary-hnsw", "_id": "5"}}
+{"my_vector": [80], "price": 16.5}
+```
+{% include copy-curl.html %}
+
+
+When querying, be sure to use a binary vector:
+
+```json
+GET test-binary-hnsw/_search
+{
+  "size": 2,
+  "query": {
+    "knn": {
+      "my_vector1": {
+        "vector": [9],
+        "k": 2
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+The follow example demonstrates how to create a binary vector index with the Faiss engine and IVF algorithm:
+
+Firstly, we need create the training index and model in binary format. For convenience, we use above `test-binary-hnsw` index and `my_vector1` field as the training index and field to train model.
+
+```json
+POST _plugins/_knn/models/test-binary-model/_train
+{
+  "training_index": "test-binary-hnsw",
+  "training_field": "my_vector",
+  "dimension": 8,
+  "description": "My model description",
+  "data_type": "binary",
+  "method": {
+    "name": "ivf",
+    "engine": "faiss",
+    "space_type": "hamming",
+    "parameters": {
+      "nlist": 1,
+      "nprobes":1
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+Then create IVF index with the trained model:
+
+```json
+PUT test-binary-ivf
+{
+  "settings": {
+    "index": {
+      "knn": true
+    }
+  },
+  "mappings": {
+    "properties": {
+      "my_vector1": {
+        "type": "knn_vector",
+        "model_id": "test-binary-model"
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+Then ingest some documents with binary vectors:
+
+```json
+PUT _bulk?refresh=true
+{"index": {"_index": "test-binary-ivf", "_id": "1"}}
+{"my_vector": [7], "price": 4.4}
+{"index": {"_index": "test-binary-ivf", "_id": "2"}}
+{"my_vector": [10], "price": 14.2}
+{"index": {"_index": "test-binary-ivf", "_id": "3"}}
+{"my_vector": [15], "price": 19.1}
+{"index": {"_index": "test-binary-ivf", "_id": "4"}}
+{"my_vector": [99], "price": 1.2}
+{"index": {"_index": "test-binary-ivf", "_id": "5"}}
+{"my_vector": [80], "price": 16.5}
+```
+{% include copy-curl.html %}
+
+When querying, be sure to use a binary vector:
+
+```json
+GET test-binary-ivf/_search
+{
+  "size": 2,
+  "query": {
+    "knn": {
+      "my_vector1": {
+        "vector": [9],
+        "k": 2
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
diff --git a/_search-plugins/knn/approximate-knn.md b/_search-plugins/knn/approximate-knn.md
index fa1b4096c7..bcee0dc631 100644
--- a/_search-plugins/knn/approximate-knn.md
+++ b/_search-plugins/knn/approximate-knn.md
@@ -314,6 +314,10 @@ To learn about using k-NN search with nested fields, see [k-NN search with neste
 
 To learn more about the radial search feature, see [k-NN radial search]({{site.url}}{{site.baseurl}}/search-plugins/knn/radial-search-knn/).
 
+### Using approximate k-NN with binary vectors
+
+To learn more about using binary vectors with k-NN search, see [k-NN search with binary vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).
+
 ## Spaces
 
 A space corresponds to the function used to measure the distance between two points in order to determine the k-nearest neighbors. From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how OpenSearch scores results, where a greater score equates to a better result. To convert distances to OpenSearch scores, we take 1 / (1 + distance). The k-NN plugin supports the following spaces. 
@@ -363,6 +367,11 @@ Not every method supports each of these spaces. Be sure to check out [the method
         \[ \text{If} d > 0, score = d + 1 \] \[\text{If} d \le 0\] \[score = {1 \over 1 + (-1 &middot; d) }\]
     </td>
   </tr>
+  <tr>
+    <td>hammingbit</td>
+    <td>\[ d(\mathbf{x}, \mathbf{y}) = \text{countSetBits}(\mathbf{x} \oplus \mathbf{y})\]</td>
+    <td>\[ score = {1 \over 1 + d } \]</td>
+  </tr>
 </table>
 
 The cosine similarity formula does not include the `1 -` prefix. However, because similarity search libraries equates
diff --git a/_search-plugins/knn/knn-index.md b/_search-plugins/knn/knn-index.md
index ed8b9217f5..d9a85b2f07 100644
--- a/_search-plugins/knn/knn-index.md
+++ b/_search-plugins/knn/knn-index.md
@@ -45,6 +45,10 @@ PUT /test-index
 
 Starting with k-NN plugin version 2.9, you can use `byte` vectors with the `lucene` engine to reduce the amount of storage space needed. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector).
 
+## Binary vector
+
+Starting with k-NN plugin version 2.16, you can use `binary` vectors with the `faiss` engine to reduce the amount of storage space needed. For more information, see [Binary vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).
+
 ## SIMD optimization for the Faiss engine
 
 Starting with version 2.13, the k-NN plugin supports [Single Instruction Multiple Data (SIMD)](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) processing if the underlying hardware supports SIMD instructions (AVX2 on x64 architecture and Neon on ARM64 architecture). SIMD is supported by default on Linux machines only for the Faiss engine. SIMD architecture helps boost overall performance by improving indexing throughput and reducing search latency.
@@ -104,14 +108,17 @@ An index created in OpenSearch version 2.11 or earlier will still use the old `e
 
 ### Supported Faiss methods
 
-Method name | Requires training | Supported spaces | Description
-:--- | :--- | :--- | :---
-`hnsw` | false | l2, innerproduct | Hierarchical proximity graph approach to approximate k-NN search.
-`ivf` | true | l2, innerproduct | Stands for _inverted file index_. Bucketing approach where vectors are assigned different buckets based on clustering and, during search, only a subset of the buckets is searched.
+Method name | Requires training | Supported spaces                                                                          | Description
+:--- | :--- |:------------------------------------------------------------------------------------------| :---
+`hnsw` | false | l2, innerproduct, hamming                                                                 | Hierarchical proximity graph approach to approximate k-NN search.
+`ivf` | true | l2, innerproduct, hamming                                                                 | Stands for _inverted file index_. Bucketing approach where vectors are assigned different buckets based on clustering and, during search, only a subset of the buckets is searched.
 
 For hnsw, "innerproduct" is not available when PQ is used.
 {: .note}
 
+The `hamming` space type is supported for binary format vectors only in OpenSearch 2.16 and later, see [Binary vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).
+{: .note}
+
 #### HNSW parameters
 
 Parameter name | Required | Default | Updatable | Description
diff --git a/_search-plugins/knn/painless-functions.md b/_search-plugins/knn/painless-functions.md
index 09eb989702..62d2bc8586 100644
--- a/_search-plugins/knn/painless-functions.md
+++ b/_search-plugins/knn/painless-functions.md
@@ -52,6 +52,7 @@ Function name | Function signature | Description
 l2Squared | `float l2Squared (float[] queryVector, doc['vector field'])` | This function calculates the square of the L2 distance (Euclidean distance) between a given query vector and document vectors. The shorter the distance, the more relevant the document is, so this example inverts the return value of the l2Squared function. If the document vector matches the query vector, the result is 0, so this example also adds 1 to the distance to avoid divide by zero errors.
 l1Norm | `float l1Norm (float[] queryVector, doc['vector field'])` | This function calculates the square of the L2 distance (Euclidean distance) between a given query vector and document vectors. The shorter the distance, the more relevant the document is, so this example inverts the return value of the l2Squared function. If the document vector matches the query vector, the result is 0, so this example also adds 1 to the distance to avoid divide by zero errors.
 cosineSimilarity | `float cosineSimilarity (float[] queryVector, doc['vector field'])` | Cosine similarity is an inner product of the query vector and document vector normalized to both have a length of 1. If the magnitude of the query vector doesn't change throughout the query, you can pass the magnitude of the query vector to improve performance, instead of calculating the magnitude every time for every filtered document:<br /> `float cosineSimilarity (float[] queryVector, doc['vector field'], float normQueryVector)` <br />In general, the range of cosine similarity is [-1, 1]. However, in the case of information retrieval, the cosine similarity of two documents ranges from 0 to 1 because the tf-idf statistic can't be negative. Therefore, the k-NN plugin adds 1.0 in order to always yield a positive cosine similarity score.
+hamming | `float hamming (float[] queryVector, doc['vector field'])` | This function calculates the Hamming distance between a given query vector and document vectors. The Hamming distance is the number of positions at which the corresponding elements are different. The shorter the distance, the more relevant the document is, so this example inverts the return value of the Hamming distance.
 
 ## Constraints
 
diff --git a/_search-plugins/vector-search.md b/_search-plugins/vector-search.md
index 862b26b375..1b876e781e 100644
--- a/_search-plugins/vector-search.md
+++ b/_search-plugins/vector-search.md
@@ -57,7 +57,7 @@ PUT test-index
 
 You must designate the field that will store vectors as a [`knn_vector`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/) field type. OpenSearch supports vectors of up to 16,000 dimensions, each of which is represented as a 32-bit or 16-bit float. 
 
-To save storage space, you can use `byte` vectors. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector).
+To save storage space, you can use `byte` or `binary` vectors. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector) and [Binary vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).
 
 ### k-NN vector search