-
Notifications
You must be signed in to change notification settings - Fork 508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds section on product quantization for docs #6926
Adds section on product quantization for docs #6926
Conversation
6bdae77
to
58058f6
Compare
Adds section in vector quantization docs for product quantization. In it, it contains tips for using it as well as memory estimations. Along with this, changed some formatting to make docs easier to write. Signed-off-by: John Mazanec <[email protected]>
58058f6
to
4f1bd63
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks
Fix formatting Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Define abbreviation on first mention Signed-off-by: Melissa Vagi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc review complete. Please let me know if you have any questions about my changes. Once you've addressed my feedback, I'll approve the PR as ready for editorial. Thank you.
@@ -10,22 +10,42 @@ has_math: true | |||
|
|||
# k-NN vector quantization | |||
|
|||
By default, the k-NN plugin supports the indexing and querying of vectors of type `float`, where each dimension of the vector occupies 4 bytes of memory. For use cases that require ingestion on a large scale, keeping `float` vectors can be expensive because OpenSearch needs to construct, load, save, and search graphs (for native `nmslib` and `faiss` engines). To reduce the memory footprint, you can use vector quantization. | |||
By default, the k-NN plugin supports the indexing and querying of vectors of type `float`, where each dimension of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the line break formatting of lines 13--16.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the line breaks so that editing would be easier and it doesnt impact rendering (i.e. it wouldnt be one line that rolls out of the screen). Is this incorrect to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's incorrect to enter line breaks. The site and OpenSearch Project doc team follow a specific formatting guide. I'll handle formatting the doc before moving it into editorial. https://github.com/opensearch-project/documentation-website/blob/main/FORMATTING_GUIDE.md
Co-authored-by: Melissa Vagi <[email protected]> Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmazanec15 @vagimeli Please see my comments and changes and let me know if you have any questions. Thanks!
|
||
In OpenSearch, the training vectors need to be present in an index. In general, the amount of training data will depend on which ANN algorithm will be used and how much data will go into the index. For IVF-based indices, a good number of training vectors to use is `max(1000*nlist, 2^code_size * 1000)`. For HNSW-based indexes, a good number is `2^code_size*1000` training vectors. See [Faiss's documentation](https://github.com/facebookresearch/faiss/wiki/FAQ#how-many-training-points-do-i-need-for-k-means) for more details about the methodology behind calculating these figures. | ||
|
||
For PQ, the two parameters that need to be selected are _m_ and _code_size_. _m_ determines how many sub-vectors the vectors should be split to encode separately. Consequently, the _dimension_ needs to be divisible by _m_. _code_size_ determines how many bits each sub-vector will be encoded with. In general, a good place to start is setting `code_size = 8` and then tuning _m_ to get the desired trade-off between memory footprint and recall. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following the second sentence here. Do we mean something like "m determines the number of subvectors into which vectors should be split for separate encoding"? In the fourth sentence, is "with" the correct preposition, or should it be "into"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, your rewrite is correct. I revised the following sentence to read: _code_size_ determines the number of bits used to encode each subvector.
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>
Address editorial feedback Signed-off-by: Melissa Vagi <[email protected]>
@natebower Thank you for the review. I accepted your edits and addressed the rewrite comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc review and editorial review completed
|
* Adds section on product quantization for docs Adds section in vector quantization docs for product quantization. In it, it contains tips for using it as well as memory estimations. Along with this, changed some formatting to make docs easier to write. Signed-off-by: John Mazanec <[email protected]> * Update knn-vector-quantization.md Fix formatting Signed-off-by: Melissa Vagi <[email protected]> * Update knn-vector-quantization.md Signed-off-by: Melissa Vagi <[email protected]> * Update knn-vector-quantization.md Define abbreviation on first mention Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Co-authored-by: Melissa Vagi <[email protected]> Signed-off-by: John Mazanec <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Signed-off-by: Melissa Vagi <[email protected]> * Update knn-index.md Formatting and copyedits Signed-off-by: Melissa Vagi <[email protected]> * Update knn-vector-quantization.md Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-index.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-index.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-index.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-index.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-index.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> * Update _search-plugins/knn/knn-vector-quantization.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> * Update knn-vector-quantization.md Address editorial feedback Signed-off-by: Melissa Vagi <[email protected]> --------- Signed-off-by: John Mazanec <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> Co-authored-by: Melissa Vagi <[email protected]> Co-authored-by: Nathan Bower <[email protected]> (cherry picked from commit 9a6bb8a) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Description
Adds section in vector quantization docs for product quantization. In it, it contains tips for using it as well as memory estimations. Along with this, changed some formatting to make docs easier to write.
I decided to include completely accurate memory estimate for formula with a note about the typical number of segments.
We added a section on scalar quantization in 2.13 - but it did not include product quantization. Related comment here: https://github.com/opensearch-project/documentation-website/pull/6249/files#r1529479186. This should be backported for 2.13
Issues Resolved
List any issues this PR will resolve, e.g. Closes [...].
Checklist
For more information on following Developer Certificate of Origin and signing off your commits, please check here.