-
Notifications
You must be signed in to change notification settings - Fork 508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor of the neural sparse search tutorial #7922
Refactor of the neural sparse search tutorial #7922
Conversation
Signed-off-by: zhichao-aws <[email protected]>
Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged. Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer. When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review. |
Signed-off-by: zhichao-aws <[email protected]>
This PR is ready for review, thanks! |
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kolchfa-aws @zhichao-aws Please see my comments and changes and let me know if you have any questions. Thanks!
Before using neural sparse search, make sure to set up a [pretrained sparse embedding model]({{site.url}}{{site.baseurl}}/ml-commons-plugin/pretrained-models/#sparse-encoding-models) or your own sparse embedding model. For more information, see [Choosing a model]({{site.url}}{{site.baseurl}}/ml-commons-plugin/integrating-ml-models/#choosing-a-model). | ||
{: .note} | ||
- Generate vector embeddings within OpenSearch: Configure an ingest pipeline to generate and store sparse vector embeddings from document text at ingestion time. At query time, input plain text, which will be automatically converted into vector embeddings for search. For complete setup steps, see [Configuring ingest pipelines for neural sparse search]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-with-pipelines/). | ||
- Ingest raw sparse vectors and search using them directly. For complete setup steps, see [Ingesting and searching raw vectors]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-with-raw-vectors/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"search using them directly" => "use them to search directly"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded.
|
||
For this tutorial, you’ll use neural sparse search with OpenSearch’s built-in ML model hosting and ingest pipelines. Because the transformation of text to embeddings is performed within OpenSearch, you'll use text when ingesting and searching documents. | ||
|
||
At ingestion time, neural sparse search uses a sparse encoding to generate sparse vector embeddings from text fields during ingestion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should "model" following "encoding"? It looks like we don't need "during ingestion" because we already have "At ingestion time".
|
||
This tutorial consists of the following steps: | ||
|
||
1. [**Configure a sparse encoding model/tokenizer**](#step-1-configure-a-sparse-encoding-modeltokenizer). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the periods aren't necessary in this list.
|
||
This tutorial consists of the following steps: | ||
|
||
1. [**Ingest sparse vectors**](#step-1-ingest-sparse-vectors). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No periods necessary in this list.
1. [**Ingest sparse vectors**](#step-1-ingest-sparse-vectors). | ||
1. [Create an index](#step-1a-create-an-index). | ||
1. [Ingest documents into the index](#step-1b-ingest-documents-into-the-index). | ||
1. [**Search the data using raw sparse vector**](#step-2-search-the-data-using-a-sparse-vector). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. [**Search the data using raw sparse vector**](#step-2-search-the-data-using-a-sparse-vector). | |
1. [**Search the data using a sparse vector**](#step-2-search-the-data-using-a-sparse-vector) |
|
||
## Step 1: Ingest sparse vectors | ||
|
||
Once you have generated sparse vector embeddings, you can ingest them into OpenSearch directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"directly ingest"?
|
||
### Step 1(a): Create an index | ||
|
||
In order to ingest documents of raw sparse vectors, create a rank features index: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "of" the right word here, or do we mean something like "containing" or "with"?
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @zhichao-aws!
* refactor Signed-off-by: zhichao-aws <[email protected]> * fix Signed-off-by: zhichao-aws <[email protected]> * Doc review Signed-off-by: Fanit Kolchina <[email protected]> * Link fix Signed-off-by: Fanit Kolchina <[email protected]> * Apply suggestions from code review Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> --------- Signed-off-by: zhichao-aws <[email protected]> Signed-off-by: Fanit Kolchina <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> Co-authored-by: Fanit Kolchina <[email protected]> Co-authored-by: kolchfa-aws <[email protected]> Co-authored-by: Nathan Bower <[email protected]> (cherry picked from commit ecd2232) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Description
Recently we get the feedback that it's hard to find using neural sparse search with raw sparse vectors in existing documentations. Besides, the neural sparse search tutorial is not well structured compared with neural search tutorial.
This PR refact the neural sparse search tutorial. Mainly for these points:
Set up an ML sparse encoding model
. It takes a ref from https://opensearch.org/docs/latest/search-plugins/neural-search-tutorial/#step-1-set-up-an-ml-language-model, but with more contents about 2 working mode in neural sparse, and a table to show the model combinations we offer in OpenSearch (we'll release v2 models soon, and present the combinations in a table will be more clear)Issues Resolved
List any issues this PR will resolve, e.g. Closes [...].
Version
2.15, 2.16
Frontend features
If you're submitting documentation for an OpenSearch Dashboards feature, add a video that shows how a user will interact with the UI step by step. A voiceover is optional.
Checklist
For more information on following Developer Certificate of Origin and signing off your commits, please check here.