This sample application is a demo of how one can build search suggestions from a document corpus. It uses documents from Vespa Documentation and extracts terms and phrases. Prefix match is used, so suggestions are shown as the user types.
This sample application is also deployed for vespa-documentation-search, see schema. Note an enhancement to this sample app:
field terms type array<string> {
indexing: summary | attribute
attribute: fast-search
}
This to solve the problem of prefix searching every term in the phrase. Example: A user searching for "rank" should have a suggestion for "learning to rank". Hence, the script generating suggestions should create something like:
"update": "id:term:term::learning/to/rank",
"create": true,
"fields": {
"term": { "assign": "learning to rank" },
"terms": { "assign": ["learning to rank", "to rank", "rank"] },
An alternative is to use the term
field as input, with a field not contained in document, and split on space:
field terms_from_term type array<string> {
indexing: input term | trim | split " +" | attribute
attribute: fast-search
}
With this, prefix queries will hit "inside" phrases, too.
Another consideration is how to remove noise. A simple approach is to require at least two instances of a word in the corpus.
A simplistic ranking based on term frequencies is used - a real application could implement a more sophisticated ranking for better suggestions.
For short inputs, a trick is to use range queries with hitLimit on a fast-search attribute. This changes the semantics of the prefix query to only match against documents in the top 1K, which is usually what one wants for short prefix lengths.
Requirements:
- Docker Desktop installed and running. 4GB available memory for Docker is recommended. Refer to Docker memory for details and troubleshooting
- Alternatively, deploy using Vespa Cloud
- Operating system: Linux, macOS or Windows 10 Pro (Docker requirement)
- Architecture: x86_64 or arm64
- Homebrew to install Vespa CLI, or download a vespa cli release from GitHub releases.
- Java 17 installed.
- Apache Maven This sample app uses custom Java components and Maven is used to build the application.
Validate environment, must be minimum 4GB:
$ docker info | grep "Total Memory" or $ podman info | grep "memTotal"
Install Vespa CLI:
$ brew install vespa-cli
For local deployment using docker image:
$ vespa config set target local
Pull and start the vespa docker container image:
$ docker pull vespaengine/vespa $ docker run --detach --name vespa --hostname vespa-container \ --publish 8080:8080 --publish 19071:19071 \ vespaengine/vespa
Download this sample application:
$ vespa clone incremental-search/search-suggestions myapp && cd myapp
Build the application package:
$ mvn clean package -U
Verify that configuration service (deploy api) is ready:
$ vespa status deploy --wait 300
Deploy the application:
$ vespa deploy --wait 300
It is possible to deploy this app to Vespa Cloud.
Wait for the application endpoint to become available:
$ vespa status --wait 300
Feed the example documents:
$ while read -r line; do echo $line > tmp.json; vespa document tmp.json; done < example_feed.jsonl
Check the website, write queries and view suggestions. Open http://localhost:8080/site/ in a browser:
$ curl -s http://localhost:8080/site/
Do a prefix query - using YQL using contains with prefix annotation:
$ vespa query 'yql=select documentid,term from sources term where term contains ([{"prefix":true}]"stre");'
YQL with userQuery() and simple query language:
- Note: The
term
field is defined as a field in the default fieldset
vespa query 'yql=select documentid,term from sources term where userQuery()' 'query=str*'
YQL with userInput() and simple query language:
vespa query 'yql=select documentid,term from sources term where ([{"defaultIndex":"default"}]userInput(@query))' 'query=str*'
- Note: with userInput, the
defaultIndex
has to be set, it can be a field, or a fieldset
Using regular expression YQL with matches instead of contains:
$ vespa query 'yql=select documentid,term from sources term where term matches "stre"'
Shutdown and remove the docker container:
$ docker rm -f vespa
Indexed prefix search matches documents where the prefix of the term matches the query terms.
To do an indexed prefix search the query needs [{"prefix":true}], see example. It is important to note that this type of prefix search is not supported for fields set to index in the schema. Therefore, all fields for prefix search has to be attributes.
Indexed prefix search is faster than using streaming search, and is also more suitable for situations where multiple concurrent queries might occur and performance is important.
In this sample application indexed prefix search is used to implement search suggestions based on users' previous queries. By storing user-input as documents you can get some queries that are not suitable for suggestion. A way to remedy this is by filtering out user queries that contain terms that are added to a block list or have a set of accepted terms. In this sample application a document processor is used to filter out such queries during document feeding. For demonstration purposes we have implemented both a block list and a set of accepted terms, but usually only one of these is needed.