Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Opensearch as the tool for searching posts #296

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Willdotwhite
Copy link
Collaborator

@Willdotwhite Willdotwhite commented Apr 18, 2024

Overview

The core details are:

  • the previous searching tool was extremely naive, entirely using String.contains
  • there was no spell correction, nor an easy way to add it in
  • results came back in an arbitrary order and weren't easily scored (read: we weren't showing the most relevant post first)
  • we had little flexibility on making some fields 'nice to have' vs. 'required'; having the option could let us create more fine-tuned searching

A search engine ('SE' for short, used in the codebase) is a much more appropriate tool to use for searching and scoring documents. This change migrates the searching, scoring, and ranking logic to Opensearch to return a list of ordered SearchItem instances, which we then use to return a list of ordered PostItem instances to the user.

Why Opensearch?

Opensearch is an open source search engine forked from Elasticsearch, so has the benefits of being more approachable to developers with a familiarity with ES than something like Solr. Although it's not the most straightforward app to run from Docker, it's not too painful to set up locally.

We could use any search engine here, as we're not doing anything complex enough to strain the boundaries of one particular tool.

How are we using Opensearch?

The existing GET /posts?... URL stays the same, but instead of querying the DB with a whole bunch of boolean filters we now query the SE with a combination of fields (details available on request); the SE returns the resulting documents in an order from most-least relevant, and we use that order to return PostItems back to the user in the same order.

Why a SE and a DB at the same time?

The DB is a persistent data store for posts stored more-or-less exactly as the user uploaded their content, plus extra metadata for us (e.g. createdAt, updatedAt). The SE is a tool that manipulates data to the most easily searched form, often by doing compute work and storing it on disk (trading memory space for search speed).

While we could use Opensearch as a persistent database, we'd need to maintain very clear boundaries between user data and indexed fields used for searching - one of the main benefits of using a search engine is it's ability to manipulate data into fields for faster or more powerful searching. Storing these alongside the core user data has the potential to get a bit messy, and we'd be indexing data (such as reportCount) that would never be searched.

Code flow

For anything that doesn't involve searching, the code stays exactly the same:
image

For a search request the API hits the SE, the SE returns a list of PostItem IDs to the API, and the API loads those documents from the DB:
image

Creating, updating and deleting a post basically duplicates the existing flow for the DB onto the SE:
image

// SE
implementation("org.apache.httpcomponents.core5:httpcore5:5.2.4")
implementation("org.apache.httpcomponents.core5:httpcore5-h2:5.2.4")
implementation("org.opensearch.client:opensearch-java:2.10.0")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The httpcore libraries provide HTTP request support for the Opensearch client, which makes requests over HTTP.

There are a range of Kotlin-specific Opensearch clients; I've not yet found one that wasn't a complete pig to set up or wasn't completely baffling to wrap my head around. I'm certain they exist, but didn't have any luck on a quick first pass.

Although the Java client is extremely verbose I've used it for this first version as it's very concise to form the initial connection to Opensearch, and constructing the query can be tidied up with helper methods if we want to go down that road.

configureInfraRouting()
configurePostRouting()
configureRequestHandling()
configureUserInfoRouting()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The infra routing is the only new thing here, I've just sorted the list to be alphabetically ordered.

import org.litote.kmongo.eq

// TODO: Auth control
fun Application.configureInfraRouting() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The /infra route is intended to provide simple setup tooling for an administrator to avoid us needing lots of helper scripts or ingrained knowledge on setting things up.

This route will load all posts from the DB and set up the posts index in the SE (including special mappings etc). I'd like to keep as much of this setup in code as possible, we'll forget how to do it otherwise.

This routing block will likely use a secret configuration value (set as a runtime environment variable) as access control.

@@ -210,83 +189,3 @@ fun Application.configurePostRouting() {
}
}
}

fun getFilterFromParameters(params: Parameters): List<Bson> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot tell you how thrilled I am to have deleted all of this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even more thrilled than I am to see it go 😁

Willdotwhite added a commit that referenced this pull request Apr 18, 2024
Full details and diagrams will be in the PR (#296), but the core
details are:
* the previous searching tool was extremely naive, entirely using `String.contains`
* there was no spell correction, nor easy way to add it in
* results came back in an arbitrary order and weren't easily scored

A search engine ('SE' for short, used in the codebase) is a much more
appropriate tool. This change migrates the searching, scoring, and ranking
logic to Opensearch to return a list of ordered SearchItem instances,
which we then use to return a list of ordered PostItem instances to the user.
Copy link
Collaborator

@awildbrysen awildbrysen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly reading through the code at this point, I went through about half of the files already but have to postpone the rest of it.
Already left a few comments, mainly on Kotlin idiomacy.

@@ -210,83 +189,3 @@ fun Application.configurePostRouting() {
}
}
}

fun getFilterFromParameters(params: Parameters): List<Bson> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even more thrilled than I am to see it go 😁

api/src/main/kotlin/com/gmtkgamejam/search/Opensearch.kt Outdated Show resolved Hide resolved
Willdotwhite added a commit that referenced this pull request Apr 19, 2024
Full details and diagrams will be in the PR (#296), but the core
details are:
* the previous searching tool was extremely naive, entirely using `String.contains`
* there was no spell correction, nor easy way to add it in
* results came back in an arbitrary order and weren't easily scored

A search engine ('SE' for short, used in the codebase) is a much more
appropriate tool. This change migrates the searching, scoring, and ranking
logic to Opensearch to return a list of ordered SearchItem instances,
which we then use to return a list of ordered PostItem instances to the user.
Willdotwhite added a commit that referenced this pull request Apr 19, 2024
Full details and diagrams will be in the PR (#296), but the core
details are:
* the previous searching tool was extremely naive, entirely using `String.contains`
* there was no spell correction, nor easy way to add it in
* results came back in an arbitrary order and weren't easily scored

A search engine ('SE' for short, used in the codebase) is a much more
appropriate tool. This change migrates the searching, scoring, and ranking
logic to Opensearch to return a list of ordered SearchItem instances,
which we then use to return a list of ordered PostItem instances to the user.
Willdotwhite added a commit that referenced this pull request Apr 19, 2024
Full details and diagrams will be in the PR (#296), but the core
details are:
* the previous searching tool was extremely naive, entirely using `String.contains`
* there was no spell correction, nor easy way to add it in
* results came back in an arbitrary order and weren't easily scored

A search engine ('SE' for short, used in the codebase) is a much more
appropriate tool. This change migrates the searching, scoring, and ranking
logic to Opensearch to return a list of ordered SearchItem instances,
which we then use to return a list of ordered PostItem instances to the user.
Willdotwhite added a commit that referenced this pull request Apr 19, 2024
Full details and diagrams will be in the PR (#296), but the core
details are:
* the previous searching tool was extremely naive, entirely using `String.contains`
* there was no spell correction, nor easy way to add it in
* results came back in an arbitrary order and weren't easily scored

A search engine ('SE' for short, used in the codebase) is a much more
appropriate tool. This change migrates the searching, scoring, and ranking
logic to Opensearch to return a list of ordered SearchItem instances,
which we then use to return a list of ordered PostItem instances to the user.
Willdotwhite added a commit that referenced this pull request Apr 19, 2024
Full details and diagrams will be in the PR (#296), but the core
details are:
* the previous searching tool was extremely naive, entirely using `String.contains`
* there was no spell correction, nor easy way to add it in
* results came back in an arbitrary order and weren't easily scored

A search engine ('SE' for short, used in the codebase) is a much more
appropriate tool. This change migrates the searching, scoring, and ranking
logic to Opensearch to return a list of ordered SearchItem instances,
which we then use to return a list of ordered PostItem instances to the user.
Copy link
Collaborator

@awildbrysen awildbrysen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more small suggested change

api/src/main/kotlin/com/gmtkgamejam/search/Opensearch.kt Outdated Show resolved Hide resolved
Full details and diagrams will be in the PR (#296), but the core
details are:
* the previous searching tool was extremely naive, entirely using `String.contains`
* there was no spell correction, nor easy way to add it in
* results came back in an arbitrary order and weren't easily scored

A search engine ('SE' for short, used in the codebase) is a much more
appropriate tool. This change migrates the searching, scoring, and ranking
logic to Opensearch to return a list of ordered SearchItem instances,
which we then use to return a list of ordered PostItem instances to the user.
* SearchAsYouType is a pre-built field type that specialises in fast real time searching
* @see https://opensearch.org/docs/latest/field-types/supported-field-types/search-as-you-type/
*/
"description_shingle" to SearchAsYouTypeProperty.Builder().build()._toProperty()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a shingle? This name doesn't feel particularly descriptive or useful.

Copy link
Collaborator Author

@Willdotwhite Willdotwhite Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shingle is a technical term - it's basically the search engine term used for an n-gram. If a description is "This is my cool team", the shingles list will be something like:

This is, is my, my cool, cool team

The Wikipedia page on w-shingling is useful reading here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants