Use Opensearch as the tool for searching posts #296

Willdotwhite · 2024-04-18T09:11:39Z

Overview

The core details are:

the previous searching tool was extremely naive, entirely using String.contains
there was no spell correction, nor an easy way to add it in
results came back in an arbitrary order and weren't easily scored (read: we weren't showing the most relevant post first)
we had little flexibility on making some fields 'nice to have' vs. 'required'; having the option could let us create more fine-tuned searching

A search engine ('SE' for short, used in the codebase) is a much more appropriate tool to use for searching and scoring documents. This change migrates the searching, scoring, and ranking logic to Opensearch to return a list of ordered SearchItem instances, which we then use to return a list of ordered PostItem instances to the user.

Why Opensearch?

Opensearch is an open source search engine forked from Elasticsearch, so has the benefits of being more approachable to developers with a familiarity with ES than something like Solr. Although it's not the most straightforward app to run from Docker, it's not too painful to set up locally.

We could use any search engine here, as we're not doing anything complex enough to strain the boundaries of one particular tool.

How are we using Opensearch?

The existing GET /posts?... URL stays the same, but instead of querying the DB with a whole bunch of boolean filters we now query the SE with a combination of fields (details available on request); the SE returns the resulting documents in an order from most-least relevant, and we use that order to return PostItems back to the user in the same order.

Why a SE and a DB at the same time?

The DB is a persistent data store for posts stored more-or-less exactly as the user uploaded their content, plus extra metadata for us (e.g. createdAt, updatedAt). The SE is a tool that manipulates data to the most easily searched form, often by doing compute work and storing it on disk (trading memory space for search speed).

While we could use Opensearch as a persistent database, we'd need to maintain very clear boundaries between user data and indexed fields used for searching - one of the main benefits of using a search engine is it's ability to manipulate data into fields for faster or more powerful searching. Storing these alongside the core user data has the potential to get a bit messy, and we'd be indexing data (such as reportCount) that would never be searched.

Code flow

For anything that doesn't involve searching, the code stays exactly the same:

For a search request the API hits the SE, the SE returns a list of PostItem IDs to the API, and the API loads those documents from the DB:

Creating, updating and deleting a post basically duplicates the existing flow for the DB onto the SE:

Willdotwhite · 2024-04-18T09:16:44Z

api/build.gradle.kts

+    // SE
+    implementation("org.apache.httpcomponents.core5:httpcore5:5.2.4")
+    implementation("org.apache.httpcomponents.core5:httpcore5-h2:5.2.4")
+    implementation("org.opensearch.client:opensearch-java:2.10.0")


The httpcore libraries provide HTTP request support for the Opensearch client, which makes requests over HTTP.

There are a range of Kotlin-specific Opensearch clients; I've not yet found one that wasn't a complete pig to set up or wasn't completely baffling to wrap my head around. I'm certain they exist, but didn't have any luck on a quick first pass.

Although the Java client is extremely verbose I've used it for this first version as it's very concise to form the initial connection to Opensearch, and constructing the query can be tidied up with helper methods if we want to go down that road.

Willdotwhite · 2024-04-18T09:17:15Z

api/src/main/kotlin/com/gmtkgamejam/Application.kt

+    configureInfraRouting()
+    configurePostRouting()
+    configureRequestHandling()
+    configureUserInfoRouting()


The infra routing is the only new thing here, I've just sorted the list to be alphabetically ordered.

Willdotwhite · 2024-04-18T09:19:38Z

api/src/main/kotlin/com/gmtkgamejam/routing/InfraRoutes.kt

+import org.litote.kmongo.eq
+
+// TODO: Auth control
+fun Application.configureInfraRouting() {


The /infra route is intended to provide simple setup tooling for an administrator to avoid us needing lots of helper scripts or ingrained knowledge on setting things up.

This route will load all posts from the DB and set up the posts index in the SE (including special mappings etc). I'd like to keep as much of this setup in code as possible, we'll forget how to do it otherwise.

This routing block will likely use a secret configuration value (set as a runtime environment variable) as access control.

Willdotwhite · 2024-04-18T09:20:05Z

api/src/main/kotlin/com/gmtkgamejam/routing/PostRoutes.kt

@@ -210,83 +189,3 @@ fun Application.configurePostRouting() {
        }
    }
 }
-
-fun getFilterFromParameters(params: Parameters): List<Bson> {


I cannot tell you how thrilled I am to have deleted all of this

Even more thrilled than I am to see it go 😁

Full details and diagrams will be in the PR (#296), but the core details are: * the previous searching tool was extremely naive, entirely using `String.contains` * there was no spell correction, nor easy way to add it in * results came back in an arbitrary order and weren't easily scored A search engine ('SE' for short, used in the codebase) is a much more appropriate tool. This change migrates the searching, scoring, and ranking logic to Opensearch to return a list of ordered SearchItem instances, which we then use to return a list of ordered PostItem instances to the user.

awildbrysen

Mainly reading through the code at this point, I went through about half of the files already but have to postpone the rest of it.
Already left a few comments, mainly on Kotlin idiomacy.

api/src/main/kotlin/com/gmtkgamejam/ApplicationCallExtensions.kt

awildbrysen · 2024-04-18T18:35:09Z

api/src/main/kotlin/com/gmtkgamejam/routing/PostRoutes.kt

@@ -210,83 +189,3 @@ fun Application.configurePostRouting() {
        }
    }
 }
-
-fun getFilterFromParameters(params: Parameters): List<Bson> {


Even more thrilled than I am to see it go 😁

api/src/main/kotlin/com/gmtkgamejam/search/Opensearch.kt

Full details and diagrams will be in the PR (#296), but the core details are: * the previous searching tool was extremely naive, entirely using `String.contains` * there was no spell correction, nor easy way to add it in * results came back in an arbitrary order and weren't easily scored A search engine ('SE' for short, used in the codebase) is a much more appropriate tool. This change migrates the searching, scoring, and ranking logic to Opensearch to return a list of ordered SearchItem instances, which we then use to return a list of ordered PostItem instances to the user.

awildbrysen

One more small suggested change

api/src/main/kotlin/com/gmtkgamejam/search/Opensearch.kt

Full details and diagrams will be in the PR (#296), but the core details are: * the previous searching tool was extremely naive, entirely using `String.contains` * there was no spell correction, nor easy way to add it in * results came back in an arbitrary order and weren't easily scored A search engine ('SE' for short, used in the codebase) is a much more appropriate tool. This change migrates the searching, scoring, and ranking logic to Opensearch to return a list of ordered SearchItem instances, which we then use to return a list of ordered PostItem instances to the user.

DomHarris · 2024-09-27T11:12:40Z

api/src/main/kotlin/com/gmtkgamejam/search/OpensearchClusterConfigurer.kt

+             * SearchAsYouType is a pre-built field type that specialises in fast real time searching
+             * @see https://opensearch.org/docs/latest/field-types/supported-field-types/search-as-you-type/
+             */
+            "description_shingle" to SearchAsYouTypeProperty.Builder().build()._toProperty()


What is a shingle? This name doesn't feel particularly descriptive or useful.

shingle is a technical term - it's basically the search engine term used for an n-gram. If a description is "This is my cool team", the shingles list will be something like:

This is, is my, my cool, cool team

The Wikipedia page on w-shingling is useful reading here

Willdotwhite added the 🍩 Don't Merge Yet 🍩 label Apr 18, 2024

Willdotwhite commented Apr 18, 2024

View reviewed changes

Willdotwhite force-pushed the feature/search-engine branch from b8f841f to 137509b Compare April 18, 2024 10:02

awildbrysen reviewed Apr 18, 2024

View reviewed changes

Willdotwhite force-pushed the feature/search-engine branch from 137509b to 9cceeb5 Compare April 19, 2024 07:28

Willdotwhite force-pushed the feature/search-engine branch from 9cceeb5 to 0106d4b Compare April 19, 2024 07:32

Willdotwhite force-pushed the feature/search-engine branch from 0106d4b to a8dc357 Compare April 19, 2024 07:34

Willdotwhite force-pushed the feature/search-engine branch from a8dc357 to 36e2ee0 Compare April 19, 2024 07:56

Willdotwhite force-pushed the feature/search-engine branch from 36e2ee0 to f931647 Compare April 19, 2024 15:47

awildbrysen reviewed Apr 22, 2024

View reviewed changes

api/src/main/kotlin/com/gmtkgamejam/search/Opensearch.kt Outdated Show resolved Hide resolved

Willdotwhite force-pushed the feature/search-engine branch from f931647 to f2a81b9 Compare April 22, 2024 16:21

awildbrysen approved these changes Apr 27, 2024

View reviewed changes

DomHarris reviewed Sep 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Opensearch as the tool for searching posts #296

Use Opensearch as the tool for searching posts #296

Willdotwhite commented Apr 18, 2024 •

edited

Loading

Willdotwhite Apr 18, 2024

Willdotwhite Apr 18, 2024

Willdotwhite Apr 18, 2024

Willdotwhite Apr 18, 2024

awildbrysen Apr 18, 2024

awildbrysen left a comment

awildbrysen Apr 18, 2024

awildbrysen left a comment

DomHarris Sep 27, 2024

Willdotwhite Sep 27, 2024 •

edited

Loading

Use Opensearch as the tool for searching posts #296

Are you sure you want to change the base?

Use Opensearch as the tool for searching posts #296

Conversation

Willdotwhite commented Apr 18, 2024 • edited Loading

Overview

Why Opensearch?

How are we using Opensearch?

Why a SE and a DB at the same time?

Code flow

Willdotwhite Apr 18, 2024

Choose a reason for hiding this comment

Willdotwhite Apr 18, 2024

Choose a reason for hiding this comment

Willdotwhite Apr 18, 2024

Choose a reason for hiding this comment

Willdotwhite Apr 18, 2024

Choose a reason for hiding this comment

awildbrysen Apr 18, 2024

Choose a reason for hiding this comment

awildbrysen left a comment

Choose a reason for hiding this comment

awildbrysen Apr 18, 2024

Choose a reason for hiding this comment

awildbrysen left a comment

Choose a reason for hiding this comment

DomHarris Sep 27, 2024

Choose a reason for hiding this comment

Willdotwhite Sep 27, 2024 • edited Loading

Choose a reason for hiding this comment

Willdotwhite commented Apr 18, 2024 •

edited

Loading

Willdotwhite Sep 27, 2024 •

edited

Loading