Use BM25 for simple search #158

lefnire · 2023-06-24T19:23:22Z

The search bar currently does the following:

If the user typed < n_words (3 currently I think?), do a literal search (entries.filter(e => e.texts.toLowerCase().includes(search.toLowerCase())
If > n_words, do a cosine similarity search using sentence_transformers.semantic_search.

I was previously using Haystack for this, which has a few conditionals which would make this ticket easier. But the project was a lot beefier for too little payoff than I anticipated, so I just went back to the basics. So, if we want a more sophisticated search pipline, we can consider:

Try Haystack or Jina again. I'd lean towards Jina, they're making strides.
Consider using a Vector Database. This would be a re-architecting of the current system which uses PyArrow with vectors stored on S3 as parquet files. My solution is infinitely scalable and cheap. But a true Vector Database would add a lot of utility, like BM25 searching, ML utilities like sentiment analysis, question-answering, etc. It would require a Docker setup, and maintenance though.

The text was updated successfully, but these errors were encountered:

lefnire added the 🔍Search Search and question-answering label Jun 24, 2023

lefnire added this to Gnothi Jun 24, 2023

github-project-automation bot moved this to Next in Gnothi Jun 24, 2023

lefnire moved this from Next to Later in Gnothi Jun 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use BM25 for simple search #158

Use BM25 for simple search #158

lefnire commented Jun 24, 2023

Use BM25 for simple search #158

Use BM25 for simple search #158

Comments

lefnire commented Jun 24, 2023