Disclaimer: I will quote historical people, so I give credits. "Render unto Caesar" as they say.
Dixit is an online app that boost your confidence: when you have a smart idea, it lets you see which historical people came upon this same smart idea. "Great minds think alike", as they say 🧐.
But the truth is, these particular great minds probably formulated the idea way better than you could. So why not quote them directly, and "stand on the shoulders of [these] giants"?
Dixit lets you type in your rough-edged idea, and through a "tremendous" "AI powered semantic search", it retrieves the quotes from its database of historical quotations that best rephrase this idea ✨.
The website is coded in Django, to make use of Python's powerful ML libraries. "A good sketch is worth a long speech", so here's a graph of the system design.
graph TD
subgraph GITHUB REPO
searchbar_page(<font color=black>Index / Search page<br>/User sentence/)
author_page(<font color=black>Specific Author page)
quotes_table{<font color=black>fa:fa-database Quotes<br>table<br>4 mo}
faiss_index{<font color=black>fa:fa-search FAISS Index<br>36 mo}
search_request[Embedded sentence]
searchbar_page-.LLM embedding-.->search_request
quotes_table-.Returns closest quotes-.->searchbar_page
search_request-.Similarity search-.->faiss_index
faiss_index -.Query quotes at selected indexes-.-> quotes_table
author_page --Gets author quotes--> quotes_table
end
subgraph POSTGRESQL DATABASE
authors_table{<font color=black>fa:fa-database Authors table}
end
%% Notice that no text in shape are added here instead that is appended further down
author_page--Get author data-->authors_table
classDef green fill:#9f6,stroke:#333,stroke-width:3px;
classDef orange fill:#f96,stroke:#333,stroke-width:3px;
class author_page,searchbar_page green
class quotes_table,faiss_index,authors_table orange
The continuous lines are actions performed by the Specific author page
, while the dotted lines are actions performed by the Index / Search page
. In particular, searching for a setnence triggers the following process:
- Query sentence is embedded by a Large Language Model from the Huggingface library into the vector space
- A similarity search based on Meta's library FAISS returns the indexes of the closest quotes from database (which means the databse of quotes has been previously embedded into vector space).
- The database of quotes is queried at these indexes, and the results are displayed to the user.
As you see above, the server is a clone of a Github repo that both hosts the FAISS index and the Quotes database: while this is certainly not the most storage-light option (and my git push
commands are painful), I have good reasons to do this.
My main constraints are: fast loading of the pages (my Railway hosting is a shared node), not eating up too much RAM (I don't want to pay), and being able to update the quotation database whenever I have new quotes.
- First, why separate Quotes database and FAISS index instead of using a single
Huggingface dataset
with an associated index? Well, thisdataset
class seems to have an implementation flaw preventing one to properly save a dataset with its index: I can either store it without index, then have to recompute the index on boot, which makes the page too slow, or pickle it but it's too heavy an object for my RAM. So I chose to keep a separate FAISS index which can be loaded up very quickly. - But I want to keep them closeby, since whenever I add/remove a line in the quote database, I have to change the associated index with it.
- Why host the FAISS Index on Github repo: you can't perform FAISS search in a POSTGRESQL table, and remote hosting (like Pinecone) would add too much complexity to the system. Here, the index is light, so I kept things simple.
- Why host the Quotes database on Github repo: this allows quick versioning in case I want to add new quotations to the database + it's very light compared to the index (4mo vs 40mo), so it's not a huge burden in memory.
On the other hand, I hosted the authors
table in the POSTGRESQL database proposed for free by Railway (❤), since this table won't change a lot, and this format makes it easily hostable / queriable.
- Website is hosted on the flawless Railway. This provider gives immaculate service so far 🌟 + it's very inexpensive.
- Analytics: Goatcounter. I highly recommend it: it's GDPR compliant, uses no cookies, and is easy to setup as a single script in your
html
header. - Domain name registration: Google Domains. Obtaining a domain name works like a breeze with a Google account.
- DNS provider: Cloudflare. But then to work with Railway, I had to follow this tutorial with Cloudflare as a DNS provider. One little tweak to note: I had to change the SSL/TLS settings to "Full" encryption in Cloudflare for it to properly work.
- Database of quotes has 24,000 quotes. This is a lot if you consider the fact that they're mostly intelligent sentences and not randomly picked, but its's not yet enough to capture all ideas. I'll add many more in the months to come.
- Due to high noise in our databases for recent quotes, I only kept the quotes from older authors, since I needed an automatable criterion and I think time helps a lot to separate the wheat from the chaff. But still, you're very welcome to also propose good recent quotes! (On a purely subjective basis)
- The LLM model used to embed user sentences into the search space is inherently limited in its representation: for instance, sometimes it can misinterpret the semantics of a query. But I am confident that as LLMs improve on semantic search, the results will become even more accurate.
The site is available at dixit.app. You're welcome to try it out, and leave me some feedback! 😃