-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chat with your website functionality #96
Comments
I've experimented with https://github.com/imartinez/privateGPT , and saw there is a possibility of feeding the returned embeddings from a vector search as context into a self-hosted GPT4All model to return answers to questions. Challenges would include:
Planning to implement the vector search first, before starting on the question answering. Hence I've created a new #99 for vector search implementation, and renamed this to Chat with your website functionality. |
Now the #99 vector search implementation is working, and I've upgraded the server from 4Gb to 8Gb RAM as per #110, I've started taking a look at this again. I have a basic Retrieval Augmented Generation working on dev, using the 7Bn parameter LLama 2 chat model quantized down to 3 bit, which is about as low as you can go. On my 7 year old dev machine, context fragments were returned nearly instantly and the generated results were taking 30-60s to generate, which is not necessarily too slow. It takes around 4.8Gb of RAM, which will be a bit of a struggle to fit on the production server, but not necessarily out of the question either. Results were at times surprisingly good, e.g. Question: How long does it take to climb ben nevis? Getting a simple demo working on dev is one thing, but getting it production ready (e.g. able to work with more than one user at a time, integrating into existing non-async Flask templates, etc.) is something else entirely. |
#96 Chat with your website functionality
I've deployed a version to production for early testing, although haven't put a link to it anywhere because it isn't ready for wider testing just yet. It is using llama-2-7b-chat quantised down to 3bit, with TorchServe as the model server. I've written a post at https://michael-lewis.com/posts/vector-search-and-retrieval-augmented-generation/ with more information on LLMs, Retrieval Augmented Generation, TorchServe etc. Will update further after testing, and if all goes well will open up for wider use. |
I've swapped from the 7B parameter LLama 2 chat model quantised down to 3 bit, because that was too slow, so now I've swapped to the 3B parameter Rocket model quantised down to 4 bit. In summary, the source reference link is returned super quickly, some of the generated content is excellent, from a memory perspective it looks viable, and from a CPU and overall response time perspective it might be viable but need further testing especially when the indexing is running. The main issue now is that the vector search results are quite poor so the LLM is given poor context, which means it mostly can't answer the question even though it should have been able to - workaround for now is to restrict to the site you are interested in querying via the domains selector below the Ask a question box. To get this far, I've encountered and resolved (or partially resolved) the following issues:
Open issues are:
|
Regarding the surprisingly poor quality results, with sentence-transformers/all-MiniLM-L6-v2, “How high is Ben Nevis?” gives a similarity score of 0.3176 to text about mountains containing the words “Ben Nevis” and its height, but a higher score of 0.4072 to some text about someone called Benjamin talking about someone down a well, and “Can you summarize Immanuel Kant’s biography in two sentences?” gives a similarity score of 0.5178 to text containing “Immanuel Kant” and some details of his life, but a higher score of 0.5766 to just the word “Biography" - you can test via:
I've tested some of the alternative models on the leaderboard at https://huggingface.co/spaces/mteb/leaderboard), and switched to BAAI/bge-small-en-v1.5 because it gives better results (including the expected ones in the examples above) and doesn't take much more memory or CPU. It'll take 7 days for all the full listings to be reindexed with the new embedding model, and 28 days for all of the basic listings to be reindexed, so it should be ready for testing on production in around 7 days. |
As per comments in #85 and #84 I'd like to experiment with Solr 9's new vector search (DenseVectorField fieldType and K-Nearest-Neighbor Query Parser).
Vector search works best on longer phrases, while keyword search works best on specific search terms, and I'm not sure how best to combine the two (I know there are various proposals for hybrid search models but not sure there are any best practices yet), so simplest option for now is a separate vector search page. Given the longer phrase input, and the familiarity many people have with things like ChatGPT, it would make sense to have a chat-like interface.
This could be accessed via a new link below the main search box, to the left of "Browse Sites", called e.g. "Chat Search". This would take you to a page with a larger chat box, where you can (in effect) ask a question about content in the searchmysite.net index, and get a link back, and maybe even a summary of the relevant part of the page.
A quick rough estimate suggests I could use a paid-for Large Language Model (LLM) APIs like OpenAI for content embeddings for about US$25 a month, which would probably be doable, but the issue is that it would need matching query embeddings and also potentially summarisation API calls, which could work out at up to US$0.05 per question, and given I can have over 160,000 searches by (unblockable) SEO spam bots per day, I don't want the financial risk of using a paid-for API. That means I'll need to use some open source language models that are self-hostable on the relatively low spec hardware I'm currently using (2 vCPUs and 4Gb RAM).
Results therefore won't be anywhere near as good as ChatGPT, but hopefully people will understand that I don't have unlimited cash. The main benefit is that the work might encourage more interest in the project. Plus it could form the basis for something a lot better, given there's lots of projects to get some of the larger models running on consumer hardware, e.g. the float16 to int8 with LLaMA, LoRA etc.
The text was updated successfully, but these errors were encountered: