Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chat with your website functionality #96

Open
m-i-l opened this issue Mar 25, 2023 · 5 comments
Open

Chat with your website functionality #96

m-i-l opened this issue Mar 25, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@m-i-l
Copy link
Contributor

m-i-l commented Mar 25, 2023

As per comments in #85 and #84 I'd like to experiment with Solr 9's new vector search (DenseVectorField fieldType and K-Nearest-Neighbor Query Parser).

Vector search works best on longer phrases, while keyword search works best on specific search terms, and I'm not sure how best to combine the two (I know there are various proposals for hybrid search models but not sure there are any best practices yet), so simplest option for now is a separate vector search page. Given the longer phrase input, and the familiarity many people have with things like ChatGPT, it would make sense to have a chat-like interface.

This could be accessed via a new link below the main search box, to the left of "Browse Sites", called e.g. "Chat Search". This would take you to a page with a larger chat box, where you can (in effect) ask a question about content in the searchmysite.net index, and get a link back, and maybe even a summary of the relevant part of the page.

A quick rough estimate suggests I could use a paid-for Large Language Model (LLM) APIs like OpenAI for content embeddings for about US$25 a month, which would probably be doable, but the issue is that it would need matching query embeddings and also potentially summarisation API calls, which could work out at up to US$0.05 per question, and given I can have over 160,000 searches by (unblockable) SEO spam bots per day, I don't want the financial risk of using a paid-for API. That means I'll need to use some open source language models that are self-hostable on the relatively low spec hardware I'm currently using (2 vCPUs and 4Gb RAM).

Results therefore won't be anywhere near as good as ChatGPT, but hopefully people will understand that I don't have unlimited cash. The main benefit is that the work might encourage more interest in the project. Plus it could form the basis for something a lot better, given there's lots of projects to get some of the larger models running on consumer hardware, e.g. the float16 to int8 with LLaMA, LoRA etc.

@m-i-l m-i-l added the enhancement New feature or request label Mar 25, 2023
This was referenced Jun 24, 2023
@m-i-l m-i-l changed the title Chat-like functionality using vector search Chat with your website functionality Jun 24, 2023
@m-i-l
Copy link
Contributor Author

m-i-l commented Jun 24, 2023

I've experimented with https://github.com/imartinez/privateGPT , and saw there is a possibility of feeding the returned embeddings from a vector search as context into a self-hosted GPT4All model to return answers to questions. Challenges would include:

  • It was taking a couple of minutes to generate answers to questions. This isn't necessarily a problem if the pages with the answers on them are returned almost immediately.
  • The smallest models need 4Gb min of RAM, and the whole environment (search, indexing, database etc.) is currently on a 4Gb Hetzner CPX21 instance (3 vCPU, 4Gb RAM, 80 Gb disk, EUR8.98 a month). One option might be to switch to one of Hetzner's new ARM64 based CAX21 instances (4 vCPU, 8Gb RAM, 80Gb disk, EUR7.73 a month).
  • I'll need to figure out some way of preventing the spam bots from using the LLM, otherwise they'll effectively be a denial of service attack. Wondering if some form of server push (e.g. WebSockets) requiring client-side JavaScript might be good enough to stop enough bots from (ab)using it.

Planning to implement the vector search first, before starting on the question answering. Hence I've created a new #99 for vector search implementation, and renamed this to Chat with your website functionality.

@m-i-l
Copy link
Contributor Author

m-i-l commented Oct 28, 2023

Now the #99 vector search implementation is working, and I've upgraded the server from 4Gb to 8Gb RAM as per #110, I've started taking a look at this again.

I have a basic Retrieval Augmented Generation working on dev, using the 7Bn parameter LLama 2 chat model quantized down to 3 bit, which is about as low as you can go. On my 7 year old dev machine, context fragments were returned nearly instantly and the generated results were taking 30-60s to generate, which is not necessarily too slow. It takes around 4.8Gb of RAM, which will be a bit of a struggle to fit on the production server, but not necessarily out of the question either. Results were at times surprisingly good, e.g.

Question: How long does it take to climb ben nevis?
Answer (context): https://michael-lewis.com/posts/climbing-the-three-peaks-snowdon-scafell-pike-and-ben-nevis/
Answer (generated): Based on the context you provided, it takes nearly 4 hours to climb Ben Nevis from the visitor center. The blog post states that it took them 4 hours to reach the summit, which is significantly longer than the time it would take to climb other mountains like Snowdon or Scafell Pike.

Getting a simple demo working on dev is one thing, but getting it production ready (e.g. able to work with more than one user at a time, integrating into existing non-async Flask templates, etc.) is something else entirely.

m-i-l added a commit that referenced this issue Dec 3, 2023
m-i-l added a commit that referenced this issue Dec 3, 2023
@m-i-l
Copy link
Contributor Author

m-i-l commented Dec 3, 2023

I've deployed a version to production for early testing, although haven't put a link to it anywhere because it isn't ready for wider testing just yet.

It is using llama-2-7b-chat quantised down to 3bit, with TorchServe as the model server.

I've written a post at https://michael-lewis.com/posts/vector-search-and-retrieval-augmented-generation/ with more information on LLMs, Retrieval Augmented Generation, TorchServe etc.

Will update further after testing, and if all goes well will open up for wider use.

@m-i-l
Copy link
Contributor Author

m-i-l commented Dec 9, 2023

I've swapped from the 7B parameter LLama 2 chat model quantised down to 3 bit, because that was too slow, so now I've swapped to the 3B parameter Rocket model quantised down to 4 bit.

In summary, the source reference link is returned super quickly, some of the generated content is excellent, from a memory perspective it looks viable, and from a CPU and overall response time perspective it might be viable but need further testing especially when the indexing is running. The main issue now is that the vector search results are quite poor so the LLM is given poor context, which means it mostly can't answer the question even though it should have been able to - workaround for now is to restrict to the site you are interested in querying via the domains selector below the Ask a question box.

To get this far, I've encountered and resolved (or partially resolved) the following issues:

  • The torchserve docker image doesn't currently build on arm64. Solution was to build my own from scratch. Probably won't be as optimised as the official one.
  • The server timed out connections open longer than 60s, while results often took longer. I've found some config to keep connections open longer, although connections are still timed out at 100s and I'm not sure why. Not spending longer investigating because really 60-100s is too long to wait anyway. I've also performed some TorchServe tuning to reduce inference time on CPU. And swapping to the new model seems to get some responses within 30s.

Open issues are:

  • The content chunks returned by the vector search are often irrelevant to the question, so if passing those into the LLM it won't be able to generate a good response. A query for "How high is Ben Nevis?" for example returns a top chunk all about someone called Benjamin and none of the chunks mentioning the height of Ben Nevis so it can't answer the question even though it could have if it had been provided a better chunk. If you restrict the search to the site when mentions the height of Ben Nevis it is better.
  • Inference takes around 50% of CPU, and indexing also takes around 50% of CPU, so when they are both running it is much more likely to timeout. In theory the indexing could be moved to another server.
  • When it times out the model server process continues running, so still eats up 50% of CPU, which is a bit of a waste considering the generated text will never be shown. I suspect the solution here may be a server push like WebSockets, but that could be a big infrastructure change to support.
  • The API is unprotected so vulnerable to DDoS e.g. from the spam bots. Most of the spam bots don't run JavaScript so as long as the operators don't manually inspect the code to find the end point it might not be such a risk.

@m-i-l
Copy link
Contributor Author

m-i-l commented Dec 17, 2023

Regarding the surprisingly poor quality results, with sentence-transformers/all-MiniLM-L6-v2, “How high is Ben Nevis?” gives a similarity score of 0.3176 to text about mountains containing the words “Ben Nevis” and its height, but a higher score of 0.4072 to some text about someone called Benjamin talking about someone down a well, and “Can you summarize Immanuel Kant’s biography in two sentences?” gives a similarity score of 0.5178 to text containing “Immanuel Kant” and some details of his life, but a higher score of 0.5766 to just the word “Biography" - you can test via:

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
question1 = "How high is Ben Nevis?"
answers1 = ["The three peaks in this context are the three highest peaks in Great Britain: Scafell Pike, England, 978m; Snowdon (Yr Wyddfa in Welsh), Wales, 1085m; Ben Nevis (Bheinn Nibheis in Scottish Gaelic), Scotland, 1345m", "Imagine being all that way down in the dark. Hope they thought to haul him up again at the end opined Benjamin, pleasantly."]
util.cos_sim(model.encode(question1), model.encode(answers1[0]))
util.cos_sim(model.encode(question1), model.encode(answers1[1]))
question2 = "Can you summarize Immanuel Kant's biography in two sentences?"
answers2 = ["Biography", "Immanuel Kant, born in 1724, was one of the most influential philosophers of the Enlightenment. Although Kant is best known today as a philosopher, his early work focused on physics. He correctly deduced a number of complicated physical phenomena, including the orbital mechanics of the earth and moon, the effects of the earth\u2019s rotation on weather patterns, and how the solar system was formed."]
util.cos_sim(model.encode(question2), model.encode(answers2[0]))
util.cos_sim(model.encode(question2), model.encode(answers2[1]))

I've tested some of the alternative models on the leaderboard at https://huggingface.co/spaces/mteb/leaderboard), and switched to BAAI/bge-small-en-v1.5 because it gives better results (including the expected ones in the examples above) and doesn't take much more memory or CPU.

It'll take 7 days for all the full listings to be reindexed with the new embedding model, and 28 days for all of the basic listings to be reindexed, so it should be ready for testing on production in around 7 days.

@m-i-l m-i-l closed this as completed Dec 17, 2023
@m-i-l m-i-l reopened this Dec 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant