Trying to run llama 2 13B in gpu #994

kettoleon · 2023-09-01T12:39:39Z

kettoleon
Sep 1, 2023

Hi community,

I've been trying to make my own generative agents as a pet project to learn, and so far I had been using oobabooga api to make calls to a llama 2 13B model by TheBloke using exllama. However I found an issue with oobabooga freezing after a few requests, so I began to look for alternatives and LocalAI looks to be just what I need 👍 .

So I'm trying to start LocalAI to run (preferably) the same model, but I have been unable to succeed so far. I'll detail what I have been doing so far and hopefully someone might point me in the right direction!

What I tried so far:

Created a docker-compose.yml and .env file as explained in https://localai.io/howtos/easy-setup-docker-gpu/
- (But I did not clone the project, I'm not sure why it is needed if you just want to run the docker image?)
I ran the docker compose from IntelliJ idea and the container loaded, rebuilt, and gave a successful response when accessing the list of available models.
I tried looking for a llama 2 model in the gallery, but I only found ggml models, none that could run only in gpu with exllama. I tried one of those anyways (had to download manually because download was very slow btw) and got problems with not enough RAM when trying to run it. The container has access to 16GB RAM, but it does not seem it is offloading anything to the gpu, there seems to already be an open ticket for that.
I also saw there has recently been exllama support added, so I decided to try it as well. ( feat: Add exllama #881 )
So according to the MR instructions, I downloaded the model into a subfolder of /models/ and in the .yaml folder i have:

context_size: 4000
batch: 512
name: exllama2
parameters:
  model: TheBloke_Llama-2-13B-chat-GPTQ
[roles, templates, etc...]

However, when the docker image starts after rebuilding, and I make a request to completions, I see the following exllama.py error with the imports:

2023-09-01T12:03:37.008477974Z 12:03PM DBG Loading model in memory from file: /models/TheBloke_Llama-2-13B-chat-GPTQ
2023-09-01T12:03:37.008480589Z 12:03PM DBG Loading GRPC Model exllama: {backendString:exllama model:TheBloke_Llama-2-13B-chat-GPTQ threads:2 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0000fc180 externalBackends:map[autogptq:/build/extra/grpc/autogptq/autogptq.py bark:/build/extra/grpc/bark/ttsbark.py diffusers:/build/extra/grpc/diffusers/backend_diffusers.py exllama:/build/extra/grpc/exllama/exllama.py huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false}
2023-09-01T12:03:37.008483765Z 12:03PM DBG Loading external backend: /build/extra/grpc/exllama/exllama.py
2023-09-01T12:03:37.024298199Z 12:03PM DBG Loading GRPC Process: /build/extra/grpc/exllama/exllama.py
2023-09-01T12:03:37.024419778Z 12:03PM DBG GRPC Service for TheBloke_Llama-2-13B-chat-GPTQ will be running at: '127.0.0.1:44165'
2023-09-01T12:03:37.025838972Z 12:03PM DBG GRPC Service state dir: /tmp/go-processmanager1849945454
2023-09-01T12:03:37.026513027Z 12:03PM DBG GRPC Service Started
2023-09-01T12:03:37.027970513Z rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:44165: connect: connection refused"
2023-09-01T12:03:38.632804488Z 12:03PM DBG GRPC(TheBloke_Llama-2-13B-chat-GPTQ-127.0.0.1:44165): stderr Traceback (most recent call last):
2023-09-01T12:03:38.633237331Z 12:03PM DBG GRPC(TheBloke_Llama-2-13B-chat-GPTQ-127.0.0.1:44165): stderr   File "/build/extra/grpc/exllama/exllama.py", line 16, in <module>
2023-09-01T12:03:38.633677957Z 12:03PM DBG GRPC(TheBloke_Llama-2-13B-chat-GPTQ-127.0.0.1:44165): stderr     from exllama.generator import ExLlamaGenerator
2023-09-01T12:03:38.633932596Z 12:03PM DBG GRPC(TheBloke_Llama-2-13B-chat-GPTQ-127.0.0.1:44165): stderr   File "/build/extra/grpc/exllama/exllama/__init__.py", line 1, in <module>
2023-09-01T12:03:38.634136428Z 12:03PM DBG GRPC(TheBloke_Llama-2-13B-chat-GPTQ-127.0.0.1:44165): stderr     from . import cuda_ext, generator, model, tokenizer
2023-09-01T12:03:38.634396977Z 12:03PM DBG GRPC(TheBloke_Llama-2-13B-chat-GPTQ-127.0.0.1:44165): stderr   File "/build/extra/grpc/exllama/exllama/cuda_ext.py", line 9, in <module>
2023-09-01T12:03:38.634590180Z 12:03PM DBG GRPC(TheBloke_Llama-2-13B-chat-GPTQ-127.0.0.1:44165): stderr     import exllama_ext
2023-09-01T12:03:38.634836603Z 12:03PM DBG GRPC(TheBloke_Llama-2-13B-chat-GPTQ-127.0.0.1:44165): stderr ImportError: /usr/local/lib/python3.9/dist-packages/exllama_ext.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

It sounds like wrong cuda version in the docker container? I've been approaching this from a top-down approach, so I don't know much about cuda 😅 .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to run llama 2 13B in gpu #994

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Trying to run llama 2 13B in gpu #994

kettoleon Sep 1, 2023

Replies: 0 comments

kettoleon
Sep 1, 2023