Ridiculously slow on RTX 3090!?! #958

gitwittidbit · 2023-08-25T22:16:35Z

gitwittidbit
Aug 25, 2023

First of all: Thank you very much for making LocalAI available. This is a giant leap for the community!

LocalAI version:
1.24.1 - built an hour ago

Environment, CPU architecture, OS, and Version:
Linux localAI 6.1.0-11-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.38-4 (2023-08-08) x86_64 GNU/Linux
This is a VM with 8 XEON cores, 32 GB ram and an NVIDIA RTX 3090

Describe the bug
It works but it is super slow: I asked "I want to make cheesecake" and the response took 21 minutes (!). That can't be right.

To Reproduce
Run it in an environment like mine and ask for cheesecake.

Expected behavior
Difficult to say but I would expect an answer within a minute or two? Or is that unrealistic?

Logs
`11:32PM DBG Request received:
11:32PM DBG Configuration read: &{PredictionOptions:{Model:wizardlm-13b-v1.2.ggmlv3.q4_0.bin Language: N:0 TopP:0.65 TopK:40 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:false Threads:4 Debug:true Roles:map[assistant:### Response: system:### System: user:### Instruction:] Embeddings:false Backend:llama-stable TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0}}
11:32PM DBG Parameters: &{PredictionOptions:{Model:wizardlm-13b-v1.2.ggmlv3.q4_0.bin Language: N:0 TopP:0.65 TopK:40 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:false Threads:4 Debug:true Roles:map[assistant:### Response: system:### System: user:### Instruction:] Embeddings:false Backend:llama-stable TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0}}
11:32PM DBG Prompt (before templating): ### Instruction: I want to make cheesecake
11:32PM DBG Template found, input modified to: ### Instruction: I want to make cheesecake

Response:

11:32PM DBG Prompt (after templating): ### Instruction: I want to make cheesecake

Response:

11:32PM DBG Loading model llama-stable from wizardlm-13b-v1.2.ggmlv3.q4_0.bin
11:32PM DBG Model already loaded in memory: wizardlm-13b-v1.2.ggmlv3.q4_0.bin
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr llama_print_timings: load time = 6589.55 ms
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr llama_print_timings: sample time = 460.18 ms / 599 runs ( 0.77 ms per token, 1301.66 tokens per second)
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr llama_print_timings: prompt eval time = 39340.21 ms / 20 tokens ( 1967.01 ms per token, 0.51 tokens per second)
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr llama_print_timings: eval time = 1228961.92 ms / 598 runs ( 2055.12 ms per token, 0.49 tokens per second)
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr llama_print_timings: total time = 1268915.94 ms
11:53PM DBG Response: {"object":"chat.completion","model":"lunademo","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"To make a delicious cheesecake, you will need the following ingredients:\n\nFor the crust:\n- 1 1/2 cups graham cracker crumbs\n- 1/4 cup granulated sugar\n- 6 tablespoons (3/4 stick) unsalted butter, melted\n\nFor the filling:\n- 16 ounces cream cheese, softened\n- 1/2 cup granulated sugar\n- 4 large eggs, separated\n- 1 teaspoon vanilla extract\n- 1/2 cup sour cream\n\nOptional toppings:\n- Fresh berries (strawberries, blueberries, raspberries)\n- Whipped cream\n- Chocolate syrup or chocolate shavings\n\nInstructions:\n\n1. Preheat your oven to 325°F (160°C).\n2. Prepare the crust: In a medium bowl, mix together the graham cracker crumbs and sugar. Stir in the melted butter until the mixture is evenly moistened. Press the crust mixture into the bottom of a 9-inch (23 cm) springform pan. Bake for 10 minutes, then let it cool completely.\n3. Prepare the filling: In a large mixing bowl, beat the cream cheese until smooth. Add the granulated sugar and beat until well combined. Beat in the egg yolks one at a time, followed by the vanilla extract.\n4. In a separate bowl, beat the egg whites until stiff peaks form. Gently fold one-fourth of the beaten egg whites into the cream cheese mixture to lighten it, then gently fold in the remaining egg whites.\n5. Pour the filling over the prepared crust and smooth the top with a spatula. Bake for 1 hour and 15 minutes, or until the edges are set and the center is just slightly jiggly.\n6. Let the cheesecake cool in the oven with the door ajar for 30 minutes. Then, remove it from the oven and let it cool completely on a wire rack.\n7. Once cooled, refrigerate the cheesecake for at least 4 hours or overnight to set.\n8. To serve, release the springform pan sides and transfer the cheesecake to a serving plate. If desired, top with fresh berries, whipped cream, chocolate syrup, or shavings.\n\nEnjoy your delicious homemade cheesecake!"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
[127.0.0.1]:43580 200 - POST /v1/chat/completions`

Additional context
I tried one the lamas out a while ago (I think I was using oobabooga) on a machine without GPU. It responded slowly but steadily, maybe a couple of tokens per second. Mind you that was without a GPU. With a GPU and a better model, I would expect it to much much quicker. There may be a perception issue at play here as well: In oobabooga you get the response token by token where here you get it all in one go, but have to wait for the last token to be generated. So the wait before you see anything is infinitely longer. But still - 21 minutes???

Could it be that my GPU wasn't actually used? Does it say anything about it in the debug info?

mudler · 2023-08-25T22:52:38Z

mudler
Aug 25, 2023
Maintainer

@gitwittidbit looks like you didn't configured the model to run with the GPU - did you had a look at https://localai.io/basics/getting_started/#cuda ? you need to set the gpu_layers numer to be offloaded

1 reply

gitwittidbit Aug 26, 2023
Author

Ah, I had not. You see, I'm a total noob and I have pretty much no idea what I am doing...

But I am very much interested in LLMs and want to learn. And I am learning a lot here.

Not knowing how many layers one can offload, I tried 35 and it worked. This is much better already! I The result came in much faster at close to 3.3 tk/s.

I saw that it said 35 of 43 layers offloaded. So I thought that I could maybe offload all 43 layers but, for some reason, that didn't work and after this line

9:31AM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:41079): stderr llama_new_context_with_model: kv self size = 3125.00 MB local-ai stopped with an error.

So how can I determine how many layers I can offload to GPU?

And is that already the best I can achieve here? It now takes 3 minutes instead of 21 minutes, an improvement by factor 7. But I had still expected more, TBH.

I also notice that during inference, the power draw of my GPU doesn't go above 137W (out of 370). The fan doesn't spin up (to a level audible over the general background noise). So there still seems to be unused potential left in my card...

Is maybe this wizard-model not well suited to work on a GPU? Are there better (faster) ones? I mean, if this is the top speed one gets, I don't think I am going to have fun running this at home and probably need to move to the cloud.

Please don't understand this as any kind of criticism. LocalAI is a great accomplishment and a tremendous gift to the community! Probably my issues stem from lack of understanding and misconfiguration.

jjqtony · 2023-09-03T12:43:47Z

jjqtony
Sep 3, 2023

I have same problem with RTX3090.
When I execute the following command from getting start:
docker run --rm -ti --gpus all -p 8080:8080 -e DEBUG=true -e MODELS_PATH=/models -e PRELOAD_MODELS='[{"url": "github:go-skynet/model-gallery/openllama_7b.yaml", "name": "gpt-3.5-turbo", "overrides": { "f16":true, "gpu_layers": 35, "mmap": true, "batch": 512 } } ]' -e THREADS=1 -v $PWD/models:/models quay.io/go-skynet/local-ai:v1.23.2-cublas-cuda12
and curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "hat is an alpaca?"}],
"temperature": 0.1
}'

get response:
12:38PM DBG Request received:
12:38PM DBG Configuration read: &{PredictionOptions:{Model:open-llama-7b-q4_0.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.1 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0} Name:gpt-3.5-turbo StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:1024 F16:true NUMA:false Threads:1 Debug:true Roles:map[] Embeddings:false Backend:llama TemplateConfig:{Chat:openllama-chat ChatMessage: Completion:openllama-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:35 MMap:true MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt: RMSNormEps:0 NGQA:0}
12:38PM DBG Parameters: &{PredictionOptions:{Model:open-llama-7b-q4_0.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.1 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0} Name:gpt-3.5-turbo StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:1024 F16:true NUMA:false Threads:1 Debug:true Roles:map[] Embeddings:false Backend:llama TemplateConfig:{Chat:openllama-chat ChatMessage: Completion:openllama-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:35 MMap:true MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt: RMSNormEps:0 NGQA:0}
12:38PM DBG Prompt (before templating): hat is an alpaca?
12:38PM DBG Template found, input modified to: Q: hat is an alpaca?\nA:
12:38PM DBG Prompt (after templating): Q: hat is an alpaca?\nA:
12:38PM DBG Loading model llama from open-llama-7b-q4_0.bin
12:38PM DBG Loading model in memory from file: /models/open-llama-7b-q4_0.bin
12:38PM DBG Loading GRPC Model llama: {backendString:llama modelFile:open-llama-7b-q4_0.bin threads:1 assetDir:/tmp/localai/backend_data context:0xc00003e0b0 gRPCOptions:0xc000852140 externalBackends:map[huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py]}
12:38PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama
12:38PM DBG GRPC Service for open-llama-7b-q4_0.bin will be running at: '127.0.0.1:35317'
12:38PM DBG GRPC Service state dir: /tmp/go-processmanager4235662773
12:38PM DBG GRPC Service Started
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:35317: connect: connection refused"
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr 2023/09/03 12:38:32 gRPC Server listening at 127.0.0.1:35317
12:38PM DBG GRPC Service Ready
12:38PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:} sizeCache:0 unknownFields:[] Model:/models/open-llama-7b-q4_0.bin ContextSize:1024 Seed:0 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:35 MainGPU: TensorSplit: Threads:1 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0}
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr create_gpt_params: loading model /models/open-llama-7b-q4_0.bin
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr ggml_init_cublas: found 1 CUDA devices:
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama.cpp: loading model from /models/open-llama-7b-q4_0.bin
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: format = ggjt v3 (latest)
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: n_vocab = 32000
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: n_ctx = 1024
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: n_embd = 4096
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: n_mult = 256
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: n_head = 32
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: n_head_kv = 32
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: n_layer = 32
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: n_rot = 128
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: n_gqa = 1
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: rnorm_eps = 5.0e-06
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: n_ff = 11008
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: freq_base = 10000.0
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: freq_scale = 1
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: ftype = 2 (mostly Q4_0)
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: model size = 7B
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: ggml ctx size = 0.08 MB
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: using CUDA for GPU acceleration
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: mem required = 404.40 MB (+ 512.00 MB per state)
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 320 MB VRAM for the scratch buffer
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: offloading 32 repeating layers to GPU
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: offloading non-repeating layers to GPU
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: offloading v cache to GPU
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: offloading k cache to GPU
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: offloaded 35/35 layers to GPU
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_model_load_internal: total VRAM used: 4378 MB
12:38PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:35317): stderr llama_new_context_with_model: kv self size = 512.00 MB

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ridiculously slow on RTX 3090!?! #958

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Ridiculously slow on RTX 3090!?! #958

gitwittidbit Aug 25, 2023

Response:

Response:

Replies: 2 comments · 1 reply

mudler Aug 25, 2023 Maintainer

gitwittidbit Aug 26, 2023 Author

jjqtony Sep 3, 2023

gitwittidbit
Aug 25, 2023

Replies: 2 comments 1 reply

mudler
Aug 25, 2023
Maintainer

gitwittidbit Aug 26, 2023
Author

jjqtony
Sep 3, 2023