How can I use GPU acceleration? #3122

JiaheBai · 2024-08-04T17:06:46Z

JiaheBai
Aug 4, 2024

I'm using localAI with GPU A800.
I built it from source with make, e.g.

git clone https://github.com/go-skynet/LocalAI
cd LocalAI
make build

When I load a model, I will get such information as

12:46AM INF [llama-cpp] Attempting to load
12:46AM INF Loading model 'OpenBioLLM-8B-lora.gguf' with backend llama-cpp
12:46AM DBG Loading model in memory from file: /path/to/model.gguf
12:46AM DBG Loading Model OpenBioLLM-8B-lora.gguf with gRPC (file: /path/to/model.gguf) (backend: llama-cpp): {backendString:llama-cpp model:OpenBioLLM-8B-lora.gguf threads:16 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000431208 externalBackends:map[] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
12:46AM INF GPU device found but no CUDA backend present
12:46AM INF GPU device found but no CUDA backend present
12:46AM INF GPU device found but no CUDA backend present
12:46AM INF GPU device found but no CUDA backend present
12:46AM INF GPU device found but no CUDA backend present
12:46AM INF GPU device found but no CUDA backend present
12:46AM INF GPU device found but no CUDA backend present
12:46AM INF GPU device found but no CUDA backend present
12:46AM INF [llama-cpp] attempting to load with AVX2 variant
12:46AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp-avx2
12:46AM DBG GRPC Service for OpenBioLLM-8B-lora.gguf will be running at: '127.0.0.1:42153'
12:46AM DBG GRPC Service state dir: /tmp/go-processmanager2034810530
12:46AM DBG GRPC Service Started
12:46AM DBG GRPC(OpenBioLLM-8B-lora.gguf-127.0.0.1:42153): stdout Server listening on 127.0.0.1:42153
12:46AM DBG GRPC Service Ready
12:46AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:OpenBioLLM-8B-lora.gguf ContextSize:1024 Seed:1974897596 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:22 MainGPU: TensorSplit: Threads:16 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/path/to/model.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:false NoKVOffload:false}
12:46AM DBG GRPC(OpenBioLLM-8B-lora.gguf-127.0.0.1:42153): stderr llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /path/to/model.gguf (version GGUF V3 (latest))
12:46AM DBG GRPC(OpenBioLLM-8B-lora.gguf-127.0.0.1:42153): stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
---some information about model structure
12:46AM DBG GRPC(OpenBioLLM-8B-lora.gguf-127.0.0.1:42153): stdout {"timestamp":1722789978,"level":"INFO","function":"initialize","line":502,"message":"initializing slots","n_slots":1}
12:46AM DBG GRPC(OpenBioLLM-8B-lora.gguf-127.0.0.1:42153): stdout {"timestamp":1722789978,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":0,"n_ctx_slot":1024}
12:46AM INF [llama-cpp] Loads OK

I checked nvidia-smi and it shows GPU is not using. I tried to set f16: true in '.yaml' file but it didn't work.
The questions is How can I use GPU acceleration so that I can speed it up. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I use GPU acceleration? #3122

{{title}}

Replies: 0 comments

Select a reply

How can I use GPU acceleration? #3122

JiaheBai Aug 4, 2024

Replies: 0 comments

JiaheBai
Aug 4, 2024