Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU not detected -- RM detects a driver/library version mismatch. #4

Open
JonBoyleCoding opened this issue Feb 14, 2024 · 17 comments
Open

Comments

@JonBoyleCoding
Copy link

Hi - I just wondered if you had some thoughts. I have a machine with some NVIDIA 2080 Supers in that, for some reason, doesn't detect the GPU and launches in CPU only mode. Happy to go over to Ollama directly if you're not sure. However thought that maybe you might have come across this so worth asking first. Thanks for looking at this either way.

I'm still working on a flake from prior to your changes to wrap around your nixpkgs fork (as of this moment, that doesn't work for me). Tried using the gpu/cuda package. I've used this on another machine and it works flawlessly (thanks for your hard work on this!).

I noticed I have this here:

67ffy95f824kxbvx4s6an9150sd6zazl-nvidia-x11-545.29.06-6.1.75/lib/libnvidia-ml.so.545.29.06: nvml vram init failure: 18"

To clarify I'm able to run pytorch with CUDA on a GPU from within a flake, so I believe the system is setup correctly. Again, happy to open an issue with Ollama if you don't believe you can help.

Full log of opening ollama serve.

time=2024-02-14T19:13:12.975Z level=INFO source=images.go:863 msg="total blobs: 6"
time=2024-02-14T19:13:12.975Z level=INFO source=images.go:870 msg="total unused blobs removed: 0"
time=2024-02-14T19:13:12.975Z level=INFO source=routes.go:999 msg="Listening on 127.0.0.1:11434 (version 0.1.24)"
time=2024-02-14T19:13:12.975Z level=INFO source=payload_common.go:106 msg="Extracting dynamic libraries..."
time=2024-02-14T19:13:17.529Z level=INFO source=payload_common.go:145 msg="Dynamic LLM libraries [cuda_v12 rocm cpu_avx cpu cpu_avx2]"
time=2024-02-14T19:13:17.529Z level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-02-14T19:13:17.529Z level=INFO source=gpu.go:242 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-02-14T19:13:17.529Z level=INFO source=gpu.go:288 msg="Discovered GPU libraries: [/nix/store/67ffy95f824kxbvx4s6an9150sd6zazl-nvidia-x11-545.29.06-6.1.75/lib/libnvidia-ml.so.545.29.06]"
time=2024-02-14T19:13:17.533Z level=INFO source=gpu.go:300 msg="Unable to load CUDA management library /nix/store/67ffy95f824kxbvx4s6an9150sd6zazl-nvidia-x11-545.29.06-6.1.75/lib/libnvidia-ml.so.545.29.06: nvml vram init failure: 18"
time=2024-02-14T19:13:17.533Z level=INFO source=gpu.go:242 msg="Searching for GPU management library librocm_smi64.so"
time=2024-02-14T19:13:17.533Z level=INFO source=gpu.go:288 msg="Discovered GPU libraries: [/nix/store/0x1y6by0mjcm1gn91rdn0bq5bh0f6l1i-rocm-smi-5.7.1/lib/librocm_smi64.so.5.0]"
time=2024-02-14T19:13:17.534Z level=INFO source=gpu.go:317 msg="Unable to load ROCm management library /nix/store/0x1y6by0mjcm1gn91rdn0bq5bh0f6l1i-rocm-smi-5.7.1/lib/librocm_smi64.so.5.0: rocm vram init failure: 8"
time=2024-02-14T19:13:17.534Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-02-14T19:13:17.534Z level=INFO source=routes.go:1022 msg="no GPU detected"
@abysssol
Copy link
Owner

Sorry, but I don't have much idea about what's going wrong. It seems likely that ollama simply doesn't support that gpu currently. This nvml vram init failure: 18 is a similar message to what I get on my amd gpu when built with cuda nvml vram init failure: 9, so I can only guess that ollama isn't set up to recognize/use your gpu, or that libnvidia-ml isn't properly detecting it.
It is also possible that an error could be caused due to building with incompatible library versions, or a missing library that should be exposed to ollama at runtime. Unfortunately, my knowledge of ollama, go, c++, etc are too superficial for me to understand what the likely cause is.

I did encounter an issue that seems superficially related, and I opened an issue with ollama. It's probably actually unrelated, but maybe it would be of interest?

Ultimately, I would recommend opening an issue with ollama, since the maintainers there would hopefully know better about what's going wrong (even if it is my nix package that's actually at fault).


I'm still working on a flake from prior to your changes to wrap around your nixpkgs fork (as of this moment, that doesn't work for me).

Would you be willing to open an issue about this with more detail, so I can try to fix it? Does it not build, or does it build but not detect any gpu?

@JonBoyleCoding
Copy link
Author

I'll open an issue with ollama then. Thanks for your thoughts. I'll try and build the new version again and get back to you in a separate issue.

@JonBoyleCoding
Copy link
Author

Just to note @abysssol, I tried again your most up-to-date version. I realised what the issue was - I was having nixpkgs follow the unstable branch of nixpkgs (which at the moment you have pointing to your separate repository before the PR).

It's still building at the moment, but I imagine there won't be any issues now! Will let you know if it fails any further.

@JonBoyleCoding
Copy link
Author

JonBoyleCoding commented Feb 14, 2024

Sorry to re-open this issue.

I noticed in your issue that you set the OLLAMA_DEBUG variable. I thought I'd give it a shot and here's the relevant output.

time=2024-02-14T22:41:55.626Z level=INFO source=gpu.go:288 msg="Discovered GPU libraries: [/nix/store/z6557r7pgvmxr9x16a4ffazly8dflh65-nvidia-x11-545.29.06-6.1.77/lib/libnvidia-ml.so.545.29.06]"
wiring nvidia management library functions in /nix/store/z6557r7pgvmxr9x16a4ffazly8dflh65-nvidia-x11-545.29.06-6.1.77/lib/libnvidia-ml.so.545.29.06
dlsym: nvmlInit_v2
dlsym: nvmlShutdown
dlsym: nvmlDeviceGetHandleByIndex
dlsym: nvmlDeviceGetMemoryInfo
dlsym: nvmlDeviceGetCount_v2
dlsym: nvmlDeviceGetCudaComputeCapability
dlsym: nvmlSystemGetDriverVersion
dlsym: nvmlDeviceGetName
dlsym: nvmlDeviceGetSerial
dlsym: nvmlDeviceGetVbiosVersion
dlsym: nvmlDeviceGetBoardPartNumber
dlsym: nvmlDeviceGetBrand
nvmlInit_v2 err: 18
time=2024-02-14T22:41:55.630Z level=INFO source=gpu.go:300 msg="Unable to load CUDA management library /nix/store/z6557r7pgvmxr9x16a4ffazly8dflh65-nvidia-x11-545.29.06-6.1.77/lib/libnvidia-ml.so.545.29.06: nvml vram init failure: 18"

This shows that the error is specifically coming from nvmlInit_v2 and the following page states the following:

https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceEnumvs.html#group__nvmlDeviceEnumvs_1g06fa9b5de08c6cc716fbf565e93dd3d0

NVML_ERROR_LIB_RM_VERSION_MISMATCH = 18
    RM detects a driver/library version mismatch.

Investigating further, the drivers I have in the /run/opengl-driver/lib directory are 545.29.02, whereas above you can see it's 545.29.06. It's only a minor version different, but perhaps this was where it was coming from.

This is probably an issue that will be solved once everything ends up in nixpkgs, but I thought I'd try and dive into the rabbit hole and see if I was correct. Unfortunately I've hit one too many stumbling blocks and it's getting late where I am. Detailing where I am at the moment.

So it seems my issue COULD be coming from a mismatch in that my system is on nixpkgs-stable and yours is currently based on nixpkgs-master. What's weird though is that I've never had this issue with other deep learning libraries in the past where I base my flakes on nixpkgs-unstable.

I've tried overriding a bit and getting stuck (pkgs referring to nixpkgs-stable):

let
    ollama = ollama-abysssol.cuda.override {
        cudaGcc = pkgs.gcc11;
        cudaPackages = pkgs.cudaPackages;
        linuxPackages = pkgs.linuxPackages;
    };
in

Ends with:

 > + g++ -fPIC -g -shared -o ../llama.cpp/build/linux/x86_64/cuda_v11/lib/libext_server.so -Wl,--whole-archive ../llama.cpp/build/linux/x86_64/cuda_v11/examples/server/libext_server.a -Wl,--no-whole-archive ../llama.cpp/build/linux/x86_64/cuda_v11/common/libcommon.a ../llama.cpp/build/linux/x86_64/cuda_v11/libllama.a '-Wl,-rpath,$ORIGIN' -lpthread -ldl -lm -L/nix/store/z23gdb356jkbf3nl91c0mk4al1dl81pr-cuda-toolkit/lib -lcudart -lcublas -lcublasLt -lcuda
       > /nix/store/idiaraknw071d20nlqp49s18gbvw4wa0-binutils-2.40/bin/ld: cannot find -lcuda: No such file or directory
       > collect2: error: ld returned 1 exit status

And it's certainly right - there is no libcuda in that cuda-toolkit directory, so I'm guessing there has been a change in how cuda organises it's libraries.

If you have any thoughts on how to go about testing this I'd appreciate it - but no worries if not. Ultimately I seem to get stuck diving into rabbit holes like this ;)

@JonBoyleCoding JonBoyleCoding changed the title GPU not detected GPU not detected -- RM detects a driver/library version mismatch. Feb 14, 2024
@abysssol
Copy link
Owner

Sorry to re-open this issue.

Don't be; this is exactly what reopening issues is meant for. I'm actually rather excited to see that this problem may have a solution after all.
It's also getting late where I am though, so I'll take a closer look tomorrow.

@abysssol
Copy link
Owner

Sorry for not getting to this today, I've been working on getting ollama 0.1.24 merged into upstream nixpkgs.

@abysssol
Copy link
Owner

abysssol commented Feb 18, 2024

My hope is that once ollama is available from upstream nixpkgs, all library and driver versions should match, so hopefully your gpu would work then.
I changed back to vendoring the ollama module instead of using it from my nixpkgs fork, so you could override nixpkgs with your stable version again.

ollama = {
  url = "github:abysssol/ollama-flake";
  inputs.nixpkgs.follows = "nixpkgs";
};

Then, try overriding linuxPackages with whatever kernel you use. That's where libnvidia-ml comes from, maybe different kernel versions use different lib versions.

let
  ollama = ollama-abysssol.cuda.override {
    # if you use the zen kernel, else the relevent kernel packages. omit if using default kernel
    linuxPackages = pkgs.linuxPackages_zen;
  };
in

Maybe that could make a difference? It looks like you already did something similar, so maybe not ... I don't know.
By the way, libcuda is from cudaPackages.cuda_cudart. Maybe it's not there on stable?
Could you also post a file tree of /run/opengl-driver/lib/? ie with tree /run/opengl-driver/lib/ or exa --tree /run/opengl-driver/lib/. /sys/module/nvidia/ may also have have something useful.
I think I'll have more time tomorrow to try to figure this out. For Real This Time™

@abysssol
Copy link
Owner

abysssol commented Feb 18, 2024

I created a new branch, cuda-testing, that changes how it splices together the cuda-toolkit libs. Could you try building that and see if it makes a difference?

ollama = {
  url = "github:abysssol/ollama-flake/cuda-testing";
  inputs.nixpkgs.follows = "nixpkgs";
};

Unfortunately, I have thus far been unable to find any more information than what you already did.

@JonBoyleCoding
Copy link
Author

Apologies for not responding - I have a number off reports/papers I'm working on at the moment!

Just had a chance to try. It appears the patches fail - going into a meeting now so I cannot debug. But I can leave you with a log for now.

Running phase: patchPhase
applying patch /nix/store/liqb6g8spk497dz0bsxlp4bmadr4189c-remove-git.patch
patching file llm/generate/gen_common.sh
applying patch /nix/store/53fs5wbc3lq27pkcdhg65q9gkf0z8g88-replace-gcc.patch
patching file llm/generate/gen_common.sh
Hunk #1 succeeded at 89 (offset 3 lines).
applying patch /nix/store/65f7ahf1i5m7d1j6l6is50aq93snl0ac-01-cache.diff
patching file llm/llama.cpp/examples/server/server.cpp
applying patch /nix/store/x1jg303zsxd6zzs3k8bkxdn5ykhbh5l3-02-shutdown.diff
patching file llm/llama.cpp/examples/server/server.cpp
Hunk #2 succeeded at 2433 (offset 38 lines).
Hunk #3 succeeded at 3057 (offset 39 lines).
patching file llm/llama.cpp/examples/server/utils.hpp
substituteStream(): ERROR: Invalid command line argument: --replace-fail
/nix/store/i0l5falbdsbfl1lgypdp1jda672bdjw3-stdenv-linux/setup: line 131: pop_var_context: head of shell_variables not a function context

@abysssol
Copy link
Owner

Apologies for not responding

No worries. It's good to prioritize things that actually matter to you. I'll do the same.
To put it in perspective, we're just trying to get a chatbot to run a bit faster. Not an especially urgent endeavor.

substituteStream(): ERROR: Invalid command line argument: --replace-fail

The failure was an oversight of mine: I left in an argument to substituteInPlace that had only been added in unstable nixpkgs.

@lfdominguez
Copy link

This is my problem too, I'm redefining nvidia-x11 on my nixos config like this:

hardware.nvidia.package = pkgs.linuxPackages_cachyos.nvidia_x11.overrideAttrs (s: rec {
        version = "550.40.07";
        name = (builtins.parseDrvName s.name).name + "-" + version;
        src = pkgs.fetchurl {
            url = "https://download.nvidia.com/XFree86/Linux-x86_64/${version}/NVIDIA-Linux-x86_64-${version}.run";
            sha256 = "298936c727b7eefed95bb87eb8d24cfeef1f35fecac864d98e2694d37749a4ad";
        };
    });

That is != than nixpkgs itself, so ollama fail because build with 545, the default version of nixpkgs. How can I change the version to this one on input definition of ollama.

@abysssol
Copy link
Owner

abysssol commented Feb 24, 2024

I'm not sure if this will work, but I think you can just override nvidia_x11 with your custom driver. Try it and tell me if it works.

{ pkgs, lib, config, ollama }: # add `config` if it's not already an argument
let
  system = "x86_64-linux";

  ollamaCuda = (ollama.${version}.cuda.override {
    linuxPackages.nvidia_x11 = config.hardware.nvidia.package;
  });
in
{
  # if you're using the service in nixos-unstable
  services.ollama.package = ollamaCuda;

  # otherwise, put it in system packages
  environment.systemPackages = [
    ollamaCuda
  ];
}
inputs = {
  ollama.url = "github:abysssol/ollama-flake";
};

@lfdominguez
Copy link

Humm i think that is the way, but sems that stdenv is not get from cuda

/nix/store/idiaraknw071d20nlqp49s18gbvw4wa0-binutils-2.40/bin/ld: cannot find -lcuda: No such file or directory

@lfdominguez
Copy link

Working!!!!!!!!!!! with @abysssol suggestion, only not add the inputs.nixpkgs.follows = "nixpkgs";

@lfdominguez
Copy link

I'm using a full flake system, so I have in my own repo my OS config and my home-manager config, so at end:

Main flake.nix,
On inputs:

ollama = {
    url = "github:abysssol/ollama-flake";
    inputs.utils.follows = "flake-utils";
};

later, declare an overlay for this ollama:

overlay-ia = final: prev: {
    ia = {
        ollama = ollama.packages.${system}.cuda.override {
            linuxPackages.nvidia_x11 = pkgs.linuxPackages_cachyos.nvidia_x11.overrideAttrs (s: rec {
                version = "550.40.07";
                name = (builtins.parseDrvName s.name).name + "-" + version;
                src = pkgs.fetchurl {
                    url = "https://download.nvidia.com/XFree86/Linux-x86_64/${version}/NVIDIA-Linux-x86_64-${version}.run";
                    sha256 = "298936c727b7eefed95bb87eb8d24cfeef1f35fecac864d98e2694d37749a4ad";
                };
            });
        };
    };
};

of course, this nvidia package must be the same declared on my os config:

hardware.nvidia.package = pkgs.linuxPackages_cachyos.nvidia_x11.overrideAttrs (s: rec {
    version = "550.40.07";
    name = (builtins.parseDrvName s.name).name + "-" + version;
    src = pkgs.fetchurl {
        url = "https://download.nvidia.com/XFree86/Linux-x86_64/${version}/NVIDIA-Linux-x86_64-${version}.run";
        sha256 = "298936c727b7eefed95bb87eb8d24cfeef1f35fecac864d98e2694d37749a4ad";
    };
});

Then, add this overlay to my nixpkgs definition, later use it on home-manager packages definitions. And wallaaaaa:

time=2024-02-24T18:36:23.272-03:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6"

thanks @abysssol

@abysssol
Copy link
Owner

I'm glad to hear you got it working.

Is there a reason why you're duplicating the definition of your custom driver instead of using a let binding?
It seems like it could cause issues if they ever get out of sync.

@lfdominguez
Copy link

I'm glad to hear you got it working.

Is there a reason why you're duplicating the definition of your custom driver instead of using a let binding? It seems like it could cause issues if they ever get out of sync.

Nonono no reason, heheheh, that's why I was testing, now I'm using a common let on the main flake to define the desired nvidia driver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants