Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem when loading kernel module for Nvidia GTX3060 / Driver version 510 #2

Closed
sewtsPatrick opened this issue Mar 21, 2022 · 1 comment

Comments

@sewtsPatrick
Copy link

When I try to run the GPU container example, nvidia-smi does not work.

dmesg shows the following output:

[ 254.247462] nvidia: loading out-of-tree module taints kernel.
[ 254.247469] nvidia: module license 'NVIDIA' taints kernel.
[ 254.247470] Disabling lock debugging due to kernel taint
[ 254.260134] nvidia-nvlink: Nvlink Core is being initialized, major device number 235

[ 254.260671] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 254.302121] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 510.54 Tue Feb 8 04:42:21 UTC 2022
[ 254.312290] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 510.54 Tue Feb 8 04:34:06 UTC 2022
[ 254.313570] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[ 254.314776] nvidia-uvm: Loaded the UVM driver, major device number 511.
[ 255.529701] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x56:1463)
[ 255.529730] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 255.529763] BUG: unable to handle page fault for address: 0000000000002a04
[ 255.529765] #PF: supervisor read access in kernel mode
[ 255.529781] #PF: error_code(0x0000) - not-present page
[ 255.529781] PGD 0 P4D 0
[ 255.529784] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 255.529787] CPU: 9 PID: 123060 Comm: nv_queue Tainted: P OE 5.10.43-yocto-standard #1
[ 255.529787] Hardware name: OnLogic K700/RXM-181, BIOS Z01-0001A037 10/13/2021
[ 255.529947] RIP: 0010:_nv009917rm+0x38/0xc0 [nvidia]
[ 255.529948] Code: 9b f0 01 48 8b bb 68 01 00 00 e8 33 59 4d 00 85 c0 74 0f 48 83 c4 08 5b 41 5c c3 0f 1f 80 00 00 00 00 44 89 e7 e8 f8 08 be ff <8b> 90 04 2a 00 00 83 fa 01 74 2f 80 b8 0c 05 00 00 00 74 12 80 b8
[ 255.529949] RSP: 0018:ffffa2dcc064fdd0 EFLAGS: 00010246
[ 255.529950] RAX: 0000000000000000 RBX: ffff933965021c08 RCX: 0000000000000000
[ 255.529951] RDX: ffffa2dcc064fdfc RSI: 0000000000000000 RDI: 0000000000000000
[ 255.529952] RBP: ffff933ab319e000 R08: 0000000000003000 R09: ffffa2dcc064fe00
[ 255.529952] R10: ffff933ab3242900 R11: 0000000000000001 R12: 0000000000000000
[ 255.529953] R13: ffffa2dcc064fec0 R14: ffff93394e806808 R15: ffff933ab3242900
[ 255.529954] FS: 0000000000000000(0000) GS:ffff93408bc40000(0000) knlGS:0000000000000000
[ 255.529955] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 255.529955] CR2: 0000000000002a04 CR3: 000000052200c006 CR4: 00000000003706e0
[ 255.529956] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 255.529957] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 255.529957] Call Trace:
[ 255.530020] ? rm_execute_work_item+0x108/0x120 [nvidia]
[ 255.530071] ? os_execute_work_item+0x4c/0x70 [nvidia]
[ 255.530122] ? _main_loop+0x8c/0x140 [nvidia]
[ 255.530173] ? nvidia_modeset_resume+0x30/0x30 [nvidia]
[ 255.530176] ? kthread+0x129/0x170
[ 255.530177] ? kthread_park+0x90/0x90
[ 255.530178] ? ret_from_fork+0x1f/0x30
[ 255.530179] Modules linked in: nvidia_uvm(POE) nvidia_modeset(POE) nvidia(POE) ip6t_REJECT(E) nf_reject_ipv6(E) ip6table_filter(E) xt_state(E) ipt_REJECT(E) nf_reject_ipv4(E) ip6_tables(E) xt_MASQUERADE(E) nf_conntrack_netlink(E) nfnetlink(E) xfrm_user(E) xt_owner(E) snd_soc_skl(E) snd_soc_hdac_hda(E) intel_rapl_msr(E) snd_hda_ext_core(E) intel_rapl_common(E) snd_soc_sst_ipc(E) snd_soc_sst_dsp(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) snd_soc_acpi_intel_match(E) snd_soc_acpi(E) coretemp(E) snd_soc_core(E) snd_compress(E) kvm_intel(E) ac97_bus(E) snd_hda_codec_hdmi(E) snd_pcm_dmaengine(E) kvm(E) snd_hda_intel(E) snd_intel_dspcfg(E) irqbypass(E) snd_hda_codec(E) crct10dif_pclmul(E) crc32_pclmul(E) 8250_dw(E) mei_wdt(E) intel_wmi_thunderbolt(E) wmi_bmof(E) ghash_clmulni_intel(E) snd_hda_core(E) mxm_wmi(E) snd_hwdep(E) ttm(E) aesni_intel(E) snd_pcm(E) crypto_simd(E) nvidiafb(E) igb(E) intel_lpss_pci(E) iTCO_wdt(E) mei_me(E) cryptd(E) vgastate(E) intel_pmc_bxt(E) intel_lpss(E)
[ 255.530203] glue_helper(E) efi_pstore(E) pcspkr(E) ee1004(E) iTCO_vendor_support(E) snd_timer(E) fb_ddc(E) cdc_acm(E) dca(E) mei(E) intel_pch_thermal(E) idma64(E) wmi(E) pinctrl_cannonlake(E) evbug(E) video(E) mac_hid(E) acpi_pad(E) acpi_tad(E) sch_fq_codel(E) [last unloaded: nouveau]
[ 255.530213] CR2: 0000000000002a04
[ 255.530214] ---[ end trace 0a22e754d9968912 ]---
[ 255.905919] RIP: 0010:_nv009917rm+0x38/0xc0 [nvidia]
[ 255.905921] Code: 9b f0 01 48 8b bb 68 01 00 00 e8 33 59 4d 00 85 c0 74 0f 48 83 c4 08 5b 41 5c c3 0f 1f 80 00 00 00 00 44 89 e7 e8 f8 08 be ff <8b> 90 04 2a 00 00 83 fa 01 74 2f 80 b8 0c 05 00 00 00 74 12 80 b8
[ 255.905922] RSP: 0018:ffffa2dcc064fdd0 EFLAGS: 00010246
[ 255.905924] RAX: 0000000000000000 RBX: ffff933965021c08 RCX: 0000000000000000
[ 255.905925] RDX: ffffa2dcc064fdfc RSI: 0000000000000000 RDI: 0000000000000000
[ 255.905926] RBP: ffff933ab319e000 R08: 0000000000003000 R09: ffffa2dcc064fe00
[ 255.905926] R10: ffff933ab3242900 R11: 0000000000000001 R12: 0000000000000000
[ 255.905927] R13: ffffa2dcc064fec0 R14: ffff93394e806808 R15: ffff933ab3242900
[ 255.905928] FS: 0000000000000000(0000) GS:ffff93408bc40000(0000) knlGS:0000000000000000
[ 255.905928] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 255.905929] CR2: 0000000000002a04 CR3: 000000010dcb8005 CR4: 00000000003706e0
[ 255.905930] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 255.905930] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 916.128870] kauditd_printk_skb: 12 callbacks suppressed

My dockerfile and entry.sh is the exact same as the example, only that I specified version 510.54.
Not sure if this is a problem with the NVIDIA driver or with balena, I hope you can help me fix this

@sewtsPatrick
Copy link
Author

ok i found out it is caused by a specific problem with the industrial pc we wanted to deploy on. I will close this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant