Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent startup error on CUDA miner #19

Open
closerm opened this issue Mar 2, 2018 · 4 comments
Open

Intermittent startup error on CUDA miner #19

closerm opened this issue Mar 2, 2018 · 4 comments

Comments

@closerm
Copy link

closerm commented Mar 2, 2018

When benchmarking the CUDA miner (v0.1.9) I get an intermittent error, as shown below.

        ============================= aion reference miner======================
                        Equihash<210,9> CPU&GPU Miner for AION v0.1.9
                        Base on NiceHash equihash miner.
        ============================= aion reference miner======================

Setting log level to 2
[20:31:50][0x00007f6ae3ad4740] Using SSE2: YES
[20:31:50][0x00007f6ae3ad4740] Using AVX: NO
[20:31:50][0x00007f6ae3ad4740] Using AVX2: NO
[20:31:50][0x00007f6ae3ad4740] Benchmarking CUDA worker (CUDA-TROMP) GeForce GTX 1080 Ti (#0) BLOCKS=64, THREADS=64
[20:31:51][0x00007f6ae3ad4740] Benchmark starting... this may take several minutes, please wait...
[20:32:12][0x00007f6adb04c700] CUDA error 'the launch timed out and was terminated' in func 'solve' line 1186

This doesn't happen every time I launch the miner, but it happened several times in a short period of running different benchmarks.

@closerm
Copy link
Author

closerm commented Mar 11, 2018

I'm still getting this error pretty consistently. Any thoughts?

        ============================= aion reference miner======================
                        Equihash<210,9> CPU&GPU Miner for AION v0.1.9
                        Base on NiceHash equihash miner.
        ============================= aion reference miner======================

Setting log level to 2
[12:31:40][0x00007f6161bb7740] Using SSE2: YES
[12:31:40][0x00007f6161bb7740] Using AVX: NO
[12:31:40][0x00007f6161bb7740] Using AVX2: NO
[12:31:40][0x00007f615912f700] stratum | Starting miner
[12:31:40][0x00007f615912f700] stratum | Connecting to stratum server 192.168.1.35:3333
[12:31:40][0x00007f615892e700] miner#0 | Starting thread #0 (CUDA-TROMP) GeForce GTX 1080 Ti (#0) BLOCKS=56, THREADS=64
[12:31:40][0x00007f615912f700] stratum | Connected!
[12:31:40][0x00007f615912f700] stratum | Subscribed to stratum server
[12:31:40][0x00007f615912f700] miner | Extranonce is 50000004
[12:31:40][0x00007f615912f700] stratum | Received new job #9
[12:31:40][0x00007f615912f700] stratum | Authorized worker 0x0000000000000000000000000000000000000000000000000000000000000000
[12:31:45][0x00007f615912f700] stratum | Received new job #a
[12:31:51][0x00007f6161bb7740] Speed [15 sec]: 5.16016 I/s, 10.223 Sols/s
[12:32:01][0x00007f6161bb7740] Speed [15 sec]: 2.33333 I/s, 5.26667 Sols/s
[12:32:03][0x00007f615892e700] miner#0 | CUDA error 'the launch timed out and was terminated' in func 'solve' line 1186

@closerm
Copy link
Author

closerm commented Mar 11, 2018

I am also getting some additional CUDA errors, and this is becoming less "intermittent".

[13:26:34][0x00007f4cb6ffd700] miner#4 | CUDA error 'unspecified launch failure' in func 'solve' line 1186
[13:26:21][0x00007f06f9359700] miner#4 | CUDA error 'an illegal memory access was encountered' in func 'solve' line 1186

[13:26:56][0x00007ffbf13f4700] miner#3 | CUDA error 'the launch timed out and was terminated' in func 'setheadernonce' line 260

These errors are all being produced by the pre-built 0.1.9 CUDA miner.

@closerm
Copy link
Author

closerm commented Mar 12, 2018

These errors appear to be related to the nvidia driver's watchdog timer that is used to keep the X window display responsive in mixed X / compute environments.

Per this thread, the first two options may not be tenable since they involve not running X which appears to be required if the user wants to control fan / power / clock speeds on the GPU. (I know there have been ways to startx, set parameters, and have them persist after closing X, but this process hasn't worked for me.)

The fourth option is working for me right now, though the use of that option is the least recommended of the ways forward.

Which brings me to option 3, the recommended option, which is effectively "break kernel execution into small enough pieces that their execution does not exceed the driver watchdog." I realize that this is a bit of a huge request, but I gather from other pages (old) that this is a bigger problem on Windows, so this problem will likely rear its ugly head when the miner is released for Windows. Refactoring the kernel code into smaller, faster executing segments could prevent this problem on both platforms.

@closerm
Copy link
Author

closerm commented Apr 12, 2018

Despite the comments above, I am still getting
CUDA error 'an illegal memory access was encountered' in func 'solve' line 1186

even with v0.2.0. It does appear to happen less, but has still occurred twice in the past hour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant