Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more docs on troubleshooting #151

Merged
merged 4 commits into from
Nov 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: hipercow
Title: High Performance Computing
Version: 1.0.40
Version: 1.0.41
Authors@R: c(person("Rich", "FitzJohn", role = c("aut", "cre"),
email = "[email protected]"),
person("Wes", "Hinsley", role = "aut"),
Expand Down
2 changes: 1 addition & 1 deletion drivers/windows/DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: hipercow.windows
Title: DIDE HPC Support for Windows
Version: 1.0.40
Version: 1.0.41
Authors@R: c(person("Rich", "FitzJohn", role = c("aut", "cre"),
email = "[email protected]"),
person("Wes", "Hinsley", role = "aut"),
Expand Down
1 change: 1 addition & 0 deletions inst/WORDLIST
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ DIDE
DIDE's
HPC
ICT
INLA
InfiniBand
JDK
OpenJDK
Expand Down
2 changes: 1 addition & 1 deletion vignettes/details.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ These options are only relevant when using hipercow's rrq integration.
Objects passed to and from rrq tasks are usually stored in Redis. However since these are all stored in memory, larger objects are offloaded to the disk instead.
This option controls the threshold used to decide whether or not to offload objects. Objects larger than the configured value (in bytes) are offloaded to disk.

The default value is 100000, ie. 100kB.
The default value is 100000, i.e., 100kB.

This option is used when the queue is first created. Changing it afterwards will have no effect.

Expand Down
49 changes: 47 additions & 2 deletions vignettes/troubleshooting.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -210,12 +210,48 @@ In that case, something is different between how the cluster sees the world, and
* Always check you're not running out disk space. The Q: quota is normally 15Gb.
* Find what node your tasks were running on. If you consistently get errors on one node, but not others, then get in touch with Wes, as we do get node failures from time to time, where the fault is not obvious at first.

# My code is slower on the cluster than running locally!
## My code is slower on the cluster than running locally!

* This is expected, especially for single-core tasks. Cluster nodes are often aiming to provide larger throughput, rather than better linear performance, so a single task may run slower on a cluster node than on your own computer. But the cluster node might be able to run 16 or more such tasks at once, without taking any longer, while you continue using your local computer for local things.
* If that is still insufficient, and you still want to compare timings in this way, then check that the cluster is doing *exactly* the same work as your local computer.

## Asking for help
# I can't connect to the cluster

There are lots of possible causes of this, and ways that this might manifest as an error message, for example:

```
Error in client_parse_submit(httr_text(r), 1L) :
Job submission has likely failed; could be a login error
```

(we will add other error messages here as we catch them).

By the time you get here, we've thrown a pretty generic error because for some reason we can't tell what has happened. Possible reasons that you might see an error like this:

* Your local internet connection has failed
* Your ZScaler session has timed out and needs re-authentication
* Your cluster session has timed out
* Your DIDE password has expired

You can check most of these by running

```r
windows_check()
```

which will work through many common points of failure and report back what does and does not work. If you want help with diagnosing this sort of error, we would expect to see output from this command.

If that does work, but you are still having what looks like connection problems, then try

```r
hipercow_hello()
```

which will launch a simple job. If this does not work, and you want to ask for help, we would like to see the whole output of this command.

If that works, but your actual job does not work, then something about what you are submitting is causing the problem. In this case, if you are asking for help, we would need to know something about your code, in which case read on for the next section.

# Asking for help

If you need help, you can ask in the "Cluster" teams channel. This is better than emailing Rich or Wes directly as they may not have time to respond, or may be on leave.

Expand Down Expand Up @@ -261,3 +297,12 @@ Too often, we will get requests for help with no information about what was run,
> ```

with this sort of information the problem may just jump out at us, or we may be able to create the error ourselves - either way we may be able to work on the problem and get back to you with a solution rather than a request for more information.

Other tips, and reasons you may have been directed to this page:

* **Please provide the whole error message**. Do not provide only the line of that you think is interesting. You can store a great many lines of text in Teams, and you can always attach file if you need to. We would really like to see as much information as possible.
* **Please do not provide screenshots** or photos of text unless for some reason your computer has lost the ability to copy and paste.
* **Please provide as much context as possible** as to what you are working on, and where. Please don't assume that we remember anything you told us in a previous discussion - we've probably forgotten and you have more context about your problem at the point where you ask than we do.
* In general, please try and reduce the chance that our response to your message has to be another question from us to you. It may feel like you will get to your answer quicker if you don't try and investigate at your end, but it will take much longer overall and use up more of everyone's time.

We do want to help, but expect slower responses where we have to do lots of discovery to find out what your problem is, it will take longer until we find the time and energy to start digging. The more information you provide, the more likely it is we can spot the error.
Loading