Skip to content
mllg edited this page Oct 14, 2014 · 21 revisions

Basic configuration

The package BatchJobs tries to find a configuration at three different possible locations:

  1. The package installation directory,
  2. your user's home directory, or
  3. the working directory of your current R session.

For 2 and 3 the file must be called .BatchJobs.R. The config file deployed with the package (1) is called BatchJobs_global_config.R and resides in the etc subfolder of your package installation directory. Editing it would potentially allow a system administrator to setup a basic global configuration for all users.

If more than one configuration file is found, all are used but settings in the more specific file (3 is more specific than 2 which in turn is more specific than 1) overwrite those made in a less specific configuration file.

The default settings are meant for interactive usage (using makeClusterFunctionsInteractive) and do not send any status emails. This should allow you to try out BatchJobs locally without any prior configuration or setup. While the configuration file is a standard R file and can potentially contain any valid R code we encourage you to only set the mandatory configuration variables. The configuration file(s) may include the following configuration variables (you are allowed to leave options out and then fall back to the respective default):

cluster.functions = makeClusterFunctionsInteractive()
mail.start = "first+last"
mail.done = "first+last"
mail.error = "all"
mail.from = "<[email protected]>"
mail.to = "<[email protected]>"
mail.control = list(smtpServer="my.mail.server.com")
staged.queries = TRUE

The variable cluster.functions determines your batch system. Please see Cluster-Functions and SSH-Cluster for possible implementations and detailed information on how to set them up.

mail.start, mail.done and mail.error concern the sending of status mails when jobs start, successfully terminate or terminate due to an exception. They can be set to:

  • 'none' = do not mail for any job
  • 'all' = mail for all jobs
  • 'first' = mail for first job
  • 'last' = mail for last job
  • 'first+last' = mail for first and last job mail.to and mail.from are the sender and recipient addresses for status mails. The sender address does not necessarily have to exist. Enclose them in <> brackets as in the above example.

mail.control is a control structure for sendmailR. Please consult your local system administrator to find a suitable local mail exchange which will handle all mail delivery.

The option staged.queries enables a mechanism where communication with the data base is restricted to the master process. If you are relying on a network file system (i.e. you are on a batch system or use the ad-hoc SSH-cluster) you should set this to TRUE to avoid file system locks. As of BatchJobs-1.5 this mechanism is enabled per default.

Resource Allocation

Which, if any, resources must be specified for allocation is highly dependent on the cluster functions you use and the local setup of your cluster. Please consult your local administrator to see which resources you must request. Having said that, the default.resources variable can be used to define default resource limits used for all jobs. A possible configuration entry might look like the following:

default.resources = list(queue="my_queue", walltime=3600)

Additional resource specifications that you provide during job submission will overwrite these default values. We encourage you to use conservative values as the defaults. This avoids mishaps where you waste computational resources because of a broken, long running job.

Furthermore you can set the option max.concurrent.jobs if your scheduler is configured with a hard per-user job limit. This way submitJobs holds job submission if more than max.concurrent.jobs jobs are on the system until enough jobs are finished.

Inspecting the current configuration

After you have loaded the BatchJobs package in R, you can inspect the current configuration by calling the getConfig function. Here is the output after a fresh installation of BatchJobs with no configuration file:

> getConfig()
BatchJobs configuration:
  cluster functions: Interactive
  mail.from:
  mail.to:
  mail.start: none
  mail.done: none
  mail.error: none
  default.resources:
  debug: FALSE
  raise.warnings: FALSE
  staged.queries: TRUE
  max.concurrent.jobs: Inf

Explicitly loading configurations

You may load a specific configuration file using the loadConfig function. It raises an error if the configuration file does not exist. Furthermore you may find setConfig useful.

Troubleshooting and Performance

If you run many short lived jobs or your cluster is very large, the database may become a bottleneck. In that case, you may wish to set staged.queries to TRUE (enabled per default for BatchJobs >=1.5). This will stage all queries on the slaves to the file system for later execution by one of the head nodes. If set, slaves never write to the database. Instead they write out all update queries to local files. The next time the master queries the database for status information it will automatically read the staged queries from the shared file system and execute them. In this way the database is always in an synchronized state at least from your (the user's) perspective. The only disadvantage is that you will experience a small delay on the head node when call a DB querying function for the first time after jobs have run for some longer time. On the upside, there will be no contention for a write lock on the database by the slaves, ensuring fast execution.

Version 1.3 of the package includes the option fs.timeout which makes the package wait for created files up to fs.timeout seconds before throwing an exception. If you experience disappearing jobs or get a lot "file not found" errors, please try to set this option to at least 10 seconds. Note that this feature is per default disabled (fs.timeout == NA).

Debugging

If you encounter problems on your batch system and you suspect this is due to a bug in how the package operates with the OS batch commands, you can set

debug = TRUE

in your configuration file and run a simple test. This will display all generated OS commands in R and their resulting output. Provide us with this output so we can fix the bug.

If your jobs fail and you only get a warning instead of an error from the code, you can set raise.warnings in your configuration file to TRUE which will make sure that all warnings raised by code during job execution are treated as an error. Technically this is equivalent to options(warn = 2) on the slave.

Clone this wiki locally