Skip to content

Job processing (cookbook)

David Anderson edited this page Nov 12, 2024 · 13 revisions

This cookbook shows how to process large numbers of jobs with BOINC. We'll make programs that create jobs (work generator) and handle their output (assimilator). We'll write these in Python, using C++ programs (supplied by BOINC) to do the low-level work. We could work directly in C++; that would be more efficient but more complex.

We'll build on the previous cookbooks. We'll assume that you've created a BOINC project and created an example application. This application (named 'worker') takes a text file and converts it to uppercase.

In this example, you'll create a directory containing input files - potentially thousands of them. Then - with one command - you'll create a batch of jobs, one per input file. The assimilator will put the output files into a new directory.

This is, of course, a toy example. But it should be straightforward to use the mechanisms to handle real applications.

Set up user permissions

  • If you haven't already done so, create an account on the BOINC project. Make a note of your user ID (an integer, shown on your user page).
  • Go to the project's admin web page.
  • Click User job submission privileges.
  • Click 'Add user'.
  • Enter your user ID and click OK.
  • Select 'All apps' and click OK.

Work generator

To create jobs, we'll use an existing script called submit_batch which is used as follows:

bin/submit_batch user_id app_name infile_dir

It creates a batch of jobs for the given app, owned by the given user. It creates one job for each file in the given directory. It can be used with any app that takes a single input file; in this case we'll use it with worker.

The source for submit_batch is here. Let's look at how it works, so that you can adapt it to your own apps.

First it gets a list of the files in the input file directory:

    files = []
    for entry in os.scandir(dir):
        if not entry.is_file():
            raise Exception('not file')
        files.append(entry.name)

Then it creates a batch by running a BOINC-supplied program, create_batch. It parses the batch ID written by this program.

    cmd = [
        'bin/create_batch',
        '--app_name', app_name,
        '--user_id', str(user_id),
        '--njobs', str(len(files)),
        '--name', '%s__%d'%(app_name, int(time.time()))
    ]
    ret = subprocess.run(cmd, capture_output=True)
    if ret.returncode:
        raise Exception('create_batch failed (%d): %s'%(ret.returncode, ret.stdout))
    batch_id = int(ret.stdout)

Then it 'stages' the input files, copying them from the input file directory to the project's download hierarchy. It uses a BOINC-supplied program '''stage_file''' to do this; this program stages all the files in the given directory.

    cmd = ['bin/stage_file', '--copy', dir]
    ret = subprocess.run(cmd, capture_output=True)
    if ret.returncode:
        raise Exception('stage_file failed (%d): %s'%(ret.returncode, ret.stdout))

Then it creates the jobs using a BOINC-supplied program create_work.

    fstr = '\n'.join(files)
    cmd = [
        'bin/create_work',
        '--appname', app_name,
        '--batch', str(batch_id),
        '--stdin'
    ]
    ret = subprocess.run(cmd, input=fstr, capture_output=True, encoding='ascii')
    if ret.returncode:
        raise Exception('create_work failed (%d): %s'%(ret.returncode, ret.stdout))

The --stdin tells create_work that job descriptions will be passed via stdin, one per line. In this case each job description is just the name of the input file. It could also include command-line parameters; see details.

Finally, it marks the batch as in progress.

    cmd = ['bin/create_work', '--enable', str(batch_id)]
    ret = subprocess.run(cmd, capture_output=True)
    if ret.returncode:
        raise Exception('enable batch failed (%d): %s'%(ret.returncode, ret.stdout))

Handling completed jobs

We'll use a program to handle completed jobs. This program moves the output file of the canonical job instance to a directory sample_results/<batch_id>/:

batch_id = sys.argv[1]
outfile_path = sys.argv[2]
fname = os.path.basename(outfile_path)
outdir = 'sample_results/%s'%(batch_id)
os.system('mkdir -p %s'%(outdir))
os.system('mv %s %s/%s'%(outfile_path, outdir, fname))

This script is in the BOINC source tree, in tools/sample_assimilate.py. Copy it to ~/projects/test/bin.

Edit ~/projects/test/config.xml. In the <daemons> section, delete the sample_assimilator entry, and add:

<daemon>
   <cmd>script_assimilator --app worker --script "sample_assimilate.py batch_id files"</cmd>
   <output>assimilator_worker.out</output>
   <pid_file>assimilator_worker.pid</pid_file>
</daemon>

Processing a batch of jobs

On the BOINC server, restart the project and make a directory for input files:

cd ~/projects/test
bin/stop
bin/start
mkdir infiles

Put some text files into infiles/. As many as you want; long, short, doesn't matter.

Submit a batch of jobs:

bin/submit_batch worker infiles

This will submit the jobs, one per input file. Go to your BOINC client and update the test project; it should start downloading and processing the jobs.

To monitor the progress of the jobs, log in to the project web site. The 'Computing' menu will contain a 'Job submission' item. Select that.

You'll see a list of batches you've submitted. The one you just submitted will probably still be in progress. You can see the status of its jobs.

Reload the page occasionally. When all the jobs are completed, the batch will move to 'Completed batches'. At that point the output files will be on the server in ~/projects/test/sample_results/batchid.

Retrying failed jobs

You may find that some jobs fail (VirtualBox occasionally fails for no apparent reason).

You can tell BOINC to retry failed jobs, up to an app-specific limit. This is specified in the app's input template. In this case, add the following to test/templates/worker_in, inside the <workunit> element:

    <max_error_results>3</max_error_results>
    <max_total_results>4</max_total_results>

This retries jobs for that app up to 3 times.

Clone this wiki locally