Skip to content

Latest commit

 

History

History
100 lines (79 loc) · 4.02 KB

distributed_batch.md

File metadata and controls

100 lines (79 loc) · 4.02 KB
title category tags redirect_from
How can I distribute a batch of jobs?
faq
distcomp
/faq/how_can_i_distribute_a_batch_of_jobs/

How can I distribute a batch of jobs?

Peer distributed computing allows to have multiple computers running the same analysis at the same time, efficiently speeding up your entire analysis. But how can I distribute my jobs?

Let's say we have an analysis script and we want to run that particular analysis for every subject, every condition (say we have 2), every event (again 2), and 5 different frequencies of interest. The analysis script should receive any of these parameters and then compute. For example, such an analysis script could look like this:

myanalysis(input)

% our study parameters
myconds    = {'A','B'};
myevents   = {'con','incon'};
myfreqs    = [10:10:50]; % 10, 20, 30, 40, and 50 Hz

% load the variable we need
input = ['/home/mystorage/', input.subj,'_data.mat']; % the matfile which contains the variables
variable = [myconds{input.cond}, '_', myevents{input.event}, '_' , num2str(myfreqs(input.freq))];
load(input, variable);

% perform analysis
data = analyze(variable);

% store data in a matfile with matching name
output = ['/home/mystorage/', input.subj, '_newdata', variable, '_.mat'];
save(output, data);
clear

This script will simply perform 'analysis' on the parameters specified. Now we should effectively distribute this script and different parameters (we don't want multiple computers to do exactly the same of course). Here is an example of how such a job distribution (all subjects, 2 conditions, 2 events, 5 frequencies) script could look like. Basically in our example, we have 4 (2 conditions * 2 events) different task-parameters repeated for 5 different frequencies:

 % our study parameters
 myconds    = {'A','B'};
 myevents   = {'con','incon'};
 myfreqs    = [10:10:50];

 % list subjects
 mysubjs = {'subject01', 'subject02', 'subject03'};
 nsub = 3;

 ctr = 0;
 for s = 1:nsub % all subjects

     for i = 1:20 % 2 conditions * 2 events * 5 frequencies

         % switch condition every 2 jobs
         if i >  2 && i `<=  4 || i >`  6 && i `<=  8 || i >` 10  && i <= 12 ...

| | i > 14 && i <= 16 | | i > 18 && i <= 20 | | ----------------- | | ----------------- cond = 2; else cond = 1; end

         % switch event every job
         if floor(i/2) == i/2 % when i = equal number
            event = 2;
         else                 % when i = odd number
            event = 1;
         end

         % switch frequency every 4 jobs
         if i > 0 && i <= 4
            freq = 1;
         elseif i > 4 && i <= 8
            freq = 2;
         elseif i > 8 && i <= 12
            freq = 3;
         elseif i > 12 && i <= 16
            freq = 4;
         elseif i > 16 && i <= 20
            freq = 5;
         end

         % check for existence
         variable = [myconds{cond}, '_', myevents{event}, '_' , num2str(myfreqs(freq))];
         output = dir(['/home/mystorage/', mysubjs{s}, '_newdata', variable, '_.mat']);

         if isempty(output) % if the matfile does not yet exist, then add to the joblist
            ctr = ctr +1;
            input{ctr}.cond  = cond;
            input{ctr}.event = event;
            input{ctr}.freq  = freq;
            input{ctr}.subj  = mysubjs{s};
         end
         clear output;
     end
 end

 % distribute and assume one job requires 2 Gb of memory and 1 hour of CPU time
 % note: one should test with cellfun instead of peercellfun first
 peercellfun(@myanalysis, input, 'memreq', 2*(1024^3), 'timreq', 1*3600)

Our joblist (i.e. 'input') should now contain 60 jobs (20 different settings * 3 subjects) which are distributed over the computer cluster. Having 'myanalysis' loading and saving the data is memory efficient. None of the output, namely, is sent to the workspace but instead stored on disk.