Skip to content

Workflows

Tadashi Maeno edited this page Aug 16, 2017 · 39 revisions

Harvester supports various workflows with push/pull, heartbeat suppression, job late-binding, and one to many mapping between jobs and workers. It is possible to configure workflow per queue. This page explains how each workflow works.

1-to-1 and pull

Workers are submitted to scheduling systems without jobs. Each worker pulls a job directly from panda once the worker gets compute resources. Workers directly send heartbeats to panda.

1-to-1 and push

Jobs are prefetched and bound to workers before being submitted to scheduling systems. Each worker has one job. Harvester sends heartbeats for jobs on behalf of workers.

1-to-1 and push and heartbeat suppression

Jobs are prefetched and bound to workers before being submitted to scheduling systems. After that workers send heartbeats by themselves.

1-to-1 and push and job late-binding

Jobs are prefetched while workers are submitted to scheduling systems asynchronously. Harvester gives one job to each worker once it gets compute resources. Typically this workflow is used for HPCs with long waiting batch queues and jumbo jobs.

1-to-many and push

This workflow is aka multi-worker workflow. Each job is prefetched to be bound to multiple workers. Harvester collects and combine information from all workers to send heartbeats for the job. Typically this workflow is used for HPCs with backfill mode and jumbo jobs.

1-to-many and push and job late-binding

This workflow is similar to the above, but the job is given to workers once they get compute resources. Typically this workflow is used for spiky opportunistic resources with jumbo jobs.

Many-to-1 and push

This workflow is aka multi-job workflow. Multiple jobs are prefetched to be given to a single worker. Harvester converts worker's information to job attributes and send them to Panda in heartbeats. This workflow works with shared_file_messenger and simple_worker_maker as follows:

  1. nJobsPerWorker is defined in panda_queueconfig.json like
                "workerMaker": {
                        "name": "SimpleWorkerMaker",
                        "module": "pandaharvester.harvesterworkermaker.simple_worker_maker",
                        "nJobsPerWorker": 100
                },
  1. If the number of prepared jobs (nPreparedJobs) is larger than nJobsPerWorker, N (int(nPreparedJobs/nJobsPerWorker)*nJobsPerWorker) jobs are taken from the DB and M (N/nJobsPerWorker) workers are generated. All jobs in each worker come from the same task.
  2. Each worker makes a directory as an access point and makes a sub directory under the access point for each job. The sub directory path is $ACCESS_POINT/$PandaID where $PandaID is the PandaID of the job.
  3. JobSpec file, PoolFileCatalog, and symlinks to input files for each job are placed to the corresponding sub directory.
  4. The list of PandaIDs is dumped to a json file pandaIDsFile in the access point. The filename is defined in panda_harvester.cfg. The MPI payload should read the file so that Nth rank gets Nth PandaID and changes the run directory to $ACCESS_POINT/$PandaID.
  5. Monitor scans all sub directories in the access point to look for workerAttributesFile, eventStatusDumpJsonFile and jobReportFile as described in this section.
  6. Each rank should report jobStatus in workerAttributesFile which can different from the final status of the worker. It would be a good practice that the final status of the worker is always 'failed', not to report 'finished' for some jobs when those ranks didn't set jobStatus correctly.
  7. Stage-out is triggered individually when each rank is finished or altogether when the worker is done.

There could be a new worker_maker if nJobsPerWorker needs to be dynamically defined, e.g. for backfill etc.

Clone this wiki locally