Skip to content

Workflows

Tadashi Maeno edited this page Nov 15, 2017 · 39 revisions

Harvester supports various workflows with push/pull, heartbeat suppression, job late-binding, and one to many mapping between jobs and workers. It is possible to configure workflow per queue. This page explains how each workflow works.

1-to-1 and pull

Workers are submitted to scheduling systems without jobs. Each worker pulls a job directly from panda once the worker gets compute resources. Workers directly send heartbeats to panda.

1-to-1 and push

Jobs are prefetched and bound to workers before being submitted to scheduling systems. Each worker has one job. Harvester sends heartbeats for jobs on behalf of workers.

1-to-1 and push and heartbeat suppression

Jobs are prefetched and bound to workers before being submitted to scheduling systems. After that workers send heartbeats by themselves.

1-to-1 and push and job late-binding

Jobs are prefetched while workers are submitted to scheduling systems asynchronously. Harvester gives one job to each worker once it gets compute resources. Typically this workflow is used for HPCs with long waiting batch queues and jumbo jobs.

1-to-many and push

This workflow is aka multi-worker workflow. Each job is prefetched to be bound to multiple workers. Harvester collects and combine information from all workers to send heartbeats for the job. Typically this workflow is used for HPCs with backfill mode and jumbo jobs.

1-to-many and push and job late-binding

This workflow is similar to the above, but the job is given to workers once they get compute resources. Typically this workflow is used for spiky opportunistic resources with jumbo jobs.

Many-to-1 and push

This workflow is aka multi-job workflow. Multiple jobs are prefetched to be given to a single worker. Harvester converts worker's information to job attributes and send them to Panda in heartbeats. This workflow works with shared_file_messenger and simple_worker_maker as follows:

  1. nJobsPerWorker is defined in panda_queueconfig.json like
                "workerMaker": {
                        "name": "SimpleWorkerMaker",
                        "module": "pandaharvester.harvesterworkermaker.simple_worker_maker",
                        "nJobsPerWorker": 100
                },
  1. If the number of prepared jobs (nPreparedJobs) is larger than nJobsPerWorker, N (int(nPreparedJobs/nJobsPerWorker)*nJobsPerWorker) jobs are taken from the DB and M (N/nJobsPerWorker) workers are generated. All jobs in each worker come from the same task by default. If the queue has "allowJobMixture":true in queueu_config, one worker is generated for jobs from multiple tasks.
  2. Each worker makes a directory as an access point and makes a sub directory under the access point for each job. The sub directory path is $ACCESS_POINT/$PandaID where $PandaID is the PandaID of the job.
  3. JobSpec file, PoolFileCatalog, and symlinks to input files for each job are placed to the corresponding sub directory.
  4. The list of PandaIDs is dumped to a json file pandaIDsFile in the access point. The filename is defined in panda_harvester.cfg. The MPI payload should read the file so that Nth rank gets Nth PandaID and changes the run directory to $ACCESS_POINT/$PandaID.
  5. Monitor scans all sub directories in the access point to look for workerAttributesFile, eventStatusDumpJsonFile and jobReportFile as described in this section.
  6. Stage-out is triggered for each rank when the tank produces eventStatusDumpJsonFile.
  7. Each rank should report jobStatus in workerAttributesFile which can different from the final status of the worker. It would be a good practice that the final status of the worker is always 'failed', not to report 'finished' for some jobs when those ranks didn't set jobStatus correctly.
  8. Jobs go to finished individually when each rank reports the final jobStatus in workerAttributesFile or altogether when the worker is done.

There could be a new worker_maker if nJobsPerWorker needs to be dynamically defined, e.g. for backfill etc.

Job late-binding and push

This workflow works in combination with other workflows like OneToOne and ManyToOne. In the pull model jobs are dispatched from panda once CPUs become available. On the other hand, with job late-binding and the push model harvester prefetches jobs, i.e., jobs are already dispatched from panda before CPUs become available, and then sends jobs to CPUs once they become available. Typically this is useful for HPCs

  • to fill nodes which became idle since execution time of jobs was shorter than the time length of the batch slot, and/or
  • to run a variety of jobs, which need different runtime environments, in a single batch slot.

Note that with this workflow harvester submit vanilla payloads to the batch system. The payload needs a capability to switch runtime environment and run a new job on HPC.

Clone this wiki locally