Skip to content

Installation

Chris Churas edited this page Mar 7, 2016 · 1 revision

This page contains information on installing and configuring Panfish.

Requirements

  • Linux based operating system with rsync, ssh, and time commands
  • Perl with Test::More (perl-Test-Simple) and ExtUtils:MakeMaker (perl-extutils-MakeMaker) installed
  • Sun/Oracle Grid Engine 6.1+ or Open Grid Scheduler installed and configured properly

Installation

Installing Panfish application is pretty straightforward download the source tree, unzip it and cd into the Panfish main directory (where Makefile.PL resides)

Then do the following:

perl Makefile.PL
make
make test
# the command below may require superuser privileges
make install

The above will install the Panfish application, but will NOT configure it. Configuration is explained below.

Configuration

Panfish requires a configuration file along with a couple directories where job templates and the job database can reside. All of these paths must be visible to all nodes on the local OGS cluster.

The following instructions will setup Panfish to run jobs on the local cluster as well as the Comet cluster.

Step 1 Create jobs database and templates directories

Create the directories by running the below commands:

mkdir -p /home/<PUT YOUR USERNAME HERE>/panfish/templates
mkdir -p /home/<PUT YOUR USERNAME HERE>/panfish/jobs

Step 2 Create panfish.config configuration file

Panfish looks in the following locations in this order for configurations. Subsequent configurations if found override the parent configurations:

/etc/panfish.config
/<Panfish Bin Directory>/../etc/panfish.config
/<Panfish Bin Directory/panfish.config
~/.panfish.config

Create a ~/.panfish.config file and put the following text in it:

Note: Be sure to replace <...> in config file with valid values.

# Tells Panfish which cluster it is running on
this.cluster=local_shadow.q

# Comma delimited list of clusters that can run jobs
cluster.list=local_shadow.q,comet_shadow.q

#
# local cluster configuration
#

# Scheduler on local cluster, right now has to be SGE
local_shadow.q.engine=SGE

# For remote clusters this is the directory where
# panfishchum pushes data to
local_shadow.q.basedir=/home/<PUT YOUR USERNAME HERE>/panfish/shadow

# Path to job database
local_shadow.q.database.dir=/home/<PUT YOUR USERNAME HERE>/panfish/jobs

# Path to job template directory
local_shadow.q.job.template.dir=/home/<PUT YOUR USERNAME HERE>/panfish/templates

# Full path to qsub command 
local_shadow.q.submit=/opt/ge2011.11/bin/linux-x64/qsub

# Full path to qstat command
local_shadow.q.stat=/opt/ge2011.11/bin/linux-x64/qstat

# Bin dir for panfish 
local_shadow.q.bin.dir=/usr/local/bin

# Limits number of concurrent running jobs
local_shadow.q.max.num.running.jobs=50

# Adds delay in seconds where panfish sleeps after a submit
local_shadow.q.submit.sleep=1

# Directory jobs can use as scratch space
local_shadow.q.scratch=/tmp

# Number of jobs to run concurrently per node
local_shadow.q.jobs.per.node=1

# Delay in seconds to wait before submitted under batched jobs
local_shadow.q.job.batcher.override.timeout=300

# Delay in seconds panfishline should wait between checking job database
local_shadow.q.line.sleep.time=180

# Directory where panfishline log files should be written, set to 
# /dev/null to not write a log file
local_shadow.q.line.stdout.path=/dev/null

# panfishline log verbosity 0=no logging, 1=some logging, 2=lots of lots of logging
local_shadow.q.line.log.verbosity=1

# Number of retries panfishland should attempt to download files
local_shadow.q.land.max.retries=10

# Delay in seconds between between retries
local_shadow.q.land.wait=100

# Timeout in seconds passed to rsync
local_shadow.q.land.rsync.timeout=180

# Connect timeout in seconds passed to rsync
local_shadow.q.land.rsync.contimeout=100

# panfish log verbosity 0=no logging, 1=some logging, 2=lots of lots of logging
local_shadow.q.panfish.log.verbosity=1

# panfishcast log verbosity 0=no logging, 1=some logging, 2=lots of lots of logging
local_shadow.q.panfishsubmit.log.verbosity=1

# panfish delay in seconds between checking database
local_shadow.q.panfish.sleep=60
local_shadow.q.io.retry.count=2
local_shadow.q.io.retry.sleep=5
local_shadow.q.io.timeout=30
local_shadow.q.io.connect.timeout=30
local_shadow.q.job.account=
local_shadow.q.job.walltime=168:00:00

#
# Comet configuration
# 
comet_shadow.q.host=<PUT COMET USERNAME HERE>@comet.sdsc.edu
comet_shadow.q.engine=SLURM
comet_shadow.q.basedir=/oasis/projects/nsf/<PUT YOUR PROJECT HERE>/<PUT COMET USERNAME HERE>
comet_shadow.q.database.dir=/home/<PUT COMET USERNAME HERE>/comet/panfish/jobs
comet_shadow.q.submit=/usr/bin/sbatch
comet_shadow.q.stat=/usr/bin/squeue -u <PUT COMET USERNAME HERE>
comet_shadow.q.bin.dir=/home/<PUT COMET USERNAME HERE>/comet/panfish/bin
comet_shadow.q.max.num.running.jobs=50
comet_shadow.q.submit.sleep=1
comet_shadow.q.scratch=`/bin/ls /scratch/$USER/[0-9]* -d`
comet_shadow.q.jobs.per.node=24
comet_shadow.q.job.batcher.override.timeout=60
comet_shadow.q.panfish.log.verbosity=2
comet_shadow.q.panfishsubmit.log.verbosity=1
comet_shadow.q.panfish.sleep=60
comet_shadow.q.io.retry.count=2
comet_shadow.q.io.retry.sleep=5
comet_shadow.q.io.timeout=30
comet_shadow.q.io.connect.timeout=30
comet_shadow.q.job.account=<PUT YOUR PROJECT HERE>
comet_shadow.q.job.walltime=12:00:00

Note: Be sure to replace <...> in config file with valid values.

  • <PUT YOUR USERNAME> - Refers to unix username
  • <PUT YOUR PROJECT HERE> - Refers to project as seen from show_accounts that can be run on Comet
  • <PUT COMET USERNAME HERE> - Comet username

Step 3 Create job template files

Under ~/panfish/templates directory create a local_shadow.q file and put the following text in it:

#!/bin/sh
#
# request Bourne shell as shell for job
#$ -S /bin/sh
#$ -V
#$ -wd @PANFISH_JOB_CWD@
#$ -o @PANFISH_JOB_STDOUT_PATH@
#$ -e @PANFISH_JOB_STDERR_PATH@
#$ -N @PANFISH_JOB_NAME@
#$ -q all.q
#$ -l h_rt=@PANFISH_WALLTIME@
echo "SGE Id:  ${JOB_ID}.${SGE_TASK_ID}"
/usr/bin/time -p @PANFISH_RUN_JOB_SCRIPT@ @PANFISH_JOB_FILE@

NOTE: Above template assumes the local cluster queue is all.q if not correct please set to queue that local jobs should run under.

Under ~/panfish/templates directory create a comet_shadow.q file and put the following text in it:

#!/bin/sh
#
#SBATCH -D @PANFISH_JOB_CWD@
#SBATCH -A @PANFISH_ACCOUNT@
#SBATCH -o @PANFISH_JOB_STDOUT_PATH@
#SBATCH -e @PANFISH_JOB_STDERR_PATH@
#SBATCH -J @PANFISH_JOB_NAME@
#SBATCH -p compute
#SBATCH -t @PANFISH_WALLTIME@
#SBATCH --nodes=1
#SBATCH --export=SLURM_UMASK=0022
/usr/bin/time -p @PANFISH_RUN_JOB_SCRIPT@ @PANFISH_JOB_FILE@

Example templates reside in the Panfish source tree under the templates directory.

Step 4 Configure ssh

Enable passwordless ssh to Comet from host(s) that will be calling panfishland, panfishcast, or panfish

Safest route is to generate an ssh key and then use ssh-agent. Once setup this should work without a password prompt:

ssh comet.sdsc.edu
$ Last login: Mon Feb  1 16:16:30 2016 from 127.0.0.1
Rocks 6.2 (SideWinder)
Profile built 08:51 13-Dec-2015

Kickstarted 09:35 13-Dec-2015
                                                                       
                      WELCOME TO 
      __________________  __  _______________
        -----/ ____/ __ \/  |/  / ____/_  __/
          --/ /   / / / / /|_/ / __/   / /
           / /___/ /_/ / /  / / /___  / /
           \____/\____/_/  /_/_____/ /_/

Step 5 Initialize database and install Panfish on remote clusters

To initialize the database simply run this command:

panfishsetup --setupdball

To install panfish on remote clusters run this command (assumes ssh has been configured as described in step 4)

panfishsetup --syncall

Step 7 Configure cron job on local node

The panfish daemon is setup to run as a periodic cronjob. The instructions below show how to add the command to cron.

To edit cron:

crontab -e

Text to add to cron using vi interface be sure to save changes by hitting escape key and typing :wq

*/5 * * * * /usr/bin/panfish --cron >> /home/<PUT YOUR USERNAME HERE>/panfish/panfish.log 2>&1

Step 7 Configure shadow queues on local cluster

Create local_shadow.q
qconf -aq local_shadow.q

The above command will bring up an editor with the following text:

qname                 local_shadow.q
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make
rerun                 FALSE
slots                 1
tmpdir                /tmp
shell                 /bin/csh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY
  • Set hostlist to @allhosts or to a group of hosts that can run the shadow panfishline jobs

Hit escape key then :wq to save changes and exit editor.

Create comet_shadow.q
qconf -aq comet_shadow.q

The above command will bring up an editor with the following text:

qname                 comet_shadow.q
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make
rerun                 FALSE
slots                 1
tmpdir                /tmp
shell                 /bin/csh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY
  • Set hostlist to @allhosts or to a group of hosts that can run the shadow panfishline jobs

Hit escape key then :wq to save changes and exit editor.

If done successfully then issuing qstat -g c should return output like this:

$ qstat -g c
CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL aoACDS  cdsuE  
--------------------------------------------------------------------------------
all.q                             0.01      0      0      1      1      0      0 
comet_shadow.q                    0.01      0      0      1      1      0      0 
local_shadow.q                    0.01      0      0      1      1      0      0 

Step 8 Configure cron job on remote cluster Comet

In addition, to the local panfish cron job that runs Panfish periodically, an instance of Panfish must also be run periodically on the remote clusters (ie comet). The purpose of this remote Panfish is to actually schedule the jobs on the remote cluster. For comet simply add an entry to cron to run say every 15 minutes or so.

To edit cron:

crontab -e

Text to add to cron using vi interface be sure to save changes by hitting escape key and typing :wq

*/15 * * * * /home/<PUT COMET USERNAME HERE>/comet/panfish/bin/panfish --cron >> /home/<PUT COMET USERNAME HERE>/comet/panfish/bin/panfish.log 2>&1

Step 9 Run a test job

First create foo directory in a directory visible to all nodes on local cluster

mkdir ~/foo

Create test.sh script and put it in the foo directory. Contents to put in test.sh

#!/bin/bash
echo "Hello World from `hostname` under the path `pwd`"
echo "PANFISH_BASEDIR = $PANFISH_BASEDIR"
echo "JOB_ID = $JOB_ID"
echo "SGE_TASK_ID = $SGE_TASK_ID"
sleep 1
exit 0

Make test.sh executable

chmod a+x ~/foo/test.sh

Upload foo to remote clusters

cd ~/foo
panfishchum --path `pwd`
Examining ... /home/(your username)/foo ... done.  Took 0 seconds.
Found 185 bytes in 1 files

Skipping local_shadow.q cause this program is running on this cluster
Uploading to comet_shadow.q ... done.  Transfer took 1 seconds.  Rate: 0.00 Mb/sec.

Run job via panfishcast

panfishcast -N hi -t 1-2 -e `pwd`/\$TASK_ID.err -o `pwd`/\$TASK_ID.out -q local_shadow.q `pwd`/test.sh