-
Notifications
You must be signed in to change notification settings - Fork 0
Installation
This page contains information on installing and configuring Panfish.
- Linux based operating system with rsync, ssh, and time commands
- Perl with Test::More (perl-Test-Simple) and ExtUtils:MakeMaker (perl-extutils-MakeMaker) installed
- Sun/Oracle Grid Engine 6.1+ or Open Grid Scheduler installed and configured properly
Installing Panfish application is pretty straightforward download the source tree, unzip it and cd into the Panfish main directory (where Makefile.PL resides)
Then do the following:
perl Makefile.PL
make
make test
# the command below may require superuser privileges
make install
The above will install the Panfish application, but will NOT configure it. Configuration is explained below.
Panfish requires a configuration file along with a couple directories where job templates and the job database can reside. All of these paths must be visible to all nodes on the local OGS cluster.
The following instructions will setup Panfish to run jobs on the local cluster as well as the Comet cluster.
Create the directories by running the below commands:
mkdir -p /home/<PUT YOUR USERNAME HERE>/panfish/templates
mkdir -p /home/<PUT YOUR USERNAME HERE>/panfish/jobs
Panfish looks in the following locations in this order for configurations. Subsequent configurations if found override the parent configurations:
/etc/panfish.config
/<Panfish Bin Directory>/../etc/panfish.config
/<Panfish Bin Directory/panfish.config
~/.panfish.config
Create a ~/.panfish.config file and put the following text in it:
Note: Be sure to replace <...> in config file with valid values.
# Tells Panfish which cluster it is running on
this.cluster=local_shadow.q
# Comma delimited list of clusters that can run jobs
cluster.list=local_shadow.q,comet_shadow.q
#
# local cluster configuration
#
# Scheduler on local cluster, right now has to be SGE
local_shadow.q.engine=SGE
# For remote clusters this is the directory where
# panfishchum pushes data to
local_shadow.q.basedir=/home/<PUT YOUR USERNAME HERE>/panfish/shadow
# Path to job database
local_shadow.q.database.dir=/home/<PUT YOUR USERNAME HERE>/panfish/jobs
# Path to job template directory
local_shadow.q.job.template.dir=/home/<PUT YOUR USERNAME HERE>/panfish/templates
# Full path to qsub command
local_shadow.q.submit=/opt/ge2011.11/bin/linux-x64/qsub
# Full path to qstat command
local_shadow.q.stat=/opt/ge2011.11/bin/linux-x64/qstat
# Bin dir for panfish
local_shadow.q.bin.dir=/usr/local/bin
# Limits number of concurrent running jobs
local_shadow.q.max.num.running.jobs=50
# Adds delay in seconds where panfish sleeps after a submit
local_shadow.q.submit.sleep=1
# Directory jobs can use as scratch space
local_shadow.q.scratch=/tmp
# Number of jobs to run concurrently per node
local_shadow.q.jobs.per.node=1
# Delay in seconds to wait before submitted under batched jobs
local_shadow.q.job.batcher.override.timeout=300
# Delay in seconds panfishline should wait between checking job database
local_shadow.q.line.sleep.time=180
# Directory where panfishline log files should be written, set to
# /dev/null to not write a log file
local_shadow.q.line.stdout.path=/dev/null
# panfishline log verbosity 0=no logging, 1=some logging, 2=lots of lots of logging
local_shadow.q.line.log.verbosity=1
# Number of retries panfishland should attempt to download files
local_shadow.q.land.max.retries=10
# Delay in seconds between between retries
local_shadow.q.land.wait=100
# Timeout in seconds passed to rsync
local_shadow.q.land.rsync.timeout=180
# Connect timeout in seconds passed to rsync
local_shadow.q.land.rsync.contimeout=100
# panfish log verbosity 0=no logging, 1=some logging, 2=lots of lots of logging
local_shadow.q.panfish.log.verbosity=1
# panfishcast log verbosity 0=no logging, 1=some logging, 2=lots of lots of logging
local_shadow.q.panfishsubmit.log.verbosity=1
# panfish delay in seconds between checking database
local_shadow.q.panfish.sleep=60
local_shadow.q.io.retry.count=2
local_shadow.q.io.retry.sleep=5
local_shadow.q.io.timeout=30
local_shadow.q.io.connect.timeout=30
local_shadow.q.job.account=
local_shadow.q.job.walltime=168:00:00
#
# Comet configuration
#
comet_shadow.q.host=<PUT COMET USERNAME HERE>@comet.sdsc.edu
comet_shadow.q.engine=SLURM
comet_shadow.q.basedir=/oasis/projects/nsf/<PUT YOUR PROJECT HERE>/<PUT COMET USERNAME HERE>
comet_shadow.q.database.dir=/home/<PUT COMET USERNAME HERE>/comet/panfish/jobs
comet_shadow.q.submit=/usr/bin/sbatch
comet_shadow.q.stat=/usr/bin/squeue -u <PUT COMET USERNAME HERE>
comet_shadow.q.bin.dir=/home/<PUT COMET USERNAME HERE>/comet/panfish/bin
comet_shadow.q.max.num.running.jobs=50
comet_shadow.q.submit.sleep=1
comet_shadow.q.scratch=`/bin/ls /scratch/$USER/[0-9]* -d`
comet_shadow.q.jobs.per.node=24
comet_shadow.q.job.batcher.override.timeout=60
comet_shadow.q.panfish.log.verbosity=2
comet_shadow.q.panfishsubmit.log.verbosity=1
comet_shadow.q.panfish.sleep=60
comet_shadow.q.io.retry.count=2
comet_shadow.q.io.retry.sleep=5
comet_shadow.q.io.timeout=30
comet_shadow.q.io.connect.timeout=30
comet_shadow.q.job.account=<PUT YOUR PROJECT HERE>
comet_shadow.q.job.walltime=12:00:00
Note: Be sure to replace <...> in config file with valid values.
- <PUT YOUR USERNAME> - Refers to unix username
- <PUT YOUR PROJECT HERE> - Refers to project as seen from show_accounts that can be run on Comet
- <PUT COMET USERNAME HERE> - Comet username
Under ~/panfish/templates directory create a local_shadow.q file and put the following text in it:
#!/bin/sh
#
# request Bourne shell as shell for job
#$ -S /bin/sh
#$ -V
#$ -wd @PANFISH_JOB_CWD@
#$ -o @PANFISH_JOB_STDOUT_PATH@
#$ -e @PANFISH_JOB_STDERR_PATH@
#$ -N @PANFISH_JOB_NAME@
#$ -q all.q
#$ -l h_rt=@PANFISH_WALLTIME@
echo "SGE Id: ${JOB_ID}.${SGE_TASK_ID}"
/usr/bin/time -p @PANFISH_RUN_JOB_SCRIPT@ @PANFISH_JOB_FILE@
NOTE: Above template assumes the local cluster queue is all.q if not correct please set to queue that local jobs should run under.
Under ~/panfish/templates directory create a comet_shadow.q file and put the following text in it:
#!/bin/sh
#
#SBATCH -D @PANFISH_JOB_CWD@
#SBATCH -A @PANFISH_ACCOUNT@
#SBATCH -o @PANFISH_JOB_STDOUT_PATH@
#SBATCH -e @PANFISH_JOB_STDERR_PATH@
#SBATCH -J @PANFISH_JOB_NAME@
#SBATCH -p compute
#SBATCH -t @PANFISH_WALLTIME@
#SBATCH --nodes=1
#SBATCH --export=SLURM_UMASK=0022
/usr/bin/time -p @PANFISH_RUN_JOB_SCRIPT@ @PANFISH_JOB_FILE@
Example templates reside in the Panfish source tree under the templates directory.
Enable passwordless ssh to Comet from host(s) that will be calling panfishland, panfishcast, or panfish
Safest route is to generate an ssh key and then use ssh-agent. Once setup this should work without a password prompt:
ssh comet.sdsc.edu
$ Last login: Mon Feb 1 16:16:30 2016 from 127.0.0.1
Rocks 6.2 (SideWinder)
Profile built 08:51 13-Dec-2015
Kickstarted 09:35 13-Dec-2015
WELCOME TO
__________________ __ _______________
-----/ ____/ __ \/ |/ / ____/_ __/
--/ / / / / / /|_/ / __/ / /
/ /___/ /_/ / / / / /___ / /
\____/\____/_/ /_/_____/ /_/
To initialize the database simply run this command:
panfishsetup --setupdball
To install panfish on remote clusters run this command (assumes ssh has been configured as described in step 4)
panfishsetup --syncall
The panfish daemon is setup to run as a periodic cronjob. The instructions below show how to add the command to cron.
To edit cron:
crontab -e
Text to add to cron using vi interface be sure to save changes by hitting escape key and typing :wq
*/5 * * * * /usr/bin/panfish --cron >> /home/<PUT YOUR USERNAME HERE>/panfish/panfish.log 2>&1
qconf -aq local_shadow.q
The above command will bring up an editor with the following text:
qname local_shadow.q
hostlist @allhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make
rerun FALSE
slots 1
tmpdir /tmp
shell /bin/csh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
- Set hostlist to @allhosts or to a group of hosts that can run the shadow panfishline jobs
Hit escape key then :wq to save changes and exit editor.
qconf -aq comet_shadow.q
The above command will bring up an editor with the following text:
qname comet_shadow.q
hostlist @allhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make
rerun FALSE
slots 1
tmpdir /tmp
shell /bin/csh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
- Set hostlist to @allhosts or to a group of hosts that can run the shadow panfishline jobs
Hit escape key then :wq to save changes and exit editor.
If done successfully then issuing qstat -g c should return output like this:
$ qstat -g c
CLUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoACDS cdsuE
--------------------------------------------------------------------------------
all.q 0.01 0 0 1 1 0 0
comet_shadow.q 0.01 0 0 1 1 0 0
local_shadow.q 0.01 0 0 1 1 0 0
In addition, to the local panfish cron job that runs Panfish periodically, an instance of Panfish must also be run periodically on the remote clusters (ie comet). The purpose of this remote Panfish is to actually schedule the jobs on the remote cluster. For comet simply add an entry to cron to run say every 15 minutes or so.
To edit cron:
crontab -e
Text to add to cron using vi interface be sure to save changes by hitting escape key and typing :wq
*/15 * * * * /home/<PUT COMET USERNAME HERE>/comet/panfish/bin/panfish --cron >> /home/<PUT COMET USERNAME HERE>/comet/panfish/bin/panfish.log 2>&1
First create foo directory in a directory visible to all nodes on local cluster
mkdir ~/foo
Create test.sh script and put it in the foo directory. Contents to put in test.sh
#!/bin/bash
echo "Hello World from `hostname` under the path `pwd`"
echo "PANFISH_BASEDIR = $PANFISH_BASEDIR"
echo "JOB_ID = $JOB_ID"
echo "SGE_TASK_ID = $SGE_TASK_ID"
sleep 1
exit 0
Make test.sh executable
chmod a+x ~/foo/test.sh
Upload foo to remote clusters
cd ~/foo
panfishchum --path `pwd`
Examining ... /home/(your username)/foo ... done. Took 0 seconds.
Found 185 bytes in 1 files
Skipping local_shadow.q cause this program is running on this cluster
Uploading to comet_shadow.q ... done. Transfer took 1 seconds. Rate: 0.00 Mb/sec.
Run job via panfishcast
panfishcast -N hi -t 1-2 -e `pwd`/\$TASK_ID.err -o `pwd`/\$TASK_ID.out -q local_shadow.q `pwd`/test.sh