parsyncfp2, a MultiHost parallel rsync wrapper

Preamble

'parsyncfp2' intially started as a one-off bash script and moved to Perl (as 'parsync') when that became too cumbersome. I incorporated fpart and it became 'parsyncfp'. This MultiHost (MH) version allows it to run over multiple SEND hosts with shared storage to cooperatively send data much faster. Rather than appending yet more acronymic characters to the name, I differentiated it with the major version number, so … 'parsyncfp2' or 'pfp2'. In the docs and src code, you may still find references to 'parsync', 'parsyncfp', as well as various abbriev’s (why does 'abbrieviation' need one?).

Introduction

'parsyncfp2' ('pfp2') is a Perl script that wraps Andrew Tridgell’s & Paul Mackerras' miraculous rsync to provide load balancing and parallel operation across network connections to substantially increase the amount of data it can send simultaneously. 'pfp2' exploits parallel operation to decrease the impact of the TCP Round Trip Time(rtt) to significantly increase the total bandwidth of data across networks. For more information about the variables surrounding data transfer over networks, see How to Move Data. Even on low-latency networks, it can speed large transfers by 4-10x. However, it is not effective for small transfers, since the startup overhead will slow the effective throughput.

General Features

Versions <2 allowed the SingleHost (SH) version to use 10s to 100s of rsyncs to increase aggregate bandwidth. Versions >2 allow MultiHost (MH) send & receive to increase bandwidth saturation to both regular rsync connections as well as rsyncd servers. This allows the traffic to be split out to servers on different networks as well as sending to multiple filesystems on the receiving end (tho the split dirs would then have to be re-combined).

'pfp2' uses Ganael Laplanche’s excellent fpart to dynamically create 'chunkfiles' for rsync to read, bypassing the need to wait for rsync’s complete recursive scan. ie, it starts the transfer almost immediately, as soon as the first chunk is written. For large, deep trees, this can be quite useful. Also see the filesfrom options below. 'pfp2' also allows huge transfers to take place without the memory overflow sometimes seen with using a single rsync, due to splitting the memory required over many smaller rsync instances.

In both SH and MH, 'pfp2' monitors the system loadavg. It will suspend spawned rsyncs until the 1m load decreases below the cutoff, then UNsuspend them as the load decreases below it.

In the SH version, suspending the parent 'pfp2' (with Ctrl+Z) will suspend all rsync children, regardless of current state. Similarly, if you kill the parent 'pfp2' (Ctrl+C), all the children rsyncs will die with various cries of distress, depending on their states. In the MH version, the spawned rsyncs are running independently on separate hosts and can only be controlled by commands issues to that host. ie you have to 'ssh' to the host and suspend or kill the processes separately. A version where the hosts communicate via sockets is in the works and a 'killer' script pfp2stop is written out at each MH invocation, which will ssh to each of the SEND and REC hosts to kill off all the rsync and 'pfp2' processes running.

'pfp2' can send files to any host with a standard rsync on the other end. In 'normal' client mode (the remote rsync starts up on demand via ssh) the target syntax is either 'host:/fully/qualified/path' or 'host:path' (implying a dir off the user’s HOME dir (specified in other apps as as 'host:~/path', but unacceptable to a native rsync). 'pfp2' can also send data to an rsyncd server. The rsyncd target syntax requires a module name ('host::module') and the user must be pre-registered in the server’s '/etc/rsyncd.conf' and '/etc/rsyncd.secrets' file - see 'man rsyncd.conf', unless the server is running without any kind of authentication.

Unless changed by '--interface', 'pfp2' assumes and monitors the routable interface. The transfer will use whatever interface normal routing provides, normally set by the name of the target. While rsync can be used for non-host-based transfers (between mounted filesystems), it works less well than for strictly network-based syncs. 'pfp2' will honor requests to sync across local filesystems and shows low but significant speedup (2x-6x).

'pfp2' only works on dirs and files that originate from the current dir (or specified via '--startdir'). You cannot include dirs and files from discontinuous or higher-level dirs. 'pfp2' also 'does not' use rsync’s sophisticated/idiosyncratic treatment of trailing `/s' to direct where files vs dirs are sent; dirs are treated as dirs regardless of the trailing `/'.

The .pfp2 dir : (unless redirected to another dir via the '--altcache' option), this contains the cache dir ('fpcache', which is cleared on each run), and the time-stamped rsync log files. These can accumulate quickly since each rsync instance will leave a date-stamped log. If you use the MH version, the .pfp2 dir is created in the common shared directory ('--commondir'), and contains the (common) fpcache dir. The rsync logs are stored in the host-named subdirectories in the .pfp2 dir and are NOT deleted by the next 'pfp2' run.

If you use the MH version, the STDERR/STDOUT of the entire transfer (the text that’s written to the screen) from each of the SEND hosts is captured in the host-specific dir named 'pfp2-log-(time)_(date)'.

Due to the terminal text coloration, the pfp2-log files are best viewed by cat’ing them to the terminal and then if necessary, copy-pasting them from the terminal.

Odd characters in names : 'pfp2' will refuse to transfer some oddly named files (tho it should copy filenames with spaces fine. Filenames with embedded newlines, DOS EOLs, and some other odd chars will be recorded in the log files in the '.pfp2' dir (see above). You should be able to specify dirs and files in the 'pfp2' command with either/both escaped spaces or with quotes: "file\ with\ spaces" or `file with spaces'. Internal to pfp2, rsync rules prevail.

Release License

parsyncfp2 is distributed under the Gnu Public License (GPL) v3.

Installation

Installation of 'parsyncfp2' is fairly simple. There’s not yet a deb or rpm package, but the bits to make it work that are not part of a fairly standard Linux distro are the Perl scripts 'parsyncfp2', 'scut' (like 'cut' but more flexible), and 'stats' (spits out descriptive statistics of whatever semi-numeric stream is fed to it). The rest of the dependents are listed here:

Debian/Ubuntu-like:

sudo apt install ethtool iproute2 fpart iw libstatistics-descriptive-perl infiniband-diags
 git clone https://github.com/hjmangalam/parsyncfp2
 cd parsyncfp2; cp parsyncfp2 scut stats ~/bin

RHel/Centos/Rocky-like:

 sudo yum install iw fpart ethtool iproute perl-Env.noarch  \
   perl-Statistics-Descriptive wireless-tools infiniband-diags
 git clone https://github.com/hjmangalam/parsyncfp2
 cd parsyncfp2; cp parsyncfp2 scut stats ~/bin

Required utilities and packages

Should the above commands not fulfill the requirements or be missing from your set of repositories, the utilities are listed below.

ethtool - query or control network driver and hardware settings. Install via repository.
ip - show / manipulate routing, network devices, interfaces and tunnels. Install via repository.
fpart - Sort and pack files into partitions. Now in many distro repositories or install from: github;
scut - a more intelligent cut. Included in the parsyncfp2 github
stats - calculate descriptive stats from STDIN. Included in the parsyncfp2 github
Perl::Descriptive-Statistics - basic descriptive statistical functions

Recommended Utilities

iwconfig - configure a wireless network interface. Needed only for WiFi. Install via repository.
perfquery - query InfiniBand port counters. Needed only for InfiniBand. Install via repository.

Options in detail

'pfp2' has a lot of options, but most are straightforward. The MultiHost and FilesFrom options require a little more description and are described in their own sections below.

Basic Options for both SH and MH

The only native rsync options that 'pfp2' uses are '-a' (archive), '-s' (protect-args), and '-l' (copy symlinks as symlinks). If you need to pass more options to rsync, then it’s up to you to provide them ALL via '--ro' and you must include the entire option string as rsync would see it (--ro='-slaz --times')

In the list 'pfp2' options below, the brackets indicate:

[i] = integer number, [f] = floating point number, [s] = "quoted string", ( ) = the default if any

--NP|np [i] (sqrt(#CPUs))The number of rsync processes to start. The optimal NP depends on many variables. Try the default and increase as needed. No point in using a high NP if your network won’t support it.
--altcache|ac [/path/to/dir] : The alternative cache dir for placing it on another FS or for running multiple SH (not MH) 'pfp2s' simultaneously
--startdir|sd [s] (pwd) : The top-level directory at which 'pfp2' starts looking for files & dirs. You can use globs/regexes with '--startdir', but only if you’re at that point in the dir tree. ie: if you’re not in the dir where the globs can be expanded, then the glob will fail. However, explicit dirs can be set from anywhere if given an existing dir with '--startdir'.
--maxbw [i] in KB/s (unlimited): 'pfp2' appropriates rsync’s bandwidth throttle mechanism, using '--maxbw' as a passthru to rsync’s 'bwlimit' option, but divides it by the NP value so as to keep the total bandwidth the same as the stated limit. It monitors and shows 'total' (not just pfp2’s) bandwidth thru the given interface.
--maxload|ml [f] (NP*2) : max system load - if 1m loadavg > maxload, 1 rsync process will be suspended per 'checkperiod' cycle until the loadavg decreases below the 'maxload'. At that point, the suspended rsyncs will be UNsuspended, one per 'checkperiod'. rsync is very CPU-light; running 6 rsyncs with compression (--ro='-slaz') causes an increase in loadavg of only about 1-2, depending on the storage systems. This is handled independently on each of the SEND hosts.
--chunksize|cs [s] (10G) : aggregate size of files allocated to one rsync process. Can specify in 'human' terms [100M, 50K, 1T] as well as integer bytes. 'pfp2' will warn once when/if you exceed the WARN # of chunkfiles (2000) and abort if you exceed the FATAL # of chunkfiles (5000). You CAN force it to use very high numbers of chunkfiles by setting the number negative ('--chunkfile=-50GB'), but this can be risky. Optimally, you want to choose a chunksize that will result in a fairly short startup time but will not result in 10s of 1000s of files. Decrease the NUMBER of chunkfiles by increasing the SIZE of the chunkfiles. The sweet spot is to choose a chunksize that will result in no more than 10x the NP number, so if '--NP=20', there should be no more than ~200 chunkfiles, altho you have very broad latitude to set your own preferences.
--interface|I [s] : network interface to monitor (not use; see above). Only SENT bytes are displayed, and the bytes are the total sent thru the link, not just from 'pfp2', so it’s a rough estimate of the the bandwidth.
--ro [s] : Options passed to rsync as quoted string . This option triggers a pause before executing to verify the command. The '--ro' string can pass any rsync option to all the rsyncs that will be started. This allows options like '-z' (compression) or '--exclude-from' (filter out unwanted files). If you use this option, you’re responsible for supplying ALL the options and providing the files and formats required. The '--ro' string is NOT appended to the default '-asl' string. DO NOT use any 'delete' options with this utility. See Hints below.
--checkperiod|cp [i] (3) : Sets the period in seconds between updates. This is a best effort attempt. If chunksize is set so small so 1000s of chunkfiles are created, file IO may lengthen this time.
--rdma : report RDMA bytes thru the IB or IB-bonded interface, otherwise, only TCP/IP bytes will be reported.
--reusechunks [i] (1) : Re-use the chunking data collected for the previous run, using the same chunk size. Useful for restarting a run that was mistakenly ended w/o waiting for fpart to recalculate the chunks. The integer argument is the chunk to start at, so rather than running thru all the (possibly 100s of) chunks, you can start at the one closest to where the interruption occurred.
--verbose|v [0-3] (2) : Sets chattiness. 3=very; 2=normal; 1=less; 0=none. This only affects verbosity post-start; warning & error messages will still be printed. This is a work in progress.
--slowdown [f] (0.5) : Introduces delays between ssh-mediated commands if the RTT is too long. It’s increased in steps automatically for large RTTs, but this option allows you to explicitly slow down the speed at which ssh connection are made. Increment in integer seconds if you see errors like: 'rsync error: unexplained error (code 255) at io.c(xxx) [sender=x.x.x]'
--dispose|d [s] (l) : What to do with the fpart cache files. (l)eave untouched, (c)ompress to a tarball, (d)elete.
--email [s] (none) : Email address to send completion message. The email address should not need escaping or quoting but should also work with them as well (joe\@go.com). The SEND host will need a working mailer for this to work.
--nowait : For scripting, sleep for a few sec instead of pausing and waiting for human intervention.
--version|V : Dumps version string and exits
--help|h : Dumps a short version of this help into your pager and then exits when you quit.

MultiHost (MH) Operation

Overview

The single 'pfp2' script has both SH and MH functionality.

The MH options allows you to rsync in parallel streams via multiple SEND hosts to the same or multiple RECEIVE hosts, including sending to different filesystems on the different RECEIVE hosts. The RECEIVE hosts can be:

standard servers which launch matching rsyncs via the usual mechanism. These can also have the same or different endpoints.
rsyncd servers with different modules and as such, can define different authentication for different users and different endpoints for the data. The comprehensive description of how this works is described in rsyncd.conf(5). Make sure that the rsyncd can start as many rsyncs as the sending hosts require by modifying the 'max connections' line.

Both types can be mixed in the same hosts string. The MH version requires that the initiator and all SEND hosts have access to a common filesystem for both data and configuration info.

Important

The required last element in a MH command is 'POD::/path' ('POD' for a pod of whales) which is the default path for any RECEIVE hosts that haven’t been defined in the '--hosts' option. This is only the case for regular paths, not for rsyncd module definitions. So while the terminal target path will be appended to otherwise naked RECEIVE hosts, rsyncd modules have to be completely specified in the hosts file as 'host::module' (More info below and see Good Example 5, Good Example 6 Good Example 7 below).

MultiHost 'pfp2' sequence:

start the process on the 'master' host
process the options
check the status & separation of the SEND and REC hosts and rsync some required utilities to the SEND hosts (requested via '--checkhosts') and verify that the 'pfp2' scripts being used are identical.
start the 'fpart' chunking process on the 'master' node (unless it’s been done previously and you’re using the '--reusechunks' option.)
reformat the 'pfp2' command based on the original options and how many SEND hosts were requested
start the SEND host processes (using the same 'pfp2' Perl script), each with the same number of parallel rsyncs.
and then exit the master process.

The SEND hosts will continue to send output back to the originating terminal (prefixed or suffixed) with the SEND hostname so you can decipher which SEND host is saying what. This information is not failsafe since output from different hosts can overwrite each other. If you wish to view the complete output per SEND host, each SEND host log can be found in the host-specific subdir in the file 'pfp-log-${DATE}'.

However, unlike the original parsyncfp or using the SH option, killing or suspending the originating program will have no effect on the SEND hosts; the remote rsyncs are independent and have to be killed manually. This SEND host independence should be addressed shortly via socket-based controls.

In the meantime, a 'killer' script called pfp2stop is automatically generated when a MH run is initiated that will ssh to each SEND and RECEIVE host and kill off all YOUR rsync and 'pfp2' processes (even those not associated with the instigating pfp2, so be careful). The pfpstop script is usually placed in your 'parsync_dir' and its exact path is emitted a couple times in the run of the 'pfp2' script as a reminder.

Options for MultiHost transfers

The MultiHost (MH) version allows you to rsync multiple streams of data via multiple SEND hosts to the same or multiple RECEIVE (REC) hosts, including different filesystems on the different REC hosts. The REC hosts can be: . rsyncd servers with multiple modules and as such, can define different auth for different users and different endpoints for the data. The comprehensive description of how this works is described in rsyncd.conf(5) . standard servers which launch matching rsyncs via the usual mechanism. These can also have the same or different endpoints.

In a MH command, the last phrase is the POD:: string. This not only defines the command as MH, but also provides the default storage path for all REC hosts in the '--hosts' argument that lack an explicit one.

Both types can be mixed in the same hosts string. The MH version requires that the master and all the send hosts (which can include the master) have access to a common filesystem for both data and configuration info.

--checkhost : Requests a pre-check to make sure that the SEND & RECEIVE hosts specified with '--hosts' do not have any rsyncs running. If they do, the number of them is reported. Those rsyncs may be valid and independent of 'pfp2' but it may be evidence of a failed 'pfp2' which may interfere with another 'pfp2' launch. This option also pushes the required utilities to the SEND hosts to make sure that they have the utilities necessary to run with full functionality.
--commondir [s] : The shared, common dir in which all chunk files and rsync logs will be stored. Similar to '--altcache' but MUST be readable by all SEND hosts.
--rpath [s] : the remote PATH prefix on the SEND hosts to check for the bits needed to run this. It is prefixed to the remote ssh cmd as 'export PATH=<rpath string>:$PATH;' The 'rpath' string can contain as many paths as you’d like, separated by colons (:), tho vars have to be escaped appropriately.

ie:
  --rpath="~/bin:$HOME/pfp2/bin"

  (default is ~/bin:$parsync_dir/.pfp2), and ':$PATH is also appended so
        --rpath="~/bin:$HOME/pfp2/bin"
            is transmitted as:
        --rpath="~/bin:$HOME/pfp2/bin:$PATH"

--hosts [s] : the string argument specifies the SEND and REC hosts, optionally supplying REC hosts with individual alternate paths to store data. The '--hosts' string format is a comma-delimited set of 'Send=Receive' hosts.
example: "s1=r1:/path1,s2=r2:/path2,s3=r3:/path3,s4=r4,s5=r5"
where each 's#' and 'r#' imply a full "user\@host" string. 's#' and 'r#' obey the standard Linux rules that they are either long or short hostnames that are resolvable by your DNS or by an entry in the '/etc/hosts' file or a numeric address (113.42.23.56). Also, each 'r#' can have a storage path appended (r2:/path2). If the REC path is not given, the path from the final 'POD::/path' target is appended. ie pfp [option option option..] POD::/common/default/receive/target.

If you specify 'different' REC paths, the SEND data will be split over those host:/path combinations, so they will have to be manually combined afterwards. This is to allow different remote filesystems to accept high bandwidth transmission without impacting other FS operations. The SEND=REC couplets follow ssh rules so that if the user at one of the hosts is different than the one being used to initiate the process, you’ll have to specify the user. Similarly for the REC host, if the user is different than the initiating USER. ie: in the following option string:

--hosts="cooper=ben,tux@chinstrap=hjm@ben,nash=ben"

'hjm' is the initiating user and is the mediating user on cooper, ben, and nash, while 'tux' is the mediating user on chinstrap. Because 'tux@chinstrap' is mediating the command, ssh assumes the same user on ben, so 'hjm@ben' has to be explicitly specified. The required last element in a MH command is 'POD::/path' which is the default path for any REC hosts that haven’t been defined in the '--hosts' option. (More info below and see Good Example 4 & 5 below)

For rsyncd targets, you can specify the REC hosts as:

 r1::module_name
 r2::module_name2
 etc

and you can mix rsyncd targets with regular rsync targets so a valid hosts string could be:

"s1=r1:/path1,s2=r2::mod2,s3=r3:/path3,s4=r4::mod4,s5=r5"

However, unless the rsyncd server is open (without authorization) you must export your RSYNC_PASSWORD in the SEND host’s '~/.bashrc' for this to work, or use '--ro="--password-file=FILE"' to point to a permission-protected file containing the appropriate credentials. Otherwise, the responding rsyncd will query for your rsync user password (not your login password). This is defined in the rsyncd host’s /etc/rsyncd.secrets file and explained in detail via 'man rsyncd.conf(5)'.

The master 'parsyncfp2' command will exit once the fpart chunking process is finished and leave the rsyncs running independently on the SEND shosts. They will continue to send output back to the originating terminal (prefixed or suffixed) with the SEND hostname so you can decipher which SEND host is saying what.

However, unlike the single-host version, killing or suspending the originating program will have no effect on the SEND hosts; the remote rsyncs will have to be killed manually. This is made easier with a 'kill script' that is generated at every invocation of the MH version, called '${parsync_dir}/pfpstop' and will kill off ALL YOUR rsync and 'parsyncfp2' instances running (including ones that were not part of the the originating parsyncfp2, so be careful).

This SEND host independence should be addressed shortly via socket-based controls.

Stopping a MultiHost pfp2

As noted above in the Overview, a crude pfp2stop bash script is generated for each run of the 'pfp2' MultiHost version and will kill all running rsyncs and 'pfp2' processes on all the hosts specified in the '--hosts' option string.

Options for using filelists

(thanks to Bill Abbott for the inspiration/guidance).

These options were created so that people who use filesystem databases such as Robinhood or Starfish, or filesystems such as GPFS, can generate lists of files directly from these utilities and avoid the (fast, but additional) overhead of running 'fpart'.

These options work with the MH version as well as the SH version.

The 3 options below provide a way of explicitly naming the files you wish to transfer by providing a file of 'fully qualified' filenames. ie. the names start with a leading '/'.

If you use this list directly with rsync, it will remove the leading '/' but then place the file with that otherwise full path inside the target dir. So '/home/hjm/DL/hello.c' would be placed in '/TARGET/home/hjm/DL/hello.c'. If this result is OK, then simply use the '--filesfrom' option to specify the file of files. If this is NOT OK, see the '-trimpath' option below.

If the list of files are NOT fully qualified then you should make sure that the command is run from the correct dir so that the rsyncs can find the designated dirs & files.

--filesfrom|ff [s] : Take explicit input file list from given file, 1 path name per line.
--trimpath|tp [s] : The path to trim from the front of full path name if '--filesfrom' file contains full path names and you want to trim them. If you want the file '/home/hjm/DL/hello.c' to end up as '/TARGET/DL/hello.c' (ie remove the original '/home/hjm'), you would use the --trimpath option as follows: '--trimpath=/home/hjm'. This will remove the given path before transferring it and assure that the file ends up in the right place. This should work even if the command is executed away from the directory where the files are rooted. If you have already modified the file list to remove the leading dir path, then of course you don’t need to use this option. A trailing '/' is not required; it will be removed regardless.
--trustme|tm : Used with '--filesfrom' above allows the use of file lists of the form:

size in bytes<tab>/fully/qualified/filename/path
825692            /home/hjm/nacs/hpc/movedata.txt
87456826          /home/hjm/Downloads/xme.tar.gz

Such a file format can be generated with 'find' in the format:

  find $PWD/{dir} {criteria} -type f -printf '%s %p\n' | sed -e 's/ /\t/'
  ie:
  find $PWD/dir42  -maxdepth 5 -mtime +183 -type f -printf '%s %p\n' | sed -e 's/ /\t/'
  (to find regular files within 5 levels deep and >  183 days old)

Hints & Workarounds

Important

rsync '--delete' options will not work with '--ro' because the multiple parallel rsyncs that parsyncfp launches are independent and therefore don’t know about each other (and so cannot exchange info about what should be deleted or not. Use a final, separate 'rsync --delete' to clean up the transfer if that’s your need.

Also, rsync options related to additional output has been disallowed to avoid confusing pfp2’s IO handling. '-v/-verbose', '--version', '-h/--help' are caught, and 'pfp2' will die with an error. Most of the info desired from this are captured in the rsync-logfile files in the ~/.parsyncfp dir.

Unless you want to view them, it’s usually a good idea to send all STDERR to '/dev/null' (append * 2> /dev/null * to the command) because there are often a variety of utilities that get upset by one thing or another. Generally, silencing the STDERR doesn’t hurt anything.

Examples

Good example 1

% parsyncfp2  --maxload=5.5 --NP=4 \
--chunksize=\$((1024 * 1024 * 4)) \
--startdir='/home/hjm' dir[123]  \
hjm@remotehost:~/backups 2> /dev/null

where:

'--maxload=5.5' will start suspending rsync instances when the 1m system load gets to 5.5 and then unsuspending them when it goes below it.
'--NP=4' starts 4 instances of rsync
'--chunksize=\$1024 * 1024 * 4' sets the chunksize, by multiplication or by explicit size: 4194304
'--startdir='/home/hjm'' sets the working dir of this operation to '/home/hjm' and 'dir1 dir2 dir3' are subdirs from '/home/hjm'
the target 'hjm@remotehost:~/backups' is the same target rsync would use
'2> /dev/null' silences all STDERR output from any offended utility.
It uses 4 instances to rsync dir1 dir2 dir3 to hjm\@remotehost:~/backups

Good example 2

% parsyncfp2  --checkperiod 6  --NP 3 \
--interface eth0  --chunksize=87682352 \
--ro="--exclude='[abc]*'"  nacs/fabio   \
hjm\@moo:~/backups

The above command shows several options used correctly:

'--chunksize=87682352' - shows that the chunksize option can be used with explicit integers as well as the human specifiers (TGMK).
--ro="--exclude='[abc]*'" - shows the correct form for excluding files based on regexes (note the quoting in block above to protect the regex as it gets passed thru)
'nacs/fabio' - shows that you can specify subdirs as well as top-level dirs (as long as the shell is positioned in the dir above, or has been specified via '--startdir'

Good example 3

parsyncfp2 -v 1 --nowait --ac pfp2cache1 --NP 4 --cp=5 --cs=50M --ro '-az'  \
linux-4.8.4 moo:~/test

The above command shows:

short version of several options (-v for --verbose, --cp for checkperiod, etc)
shows use of --altcache (--ac pfp2cache1), writing to relative dir pfp2cache1
again shows use of --ro (--ro '-az') indicating 'archive' & 'compression'.
includes '--nowait' to allow unattended scripting of parsyncfp

Good example 4

parsyncfp2 --NP=8 --chunksize=500M --filesfrom=/home/hjm/dl550 \
hjm\@moo:/home/hjm/testparsync

The above command shows:

if you use the '--filesfrom' option, you cannot use explicit source dirs (all the files come from the file of files (which require full path names)
that the '--chunksize' format can use human abbreviations (m or M for Mega).

Good example 5 (MultiHost)

parsyncfp2 --verbose=2 --ro='-aslz' \
--hosts="bigben=bridgit.ure.edu:/d1/in, \
          pooki=bridgit.ure.edu:/d2/in, \
        stunted=bridgit.ure.edu:/d3/in" \
--hostcheck --ro="-aslz"  --NP 4 --chunk 15G \
--check 5 --dispo=l --interface=wlp3s0 \
--commondir=/home/hjm/pfp2 --startdir /home/hjm/pfp2 \
dir1 dir2 dir3 dir4  POD::/

The above MH command shows:

3 SEND hosts (bigben, pooki,stunted) all sending data to the REC host bridgit.ure.edu altho the data is being split among 3 filesystems. You could also define 3 REC hosts, writing data to the SAME PATH if that was a better performance fit.
You could also define 3 REC hosts, writing data to the SAME PATH if that was a better fit:

 ...
  --hosts="bigben=bridgit.ure.edu:/d1/in, \
            pooki=bridgit.ure.edu:/d1/in, \
          stunted=bridgit.ure.edu:/d1/in" \
 ...
  and even shorter:
 ...
  --hosts="bigben=bridgit.ure.edu, \
            pooki=bridgit.ure.edu, \
          stunted=bridgit.ure.edu" \
 ...

with the final argument as:
      POD::/d1/in
which would distribute the same 'POD::' suffix to all REC hosts.

the preferred way of defining the rsyncopts with '--ro=-aslz'
the '--dispo=l' option requests that the cachefiles be left alone. In MH mode the chunk files MUST be left, since all the independent SEND hosts need to reference them until they’re finished.
the 'POD::/' terminal element is the (required) default path for any undefined REC hosts. Since all of the REC hosts paths are defined, they aren’t affected.

Good Example 6 (MultiHost)

cd /home/pfp; ~/bin/pfp2  --ro='-slaz' --chunk=50M --dispose=c --NP=6  \
--commondir=/home/pfp --filesfrom=/home/pfp/recentfilelist.txt \
--trustme --trimpath='/home/pfp' --checkhost \
--hosts="stunted=bridgit,bigben=bridgit"  POD::~/test

This example shows:

that you can symlink or rename the 'parsyncfp2' executable anything (to 'pfp2', above) and it will continue to be usable. The executable started is compared to the remote one (and is rsync’ed to the SEND hosts, if the '--checkhost' option is used, as it is here).
using the '--filesfrom' options in MH mode, where the prefix '/home/pfp' is removed from the path of all the filenames with the '--trimpath' option and the filenames are supplied with sizes, indicated by the '--trustme' option.
the TARGET string 'POD::\~/test' indicating that the naked RECEIVE hosts ('stunted', 'bigben') are automatically suffixed with the string ':\~/test'
an incorrect option '--dispose=c' that is overridden in the process. The chunk files need to be kept until the end so the given '--dispose' option is detected and changed to '--dispose=l' to enable this.
the use of '--checkhost' to make sure all the MH hosts are in good shape to begin an 'pfp2' session.

Good example 7 (Multihost)

parsyncfp2  --hostcheck --NP=16 --chunk=50G --check 5  \\
--hosts="bigben=tux@moon1, \\
          pooki=tux@moon2, \\
        stunted=tux@moon3  \\
         cooper=gibson@moon4::circadian" \\
--maxload=20 --ro='-slaz' \\
--commondir=/home/pfp --startdir /home/pfp/incoming \\
dir1 dir2 dir3 dir4  POD::/d1/incoming

The above multihost command shows 4 SEND hosts (bigben, pooki, stunted, cooper) each sending 16 stream of data to the 4 clustered REC hosts (moon1 - moon4) with the REC data path being provided by the POD default path '/d1/incoming', except for moon4 which is using a rsyncd module as the REC endpoint, with the rsyncd ID 'gibson' as the authorized user (this requires the rsyncd password to be part of the ENV on cooper: ie the ~/.bashrc must contain 'RSYNC_PASSWORD=whateveritis').

Thus there are 64 (4x16) rsync streams pushing data to the REC cluster. This assumes the filesystem on the moon cluster can write that fast and that the intermediate network can provide the bandwidth. It also assumes that the rsync compression requested by the '--ro (--ro='-slaz') arguments can stay below the individual 1m loadavg of 20 requested by '--maxload=20'. If it doesn’t, the SEND hosts will start to suspend rsyncs until the loadavg goes below 20. The '--commondir' and '--startdir' paths define the shared storage and where in it the data to be sent is stored. '--commondir' and '--startdir' do not have to be identical, but they do have to be R/W available to all the SEND hosts. The '--hostcheck' command makes sure that required utilities are available, that the 'parsyncfp2' program is identical, and also checks the latency between the SEND and REC hosts.

ERROR example 1

% pwd
/home/hjm  # executing parsyncfp from here

% parsyncfp2 --NP4  /usr/local  /media/backupdisk

why this is an error: - '--NP4' is not an option (parsyncfp will say "Unknown option: np4" It should be '--NP=4' or '--NP 4' - if you were trying to rsync '/usr/local' to '/media/backupdisk', it will fail since there is no /home/hjm/usr/local dir to use as a source. This will be shown in the log files in ~/.parsync/rsync-logfile-<datestamp>_# as a spew of "No such file or directory (2)" errors

The correct version of the above command is:

% parsyncfp2 --NP=4  --startdir=/usr  local  /media/backupdisk

Note that this example is sending data to another local mounted filesystem, not a remote host. This is OK.

Error Example 2

% parsyncfp2  hjm@moo.boo.yoo.com:/usr/local --start-dir /home/hjm mooslocal

Why this is an error:

this command is trying to PULL data from a remote SOURCE to a local TARGET. pfp2 doesn’t support that kind of operation yet.

The correct version of the above command is:

# ssh to hjm@moo, install parsyncfp2, then:
% parsyncfp2  --startdir=/usr  local  hjm@remote:/home/hjm/mooslocal

Error Example 3

% parsyncfp2 --NP=4 --chunksize=500M -startdir=/usr/local/bin hjm@remote.host.edu:/home/backups

Why this is an error:

you’ve specified a 'startdir' but haven’t specified the dirs or files to be transferred.

The correct version of the above command is:

% parsyncfp --NP=4 --chunksize=500M -startdir=/usr/local bin hjm\@remote.host.edu:/home/backups

Block tags, Version 2.243

The following is a functional block list of how 'pfp2' works, described by in-line comments indented to the same degree as the code itself to provide some functional hinting. If you modify the code yourself or want to add more such comments, just prefix them with the obvious '#\#: ' in the code and 'grep -n '\##: ' pfp2.

24:##: == COMMON TO MASTER & SLAVES ==
25:##: Lib Requirements
36:##: Dev/github/update gunk
45:##: ITER notes
56:##: Global Vars
79:##: Pre-Getopt var declarations
97:##: Getopt options & Setup
135:##: Var declarations
150:##: Reset colors
155:##: MD5 checks of executable
167:##: Declare run-permanent vars
199:##: Define cache and log dirs
232:##: Get current system stats
244:##: Define & init Getopt flag vars
315:##: ARGV processing
324:##: parse_rsync_target call
352:##: Hostlist processing
465:##: NETIF determination
533:##: IB / perfquery
550:##: get IF_SPEED
568:##: fix .ssh/config
572:##: checkhost on SINGLEMASTER, RSYNCD, RSYNC hosts (NOT POD hosts)
584:##: Check loadavg too high
603:##: == MASTER ONLY ==
637:##: process Files & Dirs to send
684:##: Process $FROMLIST, how to set up fpart cmd
742:##: Warn about OTHER FPs running
763:##: More $FROMLIST proc
891:##: == MASTER ONLY ==
892:##: reformat orig pfp2 arguments for SEND hosts
922:##: Write out pfpstop script
954:##: == SEND hosts only (SH/MH)
955:##: Compose RSYNC_CMD & send it to all the SEND hosts
1001:##: == MASTER ONLY ==
1002:##: Write feedgnuplot script to viz data xfer
1036:##: == MASTER EXITS ==
1037:##: == SEND hosts only
1056:##: init Bandwidth vars
1076:##: Start the overall common rsync loop
1146:##: stats print loop
1450:##: Final rsync log check to verify completions.
1451:##: Detect failed rsyncs and retransmit.
1478:##: Resend failed rsyncs all at once,
1494:##: Calc bytes of rsync logs and convert raw bytes to 'human'
1514:##: Print reminders
1538:##: Exit cleanup: email
1544:##: Dispose of cache
1557:##: Exit message
1578:##: Left over orphan warning
1593:##: == Subroutines
1639:##: parse_rsync_target ($LOCALUSER, $TARGET, $ALTCACHE, $recv_hoststring)
1889:##: checkhost ( "NODETYPE", $HOST2CHECK, $RSYNCMODULE, $ALTCACHE, $VERBOSE, $MAXLOAD )
2009:##: first_run_required_utils ()
2092:##: check_ssh_ok ($HOSTNAME)
2112:##: get_nbr_chunk_files () # 1st ver
2126:##: remove_fp_cache ()
2134:##: check_utils($required_str, $recommend_str)
2180:##: get_rPIDs ($pidfile, $spids)
2243:##: trim ($string)
2254:##: getavgnetbw ($NETIF, $CHECKPERIOD, $PERFQUERY)
2290:##: pause()
2297:##: INFO($message)
2309:##: WARN($string)
2331:##: FATAL($message)
2344:##: DEBUG (__LINE__, $message)
2369:##: fixfilenames ($CUR_FP_FLE, $ROOTDIR)
2404:##: ptgmk ("154.32M")
2422:##: fix_ssh_config ()
2459:##: usage ()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parsyncfp2-manual.adoc

parsyncfp2-manual.adoc

parsyncfp2, a MultiHost parallel rsync wrapper

Preamble

Introduction

General Features

Release License

Installation

Required utilities and packages

Recommended Utilities

Options in detail

Basic Options for both SH and MH

MultiHost (MH) Operation

Overview

MultiHost 'pfp2' sequence:

Options for MultiHost transfers

Stopping a MultiHost pfp2

Options for using filelists

Hints & Workarounds

Examples

Good example 1

Good example 2

Good example 3

Good example 4

Good example 5 (MultiHost)

Good Example 6 (MultiHost)

Good example 7 (Multihost)

ERROR example 1

Error Example 2

Error Example 3

Block tags, Version 2.243

Files

parsyncfp2-manual.adoc

Latest commit

History

parsyncfp2-manual.adoc

File metadata and controls

parsyncfp2, a MultiHost parallel rsync wrapper

Preamble

Introduction

General Features

Release License

Installation

Required utilities and packages

Recommended Utilities

Options in detail

Basic Options for both SH and MH

MultiHost (MH) Operation

Overview

MultiHost 'pfp2' sequence:

Options for MultiHost transfers

Stopping a MultiHost pfp2

Options for using filelists

Hints & Workarounds

Examples

Good example 1

Good example 2

Good example 3

Good example 4

Good example 5 (MultiHost)

Good Example 6 (MultiHost)

Good example 7 (Multihost)

ERROR example 1

Error Example 2

Error Example 3

Block tags, Version 2.243