Skip to content

Releases: PanDAWMS/pilot3

3.7.6.8

29 May 10:36
666430e
Compare
Choose a tag to compare
  • Updates to pilot Harvester module used on supercomputers
    • Work report is now published, leading to successful jobs on Perlmutter
    • Ignoring “RHEL8 and clones are not supported for users”-warning in cpu_arch script, which prevented CPU architecture from being reported
  • Support for NO_CVMFS_OK environmental variable
    • If set to any value, it will turn off CVMFS checks by pilot
    • Used on Mare Nostrum queues
    • Requested by P. Collato Soto et al
  • Support for error report produced by the job transform
    • If it exists, pilot will extract error_code and error_diag and report with exeErrorCode and exeErrorDiag
    • Corresponding pilot error is 1305
    • Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLASPANDA-976
    • Requested by T. Maeno
  • Housekeeping with pylint etc
    • Multiple pilot modules (10+) improved with pylint score of 9+
  • Dropped python 3.8 support since unit tests otherwise fail due to the use of more modern type hints
    • Source code is verified for Python version 3.9, 3.10, 3.11 in the CI tests
  • Bug fix
    • Pilot did not send job update during long stage-in, which could lead to lost heartbeat

3.7.5.4

08 May 07:56
dca3186
Compare
Choose a tag to compare

3.7.4.2

07 May 11:26
1dca4c9
Compare
Choose a tag to compare
  • Now setting proper error codes for lsetup and remote file open error in case of time outs
    • Before, a general payload error was set (1305); now, 1368 - "Remote file open timed out" and 1378 - "Lsetup command timed out during remote file open" will be set
    • Also, increased timeout from 600 s to 900 s for remote file open
  • Bug fix
    • Corrected one of the new cvmfs checking functions which previously failed to read the last cvmfs update file
      • Problem affected sites with non-standard cvmfs mounting
      • Reported by M. Svatos

3.7.3.84

30 Apr 11:49
4b75cba
Compare
Choose a tag to compare
  • Replaced all usages of curl with python native urllib
    • Trace reports, panda server interactions (getJob, updateJob, getProxy, queuedata and transform downloads), as well as prmon dictionary upload to https://pilot.atlas-ml.org/
    • This change is important both for attempting to reduce the number of non-responsive pilots (esp. following trace service curl upload) and for resolving EL9 complications (Tokyo)
  • Corrected cpu model reporting in cpuconsumptionunit string on ARM
    • Previously, UNKNOWN was given as cpu model name
    • Note: cache size is not reported in this case as it is not available in user space
    • Requested by I. Glushkov
  • Added GPU info from prmon to job metrics
    • E.g. “GPU_name=NVIDIA_A100-SXM4-40G nGPU=1”
    • To be used by monitoring
    • Requested by T. Korchuganova, T. Maeno
  • Now aborting stage-in loop in case graceful stop bit has been set
    • Previously, the stage-in thread would keep running until finished which unnecessarily delayed the termination of the pilot
    • The graceful stop bit gets set e.g. when there is an unexpected exception thrown
  • Added intersect value from PSS+SWAP fit to job metrics
  • Now performing CVMFS checks at the beginning of the pilot
  • ATLAS
    • Active monitoring of remote file open verification script
      • Now able to abort (e.g.) if lsetup is taking too long time
  • Support for event service with AthenaMT
  • Housekeeping
    • Additional pilot modules were processed with pylint etc
  • Bug fixes
    • Rubin
      • Removed memory monitoring files from looping job algorithms (problem reported by W. Guan)
    • Corrected exception handling when socket.gethostbyaddr() is used
    • Corrected the replica sorting algorithm, which did not sort replicas according to read_lan
    • ATLAS
      • Correction for the case where looping check had not run (thus not set any internal timings) before job suspension occurred
        • Lead to problems with some BOINC jobs

Code contributions from P Nilsson, O. Freyermuth, J. Esseiva

3.7.2.4

06 Mar 10:22
93d2524
Compare
Choose a tag to compare
  • Explicitly setting time zone to UTC for payload (ATLAS)
  • Added timeouts to socket usages
    • Set a timeout of 10 seconds to prevent potential hanging when there are problems with DNS resolution, or if the DNS server is slow to respond
    • The socket module is used in relation to rucio traces and could potentially lead to hanging/unresponsive pilots
  • Changed the default -C to -c -i in the container setup (ATLAS)
    • To preserve the process ids
    • Requested by A. De Silva and A. Mete
  • Setting XRD_LOGLEVEL=Debug before remote file open verification script is run (ATLAS)
    • To facilitate debugging problematic site(s)
    • Requested by R. Taylor
  • Bug fix
    • Now setting rucio site name (as well as missing rucio version) correctly when middleware containers are used
    • Reported by R. Kleinemuhl

3.7.1.50

13 Feb 10:38
569c213
Compare
Choose a tag to compare
  • Updated lsetup ‘root pilot’ to lsetup ‘root pilot-default’
    • Requested by A. De Silva
  • Support for job suspension
    • Looping job algorithm takes into account if the pilot has been suspended
  • Updated python executable for running open_remote_file script (now explicit ‘python3’)
    • Requested by A. De Silva
  • Reporting error code 1310 when exception has been caught while executing payload
  • Never include output files in log
    • Previously, output files were left in payload work dir for looping jobs
    • Requested by T. Korchuganova
  • Now reporting support for IPv6 (or lack of) in pilot log
    • Dumping explicit IPs to log
  • Using get() instead of direct dictionary lookup to prevent problem with transfer() call in harvester related code
  • Housekeeping
    • Processed ~60% of all files with Flynt (+ manual changes for the strings that Flynt missed or couldn’t handle)
      • Flynt converts strings from from % and .format to f-strings
    • Linting with pylint status: ~50% of all module have a score of 9-10
  • When the pilot has intercepted an external kill signal, it sends SIGTERM followed by SIGKILL 3 s later to the main payload process
    • The 3 s delay might be too short and could lead to lingering payload threads, so it was increased to 10 s
  • Added new pilot error code (1208) for SIGINT
  • Rubin
    • Do not abort pilot when getJob returns None
      • Now experiment specific
    • Requested by W. Guan
  • Bug fixes
    • The exit code for a failed download of a new proxy (following an expired proxy/certificate) was lost. The job should now fail as expected
      • Reported by R. Walker
    • Updated and corrected shell exit code conversions after kill signal
      • Now reporting unique exit code for different kill signals
      • Reported by P. Love
    • Fixed an issue with real-time logging in multi-job pilots
      • The real-time logging was prevented from starting in later jobs

3.7.0.36

06 Nov 12:46
4b0b372
Compare
Choose a tag to compare
  • Added verification of exception thrown during payload error interpretation
    • Previously, a FileHandlingFailure exception thrown during payload error interpretation would lead to a crash
    • Also, now verifying that payload stdout/err actually exists before trying to read it
    • Reported by Z. Yang (Rubin)
  • Added xroot client log file to log if it exists
    • Adding ‘XRD_LOGLEVEL=Dump XRD_LOGFILE=xrdlog.txt’ to PQ.environ causes the wrapper to set these env vars, and the pilot also adds them to the payload command. Upon a file transfer, the variables (if set) trigger the xroot client to create a log file that is stored in the pilot’s launch directory
    • There is one file per file transfer, and the pilot renames them to xrdlog-<LFN|payload>.txt (but using the file name pattern from XRD_LOGFILE). They are then copied to the tar directory (the corresponding log file for the job log transfer will of course be missed)
    • Requested by R. Walker
  • Aborting pilot immediately after server instruction ‘tobekilled’
    • The job monitor can otherwise get stuck in a loop that can take some time to get out of
  • The core dump produced by the pilot when a looping payload process has been detected, is now created for the youngest child process rather than the main payload process
  • Extended panda secrets usage for logging into docker (for user container)
    • Any secrets are redacted in logs
    • Requested by R. Zhang
  • Housekeeping
    • Corrected all module headers for Apache license 2.0 so they pass the updated apache-license-check
    • Several pilot modules (10%) now pass the pydocstyle test
    • Pilot module was processed with the black tool
    • Some minor linting was done using pylint - currently around 30% of the pilot modules have a score of over 9 out of 10
    • The plan is to process all modules with these commands and eventually use them with local pre-commit and remote GitHub Actions that run after a PR

3.6.9.10

03 Oct 12:02
78c3011
Compare
Choose a tag to compare
  • Looping killer update
    • Added exception handling for unexpected stat output seen in a job at MWT2, that effectively aborted the looping job killer and lead to multiple unsuccessful attempts of killing the payload
    • The updated function is however now only used as a fallback when psutil module is not available (it largely is, except on some HPC sites)

3.6.8.29

27 Sep 13:54
efecbae
Compare
Choose a tag to compare
  • Redirecting stdout/stderr to files for trace service curls
    • This could prevent thread deadlocks in the standard python subprocess.communicate() function in case of overwhelming amount of stdout/stderr. The subprocess.communicate() function is no longer used, which also means that the internal timeout capability in subprocess can no longer be used and had to be reimplemented by a threading timer which sets the relevant error code if necessary
    • The 'last’ output from curl is stored in trace_curl_last.stdout/stderr, and gets appended to trace_curl.stdout/stderr
    • The trace_curl_last.stdout/stderr files are searched for any curl errors (curl command always returns 0 exit code even when there was an error, so the output has to be processed)
    • A failed rucio trace curl operation is now reported with job metrics
      • Example: rucioTraceError=N
    • Increased connection timeout from 20s to 100s to be in line with panda server curl operations (where we don’t see any problems)
    • Related JIRA ticket: https://its.cern.ch/jira/browse/ATLASPANDA-835
  • Reporting prmon read_bytes/total_input_size with job metrics (‘readbyterate’)
    • Information to be used for optimizing brokerage
    • Requested by J. Elmsheuser, R. Walker
  • Extended usage of psutil
    • Job monitoring is now using psutil to discover prmon pid
    • If psutil is not available (e.g. as is the case on marenostrum), the code falls back to old ps command usage
  • Added protection against expired job objects in job_monitor loop
    • Reported by W. Guan/Z. Yang
  • Updated GitHub Action workflows
    • Unit tests and flake8 are are now independent workflows
    • Moved to latest flake8 version 6.1.0 for flake8 verification
    • All tests are run for python versions 3.8, 3.9, 3.10 and 3.11
  • Tested pilot running under python versions 3.9.18 and 3.11.5
    • Grid jobs are currently running under python version 3.9.14 but will soon switch to 3.9.18 to be in line with user tools (like rucio)
    • Python version 3.11.5 will be the default version on EL9
    • Requested by A. De Silva

3.6.7.10

18 Sep 09:02
f903d9e
Compare
Choose a tag to compare
  • Improved reporting of CPU consumption time
    • It was seen in I/O bound payloads that the correct CPU consumption time was not reported correctly. Pilot is now making sure there are no zero values reported
    • Reported by R. Walker
  • Migration towards using psutil module has started
    • Until now, pilot has relied on executing the ps command for process information, but this is heavy on the system if many ps commands are executed in short time
    • A. De Silva has made the psutil module available via ALRB and is setup in the wrapper with ‘lsetup psutil’ by P. Love
      Currently, there is no requirement for psutil - the pilot has a fallback to using other process info in case psutil fails to import - but this will change soon
    • Pilot is currently only using psutil to get information whether a certain process is running or not, with a fallback to /proc/{pid}
  • Added protection for failed writing of info dictionary to disk before server update
    • Curl normally uses this dictionary, but should now instead use the dictionary explicitly (converted to string)
    • Previously, the pilot would fail to inform the server, i.e. the job would become a lost heartbeat
    • The pilot might still fail before reaching this point, as it basically relies on disks with space > 0
  • Moved import of google cloud logging to beginning of real-time logging module to prevent an unexplained problem seen in Rubin jobs
    • Previously, said module was only imported when it needed to be used, but for some reasons this would occasionally lead to python locking up
    • Requested by Z. Yang (Rubin)