Releases · PanDAWMS/pilot3

29 May 10:36

PalNilsson

3.7.6.8

666430e

3.7.6.8

Updates to pilot Harvester module used on supercomputers
- Work report is now published, leading to successful jobs on Perlmutter
- Ignoring “RHEL8 and clones are not supported for users”-warning in cpu_arch script, which prevented CPU architecture from being reported
Support for NO_CVMFS_OK environmental variable
- If set to any value, it will turn off CVMFS checks by pilot
- Used on Mare Nostrum queues
- Requested by P. Collato Soto et al
Support for error report produced by the job transform
- If it exists, pilot will extract error_code and error_diag and report with exeErrorCode and exeErrorDiag
- Corresponding pilot error is 1305
- Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLASPANDA-976
- Requested by T. Maeno
Housekeeping with pylint etc
- Multiple pilot modules (10+) improved with pylint score of 9+
Dropped python 3.8 support since unit tests otherwise fail due to the use of more modern type hints
- Source code is verified for Python version 3.9, 3.10, 3.11 in the CI tests
Bug fix
- Pilot did not send job update during long stage-in, which could lead to lost heartbeat
  - Related function required job to be in running state, which it wasn’t
  - Reported by F. Barreiro Megino
  - Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLASPANDA-974

Assets 2

08 May 07:56

PalNilsson

3.7.5.4

dca3186

3.7.5.4

Bug fix
- Corrections to new cvmfs functions which did not fail the test with an error code when it should have
  - Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLASPANDA-926

Assets 2

07 May 11:26

PalNilsson

3.7.4.2

1dca4c9

3.7.4.2

Now setting proper error codes for lsetup and remote file open error in case of time outs
- Before, a general payload error was set (1305); now, 1368 - "Remote file open timed out" and 1378 - "Lsetup command timed out during remote file open" will be set
- Also, increased timeout from 600 s to 900 s for remote file open
Bug fix
- Corrected one of the new cvmfs checking functions which previously failed to read the last cvmfs update file
  - Problem affected sites with non-standard cvmfs mounting
  - Reported by M. Svatos

Assets 2

30 Apr 11:49

PalNilsson

3.7.3.84

4b75cba

3.7.3.84

Replaced all usages of curl with python native urllib
- Trace reports, panda server interactions (getJob, updateJob, getProxy, queuedata and transform downloads), as well as prmon dictionary upload to https://pilot.atlas-ml.org/
- This change is important both for attempting to reduce the number of non-responsive pilots (esp. following trace service curl upload) and for resolving EL9 complications (Tokyo)
Corrected cpu model reporting in cpuconsumptionunit string on ARM
- Previously, UNKNOWN was given as cpu model name
- Note: cache size is not reported in this case as it is not available in user space
- Requested by I. Glushkov
Added GPU info from prmon to job metrics
- E.g. “GPU_name=NVIDIA_A100-SXM4-40G nGPU=1”
- To be used by monitoring
- Requested by T. Korchuganova, T. Maeno
Now aborting stage-in loop in case graceful stop bit has been set
- Previously, the stage-in thread would keep running until finished which unnecessarily delayed the termination of the pilot
- The graceful stop bit gets set e.g. when there is an unexpected exception thrown
Added intersect value from PSS+SWAP fit to job metrics
- Along with the slope, it allows for plotting the fit on the monitor side
- Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLASPANDA-907
Now performing CVMFS checks at the beginning of the pilot
- Using same list of files and directories as the wrapper
- Any failure leads to termination of the pilot with exit code 64
- Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLASPANDA-926
ATLAS
- Active monitoring of remote file open verification script
  - Now able to abort (e.g.) if lsetup is taking too long time
Support for event service with AthenaMT
Housekeeping
- Additional pilot modules were processed with pylint etc
Bug fixes
- Rubin
  - Removed memory monitoring files from looping job algorithms (problem reported by W. Guan)
- Corrected exception handling when socket.gethostbyaddr() is used
- Corrected the replica sorting algorithm, which did not sort replicas according to read_lan
  - Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLDDMOPS-5685
- ATLAS
  - Correction for the case where looping check had not run (thus not set any internal timings) before job suspension occurred
    - Lead to problems with some BOINC jobs

Code contributions from P Nilsson, O. Freyermuth, J. Esseiva

Assets 2

06 Mar 10:22

PalNilsson

3.7.2.4

93d2524

3.7.2.4

Explicitly setting time zone to UTC for payload (ATLAS)
- Discussed in https://its.cern.ch/jira/browse/ATEAM-953
Added timeouts to socket usages
- Set a timeout of 10 seconds to prevent potential hanging when there are problems with DNS resolution, or if the DNS server is slow to respond
- The socket module is used in relation to rucio traces and could potentially lead to hanging/unresponsive pilots
Changed the default -C to -c -i in the container setup (ATLAS)
- To preserve the process ids
- Requested by A. De Silva and A. Mete
Setting XRD_LOGLEVEL=Debug before remote file open verification script is run (ATLAS)
- To facilitate debugging problematic site(s)
- Requested by R. Taylor
Bug fix
- Now setting rucio site name (as well as missing rucio version) correctly when middleware containers are used
- Reported by R. Kleinemuhl

Assets 2

13 Feb 10:38

PalNilsson

3.7.1.50

569c213

3.7.1.50

Updated lsetup ‘root pilot’ to lsetup ‘root pilot-default’
- Requested by A. De Silva
Support for job suspension
- Looping job algorithm takes into account if the pilot has been suspended
Updated python executable for running open_remote_file script (now explicit ‘python3’)
- Requested by A. De Silva
Reporting error code 1310 when exception has been caught while executing payload
Never include output files in log
- Previously, output files were left in payload work dir for looping jobs
- Requested by T. Korchuganova
Now reporting support for IPv6 (or lack of) in pilot log
- Dumping explicit IPs to log
Using get() instead of direct dictionary lookup to prevent problem with transfer() call in harvester related code
Housekeeping
- Processed ~60% of all files with Flynt (+ manual changes for the strings that Flynt missed or couldn’t handle)
  - Flynt converts strings from from % and .format to f-strings
- Linting with pylint status: ~50% of all module have a score of 9-10
When the pilot has intercepted an external kill signal, it sends SIGTERM followed by SIGKILL 3 s later to the main payload process
- The 3 s delay might be too short and could lead to lingering payload threads, so it was increased to 10 s
Added new pilot error code (1208) for SIGINT
Rubin
- Do not abort pilot when getJob returns None
  - Now experiment specific
- Requested by W. Guan
Bug fixes
- The exit code for a failed download of a new proxy (following an expired proxy/certificate) was lost. The job should now fail as expected
  - Reported by R. Walker
- Updated and corrected shell exit code conversions after kill signal
  - Now reporting unique exit code for different kill signals
  - Reported by P. Love
- Fixed an issue with real-time logging in multi-job pilots
  - The real-time logging was prevented from starting in later jobs

Assets 2

06 Nov 12:46

PalNilsson

3.7.0.36

4b0b372

3.7.0.36

Added verification of exception thrown during payload error interpretation
- Previously, a FileHandlingFailure exception thrown during payload error interpretation would lead to a crash
- Also, now verifying that payload stdout/err actually exists before trying to read it
- Reported by Z. Yang (Rubin)
Added xroot client log file to log if it exists
- Adding ‘XRD_LOGLEVEL=Dump XRD_LOGFILE=xrdlog.txt’ to PQ.environ causes the wrapper to set these env vars, and the pilot also adds them to the payload command. Upon a file transfer, the variables (if set) trigger the xroot client to create a log file that is stored in the pilot’s launch directory
- There is one file per file transfer, and the pilot renames them to xrdlog-<LFN|payload>.txt (but using the file name pattern from XRD_LOGFILE). They are then copied to the tar directory (the corresponding log file for the job log transfer will of course be missed)
- Requested by R. Walker
Aborting pilot immediately after server instruction ‘tobekilled’
- The job monitor can otherwise get stuck in a loop that can take some time to get out of
The core dump produced by the pilot when a looping payload process has been detected, is now created for the youngest child process rather than the main payload process
Extended panda secrets usage for logging into docker (for user container)
- Any secrets are redacted in logs
- Requested by R. Zhang
Housekeeping
- Corrected all module headers for Apache license 2.0 so they pass the updated apache-license-check
- Several pilot modules (10%) now pass the pydocstyle test
- Pilot module was processed with the black tool
- Some minor linting was done using pylint - currently around 30% of the pilot modules have a score of over 9 out of 10
- The plan is to process all modules with these commands and eventually use them with local pre-commit and remote GitHub Actions that run after a PR

Assets 2

03 Oct 12:02

PalNilsson

3.6.9.10

78c3011

3.6.9.10

Looping killer update
- Added exception handling for unexpected stat output seen in a job at MWT2, that effectively aborted the looping job killer and lead to multiple unsuccessful attempts of killing the payload
- The updated function is however now only used as a fallback when psutil module is not available (it largely is, except on some HPC sites)

Assets 2

27 Sep 13:54

PalNilsson

3.6.8.29

efecbae

3.6.8.29

Redirecting stdout/stderr to files for trace service curls
- This could prevent thread deadlocks in the standard python subprocess.communicate() function in case of overwhelming amount of stdout/stderr. The subprocess.communicate() function is no longer used, which also means that the internal timeout capability in subprocess can no longer be used and had to be reimplemented by a threading timer which sets the relevant error code if necessary
- The 'last’ output from curl is stored in trace_curl_last.stdout/stderr, and gets appended to trace_curl.stdout/stderr
- The trace_curl_last.stdout/stderr files are searched for any curl errors (curl command always returns 0 exit code even when there was an error, so the output has to be processed)
- A failed rucio trace curl operation is now reported with job metrics
  - Example: rucioTraceError=N
- Increased connection timeout from 20s to 100s to be in line with panda server curl operations (where we don’t see any problems)
- Related JIRA ticket: https://its.cern.ch/jira/browse/ATLASPANDA-835
Reporting prmon read_bytes/total_input_size with job metrics (‘readbyterate’)
- Information to be used for optimizing brokerage
- Requested by J. Elmsheuser, R. Walker
Extended usage of psutil
- Job monitoring is now using psutil to discover prmon pid
- If psutil is not available (e.g. as is the case on marenostrum), the code falls back to old ps command usage
Added protection against expired job objects in job_monitor loop
- Reported by W. Guan/Z. Yang
Updated GitHub Action workflows
- Unit tests and flake8 are are now independent workflows
- Moved to latest flake8 version 6.1.0 for flake8 verification
- All tests are run for python versions 3.8, 3.9, 3.10 and 3.11
Tested pilot running under python versions 3.9.18 and 3.11.5
- Grid jobs are currently running under python version 3.9.14 but will soon switch to 3.9.18 to be in line with user tools (like rucio)
- Python version 3.11.5 will be the default version on EL9
- Requested by A. De Silva

Assets 2

18 Sep 09:02

PalNilsson

3.6.7.10

f903d9e

3.6.7.10

Improved reporting of CPU consumption time
- It was seen in I/O bound payloads that the correct CPU consumption time was not reported correctly. Pilot is now making sure there are no zero values reported
- Reported by R. Walker
Migration towards using psutil module has started
- Until now, pilot has relied on executing the ps command for process information, but this is heavy on the system if many ps commands are executed in short time
- A. De Silva has made the psutil module available via ALRB and is setup in the wrapper with ‘lsetup psutil’ by P. Love
  Currently, there is no requirement for psutil - the pilot has a fallback to using other process info in case psutil fails to import - but this will change soon
- Pilot is currently only using psutil to get information whether a certain process is running or not, with a fallback to /proc/{pid}
Added protection for failed writing of info dictionary to disk before server update
- Curl normally uses this dictionary, but should now instead use the dictionary explicitly (converted to string)
- Previously, the pilot would fail to inform the server, i.e. the job would become a lost heartbeat
- The pilot might still fail before reaching this point, as it basically relies on disks with space > 0
Moved import of google cloud logging to beginning of real-time logging module to prevent an unexplained problem seen in Rubin jobs
- Previously, said module was only imported when it needed to be used, but for some reasons this would occasionally lead to python locking up
- Requested by Z. Yang (Rubin)

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: PanDAWMS/pilot3

3.7.6.8

3.7.5.4

3.7.4.2

3.7.3.84

3.7.2.4

3.7.1.50

3.7.0.36

3.6.9.10

3.6.8.29

3.6.7.10