Releases: PanDAWMS/pilot3
Releases · PanDAWMS/pilot3
3.7.6.8
- Updates to pilot Harvester module used on supercomputers
- Work report is now published, leading to successful jobs on Perlmutter
- Ignoring “RHEL8 and clones are not supported for users”-warning in cpu_arch script, which prevented CPU architecture from being reported
- Support for NO_CVMFS_OK environmental variable
- If set to any value, it will turn off CVMFS checks by pilot
- Used on Mare Nostrum queues
- Requested by P. Collato Soto et al
- Support for error report produced by the job transform
- If it exists, pilot will extract error_code and error_diag and report with exeErrorCode and exeErrorDiag
- Corresponding pilot error is 1305
- Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLASPANDA-976
- Requested by T. Maeno
- Housekeeping with pylint etc
- Multiple pilot modules (10+) improved with pylint score of 9+
- Dropped python 3.8 support since unit tests otherwise fail due to the use of more modern type hints
- Source code is verified for Python version 3.9, 3.10, 3.11 in the CI tests
- Bug fix
- Pilot did not send job update during long stage-in, which could lead to lost heartbeat
- Related function required job to be in running state, which it wasn’t
- Reported by F. Barreiro Megino
- Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLASPANDA-974
- Pilot did not send job update during long stage-in, which could lead to lost heartbeat
3.7.5.4
- Bug fix
- Corrections to new cvmfs functions which did not fail the test with an error code when it should have
- Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLASPANDA-926
- Corrections to new cvmfs functions which did not fail the test with an error code when it should have
3.7.4.2
- Now setting proper error codes for lsetup and remote file open error in case of time outs
- Before, a general payload error was set (1305); now, 1368 - "Remote file open timed out" and 1378 - "Lsetup command timed out during remote file open" will be set
- Also, increased timeout from 600 s to 900 s for remote file open
- Bug fix
- Corrected one of the new cvmfs checking functions which previously failed to read the last cvmfs update file
- Problem affected sites with non-standard cvmfs mounting
- Reported by M. Svatos
- Corrected one of the new cvmfs checking functions which previously failed to read the last cvmfs update file
3.7.3.84
- Replaced all usages of curl with python native urllib
- Trace reports, panda server interactions (getJob, updateJob, getProxy, queuedata and transform downloads), as well as prmon dictionary upload to https://pilot.atlas-ml.org/
- This change is important both for attempting to reduce the number of non-responsive pilots (esp. following trace service curl upload) and for resolving EL9 complications (Tokyo)
- Corrected cpu model reporting in cpuconsumptionunit string on ARM
- Previously, UNKNOWN was given as cpu model name
- Note: cache size is not reported in this case as it is not available in user space
- Requested by I. Glushkov
- Added GPU info from prmon to job metrics
- E.g. “GPU_name=NVIDIA_A100-SXM4-40G nGPU=1”
- To be used by monitoring
- Requested by T. Korchuganova, T. Maeno
- Now aborting stage-in loop in case graceful stop bit has been set
- Previously, the stage-in thread would keep running until finished which unnecessarily delayed the termination of the pilot
- The graceful stop bit gets set e.g. when there is an unexpected exception thrown
- Added intersect value from PSS+SWAP fit to job metrics
- Along with the slope, it allows for plotting the fit on the monitor side
- Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLASPANDA-907
- Now performing CVMFS checks at the beginning of the pilot
- Using same list of files and directories as the wrapper
- Any failure leads to termination of the pilot with exit code 64
- Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLASPANDA-926
- ATLAS
- Active monitoring of remote file open verification script
- Now able to abort (e.g.) if lsetup is taking too long time
- Active monitoring of remote file open verification script
- Support for event service with AthenaMT
- Housekeeping
- Additional pilot modules were processed with pylint etc
- Bug fixes
- Rubin
- Removed memory monitoring files from looping job algorithms (problem reported by W. Guan)
- Corrected exception handling when socket.gethostbyaddr() is used
- Corrected the replica sorting algorithm, which did not sort replicas according to read_lan
- Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLDDMOPS-5685
- ATLAS
- Correction for the case where looping check had not run (thus not set any internal timings) before job suspension occurred
- Lead to problems with some BOINC jobs
- Correction for the case where looping check had not run (thus not set any internal timings) before job suspension occurred
- Rubin
Code contributions from P Nilsson, O. Freyermuth, J. Esseiva
3.7.2.4
- Explicitly setting time zone to UTC for payload (ATLAS)
- Discussed in https://its.cern.ch/jira/browse/ATEAM-953
- Added timeouts to socket usages
- Set a timeout of 10 seconds to prevent potential hanging when there are problems with DNS resolution, or if the DNS server is slow to respond
- The socket module is used in relation to rucio traces and could potentially lead to hanging/unresponsive pilots
- Changed the default -C to -c -i in the container setup (ATLAS)
- To preserve the process ids
- Requested by A. De Silva and A. Mete
- Setting XRD_LOGLEVEL=Debug before remote file open verification script is run (ATLAS)
- To facilitate debugging problematic site(s)
- Requested by R. Taylor
- Bug fix
- Now setting rucio site name (as well as missing rucio version) correctly when middleware containers are used
- Reported by R. Kleinemuhl
3.7.1.50
- Updated lsetup ‘root pilot’ to lsetup ‘root pilot-default’
- Requested by A. De Silva
- Support for job suspension
- Looping job algorithm takes into account if the pilot has been suspended
- Updated python executable for running open_remote_file script (now explicit ‘python3’)
- Requested by A. De Silva
- Reporting error code 1310 when exception has been caught while executing payload
- Never include output files in log
- Previously, output files were left in payload work dir for looping jobs
- Requested by T. Korchuganova
- Now reporting support for IPv6 (or lack of) in pilot log
- Dumping explicit IPs to log
- Using get() instead of direct dictionary lookup to prevent problem with transfer() call in harvester related code
- Housekeeping
- Processed ~60% of all files with Flynt (+ manual changes for the strings that Flynt missed or couldn’t handle)
- Flynt converts strings from from % and .format to f-strings
- Linting with pylint status: ~50% of all module have a score of 9-10
- Processed ~60% of all files with Flynt (+ manual changes for the strings that Flynt missed or couldn’t handle)
- When the pilot has intercepted an external kill signal, it sends SIGTERM followed by SIGKILL 3 s later to the main payload process
- The 3 s delay might be too short and could lead to lingering payload threads, so it was increased to 10 s
- Added new pilot error code (1208) for SIGINT
- Rubin
- Do not abort pilot when getJob returns None
- Now experiment specific
- Requested by W. Guan
- Do not abort pilot when getJob returns None
- Bug fixes
- The exit code for a failed download of a new proxy (following an expired proxy/certificate) was lost. The job should now fail as expected
- Reported by R. Walker
- Updated and corrected shell exit code conversions after kill signal
- Now reporting unique exit code for different kill signals
- Reported by P. Love
- Fixed an issue with real-time logging in multi-job pilots
- The real-time logging was prevented from starting in later jobs
- The exit code for a failed download of a new proxy (following an expired proxy/certificate) was lost. The job should now fail as expected
3.7.0.36
- Added verification of exception thrown during payload error interpretation
- Previously, a FileHandlingFailure exception thrown during payload error interpretation would lead to a crash
- Also, now verifying that payload stdout/err actually exists before trying to read it
- Reported by Z. Yang (Rubin)
- Added xroot client log file to log if it exists
- Adding ‘XRD_LOGLEVEL=Dump XRD_LOGFILE=xrdlog.txt’ to PQ.environ causes the wrapper to set these env vars, and the pilot also adds them to the payload command. Upon a file transfer, the variables (if set) trigger the xroot client to create a log file that is stored in the pilot’s launch directory
- There is one file per file transfer, and the pilot renames them to xrdlog-<LFN|payload>.txt (but using the file name pattern from XRD_LOGFILE). They are then copied to the tar directory (the corresponding log file for the job log transfer will of course be missed)
- Requested by R. Walker
- Aborting pilot immediately after server instruction ‘tobekilled’
- The job monitor can otherwise get stuck in a loop that can take some time to get out of
- The core dump produced by the pilot when a looping payload process has been detected, is now created for the youngest child process rather than the main payload process
- Extended panda secrets usage for logging into docker (for user container)
- Any secrets are redacted in logs
- Requested by R. Zhang
- Housekeeping
- Corrected all module headers for Apache license 2.0 so they pass the updated apache-license-check
- Several pilot modules (10%) now pass the pydocstyle test
- Pilot module was processed with the black tool
- Some minor linting was done using pylint - currently around 30% of the pilot modules have a score of over 9 out of 10
- The plan is to process all modules with these commands and eventually use them with local pre-commit and remote GitHub Actions that run after a PR
3.6.9.10
- Looping killer update
- Added exception handling for unexpected stat output seen in a job at MWT2, that effectively aborted the looping job killer and lead to multiple unsuccessful attempts of killing the payload
- The updated function is however now only used as a fallback when psutil module is not available (it largely is, except on some HPC sites)
3.6.8.29
- Redirecting stdout/stderr to files for trace service curls
- This could prevent thread deadlocks in the standard python subprocess.communicate() function in case of overwhelming amount of stdout/stderr. The subprocess.communicate() function is no longer used, which also means that the internal timeout capability in subprocess can no longer be used and had to be reimplemented by a threading timer which sets the relevant error code if necessary
- The 'last’ output from curl is stored in trace_curl_last.stdout/stderr, and gets appended to trace_curl.stdout/stderr
- The trace_curl_last.stdout/stderr files are searched for any curl errors (curl command always returns 0 exit code even when there was an error, so the output has to be processed)
- A failed rucio trace curl operation is now reported with job metrics
- Example: rucioTraceError=N
- Increased connection timeout from 20s to 100s to be in line with panda server curl operations (where we don’t see any problems)
- Related JIRA ticket: https://its.cern.ch/jira/browse/ATLASPANDA-835
- Reporting prmon read_bytes/total_input_size with job metrics (‘readbyterate’)
- Information to be used for optimizing brokerage
- Requested by J. Elmsheuser, R. Walker
- Extended usage of psutil
- Job monitoring is now using psutil to discover prmon pid
- If psutil is not available (e.g. as is the case on marenostrum), the code falls back to old ps command usage
- Added protection against expired job objects in job_monitor loop
- Reported by W. Guan/Z. Yang
- Updated GitHub Action workflows
- Unit tests and flake8 are are now independent workflows
- Moved to latest flake8 version 6.1.0 for flake8 verification
- All tests are run for python versions 3.8, 3.9, 3.10 and 3.11
- Tested pilot running under python versions 3.9.18 and 3.11.5
- Grid jobs are currently running under python version 3.9.14 but will soon switch to 3.9.18 to be in line with user tools (like rucio)
- Python version 3.11.5 will be the default version on EL9
- Requested by A. De Silva
3.6.7.10
- Improved reporting of CPU consumption time
- It was seen in I/O bound payloads that the correct CPU consumption time was not reported correctly. Pilot is now making sure there are no zero values reported
- Reported by R. Walker
- Migration towards using psutil module has started
- Until now, pilot has relied on executing the ps command for process information, but this is heavy on the system if many ps commands are executed in short time
- A. De Silva has made the psutil module available via ALRB and is setup in the wrapper with ‘lsetup psutil’ by P. Love
Currently, there is no requirement for psutil - the pilot has a fallback to using other process info in case psutil fails to import - but this will change soon - Pilot is currently only using psutil to get information whether a certain process is running or not, with a fallback to /proc/{pid}
- Added protection for failed writing of info dictionary to disk before server update
- Curl normally uses this dictionary, but should now instead use the dictionary explicitly (converted to string)
- Previously, the pilot would fail to inform the server, i.e. the job would become a lost heartbeat
- The pilot might still fail before reaching this point, as it basically relies on disks with space > 0
- Moved import of google cloud logging to beginning of real-time logging module to prevent an unexplained problem seen in Rubin jobs
- Previously, said module was only imported when it needed to be used, but for some reasons this would occasionally lead to python locking up
- Requested by Z. Yang (Rubin)