Releases: PanDAWMS/pilot3
Releases · PanDAWMS/pilot3
3.9.4.15
- Replaced a problematic command execution function with a simpler and improved version
- The previous version caused output of some commands to be partially missed due to complications with a threaded readout function - incl df and find, which in turn affected the looping job algorithm and lead to job failures at least at BNL during last weekend
- Internal improvements
- Now handling empty output/file names from find command used by looping job algorithm
- Harmless though, as it would anyway be filtered out later
- Now handling empty output/file names from find command used by looping job algorithm
- Bug fix
- Added missing f in f-statement, which previously led to an uninformative log message for remaining disk space
- Reported by R. Walker
3.9.3.2
- Dealing with unreasonably large CPU consumption times
- Introduced a sleep in the threaded readout function used by the main command execution function, as well as non-blocking queue put of read messages, to prevent this function of spending too much CPU
- Other internal improvements
- Pausing attempts to write to OOM score adj file since it is generating too many log errors
- Requested by F. Luehring
- Removed a useless log message that was repeated many times
- Requested by F. Luehring
- Fixed a potential problem with stdout/stderr file object when there is no more space on the device (file objects are now only closed if they still exist)
- This would otherwise lead to lost heartbeat
- Fixed a blocking problem when collecting zombies
- Before, pilot would wait until the process went away, but this can lead to long time blocking
- Requested by Z. Yang
- Pausing attempts to write to OOM score adj file since it is generating too many log errors
3.9.2.41
- Added new pilot option -w / –notokenrenewal to turn off OIDC token renewals
- Needed on Perlmutter and requested by D. Benjamin
- Alternative to use PQ.catchall introduced in previous pilot version
- Pilot will now notice if the voms proxy has less than 72h left at the startup of the pilot - if so, it will fail with new wrapper error code 80
- Harvester has also been updated for this change
- Requested by R. Walker
- Discussed in JIRA ticket ATLASPANDA-1156
- Regarding lingering payload processes leading to too high CPU efficiency
- Discussed in JIRA tickets ADCMONITOR-551 and ATLASPANDA-1204
- Pilot identifies any lingering processes after the payload has finished and kills them. These are unaccounted processes that are children of the pilot and not necessarily the payload as the payload forks processes which will be inherited by the pilot in case the payload is killed by the OS
- Dealing with unreasonably large CPU consumption times
- CPU consumption times are now only measured after ten seconds of payload running
- The pilot tries to sum up the CPU consumption time using /prod/$PID/stat
- A returned CPU consumption time will be compared with the previous one, if the quotient is larger than 5 (under normal CPU load) or 10 (under high CPU load), the result will be ignored
- A high CPU load is defined as 80% load, as measured during 0.5 s
- Before the resulting CPU consumption time is stored, the existence of /proc/self/statm is verified (and discarded if this no longer exists), to make sure there are no problems with /proc itself
- Limits might be modified in later pilot versions
- Further improvements for Karolina HPC
- Switching to IPv4 when using urllib resolved most of the problems with cut job definitions, but not all.
- Force curl and IPv4 using catchall (“curlgetjob”)
- Additional updates are foreseen
- Updates for alternative stage-out algorithm
- For unified queues: Pilot makes decision to choose proper destination by considering write_lan_analysis, write_lan activities for analysis job, or simply write_lan for production jobs
- job.nucleus is excluded as a possible alt-stageout destination
- Details in pilot repo issue 152
- Discussed in JIRA ticket ATLASPANDA-994
- Pilot timing and remote i/o
- Until now, the time it takes for the remote i/o file verification to finish has been a part of the setup time. Now, it is instead added to the stage-in time
- Requested by R. Walker
- The corresponding pilot wiki page has been updated: https://github.com/PanDAWMS/pilot3/wiki/Timing-Measurements
- JIRA ticket ATLASPANDA-1221
- Improved the error message displayed on the monitor job page for remote i/o failures
- Until now, the time it takes for the remote i/o file verification to finish has been a part of the setup time. Now, it is instead added to the stage-in time
- Internal improvements
- Added thread synchronization in command execution function to get rid of annoying (but harmless) stderr “Poll: bad file descriptor reading from request pipe”
- Improved exception handling for socket related errors (specifically for initializing the trace report)
- Requested by Z. Yang (BNL/Rubin)
- Improved dmesg handling (verifying that the found memory error belongs to a process that is a known child of the payload)
- Requested by F. Luehring
- JIRA ticket ATLASPANDA-1214
- A few more modules were processed with pylint
- Bug fixes
- Time-outs for remote i/o verification did not work as expected, now corrected
- Fix confirmed e.g. in job 6425468476
- JIRA ticket ATLASPANDA-1210
- Corrected core number reporting in cpuconsumptionunit CPU info string
- Time-outs for remote i/o verification did not work as expected, now corrected
Contributions from A. Anisenkov, P. Nilsson
3.9.1.13
- Internal improvements
- Now using python native versions instead of executing external commands for
- grep - in the case of AVX2 checks
- uuidgen - used for trace report
- Now using python native versions instead of executing external commands for
- Bug fixes
- Improved exception handling in thread reading stdout in function for executing commands
- Can otherwise lead to harmless “ValueError: I/O operation on closed file” error (log message only)
- Fixed problem with parsing stdout from arcproxy leading to “Certificate has expired” failures
- Discussed in JIRA ticket ATLASPANDA-1157
- Reported by I. Glushkov
- Improved exception handling in thread reading stdout in function for executing commands
3.9.0.17
- Time-out in url open function now configurable
- Before, a time out of 30 s was used, but it was seen in Rubin that it might not be enough
- The time-out is now 120 s, but can be changed in the pilot config file
- Requested by W.Guan
- Update to main command execution function which now uses threads to handle stdout/stderr from command
- It is suspected that too much output can cause buffer overflow, which theoretically could hang the pilot python process
- Using a stack instead of recursion in function for finding processes that belong to a group
- Problem seen on UNI-SIEGEN-HEP with a large number of processes, leading to max recursion depth
- OIDC tokens on HPCs
- Pilot can now skip token renewal if keyword NO_TOKEN_RENEWAL is present in PQ.catchall
- Renewal mechanism on HPCs is done in Harvester
- Bug fixes
- Prevented threaded heartbeat function to send anything but “running” state
- This fixes a rare case when the “finished” state was sent before the log transfer had finished
- Reported by Z. Yang (Rubin)
- Now sending ddmendpoint info with field ‘endpoint’ instead of ‘ddmendpoint’
- See discussion in JIRA ticket about alternative stage-out: ATLASPANDA-994
- Harmless problem in server update function when debug mode is switched on by the pilot
- Unexpected server response; “Succeeded” instead of “StatusCode=0”, where the former could not be parsed by parse_qs() urllib function)
- Prevented threaded heartbeat function to send anything but “running” state
3.8.2.8
- Alternative stage-out
- Additional stage-out attempt for failed transfers (data and log files) to different storage if configured in astorages
- Being discussed in JIRA ticket ATLASPANDA-994 “Failover stage-out to write_lan/1 RSE”
- Pull request #142
- Improved IPv6 address extraction
- Problem with pattern recognition seen with Alma9 at Wuppertal
- Reported by T. Harenberg
- Replaced remaining ps command usage with a call to the more efficient psutil python module
- A problem during high CPU load was seen on an ARM resource (with 256 cores) due to too many concurrent ps processes (KIT)
- Reported by M. Schnepf
- Now possible to set real-time logging server via PQ.catchall
- Requested by I. Vukotic
- Bug fixes
- Fixed typo in infosys initialization (harmless)
- Pull request #143
- Improved exception handling during randomization of panda server address
- Added socket.gaierror, “Name or service not known”
- Could otherwise lead to lost heartbeat, although problem only seen in jobs that had already failed with sending job updates to server
- Fixed typo in infosys initialization (harmless)
Contributions from A. Anisenkov, P. Nilsson
3.8.1.66
- Added default path for ifconfig command (used to lookup IPv6 info) if command not found
- Support for OIDC tokens in urllib based request function (used for pilot-PanDA server communications)
- Together with a token key, the primary OIDC token is used to download a shorter token, used in the later communications with the PanDA server
- The pilot is refreshing the token immediately after launch, the original long lasting token is overwritten
- The short lasting tokens are refreshed periodically (once every 60 minutes)
- Note: OIDC tokens are used by default if found locally, otherwise X509 is used - i.e. there is no corresponding pilot option to activate the mechanism
- Received SIGTERM signals on Kubernetes resources reported with new error code 1379, “Job was preempted”
- Requested by R. Walker
- Discussed in JIRA ticket ATLASPANDA-1065
- Added two error codes for arcproxy failures
- 1380: “General arcproxy failure” (was previously reported as 1008: “"General pilot error, consult batch log"”)
- 1381: “Arcproxy failure while loading shared libraries”
- Note: this (1381) is currently only used internally and does not lead to a failed job
- Remote file open container now using EL9 instead of CentOS7
- Required for latest ROOT release
- Requested by A. De Silva
- Skipping setting RUCIO_ACCOUNT for payload
- Requested by R. Walker
- A time-out was added to the gdb command execution (for producing a core dump file) when a looping job has been discovered
- Requested by R. Walker
- Real-time logging
- Now possible to specify real-time logging server (type, protocol, URL and port) via pilot argument
- Previously, it only worked via pilot config
- Requested by W. Guan
- Added Loki real-time logging module (Rubin)
- Real-time logging can now be activated for all jobs on a given queue (relevant for pilot logs, not payload stdout)
- Activation currently via PQ.catchall
- Streaming of pilot logs requested by I. Vukotic
- To be tested more widely
- Now possible to specify real-time logging server (type, protocol, URL and port) via pilot argument
- New pilot option --noworkerpilotstatusupdate can be used to switch off worker pilot status updates
- Needed at NERSC
- Requested by T. Maeno
- Added timeout to urlopen() used for pilot-PanDA server communication
- The default timeout is too short and for getjob operations can lead to “jobdispatcher, 102: Sent job didn't receive reply from pilot within 30 min”-errors
- In case of failure, pilot will currently fallback to curl based communication
- Timeout is now explicitly set to 30 s
- Reported by Z. Yang (Rubin)
- Bug fix
- Patch for setting final job completion state before log stage-out had completed
- Leading to “ddm, 200: Could not get GUID/LFN/MD5/FSIZE/SURL from pilot XML”-error
- Reported by R. Walker, discussed in JIRA ticket ATLASPANDA-1047
- Patch for setting final job completion state before log stage-out had completed
- Housekeeping with pylint
- The average pylint score of all pilot modules is 9.56
Contributions from W. Guan, P. Nilsson
3.7.9.1
- Patch for unset resource types
- In case the resource type was not set as a pilot option (it usually is, but not always), the pilot refused to start and complained about it
- Affected TRIUMF and praguelcg2_Karolina_MCORE
- Reported by A. De Silva
3.7.8.21
- Remote file open
- Resolved a problem with executing the open remote file script which could abort prematurely
- Redirected output of container command to file (previously, the stdout was dumped to the pilot log which created unnecessary clutter).
- Now reporting lsetup time with job metrics
- Discussed in JIRA ticket ATLASPANDA-1001
- Containerized stage-in/out
- Now using EL9 container instead of outdated Rucio image
- Discussed in JIRA ticket ATLASPANDA-1008
- Resource types pilot argument
- Now supporting patterns in the resource types
- Discussed in JIRA ticket ATLASPANDA-1034
- Event service with Raythena
- Updated athenaopts for release 25
- TRF URL update for vo.darkside.org
- Housekeeping with pylint etc
- Bug fixes
- Wrong logging object used in real-time logging (reported by W. Guan)
- Exception handling for missing psutil python module (reported by R. Walker)
Code contributions from E. Karavakis, J. Esseiva, P. Nilsson
3.7.7.3
- In case of cvmfs failure, pilot now executes diagnostics commands (timeout=60s)
- cvmfs_config stat atlas.cern.ch
- attr -g revision $ATLAS_SW_BASE/atlas.cern.ch
- Requested by I. Glushkov
- Real-time logging
- Added support for Grafana Loki logging
- SPHENIX
- Fix for the URL from where the user analysis transform should be downloaded from in DOMA k8s
- Housekeeping
- Additional modules were processed with pylint etc
Code contributions from Edward Karavakis, Wen Guan, Paul Nilsson