Releases · PanDAWMS/pilot3

18 Dec 10:23

PalNilsson

3.9.4.15

9e76a9f

3.9.4.15 Latest

Latest

Replaced a problematic command execution function with a simpler and improved version
- The previous version caused output of some commands to be partially missed due to complications with a threaded readout function - incl df and find, which in turn affected the looping job algorithm and lead to job failures at least at BNL during last weekend
Internal improvements
- Now handling empty output/file names from find command used by looping job algorithm
  - Harmless though, as it would anyway be filtered out later
Bug fix
- Added missing f in f-statement, which previously led to an uninformative log message for remaining disk space
- Reported by R. Walker

Assets 2

11 Dec 13:43

PalNilsson

3.9.3.2

fd44667

3.9.3.2

Dealing with unreasonably large CPU consumption times
- Introduced a sleep in the threaded readout function used by the main command execution function, as well as non-blocking queue put of read messages, to prevent this function of spending too much CPU
Other internal improvements
- Pausing attempts to write to OOM score adj file since it is generating too many log errors
  - Requested by F. Luehring
- Removed a useless log message that was repeated many times
  - Requested by F. Luehring
- Fixed a potential problem with stdout/stderr file object when there is no more space on the device (file objects are now only closed if they still exist)
  - This would otherwise lead to lost heartbeat
- Fixed a blocking problem when collecting zombies
  - Before, pilot would wait until the process went away, but this can lead to long time blocking
  - Requested by Z. Yang

Assets 2

05 Dec 09:36

PalNilsson

3.9.2.41

95f7af1

3.9.2.41

Added new pilot option -w / –notokenrenewal to turn off OIDC token renewals
- Needed on Perlmutter and requested by D. Benjamin
- Alternative to use PQ.catchall introduced in previous pilot version
Pilot will now notice if the voms proxy has less than 72h left at the startup of the pilot - if so, it will fail with new wrapper error code 80
- Harvester has also been updated for this change
- Requested by R. Walker
- Discussed in JIRA ticket ATLASPANDA-1156
Regarding lingering payload processes leading to too high CPU efficiency
- Discussed in JIRA tickets ADCMONITOR-551 and ATLASPANDA-1204
- Pilot identifies any lingering processes after the payload has finished and kills them. These are unaccounted processes that are children of the pilot and not necessarily the payload as the payload forks processes which will be inherited by the pilot in case the payload is killed by the OS
Dealing with unreasonably large CPU consumption times
- CPU consumption times are now only measured after ten seconds of payload running
- The pilot tries to sum up the CPU consumption time using /prod/$PID/stat
- A returned CPU consumption time will be compared with the previous one, if the quotient is larger than 5 (under normal CPU load) or 10 (under high CPU load), the result will be ignored
- A high CPU load is defined as 80% load, as measured during 0.5 s
- Before the resulting CPU consumption time is stored, the existence of /proc/self/statm is verified (and discarded if this no longer exists), to make sure there are no problems with /proc itself
- Limits might be modified in later pilot versions
Further improvements for Karolina HPC
- Switching to IPv4 when using urllib resolved most of the problems with cut job definitions, but not all.
- Force curl and IPv4 using catchall (“curlgetjob”)
- Additional updates are foreseen
Updates for alternative stage-out algorithm
- For unified queues: Pilot makes decision to choose proper destination by considering write_lan_analysis, write_lan activities for analysis job, or simply write_lan for production jobs
- job.nucleus is excluded as a possible alt-stageout destination
- Details in pilot repo issue 152
- Discussed in JIRA ticket ATLASPANDA-994
Pilot timing and remote i/o
- Until now, the time it takes for the remote i/o file verification to finish has been a part of the setup time. Now, it is instead added to the stage-in time
  - Requested by R. Walker
  - The corresponding pilot wiki page has been updated: https://github.com/PanDAWMS/pilot3/wiki/Timing-Measurements
  - JIRA ticket ATLASPANDA-1221
- Improved the error message displayed on the monitor job page for remote i/o failures
Internal improvements
- Added thread synchronization in command execution function to get rid of annoying (but harmless) stderr “Poll: bad file descriptor reading from request pipe”
- Improved exception handling for socket related errors (specifically for initializing the trace report)
  - Requested by Z. Yang (BNL/Rubin)
- Improved dmesg handling (verifying that the found memory error belongs to a process that is a known child of the payload)
  - Requested by F. Luehring
  - JIRA ticket ATLASPANDA-1214
- A few more modules were processed with pylint
Bug fixes
- Time-outs for remote i/o verification did not work as expected, now corrected
  - Fix confirmed e.g. in job 6425468476
  - JIRA ticket ATLASPANDA-1210
- Corrected core number reporting in cpuconsumptionunit CPU info string

Contributions from A. Anisenkov, P. Nilsson

Assets 2

17 Oct 12:28

PalNilsson

3.9.1.13

41ed500

3.9.1.13

Internal improvements
- Now using python native versions instead of executing external commands for
  - grep - in the case of AVX2 checks
  - uuidgen - used for trace report
Bug fixes
- Improved exception handling in thread reading stdout in function for executing commands
  - Can otherwise lead to harmless “ValueError: I/O operation on closed file” error (log message only)
- Fixed problem with parsing stdout from arcproxy leading to “Certificate has expired” failures
  - Discussed in JIRA ticket ATLASPANDA-1157
  - Reported by I. Glushkov

Assets 2

15 Oct 08:42

PalNilsson

3.9.0.17

de0e57c

3.9.0.17

Time-out in url open function now configurable
- Before, a time out of 30 s was used, but it was seen in Rubin that it might not be enough
- The time-out is now 120 s, but can be changed in the pilot config file
- Requested by W.Guan
Update to main command execution function which now uses threads to handle stdout/stderr from command
- It is suspected that too much output can cause buffer overflow, which theoretically could hang the pilot python process
Using a stack instead of recursion in function for finding processes that belong to a group
- Problem seen on UNI-SIEGEN-HEP with a large number of processes, leading to max recursion depth
OIDC tokens on HPCs
- Pilot can now skip token renewal if keyword NO_TOKEN_RENEWAL is present in PQ.catchall
- Renewal mechanism on HPCs is done in Harvester
Bug fixes
- Prevented threaded heartbeat function to send anything but “running” state
  - This fixes a rare case when the “finished” state was sent before the log transfer had finished
  - Reported by Z. Yang (Rubin)
- Now sending ddmendpoint info with field ‘endpoint’ instead of ‘ddmendpoint’
  - See discussion in JIRA ticket about alternative stage-out: ATLASPANDA-994
- Harmless problem in server update function when debug mode is switched on by the pilot
  - Unexpected server response; “Succeeded” instead of “StatusCode=0”, where the former could not be parsed by parse_qs() urllib function)

Assets 2

24 Sep 09:06

PalNilsson

3.8.2.8

128ebe4

3.8.2.8

Alternative stage-out
- Additional stage-out attempt for failed transfers (data and log files) to different storage if configured in astorages
- Being discussed in JIRA ticket ATLASPANDA-994 “Failover stage-out to write_lan/1 RSE”
- Pull request #142
Improved IPv6 address extraction
- Problem with pattern recognition seen with Alma9 at Wuppertal
- Reported by T. Harenberg
Replaced remaining ps command usage with a call to the more efficient psutil python module
- A problem during high CPU load was seen on an ARM resource (with 256 cores) due to too many concurrent ps processes (KIT)
- Reported by M. Schnepf
Now possible to set real-time logging server via PQ.catchall
- Requested by I. Vukotic
Bug fixes
- Fixed typo in infosys initialization (harmless)
  - Pull request #143
- Improved exception handling during randomization of panda server address
  - Added socket.gaierror, “Name or service not known”
  - Could otherwise lead to lost heartbeat, although problem only seen in jobs that had already failed with sending job updates to server

Contributions from A. Anisenkov, P. Nilsson

Assets 2

09 Sep 12:33

PalNilsson

3.8.1.66

4a45bc2

3.8.1.66

Added default path for ifconfig command (used to lookup IPv6 info) if command not found
Support for OIDC tokens in urllib based request function (used for pilot-PanDA server communications)
- Together with a token key, the primary OIDC token is used to download a shorter token, used in the later communications with the PanDA server
- The pilot is refreshing the token immediately after launch, the original long lasting token is overwritten
- The short lasting tokens are refreshed periodically (once every 60 minutes)
- Note: OIDC tokens are used by default if found locally, otherwise X509 is used - i.e. there is no corresponding pilot option to activate the mechanism
Received SIGTERM signals on Kubernetes resources reported with new error code 1379, “Job was preempted”
- Requested by R. Walker
- Discussed in JIRA ticket ATLASPANDA-1065
Added two error codes for arcproxy failures
- 1380: “General arcproxy failure” (was previously reported as 1008: “"General pilot error, consult batch log"”)
- 1381: “Arcproxy failure while loading shared libraries”
  - Note: this (1381) is currently only used internally and does not lead to a failed job
Remote file open container now using EL9 instead of CentOS7
- Required for latest ROOT release
- Requested by A. De Silva
Skipping setting RUCIO_ACCOUNT for payload
- Requested by R. Walker
A time-out was added to the gdb command execution (for producing a core dump file) when a looping job has been discovered
- Requested by R. Walker
Real-time logging
- Now possible to specify real-time logging server (type, protocol, URL and port) via pilot argument
  - Previously, it only worked via pilot config
  - Requested by W. Guan
- Added Loki real-time logging module (Rubin)
- Real-time logging can now be activated for all jobs on a given queue (relevant for pilot logs, not payload stdout)
  - Activation currently via PQ.catchall
  - Streaming of pilot logs requested by I. Vukotic
  - To be tested more widely
New pilot option --noworkerpilotstatusupdate can be used to switch off worker pilot status updates
- Needed at NERSC
- Requested by T. Maeno
Added timeout to urlopen() used for pilot-PanDA server communication
- The default timeout is too short and for getjob operations can lead to “jobdispatcher, 102: Sent job didn't receive reply from pilot within 30 min”-errors
- In case of failure, pilot will currently fallback to curl based communication
- Timeout is now explicitly set to 30 s
- Reported by Z. Yang (Rubin)
Bug fix
- Patch for setting final job completion state before log stage-out had completed
  - Leading to “ddm, 200: Could not get GUID/LFN/MD5/FSIZE/SURL from pilot XML”-error
  - Reported by R. Walker, discussed in JIRA ticket ATLASPANDA-1047
Housekeeping with pylint
- The average pylint score of all pilot modules is 9.56

Contributions from W. Guan, P. Nilsson

Assets 2

10 Jul 05:17

PalNilsson

3.7.9.1

568c7fc

3.7.9.1

Patch for unset resource types
- In case the resource type was not set as a pilot option (it usually is, but not always), the pilot refused to start and complained about it
- Affected TRIUMF and praguelcg2_Karolina_MCORE
- Reported by A. De Silva

Assets 2

03 Jul 10:05

PalNilsson

3.7.8.21

2b450bf

3.7.8.21

Remote file open
- Resolved a problem with executing the open remote file script which could abort prematurely
- Redirected output of container command to file (previously, the stdout was dumped to the pilot log which created unnecessary clutter).
- Now reporting lsetup time with job metrics
- Discussed in JIRA ticket ATLASPANDA-1001
Containerized stage-in/out
- Now using EL9 container instead of outdated Rucio image
- Discussed in JIRA ticket ATLASPANDA-1008
Resource types pilot argument
- Now supporting patterns in the resource types
- Discussed in JIRA ticket ATLASPANDA-1034
Event service with Raythena
- Updated athenaopts for release 25
TRF URL update for vo.darkside.org
Housekeeping with pylint etc
Bug fixes
- Wrong logging object used in real-time logging (reported by W. Guan)
- Exception handling for missing psutil python module (reported by R. Walker)

Code contributions from E. Karavakis, J. Esseiva, P. Nilsson

Assets 2

12 Jun 09:55

PalNilsson

3.7.7.3

ed21469

3.7.7.3

In case of cvmfs failure, pilot now executes diagnostics commands (timeout=60s)
- cvmfs_config stat atlas.cern.ch
- attr -g revision $ATLAS_SW_BASE/atlas.cern.ch
- Requested by I. Glushkov
Real-time logging
- Added support for Grafana Loki logging
SPHENIX
- Fix for the URL from where the user analysis transform should be downloaded from in DOMA k8s
Housekeeping
- Additional modules were processed with pylint etc

Code contributions from Edward Karavakis, Wen Guan, Paul Nilsson

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: PanDAWMS/pilot3

3.9.4.15

3.9.3.2

3.9.2.41

3.9.1.13

3.9.0.17

3.8.2.8

3.8.1.66

3.7.9.1

3.7.8.21

3.7.7.3