Release 3.7.3.84 · PanDAWMS/pilot3

Replaced all usages of curl with python native urllib
- Trace reports, panda server interactions (getJob, updateJob, getProxy, queuedata and transform downloads), as well as prmon dictionary upload to https://pilot.atlas-ml.org/
- This change is important both for attempting to reduce the number of non-responsive pilots (esp. following trace service curl upload) and for resolving EL9 complications (Tokyo)
Corrected cpu model reporting in cpuconsumptionunit string on ARM
- Previously, UNKNOWN was given as cpu model name
- Note: cache size is not reported in this case as it is not available in user space
- Requested by I. Glushkov
Added GPU info from prmon to job metrics
- E.g. “GPU_name=NVIDIA_A100-SXM4-40G nGPU=1”
- To be used by monitoring
- Requested by T. Korchuganova, T. Maeno
Now aborting stage-in loop in case graceful stop bit has been set
- Previously, the stage-in thread would keep running until finished which unnecessarily delayed the termination of the pilot
- The graceful stop bit gets set e.g. when there is an unexpected exception thrown
Added intersect value from PSS+SWAP fit to job metrics
- Along with the slope, it allows for plotting the fit on the monitor side
- Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLASPANDA-907
Now performing CVMFS checks at the beginning of the pilot
- Using same list of files and directories as the wrapper
- Any failure leads to termination of the pilot with exit code 64
- Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLASPANDA-926
ATLAS
- Active monitoring of remote file open verification script
  - Now able to abort (e.g.) if lsetup is taking too long time
Support for event service with AthenaMT
Housekeeping
- Additional pilot modules were processed with pylint etc
Bug fixes
- Rubin
  - Removed memory monitoring files from looping job algorithms (problem reported by W. Guan)
- Corrected exception handling when socket.gethostbyaddr() is used
- Corrected the replica sorting algorithm, which did not sort replicas according to read_lan
  - Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLDDMOPS-5685
- ATLAS
  - Correction for the case where looping check had not run (thus not set any internal timings) before job suspension occurred
    - Lead to problems with some BOINC jobs

Code contributions from P Nilsson, O. Freyermuth, J. Esseiva

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.7.3.84