3.7.3.84
- Replaced all usages of curl with python native urllib
- Trace reports, panda server interactions (getJob, updateJob, getProxy, queuedata and transform downloads), as well as prmon dictionary upload to https://pilot.atlas-ml.org/
- This change is important both for attempting to reduce the number of non-responsive pilots (esp. following trace service curl upload) and for resolving EL9 complications (Tokyo)
- Corrected cpu model reporting in cpuconsumptionunit string on ARM
- Previously, UNKNOWN was given as cpu model name
- Note: cache size is not reported in this case as it is not available in user space
- Requested by I. Glushkov
- Added GPU info from prmon to job metrics
- E.g. “GPU_name=NVIDIA_A100-SXM4-40G nGPU=1”
- To be used by monitoring
- Requested by T. Korchuganova, T. Maeno
- Now aborting stage-in loop in case graceful stop bit has been set
- Previously, the stage-in thread would keep running until finished which unnecessarily delayed the termination of the pilot
- The graceful stop bit gets set e.g. when there is an unexpected exception thrown
- Added intersect value from PSS+SWAP fit to job metrics
- Along with the slope, it allows for plotting the fit on the monitor side
- Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLASPANDA-907
- Now performing CVMFS checks at the beginning of the pilot
- Using same list of files and directories as the wrapper
- Any failure leads to termination of the pilot with exit code 64
- Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLASPANDA-926
- ATLAS
- Active monitoring of remote file open verification script
- Now able to abort (e.g.) if lsetup is taking too long time
- Active monitoring of remote file open verification script
- Support for event service with AthenaMT
- Housekeeping
- Additional pilot modules were processed with pylint etc
- Bug fixes
- Rubin
- Removed memory monitoring files from looping job algorithms (problem reported by W. Guan)
- Corrected exception handling when socket.gethostbyaddr() is used
- Corrected the replica sorting algorithm, which did not sort replicas according to read_lan
- Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLDDMOPS-5685
- ATLAS
- Correction for the case where looping check had not run (thus not set any internal timings) before job suspension occurred
- Lead to problems with some BOINC jobs
- Correction for the case where looping check had not run (thus not set any internal timings) before job suspension occurred
- Rubin
Code contributions from P Nilsson, O. Freyermuth, J. Esseiva