Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.7.3.84 #121

Merged
merged 73 commits into from
Apr 30, 2024
Merged

3.7.3.84 #121

merged 73 commits into from
Apr 30, 2024

Conversation

PalNilsson
Copy link
Collaborator

  • Replaced all usages of curl with python native urllib
    • Trace reports, panda server interactions (getJob, updateJob, getProxy, queuedata and transform downloads), as well as prmon dictionary upload to https://pilot.atlas-ml.org/
    • This change is important both for attempting to reduce the number of non-responsive pilots (esp. following trace service curl upload) and for resolving EL9 complications (Tokyo)
  • Corrected cpu model reporting in cpuconsumptionunit string on ARM
    • Previously, UNKNOWN was given as cpu model name
    • Note: cache size is not reported in this case as it is not available in user space
    • Requested by I. Glushkov
  • Added GPU info from prmon to job metrics
    • E.g. “GPU_name=NVIDIA_A100-SXM4-40G nGPU=1”
    • To be used by monitoring
    • Requested by T. Korchuganova, T. Maeno
  • Now aborting stage-in loop in case graceful stop bit has been set
    • Previously, the stage-in thread would keep running until finished which unnecessarily delayed the termination of the pilot
    • The graceful stop bit gets set e.g. when there is an unexpected exception thrown
  • Added intersect value from PSS+SWAP fit to job metrics
  • Now performing CVMFS checks at the beginning of the pilot
  • ATLAS
    • Active monitoring of remote file open verification script
      • Now able to abort (e.g.) if lsetup is taking too long time
  • Support for event service with AthenaMT
  • Housekeeping
    • Additional pilot modules were processed with pylint etc
  • Bug fixes
    • Rubin
      • Removed memory monitoring files from looping job algorithms (problem reported by W. Guan)
    • Corrected exception handling when socket.gethostbyaddr() is used
    • Corrected the replica sorting algorithm, which did not sort replicas according to read_lan
    • ATLAS
      • Correction for the case where looping check had not run (thus not set any internal timings) before job suspension occurred
        • Lead to problems with some BOINC jobs

Code contributions from P Nilsson, O. Freyermuth, J. Esseiva

PalNilsson and others added 30 commits March 7, 2024 08:38
socket.gethostbyaddr() throws a socket.gaierror exception
if the address can not be resolved which was not caught before.

Fixes regression introduced by e7d7c6f , after which
Pilot yielded stage-in and stage-out errors with backtrace
for sites using a non-DNS resolvable PANDA_HOSTNAME
(VMs, long-lived containers,...).
tracereport: also catch gaierror for gethostbyaddr()
Support event service for AthenaMT jobs
# Conflicts:
#	PILOTVERSION
#	pilot/util/constants.py
fixup: mtes arg typo and ES file path
@PalNilsson PalNilsson merged commit 4b75cba into master Apr 30, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants