Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.9.2.41 #154

Merged
merged 68 commits into from
Dec 5, 2024
Merged

3.9.2.41 #154

merged 68 commits into from
Dec 5, 2024

Conversation

PalNilsson
Copy link
Collaborator

  • Added new pilot option -w / –notokenrenewal to turn off OIDC token renewals
    • Needed on Perlmutter and requested by D. Benjamin
    • Alternative to use PQ.catchall introduced in previous pilot version
  • Pilot will now notice if the voms proxy has less than 72h left at the startup of the pilot - if so, it will fail with new wrapper error code 80
    • Harvester has also been updated for this change
    • Requested by R. Walker
    • Discussed in JIRA ticket ATLASPANDA-1156
  • Regarding lingering payload processes leading to too high CPU efficiency
    • Discussed in JIRA tickets ADCMONITOR-551 and ATLASPANDA-1204
    • Pilot identifies any lingering processes after the payload has finished and kills them. These are unaccounted processes that are children of the pilot and not necessarily the payload as the payload forks processes which will be inherited by the pilot in case the payload is killed by the OS
  • Dealing with unreasonably large CPU consumption times
    • CPU consumption times are now only measured after ten seconds of payload running
    • The pilot tries to sum up the CPU consumption time using /prod/$PID/stat
    • A returned CPU consumption time will be compared with the previous one, if the quotient is larger than 5 (under normal CPU load) or 10 (under high CPU load), the result will be ignored
    • A high CPU load is defined as 80% load, as measured during 0.5 s
    • Before the resulting CPU consumption time is stored, the existence of /proc/self/statm is verified (and discarded if this no longer exists), to make sure there are no problems with /proc itself
    • Limits might be modified in later pilot versions
  • Further improvements for Karolina HPC
    • Switching to IPv4 when using urllib resolved most of the problems with cut job definitions, but not all.
    • Force curl and IPv4 using catchall (“curlgetjob”)
    • Additional updates are foreseen
  • Updates for alternative stage-out algorithm
    • For unified queues: Pilot makes decision to choose proper destination by considering write_lan_analysis, write_lan activities for analysis job, or simply write_lan for production jobs
    • job.nucleus is excluded as a possible alt-stageout destination
    • Details in pilot repo issue 152
    • Discussed in JIRA ticket ATLASPANDA-994
  • Pilot timing and remote i/o
    • Until now, the time it takes for the remote i/o file verification to finish has been a part of the setup time. Now, it is instead added to the stage-in time
    • Improved the error message displayed on the monitor job page for remote i/o failures
  • Internal improvements
    • Added thread synchronization in command execution function to get rid of annoying (but harmless) stderr “Poll: bad file descriptor reading from request pipe”
    • Improved exception handling for socket related errors (specifically for initializing the trace report)
      • Requested by Z. Yang (BNL/Rubin)
    • Improved dmesg handling (verifying that the found memory error belongs to a process that is a known child of the payload)
    • A few more modules were processed with pylint
  • Bug fixes
    • Time-outs for remote i/o verification did not work as expected, now corrected
    • Corrected core number reporting in cpuconsumptionunit CPU info string

Contributions from A. Anisenkov, P. Nilsson

@PalNilsson PalNilsson merged commit 95f7af1 into master Dec 5, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants