3.6.0.108
- Pilot now executes script cpu_arch.py (option –alg gcc) that reports the CPU architecture
- Requested by A. Serhan Mete
- For details, see https://its.cern.ch/jira/browse/ATLINFR-4844
- For non-ATLAS experiments, pilot is using internal cpu_arch script, and lsetup cpu_flags for ATLAS
- Resilience against slow networks
- Problem seen with Rubin job where they had severe network issues (a jobUpdate was very slow to finish, and once it did, the actual job had already finished and this lead to problems with secondary job)
- Now making sure that job workdir actually still exists when pilot receives a ‘tobekilled’ instruction - in order to prevent total abort; ‘tobekilled’ will also no longer lead to pilot ending
- Zipping all oversized files (typically payload stdout or other log files created by the payload)
- Previously, pilot deleted these files
- Size of archive also checked, deleted if too big
- Requested by R. Walker
- Immediate server update after batch kill
- Requested by R. Walker
- Use job.maxwalltime if available instead of PQ.maxwalltime (push queues only)
- Requested by R. Walker
- Redirecting stdout/stderr from remote file open command to files to avoid lost output in case of time-out exception
- Requested by R. Walker
- Pilot now sleeps two minutes (configurable) between PanDA server updates in case of trouble
- Requested by W. Guan
- HTCondor environmental variable
- Now setting new env var HTCondor_JOB_ID for debugging purposes with the following format
< PanDA ID > : < processing type > : < cluster ID > . < process ID > _ < schedd name code > - Due to a max allowed length of 31 chars, the cluster ID and process IDs are converted to hex
- Pilot enforces the max length
- Lustre has the ability to tag the JobID for monitoring purposes. The new variable is defined before any Lustre activity starts
- Requested by D. Benjamin for sPHENIX, but could be useful on all HTCondor systems
- Now setting new env var HTCondor_JOB_ID for debugging purposes with the following format
- Allowing mv copytool to move files to final destination
- Activated via PQ.catchall=..,mv_final_destination
- Requested by D. Benjamin
- Dask updates
- Pilot has been tested running in a pod for dask purposes, both in interactive mode (pilot communicates with server and stages in files if necessary) and non-interactive mode (pilot runs on resource like a normal grid job)
- AlmaLinux9 related update
- Dumping /etc/os-release to log instead of trying to execute lsb_release command, which is not available on AlmaLinux9
- Requested by J. Van Eldik
- Added memory monitoring for sPHENIX
- Based on prmon and same setup as ATLAS is using (but with hardcoded path instead of using asetup)
- Requested by X. Zhao (BNL)
- Raythena updates
- Renamed internal resource from Cori to Nersc since Cori is reaching end of life and transitioning to Perlmutter
- Updated FRONTIER_SERVER to use Nersc proxy
- Fixed an issue when trying to append --preExec to the executable
- Correctly configured event service job with CA
- GitHub PR: #79
- Internal thread handling and job monitoring improved to catch rare failures
- It was reported by Rubin that the pilot could get stuck in some rare cases and not be able to finish
- Pilot is waiting for all internal threads (except main thread) to finish after graceful_stop has been set, but has a five minute time-out in case some thread is stuck
- Pilot checks now optional
- Currently the following config options are now optional: last_heartbeat,machinefeatures,jobfeatures,cpu_usage,threads
- If not present in config.Pilot.checks, pilot will not run the corresponding check
- If config file is outdated / Pilot.config is not listed, all checks will run as before
- More checks to follow, including Payload.checks
- Requested by X. Zhao (BNL)
- Real-time logging
- ssl_enable and ssl_verify are now configurable (ssl_enable=True will trigger https transport (default), and http transport for ssl_enable_False)
- Requested for sPHENIX by X. Zhao (BNL)
- Bug fixes
- Fixed import problem in gs copy tool
- Requested by W. Guan
- Improved process and process group killing after command execution timeout
- Previously, it could happen that a process lookup after a timeout could trigger a second exception after the initial timeout exception
- Fixed import problem in gs copy tool
Contributions from J. Esseiva, P. Nilsson.