Skip to content

3.6.1.31

Compare
Choose a tag to compare
@PalNilsson PalNilsson released this 27 Jun 09:24
· 866 commits to master since this release
23235f3
  • Prevented CPU architecture script from being executed when not wanted (no change for ATLAS)

    • Seems to occasionally cause hanging on Rubin resources
  • Time-outs and remote file open verification

    • Enforcing a stdout buffer flush in remote file open script (always) as well as when receiving a time-out exception in script execution. It might help in some cases, but not if it’s the container setup that is hanging
      • Requested by R. Walker
    • Writing all messages from remote file open script to new text file, “remotefileslog-instant.txt” - as opposed to only creating this file using stdout after the container has finished
      • Any time-out info will be written to “remote_open.std*” files
      • Container setup and/or time-out exception will be written to “remotefileslog.txt” as before
      • (A later pilot version can extract the last file open message from this file and add to the error diagnostics)
    • Fix for recursive kills after time-out (leading to many kill attempts)
      • Requested by R. Taylor
  • Added support for output file with regular expression

    • Pilot looks for matching files when it finds ‘regex|..’ expression in LFN and updates output file list
    • Requested by T. Maeno, J. Webb and X. Zhao for sPHENIX
  • Added time-out to ps execution (for CPU activity monitoring) since Rubin reported that this operation can hang on problematic nodes

  • Only show internal memory usage in debug mode

    • To prevent excessive calls to ps command
    • Requested by R. Walker
  • Pilot changes related to PanDA/Dask integration (interactive mode)

    • Updated and improved logic for stage-in when pilot is running in a pod in stager mode
    • Pilot stages in input files then quits
    • Currently this leads to job finishing even though the user will still be using jupyter on the resource (a later pilot version can keep the job in running state until end of lease time)
  • Checksum type can now be selected in pilot config

    • container_type=md5 or adler32
    • Requested by J. Webb (BNL) for sPHENIX
  • Preserving file attributes (timestamps, mode, ownership) while copying pilot source into container (A.A.)

  • VP jobs now using ignore_availability=False when looking up replicas (I.V.)

    • To bypass replica sorting issue seen in VP jobs where algorithm picked replica from site that was in downtime
  • Housekeeping

    • Processed multiple files with pylint and implemented solutions (typical scores: 7-9+ / 10)

Code contributions from A. Anisenkov, I. Vukotic, P. Nilsson.