-
Notifications
You must be signed in to change notification settings - Fork 15
Commissioning on the grid
This page describes commissioning efforts on the grid with HTCondor plugins which use Condor-C to feed pilots to HTCondor, Cream and ARC CEs.
To see if a harvester instance is degraded when it runs a few thousands of workers at a PQ
Done at CERN public resources with pull (Jan 2018). Several bug fixes but no performance issue was found. The number of running jobs at CERN-PROD-preprod was well hovered around 2k.
To see if UPS properly works.
UPS was implemented. A unified PQ was defined at CERN public resources.
Done at CERN-PROD-DEV_UCORE (Feb 2018). Multiple flavors of pilots were submitted to the PQ and they shared underlying CPU resources. The number of single core jobs was well limited. More details in the presentation
To see if there is a performance issue in the harvester instance and to check if pilots are automatically submitted with proper attributes to each PQ without any manual intervention.
A harvester instance will be configured to submit pilots to hundreds of PQs where APF are submitting pilots. A small number of pilots for each PQ. Pilots from harvester and APF will go to the same PQs. SchedulerID=harvester-cern_cloud is set to harvester pilots, so that they are distinguished from APF pilots and the summary is available in a pandamon page. A couple of CondorCE and CreamCE PQs in Aisa, EU and US should be tried before going through all PQs. The same pilot wrapper is used both for all PQs, i.e., no special wrapper for US.
Completed in the middle of Mar 2018. All types of CEs work fine except GT5 CEs which were retiring and thus were ignored. Several changes were added to harvester to decrease CPU consumption.
To establish migration procedures to Harvester from APF. To keep enough number of running jobs at the site. To check performance of harvester instance such as CPU and memory usage, stability and robustness of the service.
To setup one more harvester node to avoid single point of failure.
Job monitoring, Harvester monitoring, and Node monitoring.
Started at BNL on 20th Mar 2018. One harvester instance feeding pilots+jobs to BNL_PROD with PUSH to avoid empty pilots, while 4 APFs feeding pilots with PULL to the same PQ. It was confirmed on 27th Mar that harvester was running 800 jobs while APFs are running 900 jobs. Some jobs failed with Condor HoldReason: CE job in status 1 put on hold by SYSTEM_PERIODIC_HOLD due to non-existent route or entry in JOB_ROUTER_ENTRIES. ; Condor RemoveReason: via condor_rm (by user atlpan) since they were HELD for 6 hours and then got killed, but this kind of failure will disappear once the PQ is changed to use PULL. After that, it was decided to have a separate PQ, sharing the same slots allocation with BNL_PROD, which is served only by harvester. TBD: UPS or PULL for the PQ to avoid empty pilots.
To confirm UPS works properly and to see if harvester can automatically discover a proper queue limit.
A new UCORE PQ will be created somewhere, where gradually harvester pilot submission rate will be increased. Eventually, existing PQs will be set offline to let UPS manage all CPU resources. The underlying batch system needs to be reconfigured if necessary.
To be done
Full migration to harvester for PULL PQs which are served by APF now.
To be done.
Getting started |
---|
Installation and configuration |
Testing and running |
Debugging |
Work with Middleware |
Admin FAQ |
Development guides |
---|
Development workflow |
Tagging |
Production & commissioning |
---|
Scale up submission |
Condor experiences |
Commissioning on the grid |
Production servers |
Service monitoring |
Auto Queue Configuration with CRIC |
SSH+RPC middleware setup |
Kubernetes section |
---|
Kubernetes setup |
X509 credentials |
AWS setup |
GKE setup |
CERN setup |
CVMFS installation |
Generic service accounts |
Advanced payloads |
---|
Horovod integration |