Skip to content

Admin FAQ

FaHui Lin edited this page Nov 21, 2023 · 22 revisions

Harvester Admin Tool

Harvester Admin Tool is available since Harvester version 0.0.20, which provides the command for harvester administrative operations.

Setup

After installation of Harvester, under {harvester_venv}/local/bin/ there is a script template harvester-admin.rpmnew.template . Copy it to be harvester-admin and modify it: Set the userName and VIRTUAL_ENV according to the user to run the admin tool and the harvester venv respectively. E.g. when venv directory is /opt/harvester :

# cp /opt/harvester/local/bin/harvester-admin.rpmnew.template /opt/harvester/local/bin/harvester-admin
# vim /opt/harvester/local/bin/harvester-admin

One may also want to make harvester-admin a default command (rather than executable file) in the shell by modifying $PATH .

Usage

Run harvester-admin . Option -h after any command/sub-command provides help message. Some examples below.

Show help:

# /opt/harvester/local/bin/harvester-admin -h
usage: harvester-admin [-h] [-v] {test,get,fifo,cacher,qconf,kill} ...

positional arguments:
  {test,get,fifo,cacher,qconf,kill}
    test                for testing only
    get                 get attributes of this harvester
    fifo                fifo related
    cacher              cacher related
    qconf               queue configuration
    kill                kill something alive

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose, --debug
                        Print more verbose output. (Debug mode !)

Admin tool test:

# /opt/harvester/local/bin/harvester-admin -v test
[2019-10-04 00:17:03,197 CRITICAL] Harvester Admin Tool: test CRITICAL
[2019-10-04 00:17:03,198 ERROR] Harvester Admin Tool: test ERROR
[2019-10-04 00:17:03,198 WARNING] Harvester Admin Tool: test WARNING
[2019-10-04 00:17:03,198 INFO] Harvester Admin Tool: test INFO
[2019-10-04 00:17:03,198 DEBUG] Harvester Admin Tool: test DEBUG
Harvester Admin Tool: test
[2019-10-04 00:17:03,198 DEBUG] ARGS: Namespace(debug=True, which='test') ; RESULT: None 
[2019-10-04 00:17:03,198 DEBUG] Action completed in 0.001 seconds

Show help of qconf (queue configuration) sub-command:

# /opt/harvester/local/bin/harvester-admin qconf -h
usage: harvester-admin qconf [-h] {list,dump,refresh,purge} ...

positional arguments:
  {list,dump,refresh,purge}
    list                List queues. Only active queues listed by default
    dump                Dump queue configurations
    refresh             refresh queue configuration immediately
    purge               Purge the queue thoroughly from harvester DB (Be
                        careful !!)

optional arguments:
  -h, --help            show this help message and exit

List all queue configurations in harvester:

# /opt/harvester/local/bin/harvester-admin qconf list -a
configID : queue name
--------- ------------
   44795 : pic-htcondor_UCORE
   44796 : NIKHEF-ELPROD_MCORE
   ...
   44974 : INFN-T1_UCORE

FIFO

What is FIFO

FIFO in harvester is an optional feature that harvester agents can take advantage of message queue data structure. Main purpose FIFO is to reduce DB polling frequency and lower CPU usage of the node.

The FIFO has "Priority Queue" data structure. Different plugins can be chosen as fifo backend. Existing plugins: SQLite, Redis

So far, only monitor agent has option to enable FIFO. The monitor FIFO allow configurable priority of each PQ. Worker chunks with shorter fifoCheckInterval has higher priority and will be checked more frequently.

How to set up FIFO

Choose a FIFO backend and set up related service. Then, configure in harvester.

SQLite

Backend setup:

  • Make sure sqlite3 is installed in OS.
  • No special service configuration required.

Harvester configuration:

  • In panda_harvester.cfg, [fifo] section, one should set fifoModule = pandaharvester.harvesterfifo.sqlite_fifo and fifoClass = SqliteFifo to use the SqliteFifo fifo plugin.

  • The database_filename should be specified as the database filename for sqlite. This must be different from main Harvester DB and other fifo DBs if using sqlite.

  • It is recommended to use placeholder $(AGENT) in filename to make different DBs for fifo of different agents.

  • One can set DB file located in ramdisk for better performance.

  • E.g.

    [fifo]
    fifoModule = pandaharvester.harvesterfifo.sqlite_fifo
    fifoClass = SqliteFifo
    database_filename = /dev/shm/$(AGENT)_fifo.db
    

Redis

Backend setup:

  • Make sure the redis service is installed, configured, and running on the harvester node.

  • Install python redis via pip within harvester python venv:

    $ pip install redis
    

Harvester configuration:

  • In panda_harvester.cfg, [fifo] section, one should set fifoModule = pandaharvester.harvesterfifo.redis_fifo and fifoClass = RedisFifo to use the RedisFifo fifo plugin. This configuration is enough here if one sets up a standalone redis service on localhost with default port, without password nor db specified.

  • If the redis service has specific host, port, db, and password, then respectively set redisHost, redisPort, redisDB, redisPassword accordingly.

  • E.g.

    [fifo]
    fifoModule = pandaharvester.harvesterfifo.redis_fifo
    fifoClass = RedisFifo
    

MySQL

Note:

  • One can use the same MySQL DB server of Harvester DB for FIFO backend, but this is reasonable if and only if the Harvester DB is shared across multiple Harvester nodes (thus FIFO should be shared as well).
  • FIFO with MySQL has less performance than that with SQLite with ramdisk. If the harvester node is standalone, use SQLite or Redis FIFO.

Backend setup:

  • An empty database must created and grated to specific db user beforehand.
  • Make sure harvester node has rw access as the db user to a database of MySQL/MariaDB.

Harvester configuration:

  • In panda_harvester.cfg, [fifo] section, one should set fifoModule = pandaharvester.harvesterfifo.mysql_fifo and fifoClass = MysqlFifo to use the MysqlFifo fifo plugin.

  • The db_host, db_port, db_user, db_password, db_schema should be specified properly to access backend DB. This can (and had better) be different from main Harvester DB and other fifo DBs if using MySQL.

  • E.g. (Say a database named "HARVESTER_FIFO" was created and granted to db user "harvester_fifo" in advance)

[fifo]

fifoModule = pandaharvester.harvesterfifo.mysql_fifo
fifoClass = MysqlFifo

# database attributes for MySQL
db_host = [email protected]
db_port = 12345
db_user = harvester_fifo
db_password = paswordforfifo
db_schema = HARVESTER_FIFO

How to test / benchmark FIFO

(Available after "14-08-2018 12:51:14 on contrib_cern (by fahui)")

One may wonder whether the FIFO works and how good it is.

Harvester Admin Tool provides a FIFO benchmark command:

# harvester-admin fifo benchmark
Start fifo benchmark ...
Cleared fifo
Put 500 objects by 1 threads : took 0.237 sec
Now fifo size is 500
Get 500 objects by 1 threads : took 0.392 sec
Now fifo size is 0
Put 500 objects by 1 threads : took 0.210 sec
Now fifo size is 500
Get 500 objects protective dequeue by 1 threads : took 0.433 sec
Now fifo size is 0
Put 500 objects by 1 threads : took 0.222 sec
Now fifo size is 500
Cleared fifo : took 0.001 sec
Now fifo size is 0
Finished fifo benchmark
Summary:
FIFO plugin is: RedisFifo
Benchmark with 500 objects by 1 threads
Put            : 0.447 ms / obj
Get            : 0.786 ms / obj
Get protective : 0.867 ms / obj
Clear          : 0.002 ms / obj

One can also specify the number of objects to benchmark with -n option.

How to set up and configure Monitor FIFO

Here it shows the steps to configure and enable FIFO in for harvester monitor agent.

Configurations

(Available after "14-08-2018 12:51:14 on contrib_cern (by fahui)")

To enable monitor FIFO, at least fifoEnable = True and fifoCheckInterval line need to be added in monitor section of harvester.cfg .

Besides, it is reasonable to adjust some existing variables in monitor section. Typically when monitor fifo is enabled, one may want to decrease the frequency of worker check in DB cycles (because almost all checks can now be done in fifo cycles) via increasing sleepTime (and maybe checkInterval as well).

A minimal configuration may look like this:

[monitor]

nThreads = 3
maxWorkers = 500
lockInterval = 600
checkInterval = 900
sleepTime = 1800
checkTimeout = 3600

fifoEnable = True
fifoCheckInterval = 300

Repopulate monitor fifo

The monitor FIFO is empty (or not existing) when being set up for the first time or reset (say, sqlite db in ramdisk after node reboot). To let monitor agent utilize the monitor fifo, one needs to (re)populate monitor fifo (with active worker chunks).

Harvester Admin Tool allows one to do this in one line:

# harvester-admin fifo repopulate monitor
Repopulated monitor fifo

Voila.

N.B.: This operation removes everything from monitor FIFO first, and then populates it with active worker chunks queried from Harvester DB. It may take some time (several minutes) if one has many (say 100k) worker records in Harvester DB.

N.B.: It is recommended to repopulate monitor fifo when harvester service stops; i.e. when the FIFO is not accessed by other processes. And restart harvester service afterwards. (Though it is possible to repopulate monitor fifo when harvester service running)

How to setup monitor plugin cache

Some monitor plugins (e.g. HTCondor monitor) have cache functionality utilizing Harvester FIFO.

To enable and configure it, modify pluginCache* in [monitor] section in harvester.cfg . E.g.:

[monitor]

# plugin cache parameters (used if monitor plugin supports)
pluginCacheEnable = True
pluginCacheRefreshInterval = 300

How to setup monitor event-based check mechanism

(Available since Harvester version 0.1.0-rc)

Beside periodic polling the resource for status update all workers, now monitor agent can also check only partial workers at a time where the monitor agent gets the workers' update "event". (Note here the "event" means an update event of the worker, including batch status change etc. so it can be checked; nothing to do with PanDA job events.)

Note the monitor event-based check mechanism has to run with monitor fifo mechanism enabled.

Requirements on monitor plugin

To run harvester monitor agent with event-based check mechanism, the monitor plugin needs to have the method report_updated_workers which reports the workers just updated and their update timestamp (Check DummyMonitor for details of this method).

Note that the method should report the workerID (NOT batchID !!) of the workers. Thus, it is the batch/resource-facing system that has the responsibility to record the workerID in the batch jobs, which the monitor plugin can access.

Harvester monitor agent calls report_updated_workers of the plugin periodically to get all newly updated workers in a shot.

Harvester configuration

In harvester configuration [monitor] section, one needs to have fifoEnable and eventBasedEnable to be True, and specify which plugin(s) in eventBasedPlugins in json format (array of objects).

Besides, specify the parameters: eventBasedCheckInterval, eventBasedTimeWindow, eventBasedCheckMaxEvents, eventBasedEventLifetime, eventBasedRemoveMaxEvents.

A complete configuration of monitor section may look like:

[monitor]
nThreads = 6 
maxWorkers = 750
lockInterval = 300
checkInterval = 3600
sleepTime = 2400
workerQueueTimeLimit = 172800

fifoEnable = True
fifoSleepTimeMilli = 5000
fifoCheckInterval = 1800
fifoCheckDuration = 15
checkTimeout = 3600
#fifoMaxWorkersToPopulate = 10000
fifoMaxWorkersPerChunk = 500
fifoForceEnqueueInterval = 10800
fifoMaxPreemptInterval = 60

pluginCacheEnable = True
pluginCacheRefreshInterval = 300

eventBasedEnable = True
eventBasedPlugins = 
  [
    {
      "module": "pandaharvester.harvestermonitor.htcondor_monitor",
      "name": "HTCondorMonitor",
      "submissionHost_list": [
          "aipanda023.cern.ch,aipanda023.cern.ch:19618",
          "aipanda024.cern.ch,aipanda024.cern.ch:19618",
          "aipanda183.cern.ch,aipanda183.cern.ch:19618",
          "aipanda184.cern.ch,aipanda184.cern.ch:19618"
        ]
    }
  ]
eventBasedCheckInterval = 300
eventBasedTimeWindow = 450
eventBasedCheckMaxEvents = 500
eventBasedEventLifetime = 1800
eventBasedRemoveMaxEvents = 2000


PanDA Queue management

How to offline a PQ from harvester

If one just wants the harvester not to submit more workers of the PQ, as temporary manual offline, it suffices to add the following line in the object of the PQ in harvester local queue configuration file. E.g.

"CERN-EXTENSION_GOOGLE_HARVESTER": {
    "queueStatus": "OFFLINE",
    ...
}

How to remove a PQ from harvester

If one wants to remove the PQ completely from harvester (e.g. the PQ is renamed or no longer used), then:

  1. Be sure that one really does not need anything jobs/workers/configs of the PQ any longer.

  2. Modify the pilot_manager to be "local" of the PQ on AGIS and/or make sure harvester does not grab information about this PQ from AGIS anymore.

  3. Remove all lines of the PQ in harvester local queue configuration file.

  4. Run qconf purge with harvester admin tool in order to delete all records of this PQ in DB. E.g.:

     # harvester-admin qconf purge UKI-LT2-IC-HEP_SL6
     Purged UKI-LT2-IC-HEP_SL6 from harvester DB
    

KaBOOM!

Worker management

How to kill workers in a dead queue or dead CE

Sometimes one finds plenty of queuing workers submitted to a certain dead CE, preventing more jobs to get activated/submitted to the whole queue. Or may be a queue is totally blocked due to site issue and all workers already submitted to the site will never run.

In such cases, on the harvester instance one can manually kill workers which block the queue -- harvester admin tool allows one to kill workers filtered by worker status, queue (site), CE, and submissionhost (e.g. condor schedd).

E.g. Kill all submitted (queuing) workers submitted to CE "ce13.pic.es:9619" and CE "ce14.pic.es:9619" of site "pic-htcondor_UCORE":

# /opt/harvester/local/bin/harvester-admin kill workers --sites pic-htcondor_UCORE --ces ce13.pic.es:9619 ce14.pic.es:9619  --submissionhosts ALL
Sweeper will soon kill 7 workers, with status in ['submitted'], computingSite in ['pic-htcondor_UCORE'], computingElement in ['ce13.pic.es:9619', 'ce14.pic.es:9619'], submissionHost in ALL

E.g. Kill all submitted and idle workers submitted via submissionhost "aipanda183.cern.ch,aipanda183.cern.ch:19618" (full submissionhost name of aipanda183 condor schedd) to the CE "ce13.pic.es:9619" (say, condor GAHP processes to some CE are down on a certain schedd):

# /opt/harvester/local/bin/harvester-admin kill workers --status submitted idle --sites ALL --ces ALL --submissionhosts aipanda183.cern.ch,aipanda183.cern.ch:19618
Sweeper will soon kill 7 workers, with status in ['submitted', 'idle'], computingSite in ALL, computingElement in ['ce13.pic.es:9619'], submissionHost in ['aipanda183.cern.ch,aipanda183.cern.ch:19618']

Rules of command harvester-admin kill workers:

  • Available filter flags are --status, --sites, --ces, --submissionhosts
  • After the filter flags there can be one of the following: a single argument (workers matching the argument), multiple arguments separated by space (workers matching any of these arguments), or the keyword ALL (no constraint on this flag)
  • --sites, --ces, --submissionhosts are mandatory. One MUST specify them to be valid argument(s), or ALL
  • --status is optional. Available status arguments are submitted, idle, running, and their combination. If --status is omitted, its value is submitted by default.
  • All workers which match the conditions of all filter flags will be killed by sweeper agent soon (next cycle).

Note: For grid, the feature will be implemented on BigPanDA webpage as well for easier manual operation. Furthermore, in the future the monitoring system will automatically spot dead CEs and kill blocked workers.

Get statistics of workers of a PQ

Harvester admin tool provides query workers command to get number of workers of the PQ specified, broken down by prodsourcelabel, resource_type (SCORE, MCORE, ...), and worker status.

For example, the worker stats of CERN-PROD_UCORE_2 :

# /opt/harvester/local/bin/harvester-admin query workers CERN-PROD_UCORE_2
{
    "CERN-PROD_UCORE_2": {
        "ANY": {
            "ANY": {
                "running": 0,
                "submitted": 0,
                "to_submit": 0
            }
        },
        "managed": {
            "SCORE": {
                "cancelled": 24,
                "finished": 33,
                "running": 0,
                "submitted": 2,
                "to_submit": 1
            }
        }
    }
}
Clone this wiki locally