Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WMAgent: Switch the new deployment model to less restrictive initialization mode #11990

Closed
todor-ivanov opened this issue May 20, 2024 · 5 comments · Fixed by dmwm/CMSKubernetes#1485 or dmwm/CMSKubernetes#1490

Comments

@todor-ivanov
Copy link
Contributor

todor-ivanov commented May 20, 2024

Impact of the new feature
Wmagent

Is your feature request related to a problem? Please describe.
In the current deployment model we have several levels of distinguishing which deployment we'd consider a fresh new one, meaning we are about to follow a full initialization process from scratch for the agent in question. At the current stage of development, such process induces a complete wipe out of all agent's data, especially the relational database. Here follows the explanation of those three different levels, together with the basic mechanism used for the implementation of this feature:

NOTE:
On every step of the initialization process we check the relevant .init<step> file content and we always compare the current $WMA_BUILD_ID with the one previously initialized at the host.
The WMA_BUILD_ID is a hash generated during the execution of install.sh at build time (for docker images) or deploy time for virtual env. There are three levels of comparison we can make:

  • If we want to trigger re-initialization on any WMAgent image rebuild, then the $WMA_BUILD_ID should contain a sha256sum of a random variable
  • If we want to trigger re-initialization only on new WMAgent tag builds, then the $WMA_BUILD_ID should contain a sha256sum of the whole $WMA_TAG
  • If we want to trigger re-initialization only on version change (not on patches or relase candidates), then we should split $WMA_TAG in "relesae" and "patch/release candidate" part (or major minor if we want to call them) and the $WMA_BUILD_ID should contain a sha256sum only of the main - version part, e.g.:
WMA_TAG=2.3.3.1; 
WMA_TAG_RELEASE=2.3.3
WMA_TAG_MAJOR=2.3;
WMA_TAG_MINOR=3
WMA_TAG_PATCH=1

The current implementation considers the first one - re-initialization on any WMAgent rebuild.

The current issue is to request a change of this behavior to shift from the most restrictive mechanism of initialization, to the last one - the least restrictive. So that we can identify a complete and fresh new deployment only on release change, and preserve the ability of the system push fixes in production by just release a patch version of the same agent and redeploying the container.

NOTE:
In addition to shifting the condition for triggering the initialization process based on the least restrictive criteria, we also have to think of two more aspects:

  • We should not lose any Workflow cache directories, so that we do not have to drain the agents before patching. Which means we should now start basing the host mount points only on the WMA_VER_RELEASE part of the WMA_TAG rather than on the whole WMA_TAG as up to now. This way the install area of the agent won't change in the case of deploying a new patch version or release candidate tag, and the already running workflows would be able to continue uninterrupted.
  • We should enforce the check and copy of the Runtime code from the container to the host are, in the case of any minor version, patch version or release candidate change. Because there could be a change o the runtime code which we may fail to reflect on during the process of deploying a new patched tag if we skip that step.

Describe the solution you'd like
Switch from most restrictive to the least restrictive mode of operation

Describe alternatives you've considered
None. It is a must.

Additional context
Part of the following meta issue: #11314

@todor-ivanov todor-ivanov changed the title WMAgent: Switch the new deployment method to less restrictive initialization mode WMAgent: Switch the new deployment model to less restrictive initialization mode May 20, 2024
@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented May 20, 2024

And just as expected ... With the PR which is to fix the issue, here are the two builds of two sequential release candidates:

  • 2.3.4rc3
#11 [ 7/10] RUN /data/install.sh -t 2.3.4rc3
#11 0.115 
#11 0.115 =======================================================================
#11 0.115 Starting new WMAgent deployment with the following initialisation data:
#11 0.115 -----------------------------------------------------------------------
#11 0.116  - WMAgent Tag                : 2.3.4rc3
#11 0.116  - WMAgent Release            : 2.3.4
#11 0.116  - WMAgent Major version      : 2.3
#11 0.116  - WMAgent Minor version      : 4
#11 0.116  - WMAgent Patch/Release cand : rc3
#11 0.116  - WMAgent User               : 
#11 0.116  - WMAgent Root path          : /data
#11 0.118  - Python  Version            : Python 3.8.16
#11 0.118  - Python  Module path        : /usr/local/lib/python3.8/site-packages
#11 0.118 =======================================================================
...
#11 31.92 -----------------------------------------------------------------------
#11 31.92 Start Generating and preserving current build id
#11 31.92 WMA_BUILD_ID:386a85ff43996bfa097ad213275fa5294bd3d9668eafd1f41aafeade18423a9a
#11 31.92 WMA_BUILD_ID preserved at: /data/.dockerBuildId 
#11 31.92 Done Generating and preserving current build id!
#11 31.92 
#11 31.92 -----------------------------------------------------------------------

  • 2.3.4rc4
...
11 [ 7/10] RUN /data/install.sh -t 2.3.4rc4
#11 0.115 
#11 0.115 =======================================================================
#11 0.115 Starting new WMAgent deployment with the following initialisation data:
#11 0.115 -----------------------------------------------------------------------
#11 0.115  - WMAgent Tag                : 2.3.4rc4
#11 0.115  - WMAgent Release            : 2.3.4
#11 0.115  - WMAgent Major version      : 2.3
#11 0.115  - WMAgent Minor version      : 4
#11 0.115  - WMAgent Patch/Release cand : rc4
#11 0.115  - WMAgent User               : 
#11 0.115  - WMAgent Root path          : /data
#11 0.117  - Python  Version            : Python 3.8.16
#11 0.117  - Python  Module path        : /usr/local/lib/python3.8/site-packages
#11 0.117 =======================================================================
...

#11 32.24 -----------------------------------------------------------------------
#11 32.24 Start Generating and preserving current build id
#11 32.24 WMA_BUILD_ID:386a85ff43996bfa097ad213275fa5294bd3d9668eafd1f41aafeade18423a9a
#11 32.24 WMA_BUILD_ID preserved at: /data/.dockerBuildId 
#11 32.24 Done Generating and preserving current build id!
#11 32.24 
#11 32.24 -----------------------------------------------------------------------

As one can see the WMA_BUILD_ID has not changed between the patch/release candidate version change. This guaranties us that a re-initialization process from scratch won't be started by switching between those two versions.

FYI: @amaltaro

@todor-ivanov todor-ivanov self-assigned this May 20, 2024
@anpicci
Copy link
Contributor

anpicci commented May 21, 2024

@amaltaro @todor-ivanov I can test this once it is approved

@todor-ivanov
Copy link
Contributor Author

thanks @anpicci

@todor-ivanov
Copy link
Contributor Author

Reopening this issue in order to address the change of behavior mentioned in this comment: dmwm/CMSKubernetes#1466 (comment)

@todor-ivanov
Copy link
Contributor Author

hi @amaltaro,
ok....., here are the results from the match I was talking about earlier today ;) :

  • Building and deploying a fresh image with WMA_TAG=2.3.4rc3. The init output is:
=======================================================
Starting WMAgent with the following initialisation data:
-------------------------------------------------------
 - WMAgent Version            : 2.3.4rc3
 - WMAgent Release Cycle      : 2.3.4
 - WMAgent User               : cmst1
 - WMAgent Root path          : /data
 - WMAgent Host               : vocms0260.cern.ch
 - WMAgent TeamName           : testbed-vocms0260
 - WMAgent Number             : 0
 - WMAgent Relational DB type : mysql
 - Python  Version            : Python 3.8.16
 - Python  Module path        : /usr/local/lib/python3.8/site-packages
=======================================================

-------------------------------------------------------
Start: Performing basic_checks

Done: Performing basic_checks
-------------------------------------------------------

check_wmasecrets: Checking for changes in the WMAgent.secrets file
check_wmasecrets: No change found.
-------------------------------------------------------
Start: Performing checks for successful WMAgent initialisation steps...
WMA_BUILD_ID: 386a85ff43996bfa097ad213275fa5294bd3d9668eafd1f41aafeade18423a9a
wmaInitId: /data/srv/wmagent/2.3.4/config/.initActive
/data/srv/wmagent/2.3.4/config/.initAdmin
/data/srv/wmagent/2.3.4/config/.initAgent
/data/srv/wmagent/2.3.4/config/.initConfig
/data/srv/wmagent/2.3.4/config/.initCouchDB
/data/srv/wmagent/2.3.4/config/.initResourceControl
/data/srv/wmagent/2.3.4/config/.initResourceOpp
/data/srv/wmagent/2.3.4/config/.initRucio
/data/srv/wmagent/2.3.4/config/.initRuntime
/data/srv/wmagent/2.3.4/config/.initSqlDB
/data/srv/wmagent/2.3.4/config/.initUpload
/data/srv/wmagent/2.3.4/config/.initUsing
WARNING: wmaInitId vs. wmaBuildId mismatch
-------------------------------------------------------
Start: Performing Host initialization steps
deploy_to_host: Linking the proper manage file from Config Area
deploy_to_host: Copy the Runtime scripts
_copy_runtime: Copying content from: /usr/local/lib/python3.8/site-packages/WMCore/WMRuntime to: /data/srv/wmagent/2.3.4/install/Docker/
_copy_runtime: Copying content from: /usr/local/etc/ to: /data/srv/wmagent/2.3.4/config/
deploy_to_host: Initialize && Validate && Load WMAgent.secrets
deploy_to_host: checking /data/admin/wmagent/WMAgent.secrets
deploy_to_host: Initialise Rucio config
...
-------------------------------------------------------
Start: Performing checks for successful WMAgent initialisation steps...
WMA_BUILD_ID: 386a85ff43996bfa097ad213275fa5294bd3d9668eafd1f41aafeade18423a9a
wmaInitId: 386a85ff43996bfa097ad213275fa5294bd3d9668eafd1f41aafeade18423a9a
OK

Docker container has been initialised! However you still need to:
  1) Double check agent configuration: less /data/[dockerMount]/srv/wmagent/current/config/config.py
  2) Start the agent by either of the methods bellow:
     a) From inside the already running container
          * Access the running WMAgent container:
            docker exec -it wmagent bash
          * Use the regular manage script inside the container:
            manage start-agent

     b) From the host - by restarting the whole container
          * Kill the currently running container:
            docker kill wmagent
          * Start a fresh instance of wmagent:
            ./wmagent-docker-run.sh -t <WMA_TAG> && docker logs -f wmagent

     c) If you are deploying inside a virtual environment
          * Activate the environment:
            cd <Deployment_dir> && . bin/activate
          * Use the regular manage script inside the virtual environment:
            manage start-agent

Have a nice day!
  • Rebuilding a release candidate rc4 and restarting the container. Here is the output:

=======================================================
Starting WMAgent with the following initialisation data:
-------------------------------------------------------
 - WMAgent Version            : 2.3.4rc4
 - WMAgent Release Cycle      : 2.3.4
 - WMAgent User               : cmst1
 - WMAgent Root path          : /data
 - WMAgent Host               : vocms0260.cern.ch
 - WMAgent TeamName           : testbed-vocms0260
 - WMAgent Number             : 0
 - WMAgent Relational DB type : mysql
 - Python  Version            : Python 3.8.16
 - Python  Module path        : /usr/local/lib/python3.8/site-packages
=======================================================

-------------------------------------------------------
Start: Performing basic_checks

Done: Performing basic_checks
-------------------------------------------------------

check_wmasecrets: Checking for changes in the WMAgent.secrets file
check_wmasecrets: No change found.
check_wmatag: This agent has been previously initialized. Checking for WMAgent version change since last run.
check_wmatag: Found version change since last run: 2.3.4rc4 vs. 2.3.4rc3
check_wmatag: Enforcing Runtime code check and copy if needed.
-------------------------------------------------------
Start: Performing checks for successful WMAgent initialisation steps...
WMA_BUILD_ID: 386a85ff43996bfa097ad213275fa5294bd3d9668eafd1f41aafeade18423a9a
wmaInitId: /data/srv/wmagent/2.3.4/config/.initRuntime
386a85ff43996bfa097ad213275fa5294bd3d9668eafd1f41aafeade18423a9a
WARNING: wmaInitId vs. wmaBuildId mismatch
-------------------------------------------------------
Start: Performing Host initialization steps
deploy_to_host: Linking the proper manage file from Config Area
deploy_to_host: Copy the Runtime scripts
_copy_runtime: Copying content from: /usr/local/lib/python3.8/site-packages/WMCore/WMRuntime to: /data/srv/wmagent/2.3.4/install/Docker/
_copy_runtime: Copying content from: /usr/local/etc/ to: /data/srv/wmagent/2.3.4/config/
_copy_runtime: Preserving the current WMA_TAG at the wma_init database
deploy_to_host: Initialize && Validate && Load WMAgent.secrets
deploy_to_host: Initialise Rucio config

Creating proxy  Done

Your proxy is valid until Thu May 30 23:43:09 2024
Done: Performing Host initialization steps
-------------------------------------------------------
_check_mysql: Checking whether the MySQL server is reachable...
_status_of_mysql:
Uptime: 558129  Threads: 2  Questions: 15660081  Slow queries: 0  Opens: 1688  Open tables: 63  Queries per second avg: 28.058
_status_of_mysql: MySQL connection is OK!
_check_mysql: Checking whether the MySQL schema has been installed
_sql_schema_valid: Checking the current SQL Database schema integrity.
_sql_dumpSchema: Dumping the current SQL schema of database: wmagent to /tmp/.wmaSchemaTmp
_sql_dbid_valid: Checking if the current SQL Database Id matches the WMA_BUILD_ID and hostname of the agent.
_sql_dbid_valid: OK: Database recorded and current agent's init parameters match.
_check_couch: Checking whether the CouchDB database is reachable...
_status_of_couch:
{"couchdb":"Welcome","version":"3.2.2","git_sha":"d5b746b7c","uuid":"18f53118737ed74893055db0ffa972e2","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"]}
_status_of_couch: CouchDB connection is OK!
-------------------------------------------------------
Start: Performing activate_agent
Done: Performing activate_agent
-------------------------------------------------------
-------------------------------------------------------
Start: Performing init_agent
init_agent: The agent has been properly initialized already.
Done: Performing init_agent
-------------------------------------------------------
-------------------------------------------------------
Start: Performing agent_tweakconfig
Done: Performing agent_tweakconfig
-------------------------------------------------------
-------------------------------------------------------
Start: Performing agent_resource_control
agent_resource_control: Agent Resource control has been populated already.
Done: Performing agent_resource_control
-------------------------------------------------------
-------------------------------------------------------
Start: Performing agent_resource_opp
agent_resource_opp: Agent Opportunistic Resource control has been populated already.
Done: Performing agent_resource_opp
-------------------------------------------------------
-------------------------------------------------------
Start: Performing agent_upload_config
Done: Performing agent_upload_config
-------------------------------------------------------
-------------------------------------------------------
Start: Performing checks for successful WMAgent initialisation steps...
WMA_BUILD_ID: 386a85ff43996bfa097ad213275fa5294bd3d9668eafd1f41aafeade18423a9a
wmaInitId: 386a85ff43996bfa097ad213275fa5294bd3d9668eafd1f41aafeade18423a9a
OK

Docker container has been initialised! However you still need to:
  1) Double check agent configuration: less /data/[dockerMount]/srv/wmagent/current/config/config.py
  2) Start the agent by either of the methods bellow:
     a) From inside the already running container
          * Access the running WMAgent container:
            docker exec -it wmagent bash
          * Use the regular manage script inside the container:
            manage start-agent

     b) From the host - by restarting the whole container
          * Kill the currently running container:
            docker kill wmagent
          * Start a fresh instance of wmagent:
            ./wmagent-docker-run.sh -t <WMA_TAG> && docker logs -f wmagent

     c) If you are deploying inside a virtual environment
          * Activate the environment:
            cd <Deployment_dir> && . bin/activate
          * Use the regular manage script inside the virtual environment:
            manage start-agent

Have a nice day!
  • The initialization has been enforced only for one additional step, which is clearly visible in the logs:
check_wmatag: This agent has been previously initialized. Checking for WMAgent version change since last run.
check_wmatag: Found version change since last run: 2.3.4rc4 vs. 2.3.4rc3
check_wmatag: Enforcing Runtime code check and copy if needed.
...
_copy_runtime: Copying content from: /usr/local/lib/python3.8/site-packages/WMCore/WMRuntime to: /data/srv/wmagent/2.3.4/install/Docker/
_copy_runtime: Copying content from: /usr/local/etc/ to: /data/srv/wmagent/2.3.4/config/
_copy_runtime: Preserving the current WMA_TAG at the wma_init database
... 
  • The initialization process is skipping all other steps e.g:
-------------------------------------------------------
Start: Performing agent_resource_control
agent_resource_control: Agent Resource control has been populated already.
Done: Performing agent_resource_control
-------------------------------------------------------
  • The host mount area is linked in both cases only to the WMA_VER_RELEASE, and not to the full WMA_TAG:
cmst1@vocms0260:wmagent $ ll /data/dockerMount/srv/wmagent/

drwxr-xr-x. 5 cmst1 zh 4096 May 24 00:59 2.3.4
lrwxrwxrwx. 1 cmst1 zh   35 May 24 01:43 current -> /data/dockerMount/srv/wmagent/2.3.4

I think this accomplishes the desired behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment