Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tier0 WMAgent deployment model for Alma9/RHEL9 #11890

Closed
amaltaro opened this issue Feb 6, 2024 · 21 comments · Fixed by dmwm/CMSKubernetes#1466
Closed

Tier0 WMAgent deployment model for Alma9/RHEL9 #11890

amaltaro opened this issue Feb 6, 2024 · 21 comments · Fixed by dmwm/CMSKubernetes#1466

Comments

@amaltaro
Copy link
Contributor

amaltaro commented Feb 6, 2024

Impact of the new feature
Tier0 WMAgent

This issue may be closed by: dmwm/CMSKubernetes#1466

Is your feature request related to a problem? Please describe.
This can potentially become another meta-issue, but it's important to start tracking the Tier0 requirements such that we can properly plan for an impending migration of the Tier0 WMAgent stack, together with condor schedd migration to Alma9 OS.

This goes along the plans we have for central production agent, tracked in this meta-issue: #11314
but we need to clarify whether the Tier0 agrees with the same deployment model and/or what changes are required for the T0 environment.

Describe the solution you'd like
For now, central production WMAgent deployment will be:

  • package and upload wmagent package to PyPi
  • build WMAgent docker images based on the https://github.com/dmwm/CMSKubernetes/tree/master/docker/pypi/wmagent and PyPi WMAgent package.
  • assume condor_schedd will be deployed on the host
  • For MariaDB (Fermilab agents), run it from a Docker container
  • For Oracle, there is no need to run any service, however we do need to have an up-to-date tnsnames.ora file
  • (docker) compose all these images together

Describe alternatives you've considered
None

Additional context
See containerization meta-issue: #11314
Depends on: #11981
Depends on: #11982
Depends on: #12010
Depends on: #12013

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Feb 9, 2024

Thanks for creating this issue @amaltaro

Some updates here on that.

There was some non 0 effort shared between me and @LinaresToine last year for bringing the T0 deployment tests up to the level where the WMAgent containers currently are. We have reached to the point where the Oracle functionalities for the manage scripts and the CouchDB container are needed. Once we resolve: #11720 and #11312, T0 Team will be automatically able to fully run T0 agents either from a WMAgent container, or from the OS directly by using the scripts we provide here: https://github.com/dmwm/WMCore/blob/master/deploy/deploy-wmagent-venv.sh (provided the underlying OS is alma9 or rhel9 or equivalent one supporting python 3.8.16 or higher). Last year we tracked down all the needed packages equivalences between distributions and OS dependencies during our meetings - Hopefully Antonio still keeps record on the list :)

FYI @germanfgv @LinaresToine

@todor-ivanov todor-ivanov self-assigned this Mar 8, 2024
@todor-ivanov
Copy link
Contributor

logging the current state of this issue:

Yesterday, We took the chance that the two mandatory PRs (dmwm/CMSKubernetes#1409 && dmwm/CMSKubernetes#1451) needed for T0 to start their tests are already in a semifinal stage, and we had a meeting yesterday between me, Andrea, German and Antonio to discuss the new initialization mechanisms of the agents with a small hands on demonstration from my side on how things work currently. During this process I used mostly Docker containers for both CouchDB and WMAgent, because the initialization process for the agent is immutable and independent of the deployment methods, being it Docker or virtual env.

We agreed on another meeting next week when we would have everything merged from our side, such that next time they would be able to perform the hands on activities with guidance on my side.

@todor-ivanov
Copy link
Contributor

p.s. on the comment from above.

During the meeting, we also agreed on starting the proper communication with CERN IT in order to asses how safe it is to add the OS UIDs and GIDs in the Docker repositories, as explained in this comment: dmwm/CMSKubernetes#1412 (comment)

I am about to start this communication today, and will include the proper set of people involved from our side.

FYI: @amaltaro @vkuznet @khurtado @anpicci

@todor-ivanov
Copy link
Contributor

Again for logging the activity on this topic:

Today we had yet another meeting during which T0 Team was doing the hands on activities. We found several issues like:

  • Additional account management needed to happen directly inthe scripts ... will be fixed once we add the T0 account to the CMSKubernetes repository, but for that we'll have to wait for reply from CERN IT
  • We failed with CouchDB connection between the agent and the the CouchDB instance from the Docker container.

Once the later is debugged we plan to have another meeting for tomorrow.

FYI: @amaltaro @germanfgv @LinaresToine @anpicci

@todor-ivanov
Copy link
Contributor

Logging the activities again.

Today we were having yet another long hands on meeting between me, @LinaresToine and @anpicci. This time we we had to resolve few more bugs such as a

  • Mismatch between T0 and Prodction agents configurations - T0 are not configuring ACDC server,
  • Typo in fetching Team name from WMAgent.secrets file
  • Missing Tier0.* account handling in init.sh script
  • Skipping agent config upload step for T0 agents

A PR is coming for resolving all of these.

At the very end we managed to fully deploy and initialize the agent to the very end. But we failed to start the services because of the old licurl - to - pycurl backend mismatch (nss vs. openssl). @LinaresToine is about to try later to resolve this issue, so they can start testing the deployment of the T0 related packages and eventual injections.

FYI: @amaltaro

@LinaresToine
Copy link

A PR in the CMSKubernetes repository was created to fix isues pointed out in the previous comment. See dmwm/CMSKubernetes#1457.

I continue to look into the libcurl - to - pycurl issue pointed out by @todor-ivanov

@todor-ivanov
Copy link
Contributor

Hi @LinaresToine I've left one comment in the Pr. Please take look

@todor-ivanov
Copy link
Contributor

hi @LinaresToine while you are removing that line I asked in the review, here is your solution about the backend ssl library mismatch in pycurl. You need to follow both of those steps:

  • At the Host:
[root@vocms0290 data]# yum install curl-openssl libcurl-devel libcurl-openssl-devel
  • Inside the Virtual Environment:
(WMAgent.venv3) cmst1@vocms0290:WMAgent.venv3 $ pip uninstall pycurl
(WMAgent.venv3) cmst1@vocms0290:WMAgent.venv3 $ export PYCURL_SSL_LIBRARY=openssl
(WMAgent.venv3) cmst1@vocms0290:WMAgent.venv3 $ pip install --no-cache-dir --global-option=build_ext --global-option="-L/usr/local/opt/openssl/lib" --global-option="-I/usr/local/opt/openssl/include"  pycurl

@LinaresToine
Copy link

Thank you very much @todor-ivanov. Regarding the PR, I will modify it right now. I will point out an observation in the PR itself.
Regarding the backend solution, I did try the steps you mention inside the virtual environment. I was missing out on the step in the host. Thank you for your insight!!

@LinaresToine
Copy link

Hello @todor-ivanov. I created a script to keep track of all the steps taken until the manage start-agent step. This script can be seen in dmwm/T0#4931. Worth clarifying that I am sourcing the script, not executing it.

From our meetings I remember the couchdb-docker-build.sh step is not always necessary, however I include it for now.

The script includes the debugging of the pycurl issue with the environment activated. I may be missing something, but I still get the same error message upon running manage start-agent. I continue to look into it.

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Mar 20, 2024

hi @LinaresToine

I continue to look into it.

Lets quickly meet again next week and we will try to resolve this together.

There were also few more highlights from @germanfgv yesterday, about the corrections we need to apply to the manage script in order to avoid bringing the ACDC related variables into the T0 configuration

@klannon
Copy link

klannon commented Apr 15, 2024

@todor-ivanov @LinaresToine Can you give an update on this issue? Has there been any more progress since the meeting referenced above?

@LinaresToine
Copy link

Hello @klannon
Tier 0 recently got an Alma 9 machine, allowing us to successfully deploy the new agent. That is an update regarding the issue with libcurl and pycurl backend misconfigurations. Regarding the fixes in the init and manage scripts, perhaps @todor-ivanov can give an update on that? Such fixes are documented in dmwm/CMSKubernetes#1457

@todor-ivanov
Copy link
Contributor

hi @klannon we plan to close this issue this week. There are few more lines that need to get into those fixes, which would be faster if I make them myself. And then we meet with @LinaresToine to complete the process on another hands on meeting.

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Apr 19, 2024

So, after our long meeting yesterday with @LinaresToine we ended up with with another possible fix addressing more than one problems: dmwm/CMSKubernetes#1466

But there is still few minor details before we close it. I'll push more changes later tonight.

@todor-ivanov
Copy link
Contributor

todor-ivanov commented May 2, 2024

hi @amaltaro @anpicci @vkuznet with our latest commits to dmwm/CMSKubernetes#1466 we ( me @LinaresToine @germanfgv ) managed to initialize a T0 agent alma9 machine properly and test this deployment process during our meeting today. There were some new library issues that have popped up the last minute when German decided to actually start a replay (he or Antonio are going to update the issue with the new error). But the biggest success here was that we managed to finish the deployment properly on top of all the rest of the changes we did lately on the WMAgent containers. Please feel free to start your review on the PR in the CMSKuberenetes repository.

@LinaresToine
Copy link

Hello all. As Todor mentions in the comment above, we were able to initialize and start the T0 agent successfully. After starting, the Tier0Feeder component of the agent displays the following error message:

ERROR:StdBase:About to raise exception <@========== WMException Start ==========@>
Exception Class: WMSpecFactoryException
Message: 'xml.etree.ElementTree.Element' object has no attribute 'getchildren'
ClassName : None
ModuleName : WMCore.WMSpec.WMWorkloadTools
MethodName : _validateArgFunction
ClassInstance : None
FileName : /data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/WMWorkloadTools.py
LineNumber : 139
ErrorNr : 0

Traceback:
File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/WMWorkloadTools.py", line 130, in _validateArgFunction
if not valFunction(value):

File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/StdSpecs/StdBase.py", line 1034, in
"CMSSWVersion": {"validate": lambda x: x in releases(),

File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/ReqMgr/Tools/cms.py", line 219, in releases
return TC.releases(arch)

File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/TagCollector.py", line 89, in releases
for row in self.data():

File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/TagCollector.py", line 83, in data
for row in xml_parser(data, pkey):

File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/XMLUtils.py", line 62, in xml_parser
get_children(elem, event, row, key)

File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/XMLUtils.py", line 74, in get_children
for child in elem.getchildren():

@amaltaro @vkuznet @todor-ivanov @anpicci @germanfgv

@anpicci
Copy link
Contributor

anpicci commented May 6, 2024

Hi @LinaresToine @todor-ivanov , it seems an error with the XML file itself, rather than a bug, right? In addition, the stacktrace looks like incomplete.

@LinaresToine
Copy link

LinaresToine commented May 6, 2024

Thank you @anpicci . The WMException ended there after the attribute getchildren was not found. However, I can include this portion also:

2024-05-02 19:02:18,253:139701761603136:ERROR:Tier0FeederPoller:Can't configure for run 359688 and stream Calibration
Traceback (most recent call last):
File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/WMWorkloadTools.py", line 130, in _validateArgFunction
if not valFunction(value):
File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/StdSpecs/StdBase.py", line 1034, in
"CMSSWVersion": {"validate": lambda x: x in releases(),
File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/ReqMgr/Tools/cms.py", line 219, in releases
return TC.releases(arch)
File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/TagCollector.py", line 89, in releases
for row in self.data():
File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/TagCollector.py", line 83, in data
for row in xml_parser(data, pkey):
File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/XMLUtils.py", line 62, in xml_parser
get_children(elem, event, row, key)
File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/XMLUtils.py", line 74, in get_children
for child in elem.getchildren():
AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'getchildren'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/StdSpecs/StdBase.py", line 944, in masterValidation
validateArgumentsCreate(schema, argumentDefinition)
File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/WMWorkloadTools.py", line 276, in validateArgumentsCreate
_validateArgumentOptions(arguments, argumentDefinition, "optional")
File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/WMWorkloadTools.py", line 160, in _validateArgumentOptions
arguments[arg] = _validateArgument(arg, arguments[arg], argDef)
File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/WMWorkloadTools.py", line 101, in _validateArgument
_validateArgFunction(argument, value, argumentDefinition["validate"])
File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/WMWorkloadTools.py", line 139, in _validateArgFunction
raise WMSpecFactoryException(str(ex))

@todor-ivanov
Copy link
Contributor

We are in the process of final tests here. More details I gave in my comment to the PR with which I called @amaltaro and @anpicci for final review: dmwm/CMSKubernetes#1466 (comment)

@LinaresToine
Copy link

The xml library have removed the getchildren attribute after version 3.9: https://docs.python.org/3.8/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.getchildren

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment