-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tier0 WMAgent deployment model for Alma9/RHEL9 #11890
Comments
Thanks for creating this issue @amaltaro Some updates here on that. There was some non 0 effort shared between me and @LinaresToine last year for bringing the T0 deployment tests up to the level where the WMAgent containers currently are. We have reached to the point where the Oracle functionalities for the |
logging the current state of this issue: Yesterday, We took the chance that the two mandatory PRs (dmwm/CMSKubernetes#1409 && dmwm/CMSKubernetes#1451) needed for T0 to start their tests are already in a semifinal stage, and we had a meeting yesterday between me, Andrea, German and Antonio to discuss the new initialization mechanisms of the agents with a small hands on demonstration from my side on how things work currently. During this process I used mostly Docker containers for both CouchDB and WMAgent, because the initialization process for the agent is immutable and independent of the deployment methods, being it Docker or virtual env. We agreed on another meeting next week when we would have everything merged from our side, such that next time they would be able to perform the hands on activities with guidance on my side. |
p.s. on the comment from above. During the meeting, we also agreed on starting the proper communication with CERN IT in order to asses how safe it is to add the OS UIDs and GIDs in the Docker repositories, as explained in this comment: dmwm/CMSKubernetes#1412 (comment) I am about to start this communication today, and will include the proper set of people involved from our side. |
Again for logging the activity on this topic: Today we had yet another meeting during which T0 Team was doing the hands on activities. We found several issues like:
Once the later is debugged we plan to have another meeting for tomorrow. |
Logging the activities again. Today we were having yet another long hands on meeting between me, @LinaresToine and @anpicci. This time we we had to resolve few more bugs such as a
A PR is coming for resolving all of these. At the very end we managed to fully deploy and initialize the agent to the very end. But we failed to start the services because of the old FYI: @amaltaro |
A PR in the CMSKubernetes repository was created to fix isues pointed out in the previous comment. See dmwm/CMSKubernetes#1457. I continue to look into the libcurl - to - pycurl issue pointed out by @todor-ivanov |
Hi @LinaresToine I've left one comment in the Pr. Please take look |
hi @LinaresToine while you are removing that line I asked in the review, here is your solution about the backend ssl library mismatch in pycurl. You need to follow both of those steps:
|
Thank you very much @todor-ivanov. Regarding the PR, I will modify it right now. I will point out an observation in the PR itself. |
Hello @todor-ivanov. I created a script to keep track of all the steps taken until the manage start-agent step. This script can be seen in dmwm/T0#4931. Worth clarifying that I am sourcing the script, not executing it. From our meetings I remember the couchdb-docker-build.sh step is not always necessary, however I include it for now. The script includes the debugging of the pycurl issue with the environment activated. I may be missing something, but I still get the same error message upon running manage start-agent. I continue to look into it. |
Lets quickly meet again next week and we will try to resolve this together. There were also few more highlights from @germanfgv yesterday, about the corrections we need to apply to the |
@todor-ivanov @LinaresToine Can you give an update on this issue? Has there been any more progress since the meeting referenced above? |
Hello @klannon |
hi @klannon we plan to close this issue this week. There are few more lines that need to get into those fixes, which would be faster if I make them myself. And then we meet with @LinaresToine to complete the process on another hands on meeting. |
So, after our long meeting yesterday with @LinaresToine we ended up with with another possible fix addressing more than one problems: dmwm/CMSKubernetes#1466 But there is still few minor details before we close it. I'll push more changes later tonight. |
hi @amaltaro @anpicci @vkuznet with our latest commits to dmwm/CMSKubernetes#1466 we ( me @LinaresToine @germanfgv ) managed to initialize a T0 agent alma9 machine properly and test this deployment process during our meeting today. There were some new library issues that have popped up the last minute when German decided to actually start a replay (he or Antonio are going to update the issue with the new error). But the biggest success here was that we managed to finish the deployment properly on top of all the rest of the changes we did lately on the WMAgent containers. Please feel free to start your review on the PR in the CMSKuberenetes repository. |
Hello all. As Todor mentions in the comment above, we were able to initialize and start the T0 agent successfully. After starting, the Tier0Feeder component of the agent displays the following error message:
|
Hi @LinaresToine @todor-ivanov , it seems an error with the XML file itself, rather than a bug, right? In addition, the stacktrace looks like incomplete. |
Thank you @anpicci . The WMException ended there after the attribute getchildren was not found. However, I can include this portion also:
|
We are in the process of final tests here. More details I gave in my comment to the PR with which I called @amaltaro and @anpicci for final review: dmwm/CMSKubernetes#1466 (comment) |
The xml library have removed the getchildren attribute after version 3.9: https://docs.python.org/3.8/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.getchildren |
Impact of the new feature
Tier0 WMAgent
This issue may be closed by: dmwm/CMSKubernetes#1466
Is your feature request related to a problem? Please describe.
This can potentially become another meta-issue, but it's important to start tracking the Tier0 requirements such that we can properly plan for an impending migration of the Tier0 WMAgent stack, together with condor schedd migration to Alma9 OS.
This goes along the plans we have for central production agent, tracked in this meta-issue: #11314
but we need to clarify whether the Tier0 agrees with the same deployment model and/or what changes are required for the T0 environment.
Describe the solution you'd like
For now, central production WMAgent deployment will be:
wmagent
package to PyPiDescribe alternatives you've considered
None
Additional context
See containerization meta-issue: #11314
Depends on: #11981
Depends on: #11982
Depends on: #12010
Depends on: #12013
The text was updated successfully, but these errors were encountered: