-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor WMAgent initialization #1410
Conversation
hi @amaltaro, As the last time, I have no rights in this repository to ask for a review. I do not know why. But anyway.. Things are starting to get shape now. Feel free to look at the code at you convenience. |
@todor-ivanov please consider adding a clear description of this pull request. |
hi @amaltaro I am adding the description in a while. |
@amaltaro , Sorry it took a while but I am done with the update of this PR description now. Please read it again. |
docker/pypi/wmagent/README.md
Outdated
|
||
|
||
### List of all config/.init* files | ||
* config/.initActive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todor, why do you name all init files with dot prefix by making them hidden? Is there any concern about their visibility? Second question, where these files? and what is their structure? Which env/and other setup parameters they set. Finally, is there any order in their initialization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well..., this has been like this before. And mostly, having them hidden prevents them from messing up with the rest of the config and stateful files mounted at the host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where these files
Previously those were scattered all over the WMAgent
deployment area, depending to which step they were referring to. Now they are all concentrated only in the config
directory. Meaning one can find all of them there:
- From inside the container at
/data/srv/wmagent/<WMA_TAG>/config/.init*
- From the host at
/data/dockerMount/srv/wmagent/<WMA_TAG>/config/.init*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which env/and other setup parameters they set.
They hold only one variable - $WMA_BUILD_ID
- which is the WMAgent image unique identifier set at build time
Finally, is there any order in their initialization?
yes, there is. I think I have listed them in the README in thee order of initialization. if I have not, I will fix that
``` | ||
[cmst1@unit02:current]$ docker kill wmagent | ||
|
||
[cmst1@unit02:current]$ rm /data/dockerMount/srv/wmagent/<WMA_TAG>/config/.init* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is unclear to me why .init
files should be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general mechanism is as follows:
On every container restart, the run.sh
script goes and makes a full check of all .init * files if they:
- are present and
- are containing the same
wma_build_id
if one of them is either missing or holding a differentwma_build_id
it means the step has never been properly completed and needs to be performed during this restart (and eventually any other step depending on it)
Same for force intialization. If one (for any reason) would want to repeat an initialization step togather with all it's dependent steps, he just need to delete the proper .init file.
Hi @vkuznet, |
Hi @amaltaro, @vkuznet, @khurtado, A Major update on this.... p.s. I do keep the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todor, I went through this PR multiple times and it is very difficult to follow due to large number of changes. I mostly paid attention to high level logic which seems fine to me. Said that, I still do not understand why did you remove couch and maria DB configs from pypi/wmagent/etc area as they are needed to start corresponding DB. My feedback on this PR is the following. If you managed it to run it we should merge and iterate on missing or misbehaving parts and manage them in smaller chunks. But if @amaltaro will insist on detailed review of every single line then I bet it will require few hours of careful screenining and I still we may easily miss many things due to large impact/changes to lots of different places. The more productive approach it seems to me to apply it to WMA deployment and verify how it will go.
Thanks @vkuznet for your additional look into this PR. I agree this is a big one, and yes it does change a lot of things, and yes it does work well. I am already using this setup for my constant debugging in vocms0260. People may login there, connect to the container and play around so they can get used to the environment. As of:
Those are not needed in the This later is true at the very least for CouchDB. About the SQL
So long story short: other than which are the correct scripts from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todor, in addition to the comments made along the code, I wanted to leave more general questions/concerns here:
- Please make a PR against the WMCore repo updating the 2 config/secrets file under: https://github.com/dmwm/WMCore/tree/master/deploy
- Please create a new WMCore GH issue reporting what are the Oracle functionalities that are pending implementation (list any MariaDB pending functionalities as well please)
- As mentioned along the code, the new relational database table - for container management - should be defined in the WMCore repository, likely under: https://github.com/dmwm/WMCore/tree/master/src/python/WMCore/Agent/Database
- We can address it in 2 ways, update this PR and create the relevant WMCore PR; Or create yet another WMCore ticket to re-refactor this. I would prefer the former, but given the changes to inject issues into this PR, maybe we should stick to the latter.
- In addition, I don't think it is a good idea to use case sensitive table/column names, as both Oracle and MariaDB are not case sensitive. The best is to use underscore in the variable names, e.g.
init_param
.
- Concerning the manage-common.sh script. Right now (RPM) we need to source the "init.sh" agent file in order to get sqlplus/mysqldump/etc in the environment. If a similar case is needed for the container, then please update the top of that script with the needed information.
- As previously discussed, if it is simple enough, please provide a data flow diagram (even in ASCII will be fine) with the order of scripts executed in a standard container initialization.
- Is there any functionality currently present in https://github.com/dmwm/WMCore/blob/master/deploy/deploy-wmagent.sh which is not yet in run.sh? I guess one of those is the addition of HPC resources, but there could be others. I would suggest to make a GH issue for this.
docker/pypi/wmagent/init.sh
Outdated
|
||
echo "Start sleeping now ...zzz..." | ||
|
||
while true; do sleep 10; done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line just so the container keeps running? If that is the case, is there any reason to keep the start-agent line below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this line is just for keeping the container alive. (Which BTW leads to another change in the wmagent_docker_run.sh
, but I will fix that one separately).
Regarding the line bellow - we better remove it indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amaltaro , I've always been with the impression the names of those two scripts should be swapped. And now as I look at the ASCII diagram showing script relations, this impression only solidifies. But I did not proceed with the renaming because this would smear all the changes and the Review would become practically impossible... I may suggest, once we are ready for merging as a last step I should swap the names of those two scripts in a single commit separate from the rest. What do you think about it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amaltaro , Sorry if it was not clear enough - I was talking about swapping the names of: init.sh
<--> run.sh
.
Such that run.sh
gets back as a default entry point for Docker and just calls init.sh
and goes into an endless loop, while init.sh
drives the whole initialization process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@todor-ivanov apologies for the delay. You have a better view of this than me, so please go ahead if you consider it better. If you do so though, please make sure the README file reflects it as well.
@@ -176,7 +189,7 @@ echo "Start $stepMsg" | |||
chmod +x $WMA_DEPLOY_DIR/deploy/renew_proxy.sh $WMA_DEPLOY_DIR/deploy/restartComponent.sh | |||
|
|||
crontab -u $WMA_USER - <<EOF | |||
55 */12 * * * $WMA_DEPLOY_DIR/deploy/renew_proxy.sh | |||
55 */12 * * * $WMA_MANAGE_DIR/manage renew-proxy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is where the continuous proxy renewal process is defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the reason why it lives in install.sh
instead of run.sh
, is because it is a predefined logic independent of any runtime parameter. So no need to be part of the init
process at all. It is is easily definable at build time. I may suggest while you are looking to the rest of the PR, if you notice there is more of the logic, which is independent of any runtime parameters or initial configuration parameters, to point them and we can make the effort to move them here as well.
@@ -52,15 +52,15 @@ done | |||
|
|||
dockerOpts=" --network=host --progress=plain --build-arg WMA_TAG=$WMA_TAG " | |||
|
|||
docker build $dockerOpts -t wmagent:$WMA_TAG -t wmagent:latest . | |||
docker build $dockerOpts -t local/wmagent:$WMA_TAG -t local/wmagent:latest . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please educate me on why prefixing it with local
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this line based on experience with with coucdb
container. Because there was an upstream repository called couchdb
which is actually the official one.
Same here - it would be good during local builds and tests one to distinguish whether he/she runs from cmsweb/wmagent
or local/wmagent
.
Hi @amaltaro, Thanks for this Review. I think I have addressed all requested code changes, but let me try to answer your general comments one at a time:
Here it is: dmwm/WMCore#11717
Here it is: dmwm/WMCore#11720
Here I'd chose your second suggestion - to create a new issue just for moving the meta data table creation under wmcore repository . Here it is: dmwm/WMCore#11721
With this implementation
Everything is supposed to be already in the environment, so as long as one sticks to the usage of the
This is also explained in the README file.
Here it is:
And it was also added to the README file as well.
Here it is: dmwm/WMCore#11722 Alan, please feel free to look into this PR again, while I am working on preparing the new set of issues. |
@amaltaro BTW: There was only one remark I could not find what it relates to:
Originating from the same reasoning I intentionally used only names of tables and columns with underscores instead of case sensitive naming. If you have found such please let me know so I can fix that, but I personally double checked and found none.
|
@todor-ivanov I don't understand your statement above. Did you keep:
or not? My suggestion was to actually rename them to init_param and init_value. |
That's easy enough. I thought you were left with the impression those names are following some case sensitive naming convention e.g. |
docker/pypi/wmagent/run.sh
Outdated
fi | ||
[[ $errVal -eq 0 ]] && echo $WMA_BUILD_ID > $WMA_CONFIG_DIR/$service/.dockerInit | ||
done | ||
# TODO: This is to be removed once we decide to run it directly from the deploy area |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amaltaro I'd like to draw your attention to these two lines bellow. To me these are obsolete. I kept them just for legacy reasons, but the way how things are currently organized, there is no need to copy the manage script from the $WMA_DEPLOY_DIR to $WMA_MANAGE_DIR both are listed in the agent's PATH, so the script would be easily run from any of those locations. The only difference is that if we use it from $WMA_DEPLOY_DUR it would be run from the docker image, and would never show up in the host mount area ($WMA_MANAGE_DIR is equivalent currently to $WMA_CONFIG_DIR). The current situation gives us a little bit of freedom, though - we have the ability to test some new changes or fix bugs to the manage
script without the need to build a new container every time. And this was the only reason I have not removed this step at once. I'd like to hear your opinion on this one as well.
@amaltaro @vkuznet @khurtado , Yet another report of tests here: As you know, I am already using this setup at But today I also released the two paused workflows from the local queue and let them propagate down to condor. And here is the result:
I couldn't be happier...... Hope you like it as well. |
@todor-ivanov we have 38 commits in this PR, which will make our life miserable if we need to go through the history of these files. On the other hand, I'd rather not risk squashing commits and losing developments along the way. Please let me know how you would like to proceed with the merge of this PR. Note that it is not ready yet and there is one change that you wanted to make (swap run.sh and init.sh). So let us wait for that one. |
8665806
to
b6bb29e
Compare
Thanks for taking yet another look @amaltaro
Well I do not think there is any groundbreaking history in this development that would be lost and we will not be able to restore to a previous state if we need. So I went and merged those 38 commits in 2 - one for code change and one for README updates.
|
@todor-ivanov okay, please let me know when it's ready for a final look. |
91c88d1
to
ec7349c
Compare
hi @amaltaro |
@todor-ivanov for your last commit to renew proxy once it is below 24h. Please change it to 7 days. |
40675ae
to
60b72bd
Compare
hi @amaltaro
Done |
Thanks Todor. Please squash the commits accordingly and we can proceed with merging it. |
… to HOST_MOUNT_DIR && Remove HOSTADMIN mount point. Typo Add all files needed for the new initialisation logic. Remaking the interaction between run and manage scripts Move *-new files as defaults && Remove unneded files. Move _renew_proxy to manage-common && Add manage renew-proxy command && Fix broken _parse_wmasecrets call && Fix broken aliases && Strip login variables and exports from WMA_ENV_FILE Remove redundant parameter to db-prompt && Fix password typos Fix missing return value from init-agent Add database status functions Fix forgotten return from bad WMAgent.secrets parsing Start building the full database checks mechanisms before agent startup Move _status_of_couch to manage-common && call it from run.sh Add rough, and currently broken, mechanism for databse validation Add check for empty database for mysql && reduce wmagent-docker* paramters && improve all functions' printout messages Implement clean-mysql && Reduce global definitions in manage script Remove wmaInitSqlDB when cleaning the whole agent as well. Reorder database checks and agent activation && Fix COUCH_HOST to be read from WMAgent.secrets && Improve printouts Write WMA_BUILD_ID and hostname at the sql database. Implement _sql_dbid_valid Move wmagent database creation to init-agent Stop cleaning databases on WMAgent.secrets file change && Fix RUCIO_HOME && Fix default wmaSchemaFile Implement a single function _exec_mysql to use for all calls to mysql client Fix _renew_proxy Switch -v to -t option in install.sh. Remove unneeded check_services function && Start the agent on a restart upon full intialization Cleanup manage script from old comment lines Fix sql schema dump options Add better pager options to the mysql db-prompt Addressing review comments Fix renamed variables from MYSQL_ADDRESS to MYSQL_HOST Rename wma_init columns Fix broken tree at the host for condor jobs. Fix broken tree at the host for condor jobs one level down - at /data/srv/wmagent instead of /data/srv .
Update README Update README Update README Update README Update README Adding ASCII diagram to the README file && Cleaning comments && Removing a redundant base check. Update README
Renew myProxy 7 days before expiration Renew myProxy 7 days before expiration
60b72bd
to
4acb39b
Compare
hi @amaltaro
Done. I left the commit for swapping |
Thanks Todor. |
fixes dmwm/WMCore#11627
READY
With the current PR few modifications are suggested:
Start reading every parameter needed for the agent initialization process only from
WMAgent.secrets
file and create the config file based only on those (avoid reading any parameter from the command line)Removes the stage of initializing stuff to the container at runtime (like copying the
WMAgent secrets
file every time from the host to the docker image and working only with the copy).Removes the HostAdmin mount point and leaves the container to interact directly with the so provided
WMAgent.secrets
file.Improves
WMAgent.secrets
validation, parsing and loading mechanism.Improves the container runtime wrapper, to properly create current links to the actual wmagent tag to be used in the host mount area.
Adds a file with common definitions of functions between
manage
andrun.sh
.Unifies the initialization steps validation between
manage
andrun.sh
in a single mechanism.Splits actions between the
run.sh
andmanage
scripts such thatrun.sh
takes care of all initialization and validation markers creation and keeping track of, while the othermanage
acts just as a sole executor of management commands and only checks if a given management action is allowed or not.Adds fine grained initialization markers on every initialization step rather than on full areas as it was before
Removes most hard coded paths and rely only on the relational paths created at the Docker file stemming from a single
$WMA_ROOT_DIR
Expose everything from the runtime on proper (extended) set of mount points.
The above is done in parallel with reorganization of the directory structure of the agent, such as:
*
/data/srv/wmagnet/current/install/{wmagent,myslq,couchdb}
or*
/data/srv/wmagent/current/config/{wmagent,myslq,couchdb}
*
/data/srv/wmagent/current/logs/{wmagent,myslq,couchdb}
*
/data/srv/wmagent/current/state/{wmagent,myslq,couchdb}
Because now every service is supposed to run in a separate container with it's won mount point at the host, so we now need the following structure:
*
/data/srv/wmagnet/current/{install,config,logs,state}
and*
/data/srv/couchdb/current/{install,config,logs,state}
and*
/data/srv/mysql/current/{install,config,logs,state}
wmagent-couch
andwmagent-mysql
containers as wellImproves variable naming convention by adding
WMA_
prefix to anything related to the agent container and environmentSimplifies some manage functionalities and fixes others.
Adds minimal help to the manage script.
Adds automatic initialization of the whole environment at login time - improves operations. E.g
manage
script usage now is as simple as: [1]Removes management functions related to databases manipulation from the
manage
script and prepares them to be adopted in themanage
scripts for the relevantCouchDB
andMariaDB
docker container. (I've already prepared a big portion of them, so half of the future work done as well.)Removes socket connections to the SQL database and connects only through
127.0.0.1
Improves status checks for all databases both Relational and CouchDB.
Adds checks of databases directly for both Mysql and Oracle, rather than, as of now, by simply tracking the Mysql server process (this PIDFile is not reachable anymore).
[1]