-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running nova_plugin.server.create twice leaks servers #51
Comments
Hi, just to make sure i understand, you would like cloudify to identify that a server was already created for a particular node instance and use that server in that case? if you are referring to cleanup than yes currently we have an issue with re-running failed workflows. the way to do it for now is to run the "opposite" workflow before running the original one again. so basically if you run uninstall before running install again, it would have cleaned up that server. |
If create is called and there are already runtime properties on the node instance pointing to a VM, and that VM still exists, then create should definitely not create a new VM and overwrite the existing properties and lose knowledge of the old (still running) VM. It would be nice to have it just return quietly in this case (much like how calling start on an already-started server is harmless). But it would be OK to fail too. The silently-leaks-resources part is the real problem at the moment. What I'm trying to achieve is a "install/repair the system" workflow that always tries to progress from the current state towards the fully-installed state. Ideally that would be the built-in "install" workflow. But to support this we at least need lifecycle operations that don't leak resources unpredictably. Cleanup and rollback are complex problems, I don't have a particular opinion there, other than to point out it's really hard to make aggregate operations like workflows atomic in the face of failure.. so you're going to have to deal with partially-created states at some point. (And of course there's things like VM failure that can result in similar-looking states) |
BTW, if you have such an "install/repair" workflow, then healing from failures is a bit simpler: tear down (reset) all failed node instances and anything contained within them, then run the install/repair workflow. |
I agree with what you said about cleanup and rollback being complex problems. we have already started discussion around this area. regarding what you are trying to achieve, i would make a differentiation of repair versus install. repair implies that something was broken, re-running the install workflow does not mean something is broken, but that something was not installed properly, and here idempotency is very useful like you mentioned. regarding this repair workflow, we actually already have started the development of this workflow. its currently in testing phase and will be available in the 3.2 release. what it does is exactly what you described + executes all necessary relationship operations of course. this workflow can be integrated with our monitoring system to achieve automatic healing... |
The repair workflow is probably always going to be a superset of the install workflow (consider the "everything fails" case) |
right, its not just a superset, in the current implementation it will also contain the uninstall/teardown workflow. in any case i see your point about the openstack plugin create operation, will write back here soon about this. |
If the nova_plugin.server.create operation is run twice on the same node for any reason (e.g. you rerun a partially-failed install workflow) then a new server is created, and the existing server is leaked - it is no longer known to Cloudify and will not be cleaned up.
Running the operation twice should either be idempotent, or fail. (idempotency would be nicer!)
The text was updated successfully, but these errors were encountered: