Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running nova_plugin.server.create twice leaks servers #51

Open
mutability opened this issue Feb 4, 2015 · 6 comments
Open

Running nova_plugin.server.create twice leaks servers #51

mutability opened this issue Feb 4, 2015 · 6 comments

Comments

@mutability
Copy link

If the nova_plugin.server.create operation is run twice on the same node for any reason (e.g. you rerun a partially-failed install workflow) then a new server is created, and the existing server is leaked - it is no longer known to Cloudify and will not be cleaned up.

Running the operation twice should either be idempotent, or fail. (idempotency would be nicer!)

@iliapolo
Copy link
Contributor

iliapolo commented Feb 4, 2015

Hi, just to make sure i understand, you would like cloudify to identify that a server was already created for a particular node instance and use that server in that case?

if you are referring to cleanup than yes currently we have an issue with re-running failed workflows. the way to do it for now is to run the "opposite" workflow before running the original one again. so basically if you run uninstall before running install again, it would have cleaned up that server.
an issue to discuss here is probably the rollback abilities of a workflow, and the ability to start it again from a very particular point, ideally re-running the workflow from the last failure point.

@mutability
Copy link
Author

If create is called and there are already runtime properties on the node instance pointing to a VM, and that VM still exists, then create should definitely not create a new VM and overwrite the existing properties and lose knowledge of the old (still running) VM. It would be nice to have it just return quietly in this case (much like how calling start on an already-started server is harmless). But it would be OK to fail too. The silently-leaks-resources part is the real problem at the moment.

What I'm trying to achieve is a "install/repair the system" workflow that always tries to progress from the current state towards the fully-installed state. Ideally that would be the built-in "install" workflow. But to support this we at least need lifecycle operations that don't leak resources unpredictably.

Cleanup and rollback are complex problems, I don't have a particular opinion there, other than to point out it's really hard to make aggregate operations like workflows atomic in the face of failure.. so you're going to have to deal with partially-created states at some point. (And of course there's things like VM failure that can result in similar-looking states)

@mutability
Copy link
Author

BTW, if you have such an "install/repair" workflow, then healing from failures is a bit simpler: tear down (reset) all failed node instances and anything contained within them, then run the install/repair workflow.

@iliapolo
Copy link
Contributor

iliapolo commented Feb 4, 2015

I agree with what you said about cleanup and rollback being complex problems. we have already started discussion around this area.

regarding what you are trying to achieve, i would make a differentiation of repair versus install. repair implies that something was broken, re-running the install workflow does not mean something is broken, but that something was not installed properly, and here idempotency is very useful like you mentioned.
basically post vs pre deployment workflows.

regarding this repair workflow, we actually already have started the development of this workflow. its currently in testing phase and will be available in the 3.2 release. what it does is exactly what you described + executes all necessary relationship operations of course.

this workflow can be integrated with our monitoring system to achieve automatic healing...

@mutability
Copy link
Author

The repair workflow is probably always going to be a superset of the install workflow (consider the "everything fails" case)

@iliapolo
Copy link
Contributor

iliapolo commented Feb 4, 2015

right, its not just a superset, in the current implementation it will also contain the uninstall/teardown workflow. in any case i see your point about the openstack plugin create operation, will write back here soon about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants