-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exceptions should be handled, especially where it makes sense to attempt a retry #19
Comments
We have resolved the "Cannot sign the request without a client name" error. It was due to an organization which had an Admins group with no users in it. |
Here's a fun new iteration on the theme of busted chef orgs.
Note, 'admins' group has no acls, thus no permissions. |
Actually I'm not sure what to make of the above. There were other orgs that didn't have /acls/groups/admins.json but otherwise were backed up okay. In the process of using orgmapper to grant myself permission to see "testorg", we stopped being able to reproduce the error. |
The Errno::EEXIST problem reported earlier can be worked around by doing this before restarting the backup:
|
I'm increasingly of the opinion that we should try to restore as much as possible even if a given org fails. If one org fails, we could swallow the error, move on to the next, and then print out a report at the end. Thoughts? |
That might be helpful, though you mention restores and I was reporting problems specifically with backup. It is definitely the case that for backups that take 20+ hours to run to completion, getting to an org with an error (like I keep seeing where there are no members in the admins group for backup to switch to) takes a long time, then you fix it in two minutes, then wait a long time again to see that it worked, wait longer, then discover there's another org busted the same way and have to start over AGAIN -- this process is pretty clunky, and having the backup move to the next org and print a report at the end would certainly improve things here. I wouldn't expect things like server timeouts or transient network transport/socket issues like I had a run of where a socket couldn't be opened locally ("cannot assign requested address") should skip the org but rather hang on and retry, or similarly with the EEXIST problem on empty directories, it seems like that should be basically just ignored. |
I agree. As much as possible, errors should be obvious (you should see On Thu, Feb 6, 2014 at 11:28 AM, strickra [email protected] wrote:
|
This is still an issue even on the 2.0 refactor branch. I'm going to leave this issue open since I think we can probably make the situation better for the use case of long backups. |
We are seeing multiple unhandled exceptions, which are causing knife ec backup activity to be more fragile than is necessary. The big one we keep seeing is that if --concurrency is > 1, the backup will run for a few minutes and then abort like so:
It never aborts on the same file, or after the same amount of time has elapsed, but it's usually between two and four minutes.
When we set --concurrency to 1, the backup ran fine for 22 hours, then encountered the following problem and aborted:
Although there is some evidence that restarting a failed backup is a supported thing (seems to skip some already-downloaded content, update other objects where they have been changed), it is not entirely complete:
Taken together, it is catastrophic.
The text was updated successfully, but these errors were encountered: