Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exceptions should be handled, especially where it makes sense to attempt a retry #19

Open
strickra opened this issue Jan 17, 2014 · 8 comments

Comments

@strickra
Copy link

We are seeing multiple unhandled exceptions, which are causing knife ec backup activity to be more fragile than is necessary. The big one we keep seeing is that if --concurrency is > 1, the backup will run for a few minutes and then abort like so:

Created /cookbooks/rabbitmq-0.0.1
Created /cookbooks/nova-0.6.26/templates/default/dashboard.apache.erb
ERROR: internal server error
Response: #<Net::ReadAdapter:0x00000002fd3c28>

It never aborts on the same file, or after the same amount of time has elapsed, but it's usually between two and four minutes.

When we set --concurrency to 1, the backup ran fine for 22 hours, then encountered the following problem and aborted:

Created /acls/roles/build_slave.json
Created /acls/organization.json
Grabbing organization personal-darragh ...
Created /acls
Created /acls/groups
Created /acls/groups/billing-admins.json
Created /groups
Created /groups/billing-admins.json
ERROR: ArgumentError: Cannot sign the request without a client name, check that :node_name is assigned

Although there is some evidence that restarting a failed backup is a supported thing (seems to skip some already-downloaded content, update other objects where they have been changed), it is not entirely complete:

Created /cookbooks/swift-0.0.19/templates/default/cron.d/swift-container-stats-log-creator.erb
Created /cookbooks/swift-0.0.19/templates/default/rsyslog.d/40-swift-object.conf.erb
Created /cookbooks/swift-0.0.19/files/default/systest/ring/account.builder
Created /cookbooks/swift-0.0.19/files/default/systest/ring/container.builder
ERROR: Errno::EEXIST: File exists - /home/strickra/projects/chef11/xfer/aw1/organizations/aw1-ops/cookbooks/icinga-0.3.8

Taken together, it is catastrophic.

@strickra
Copy link
Author

We have resolved the "Cannot sign the request without a client name" error. It was due to an organization which had an Admins group with no users in it.

@strickra
Copy link
Author

Here's a fun new iteration on the theme of busted chef orgs.

Grabbing organization testorg ...
Created /acls
Created /acls/groups
Created /acls/groups/billing-admins.json
Created /groups
Created /groups/billing-admins.json
Created /groups/admins.json
ERROR: ChefFS::FileSystem::OperationFailedError: HTTP error retrieving children: 403 "Forbidden"

Note, 'admins' group has no acls, thus no permissions.

@strickra
Copy link
Author

Actually I'm not sure what to make of the above. There were other orgs that didn't have /acls/groups/admins.json but otherwise were backed up okay. In the process of using orgmapper to grant myself permission to see "testorg", we stopped being able to reproduce the error.

@strickra
Copy link
Author

The Errno::EEXIST problem reported earlier can be worked around by doing this before restarting the backup:

find backup-dir/ -type d -empty -print | xargs rmdir

@stevendanna
Copy link
Contributor

I'm increasingly of the opinion that we should try to restore as much as possible even if a given org fails. If one org fails, we could swallow the error, move on to the next, and then print out a report at the end. Thoughts?

@strickra
Copy link
Author

strickra commented Feb 6, 2014

That might be helpful, though you mention restores and I was reporting problems specifically with backup. It is definitely the case that for backups that take 20+ hours to run to completion, getting to an org with an error (like I keep seeing where there are no members in the admins group for backup to switch to) takes a long time, then you fix it in two minutes, then wait a long time again to see that it worked, wait longer, then discover there's another org busted the same way and have to start over AGAIN -- this process is pretty clunky, and having the backup move to the next org and print a report at the end would certainly improve things here.

I wouldn't expect things like server timeouts or transient network transport/socket issues like I had a run of where a socket couldn't be opened locally ("cannot assign requested address") should skip the org but rather hang on and retry, or similarly with the EEXIST problem on empty directories, it seems like that should be basically just ignored.

@jkeiser-oc
Copy link

I agree. As much as possible, errors should be obvious (you should see
them) but if we can move on and do more, we should.

On Thu, Feb 6, 2014 at 11:28 AM, strickra [email protected] wrote:

That might be helpful, though you mention restores and I was reporting
problems specifically with backup. It is definitely the case that for
backups that take 20+ hours to run to completion, getting to an org with an
error (like I keep seeing where there are no members in the admins group
for backup to switch to) takes a long time, then you fix it in two minutes,
then wait a long time again to see that it worked, wait longer, then
discover there's another org busted the same way and have to start over
AGAIN -- this process is pretty clunky.

I wouldn't expect things like server timeouts or transient network
transport/socket issues like I had a run of where a socket couldn't be
opened locally ("cannot assign requested address") should not skip the org
but rather hang on and retry, or similarly with the EEXIST problem on empty
directories, it seems like that should be basically just ignored.

Reply to this email directly or view it on GitHubhttps://github.com//issues/19#issuecomment-34360601
.

@stevendanna
Copy link
Contributor

This is still an issue even on the 2.0 refactor branch. I'm going to leave this issue open since I think we can probably make the situation better for the use case of long backups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants