Exceptions should be handled, especially where it makes sense to attempt a retry #19

strickra · 2014-01-17T20:33:00Z

We are seeing multiple unhandled exceptions, which are causing knife ec backup activity to be more fragile than is necessary. The big one we keep seeing is that if --concurrency is > 1, the backup will run for a few minutes and then abort like so:

Created /cookbooks/rabbitmq-0.0.1
Created /cookbooks/nova-0.6.26/templates/default/dashboard.apache.erb
ERROR: internal server error
Response: #<Net::ReadAdapter:0x00000002fd3c28>

It never aborts on the same file, or after the same amount of time has elapsed, but it's usually between two and four minutes.

When we set --concurrency to 1, the backup ran fine for 22 hours, then encountered the following problem and aborted:

Created /acls/roles/build_slave.json
Created /acls/organization.json
Grabbing organization personal-darragh ...
Created /acls
Created /acls/groups
Created /acls/groups/billing-admins.json
Created /groups
Created /groups/billing-admins.json
ERROR: ArgumentError: Cannot sign the request without a client name, check that :node_name is assigned

Although there is some evidence that restarting a failed backup is a supported thing (seems to skip some already-downloaded content, update other objects where they have been changed), it is not entirely complete:

Created /cookbooks/swift-0.0.19/templates/default/cron.d/swift-container-stats-log-creator.erb
Created /cookbooks/swift-0.0.19/templates/default/rsyslog.d/40-swift-object.conf.erb
Created /cookbooks/swift-0.0.19/files/default/systest/ring/account.builder
Created /cookbooks/swift-0.0.19/files/default/systest/ring/container.builder
ERROR: Errno::EEXIST: File exists - /home/strickra/projects/chef11/xfer/aw1/organizations/aw1-ops/cookbooks/icinga-0.3.8

Taken together, it is catastrophic.

The text was updated successfully, but these errors were encountered:

strickra · 2014-01-27T22:47:34Z

We have resolved the "Cannot sign the request without a client name" error. It was due to an organization which had an Admins group with no users in it.

strickra · 2014-01-29T21:43:51Z

Here's a fun new iteration on the theme of busted chef orgs.

Grabbing organization testorg ...
Created /acls
Created /acls/groups
Created /acls/groups/billing-admins.json
Created /groups
Created /groups/billing-admins.json
Created /groups/admins.json
ERROR: ChefFS::FileSystem::OperationFailedError: HTTP error retrieving children: 403 "Forbidden"

Note, 'admins' group has no acls, thus no permissions.

strickra · 2014-01-30T18:29:08Z

Actually I'm not sure what to make of the above. There were other orgs that didn't have /acls/groups/admins.json but otherwise were backed up okay. In the process of using orgmapper to grant myself permission to see "testorg", we stopped being able to reproduce the error.

strickra · 2014-01-30T19:18:13Z

The Errno::EEXIST problem reported earlier can be worked around by doing this before restarting the backup:

find backup-dir/ -type d -empty -print | xargs rmdir

stevendanna · 2014-02-06T17:18:35Z

I'm increasingly of the opinion that we should try to restore as much as possible even if a given org fails. If one org fails, we could swallow the error, move on to the next, and then print out a report at the end. Thoughts?

strickra · 2014-02-06T19:28:13Z

That might be helpful, though you mention restores and I was reporting problems specifically with backup. It is definitely the case that for backups that take 20+ hours to run to completion, getting to an org with an error (like I keep seeing where there are no members in the admins group for backup to switch to) takes a long time, then you fix it in two minutes, then wait a long time again to see that it worked, wait longer, then discover there's another org busted the same way and have to start over AGAIN -- this process is pretty clunky, and having the backup move to the next org and print a report at the end would certainly improve things here.

I wouldn't expect things like server timeouts or transient network transport/socket issues like I had a run of where a socket couldn't be opened locally ("cannot assign requested address") should skip the org but rather hang on and retry, or similarly with the EEXIST problem on empty directories, it seems like that should be basically just ignored.

jkeiser-oc · 2014-02-06T19:34:44Z

I agree. As much as possible, errors should be obvious (you should see
them) but if we can move on and do more, we should.

On Thu, Feb 6, 2014 at 11:28 AM, strickra [email protected] wrote:

That might be helpful, though you mention restores and I was reporting
problems specifically with backup. It is definitely the case that for
backups that take 20+ hours to run to completion, getting to an org with an
error (like I keep seeing where there are no members in the admins group
for backup to switch to) takes a long time, then you fix it in two minutes,
then wait a long time again to see that it worked, wait longer, then
discover there's another org busted the same way and have to start over
AGAIN -- this process is pretty clunky.

I wouldn't expect things like server timeouts or transient network
transport/socket issues like I had a run of where a socket couldn't be
opened locally ("cannot assign requested address") should not skip the org
but rather hang on and retry, or similarly with the EEXIST problem on empty
directories, it seems like that should be basically just ignored.

Reply to this email directly or view it on GitHubhttps://github.com//issues/19#issuecomment-34360601
.

stevendanna · 2014-11-25T09:55:40Z

This is still an issue even on the 2.0 refactor branch. I'm going to leave this issue open since I think we can probably make the situation better for the use case of long backups.

jeremymv2 mentioned this issue Mar 17, 2017

New EcErrorHandler to catch HTTP failures avoiding crashes #90

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exceptions should be handled, especially where it makes sense to attempt a retry #19

Exceptions should be handled, especially where it makes sense to attempt a retry #19

strickra commented Jan 17, 2014

strickra commented Jan 27, 2014

strickra commented Jan 29, 2014

strickra commented Jan 30, 2014

strickra commented Jan 30, 2014

stevendanna commented Feb 6, 2014

strickra commented Feb 6, 2014

jkeiser-oc commented Feb 6, 2014

stevendanna commented Nov 25, 2014

Exceptions should be handled, especially where it makes sense to attempt a retry #19

Exceptions should be handled, especially where it makes sense to attempt a retry #19

Comments

strickra commented Jan 17, 2014

strickra commented Jan 27, 2014

strickra commented Jan 29, 2014

strickra commented Jan 30, 2014

strickra commented Jan 30, 2014

stevendanna commented Feb 6, 2014

strickra commented Feb 6, 2014

jkeiser-oc commented Feb 6, 2014

stevendanna commented Nov 25, 2014