Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[omicron-package] Retry failed downloads automatically #4168

Merged
merged 1 commit into from
Oct 2, 2023

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Sep 29, 2023

  • Automatically retries downloads, with a small amount of constant backoff
  • Provides config options for users to customize these retry attempts / backoffs
  • Extends progress bars to show when download failures occur

Tested manually, by running:

cargo run --bin=omicron-package -- --retry-count 5 package

Then cycling my machine's network connectivity. I observed the "Failed to download prebuilt messages", saw the attempts ticking down, and then reconnected. Once connectivity was restored, the download succeeded.

Fixes #4165

@smklein smklein added Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. build Building the control plane labels Sep 29, 2023
@smklein smklein requested a review from davepacheco September 29, 2023 20:58
@smklein smklein requested a review from jclulow September 29, 2023 21:11
@@ -303,8 +325,63 @@ async fn get_sha256_digest(path: &PathBuf) -> Result<Digest> {
Ok(context.finish())
}

async fn download_prebuilt(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically just refactored to make re-calling it easier

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this.

It'd be nice if we could use the existing general-purpose retry stuff but I get that the policies they use today aren't appropriate here.

I also wonder if we're going to want to resume downloads from where they left off. (If the server is failing somewhat often, we might never be able to do a full download, but we would be able to make progress if we resumed partial downloads.) But we can always add this if we continue to see problems.

.content_length()
.ok_or_else(|| anyhow!("Missing Content Length"))?,
);
let mut file = tokio::fs::File::create(&path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that we don't explicitly remove the old file. Does this remove it, or at least truncate it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It truncates it on the second attempt. If it helps, the hash won't match on partial downloads, so it would know that it needs to retry if we re-execute.

@smklein smklein merged commit 1c0553a into main Oct 2, 2023
@smklein smklein deleted the omicron-package-retry branch October 2, 2023 18:01
@jgallagher
Copy link
Contributor

I also wonder if we're going to want to resume downloads from where they left off. (If the server is failing somewhat often, we might never be able to do a full download, but we would be able to make progress if we resumed partial downloads.) But we can always add this if we continue to see problems.

This will probably require adding support in buildomat (oxidecomputer/buildomat#31).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Building the control plane Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI package job fails due to transient buildomat errors
3 participants