Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parallel downloads of the same image #2723

Merged
merged 1 commit into from
Oct 11, 2024

Conversation

nirs
Copy link
Contributor

@nirs nirs commented Oct 11, 2024

When starting multiple vms in parallel using the same uncached yet image, we downloaded the data to the same temporary file (data.tmp) and wrote metadata to the same files (url, time, type, {digest}.digest). We could fail in many ways:

  • One process removes the cache directory when another tries to rename a temporary file to the destination file.
  • One process fails to remove the cache directory since another process started to download and the directory is not empty.
  • Corrupting data when writing to the same file from multiple processes. Mostly likely when writing downloaded data.

Fixes:

  • Do not remove the cache directory before the download. This avoids the failing to remove non-empty directory, and failure to rename a temporary file.
  • Write all data to temporary files, replacing the destination file when the write is complete. This avoid corrupted data.
  • Do not remove destination file when creating a new temporary file or renaming since creating a file truncate existing file and renaming replace it.

Issues:

  • On windows os.Rename is not atomic, but it should work when using NTFS.
  • On failures we may have lefover temporary files ({name}.tmp.{pid}).

Fixes #2722

Logs from creating 10 vms in parallel:
test-parallel-download.tar.gz

When starting multiple vms in parallel using the same uncached yet
image, we downloaded the data to the same temporary file (data.tmp) and
wrote metadata to the same files (url, time, type, {digest}.digest). We
could fail in many ways:

- One process removes the cache directory when another tries to rename a
  temporary file to the destination file.
- One process fails to remove the cache directory since another process
  started to download and the directory is not empty.
- Corrupting data when writing to the same file from multiple processes.
  Mostly likely when writing downloaded data.

Fixes:

- Do not remove the cache directory before the download. This avoids the
  failing to remove non-empty directory, and failure to rename a
  temporary file.
- Write all data to temporary files, replacing the destination file when
  the write is complete. This avoid corrupted data.
- Do not remove destination file when creating a new temporary file or
  renaming since creating a file truncate existing file and renaming
  replace it.

Issues:

- On windows os.Rename is not atomic, but it should work when using
  NTFS.
- On failures we may have lefover temporary files ({name}.tmp.{pid}).

Fixes lima-vm#2722

Signed-off-by: Nir Soffer <[email protected]>
@AkihiroSuda AkihiroSuda added this to the v1.0 milestone Oct 11, 2024
Copy link
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, can we have unit tests?

@AkihiroSuda AkihiroSuda merged commit a07b7f9 into lima-vm:master Oct 11, 2024
26 checks passed
@nirs nirs deleted the parallel-downloads branch October 11, 2024 18:28
@nirs
Copy link
Contributor Author

nirs commented Oct 11, 2024

Thanks, can we have unit tests?

I did not check the current tests yet, but I don't wan to have a test doing parallel downloads from a real server that will run on very pr.

We can do something like this:

  1. start in process http server serving a directory inside the test temporary directory
  2. add tiny image with random data to the http server root directory
  3. run downloading the tiny image in parallel
  4. verify that all downloads succeeded and contents of cached directory is correct

With a tiny image this can run very quickly.

@AkihiroSuda what do you think?

@AkihiroSuda
Copy link
Member

SGTM

@nirs
Copy link
Contributor Author

nirs commented Oct 13, 2024

Thanks, can we have unit tests?

Here: #2728

@nirs nirs mentioned this pull request Oct 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Downloading images concurrently breaks in many ways
2 participants