Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test parallel downloads #2728

Merged
merged 1 commit into from
Oct 15, 2024
Merged

Conversation

nirs
Copy link
Contributor

@nirs nirs commented Oct 13, 2024

Add sub tests for parallel downloads with "with cache" and "caching-only
mode". The test run 20 concurrent downloads with the current test http
server and ensure that downloads succeeded.

To make it possible to test using the same process, we include now a
counter in the perProcessTempfile. The file name is now:

{filename}.pid.{count}

The test revealed an issue with clonefile, not seen when testing 3
parallel limactl create. When the data file is replaced during a
clonefile syscall, it may fail with:

clonefile failed: no such file or directory

This smells like a darwin bug, but we can avoid this issue by using the
first download result. When a download finishes, we check if data file
exists, and return success if it does. We take a lock for the very
short time needed to check and rename the temporary data file to the
target file.

The parallel tests failed on windows - it seems that os.Rename() does
not work on windows if the target exist. Replacing atomicWrite with a
simpler version that takes a lock and write the file fixes the issue.

Tested using:

% go test ./pkg/downloader -run 'TestDownloadRemote/with_cache/parallel' -count=1000
ok          github.com/lima-vm/lima/pkg/downloader  116.025s

% go test ./pkg/downloader -run 'TestDownloadRemote/caching-only_mode/parallel' -count=1000
ok          github.com/lima-vm/lima/pkg/downloader  98.535s

@nirs
Copy link
Contributor Author

nirs commented Oct 13, 2024

The windows tests fail, probably already broken since #2723:

downloader_test.go:110: Download failed: rename
C:\Users\RUNNER~1\AppData\Local\Temp\TestDownloadRemotewith_cacheparallel3648796552\001\cache\download\by-url-sha256\234be6a0193368c635111cf397120fb9724641a6641a1adb8b16ad5b74e951f3\url.tmp.1624.11
C:\Users\RUNNER~1\AppData\Local\Temp\TestDownloadRemotewith_cacheparallel3648796552\001\cache\download\by-url-sha256\234be6a0193368c635111cf397120fb9724641a6641a1adb8b16ad5b74e951f3\url:
Access is denied.

os.Rename() calls MoveFileEx(from, to, MOVEFILE_REPLACE_EXISTING)
https://github.com/golang/go/blob/b521ebb55a9b26c8824b219376c7f91f7cda6ec2/src/internal/syscall/windows/syscall_windows.go#L360

Maybe it does not work when another process is renaming the same file?

@nirs
Copy link
Contributor Author

nirs commented Oct 13, 2024

The fix works for windows for the metadata files, but renaming the data file fails:

=== RUN   TestDownloadRemote/with_cache/parallel
    downloader_test.go:110: Download failed: rename C:\Users\RUNNER~1\AppData\Local\Temp\TestDownloadRemotewith_cacheparallel26671273\001\cache\download\by-url-sha256\6c11daee298801cf583ce308783f6d28f0ddfce3214064d55cc45217174035fc\data.tmp.6508.6 C:\Users\RUNNER~1\AppData\Local\Temp\TestDownloadRemotewith_cacheparallel26671273\001\cache\download\by-url-sha256\6c11daee298801cf583ce308783f6d28f0ddfce3214064d55cc45217174035fc\data: Access is denied.
=== RUN   TestDownloadRemote/caching-only_mode
=== RUN   TestDownloadRemote/caching-only_mode/serial
=== RUN   TestDownloadRemote/caching-only_mode/parallel
    downloader_test.go:153: Download failed: rename C:\Users\RUNNER~1\AppData\Local\Temp\TestDownloadRemotecaching-only_modeparallel2628695855\001\cache\download\by-url-sha256\6c11daee298801cf583ce308783f6d28f0ddfce3214064d55cc45217174035fc\data.tmp.6508.27 C:\Users\RUNNER~1\AppData\Local\Temp\TestDownloadRemotecaching-only_modeparallel2628695855\001\cache\download\by-url-sha256\6c11daee298801cf583ce308783f6d28f0ddfce3214064d55cc45217174035fc\data: Access is denied.

We know that the "data" does not exist, not sure why os.Rename() fails.

@nirs nirs marked this pull request as draft October 13, 2024 12:51
@nirs
Copy link
Contributor Author

nirs commented Oct 13, 2024

Adding logs:

diff --git a/pkg/downloader/downloader.go b/pkg/downloader/downloader.go
index 2611740..ea7e648 100644
--- a/pkg/downloader/downloader.go
+++ b/pkg/downloader/downloader.go
@@ -679,7 +679,9 @@ func downloadHTTP(ctx context.Context, localPath, lastModified, contentType, url
        // data, and when we return data file exists.
 
        return lockutil.WithDirLock(filepath.Dir(localPath), func() error {
-               if _, err := os.Stat(localPath); err == nil {
+               fi, err := os.Stat(localPath)
+               logrus.Infof("os.Stat('%s') -> %+v, %+v", localPath, fi, err)
+               if err == nil {
                        return nil
                } else if !errors.Is(err, os.ErrNotExist) {
                        return err
  1. First thread did not find the file - expected
=== RUN   TestDownloadRemote/with_cache/parallel
time="2024-10-13T13:41:07Z" level=info msg="os.Stat('C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\TestDownloadRemotewith_cacheparallel1978374964\\001\\cache\\download\\by-url-sha256\\28be12e26c23dd7b394a4970c0da2ae02769dd834c338a550d7a6264536c0440\\data') -> <nil>, CreateFile C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\TestDownloadRemotewith_cacheparallel1978374964\\001\\cache\\download\\by-url-sha256\\28be12e26c23dd7b394a4970c0da2ae02769dd834c338a550d7a6264536c0440\\data: The system cannot find the file specified."
  1. Next 4 thread did not find the file?!
    • should be impossible: the lock should prevent these threads from checking the file until the first thread rename the temporary file and released the lock.
time="2024-10-13T13:41:07Z" level=info msg="os.Stat('C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\TestDownloadRemotewith_cacheparallel1978374964\\001\\cache\\download\\by-url-sha256\\28be12e26c23dd7b394a4970c0da2ae02769dd834c338a550d7a6264536c0440\\data') -> <nil>, CreateFile C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\TestDownloadRemotewith_cacheparallel1978374964\\001\\cache\\download\\by-url-sha256\\28be12e26c23dd7b394a4970c0da2ae02769dd834c338a550d7a6264536c0440\\data: The system cannot find the file specified."
time="2024-10-13T13:41:07Z" level=info msg="os.Stat('C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\TestDownloadRemotewith_cacheparallel1978374964\\001\\cache\\download\\by-url-sha256\\28be12e26c23dd7b394a4970c0da2ae02769dd834c338a550d7a6264536c0440\\data') -> <nil>, CreateFile C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\TestDownloadRemotewith_cacheparallel1978374964\\001\\cache\\download\\by-url-sha256\\28be12e26c23dd7b394a4970c0da2ae02769dd834c338a550d7a6264536c0440\\data: The system cannot find the file specified."
time="2024-10-13T13:41:07Z" level=info msg="os.Stat('C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\TestDownloadRemotewith_cacheparallel1978374964\\001\\cache\\download\\by-url-sha256\\28be12e26c23dd7b394a4970c0da2ae02769dd834c338a550d7a6264536c0440\\data') -> <nil>, CreateFile C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\TestDownloadRemotewith_cacheparallel1978374964\\001\\cache\\download\\by-url-sha256\\28be12e26c23dd7b394a4970c0da2ae02769dd834c338a550d7a6264536c0440\\data: The system cannot find the file specified."
time="2024-10-13T13:41:07Z" level=info msg="os.Stat('C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\TestDownloadRemotewith_cacheparallel1978374964\\001\\cache\\download\\by-url-sha256\\28be12e26c23dd7b394a4970c0da2ae02769dd834c338a550d7a6264536c0440\\data') -> <nil>, CreateFile C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\TestDownloadRemotewith_cacheparallel1978374964\\001\\cache\\download\\by-url-sha256\\28be12e26c23dd7b394a4970c0da2ae02769dd834c338a550d7a6264536c0440\\data: The system cannot find the file specified."
  1. From this point, all threads find the file and return - expected
time="2024-10-13T13:41:07Z" level=info msg="os.Stat('C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\TestDownloadRemotewith_cacheparallel1978374964\\001\\cache\\download\\by-url-sha256\\28be12e26c23dd7b394a4970c0da2ae02769dd834c338a550d7a6264536c0440\\data') -> &{name:data FileAttributes:32 CreationTime:{LowDateTime:2384380421 HighDateTime:31137141} LastAccessTime:{LowDateTime:2384386325 HighDateTime:31137141} LastWriteTime:{LowDateTime:2384386325 HighDateTime:31137141} FileSizeHigh:0 FileSizeLow:49 ReparseTag:0 filetype:0 Mutex:{state:0 sema:0} path:C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\TestDownloadRemotewith_cacheparallel1978374964\\001\\cache\\download\\by-url-sha256\\28be12e26c23dd7b394a4970c0da2ae02769dd834c338a550d7a6264536c0440\\data vol:0 idxhi:0 idxlo:0 appendNameToPath:false}, <nil>"
  1. 3 thread failed to rename - this means the lock is broken - only one thread should be able to rename.
    downloader_test.go:110: Download failed: rename C:\Users\RUNNER~1\AppData\Local\Temp\TestDownloadRemotewith_cacheparallel1978374964\001\cache\download\by-url-sha256\28be12e26c23dd7b394a4970c0da2ae02769dd834c338a550d7a6264536c0440\data.tmp.908.6 C:\Users\RUNNER~1\AppData\Local\Temp\TestDownloadRemotewith_cacheparallel1978374964\001\cache\download\by-url-sha256\28be12e26c23dd7b394a4970c0da2ae02769dd834c338a550d7a6264536c0440\data: Access is denied.
    downloader_test.go:110: Download failed: rename C:\Users\RUNNER~1\AppData\Local\Temp\TestDownloadRemotewith_cacheparallel1978374964\001\cache\download\by-url-sha256\28be12e26c23dd7b394a4970c0da2ae02769dd834c338a550d7a6264536c0440\data.tmp.908.7 C:\Users\RUNNER~1\AppData\Local\Temp\TestDownloadRemotewith_cacheparallel1978374964\001\cache\download\by-url-sha256\28be12e26c23dd7b394a4970c0da2ae02769dd834c338a550d7a6264536c0440\data: Access is denied.
    downloader_test.go:110: Download failed: rename C:\Users\RUNNER~1\AppData\Local\Temp\TestDownloadRemotewith_cacheparallel1978374964\001\cache\download\by-url-sha256\28be12e26c23dd7b394a4970c0da2ae02769dd834c338a550d7a6264536c0440\data.tmp.908.8 C:\Users\RUNNER~1\AppData\Local\Temp\TestDownloadRemotewith_cacheparallel1978374964\001\cache\download\by-url-sha256\28be12e26c23dd7b394a4970c0da2ae02769dd834c338a550d7a6264536c0440\data: Access is denied.

@AkihiroSuda looks like the windows lock is broken. We don't have tests for the lockutil so this is expected :-)

@nirs
Copy link
Contributor Author

nirs commented Oct 13, 2024

Or maybe on windows the lock works only for different processes?

We used the same process to simplify the test, but maybe we should test separate processes. We can create a minimal executable that run fileutils.Download(), or maybe create a hidden limactl download command and test it?

@nirs
Copy link
Contributor Author

nirs commented Oct 13, 2024

The windows tests pass now. I will split the lockutil fix to another pr, it is much worse bug than the rare failure for parallel downloads.

@nirs
Copy link
Contributor Author

nirs commented Oct 13, 2024

Added separate PR for lockutil: #2729.

I'll rebase and squash the commit when it is merge.

@nirs
Copy link
Contributor Author

nirs commented Oct 13, 2024

Squashed the test commit and rebase on #2729 since the tests are broken without it. I'll rebase on master after #2729 it is merged.

@nirs nirs marked this pull request as ready for review October 13, 2024 17:17
Add sub tests for parallel downloads with "with cache" and "caching-only
mode". The test run 20 concurrent downloads with the current test http
server and ensure that downloads succeeded.

To make it possible to test using the same process, we include now a
counter in the perProcessTempfile. The file name is now:

    {filename}.pid.{count}

The test revealed an issue with clonefile, not seen when testing 3
parallel `limactl create`. When the data file is replaced during a
clonefile syscall, it may fail with:

    clonefile failed: no such file or directory

This smells like a darwin bug, but we can avoid this issue by using the
first download result. When a download finishes, we check if data file
exists, and return success if it does. We take a lock for the very
short time needed to check and rename the temporary data file to the
target file.

The parallel tests failed on windows - it seems that os.Rename() does
not work on windows if the target exist. Replacing atomicWrite with a
simpler version that takes a lock and write the file fixes the issue.

Tested using:

    % go test ./pkg/downloader -run 'TestDownloadRemote/with_cache/parallel' -count=1000
    ok  	github.com/lima-vm/lima/pkg/downloader	116.025s

    % go test ./pkg/downloader -run 'TestDownloadRemote/caching-only_mode/parallel' -count=1000
    ok  	github.com/lima-vm/lima/pkg/downloader	98.535s

Signed-off-by: Nir Soffer <[email protected]>
@AkihiroSuda AkihiroSuda added this to the v1.0 milestone Oct 13, 2024
Copy link
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@AkihiroSuda AkihiroSuda merged commit 6f7569f into lima-vm:master Oct 15, 2024
26 checks passed
@nirs nirs deleted the downloader-parallel-test branch October 15, 2024 07:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants