Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too much lock contention? #247

Open
IanButterworth opened this issue Jul 13, 2024 · 8 comments
Open

Too much lock contention? #247

IanButterworth opened this issue Jul 13, 2024 · 8 comments

Comments

@IanButterworth
Copy link
Member

IanButterworth commented Jul 13, 2024

Trying to investigate this at the JuliaCon hackathon. Opening for discussion.

$ julia -t4

julia> function foo(url, n; downloader=()->Downloads.Downloader(;grace=0))
       Threads.@threads for _ in 1:n
           Downloads.download(url; downloader=downloader())
       end
       end
foo (generic function with 3 methods)

julia> @time foo("https://julialang-s3.julialang.org/bin/mac/aarch64/1.10/julia-1.10.4-macaarch64.dmg", 4, downloader=Returns(nothing))
 28.483014 seconds (916.99 k allocations: 659.244 MiB, 0.70% gc time, 22905 lock conflicts, 1.15% compilation time)

julia> @time foo("https://julialang-s3.julialang.org/bin/mac/aarch64/1.10/julia-1.10.4-macaarch64.dmg", 4)
 26.974909 seconds (1.23 M allocations: 675.870 MiB, 0.82% gc time, 1801 lock conflicts, 1.32% compilation time)

Note 1801 lock conflicts even on the task local Downloader version

@StefanKarpinski
Copy link
Member

Can you figure out which locks have contention?

@StefanKarpinski
Copy link
Member

I suspect it's not the download start lock but probably the IO locks.

@IanButterworth
Copy link
Member Author

IanButterworth commented Jul 13, 2024

This is the most common lock from @profile

lock(multi.lock) do
watcher.readable || watcher.writable || return # !isopen
@check curl_multi_socket_action(multi.handle, sock, flags)
check_multi_info(multi)
end

I've also moved to hosting a file locally on a http server and downloading from tha, to test theoretical limit because the conference and hotel wifi isn't great, and with that I see a lot fewer locks.

one download per thread

julia> @time foo("http://localhost:8000/julia-1.10.4-macaarch64.dmg", 6, downloader=Returns(nothing))
  0.958359 seconds (493.32 k allocations: 945.356 MiB, 7.87% gc time, 2380 lock conflicts)

julia> @time foo("http://localhost:8000/julia-1.10.4-macaarch64.dmg", 6)
  0.949370 seconds (460.36 k allocations: 790.393 MiB, 11.77% gc time, 31 lock conflicts)

many downloads per thread

julia> @time foo("http://localhost:8000/julia-1.10.4-macaarch64.dmg", 60, downloader=Returns(nothing))
 21.439162 seconds (4.94 M allocations: 9.233 GiB, 4.32% gc time, 24624 lock conflicts)

julia> @time foo("http://localhost:8000/julia-1.10.4-macaarch64.dmg", 60)
 12.926270 seconds (5.43 M allocations: 9.108 GiB, 3.51% gc time, 567 lock conflicts)

I'm not sure whether a locally http server is a fair test of this.. but the speed is a lot faster.

And if I set https://curl.se/libcurl/c/CURLOPT_BUFFERSIZE.html

setopt(easy, CURLOPT_BUFFERSIZE, 10_000_000)

we get a good speed up (both here with task local Downloaders)

julia> @time foo("http://localhost:8000/julia-1.10.4-macaarch64.dmg", 6)
  0.568647 seconds (709.51 k allocations: 956.762 MiB, 30.29% gc time, 19 lock conflicts, 102.74% compilation time: <1% of which was recompilation)

julia> @time foo("http://localhost:8000/julia-1.10.4-macaarch64.dmg", 60)
  3.849536 seconds (4.77 M allocations: 9.075 GiB, 21.38% gc time, 14 lock conflicts, 0.53% compilation time)

Apparently the default buffersize is 16kB and the max is 10MB

@IanButterworth
Copy link
Member Author

IanButterworth commented Jul 13, 2024

Take all the numbers here with a pinch of salt, I just tried default settings again and saw similar speeds to the latter results..

Note that the buffersize is increased in this example https://github.com/curl/curl/blob/3765d75ce47b66289f946382b649d0e99389dc77/docs/examples/http2-download.c#L168-L169

and curl itself uses 100kB https://github.com/curl/curl/blob/3765d75ce47b66289f946382b649d0e99389dc77/src/tool_operate.c#L271

Some comparisons:

Downloads.jl - normal default settings from master

julia> for _ in 1:5 @time Downloads.download("http://localhost:8000/julia-1.10.4-macaarch64.dmg"); end
  0.104941 seconds (92.07 k allocations: 158.067 MiB, 7.03% gc time)
  0.090938 seconds (92.08 k allocations: 158.068 MiB, 6.12% gc time)
  0.096240 seconds (92.07 k allocations: 158.067 MiB, 5.23% gc time)
  0.090895 seconds (92.08 k allocations: 158.067 MiB, 5.55% gc time)
  0.088865 seconds (92.08 k allocations: 158.068 MiB, 5.88% gc time)

julia> 163 / 0.09
1811.111111111111

curl cli - 1 GB/s

% url=http://localhost:8000/julia-1.10.4-macaarch64.dmg
% curl -O -s -w "%{speed_download}" "$url" | awk '{printf "Download speed: %.2f GB/s\n", $1/1024/1024/1024}'
Download speed: 1.02 GB/s
...
Download speed: 1.17 GB/s
Download speed: 1.10 GB/s
Download speed: 1.04 GB/s

wget cli - 2.6 GB/s

And wget gives 2.6 GB/s

% wget --output-document=/dev/null http://localhost:8000/julia-1.10.4-macaarch64.dmg
--2024-07-13 23:33:14--  http://localhost:8000/julia-1.10.4-macaarch64.dmg
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 163644235 (156M) [application/x-apple-diskimage]
Saving to: '/dev/null'

/dev/null                                     100%[==============================================================================================>] 156.06M  --.-KB/s    in 0.06s   

2024-07-13 23:33:15 (2.50 GB/s) - '/dev/null' saved [163644235/163644235]
...
2024-07-13 23:41:44 (2.72 GB/s) - ‘/dev/null’ saved [163644235/163644235]
2024-07-13 23:41:45 (2.67 GB/s) - ‘/dev/null’ saved [163644235/163644235]
2024-07-13 23:41:46 (2.82 GB/s) - ‘/dev/null’ saved [163644235/163644235]

Thanks to @giordano for pointing some of this out.

@IanButterworth
Copy link
Member Author

IanButterworth commented Jul 13, 2024

Using JuliaLang/julia#55120 it turns out the lock that is mainly hitting the slow lock is

lock(multi.lock) do

and that's happening with timeout_ms == 0


One thing I'm noticing is that for

Downloads.download(url; downloader=Downloads.Downloader(;grace=0))

we never see lock conflicts.

Nor for

fetch(@spawn Downloads.download(url; downloader=Downloads.Downloader(;grace=0)))

But we do for

Threads.@threads for _ in 1:10
    Downloads.download(url; downloader=Downloads.Downloader(;grace=0))
end

A small number can be attributed to this lock when generating the Downloaders (which could be more efficiently managed to be task-local in the @thread for)

@lock MULTIS_LOCK push!(filter!(m -> m.value isa Multi, MULTIS), WeakRef(multi))

@IanButterworth
Copy link
Member Author

IanButterworth commented Jul 14, 2024

I think a few things we might want to consider doing are:

  1. By default setting setopt(easy, CURLOPT_BUFFERSIZE, 10_000_000) - increase from the default of 16 kB to 10 MB
  2. Increase CONNECT_SEMAPHORE
    const CONNECT_SEMAPHORE = Base.Semaphore(16) # empirically chosen (ie guessed)
  3. Document that when using concurrent tasks it's best to use a task-local Downloader

I'll benchmark the parameter space when on a strong internet connection.

In the meantime I've spun up 4 local http servers on different ports from a ramdisk, and accessing them individually from each task, but I'm not sure it's a fair test.

This is 4 sequential downloads on each of 4 concurrent tasks, 10 times

julia> foo(4, 4, 10)
1532.902446759521 MB/s
2447.040027683379 MB/s
1884.3813202392432 MB/s
2017.8544918394593 MB/s
1973.309438735145 MB/s
1632.805491498009 MB/s
2153.163424235369 MB/s
1935.19328559164 MB/s
1728.2429605092427 MB/s
1970.3497493574216 MB/s

Added results from locally hosted servers here master...ib/increase_buffersize

@IanButterworth
Copy link
Member Author

I think I'm reaching a conclusion here.

  1. The current defaults for buffersize & CONNECT_SEMAPHORE are actually optimal already.
  • Based on local server testing no configuration of them notably beats the current defaults, and lower values are slower
  • Based on CI benchmarks (where Windows CI machines have the fastest network connection it seems) the current defaults actually perform best
  1. Lock contention numbers are high but don't seem to slow things down. Could callbacks be orchestrated to avoid lock contention? A channel? Not sure but either way I'm not sure it's going to speed things up helpful
  2. However using the global Downloader from different threads does seem to hit enough lock contention to slow things down. I'll add a note in the docs to highlight that.

I'll close.

@StefanKarpinski
Copy link
Member

I'd really like for the defaults to be fast and not require jumping through any hoops. We've put so much work into making this concurrent under the hood with a simple blocking API, seems lame to not make it actually fast. So I'd like to actually fix this and make the warning unnecessary.

I have a few paths I'd like to explore. There's an option for multithreaded libcurl usage—there might be a way to make that work and avoid needing a lock. Another option would be to create a Downloader object per thread. The only way to observe that is the easy hook stuff which is an unofficial and not great API anyway, so we could reasonably do that. The other approach would be something with channels or other ways to avoid locks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants