Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TTFX regression with Comonicon CLI on Julia 1.11 #55171

Open
timholy opened this issue Jul 19, 2024 · 22 comments
Open

TTFX regression with Comonicon CLI on Julia 1.11 #55171

timholy opened this issue Jul 19, 2024 · 22 comments
Labels
system:wsl Windows Subsystem Linux (WSL)

Comments

@timholy
Copy link
Member

timholy commented Jul 19, 2024

Following up on a slack conversation, there is indeed a substantial regression in the responsivity of a CLI application written with Comonicon:

Julia 1.10:

~/.julia/bin$ time ./cli --help
# lots of output
real    0m0.285s
user    0m0.395s
sys     0m0.642s

Julia 1.11-rc1:

real    0m3.681s
user    0m3.656s
sys     0m0.866s

I haven't really participated in the Julia 1.11 release and don't have time to check myself, but since it's a pretty sizable regression I'm slapping a milestone on it. (I am not personally affected by CLI TTFX, though.)

Here are the specifics of what I tested:

tim@diva:/tmp/TTFX/CLI$ tree
.
├── Manifest.toml
├── Project.toml
└── src
    └── CLI.jl

1 directory, 3 files
tim@diva:/tmp/TTFX/CLI$ cat src/CLI.jl
module CLI

using Comonicon

"""
ArgParse example implemented in Comonicon.

# Arguments

- `x`: an argument

# Options

- `--opt1 <arg>`: an option
- `-o, --opt2 <arg>`: another option

# Flags

- `-f, --flag`: a flag
"""
Comonicon.@main function main(x; opt1=1, opt2::Int=2, flag=false)
    println("Parsed args:")
    println("flag=>", flag)
    println("arg=>", x)
    println("opt1=>", opt1)
    println("opt2=>", opt2)
end

end # module CLI

Build with

(CLI) pkg> build

and then test with

~/.julia/bin$ time ./cli --help
@timholy timholy added this to the 1.11 milestone Jul 19, 2024
@timholy
Copy link
Member Author

timholy commented Jul 19, 2024

Oh, and if anyone is looking into this, note that the master branch of SnoopCompile works on Julia 1.11. See this issue for progress on the release: timholy/SnoopCompile.jl#382

@IanButterworth
Copy link
Member

IanButterworth commented Jul 19, 2024

I tried to replicate this but

(CLI) pkg> st
Project CLI v0.1.0
Status `~/Documents/GitHub/CLI/Project.toml`
  [863f3e99] Comonicon v1.0.8

(CLI) pkg> build

shell> ls ~/.julia/bin
ls: /Users/ian/.julia/bin: No such file or directory

If anyone familiar with Comonicon can help I'd be happy to investigate (@timholy mentioned on slack that he wasn't sure why this didn't build ~/.julia/bin/cli)

@Sbozzolo

This comment was marked as off-topic.

@KristofferC

This comment was marked as off-topic.

@Roger-luo
Copy link
Contributor

Hi you can try the example package here: https://github.com/comonicon/Comonicon.jl/tree/main/example/Hello

and you can install the entry point by running Hello.comonicon_install()

help?> Hello.comonicon_install
  comonicon_install(;kwargs...)

  Install the CLI manually. This will use the default configuration in
  Comonicon.toml, if it exists. For more detailed reference, please refer to
  Comonicon documentation (https://comonicon.org).

you don't have to test that with an entry point, you can test directly with its Julia entry point

help?> Hello.command_main
  No documentation found.

  Hello.command_main is a Function.

  # 2 methods for generic function "command_main" from Hello:
   [1] command_main(ARGS::Vector{String})
       @ ~/Code/Julia/Comonicon/src/codegen/julia.jl:90
   [2] command_main()
       @ ~/Code/Julia/Comonicon/src/codegen/julia.jl:90

@KristofferC
Copy link
Member

I get:

# 1.10
hello --help  0.37s user 0.31s system 308% cpu 0.219 total

# 1.11
hello --help  1.01s user 1.58s system 354% cpu 0.731 total

And considering both Pkg and REPL is loaded, this does not seem very surprising.

@KristofferC KristofferC removed this from the 1.11 milestone Jul 29, 2024
@timholy
Copy link
Member Author

timholy commented Jul 29, 2024

Hmm, it was much worse for me, I wonder why. Agreed this seems more reasonable.

Is it all TTL? There seems to be a path forward that Jameson and I sketched out for significantly reducing TTL, but it will take some time to explore. Therefore, let me ask a heretical question: if Pkg & REPL are the two culprits and they come up a lot, how important for 1.11 is their exclusion from the sysimg? Clearly we want this eventually, but until juliac is out, I guess I'm unclear on whether the advantages of their exclusion more than compensate for the disadvantages. Something that might be worth considering is "baking them in" for 1.11 (but hopefully that's just a couple of new lines to re-include them, I'm not suggesting we revert any hard work) with the expectation that they'll be excluded in a future release where either the TTL issue has been fixed or the advantages of their exclusion clearly outweigh the disadvantages.

If "putting them back" is not just a couple of lines, then you can ignore this.

@IanButterworth
Copy link
Member

@timholy whats your platform? WSL2?

@KristofferC
Copy link
Member

Regarding this specific case, it seems "self-inflicted". The Hello app here depends on the full Comonicon.jl which pulls in a lot of dependencies (Pkg, REPL) to both do argument parsing and to add a Hello.comonicon_install command inside the app itself. To me, this would better be structured as having Comonicon install the app (Comonicon.install("examples/Hello")) and having a smaller dependency that handles argument parsing that the app itself would depend on.

@KristofferC
Copy link
Member

As of right now, just adding Pkg to the sysimage doesn't work:

LoadError("sysimg.jl", 57, LoadError("/Users/kristoffercarlsson/julia/usr/share/julia/stdlib/v1.12/Pkg/src/Pkg.jl", 3, 
  LoadError("/Users/kristoffercarlsson/julia/usr/share/julia/stdlib/v1.12/Pkg/src/precompile.jl", 101, 
    Base.IOError("could not spawn setenv(`'' a registries/Registry.tar.gz -tgzip

which I guess is from https://github.com/JuliaLang/Pkg.jl/blob/da9f89ffc7df973d273a3952051ad656987bc64b/src/PlatformEngines.jl#L20 and that jlls don't work in the process creating a sysimage since __init__ is disabled in that process.

@IanButterworth
Copy link
Member

Including stdlibs shouldn't be expensive, and in my testing it's not on MacOS nor native linux. And from what I hear native windows neither.

I suspect this is another edge case where WSL2 (which I believe Tim is using) is so slow (100x slower than native linux) for file system operations that our loading code fs stuff is getting in the way.

We can probably replace some readdir's with the much more efficient _readdirx in there.

@IanButterworth
Copy link
Member

A profile would probably help.

@timholy
Copy link
Member Author

timholy commented Jul 29, 2024

In trying to install perf I'm currently hitting microsoft/WSL#10913. I should probably upgrade to 22.04, but that will take some time. In the meantime, given that this is WSL-specific then I agree with removing this from the milestone.

@Roger-luo
Copy link
Contributor

I'm slowly refactoring things in Comonicon to have features like app installation done as a plugin. The package was done in pre-package-image time, so everything was generated as a zero dependency code and injected into the user's code space via a macro. As @KristofferC mentioned, this is no longer necessary and can be done with a separate runtime package to support fancy CLI features (which is what I'm planning).

If excluding stdlib is the only reason for slowing it, I'm OK with not fixing it.

@IanButterworth IanButterworth added the system:wsl Windows Subsystem Linux (WSL) label Jul 29, 2024
@topolarity
Copy link
Member

another edge case where WSL2 (which I believe Tim is using) is so slow (100x slower than native linux) for file system operations that our loading code fs stuff is getting in the way.

I'm not sure that fully explains the slowdown here - WSL2 has very slow filesystem operations, but only when accessing files across the bridge to the native Windows FS.

When accessing files in the Linux filesystem, it should be pretty competitive. Is @timholy keeping his depot on the Windows side?

@IanButterworth
Copy link
Member

IanButterworth commented Aug 1, 2024

Tim provided some redone timing info to me for #55331 and saw faster baseline times than those which were reported here, for unknown reasons.

(Note my understanding is that Tim's julia installation and depot resides within the linux filesystem on WSL2)

@topolarity
Copy link
Member

I'm able to reproduce this on my machine, although I have to have assertions on to get numbers almost as bad as in the OP:

# julia +1.10
~/.julia/bin/cli --help  0.23s user 0.30s system 239% cpu 0.220 total
# julia +nightly
~/.julia/bin/cli --help  1.40s user 0.20s system 138% cpu 1.157 total
# local master (w/ assertions)
~/.julia/bin/cli --help  2.18s user 0.14s system 121% cpu 1.911 total

In case it's helpful, here's an strace -r -tt -T from the nightly and 1.10 runs:
cli_strace_1_10.log
cli_strace_nightly.log

I haven't looked closely enough at the trace yet to know what's dominating here, but we are making ~8x as many syscalls as we used to.

@KristofferC
Copy link
Member

I'm not surprised the syscalls are (much) more now because we went from opening 1 package to a lot of them. But looking at the logs there are some cases like

18:16:27.787691 (+     0.000211) statx(AT_FDCWD, "/home/topolarity/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Logging/Project.toml", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=165, ...}) = 0 <0.000093>
18:16:27.787910 (+     0.000218) statx(AT_FDCWD, "/home/topolarity/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Logging/Project.toml", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=165, ...}) = 0 <0.000090>

18:16:27.790806 (+     0.000760) statx(AT_FDCWD, "/home/topolarity/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Logging/src/Logging.jl", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=2404, ...}) = 0 <0.000110>
18:16:27.791096 (+     0.000289) statx(AT_FDCWD, "/home/topolarity/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Logging/src/Logging.jl", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=2404, ...}) = 0 <0.000116>

which are two identical stat calls for the same file at almost the same time which is probably not needed?

@topolarity
Copy link
Member

Yeah, agreed - Scrolling the log doesn't show any particular dominator, just a lot of package loading.

This is without #55331 so there's a chance Ian has already cleaned up some of that 🙂

@topolarity
Copy link
Member

The summarized results are also interesting (strace -c):

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 50.76    0.103150          14      7087           read
 13.25    0.026934          11      2326      1253 access
 13.00    0.026428          10      2588         1 lseek
  6.84    0.013900           8      1615       118 statx
  5.28    0.010729          11       973        97 openat
  4.17    0.008476           9       878           close
  1.36    0.002770           7       378           mmap
  1.21    0.002454           9       258        72 newfstatat
  0.88    0.001798           5       322           futex
  0.85    0.001722          13       132           getdents64
  0.50    0.001020           9       103           mprotect
  0.43    0.000874           9        92           brk
 ...
------ ----------- ----------- --------- --------- ------------------
100.00    0.203227          11     17065      1584 total

So we're dominated by reads, most of which are us opening all of the .so files.

But in total we're only spending 15-20% of the time doing syscalls (1.38s user 0.18s system).

Taking a perf, it looks like we're suffering from the usual culprits for package loading:
image

(data is in profile.linux-perf.zip), image is taken from https://www.speedscope.app/)

@topolarity
Copy link
Member

topolarity commented Aug 1, 2024

It's also worth investigating why @KristofferC's results seem to be much better parallelized...

For an apples-to-apples test, here's the build.jl that I had to add to the MWE in the OP:

$ cat deps/build.jl
# build.jl
using CLI; CLI.comonicon_install()

@KristofferC
Copy link
Member

KristofferC commented Aug 1, 2024

This is what I am running (on an M1):

~/JuliaTests/CLI
❯ time julia +1.10 --project -e 'using CLI; exit(CLI.command_main())' -- --help  1> /dev/null
julia +1.10 --project -e 'using CLI; exit(CLI.command_main())' -- --help >   0.56s user 0.33s system 360% cpu 0.248 total

~/JuliaTests/CLI
❯ time julia +1.11 --project -e 'using CLI; exit(CLI.command_main())' -- --help  1> /dev/null
julia +1.11 --project -e 'using CLI; exit(CLI.command_main())' -- --help >   1.58s user 0.82s system 203% cpu 1.181 total

KristofferC pushed a commit that referenced this issue Aug 2, 2024
Avoids immediately successive stat calls for the same path during
startup & loading.

According to MacOS Instruments this reduces `stat64` calls during
`--start=no -e "using Pkg"` from 844 to 672.

On MacOS I don't see a speed improvement, but on WSL2 @timholy reported
the test from #55171 sees a
modest improvement.

This PR (10 iterations)
```
tim@diva:~/.julia/bin$ time for i in {1..10}; do ./cli --help &> /dev/null; done

real    0m2.999s
user    0m3.547s
sys     0m6.552s
```

master
```
real    0m3.217s
user    0m3.794s
sys     0m6.700s
```
1.11-rc2:
```
real    0m3.746s
user    0m4.169s
sys     0m6.755s
```

I left the `@debug`s in as they might be useful for similar
investigations.
lazarusA pushed a commit to lazarusA/julia that referenced this issue Aug 17, 2024
Avoids immediately successive stat calls for the same path during
startup & loading.

According to MacOS Instruments this reduces `stat64` calls during
`--start=no -e "using Pkg"` from 844 to 672.

On MacOS I don't see a speed improvement, but on WSL2 @timholy reported
the test from JuliaLang#55171 sees a
modest improvement.

This PR (10 iterations)
```
tim@diva:~/.julia/bin$ time for i in {1..10}; do ./cli --help &> /dev/null; done

real    0m2.999s
user    0m3.547s
sys     0m6.552s
```

master
```
real    0m3.217s
user    0m3.794s
sys     0m6.700s
```
1.11-rc2:
```
real    0m3.746s
user    0m4.169s
sys     0m6.755s
```

I left the `@debug`s in as they might be useful for similar
investigations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
system:wsl Windows Subsystem Linux (WSL)
Projects
None yet
Development

No branches or pull requests

6 participants