-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build: use enry to detect low priority files #829
Conversation
This is a much more robust detection mechanism. Additionally we have these signals we can also add in: func IsConfiguration(path string) bool func IsDocumentation(path string) bool func IsDotFile(path string) bool func IsImage(path string) bool My main concern with this change is generated file detection on content using up RAM or CPU. Will monitor this impact on pprof in production. Test Plan: go test.
These all like improvements. We overly matched on test before, so now include test framework code which is interesting. And we now detect that a file is auto generated (the syscalls one)
github.com/sourcegraph/sourcegraph-public-snapshot/cmd/frontend/graphqlbackend/testing.go | ||
46:type Test struct { | ||
79:func RunTest(t *testing.T, test *Test) { | ||
58:func RunTests(t *testing.T, tests []*Test) { | ||
hidden 27 more line matches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This result seems worse, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so? The old result has nothing to do with test, and the new result has nothing to do with server. It makes sense that this result ends up higher since it has more exported symbols matching a token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I'm also really curious how this affects indexing latency and memory usage. Did you try to index a repo locally, for example sgtest/megarepo
? I've found that helpful in catching indexing latency regressions.
I did not! I confess to being lazy here since I don't have that repo cloned on my laptop (and my desktop is waiting for a new harddrive to be delivered). To be honest I thought this was low risk enough to monitor the impact on dotcom, but I realise that may actually not that be as clear given you are also shipping memory use improvements. I'll do a quick eval on the sg repo. |
Ran on my m2 laptop. Changes in memory are insignificant. Change in runtime is 5%.
maximum resident set size
peak memory footprint
Full output
```shellsession
❯ hyperfine -m 5 --show-output '/usr/bin/time -al ./zoekt-git-index-old -incremental=false -disable_ctags ../sourcegraph' '/usr/bin/time -al ./zoekt-git-index-new -incremental=false -disable_ctags ../sourcegraph'
Benchmark 1: /usr/bin/time -al ./zoekt-git-index-old -incremental=false -disable_ctags ../sourcegraph
2024/09/19 14:15:42 attempting to index 14705 total files
2024/09/19 14:15:45 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23980144 index bytes (overhead 2.9), 1563 files processed
2024/09/19 14:15:51 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284819660 index bytes (overhead 2.7), 13142 files processed
9.28 real 9.64 user 0.47 sys
979140608 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
60781 page reclaims
1 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
708 signals received
136 voluntary context switches
2905 involuntary context switches
156095300102 instructions retired
34888896521 cycles elapsed
963743040 peak memory footprint
2024/09/19 14:15:51 attempting to index 14705 total files
2024/09/19 14:15:54 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23980144 index bytes (overhead 2.9), 1563 files processed
2024/09/19 14:16:00 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284819660 index bytes (overhead 2.7), 13142 files processed
9.35 real 9.70 user 0.47 sys
997048320 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
61891 page reclaims
1 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
756 signals received
183 voluntary context switches
3334 involuntary context switches
156057166866 instructions retired
35090865584 cycles elapsed
981880128 peak memory footprint
2024/09/19 14:16:00 attempting to index 14705 total files
2024/09/19 14:16:04 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23980144 index bytes (overhead 2.9), 1563 files processed
2024/09/19 14:16:09 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284819660 index bytes (overhead 2.7), 13142 files processed
9.30 real 9.66 user 0.46 sys
1002274816 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
62202 page reclaims
1 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
758 signals received
186 voluntary context switches
3164 involuntary context switches
156249307615 instructions retired
34954748988 cycles elapsed
987204992 peak memory footprint
2024/09/19 14:16:10 attempting to index 14705 total files
2024/09/19 14:16:13 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23980144 index bytes (overhead 2.9), 1563 files processed
2024/09/19 14:16:19 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284819660 index bytes (overhead 2.7), 13142 files processed
9.33 real 9.71 user 0.46 sys
971014144 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
60276 page reclaims
1 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
770 signals received
172 voluntary context switches
3122 involuntary context switches
156030424308 instructions retired
35080996626 cycles elapsed
955632896 peak memory footprint
2024/09/19 14:16:19 attempting to index 14705 total files
2024/09/19 14:16:22 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23980144 index bytes (overhead 2.9), 1563 files processed
2024/09/19 14:16:28 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284819660 index bytes (overhead 2.7), 13142 files processed
9.30 real 9.66 user 0.46 sys
990347264 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
61457 page reclaims
1 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
728 signals received
182 voluntary context switches
3127 involuntary context switches
156072501995 instructions retired
34863512990 cycles elapsed
974851392 peak memory footprint
Time (mean ± σ): 9.319 s ± 0.029 s [User: 9.680 s, System: 0.470 s]
Range (min … max): 9.289 s … 9.359 s 5 runs
Benchmark 2: /usr/bin/time -al ./zoekt-git-index-new -incremental=false -disable_ctags ../sourcegraph Summary
|
Thanks for testing this! The 5% latency increase is not ideal but seems fine. |
Note: I did this indexing with symbols turned off, so we skipped ctags. I suspect with ctags on the change in runtime would not be noticeable. |
This is a much more robust detection mechanism. Additionally we have these signals we can also add in:
func IsConfiguration(path string) bool
func IsDocumentation(path string) bool
func IsDotFile(path string) bool
func IsImage(path string) bool
My main concern with this change is generated file detection on content using up RAM or CPU. Will monitor this impact on pprof in production.
Test Plan: go test.