Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: use enry to detect low priority files #829

Merged
merged 3 commits into from
Sep 18, 2024

Conversation

keegancsmith
Copy link
Member

This is a much more robust detection mechanism. Additionally we have these signals we can also add in:

func IsConfiguration(path string) bool
func IsDocumentation(path string) bool
func IsDotFile(path string) bool
func IsImage(path string) bool

My main concern with this change is generated file detection on content using up RAM or CPU. Will monitor this impact on pprof in production.

Test Plan: go test.

This is a much more robust detection mechanism. Additionally we have
these signals we can also add in:

  func IsConfiguration(path string) bool
  func IsDocumentation(path string) bool
  func IsDotFile(path string) bool
  func IsImage(path string) bool

My main concern with this change is generated file detection on content
using up RAM or CPU. Will monitor this impact on pprof in production.

Test Plan: go test.
@keegancsmith keegancsmith requested a review from a team September 17, 2024 11:42
@cla-bot cla-bot bot added the cla-signed label Sep 17, 2024
These all like improvements. We overly matched on test before, so now
include test framework code which is interesting. And we now detect that
a file is auto generated (the syscalls one)
Comment on lines +35 to +39
github.com/sourcegraph/sourcegraph-public-snapshot/cmd/frontend/graphqlbackend/testing.go
46:type Test struct {
79:func RunTest(t *testing.T, test *Test) {
58:func RunTests(t *testing.T, tests []*Test) {
hidden 27 more line matches
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This result seems worse, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so? The old result has nothing to do with test, and the new result has nothing to do with server. It makes sense that this result ends up higher since it has more exported symbols matching a token?

build/builder_test.go Show resolved Hide resolved
@keegancsmith keegancsmith merged commit a8d7c8b into main Sep 18, 2024
8 checks passed
@keegancsmith keegancsmith deleted the k/use-enry-classification branch September 18, 2024 09:07
Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I'm also really curious how this affects indexing latency and memory usage. Did you try to index a repo locally, for example sgtest/megarepo? I've found that helpful in catching indexing latency regressions.

@keegancsmith
Copy link
Member Author

I did not! I confess to being lazy here since I don't have that repo cloned on my laptop (and my desktop is waiting for a new harddrive to be delivered). To be honest I thought this was low risk enough to monitor the impact on dotcom, but I realise that may actually not that be as clear given you are also shipping memory use improvements. I'll do a quick eval on the sg repo.

@keegancsmith
Copy link
Member Author

Ran on my m2 laptop. Changes in memory are insignificant. Change in runtime is 5%.

hyperfine -m 5 --show-output '/usr/bin/time -al ./zoekt-git-index-old -incremental=false -disable_ctags ../sourcegraph' '/usr/bin/time -al ./zoekt-git-index-new -incremental=false -disable_ctags ../sourcegraph'

Summary
 /usr/bin/time -al ./zoekt-git-index-old -incremental=false -disable_ctags ../sourcegraph ran
   1.05 ± 0.01 times faster than /usr/bin/time -al ./zoekt-git-index-new -incremental=false -disable_ctags ../sourcegraph

maximum resident set size

old new
971014144 979779584
979140608 982417408
990347264 984399872
997048320 985169920
1002274816 995901440

peak memory footprint

old new
955632896 979779584
963743040 982417408
974851392 984399872
981880128 985169920
987204992 995901440

Full output

```shellsession ❯ hyperfine -m 5 --show-output '/usr/bin/time -al ./zoekt-git-index-old -incremental=false -disable_ctags ../sourcegraph' '/usr/bin/time -al ./zoekt-git-index-new -incremental=false -disable_ctags ../sourcegraph' Benchmark 1: /usr/bin/time -al ./zoekt-git-index-old -incremental=false -disable_ctags ../sourcegraph 2024/09/19 14:15:42 attempting to index 14705 total files 2024/09/19 14:15:45 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23980144 index bytes (overhead 2.9), 1563 files processed 2024/09/19 14:15:51 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284819660 index bytes (overhead 2.7), 13142 files processed 9.28 real 9.64 user 0.47 sys 979140608 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 60781 page reclaims 1 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 708 signals received 136 voluntary context switches 2905 involuntary context switches 156095300102 instructions retired 34888896521 cycles elapsed 963743040 peak memory footprint 2024/09/19 14:15:51 attempting to index 14705 total files 2024/09/19 14:15:54 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23980144 index bytes (overhead 2.9), 1563 files processed 2024/09/19 14:16:00 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284819660 index bytes (overhead 2.7), 13142 files processed 9.35 real 9.70 user 0.47 sys 997048320 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 61891 page reclaims 1 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 756 signals received 183 voluntary context switches 3334 involuntary context switches 156057166866 instructions retired 35090865584 cycles elapsed 981880128 peak memory footprint 2024/09/19 14:16:00 attempting to index 14705 total files 2024/09/19 14:16:04 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23980144 index bytes (overhead 2.9), 1563 files processed 2024/09/19 14:16:09 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284819660 index bytes (overhead 2.7), 13142 files processed 9.30 real 9.66 user 0.46 sys 1002274816 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 62202 page reclaims 1 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 758 signals received 186 voluntary context switches 3164 involuntary context switches 156249307615 instructions retired 34954748988 cycles elapsed 987204992 peak memory footprint 2024/09/19 14:16:10 attempting to index 14705 total files 2024/09/19 14:16:13 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23980144 index bytes (overhead 2.9), 1563 files processed 2024/09/19 14:16:19 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284819660 index bytes (overhead 2.7), 13142 files processed 9.33 real 9.71 user 0.46 sys 971014144 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 60276 page reclaims 1 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 770 signals received 172 voluntary context switches 3122 involuntary context switches 156030424308 instructions retired 35080996626 cycles elapsed 955632896 peak memory footprint 2024/09/19 14:16:19 attempting to index 14705 total files 2024/09/19 14:16:22 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23980144 index bytes (overhead 2.9), 1563 files processed 2024/09/19 14:16:28 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284819660 index bytes (overhead 2.7), 13142 files processed 9.30 real 9.66 user 0.46 sys 990347264 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 61457 page reclaims 1 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 728 signals received 182 voluntary context switches 3127 involuntary context switches 156072501995 instructions retired 34863512990 cycles elapsed 974851392 peak memory footprint Time (mean ± σ): 9.319 s ± 0.029 s [User: 9.680 s, System: 0.470 s] Range (min … max): 9.289 s … 9.359 s 5 runs

Benchmark 2: /usr/bin/time -al ./zoekt-git-index-new -incremental=false -disable_ctags ../sourcegraph
2024/09/19 14:16:28 attempting to index 14705 total files
2024/09/19 14:16:32 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23959425 index bytes (overhead 2.9), 1563 files processed
2024/09/19 14:16:38 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284637856 index bytes (overhead 2.7), 13142 files processed
9.70 real 10.14 user 0.47 sys
982417408 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
60994 page reclaims
1 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
838 signals received
211 voluntary context switches
3265 involuntary context switches
165338717104 instructions retired
36581432105 cycles elapsed
967462208 peak memory footprint
2024/09/19 14:16:38 attempting to index 14705 total files
2024/09/19 14:16:41 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23959425 index bytes (overhead 2.9), 1563 files processed
2024/09/19 14:16:48 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284637856 index bytes (overhead 2.7), 13142 files processed
9.76 real 10.14 user 0.47 sys
979779584 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
60822 page reclaims
1 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
753 signals received
141 voluntary context switches
3011 involuntary context switches
165344725526 instructions retired
36467164510 cycles elapsed
964709696 peak memory footprint
2024/09/19 14:16:48 attempting to index 14705 total files
2024/09/19 14:16:51 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23959425 index bytes (overhead 2.9), 1563 files processed
2024/09/19 14:16:57 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284637856 index bytes (overhead 2.7), 13142 files processed
9.80 real 10.21 user 0.46 sys
984399872 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
61105 page reclaims
1 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
815 signals received
202 voluntary context switches
3562 involuntary context switches
165266447587 instructions retired
36582126526 cycles elapsed
969313600 peak memory footprint
2024/09/19 14:16:58 attempting to index 14705 total files
2024/09/19 14:17:01 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23959425 index bytes (overhead 2.9), 1563 files processed
2024/09/19 14:17:07 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284637856 index bytes (overhead 2.7), 13142 files processed
9.81 real 10.21 user 0.46 sys
995901440 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
61799 page reclaims
1 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
786 signals received
244 voluntary context switches
3414 involuntary context switches
165275906471 instructions retired
36506990752 cycles elapsed
980913536 peak memory footprint
2024/09/19 14:17:07 attempting to index 14705 total files
2024/09/19 14:17:11 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23959425 index bytes (overhead 2.9), 1563 files processed
2024/09/19 14:17:17 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284637856 index bytes (overhead 2.7), 13142 files processed
9.80 real 10.21 user 0.47 sys
985169920 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
61139 page reclaims
1 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
785 signals received
221 voluntary context switches
3251 involuntary context switches
165226972397 instructions retired
36434857052 cycles elapsed
970001728 peak memory footprint
Time (mean ± σ): 9.781 s ± 0.044 s [User: 10.187 s, System: 0.472 s]
Range (min … max): 9.708 s … 9.818 s 5 runs

Summary
/usr/bin/time -al ./zoekt-git-index-old -incremental=false -disable_ctags ../sourcegraph ran
1.05 ± 0.01 times faster than /usr/bin/time -al ./zoekt-git-index-new -incremental=false -disable_ctags ../sourcegraph

</details>

@jtibshirani
Copy link
Member

Ran on my m2 laptop. Changes in memory are insignificant. Change in runtime is 5%.

Thanks for testing this! The 5% latency increase is not ideal but seems fine.

@keegancsmith
Copy link
Member Author

Ran on my m2 laptop. Changes in memory are insignificant. Change in runtime is 5%.

Thanks for testing this! The 5% latency increase is not ideal but seems fine.

Note: I did this indexing with symbols turned off, so we skipped ctags. I suspect with ctags on the change in runtime would not be noticeable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants