build: use enry to detect low priority files #829

keegancsmith · 2024-09-17T11:42:49Z

This is a much more robust detection mechanism. Additionally we have these signals we can also add in:

func IsConfiguration(path string) bool
func IsDocumentation(path string) bool
func IsDotFile(path string) bool
func IsImage(path string) bool

My main concern with this change is generated file detection on content using up RAM or CPU. Will monitor this impact on pprof in production.

Test Plan: go test.

This is a much more robust detection mechanism. Additionally we have these signals we can also add in: func IsConfiguration(path string) bool func IsDocumentation(path string) bool func IsDotFile(path string) bool func IsImage(path string) bool My main concern with this change is generated file detection on content using up RAM or CPU. Will monitor this impact on pprof in production. Test Plan: go test.

These all like improvements. We overly matched on test before, so now include test framework code which is interesting. And we now detect that a file is auto generated (the syscalls one)

stefanhengl · 2024-09-18T08:40:35Z

internal/e2e/testdata/test_server.txt

+github.com/sourcegraph/sourcegraph-public-snapshot/cmd/frontend/graphqlbackend/testing.go
+46:type Test struct {
+79:func RunTest(t *testing.T, test *Test) {
+58:func RunTests(t *testing.T, tests []*Test) {
+hidden 27 more line matches


This result seems worse, no?

I don't think so? The old result has nothing to do with test, and the new result has nothing to do with server. It makes sense that this result ends up higher since it has more exported symbols matching a token?

build/builder_test.go

jtibshirani

Nice! I'm also really curious how this affects indexing latency and memory usage. Did you try to index a repo locally, for example sgtest/megarepo? I've found that helpful in catching indexing latency regressions.

keegancsmith · 2024-09-19T12:04:54Z

I did not! I confess to being lazy here since I don't have that repo cloned on my laptop (and my desktop is waiting for a new harddrive to be delivered). To be honest I thought this was low risk enough to monitor the impact on dotcom, but I realise that may actually not that be as clear given you are also shipping memory use improvements. I'll do a quick eval on the sg repo.

keegancsmith · 2024-09-19T12:27:59Z

Ran on my m2 laptop. Changes in memory are insignificant. Change in runtime is 5%.

hyperfine -m 5 --show-output '/usr/bin/time -al ./zoekt-git-index-old -incremental=false -disable_ctags ../sourcegraph' '/usr/bin/time -al ./zoekt-git-index-new -incremental=false -disable_ctags ../sourcegraph'

Summary
 /usr/bin/time -al ./zoekt-git-index-old -incremental=false -disable_ctags ../sourcegraph ran
   1.05 ± 0.01 times faster than /usr/bin/time -al ./zoekt-git-index-new -incremental=false -disable_ctags ../sourcegraph

maximum resident set size

old	new
971014144	979779584
979140608	982417408
990347264	984399872
997048320	985169920
1002274816	995901440

peak memory footprint

old	new
955632896	979779584
963743040	982417408
974851392	984399872
981880128	985169920
987204992	995901440

Full output

```shellsession ❯ hyperfine -m 5 --show-output '/usr/bin/time -al ./zoekt-git-index-old -incremental=false -disable_ctags ../sourcegraph' '/usr/bin/time -al ./zoekt-git-index-new -incremental=false -disable_ctags ../sourcegraph' Benchmark 1: /usr/bin/time -al ./zoekt-git-index-old -incremental=false -disable_ctags ../sourcegraph 2024/09/19 14:15:42 attempting to index 14705 total files 2024/09/19 14:15:45 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23980144 index bytes (overhead 2.9), 1563 files processed 2024/09/19 14:15:51 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284819660 index bytes (overhead 2.7), 13142 files processed 9.28 real 9.64 user 0.47 sys 979140608 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 60781 page reclaims 1 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 708 signals received 136 voluntary context switches 2905 involuntary context switches 156095300102 instructions retired 34888896521 cycles elapsed 963743040 peak memory footprint 2024/09/19 14:15:51 attempting to index 14705 total files 2024/09/19 14:15:54 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23980144 index bytes (overhead 2.9), 1563 files processed 2024/09/19 14:16:00 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284819660 index bytes (overhead 2.7), 13142 files processed 9.35 real 9.70 user 0.47 sys 997048320 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 61891 page reclaims 1 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 756 signals received 183 voluntary context switches 3334 involuntary context switches 156057166866 instructions retired 35090865584 cycles elapsed 981880128 peak memory footprint 2024/09/19 14:16:00 attempting to index 14705 total files 2024/09/19 14:16:04 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23980144 index bytes (overhead 2.9), 1563 files processed 2024/09/19 14:16:09 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284819660 index bytes (overhead 2.7), 13142 files processed 9.30 real 9.66 user 0.46 sys 1002274816 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 62202 page reclaims 1 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 758 signals received 186 voluntary context switches 3164 involuntary context switches 156249307615 instructions retired 34954748988 cycles elapsed 987204992 peak memory footprint 2024/09/19 14:16:10 attempting to index 14705 total files 2024/09/19 14:16:13 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23980144 index bytes (overhead 2.9), 1563 files processed 2024/09/19 14:16:19 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284819660 index bytes (overhead 2.7), 13142 files processed 9.33 real 9.71 user 0.46 sys 971014144 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 60276 page reclaims 1 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 770 signals received 172 voluntary context switches 3122 involuntary context switches 156030424308 instructions retired 35080996626 cycles elapsed 955632896 peak memory footprint 2024/09/19 14:16:19 attempting to index 14705 total files 2024/09/19 14:16:22 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23980144 index bytes (overhead 2.9), 1563 files processed 2024/09/19 14:16:28 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284819660 index bytes (overhead 2.7), 13142 files processed 9.30 real 9.66 user 0.46 sys 990347264 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 61457 page reclaims 1 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 728 signals received 182 voluntary context switches 3127 involuntary context switches 156072501995 instructions retired 34863512990 cycles elapsed 974851392 peak memory footprint Time (mean ± σ): 9.319 s ± 0.029 s [User: 9.680 s, System: 0.470 s] Range (min … max): 9.289 s … 9.359 s 5 runs

Benchmark 2: /usr/bin/time -al ./zoekt-git-index-new -incremental=false -disable_ctags ../sourcegraph
2024/09/19 14:16:28 attempting to index 14705 total files
2024/09/19 14:16:32 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23959425 index bytes (overhead 2.9), 1563 files processed
2024/09/19 14:16:38 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284637856 index bytes (overhead 2.7), 13142 files processed
9.70 real 10.14 user 0.47 sys
982417408 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
60994 page reclaims
1 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
838 signals received
211 voluntary context switches
3265 involuntary context switches
165338717104 instructions retired
36581432105 cycles elapsed
967462208 peak memory footprint
2024/09/19 14:16:38 attempting to index 14705 total files
2024/09/19 14:16:41 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23959425 index bytes (overhead 2.9), 1563 files processed
2024/09/19 14:16:48 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284637856 index bytes (overhead 2.7), 13142 files processed
9.76 real 10.14 user 0.47 sys
979779584 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
60822 page reclaims
1 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
753 signals received
141 voluntary context switches
3011 involuntary context switches
165344725526 instructions retired
36467164510 cycles elapsed
964709696 peak memory footprint
2024/09/19 14:16:48 attempting to index 14705 total files
2024/09/19 14:16:51 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23959425 index bytes (overhead 2.9), 1563 files processed
2024/09/19 14:16:57 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284637856 index bytes (overhead 2.7), 13142 files processed
9.80 real 10.21 user 0.46 sys
984399872 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
61105 page reclaims
1 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
815 signals received
202 voluntary context switches
3562 involuntary context switches
165266447587 instructions retired
36582126526 cycles elapsed
969313600 peak memory footprint
2024/09/19 14:16:58 attempting to index 14705 total files
2024/09/19 14:17:01 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23959425 index bytes (overhead 2.9), 1563 files processed
2024/09/19 14:17:07 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284637856 index bytes (overhead 2.7), 13142 files processed
9.81 real 10.21 user 0.46 sys
995901440 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
61799 page reclaims
1 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
786 signals received
244 voluntary context switches
3414 involuntary context switches
165275906471 instructions retired
36506990752 cycles elapsed
980913536 peak memory footprint
2024/09/19 14:17:07 attempting to index 14705 total files
2024/09/19 14:17:11 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00001.zoekt: 23959425 index bytes (overhead 2.9), 1563 files processed
2024/09/19 14:17:17 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 284637856 index bytes (overhead 2.7), 13142 files processed
9.80 real 10.21 user 0.47 sys
985169920 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
61139 page reclaims
1 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
785 signals received
221 voluntary context switches
3251 involuntary context switches
165226972397 instructions retired
36434857052 cycles elapsed
970001728 peak memory footprint
Time (mean ± σ): 9.781 s ± 0.044 s [User: 10.187 s, System: 0.472 s]
Range (min … max): 9.708 s … 9.818 s 5 runs

Summary
/usr/bin/time -al ./zoekt-git-index-old -incremental=false -disable_ctags ../sourcegraph ran
1.05 ± 0.01 times faster than /usr/bin/time -al ./zoekt-git-index-new -incremental=false -disable_ctags ../sourcegraph

</details>

jtibshirani · 2024-09-19T15:59:55Z

Ran on my m2 laptop. Changes in memory are insignificant. Change in runtime is 5%.

Thanks for testing this! The 5% latency increase is not ideal but seems fine.

keegancsmith · 2024-09-20T08:15:35Z

Ran on my m2 laptop. Changes in memory are insignificant. Change in runtime is 5%.

Thanks for testing this! The 5% latency increase is not ideal but seems fine.

Note: I did this indexing with symbols turned off, so we skipped ctags. I suspect with ctags on the change in runtime would not be noticeable.

keegancsmith requested a review from a team September 17, 2024 11:42

cla-bot bot added the cla-signed label Sep 17, 2024

update e2e fixtures

2db6625

These all like improvements. We overly matched on test before, so now include test framework code which is interesting. And we now detect that a file is auto generated (the syscalls one)

stefanhengl approved these changes Sep 18, 2024

View reviewed changes

add test checking content

8c5a947

keegancsmith merged commit a8d7c8b into main Sep 18, 2024
8 checks passed

keegancsmith deleted the k/use-enry-classification branch September 18, 2024 09:07

jtibshirani reviewed Sep 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build: use enry to detect low priority files #829

build: use enry to detect low priority files #829

keegancsmith commented Sep 17, 2024

stefanhengl Sep 18, 2024

keegancsmith Sep 18, 2024

jtibshirani left a comment

keegancsmith commented Sep 19, 2024

keegancsmith commented Sep 19, 2024

jtibshirani commented Sep 19, 2024

keegancsmith commented Sep 20, 2024

build: use enry to detect low priority files #829

build: use enry to detect low priority files #829

Conversation

keegancsmith commented Sep 17, 2024

stefanhengl Sep 18, 2024

Choose a reason for hiding this comment

keegancsmith Sep 18, 2024

Choose a reason for hiding this comment

jtibshirani left a comment

Choose a reason for hiding this comment

keegancsmith commented Sep 19, 2024

keegancsmith commented Sep 19, 2024

jtibshirani commented Sep 19, 2024

keegancsmith commented Sep 20, 2024