Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

☂️ Search: improve Zoekt indexing #58133

Closed
14 tasks done
jtibshirani opened this issue Nov 6, 2023 · 7 comments
Closed
14 tasks done

☂️ Search: improve Zoekt indexing #58133

jtibshirani opened this issue Nov 6, 2023 · 7 comments
Assignees
Labels
team/search-platform Issues owned by the search platform team tracking

Comments

@jtibshirani
Copy link
Member

jtibshirani commented Nov 6, 2023

Zoekt can sometimes fail to index large repos because of timeouts or memory issues. This can result in missing or out-of-date search results. There’s also little visibility into the indexing process: we don't report progress or surface errors clearly, and we don't have good observability tools for debugging problems. This issue tracks a round of improvements we want to make to search indexing.

Indexing performance

Indexing observability

Squash bugs

/cc @sourcegraph/search-platform

@jtibshirani jtibshirani added tracking team/search-platform Issues owned by the search platform team labels Nov 6, 2023
@jtibshirani jtibshirani self-assigned this Nov 6, 2023
@keegancsmith
Copy link
Member

meta: Nice. Seeing this makes me also want to start using tracking issues for sprints of work.

@jtibshirani
Copy link
Member Author

jtibshirani commented Nov 11, 2023

Here are some profiling results for sgtest/megarepo, with universal-ctags enabled.

CPU

Showing top 10 nodes out of 177
      flat  flat%   sum%        cum   cum%
    75.64s 26.20% 26.20%     75.66s 26.21%  syscall.syscall
    41.31s 14.31% 40.51%     49.06s 16.99%  runtime.mapassign_fast64
    24.50s  8.49% 49.00%    126.63s 43.86%  github.com/sourcegraph/zoekt.(*postingsBuilder).newSearchableString
    21.91s  7.59% 56.58%     23.89s  8.28%  runtime.mapaccess1_fast64
    14.88s  5.15% 61.74%     14.88s  5.15%  runtime.madvise
    14.16s  4.90% 66.64%     14.19s  4.92%  encoding/binary.PutUvarint (inline)
    13.07s  4.53% 71.17%     13.07s  4.53%  runtime.memclrNoHeapPointers
     9.25s  3.20% 74.37%      9.26s  3.21%  runtime.kevent
     9.17s  3.18% 77.55%      9.17s  3.18%  runtime.pthread_cond_wait
     7.48s  2.59% 80.14%      7.48s  2.59%  runtime.memmove

Takeaways:

  • Spend a lot of time in map operations from (1) checking the max trigram limit and (2) creating postings lists
  • Spend a fair amount of time in memory management / GC
  • Biggest win would be improving ctags latency (a big contributor to syscall.syscall time above)

Memory allocations

Showing top 10 nodes out of 78
      flat  flat%   sum%        cum   cum%
21454.52MB 34.27% 34.27% 21454.52MB 34.27%  github.com/sourcegraph/zoekt.(*postingsBuilder).newSearchableString
 9997.24MB 15.97% 50.24%  9997.24MB 15.97%  github.com/go-git/go-git/v5/plumbing.(*MemoryObject).Write
 9741.64MB 15.56% 65.80%  9741.64MB 15.56%  github.com/sourcegraph/zoekt.CheckText
 6100.36MB  9.74% 75.54%  6100.36MB  9.74%  bytes.growSlice
 3987.56MB  6.37% 81.91%  7250.83MB 11.58%  github.com/sourcegraph/go-ctags.(*ctagsProcess).Parse
 1743.12MB  2.78% 84.70%  1743.12MB  2.78%  bufio.NewReaderSize
 1241.67MB  1.98% 86.68%  3233.26MB  5.16%  encoding/json.Unmarshal
 1168.06MB  1.87% 88.55%  1346.07MB  2.15%  encoding/json.(*decodeState).literalStore
  722.17MB  1.15% 89.70%   733.94MB  1.17%  github.com/sourcegraph/zoekt/build.(*tagsToSections).Convert
  719.29MB  1.15% 90.85%   765.56MB  1.22%  github.com/sourcegraph/zoekt.(*IndexBuilder).addSymbols

Takeaways:

  • Building postings (naturally) allocates a lot
  • Checking the max trigram limit allocates a lot, seems like low-hanging fruit
  • go-git contributes a lot to allocations and is maybe not super efficient
  • ctags parsing also contributes (JSON decoding, parsing the output, adding symbols)

Peak memory usage
I found a few bugs here that caused high memory consumption. I'll add a profile once those are fixed.

@jtibshirani
Copy link
Member Author

Here's a profile of memory usage after fixing some obvious issues, taken right after we finish building the 10th shard out of ~20).

Peak memory usage

Showing top 10 nodes out of 69
      flat  flat%   sum%        cum   cum%
  733.14MB 33.18% 33.18%   733.14MB 33.18%  bytes.growSlice
     530MB 23.99% 57.17%      530MB 23.99%  github.com/go-git/go-git/v5/plumbing/format/idxfile.(*MemoryIndex).genOffsetHash
  303.30MB 13.73% 70.90%   303.30MB 13.73%  github.com/go-git/go-git/v5/plumbing/format/idxfile.readObjectNames
  146.63MB  6.64% 77.53%   146.63MB  6.64%  github.com/sourcegraph/zoekt.(*postingsBuilder).newSearchableString
  105.44MB  4.77% 82.30%   105.44MB  4.77%  github.com/go-git/go-git/v5/plumbing.(*MemoryObject).Write
   91.88MB  4.16% 86.46%   653.84MB 29.59%  github.com/sourcegraph/zoekt/gitindex.prepareNormalBuild
   65.43MB  2.96% 89.42%    65.43MB  2.96%  github.com/sourcegraph/zoekt/build.(*tagsToSections).Convert
   58.84MB  2.66% 92.09%    58.84MB  2.66%  github.com/go-git/go-git/v5/plumbing/format/idxfile.readOffsets
   56.35MB  2.55% 94.64%  1930.45MB 87.37%  github.com/sourcegraph/zoekt/gitindex.indexGitRepo
   43.50MB  1.97% 96.61%    43.50MB  1.97%  encoding/json.(*decodeState).literalStore

Takeaways:

  • Top consumer is bytes.growSlice, which is the buffered index docs ... nothing too concerning here
  • go-git takes a surprising amount of memory on top of this
  • Shard postings contribute some too ... not concerning

@keegancsmith
Copy link
Member

If you want to experiment with removing go-git, or atleast avoid it for the heavy lifting you can see a few experiments I did here sourcegraph/zoekt#424 This was me a while ago experimenting with ideas around how to more efficiently get stuff off of gitserver for searching/indexing.

@jtibshirani
Copy link
Member Author

Documenting the results of profiling universal-ctags versus scip-ctags on sgtest/megarepo.

Peak memory usage
universal-ctags process: ~85 MB
scip-ctags process: ~134MB

Processing time
universal-ctags: took 4 min 25 sec to index repo
scip-ctags: took 4 min 48 sec to index repo

Takeaway: currently, the main benefit of scip-ctags is its superior symbol quality, not its resource usage

@jtibshirani
Copy link
Member Author

There is definitely more we can do here, but I'm closing this out as a "completed" round of work. Highlights:

Will file follow-up issues about better observability in case of OOMs and about trying GOMEMLIMIT.

@jtibshirani
Copy link
Member Author

jtibshirani commented Jan 30, 2024

Here's a rough formula for calculating the peak memory usage of Zoekt indexserver:

  • Indexing process
    • File buffer: 100MB * (num_threads + 1)
    • Postings: 100MB * num_threads ... postings are a form of compression and are rarely larger than the original files
    • In-memory git objects from gogit: < 1GB ... empirical guess (see above benchmark)
  • Ctags processes: < 200MB * num_threads ... empirical guess (see above benchmark)
  • Indexserver itself: < 1GB ... conservative upper bound based on the work it performs

Total: ~400MB * (num_threads) + 2GB

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
team/search-platform Issues owned by the search platform team tracking
Projects
None yet
Development

No branches or pull requests

2 participants