Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index: experiment to limit ngram lookups for large snippets #795

Merged
merged 1 commit into from
Jul 26, 2024

Conversation

keegancsmith
Copy link
Member

This introduces an experiment where we can stop looking up ngrams at a certain limit. The insight here is that for large substrings we spend more time finding the smallest ngram frequency than the time a normal search takes. So instead we can try and find a good balance between looking for a good (two) ngrams and actually searching the corpus.

The plan is to set different values for
SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT in sourcegraph production and see how it affects performance of attribution search service.

Test Plan: ran all tests with the envvar set to 2. I expected tests that assert on stats to fail, but everything else to pass. This was the case.

SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT=2 go test ./...

Related to https://linear.app/sourcegraph/issue/CODY-3029/investigate-performance-of-guardrails-attribution-endpoint

This introduces an experiment where we can stop looking up ngrams at a
certain limit. The insight here is that for large substrings we spend
more time finding the smallest ngram frequency than the time a normal
search takes. So instead we can try and find a good balance between
looking for a good (two) ngrams and actually searching the corpus.

The plan is to set different values for
SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT in sourcegraph production and
see how it affects performance of attribution search service.

Test Plan: ran all tests with the envvar set to 2. I expected tests that
assert on stats to fail, but everything else to pass. This was the case.

  SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT=2 go test ./...
@keegancsmith keegancsmith requested review from eseliger and a team July 26, 2024 15:13
@cla-bot cla-bot bot added the cla-signed label Jul 26, 2024
@keegancsmith keegancsmith merged commit 12ce07a into main Jul 26, 2024
9 checks passed
@keegancsmith keegancsmith deleted the k/optimization branch July 26, 2024 16:01
xvandish pushed a commit to xvandish/zoekt that referenced this pull request Dec 24, 2024
…aph#795)

This introduces an experiment where we can stop looking up ngrams at a
certain limit. The insight here is that for large substrings we spend
more time finding the smallest ngram frequency than the time a normal
search takes. So instead we can try and find a good balance between
looking for a good (two) ngrams and actually searching the corpus.

The plan is to set different values for
SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT in sourcegraph production and
see how it affects performance of attribution search service.

Test Plan: ran all tests with the envvar set to 2. I expected tests that
assert on stats to fail, but everything else to pass. This was the case.

  SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT=2 go test ./...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants