index: experiment to limit ngram lookups for large snippets #795

keegancsmith · 2024-07-26T15:13:31Z

This introduces an experiment where we can stop looking up ngrams at a certain limit. The insight here is that for large substrings we spend more time finding the smallest ngram frequency than the time a normal search takes. So instead we can try and find a good balance between looking for a good (two) ngrams and actually searching the corpus.

The plan is to set different values for
SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT in sourcegraph production and see how it affects performance of attribution search service.

Test Plan: ran all tests with the envvar set to 2. I expected tests that assert on stats to fail, but everything else to pass. This was the case.

SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT=2 go test ./...

Related to https://linear.app/sourcegraph/issue/CODY-3029/investigate-performance-of-guardrails-attribution-endpoint

This introduces an experiment where we can stop looking up ngrams at a certain limit. The insight here is that for large substrings we spend more time finding the smallest ngram frequency than the time a normal search takes. So instead we can try and find a good balance between looking for a good (two) ngrams and actually searching the corpus. The plan is to set different values for SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT in sourcegraph production and see how it affects performance of attribution search service. Test Plan: ran all tests with the envvar set to 2. I expected tests that assert on stats to fail, but everything else to pass. This was the case. SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT=2 go test ./...

…aph#795) This introduces an experiment where we can stop looking up ngrams at a certain limit. The insight here is that for large substrings we spend more time finding the smallest ngram frequency than the time a normal search takes. So instead we can try and find a good balance between looking for a good (two) ngrams and actually searching the corpus. The plan is to set different values for SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT in sourcegraph production and see how it affects performance of attribution search service. Test Plan: ran all tests with the envvar set to 2. I expected tests that assert on stats to fail, but everything else to pass. This was the case. SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT=2 go test ./...

keegancsmith requested review from eseliger and a team July 26, 2024 15:13

cla-bot bot added the cla-signed label Jul 26, 2024

eseliger approved these changes Jul 26, 2024

View reviewed changes

keegancsmith merged commit 12ce07a into main Jul 26, 2024
9 checks passed

keegancsmith deleted the k/optimization branch July 26, 2024 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index: experiment to limit ngram lookups for large snippets #795

index: experiment to limit ngram lookups for large snippets #795

keegancsmith commented Jul 26, 2024

index: experiment to limit ngram lookups for large snippets #795

index: experiment to limit ngram lookups for large snippets #795

Conversation

keegancsmith commented Jul 26, 2024