Don't truncate file before detecting language #740

jtibshirani · 2024-02-09T21:18:00Z

Currently, we truncate a file's contents to 2048 bytes before passing it to
go-enry. I ran into a few cases where this is causing us to misclassify
files.

This PR removes the truncation. It should still be fine in terms of
performance, since go-enry is quite fast in general: ~1ms in my local
testing, even for large files. And we only run language detection if we plan to
index the file, which means we skip binary files and large files.

jtibshirani · 2024-02-09T21:20:07Z

I tried indexing sgtest/megarepo, and did not detect a difference in latency:

go run ./cmd/zoekt-index -parallelism 1 ../megarepo

Before: 7 min, 4 sec
After: 7 min, 14 sec

keegancsmith

nice idea comparing latency on megarepo.

Don't truncate file before detecting language

d6d1a6d

jtibshirani mentioned this pull request Feb 9, 2024

☂️ Search: solidify content-based language filtering sourcegraph/sourcegraph-public-snapshot#60341

Closed

5 tasks

keegancsmith approved these changes Feb 11, 2024

View reviewed changes

jtibshirani merged commit 1c158f9 into main Feb 12, 2024
8 checks passed

jtibshirani deleted the jtibs/lang-detection branch February 12, 2024 02:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't truncate file before detecting language #740

Don't truncate file before detecting language #740

jtibshirani commented Feb 9, 2024

jtibshirani commented Feb 9, 2024

keegancsmith left a comment

Don't truncate file before detecting language #740

Don't truncate file before detecting language #740

Conversation

jtibshirani commented Feb 9, 2024

jtibshirani commented Feb 9, 2024

keegancsmith left a comment

Choose a reason for hiding this comment