Skip to content

Commit

Permalink
Don't truncate file before detecting language (#740)
Browse files Browse the repository at this point in the history
Currently, we truncate a file's contents to 2048 bytes before passing it to
`go-enry`. I ran into a few cases where this is causing us to misclassify
files.

This PR removes the truncation. It should still be fine in terms of
performance, since `go-enry` is quite fast in general: ~1ms in my local
testing, even for large files. And we only run language detection if we plan to
index the file, which means we skip binary files and large files.
  • Loading branch information
jtibshirani authored Feb 12, 2024
1 parent b227501 commit 1c158f9
Showing 1 changed file with 1 addition and 6 deletions.
7 changes: 1 addition & 6 deletions indexbuilder.go
Original file line number Diff line number Diff line change
Expand Up @@ -397,12 +397,7 @@ func (b *IndexBuilder) addSymbols(symbols []*Symbol) {

func DetermineLanguageIfUnknown(doc *Document) {
if doc.Language == "" {
c := doc.Content
// classifier is faster on small files without losing much accuracy
if len(c) > 2048 {
c = c[:2048]
}
doc.Language = enry.GetLanguage(doc.Name, c)
doc.Language = enry.GetLanguage(doc.Name, doc.Content)
}
}

Expand Down

0 comments on commit 1c158f9

Please sign in to comment.