Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

☂️ Search: solidify content-based language filtering #60341

Closed
5 tasks done
jtibshirani opened this issue Feb 8, 2024 · 3 comments
Closed
5 tasks done

☂️ Search: solidify content-based language filtering #60341

jtibshirani opened this issue Feb 8, 2024 · 3 comments
Assignees
Labels
graph/language-detection Inference of languages from filenames, file contents etc. team/search-platform Issues owned by the search platform team

Comments

@jtibshirani
Copy link
Member

jtibshirani commented Feb 8, 2024

Currently, we translate lang filters to filters on file extensions. Because multiple languages may share the same extension, this can lead to errors. For example, lang:matlab also matches Objective C files, since they also end in .m.

We have a feature search-content-based-lang-detection that instead matches against the actual language of the file, as determined by go-enry. This issue tracks work to solidify the feature so we're comfortable recommending it to customers.

Note: this feature defaults to off, more work is needed to enable it by default (https://github.com/sourcegraph/sourcegraph/issues/60676).

/cc @sourcegraph/search-platform

@jtibshirani
Copy link
Member Author

jtibshirani commented Feb 9, 2024

I'd appreciate your thoughts on the new behavior for lang:cpp.

  • Before: we always include files ending in .cpp, .hpp, and .h. This includes C files that end in .h, which can be confusing (since you see both "C++" and "C" listed in the filters sidebar).
  • Now: we always return .cpp and .hpp because these are unambiguous. We return .h files only if they resemble C++, like import something from the standard library or use a namespace.

Downsides of new behavior:

  • It's not super common, but you can have .h files in a C++ repo that just happen to resemble C. See this example from the open source repo grf-labs/grf.
  • go-enry has some classification errors, where something is clearly C++ but it labels it as C (see this example from same repo).

For me the trade-off is acceptable. It feels like a natural mental model for each file to have a single language. And C++ users can always put together a special context like lang:C++ OR lang:C to mirror the old behavior. If it's a big problem, we could also introduce custom overrides, like we do for syntax highlighting. (The linguist library allows custom attributes per repo through .gitattributes, but go-enry doesn't support anything like this yet.)

@varungandhi-src
Copy link
Contributor

It's not super common, but you can have .h files in a C++ repo that just happen to resemble C. See this example from the open source repo grf-labs/grf.

There were 14 files in the results that I'm seeing right now. I've annotated the files for which we could improve the results.

go-enry has some classification errors, where something is clearly C++ but it labels it as C (see this example from same repo).

Agreed, if go-enry checked for 'extern "C"` that would help fix the results for those files.

For me the trade-off is acceptable

I agree, it's a big improvement compared to what we have, and we can iterate on further improvements down the line.

@jtibshirani
Copy link
Member Author

We made a nice round of improvements and should feel comfortable recommending the feature to customers who have issues with our current lang filters. However, more work is required to really "complete" the feature and enable it by default. I filed https://github.com/sourcegraph/sourcegraph/issues/60676 to track that work.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
graph/language-detection Inference of languages from filenames, file contents etc. team/search-platform Issues owned by the search platform team
Projects
None yet
Development

No branches or pull requests

2 participants