Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Symbol search: support content-based lang detection #60626

Merged
merged 6 commits into from
Feb 21, 2024

Conversation

jtibshirani
Copy link
Member

The symbol service already stores each file's detected language. Now, when the
'search-content-based-lang-detection' feature is enabled, the symbols service
filters directly on the language column.

This PR doesn't update Rockskip, since it doesn't store the detected language.

Relates to https://github.com/sourcegraph/sourcegraph/issues/60341

Test plan

Added unit tests, manual testing

@cla-bot cla-bot bot added the cla-signed label Feb 19, 2024
@jtibshirani jtibshirani marked this pull request as ready for review February 20, 2024 18:48
@jtibshirani jtibshirani requested review from varungandhi-src and a team February 20, 2024 23:31
@@ -120,6 +127,10 @@ func makeSearchCondition(column string, regex string, isCaseSensitive bool) *sql
return sqlf.Sprintf(column+" REGEXP %s", regex)
}

func makeLangCondition(lang string) *sqlf.Query {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to be aware of: the languages stored here are produced by ctags, whereas the lang filters are expected to be normalized languages from go-enry. I looked through universal-ctags --list-languages and they seem to match up well, so I don't expect this to be a problem in practice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong suggestion: Please let's use different types here EnryLanguage and CtagsLanguage. These kinds of problems come up in edge cases. For example,

  • CUDA is all upper-case in ctags but Cuda in enry.
  • COBOL is uppercase in enry but Cobol in ctags.
  • MatLab in ctags is MATLAB in enry.
  • Protobuf in ctags is Protocol Buffers in enry
  • ctags has both Perl6 and Raku (which are the same language IIUC) whereas enry only has Perl 6
  • ObjectiveC in ctags is Objective-C in enry
  • Asm in ctags is Assembly in enry (there's also Unix Assembly)
  • ctags has Zsh and Sh whereas enry has Shell.

Some of these are fixable with the case conversion below. But some are not.

I'm not expecting this PR to have a perfect conversion function. But it seems conceptually wrong to conflate these two.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, there are indeed more exceptions than I realized. To check I understand your suggestion -- we should have an explicit conversion from go-enry language to ctags language, and update naming to make this distinction clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check I understand your suggestion -- we should have an explicit conversion from go-enry language to ctags language, and update naming to make this distinction clear.

Yes, and specifically for naming, I'm emphasizing that we should have different types (not just variable names) for these two, and a conversion function from one type to another.

We have something similar in the syntax highlighter, albeit more complicated, to translate Sublime grammar names to Tree-sitter language names.

https://sourcegraph.com/github.com/sourcegraph/sourcegraph@8379154d11d98bae47f5a7bb17336485e8b93ecb/-/blob/docker-images/syntax-highlighter/crates/syntax-analysis/src/highlighting.rs?L163-201

Right now, the langInclude and langExclude in various places are just strings instead of being domain-specific types, and that's understandable, but changing that to be a more specific type would be valuable. (However, I'm not suggesting you do that in this PR specifically.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, that matches my understanding -- I will punt on any new types for this PR.

Repo: api.RepoName(repo),
CommitID: api.CommitID(commit),
Query: "",
IncludeLangs: []string{"Text"}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this patch was additive, so I don't quite understand why this part of the test needs an update, since the "Text" string doesn't appear elsewhere in this patch. Is this test going to continue to pass even if []string{"Text"} is omitted/did you add this to make sure we're exercising the 'has language filter' code path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I did this too quickly! I just pushed a correction. Here I added a lang filter on "text" to ensure the new codepath in internal/rockskip/search.go is exercised. If we remove that logic, this test will fail.

Overall, the rockskip tests are really lacking. If we make further changes, we will invest in new unit testing suites.

Copy link
Contributor

@varungandhi-src varungandhi-src left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo the comment on differences between enry and ctags language names.

cmd/symbols/internal/api/search_sqlite.go Outdated Show resolved Hide resolved
@jtibshirani
Copy link
Member Author

Thanks for the reviews! I realize this isn't the most high quality + complete solution, but just putting in "best effort" here until we can really invest in completing the feature.

@jtibshirani jtibshirani merged commit c168064 into main Feb 21, 2024
13 checks passed
@jtibshirani jtibshirani deleted the jtibs/symbol-search branch February 21, 2024 18:44
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants