Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement exact-boundary match type #3967

Merged
merged 6 commits into from
Aug 29, 2024
Merged

Implement exact-boundary match type #3967

merged 6 commits into from
Aug 29, 2024

Conversation

junegunn
Copy link
Owner

@junegunn junegunn commented Aug 13, 2024

Close #3963

This implements the exact-boundary match type, which finds exact matches for a search term where both ends are at word boundaries. Exact-boundary term should start with ' and end with '.

  • I specifically didn't choose "TERM" because I often use such patterns to search for string literals in the source code.
    • We couldn't do the same with 'TERM' before, so we're not breaking a use case here.
  • In --exact mode, ' prefix enables fuzzy matching, so exact-boundary term should be in TERM' form.
    • This is not ideal, considering cases where you type it's.
  • Should we allow partial boundary matches? i.e. only one end at a word boundary.

@maxaykin
Copy link

This is not ideal, considering cases where you type it's.

Could you please describe this in more detail? It's not clear for me what is the problem here.

Also I suspect that work of '-suffix in --exact mode is not so obvious from the documentation and, by the way, will the man page be changed as well?

I would like to propose few alternatives:

  • quoting both ends works the same in exact and fuzzy modes (i.e. in --exact mode exact-boundary-match would require quotes from both ends)
  • '-suffix does not require '-prefix (i.e. in exact and fuzzy modes it enables mode exact-boundary-match, if '-prefix is present, it is ignored)
  • the documentation says that '-suffix enables word boundary only when exact matching is in effect in either mode (though, this may be still confusing a bit)

The second option would also allow quicker enabling of exact-boundary-match.

Regarding partial boundary matches, what would it look like? I guess it would be natural if a quote enabled both exactness and word boundary check for that end of term where it is placed.

Everything else looks and works great. I have done a short manual testing of it.

@junegunn
Copy link
Owner Author

junegunn commented Aug 13, 2024

Could you please describe this in more detail?

I was worried about match type being changed multiple times against the user's will.

  • it - exact match
  • it' - exact boundary match
  • it's - exact match

The transition can surprise the user, although if the final query is it's, the result will be the same, so it's probably okay. But the user will no longer be able to match literal it'.

Also I suspect that work of '-suffix in --exact mode is not so obvious from the documentation and, by the way, will the man page be changed as well?

This is just a draft, no documentation, no tests yet.

I would like to propose few alternatives:

The first option is probably the least confusing one, but the users will experience unintended mode changes in --exact mode while typing.

  • 'it - fuzzy match (many results)
  • 'it' - exact boundary match (few results)
  • 'it's - fuzzy match (many results again)

The second option would also allow quicker enabling of exact-boundary-match.

'-prefix syntax was borrowed from Lisp, so users with Lisp experience can see the reasoning behind it. '-suffix enabling boundary match feels a little awkward and arbitrary. On the other hand, I think fully-quoted syntax shows the intention of the user better, i.e. I want 'stricter' matching.

Regarding partial boundary matches, what would it look like?

\bfront, back\b, \bboth\b, à la regular expression.

@maxaykin
Copy link

\bfront, back\b, \bboth\b, à la regular expression.

I don't know if this is really needed but it could be a good option. Though, if implemented it may become a precedent for other requests for regexp-like modifiers. 😉 BTW, I noticed that some people asked for a toggle of case sensitiveness. It is what I would like to have as well (e.g. in the way it is done in vim with modifiers "\C", "\c").

Regarding your implementation of word boundary check (based on bonus), now words cannot include underscore. As far as I understand, this is uncommon (see https://stackoverflow.com/questions/1324676/what-is-a-word-boundary-in-regex). And this is also not what I would expect.

@junegunn
Copy link
Owner Author

junegunn commented Aug 16, 2024

Though, if implemented it may become a precedent for other requests for regexp-like modifiers.

Right. I don't want that.

Regarding your implementation of word boundary check (based on bonus), now words cannot include underscore. As far as I understand, this is uncommon (see https://stackoverflow.com/questions/1324676/what-is-a-word-boundary-in-regex). And this is also not what I would expect.

Thanks for pointing it out. But I don't know, expectations are subjective. When I see src/options_no_pprof.go, underscores do look like word separator, and I would be surprised if I don't get it when I type in 'no', but you may expect different.

Expectations also depend on context. If a project of your interest uses the snake_case convention in its code and file paths, it may be better to not regard _ as a boundary. But some languages use - as a word separator (Lisp, CSS), and I personally work with some projects that use - as a word separator in the file paths. e.g. roles/hadoop/tasks/format-namenode.yml. Should we also not regard - as a boundary for cases like that? Vim provides options like iskeyword to deal with different conventions, but we don't want to go down that rabbit hole. Acknowledging the fact that there's no one-size-fits-all solution, I think this match type should be more inclusive than exclusive. e.g. When you're looking for win, but not _win, you can add !_w to the query to further narrow down the results. 'win' !_w

That said, I'm not going to make the decision right away, I'm going to take some time to use it myself to see how I like it better.

@junegunn
Copy link
Owner Author

After experimenting with this for a while, one thing I can agree is that ..._foo_... should be ranked lower than ...-foo-..., ...[foo]..., ...(foo)..., ... $foo ..., etc. We can still include those underscore-boundary results but lower their ranks.

@junegunn junegunn force-pushed the boundary-query branch 5 times, most recently from d8b702f to a78f152 Compare August 16, 2024 11:40
@maxaykin
Copy link

Expectations also depend on context... there's no one-size-fits-all solution

I understand your point and I agree with it partially but this approach does not solve the problem for me (yes, most of the projects I work with use the snake_case convention, and thus, there are still too many results after entering a query in fzf).

When you're looking for win, but not _win, you can add !_w to the query to further narrow down the results. 'win' !_w

This does not help. Consider the following examples which are quite common, I think (at least, I see them very often and it was one of the reasons to request this feature) :

new_win_id = get_id(win)
win = get_win_by_id(win_id)

The query you suggested will not show them at all while these results may be very important if I search for all places where term 'win' is used. Use of logical OR (i.e. 'win' | !_w) will not help as well since it will include everything else.

Could you please consider possibility of making a configurable list of characters that would be treated as part of words? For example, storing it in an environment variable would allow different setups for projects based on different languages. I believe, it would satisfy most of people.

If not, is there a proper and simple way of hard-coding it in a local copy of fzf repository (I mean something better than what I implemented in my fork)?

Also I noticed something weird in behavior of the current logic: using the above examples as the input data query '_win' shows both lines but queries 'win_' and '_win_' show nothing. This is quite confusing.

@junegunn
Copy link
Owner Author

Have you tried getting used to the change:reload approach? The more I hear from you, the more I feel it's the solution you're looking for; only getting the exact matches using precise regular expressions. fzf is a fuzzy finder and it is expected by its nature that it generates many irrelevant matches. It focuses on implementing a good scoring mechanism so that relevant matches appear first, not on completely excluding less relevant matches from the result.

@maxaykin
Copy link

Have you tried getting used to the change:reload approach?

Yes but typing a regexp every time is inconvenient. So, now this is my backup option. Besides, I would like to have the same way of working with any source of data, not only for what is produced by grep-like tools. And inventing various wrappers/workarounds for different situations is not what anyone would like to do with fzf.

It focuses on implementing a good scoring mechanism so that relevant matches appear first, not on completely excluding less relevant matches from the result.

This contradicts with the behavior of fzf extended search mode, doesn't it? I mean that switches like ^-prefix hide irrelevant results.

@junegunn
Copy link
Owner Author

junegunn commented Aug 17, 2024

Yes but typing a regexp every time is inconvenient.

You can find examples in https://github.com/junegunn/fzf/blob/master/ADVANCED.md where you can toggle between two modes.

And inventing various wrappers/workarounds for different situations

Anything you have in mind in particular?

I mean that switches like ^-prefix hide irrelevant results.

My point is that it's not the main focus of fzf. Yes, you can filter out some stuff using basic prefix, suffix patterns, but that's about it. They are most of the time incapable of completely removing irrelevant matches, and fzf was never designed to do that. If that was the point, fzf would have supported proper regular expressions for precise filtering. If you need it, there are other interactive filter programs that support it, such as peco, but fzf chose not to.

Instead, fzf has focused on improving the scoring mechanism to make more relevant matches appear first. It's an interactive filter program unlike traditional grep, so once you met an irrelevant match, you can stop browsing there. And in this case, I believe fzf should match both x_win_x and x win x, but present the latter first. If you're not interested in _win_ and the likes, you can stop there, but if you're looking for them, you can have them. It meets both expectations nicely.

Also I noticed something weird in behavior of the current logic: using the above examples as the input data query '_win' shows both lines but queries 'win_' and '_win_' show nothing. This is quite confusing.

That is not expected. Let me look into it.

@maxaykin
Copy link

I believe that the only purpose of scoring is to show more suitable results first but all results should be relevant matches. And I don't understand why you say that fzf was not designed to remove irrelevant matches completely. Based on my experience with its modern version, this is what it does most of the time, even if one uses just fuzzy matching mode. Otherwise we would see all lines of source data every time, they would be just sorted in a curtain way.

So, what we are really discussing here is the set of instruments for filtering out irrelevant input data. I see that you personally are satisfied with the current set and trying to avoid its further extending. Ok, you are the author and you decide how to develop and support the tool. I'm not going to continue debating this.

Regarding the suggestions of using regexp (with other tools) for more precise matching, this way looks not so attractive. I really like how it is done in fzf extended mode: you just enter words or their parts that you remember or think of and then you can make just few very quick and simple modifications to reduce number of results if needed. Construction of regular expressions is not what I want to switch back.

@junegunn
Copy link
Owner Author

junegunn commented Aug 19, 2024

but all results should be relevant matches

We seem to have a very different understanding of this program, or maybe we're using the term "relevant" differently.

Fuzzy matching algorithm usually generates many matches that are not "relevant" to the user's intention, but then the scoring algorithm kicks in and tries to present the most likely "relevant" matches at the top of the list.

For example, say I'm looking for a "UserGroupInformation" file under some "security" package whose name I don't exactly remember, I just type in secusergrin. (sec for security, user for user, gr for group, and in for information).

image

This query will generate 1001 matches, but the exact file I want is at the top of the list, I can just press enter. It doesn't matter at all to me that there are 1000 other matches. Those other matches are not relevant to me, and I can easily see that just by a quick glance.

Otherwise we would see all lines of source data every time, they would be just sorted in a curtain way.

So I guess you seem to have interpreted "relevant" as "relevant to the query according to the fuzzy matching algorithm" (yes, of course they are), when I was using it to mean "relevant to the user's intention".

But even so, does it matter if the items I want are at the top of the list?

there are still too many results (e.g. if I look for a variable "Window" there may be many results like "bottom_window" or "InitWindow()")

This was from your initial problem definition. Now that we can type 'window', you can filter out initWindow(). bottom_window will still be in the result, but once you see it, you know you can stop scrolling, because the rest of the matches are no good, in other words, "irrelevant" to you. But the other users who want them can continue browsing, maybe selecting some or all of them.

@maxaykin
Copy link

It is clear form me that scoring helps to present the most relevant (to the intention) matches at the top of the resulting list but it has nothing to do with filtering out irrelevant (to the query) parts of input data. If I understand it correctly, you confirmed that the list of results does always meet the query criteria. Thus, scoring cannot make including irrelevant to the query results acceptable. According to the current behavior of fzf they just should be hidden.

When I requested this feature, I had not anticipated that "word boundary" might have different meaning for different projects/people. Later your changes and our discussion raised this problem. While it is quite fine for you to have underscore as a word delimiter, for me results like bottom_window (if I typed 'window') look irrelevant to the query and must be hidden.

Now the question is how to deal with term "word boundary". If you state in the documentation that it is a specific version of boundary matching, probably that will be formally ok but it will not help people like me.

I'm testing the following local modification to suit my needs:

diff --git a/src/algo/algo.go b/src/algo/algo.go
index f056918..607c741 100644
--- a/src/algo/algo.go
+++ b/src/algo/algo.go
@@ -167,6 +167,7 @@ const (
        charWhite charClass = iota
        charNonWord
        charDelimiter
+       charWord
        charLower
        charUpper
        charLetter
@@ -202,6 +203,8 @@ func Init(scheme string) bool {
                        c = charUpper
                } else if char >= '0' && char <= '9' {
                        c = charNumber
+               } else if char == '_' {
+                       c = charWord
                } else if strings.ContainsRune(whiteChars, char) {
                        c = charWhite
                } else if strings.ContainsRune(delimiterChars, char) {

@junegunn junegunn marked this pull request as ready for review August 19, 2024 15:43
@junegunn
Copy link
Owner Author

but it will not help people like me.

I disagree. I mentioned above how a user with a different expectation can still benefit from it because matches around _ are ranked lower than the rest.

If we decide not to see _ as a word boundary, it will better meet your expectations, but then the implementation will become unusable for the other group of users. As the maintainer of this widely popular program, I strive to find a solution that will serve a larger number of users, even if it means that it will be not "ideal" for some. Adding an option is a cop-out, more options around the core functionality will generally make the program harder to understand and use, so I'd like to avoid that.

@maxaykin
Copy link

I mentioned above how a user with a different expectation can still benefit from it because matches around _ are ranked lower than the rest.

Yes, but this makes the behaviour a bit inconsistent and confusing because considering underscore as a possible part of words is more common. At least, grep and ag do not see it as a word splitter. Similarly people complaints about "exact match" being not fully exact (i.e. matching the case is not enabled by quoting which is not obvious from the documentation). So, I suggest stating these peculiarities in the man page explicitly.

then the implementation will become unusable for the other group of users. As the maintainer of this widely popular program, I strive to find a solution that will serve a larger number of users, even if it means that it will be not "ideal" for some. Adding an option is a cop-out, more options around the core functionality will generally make the program harder to understand and use

In my turn I cannot agree with this. 😄 A larger number of users would be served better if a greater number of their needs were satisfied. An option or a setting or a configuration/environment variable would make it possible to provide a different behaviour for those who need it and not to affect others. You could develop it so that it would not consider underscore as a word splitter by default. Excuse me, please, but not adding it with a reference to possible complexity is precisely a cop-out. I don't think that an additional option would break the compromise between complexity and quality of the tool, look at vim which is much more popular: it is complex but the number of features and its flexibility is its power.

@junegunn
Copy link
Owner Author

Well, it's a philosophical difference in software product design so we'll have to agree to disagree, but thanks for sharing your thoughts.

@maxaykin
Copy link

Thank you for your efforts! 👍

@junegunn junegunn force-pushed the boundary-query branch 2 times, most recently from 372236d to 1d25dcc Compare August 24, 2024 06:28
Only requiring '-suffix in --exact mode is confusing and not
straightforward. Requiring '-prefix in --exact mode means that
the users can experience unintended mode switches while typing.

e.g.
     'it   -> fuzzy (many results)
     'it'  -> boundary (few results)
     'it's -> fuzzy (many results)

However, user who intends to input a boundary query should not be
interested in the intermediate results, and the number of matches
decreases as she types, so it should be okay.

On the other hand, user who does intend to type "it's" will be surprised
by the sudden decrease of the match count, but eventually get the right
result.
@junegunn junegunn merged commit c6d620c into master Aug 29, 2024
5 checks passed
@junegunn junegunn deleted the boundary-query branch August 29, 2024 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Matching with word boundaries
2 participants