-
-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avoid allocating hats to the first letter of a token #1723
base: main
Are you sure you want to change the base?
Conversation
I plan to keep running this for a little while longer, gathering data, but I thought I would share it in case anyone else wants to play with it. (I know the tests are busted.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good with minor tweak
// iterate through the graphemes, marking the first letter | ||
for (const grapheme of graphemes) { | ||
if (grapheme.text.match(/[a-z]/)) { | ||
grapheme.isFirstLetterGrapheme = true; | ||
break; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't feel like the right place for this logic; seems more specific to hat allocator. I'd be inclined to inject it right after we call getTokenGraphemes
in getTokenRemainingHatCandidates
and add it to HatCandidate
.
I'd also be tempted to handle camel case / snake case, but yeah prob not necessary in this draft? Shouldn't be too hard, tho; just for each char
- if it's lowercase letter:
a. returnfalse
if prev char exists and is letter
b. else returntrue
- else if it's uppercase letter:
a. returntrue
if previous char doesn't exist or is not uppercase letter
b. else returnfalse
- else return
false
I'd also advise using Unicode character classes, eg \p{Lu}
, \p{L}
, \P{L}
etc
Also, it looks like your code should handle the case of the same grapheme appearing twice in the same token well, right? Ie we'll end up with two candidates, and end up preferring the one that is not first in its token, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Word splitting (for brains) is subtle. One case this would get wrong is languages that prefer all-caps initialisms, like Go and Swift. Consider URLRequest: The second R
begins a word, but would be marked as not-initial by your heuristic.
I was also worried about whether we needed to respect the settings from the user about how to tokenize.
But the perfect is the enemy of the good! I'll implement your suggestion and play with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be inclined to inject it right after we call getTokenGraphemes in getTokenRemainingHatCandidates and add it to HatCandidate.
That seems reasonable. Will play with that. It'll spare me fixing 94 tests. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another hard word-splitting case: thisIsATest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be inclined to inject it right after we call getTokenGraphemes in getTokenRemainingHatCandidates
I forgot. At that point we've normalized the grapheme, so we no longer have case information. We could add an original text field to the graphemes. Or punt.
(The normalization is also the reason that I used [a-z]
instead of unicode character classes. But I'll happily switch back on that front.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh right. Keep in mind tho that the normalisation won't always strip accents; the user can customise normalisation. You can use the offsets to get the original text
Your counterexamples for splitting are interesting, maybe better to just keep your heuristic?
But yeah I wonder if it's worth all the trouble 🤔
Maybe let's discuss at meet-up
I propose that we declare that it fixes cursorless-dev#1658, at least for now.
0385b48
to
da3c7f1
Compare
here's another rev. lots of tests are still failing; it's going to be tedious to fix them, so I'd like to wait until we are relatively confident in the rest of the direction. |
...rless-engine/src/processTargets/modifiers/scopeHandlers/WordScopeHandler/WordScopeHandler.ts
Outdated
Show resolved
Hide resolved
notes to self:
|
update: @AndreasArvidsson is going to have a look and take this one home if it's pretty close to mergeable in its current form |
great, thanks! |
@josharian Have you evaluated the difference between just avoiding the first character in the token verses the first character in every subword? When I first thought about this problem I kinda just envisioned the first character in the token, but your implementation is doing every subword which could be better. Any insight? |
I remember thinking at the time that doing sub words was important. But It is not something I ever gathered data about, because the effects are purely qualitative. And a lot of time has now gone by… |
We could get much fancier than this,
but after running this with a day it appears to help some,
and it is nice and simple.
I propose that we declare that it fixes #1658,
at least for now.
Checklist