avoid allocating hats to the first letter of a token #1723

josharian · 2023-08-02T23:34:26Z

We could get much fancier than this,
but after running this with a day it appears to help some,
and it is nice and simple.

I propose that we declare that it fixes #1658,
at least for now.

Checklist

[/] I have added tests
[/] I have updated the docs and cheatsheet
[/] I have not broken the cheatsheet

josharian · 2023-08-02T23:34:48Z

I plan to keep running this for a little while longer, gathering data, but I thought I would share it in case anyone else wants to play with it.

(I know the tests are busted.)

pokey

Looks good with minor tweak

pokey · 2023-08-04T13:16:36Z

packages/cursorless-engine/src/tokenGraphemeSplitter/tokenGraphemeSplitter.ts

+    // iterate through the graphemes, marking the first letter
+    for (const grapheme of graphemes) {
+      if (grapheme.text.match(/[a-z]/)) {
+        grapheme.isFirstLetterGrapheme = true;
+        break;
+      }
+    }


This doesn't feel like the right place for this logic; seems more specific to hat allocator. I'd be inclined to inject it right after we call getTokenGraphemes in getTokenRemainingHatCandidates and add it to HatCandidate.

I'd also be tempted to handle camel case / snake case, but yeah prob not necessary in this draft? Shouldn't be too hard, tho; just for each char

if it's lowercase letter:
a. return false if prev char exists and is letter
b. else return true

else if it's uppercase letter:
a. return true if previous char doesn't exist or is not uppercase letter
b. else return false

else return false

I'd also advise using Unicode character classes, eg \p{Lu}, \p{L}, \P{L} etc

Also, it looks like your code should handle the case of the same grapheme appearing twice in the same token well, right? Ie we'll end up with two candidates, and end up preferring the one that is not first in its token, right?

Word splitting (for brains) is subtle. One case this would get wrong is languages that prefer all-caps initialisms, like Go and Swift. Consider URLRequest: The second R begins a word, but would be marked as not-initial by your heuristic.

I was also worried about whether we needed to respect the settings from the user about how to tokenize.

But the perfect is the enemy of the good! I'll implement your suggestion and play with it.

I'd be inclined to inject it right after we call getTokenGraphemes in getTokenRemainingHatCandidates and add it to HatCandidate.

That seems reasonable. Will play with that. It'll spare me fixing 94 tests. :)

Another hard word-splitting case: thisIsATest

I'd be inclined to inject it right after we call getTokenGraphemes in getTokenRemainingHatCandidates

I forgot. At that point we've normalized the grapheme, so we no longer have case information. We could add an original text field to the graphemes. Or punt.

(The normalization is also the reason that I used [a-z] instead of unicode character classes. But I'll happily switch back on that front.)

Ahh right. Keep in mind tho that the normalisation won't always strip accents; the user can customise normalisation. You can use the offsets to get the original text

Your counterexamples for splitting are interesting, maybe better to just keep your heuristic?

But yeah I wonder if it's worth all the trouble 🤔

Maybe let's discuss at meet-up

I propose that we declare that it fixes cursorless-dev#1658, at least for now.

josharian · 2023-08-08T02:36:16Z

here's another rev. lots of tests are still failing; it's going to be tedious to fix them, so I'd like to wait until we are relatively confident in the rest of the direction.

...rless-engine/src/processTargets/modifiers/scopeHandlers/WordScopeHandler/WordScopeHandler.ts

josharian · 2023-08-12T01:32:56Z

notes to self:

correctly handle _abcTest (are we avoiding _ or a?)
perf test
maybe re-use tokenizers
switch to ranges
tests: stats, fixtures
data gathering for end users
- no phones/replace
- jsonl
- open append/exclusive
- command payload
- rotate monthly
- include extension version

pokey · 2024-06-20T10:20:39Z

update: @AndreasArvidsson is going to have a look and take this one home if it's pretty close to mergeable in its current form

josharian · 2024-06-25T00:20:53Z

update: @AndreasArvidsson is going to have a look and take this one home if it's pretty close to mergeable in its current form

great, thanks!

AndreasArvidsson · 2024-06-25T04:01:19Z

@josharian Have you evaluated the difference between just avoiding the first character in the token verses the first character in every subword? When I first thought about this problem I kinda just envisioned the first character in the token, but your implementation is doing every subword which could be better. Any insight?

josharian · 2024-06-25T04:19:42Z

I remember thinking at the time that doing sub words was important. But It is not something I ever gathered data about, because the effects are purely qualitative. And a lot of time has now gone by…

…into josh/no-first-word

josharian requested a review from pokey as a code owner August 2, 2023 23:34

josharian marked this pull request as draft August 3, 2023 00:33

pokey reviewed Aug 4, 2023

View reviewed changes

avoid allocating hats to the first letter of a word in a token

da3c7f1

I propose that we declare that it fixes cursorless-dev#1658, at least for now.

josharian force-pushed the josh/no-first-word branch from 0385b48 to da3c7f1 Compare August 8, 2023 02:35

pokey reviewed Aug 8, 2023

View reviewed changes

...rless-engine/src/processTargets/modifiers/scopeHandlers/WordScopeHandler/WordScopeHandler.ts Outdated Show resolved Hide resolved

fix tests

b33055b

pokey assigned AndreasArvidsson Jun 20, 2024

AndreasArvidsson added 3 commits June 25, 2024 03:07

Merge branch 'main' into josh/no-first-word

36904fc

testing

5486b7b

clean up

b3bce4d

AndreasArvidsson and others added 8 commits June 25, 2024 10:25

Added comment

0cd2090

Fix merge conflict in test

021bafa

[pre-commit.ci lite] apply automatic fixes

10dc73e

Testing

fed82c5

testing

90b4e02

Merge branch 'josh/no-first-word' of github.com:josharian/cursorless …

ba616c7

…into josh/no-first-word

update

5ebd10c

Update tests

44aeec7

AndreasArvidsson marked this pull request as ready for review June 25, 2024 10:39

AndreasArvidsson self-requested a review as a code owner June 25, 2024 10:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid allocating hats to the first letter of a token #1723

avoid allocating hats to the first letter of a token #1723

josharian commented Aug 2, 2023

josharian commented Aug 2, 2023 •

edited

Loading

pokey left a comment

pokey Aug 4, 2023

josharian Aug 5, 2023

josharian Aug 5, 2023

josharian Aug 5, 2023

josharian Aug 5, 2023 •

edited

Loading

pokey Aug 5, 2023

josharian commented Aug 8, 2023

josharian commented Aug 12, 2023

pokey commented Jun 20, 2024

josharian commented Jun 25, 2024

AndreasArvidsson commented Jun 25, 2024

josharian commented Jun 25, 2024

avoid allocating hats to the first letter of a token #1723

Are you sure you want to change the base?

avoid allocating hats to the first letter of a token #1723

Conversation

josharian commented Aug 2, 2023

Checklist

josharian commented Aug 2, 2023 • edited Loading

pokey left a comment

Choose a reason for hiding this comment

pokey Aug 4, 2023

Choose a reason for hiding this comment

josharian Aug 5, 2023

Choose a reason for hiding this comment

josharian Aug 5, 2023

Choose a reason for hiding this comment

josharian Aug 5, 2023

Choose a reason for hiding this comment

josharian Aug 5, 2023 • edited Loading

Choose a reason for hiding this comment

pokey Aug 5, 2023

Choose a reason for hiding this comment

josharian commented Aug 8, 2023

josharian commented Aug 12, 2023

pokey commented Jun 20, 2024

josharian commented Jun 25, 2024

AndreasArvidsson commented Jun 25, 2024

josharian commented Jun 25, 2024

josharian commented Aug 2, 2023 •

edited

Loading

josharian Aug 5, 2023 •

edited

Loading