Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid allocating hats to the first letter of a token #1723

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

josharian
Copy link
Collaborator

We could get much fancier than this,
but after running this with a day it appears to help some,
and it is nice and simple.

I propose that we declare that it fixes #1658,
at least for now.

Checklist

  • [/] I have added tests
  • [/] I have updated the docs and cheatsheet
  • [/] I have not broken the cheatsheet

@josharian josharian requested a review from pokey as a code owner August 2, 2023 23:34
@josharian
Copy link
Collaborator Author

josharian commented Aug 2, 2023

I plan to keep running this for a little while longer, gathering data, but I thought I would share it in case anyone else wants to play with it.

(I know the tests are busted.)

@josharian josharian marked this pull request as draft August 3, 2023 00:33
Copy link
Member

@pokey pokey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good with minor tweak

Comment on lines 133 to 139
// iterate through the graphemes, marking the first letter
for (const grapheme of graphemes) {
if (grapheme.text.match(/[a-z]/)) {
grapheme.isFirstLetterGrapheme = true;
break;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't feel like the right place for this logic; seems more specific to hat allocator. I'd be inclined to inject it right after we call getTokenGraphemes in getTokenRemainingHatCandidates and add it to HatCandidate.

I'd also be tempted to handle camel case / snake case, but yeah prob not necessary in this draft? Shouldn't be too hard, tho; just for each char

  1. if it's lowercase letter:
    a. return false if prev char exists and is letter
    b. else return true
  2. else if it's uppercase letter:
    a. return true if previous char doesn't exist or is not uppercase letter
    b. else return false
  3. else return false

I'd also advise using Unicode character classes, eg \p{Lu}, \p{L}, \P{L} etc

Also, it looks like your code should handle the case of the same grapheme appearing twice in the same token well, right? Ie we'll end up with two candidates, and end up preferring the one that is not first in its token, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Word splitting (for brains) is subtle. One case this would get wrong is languages that prefer all-caps initialisms, like Go and Swift. Consider URLRequest: The second R begins a word, but would be marked as not-initial by your heuristic.

I was also worried about whether we needed to respect the settings from the user about how to tokenize.

But the perfect is the enemy of the good! I'll implement your suggestion and play with it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be inclined to inject it right after we call getTokenGraphemes in getTokenRemainingHatCandidates and add it to HatCandidate.

That seems reasonable. Will play with that. It'll spare me fixing 94 tests. :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another hard word-splitting case: thisIsATest

Copy link
Collaborator Author

@josharian josharian Aug 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be inclined to inject it right after we call getTokenGraphemes in getTokenRemainingHatCandidates

I forgot. At that point we've normalized the grapheme, so we no longer have case information. We could add an original text field to the graphemes. Or punt.

(The normalization is also the reason that I used [a-z] instead of unicode character classes. But I'll happily switch back on that front.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh right. Keep in mind tho that the normalisation won't always strip accents; the user can customise normalisation. You can use the offsets to get the original text

Your counterexamples for splitting are interesting, maybe better to just keep your heuristic?

But yeah I wonder if it's worth all the trouble 🤔

Maybe let's discuss at meet-up

I propose that we declare that it fixes cursorless-dev#1658,
at least for now.
@josharian
Copy link
Collaborator Author

here's another rev. lots of tests are still failing; it's going to be tedious to fix them, so I'd like to wait until we are relatively confident in the rest of the direction.

@josharian
Copy link
Collaborator Author

notes to self:

  • correctly handle _abcTest (are we avoiding _ or a?)
  • perf test
  • maybe re-use tokenizers
  • switch to ranges
  • tests: stats, fixtures
  • data gathering for end users
    • no phones/replace
    • jsonl
    • open append/exclusive
    • command payload
    • rotate monthly
    • include extension version

@pokey
Copy link
Member

pokey commented Jun 20, 2024

update: @AndreasArvidsson is going to have a look and take this one home if it's pretty close to mergeable in its current form

@josharian
Copy link
Collaborator Author

update: @AndreasArvidsson is going to have a look and take this one home if it's pretty close to mergeable in its current form

great, thanks!

@AndreasArvidsson
Copy link
Member

@josharian Have you evaluated the difference between just avoiding the first character in the token verses the first character in every subword? When I first thought about this problem I kinda just envisioned the first character in the token, but your implementation is doing every subword which could be better. Any insight?

@josharian
Copy link
Collaborator Author

I remember thinking at the time that doing sub words was important. But It is not something I ever gathered data about, because the effects are purely qualitative. And a lot of time has now gone by…

@AndreasArvidsson AndreasArvidsson marked this pull request as ready for review June 25, 2024 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

choose hats to avoid the Stroop effect?
3 participants