Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider tweaking tokenization #33

Open
hoelzro opened this issue Nov 27, 2018 · 1 comment
Open

Consider tweaking tokenization #33

hoelzro opened this issue Nov 27, 2018 · 1 comment

Comments

@hoelzro
Copy link
Owner

hoelzro commented Nov 27, 2018

I might want to tweak how the plugin uses lunr to tokenize things, to handle hyphenated words or URLs.

Examples:

#5 (comment)

xit('should pick up "twitter" in a URL', async function() {
await prepare();
var text = 'https://twitter.com/hoelzro/status/877901644125663232';
$tw.wiki.addTiddler(new $tw.Tiddler(
$tw.wiki.getCreationFields(),
{ title: 'ContainsTweetLink', type: 'text/vnd.tiddlywiki', text: text },
$tw.wiki.getModificationFields()
));
await waitForNextTick();
var results = wiki.compileFilter('[ftsearch[twitter]]')();
expect(results).toContain('ContainsTweetLink');
});

@hoelzro
Copy link
Owner Author

hoelzro commented Dec 1, 2018

Another interesting data point for this: e-mail is treated as two tokens, which kind of screws things up

Would it make sense just to use a tokenizer that recognizes certain exceptions (like e-mail) and certain special prefixes (like re-)? Alternative to a list of exceptions, we could have logic that bundles prefixes of a certain length (eg. 3 or fewer characters)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant