Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return original text as tokens #1

Merged
merged 5 commits into from
Oct 25, 2018

Conversation

johnmbw
Copy link
Collaborator

@johnmbw johnmbw commented Oct 23, 2018

The tokenizer was doing some normalisation of tokens (to match words to dictionary correctly I think), but those normalised tokens were being returned.

This change makes it return the original tokens (which could always be normalised again if wanted), so maintain the text that was in the source.

Also adds minimal tests etc

@johnmbw johnmbw force-pushed the return-original-text-as-tokens branch from e27abcd to dac0606 Compare October 23, 2018 12:31
So that we can return that original tokens instead of the normlised
ones.
@johnmbw johnmbw force-pushed the return-original-text-as-tokens branch from dac0606 to 180e699 Compare October 23, 2018 13:11
@johnmbw johnmbw force-pushed the return-original-text-as-tokens branch from 180e699 to a65fdbc Compare October 23, 2018 14:15
Copy link

@ewencluley ewencluley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this looks fine, I'm struggling to understand fully why this is being done in the tokeniser. I think normally for the other tokenisers we normalise after tokenisation as a separate step. Is there a reason that cant be done here, leaving the tokeniser unchanged?

Ignore me, helps if i read the PR description 🤦‍♂️

// kỹ would be normalised to kĩ internally
checkTokenization(
"Direct message để được chúng mình tư vấn kỹ hơn nhé",
"Direct","message", "để", "được", "chúng mình", "tư vấn", "kỹ", "hơn", "nhé"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NABD - space after "Direct",

@johnmbw johnmbw merged commit 5ab98f6 into BrandwatchLtd:master Oct 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants