Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex for separating lemmas may fail in some edge cases #1017

Open
scottkleinman opened this issue Jun 6, 2020 · 0 comments
Open

Regex for separating lemmas may fail in some edge cases #1017

scottkleinman opened this issue Jun 6, 2020 · 0 comments

Comments

@scottkleinman
Copy link
Contributor

This is a leftover from issue #1013. In scrubber.py, line 169, the regex pattern is supposed to detect whitespace, unicode punctuation, or the end of the string. See the comment in the preceding commit. However, this commit did not supply a pattern for detecting Unicode punctuation. As yet, I have failed to find a concise way to do this, so I have supplied \W (non-word character). This works pretty well, but it may fail in some edge cases. This needs further testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant