-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API request: String Homoglyph.toASCII(String) #6
Comments
This would be extremely useful to me. I was looking for a |
This change is probably more complicated than it first appears. I don't think a single 'canonical' set of characters could be defined that would make sense for everyone, it would vary depending on the language of the user, and also on the expected content of the text (for example should the digit '1' be replaced with the letter 'l' or left as it is?) I think the library would have to allow the user to specify what they considered canonical. In addition it isn't obvious what the correct behaviour should be if letters within the canonical set are homoglyphs of each other - for example if we just say that the 26 letters of the English alphabet are canonical, do we change the digit 1 to lower-case 'L' or to capital 'I'? I welcome any suggestions regarding a good way to handle this. |
Running into this - we'd find it very useful to be able to regex-match including homoglyphs, and normalisation is definitely the only way to handle this. My suggestion would be to prioritise normalising to letters - the main use-case for a library like this is automated chat moderation; it's unlikely for numbers to be useful matches for problematic content (in my opinion). Of course, this doesn't solve the latter part of your question - I think the only real solution there is to support generating permutations instead; then they can all be tested. |
Please provide a
toASCII
API which tries to fit the character in ASCII range and returns a string. For example, the following holds true:It is useful in the scenarios where we want to run complex REGEX rules on (approximate) ASCII representation. Building complex regex tree equivalent with Homoglyph.search() API is not convenient (at least in certain cases).
The text was updated successfully, but these errors were encountered: