API request: String Homoglyph.toASCII(String) #6

am11 · 2019-09-09T12:41:35Z

Please provide a toASCII API which tries to fit the character in ASCII range and returns a string. For example, the following holds true:

Homoglyph homoglyph = HomoglyphBuilder.build();
assertEquals("The quick brown fox jumps over the lazy dog", 
    homoglyph.toASCII("Τһе ԛυіϲκ Ьгоѡɴ ғох јυⅿрѕ оⅴег τһе ⅼаzу ԁоɡ"));

It is useful in the scenarios where we want to run complex REGEX rules on (approximate) ASCII representation. Building complex regex tree equivalent with Homoglyph.search() API is not convenient (at least in certain cases).

The text was updated successfully, but these errors were encountered:

codygray · 2021-12-11T10:40:16Z

This would be extremely useful to me. I was looking for a canonicalize function that would do essentially the same thing. I think that, perhaps, canonicalize is a better name than toASCII, since the "base" characters may not be strictly ASCII.

codebox · 2022-09-11T17:51:52Z

This change is probably more complicated than it first appears. I don't think a single 'canonical' set of characters could be defined that would make sense for everyone, it would vary depending on the language of the user, and also on the expected content of the text (for example should the digit '1' be replaced with the letter 'l' or left as it is?) I think the library would have to allow the user to specify what they considered canonical. In addition it isn't obvious what the correct behaviour should be if letters within the canonical set are homoglyphs of each other - for example if we just say that the 26 letters of the English alphabet are canonical, do we change the digit 1 to lower-case 'L' or to capital 'I'?

I welcome any suggestions regarding a good way to handle this.

gdude2002 · 2023-03-14T16:45:11Z

Running into this - we'd find it very useful to be able to regex-match including homoglyphs, and normalisation is definitely the only way to handle this.

My suggestion would be to prioritise normalising to letters - the main use-case for a library like this is automated chat moderation; it's unlikely for numbers to be useful matches for problematic content (in my opinion).

Of course, this doesn't solve the latter part of your question - I think the only real solution there is to support generating permutations instead; then they can all be tested.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API request: String Homoglyph.toASCII(String) #6

API request: String Homoglyph.toASCII(String) #6

am11 commented Sep 9, 2019 •

edited

Loading

codygray commented Dec 11, 2021

codebox commented Sep 11, 2022

gdude2002 commented Mar 14, 2023 •

edited

Loading

API request: String Homoglyph.toASCII(String) #6

API request: String Homoglyph.toASCII(String) #6

Comments

am11 commented Sep 9, 2019 • edited Loading

codygray commented Dec 11, 2021

codebox commented Sep 11, 2022

gdude2002 commented Mar 14, 2023 • edited Loading

am11 commented Sep 9, 2019 •

edited

Loading

gdude2002 commented Mar 14, 2023 •

edited

Loading