-
Notifications
You must be signed in to change notification settings - Fork 102
Implications
The mimic user is essentially able to write text that, by way of its appearance, means one thing to the victim seeing it, and quite another thing to the victim's computer. This has several implications.
If it wasn't obvious, this can be used to subtly break a coworker's code in ways that may be difficult to debug or even notice.
In situations where text alone identifies a third party - such as an email address or a domain name - an attacker could use a mimicked identifier to be an imposter of that third party.
Someone could theoretically mimic a stolen copy of text to evade auto-detection software, as that software would not likely consider it to be a match to any original source.
In a context where anti-spam software bases detection on matches of common phrases, a spammer could mimic spam text in a different way every time so that the anti-spam software never considers it a match to known spam.
If someone, for either just or malevolent reasons, wants to evade text being meaningfully indexed by a search engine, they could mimic it. Even though the search engine would successfully index it, no usual search terms would subsequently succeed in finding a match.
For (again) just or malevolent reasons, automatic censorship mechanisms such as vulgarity filters could be theoretically circumvented by mimicking text. One could even extend this to attempt to evade automatic keyword censorship from a totalitarian government, for instance.
When replacing characters, there are choices: which original character to replace as well as the choice of the new character to substitute. These choices can be leveraged to encode a hidden data stream; in other words, text can look the same as it did before but carry a hidden message. (This is a planned feature of Mimic.)
In contexts sensitive to the above issues, several approaches could be taken:
- As mimic is already able to do, check for unusual or suspicious characters that are in unexpected Unicode ranges.
- Again, as mimic is already able to do, attempt to replace such characters with characters that are considered more conventional.
- Maintain and utilize an index of known popular "truth" terms (e.g. google.com), and warn if a potential homoglyph attack is attempting to spoof such a truth term.
- As a heavier, more general solution, apply a round-trip render->OCR algorithm to check for discrepancies.
- Re-evaluate whether it is important to support Unicode in the first place for certain applications, and consider using more restrictive character sets.
For code editing in particular, there are Unicode safety plugins for editors sprouting up; see Related Software. It generally helps if you use an IDE with real-time in-editor syntax checking; most modern IDEs for most languages support this.
Some simple grep-fu will highlight bad characters (other than spaces), thanks to GrumpenKraut:
grep --color -n '[^ -~]'
[Wikipedia: Unicode Equivalence] (https://en.wikipedia.org/wiki/Unicode_equivalence)