Skip to content
Greg Toombs edited this page Oct 31, 2015 · 12 revisions

The mimic user is essentially able to write text that, by way of its appearance, means one thing to the victim seeing it, and quite another thing to the victim's computer. This has several implications.

Coding pranks

If it wasn't obvious, this can be used to subtly break a coworker's code in ways that may be difficult to debug or even notice.

Scamming / spoofing

In situations where text alone identifies a third party - such as an email address or a domain name - an attacker could use a mimicked identifier to be an imposter of that third party.

Plagiarism

Someone could theoretically mimic a stolen copy of text to evade auto-detection software, as that software would not likely consider it to be a match to any original source.

Spamming

In a context where anti-spam software bases detection on matches of common phrases, a spammer could mimic spam text in a different way every time so that the anti-spam software never considers it a match to known spam.

Evasion of indexing

If someone, for either just or malevolent reasons, wants to evade text being meaningfully indexed by a search engine, they could mimic it. Even though the search engine would successfully index it, no usual search terms would subsequently succeed in finding a match.

Evasion of censorship

For (again) just or malevolent reasons, automatic censorship mechanisms such as vulgarity filters could be theoretically circumvented by mimicking text. One could even extend this to attempt to evade automatic keyword censorship from a totalitarian government, for instance.

Steganography

When replacing characters, there are choices: which original character to replace as well as the choice of the new character to substitute. These choices can be leveraged to encode a hidden data stream; in other words, text can look the same as it did before but carry a hidden message. (This is a planned feature of Mimic.)

Where do we go from here?

In contexts sensitive to the above issues, several approaches could be taken:

  • As mimic is already able to do, check for unusual or suspicious characters that are in unexpected Unicode ranges.
  • Again, as mimic is already able to do, attempt to replace such characters with characters that are considered more conventional.
  • Maintain and utilize an index of known popular "truth" terms (e.g. google.com), and warn if a potential homoglyph attack is attempting to spoof such a truth term.
  • As a heavier, more general solution, apply a round-trip render->OCR algorithm to check for discrepancies.
  • Re-evaluate whether it is important to support Unicode in the first place for certain applications, and consider using more restrictive character sets.

For code editing in particular, there are Unicode safety plugins for editors sprouting up; see Related Software. It generally helps if you use an IDE with real-time in-editor syntax checking; most modern IDEs for most languages support this.

Some simple grep-fu will highlight bad characters (other than spaces), thanks to GrumpenKraut:

grep --color -n '[^ -~]'

See also

[Wikipedia: Unicode Equivalence] (https://en.wikipedia.org/wiki/Unicode_equivalence)

Wikipedia: IDN homograph attack

Online homoglyph generator