dev-resources.site
for different kinds of informations.
Fun with UTF-8: Homoglyphs
κ§π¬ππ¬πβ
ΌΠ£ΡαΉ β
°Ρ πΠ΅π πΕΏ αΥΈπΎ πΠ³ κ³ΠΎπΠ΅ Ι‘πΠ°ΟαΉπΎβ
ΏΠ΅π παΉΠ°π αΉπΊΡ πκ±πΎκ΄πππ½Π°π
ΠΎπ Ρ΅πΎΠ³π πα₯β
ΏΡπ
Π°κ΅ β
Όκπ πα΄ πααπΎ ΠΎπκ§πΎπ πΠ΅π αΕΏ Ι‘κ΅πΊΡαΉΠ΅αΠ΅Ρ. Like in previous sentence, that does not use a single ASCII letter:
κ§ - LISU LETTER XA
π¬ - DESERET SMALL LETTER LONG O
π - MATHEMATICAL SANS-SERIF SMALL M
π¬ - DESERET SMALL LETTER LONG O
π - MATHEMATICAL SANS-SERIF SMALL G
β
Ό - SMALL ROMAN NUMERAL FIFTY
Π£ - CYRILLIC CAPITAL LETTER U
Ρ - CYRILLIC SMALL LETTER ER
αΉ - GEORGIAN CAPITAL LETTER CHIN
...
Homoglyphs are not Unicode specific, but it was ability to write in many scripts using single UTF encoding that made them popular.
Similarity is conditional
It is font dependent. Two sets of graphemes looking very similar (or even identical) in one font may not look that similar in another. For example Ρ - CYRILLIC SMALL LETTER TE
looks like ASCII T
, but in cursive fonts (those that resembles handwriting connected letters) looks like m
.
Similarity is subjective
For many people unfamiliar with given alphabets Η¦
and Δ
may look exactly the same. But if someone is using those letters on daily basis he will notice immediately that first one has CARON
and the other has BREVE
on top.
They are not limited to single grapheme
For example α - MYANMAR LETTER THA
looks like two ASCII o
letters. And the other way - ASCII rn
looks like single ASCII letter m
.
Applications?
Fun. πΗkΗ pΙΉoducΗng weird looking bα΄t ΙΉeadΙble Κext.
Trolling. Programmer's classic is to replace in someone's code
;
with;
-GREEK QUESTION MARK
- and watch some funny debugging attempts. More advanced version is to modify keybinding. For example on macOS create~/Library/KeyBindings/DefaultKeyBinding.dict
with following content:
{
";" = (insertText:,";");
}
And observe how Python suddenly became someone's favorite language of choice :P
Just promise you won't troll stressed out junior dev before the end of sprint.
- Phishing. This is "Fun with UTF-8" sub series, but unfortunately this application is anything but fun. Homoglyphs are massively used to spoof company names, bypass anti-spam filters and create fake domains. For example can you spot difference between
Paypal
andκayΡΠ°l
?
Common way to detect those is to check Script
Unicode property, more on those in this post. Single word using more than one script should be considered suspicious:
$ raku -e '"Paypal".comb.classify( *.uniprop("Script") ).say'
{Latin => [P a y p a l]} # real
$ raku -e '"κayΡΠ°l".comb.classify( *.uniprop("Script") ).say'
{Cyrillic => [Ρ Π°], Latin => [a y l], Lisu => [κ]} # fake
Raku note: Method comb
without param extracts list of characters. Those characters are classified by classify
method. Classification key is output of uniprop
method for given character.
Tools
I'm maintaining HomoGlypher library/package which allows to handle common homoglyph operations:
Unwind. From ASCII text create list of all possible homoglyphied text variants. This is useful for example in checking if some domain is spoofed.
Collapse - From homoglyphied text recover all possible ASCII text variants. Useful for normalization of text before passing it to content filters.
Randomize - From ASCII text create single homoglyphied text with given replacement probability.
Tokenize. Create regular expression token that will match homoglyphied text equivalent to given ASCII text. I think this may be the only homoglyph related library in the existence having this feature :)
Huge list of mappings is provided, so you won't have to dig through Unicode blocks on your own to find possible similarities between graphemes.
Give it a try. And if you know other homoglyph libraries please leave a note in the comments for future readers.
Featured ones: