Logo

dev-resources.site

for different kinds of informations.

Fun with UTF-8: Homoglyphs

Published at
9/15/2023
Categories
unicode
utf
raku
Author
bbkr
Categories
3 categories in total
unicode
open
utf
open
raku
open
Author
4 person written this
bbkr
open
Fun with UTF-8: Homoglyphs

ꓧ𐐬𝗆𐐬𝗀ⅼУрႹ β…°Ρ• π—ŒΠ΅π— π—ˆΕΏ ဝո𝖾 π—ˆΠ³ ꝳо𝗋С Ι‘π—‹Π°Οα‚Ήπ–Ύβ…ΏΠ΅π—Œ 𝗍Ⴙа𝗍 Ⴙ𝖺ѕ 𝗂ꝱ𝖾ꝴ𝗍𝗂𐐽а𝗅 о𝗋 ѡ𝖾г𝗒 π—ŒαŽ₯ⅿі𝗅аꝡ ⅼꝏ𝗄 𝗍ᴏ π—Œαƒ˜αƒπ–Ύ ΠΎπ—κœ§π–Ύπ—‹ π‘ˆΠ΅π— ဝſ ɑꝡ𝖺рႹСოСѕ. Like in previous sentence, that does not use a single ASCII letter:

ꓧ - LISU LETTER XA
𐐬 - DESERET SMALL LETTER LONG O
𝗆 - MATHEMATICAL SANS-SERIF SMALL M
𐐬 - DESERET SMALL LETTER LONG O
𝗀 - MATHEMATICAL SANS-SERIF SMALL G
β…Ό - SMALL ROMAN NUMERAL FIFTY
Π£ - CYRILLIC CAPITAL LETTER U
Ρ€ - CYRILLIC SMALL LETTER ER
α‚Ή - GEORGIAN CAPITAL LETTER CHIN
...
Enter fullscreen mode Exit fullscreen mode

Homoglyphs are not Unicode specific, but it was ability to write in many scripts using single UTF encoding that made them popular.

Similarity is conditional

It is font dependent. Two sets of graphemes looking very similar (or even identical) in one font may not look that similar in another. For example Ρ‚ - CYRILLIC SMALL LETTER TE looks like ASCII T, but in cursive fonts (those that resembles handwriting connected letters) looks like m.

Similarity is subjective

For many people unfamiliar with given alphabets Ǧ and Ğ may look exactly the same. But if someone is using those letters on daily basis he will notice immediately that first one has CARON and the other has BREVE on top.

They are not limited to single grapheme

For example ထ - MYANMAR LETTER THA looks like two ASCII o letters. And the other way - ASCII rn looks like single ASCII letter m.

Applications?

  • Fun. 𐐑ǃkǝ pΙΉoducΗƒng weird looking bᴝt ΙΉeadɐble Κ‡ext.

  • Trolling. Programmer's classic is to replace in someone's code ; with ; - GREEK QUESTION MARK - and watch some funny debugging attempts. More advanced version is to modify keybinding. For example on macOS create ~/Library/KeyBindings/DefaultKeyBinding.dict with following content:

{
    ";" = (insertText:,";");
}
Enter fullscreen mode Exit fullscreen mode

And observe how Python suddenly became someone's favorite language of choice :P

Just promise you won't troll stressed out junior dev before the end of sprint.

  • Phishing. This is "Fun with UTF-8" sub series, but unfortunately this application is anything but fun. Homoglyphs are massively used to spoof company names, bypass anti-spam filters and create fake domains. For example can you spot difference between Paypal and κ“‘ayΡ€Π°l?

Common way to detect those is to check Script Unicode property, more on those in this post. Single word using more than one script should be considered suspicious:

$ raku -e '"Paypal".comb.classify( *.uniprop("Script") ).say'
{Latin => [P a y p a l]} # real

$ raku -e '"κ“‘ayΡ€Π°l".comb.classify( *.uniprop("Script") ).say'
{Cyrillic => [Ρ€ Π°], Latin => [a y l], Lisu => [κ“‘]} # fake
Enter fullscreen mode Exit fullscreen mode

Raku note: Method comb without param extracts list of characters. Those characters are classified by classify method. Classification key is output of uniprop method for given character.

Tools

I'm maintaining HomoGlypher library/package which allows to handle common homoglyph operations:

  • Unwind. From ASCII text create list of all possible homoglyphied text variants. This is useful for example in checking if some domain is spoofed.

  • Collapse - From homoglyphied text recover all possible ASCII text variants. Useful for normalization of text before passing it to content filters.

  • Randomize - From ASCII text create single homoglyphied text with given replacement probability.

  • Tokenize. Create regular expression token that will match homoglyphied text equivalent to given ASCII text. I think this may be the only homoglyph related library in the existence having this feature :)

Huge list of mappings is provided, so you won't have to dig through Unicode blocks on your own to find possible similarities between graphemes.

Give it a try. And if you know other homoglyph libraries please leave a note in the comments for future readers.

raku Article's
30 articles in total
Favicon
SSH port forwarding from within code
Favicon
SSH port forwarding from within Raku code
Favicon
Solving the Weekly Challenge 302 Task 1: Ones and Zeroes in Python
Favicon
Solving the Weekly Challenge 302 Task 2: Step by Step in Python
Favicon
My Python Language Solution to Task 2: Nested Array from The Weekly Challenge 300
Favicon
My Python Language Solution to Task 1: Beautiful Arrangement from The Weekly Challenge 300
Favicon
My Python Language Solution to Task 1 from The Weekly Challenge 299
Favicon
Sparky - composable user interfaces for internal services
Favicon
Sparky - hacking minikube with mini tool
Favicon
Sparky - simple and efficient alternative to Ansible
Favicon
Confirming The LPW 2024 Venue & Date
Favicon
Announcing The London Perl & Raku Workshop 2024
Favicon
Stability
Favicon
Practicing Raku Grammars On Exercism
Favicon
Languages wanted!
Favicon
Perl and Raku Dev Room @FOSDEM 24
Favicon
Introducing Humming-Bird v3
Favicon
Publishing Raku modules
Favicon
Sorting numbers in Raku with the help of ChatGPT
Favicon
UTF-8 series wrap up
Favicon
UTF-8 Byte Order Mark
Favicon
Fun with UTF-8: Homoglyphs
Favicon
UTF-8 regular expressions
Favicon
Fun with UTF-8: variables and operators
Favicon
UTF-8 sorting and collation
Favicon
UTF-8 grapheme clusters
Favicon
UTF-8 (de)composition
Favicon
UTF-8 code point properties
Favicon
Fun with UTF-8: browsing code points namespace
Favicon
UTF-8 Glyphs and Graphemes

Featured ones: