Logo

dev-resources.site

for different kinds of informations.

UTF-8 (de)composition

Published at
8/10/2023
Categories
unicode
utf
raku
Author
bbkr
Categories
3 categories in total
unicode
open
utf
open
raku
open
Author
4 person written this
bbkr
open
UTF-8 (de)composition

Composition is a process of transposing base grapheme followed by combining code points into single grapheme.

Let's start with simple letter a:

$ raku -e '
    my $text = "a";
    $text.uniname.say;
    $text.ord.base( 16 ).say;
    $text.chars.say;
    $text.codes.say;
    $text.encode.bytes.say;
'

LATIN SMALL LETTER A # Code point name
61                   # Code point number
1                    # Single character
1                    # Single code point
1                    # Encoded in UTF-8 using single byte
Enter fullscreen mode Exit fullscreen mode

Raku note: This language has no length method on strings, because in Unicode world it is super confusing. Instead there are separate methods to ask precisely about amount of characters, amount of code points and amount of bytes.

Let's do the same for "ogonek" (tiny tail), which is combining code point that appeared in previous posts:

$ raku -e '
    my $text = "\c[COMBINING OGONEK]";
    $text.ord.base( 16 ).say;
    $text.chars.say;
    $text.codes.say;
    $text.encode.bytes.say;
'

328 # Code point number
1   # Single character
1   # Single code point
2   # Encoded in UTF-8 using two bytes
Enter fullscreen mode Exit fullscreen mode

And smash them together:

$ raku -e '
    my $text = "a\c[COMBINING OGONEK]";
    $text.say;
    $text.uniname.say;
    $text.ord.base( 16 ).say;
    $text.chars.say;
    $text.codes.say;
    $text.encode.bytes.say;
'

ą                                # Glyph
LATIN SMALL LETTER A WITH OGONEK # Code point name
105                              # Code point number
1                    # Single character
1                    # Single code point
2                    # Encoded in UTF-8 using two bytes
Enter fullscreen mode Exit fullscreen mode

Our two code points U+61 and U+328 were composed together and produced another code point U+105. Which is more obvious when we look at glyphs: a + ̨ = ą.

Dog chasing tail
(source: Warren Photographics)

In less technical terms

Composition reflects natural language. Sometimes base letters in given script were not enough to express nuances in given language. To solve that, derivatives of base letters were created by adding small modifiers to indicate pronunciation accent / tone / stress differences. Those modifiers are commonly known as "diacritic glyphs". Most known are: acute, macron, tilde, grave, diaeresis, ogonek, etc.

But why Unicode decided to make two ways of expressing the same stuff?

Compression

In the example above base character is 1 byte, diacritic glyph is 2 bytes. By having composed ą code point in 2 byte space it can be written using 2 bytes instead of 3. This quickly adds up in alphabets using diacritics extensively, so +1 for composed form.

Comparison

While comparing two texts both composed or decomposed forms can be used. Assuming of course that compared texts are using the same form consistently. However the problem occurs when there is more than one combining code point, like for example in ǭ.

raku -e '
    my $text1 = "\c[LATIN SMALL LETTER O]\c[COMBINING MACRON]\c[COMBINING OGONEK]";
    my $text2 = "\c[LATIN SMALL LETTER O]\c[COMBINING OGONEK]\c[COMBINING MACRON]";
    say $text1 eq $text2;
    $text1.uniname.say;
'

True
LATIN SMALL LETTER O WITH OGONEK AND MACRON
Enter fullscreen mode Exit fullscreen mode

Order of combining characters is irrelevant in composition. Both texts above are equal, despite the fact that they were composed from code points in different order. This comparison will fail when decomposed form is used, so +1 for composed one.

Base comparison

Skipping diacritics is very common. Most of you would write in search engine Josip Belusic when looking for information about Croatian inventor Josip Belušić. And it becomes even more common with smartphones, where limited keyboard space and single hand typing discourage proper use of diacritics.

Previously s and š characters were completely unrelated code points, for example in ISO-8859-1 encoding. So a lot of search engines used huge mapping dictionaries to implement "Do What I Mean" behavior and provide results when diacritics were and were not used in search query.

With Unicode not only it is easy to get base characters form without having diacritic mappings:

$ raku -e '"Josip Belušić".samemark( "a" ).say'

Josip Belusic
Enter fullscreen mode Exit fullscreen mode

Raku note: This counterintuitive syntax is explained here. Luckily more friendly and faster method nomark() will be added to Raku soon by courtesy of @lizmat.

But also it is easy to match base characters in regular expressions:

$ raku -e 'say "Josip Belušić" ~~ m:ignoremark/ Belusic /'

「Belušić」 # Matched part of text
Enter fullscreen mode Exit fullscreen mode

That gives +2 for decomposed form functionality, resulting in a tie. Both composed and decomposed forms provide nice features for people working with text, and it was good decision to have them both in Unicode.

Stroke trap!

There are STROKE combining characters like COMBINING SHORT STROKE OVERLAY defined in Unicode. But stroked letters do not decompose:

$ raku -e '"Grøn gås".samemark("a").say'  # Green goose in Danish

Grøn gas

$ raku -e '"żółw".samemark("a").say' # Turtle in Polish

zołw

$ raku -e '.say for "łø".uninames'

LATIN SMALL LETTER L WITH STROKE
LATIN SMALL LETTER O WITH STROKE
Enter fullscreen mode Exit fullscreen mode

Why? I was unable to find. They clearly have base Latin letter. If you know please share in the comments.

More traps!

Æ does not decompose, it is simply LATIN CAPITAL LETTER AE, not A WITH E.

German ß does not decompose to SS because this transition only happens when case is changed.

Kanji does not decompose to Katakana or Hiragana, despite the fact that Katakana / Hiragana glyphs are often part of Kanji characters.

Roman numerals like or do not decompose.

Trivia

  • In Raku you can not switch between composed and decomposed forms of a string because all strings are automatically composed. However there are methods to get binary representations of both forms:
$ raku -e '"ǭ".NFC.say; "ǭ".NFD.say;'

NFC:0x<01ed>
NFD:0x<006f 0328 0304>
Enter fullscreen mode Exit fullscreen mode

If you want to find out what string decomposes into you can convert it back to code point names:

$ raku -e '.uniname.say for "ǭ".NFD'

LATIN SMALL LETTER O
COMBINING OGONEK
COMBINING MACRON
Enter fullscreen mode Exit fullscreen mode
  • What happens if there is no composing code point and no glyph to represent it?

Funny stuff. Your browser or text editor will try to render is somehow. Sometimes as character followed by composing glyph, sometimes as overlay.

$ raku -e '"\c[LATIN SMALL LETTER H]\c[COMBINING OGONEK]".say'

h̨ # There is no such letter in any alphabet
Enter fullscreen mode Exit fullscreen mode
  • Is composition used only for diacritics?

No. There is whole Code for inherited script category with tons of weird composable characters.

raku -e \c[LATIN SMALL LETTER O]\c[COMBINING LATIN SMALL LETTER O]".say'

oͦ # Snowman?
Enter fullscreen mode Exit fullscreen mode
  • Does decomposition work with Emoji modifiers?

Yes.

$ raku -e 'say "👍🏿" ~~ m:ignoremark/ "👍" /'

「👍🏿」
Enter fullscreen mode Exit fullscreen mode

Coming up next: Grapheme clusters.

raku Article's
30 articles in total
Favicon
SSH port forwarding from within code
Favicon
SSH port forwarding from within Raku code
Favicon
Solving the Weekly Challenge 302 Task 1: Ones and Zeroes in Python
Favicon
Solving the Weekly Challenge 302 Task 2: Step by Step in Python
Favicon
My Python Language Solution to Task 2: Nested Array from The Weekly Challenge 300
Favicon
My Python Language Solution to Task 1: Beautiful Arrangement from The Weekly Challenge 300
Favicon
My Python Language Solution to Task 1 from The Weekly Challenge 299
Favicon
Sparky - composable user interfaces for internal services
Favicon
Sparky - hacking minikube with mini tool
Favicon
Sparky - simple and efficient alternative to Ansible
Favicon
Confirming The LPW 2024 Venue & Date
Favicon
Announcing The London Perl & Raku Workshop 2024
Favicon
Stability
Favicon
Practicing Raku Grammars On Exercism
Favicon
Languages wanted!
Favicon
Perl and Raku Dev Room @FOSDEM 24
Favicon
Introducing Humming-Bird v3
Favicon
Publishing Raku modules
Favicon
Sorting numbers in Raku with the help of ChatGPT
Favicon
UTF-8 series wrap up
Favicon
UTF-8 Byte Order Mark
Favicon
Fun with UTF-8: Homoglyphs
Favicon
UTF-8 regular expressions
Favicon
Fun with UTF-8: variables and operators
Favicon
UTF-8 sorting and collation
Favicon
UTF-8 grapheme clusters
Favicon
UTF-8 (de)composition
Favicon
UTF-8 code point properties
Favicon
Fun with UTF-8: browsing code points namespace
Favicon
UTF-8 Glyphs and Graphemes

Featured ones: