dev-resources.site
for different kinds of informations.
UTF-8 (de)composition
Composition is a process of transposing base grapheme followed by combining code points into single grapheme.
Let's start with simple letter a
:
$ raku -e '
my $text = "a";
$text.uniname.say;
$text.ord.base( 16 ).say;
$text.chars.say;
$text.codes.say;
$text.encode.bytes.say;
'
LATIN SMALL LETTER A # Code point name
61 # Code point number
1 # Single character
1 # Single code point
1 # Encoded in UTF-8 using single byte
Raku note: This language has no length
method on strings, because in Unicode world it is super confusing. Instead there are separate methods to ask precisely about amount of characters, amount of code points and amount of bytes.
Let's do the same for "ogonek" (tiny tail), which is combining code point that appeared in previous posts:
$ raku -e '
my $text = "\c[COMBINING OGONEK]";
$text.ord.base( 16 ).say;
$text.chars.say;
$text.codes.say;
$text.encode.bytes.say;
'
328 # Code point number
1 # Single character
1 # Single code point
2 # Encoded in UTF-8 using two bytes
And smash them together:
$ raku -e '
my $text = "a\c[COMBINING OGONEK]";
$text.say;
$text.uniname.say;
$text.ord.base( 16 ).say;
$text.chars.say;
$text.codes.say;
$text.encode.bytes.say;
'
ą # Glyph
LATIN SMALL LETTER A WITH OGONEK # Code point name
105 # Code point number
1 # Single character
1 # Single code point
2 # Encoded in UTF-8 using two bytes
Our two code points U+61
and U+328
were composed together and produced another code point U+105
. Which is more obvious when we look at glyphs: a
+ ̨
= ą
.
(source: Warren Photographics)
In less technical terms
Composition reflects natural language. Sometimes base letters in given script were not enough to express nuances in given language. To solve that, derivatives of base letters were created by adding small modifiers to indicate pronunciation accent / tone / stress differences. Those modifiers are commonly known as "diacritic glyphs". Most known are: acute, macron, tilde, grave, diaeresis, ogonek, etc.
But why Unicode decided to make two ways of expressing the same stuff?
Compression
In the example above base character is 1 byte, diacritic glyph is 2 bytes. By having composed ą
code point in 2 byte space it can be written using 2 bytes instead of 3. This quickly adds up in alphabets using diacritics extensively, so +1
for composed form.
Comparison
While comparing two texts both composed or decomposed forms can be used. Assuming of course that compared texts are using the same form consistently. However the problem occurs when there is more than one combining code point, like for example in ǭ
.
raku -e '
my $text1 = "\c[LATIN SMALL LETTER O]\c[COMBINING MACRON]\c[COMBINING OGONEK]";
my $text2 = "\c[LATIN SMALL LETTER O]\c[COMBINING OGONEK]\c[COMBINING MACRON]";
say $text1 eq $text2;
$text1.uniname.say;
'
True
LATIN SMALL LETTER O WITH OGONEK AND MACRON
Order of combining characters is irrelevant in composition. Both texts above are equal, despite the fact that they were composed from code points in different order. This comparison will fail when decomposed form is used, so +1
for composed one.
Base comparison
Skipping diacritics is very common. Most of you would write in search engine Josip Belusic
when looking for information about Croatian inventor Josip Belušić
. And it becomes even more common with smartphones, where limited keyboard space and single hand typing discourage proper use of diacritics.
Previously s
and š
characters were completely unrelated code points, for example in ISO-8859-1 encoding. So a lot of search engines used huge mapping dictionaries to implement "Do What I Mean" behavior and provide results when diacritics were and were not used in search query.
With Unicode not only it is easy to get base characters form without having diacritic mappings:
$ raku -e '"Josip Belušić".samemark( "a" ).say'
Josip Belusic
Raku note: This counterintuitive syntax is explained here. Luckily more friendly and faster method nomark()
will be added to Raku soon by courtesy of @lizmat.
But also it is easy to match base characters in regular expressions:
$ raku -e 'say "Josip Belušić" ~~ m:ignoremark/ Belusic /'
「Belušić」 # Matched part of text
That gives +2
for decomposed form functionality, resulting in a tie. Both composed and decomposed forms provide nice features for people working with text, and it was good decision to have them both in Unicode.
Stroke trap!
There are STROKE
combining characters like COMBINING SHORT STROKE OVERLAY
defined in Unicode. But stroked letters do not decompose:
$ raku -e '"Grøn gås".samemark("a").say' # Green goose in Danish
Grøn gas
$ raku -e '"żółw".samemark("a").say' # Turtle in Polish
zołw
$ raku -e '.say for "łø".uninames'
LATIN SMALL LETTER L WITH STROKE
LATIN SMALL LETTER O WITH STROKE
Why? I was unable to find. They clearly have base Latin letter. If you know please share in the comments.
More traps!
Æ
does not decompose, it is simply LATIN CAPITAL LETTER AE
, not A WITH E
.
German ß
does not decompose to SS
because this transition only happens when case is changed.
Kanji does not decompose to Katakana or Hiragana, despite the fact that Katakana / Hiragana glyphs are often part of Kanji characters.
Roman numerals like Ⅳ
or Ⅺ
do not decompose.
Trivia
- In Raku you can not switch between composed and decomposed forms of a string because all strings are automatically composed. However there are methods to get binary representations of both forms:
$ raku -e '"ǭ".NFC.say; "ǭ".NFD.say;'
NFC:0x<01ed>
NFD:0x<006f 0328 0304>
If you want to find out what string decomposes into you can convert it back to code point names:
$ raku -e '.uniname.say for "ǭ".NFD'
LATIN SMALL LETTER O
COMBINING OGONEK
COMBINING MACRON
- What happens if there is no composing code point and no glyph to represent it?
Funny stuff. Your browser or text editor will try to render is somehow. Sometimes as character followed by composing glyph, sometimes as overlay.
$ raku -e '"\c[LATIN SMALL LETTER H]\c[COMBINING OGONEK]".say'
h̨ # There is no such letter in any alphabet
- Is composition used only for diacritics?
No. There is whole Code for inherited script
category with tons of weird composable characters.
raku -e \c[LATIN SMALL LETTER O]\c[COMBINING LATIN SMALL LETTER O]".say'
oͦ # Snowman?
- Does decomposition work with Emoji modifiers?
Yes.
$ raku -e 'say "👍🏿" ~~ m:ignoremark/ "👍" /'
「👍🏿」
Coming up next: Grapheme clusters.
Featured ones: