Logo

dev-resources.site

for different kinds of informations.

UTF-8 code point properties

Published at
8/5/2023
Categories
unicode
utf
raku
Author
bbkr
Categories
3 categories in total
unicode
open
utf
open
raku
open
Author
4 person written this
bbkr
open
UTF-8 code point properties

Having single encoding to express every intent in every language is awesome. But that comes at a cost of dealing with huge amount of characters you probably never seen before. Luckily every code point has a set of properties that may help you with text processing. There are over 100 properties total, but the most useful are:

  • Major/minor category.
$ raku -e '"a".uniprop.say'
Ll

$ raku -e '"3".uniprop.say'
Nd
Enter fullscreen mode Exit fullscreen mode

When uniprop is called without params in Raku it returns acronym for major/minor category. Ll means letter/lowercase, Nd means number/decimal digit. Full list is available here. Those can also be tested independently as shown below.

  • Letter, Number, Punctuation, Separator.

Daily bread of text processing.

raku -e '
    my $text = "Is this 1970 Dodge?";
    say $text;
    for "Letter", "Number", "Separator", "Punctuation" {
        $text.uniprops( $_ ).join.say;
    }
'

Is this 1970 Dodge?
1101111000000111110 # Letters
0000000011110000000 # Numbers
0010000100001000000 # Separators
0000000000000000001 # Punctuation
Enter fullscreen mode Exit fullscreen mode

Method uniprops is the same as uniprop, but returns property values for all characters in string. Both uniprop and uniprops methods can be given specific property to test against.

  • Script
$ raku -e '"aГΦح日".uniprops( "Script" ).say'

(Latin Cyrillic Greek Arabic Han)
Enter fullscreen mode Exit fullscreen mode

Script is a writing system. It should not be confused with alphabet - for example a and ą are both Latin script but the ą only belongs to Polish alphabet. And it should not be confused with language, despite the fact that it sometimes alignes with it - for example Greek.

How many scripts are there?

$ raku -e '.say for ( 1 .. 1_112_064 ).map( *.uniprop( "Script" ) ).unique'

Common
Latin
Bopomofo
Inherited
Greek
Unknown
Coptic
Cyrillic
Armenian
Hebrew
Arabic
Syriac
Thaana
Nko
...
Enter fullscreen mode Exit fullscreen mode

The answer is 158.

  • Casing

Did you know that "lowercase" and "uppercase" terms come from printing press? Page was composed from metal stamps. Stamps with letters a, b, c were used more often than stamps with letters A, B, C. So a, b, c were stored in lower case on the desk, easier to reach. While A, B, C were stored in upper case above desk to save space.

What do we know about case of A?

$ raku -e 'say $_, " ", "A".uniprop( $_ ) for "Cased", "Lowercase", "Uppercase", "Lowercase_Mapping"'

Cased True
Lowercase False
Uppercase True
Lowercase_Mapping a
Enter fullscreen mode Exit fullscreen mode

I won't go into rabbit hole of titlecase vs uppercase. Foldcase used for comparison will appear in another post of this series. But Cased is an interesting property, showing if given code point has upppercase/lowercase form. For example there is no concept of letter case in Kanji:

$ raku -e '"女".uniprop( "Cased" ).say'
False

$ raku -e 'say "女".lc.ord == "女".uc.ord'
True
Enter fullscreen mode Exit fullscreen mode

Dragon, source https://nohat.cc

  • Numeric value

For numbers Unicode also holds value.

$ raku -e '"4 Ⅴ ¾ 8️⃣ ㊷ 兆".uniprops( "Numeric_Value" ).grep( Int|Rat ).say'

(4 5 0.75 8 42 1000000000000)
Enter fullscreen mode Exit fullscreen mode

This is super useful in text normalization. Although you must be aware of different numeric systems if you want to convert text to numeric type in your programming language. For example ⅤⅠ is not 51 but 6 in Roman numerals. Also Roman numerals are tricky - because of frequent use in the past on watches and clocks there are code points defined up to value of 12. So Roman 11 can be expressed by single code point or by two code points ⅩⅠ.

Just for completion - there are no numeric values defined for constants like π or . Despite the fact, that is EULER CONSTANT U+2107 character that has no other purpose in life than being a numeric value.

Raku note: I added spaces between graphemes to increase readability, so later I had to filter out NaN values from the result by extracting only Integers and Rationals.

  • more Punctuation

Punctuation is huge in Unicode, there are 7 main categories for it - Connector, Dash, Open, Close, Initial, Final, Other. Full explanation is way beyond the scope of this series, but I want to show few examples that may help you working with text right away.

Regular ASCII? Ethiopic full stop? Double question mark? Extracting sentences from text has never been so easy:

$ raku -e '"!?.።⁇".uniprops( "Sentence_Terminal" ).say'
(True True True True True)
Enter fullscreen mode Exit fullscreen mode

Well, until you get to articles about J. F. Kennedy, but that will be covered in future post about regular expressions.

Just as there are tons of sentence terminals there are also many dashes, 29 to be precise. Now you can find them easily:

$ raku -e '( "-", "—", "⸺", "⸻" ).map( *.uniprops( "Dash" ) ).flat.say'

(True True True True)
Enter fullscreen mode Exit fullscreen mode

Did you know that hyphen should not be slapped everywhere you mean dash? According to rules semi-final uses hyphen, inclusion I was — as always — hungry uses em-dash (named after width of letter m) and ranges 2020–2023 uses en-dash (named after width of letter n). Yeah, right...

For brackets Unicode brings one more tool to your toolbox - you can check if bracket is opening or closing and you can find matching one:

$ raku -e '.say for "(}".uniprops;'

Ps # Open_Punctuation
Pe # Close_Punctuation

$ raku -e '.say for "(}".uniprops( "Bidi_Mirroring_Glyph" );'

)
{
Enter fullscreen mode Exit fullscreen mode

Can you feel the power already?

This post only scratched the surface of Unicode properties. But with this knowledge you can get any piece of text and be able to parse it without knowing all the letters and symbols used in different scripts.

Coming up next: Composed / decomposed forms.

raku Article's
30 articles in total
Favicon
SSH port forwarding from within code
Favicon
SSH port forwarding from within Raku code
Favicon
Solving the Weekly Challenge 302 Task 1: Ones and Zeroes in Python
Favicon
Solving the Weekly Challenge 302 Task 2: Step by Step in Python
Favicon
My Python Language Solution to Task 2: Nested Array from The Weekly Challenge 300
Favicon
My Python Language Solution to Task 1: Beautiful Arrangement from The Weekly Challenge 300
Favicon
My Python Language Solution to Task 1 from The Weekly Challenge 299
Favicon
Sparky - composable user interfaces for internal services
Favicon
Sparky - hacking minikube with mini tool
Favicon
Sparky - simple and efficient alternative to Ansible
Favicon
Confirming The LPW 2024 Venue & Date
Favicon
Announcing The London Perl & Raku Workshop 2024
Favicon
Stability
Favicon
Practicing Raku Grammars On Exercism
Favicon
Languages wanted!
Favicon
Perl and Raku Dev Room @FOSDEM 24
Favicon
Introducing Humming-Bird v3
Favicon
Publishing Raku modules
Favicon
Sorting numbers in Raku with the help of ChatGPT
Favicon
UTF-8 series wrap up
Favicon
UTF-8 Byte Order Mark
Favicon
Fun with UTF-8: Homoglyphs
Favicon
UTF-8 regular expressions
Favicon
Fun with UTF-8: variables and operators
Favicon
UTF-8 sorting and collation
Favicon
UTF-8 grapheme clusters
Favicon
UTF-8 (de)composition
Favicon
UTF-8 code point properties
Favicon
Fun with UTF-8: browsing code points namespace
Favicon
UTF-8 Glyphs and Graphemes

Featured ones: