Logo

dev-resources.site

for different kinds of informations.

Intl.Segmenter(): Don't use string.split() nor string.length

Published at
7/25/2023
Categories
javascript
intl
encoding
unicode
Author
ayc0
Categories
4 categories in total
javascript
open
intl
open
encoding
open
unicode
open
Author
4 person written this
ayc0
open
Intl.Segmenter(): Don't use string.split() nor string.length
  1. TL;DR
  2. Explanation
    1. Definitions
    2. UTF-16
    3. String.prototype.length
    4. Unicode composition
    5. Emoji Sequence
  3. Intl.Segmenter
    1. Browser compatibility

The other day I was playing with JS and I saw this:

'รฉ'.length;
// 1
'eฬ'.length;
// 2, not the same output as the line before
'eฬ'.split('').join('|');
// 'e|ฬ'
Enter fullscreen mode Exit fullscreen mode

(Yes, all of those are valid, you can copy paste them ๐Ÿ˜…)

TL;DR

As an image is worth 1000 words:

Grapheme vs code unit vs code point

You can use Intl.Segmenter

const seg = new Intl.Segmenter('en', { granularity: "grapheme" });

[...seg.segment('๐Ÿ™Œ๐Ÿพ')].length
// 1
[...seg.segment('eฬ')].length
// 1
Enter fullscreen mode Exit fullscreen mode

Explanation

This article will talk about character vs code unit vs code point vs grapheme vs glyph.

Definitions

  • Character: generic term that can mean any of the other 4 terms.
  • Code Unit: A code unit is the smallest unit of data in UTF-16 encoding. In UTF-16, each code unit is 16 bits (2 bytes) in size. It can represent a part of a character or a complete character, depending on the character's Unicode value.
  • Code Point: A code point is a numerical value assigned to a specific character in the Unicode standard. It's a unique identifier for each character and is typically represented in hexadecimal. For example, the code point for the letter "A" is U+0041. In UTF-16, every code point is composed by either 1 or 2 code unit.
  • Grapheme: A grapheme is the smallest unit of a writing system that carries meaning and represents a single "user-perceived" character. In UTF-16, every grapheme is composed by at least 1 code point. Not all code points are part of graphemes, like the zero-width non-joiner.
  • Glyph: A glyph is a visual representation or image of a character. It is the actual shape or form of a character as it appears on a screen or in print. A single character can have multiple glyphs associated with it, representing different typographic variations or font styles.

You can check https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme for more details.

UTF-16

JavaScript uses UTF-16 (and not UTF-8 as opposed as many other languages. To note: UTF-8 would also have all of those issues).

In UTF-16, characters are encoded in 16-bit chunks (code unit). For instance $ is encoded in hexadecimal into 0024 (thus its notation U+0024 or '\u0024'); and โ‚ฌ is encoded as 20AC.

Problem: Using a 16-bit code unit can only result in 65536 possible characters, so how do we represent the other characters? UTF-16 has a system where it can use 2 code units to encode some code points. For instance ๐ท is the code point U+10437 will be encoded as D801 DC37 (a high surrogate D801 and a low surrogate DC37).

$, โ‚ฌ, and ๐ท encoded in UTF-16 in code units

String.prototype.length

According MDN, the length is based on code units:

The length data property of a String value contains the length of the string in UTF-16 code units.

This explains why for ๐Ÿ™Œ (U+1F64C) or ๐ท (U+10437), using .length doesnโ€™t return 1 as those are encoded in 2 code units:

'๐ท'.length; // U+10437
// 2
'๐Ÿ™Œ'.length; // U+1F64C
// 2
Enter fullscreen mode Exit fullscreen mode

One possible fix for this case is to use iterators. According to MDN again, iterators work on code points (they say characters, but they mean code points):

Since length counts code units instead of characters, if you want to get the number of characters, you can first split the string with its iterator, which iterates by characters

And it does work indeedโ€ฆ

[...'๐ท'].length // U+10437
// 1
[...'๐Ÿ™Œ'].length // U+1F64C
// 1
[...'eฬ'].length
// 2
[...'๐Ÿ™Œ๐Ÿพ']
// 2
Enter fullscreen mode Exit fullscreen mode

โ€ฆ but not for all characters. Why?

Unicode composition

Another specificity of Unicode is that it can combines multiple code points to form a grapheme. This is called canonical equivalence
(see https://unicode.org/reports/tr15/#Canon_Compat_Equivalence).

For instance the letter "ร‡" can either be the code point for this character, or the code point for "C" followed by the diacritic mark "โ—Œฬง"

ร‡ <-> C+โ—Œฬง

We can also use normalization NFD and NFC to switch between the precomposed and decomposed forms (see https://unicode.org/reports/tr15/#Norm_Forms):

Many characters are known as canonical composites, or precomposed characters. In the D forms, they are decomposed; in the C forms, they are usually precomposed.

This explains why รฉโ€™s length was either 1 or 2 in the initial example:

  • decomposed form โ†’ 2 code points
  • precomposed form โ†’ 1 code point

In JavaScript, you can use String.prototype.normalize (MDN):

'รฉ'.length;
// 1
'รฉ'.normalize('NFD').length;
// 2
'รฉ'.normalize('NFD').normalize('NFC').length;
// 1
Enter fullscreen mode Exit fullscreen mode

Emoji Sequence

Similarly to character compositions, emojis can be combined together with special characters (this is not an exhaustive list):

  • Skin tone modifiers can be used to customize the color skin of emojis
    For instance "๐Ÿ™Œ๐Ÿพ" is composed of "๐Ÿ™Œ" + "๐Ÿพ" (Medium-Dark Skin Tone modifier)

    [...'๐Ÿ™Œ๐Ÿพ'];
    //ย ['๐Ÿ™Œ', '๐Ÿพ']
    
  • Zero-Width Joiner (ZWJ) can be used to merge some emojis together
    For instance "๐Ÿ˜ฎโ€๐Ÿ’จ" is composed of "๐Ÿ˜ฎ" + "โ€" (ZWJ) + "๐Ÿ’จ"
    And "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ" is composed of each individual family members plus ZWJs:

    [...'๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ'];
    //ย ['๐Ÿ‘ฉ', 'โ€', '๐Ÿ‘ฉ', 'โ€', '๐Ÿ‘ง', 'โ€', '๐Ÿ‘ฆ']
    
  • Variation Selectors can be used to choose a different glyph variant for a code point
    For instance "โ„น๏ธ" is composed of "โ„น" + "๏ธ" (Variation Selector-16 to force the display as an emoji)

Intl.Segmenter

In 2021, the TC39 committee added to ECMAScript Intl.Segmenter:

The Intl.Segmenter object enables locale-sensitive text segmentation, enabling you to get meaningful items (graphemes, words or sentences) from a string.

Once a locale is picked, you can use .segment to generate an iterator with each grapheme of a string:

const seg = new Intl.Segmenter('en', { granularity: 'grapheme' });

for (const grapheme of seg.segment('Hรฉlรด ๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ ๐Ÿ™Œ๐Ÿพ')) {
    console.log(grapheme.segment);
}
// "H"
// "รฉ"
// "l"
// "รด"
// " "
// "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ"
// " "
// "๐Ÿ™Œ๐Ÿพ"
Enter fullscreen mode Exit fullscreen mode

And if you want to get the number of grapheme (like .length), you can transform it to an array first:

[...seg.segment('๐Ÿ™Œ๐Ÿพ')].length;
// 1

[...seg.segment('eฬ')].length;
// 1
Enter fullscreen mode Exit fullscreen mode

Browser compatibility

Sadly, at the date of the writing (July 2023) is not supported on Firefox yet โ€“ย check on caniuse.com. You can track this issue if you want to follow its development.

Browser compatibility table

encoding Article's
30 articles in total
Favicon
Why I Built the Laravel Encoding Package I Couldnโ€™t Find Anywhere Else
Favicon
From 'A' to '๐Ÿ˜Š': How Programming Languages Handle Strings
Favicon
Base64 strings concepts in different programming language
Favicon
Secure and Scalable Encoding Made Easy with Laravel Encoder: A Complete Tutorial
Favicon
Encoding
Favicon
On Transformers and Vectors
Favicon
The รผ/รผ Conundrum
Favicon
Unlocking the Potential of Video Transcoding
Favicon
How to inverse transform both ordinal and label encoding?
Favicon
Introducciรณn a Buffer en JavaScript
Favicon
Intl.Segmenter(): Don't use string.split() nor string.length
Favicon
Packing and unpacking bytes
Favicon
Chuw Vidf Nam sogp sogp 4.0 (Cvnss4.0) zujx goc nhinl mas hoaj
Favicon
The Hitchhiker's Guide to Binary-to-Text Encoding
Favicon
Text versus bytes
Favicon
Transforming Categorical Data: A Practical Guide to Handling Non-Numerical Variables for Machine Learning Algorithms.
Favicon
Dealing with Categorical Data: Encoding Features for ML Algorithms
Favicon
Application of Media Processing Technology to 4K/8K FHD Video Processing
Favicon
Base64's goodness
Favicon
How does Base64 work?
Favicon
Ordinal Vs One Hot Vs Label Encoding
Favicon
PHP: Useful Encoding and decoding Functions You Need to Know
Favicon
How good is my video? A (traditional) video quality metrics survey
Favicon
String encodings
Favicon
The unicode encoding system
Favicon
Unicode
Favicon
Serialization
Favicon
Base 64 Encoder-Decoder in Java
Favicon
Windows ็ณป็ตฑไธŠ Python ็š„ๆ–‡ๅญ—่ผธๅ‡บ็ทจ็ขผ
Favicon
UTF-8 strings in C (3/3)

Featured ones: