Logo

dev-resources.site

for different kinds of informations.

Converting UTF (including emoji) to HTML

Published at
7/30/2021
Categories
javascript
unicode
emoji
utf8
Author
nikkimk
Categories
4 categories in total
javascript
open
unicode
open
emoji
open
utf8
open
Author
7 person written this
nikkimk
open
Converting UTF (including emoji) to HTML

Sometimes my coworker likes to mention things just to get my mind stuck on them. Take the text from this request:

Because of some limitations both in UTF-8 and mysql (less a concern for us now but still..) it would probably be good to have some kind of simple-emoji type of tag. Similar to how we have simple-icon, a simple-iconcould be used to provide minor tweaks / accounting for emojis in a consistent way.

So last night I worked on translating UTF (including emoji) into their HTML entities.

Basic Unicode to HTML Entity Conversion

I started with started with an adapted version of this conversion logic to convert any character that is not part of the 127 ASCII characters:

utf2Html(str){
  let result = '', 

    //converts unicode decimal value into an HTML entity
    decimal2Html = (num) => `&#${num};`,

    //converts a character into an HTML entity 
    char2Html = (char) => {
      //ASCII character or html entity from character code
      return char.charCodeAt() > 127 ? decimal2Html(char.charCodeAt()) : char;
    };

  //check each character
  [...str].forEach(char=>{
    result += char2Html(char);
  });

  return result;
}
Enter fullscreen mode Exit fullscreen mode

If we want to check this function (quite literally by dropping a UTF-8 checkmark ✓ into the function), its character code 10003 is the same as it's unicode value so it can be used to generate correct HTML entity ✓

The Problem with Emoji Conversion

While the function above works on UTF-8 special characters, it won't work all of the emoji we have available today. I found a really good explanation for in a post called Unicode in Javascript.

Take the 🤯 emoji, for example.

The character code for this emoji is 55357, so the entity returned by the function above would be �, which does not work.

The unicode value for 🤯 is actually 129327 (or 0001 1111 1001 0010 1111 in binary). In order to express this character as in it's 16-bit form, it is split into a surrogate pair of 16-bit units, in string form as \uD83E\uDD2F (according this handy Surrogate Pair Calculator)--🤯

So in order to get the correct value, we need to know:

  • if a character is one of these surrogate pair emojis, and
  • how to calculate a surrogate pair's value.

Determining if an Emoji is a Surrogate Pair

The JavaScript string length for any type of character is 1.
It is the same for characters, symbols and emoji

JavaScript Result
't'.length 1
'✓'.length 1
'🤯'.length 1

But if I use the spread operator (...) to get length, I can see that my emoji is made of a surrogate pair.

JavaScript Result
[...'t'].length 1
[...'✓'].length 1
[...'🤯'].length 2

That means that I can tell which characters are surrogate pairs if [...char].length > 1:

utf2Html(str){
  let result = '', 

    //converts unicode decimal value into an HTML entity
    decimal2Html = (num) => `&#${num};`,

    //converts a character into an HTML entity 
    char2Html = (char) => {
      let item = `${char}`;

      //spread operator can detect emoji surrogate pairs 
      if([...item].length > 1) {
        //TODO calculate a surrogate pair's value
      }

      //ASCII character or html entity from character code
      return char.charCodeAt() > 127 ? decimal2Html(char.charCodeAt()) : char;
    };

  //check each character
  [...str].forEach(char=>{
    result += char2Html(char);
  });

  return result;
}
Enter fullscreen mode Exit fullscreen mode

Notice I left a //TODO comment about calculating the pair. We'll tackle that next...

Calculating a Surrogate Pair's Unicode Value

I couldn't find a good post for converting a surrogate pair to it's unicode value, so instead followed these steps for converting from unicode to surrogate pairs in reverse:

# Step 🤯 Example
1 Get the value of each part of the pair. 55358 / 56623
2 Convert each value to a binary number. 1101100000111110 / 1101110100101111
3 Take the last 10 digits of each number. 0000111110 / 0100101111
4 Concatenate the two binary numbers a single 20-bit binary number. 00001111100100101111
5 Convert 20-bit number to a decimal number. 63791
6 Add 0x10000 to the new number. 129327

The Completed UTF (Including Emoji) to HTML Function

utf2Html(str){
  let result = '', 
    //converts unicode decimal value into an HTML entity
    decimal2Html = (num) => `&#${num};`,
    //converts a character into an HTML entity 
    char2Html = (char) => {
      let item = `${char}`;

      //spread operator can detect emoji surrogate pairs 
      if([...item].length > 1) {

        //handle and convert utf surrogate pairs
        let concat = '';

        //for each part of the pair
        for(let i = 0; i < 2; i++){

          //get the character code value 
          let dec = char[i].charCodeAt(),
            //convert to binary 
            bin = dec.toString(2),
            //take the last 10 bits
            last10 = bin.slice(-10);
            //concatenate into 20 bit binary
            concat = concat + last10,
            //add 0x10000 to get unicode value
            unicode = parseInt(concat,2) + 0x10000;
        }

        //html entity from unicode value
        return decimal2Html(unicode); 
      }

      //ASCII character or html entity from character code
      return char.charCodeAt() > 127 ? decimal2Html(char.charCodeAt()) : char;
    };

  //check each character
  [...str].forEach(char=>{
    result += char2Html(char);
  });

  return result;
}
Enter fullscreen mode Exit fullscreen mode

Update

Thanks to a comment by LUKE知る, I have an even simpler way to do this:

export function utf2Html(str) {
  return [...str].map((char) => char.codePointAt() > 127 ? `&#${char.codePointAt()};` : char).join('');
}
Enter fullscreen mode Exit fullscreen mode

Mind blown meme: Problems Saving Unicode, Convert Symbols to HTML, Many Emoji are Surrogate Pairs, Convert Symbols & Emoji to HTML

Featured ones: