

for different kinds of informations.

Converting UTF (including emoji) to HTML

Published at
4 categories in total
7 person written this
Converting UTF (including emoji) to HTML

Sometimes my coworker likes to mention things just to get my mind stuck on them. Take the text from this request:

Because of some limitations both in UTF-8 and mysql (less a concern for us now but still..) it would probably be good to have some kind of simple-emoji type of tag. Similar to how we have simple-icon, a simple-iconcould be used to provide minor tweaks / accounting for emojis in a consistent way.

So last night I worked on translating UTF (including emoji) into their HTML entities.

Basic Unicode to HTML Entity Conversion

I started with started with an adapted version of this conversion logic to convert any character that is not part of the 127 ASCII characters:

  let result = '', 

    //converts unicode decimal value into an HTML entity
    decimal2Html = (num) => `&#${num};`,

    //converts a character into an HTML entity 
    char2Html = (char) => {
      //ASCII character or html entity from character code
      return char.charCodeAt() > 127 ? decimal2Html(char.charCodeAt()) : char;

  //check each character
    result += char2Html(char);

  return result;
Enter fullscreen mode Exit fullscreen mode

If we want to check this function (quite literally by dropping a UTF-8 checkmark ✓ into the function), its character code 10003 is the same as it's unicode value so it can be used to generate correct HTML entity ✓

The Problem with Emoji Conversion

While the function above works on UTF-8 special characters, it won't work all of the emoji we have available today. I found a really good explanation for in a post called Unicode in Javascript.

Take the 🤯 emoji, for example.

The character code for this emoji is 55357, so the entity returned by the function above would be �, which does not work.

The unicode value for 🤯 is actually 129327 (or 0001 1111 1001 0010 1111 in binary). In order to express this character as in it's 16-bit form, it is split into a surrogate pair of 16-bit units, in string form as \uD83E\uDD2F (according this handy Surrogate Pair Calculator)--🤯

So in order to get the correct value, we need to know:

  • if a character is one of these surrogate pair emojis, and
  • how to calculate a surrogate pair's value.

Determining if an Emoji is a Surrogate Pair

The JavaScript string length for any type of character is 1.
It is the same for characters, symbols and emoji

JavaScript Result
't'.length 1
'✓'.length 1
'🤯'.length 1

But if I use the spread operator (...) to get length, I can see that my emoji is made of a surrogate pair.

JavaScript Result
[...'t'].length 1
[...'✓'].length 1
[...'🤯'].length 2

That means that I can tell which characters are surrogate pairs if [...char].length > 1:

  let result = '', 

    //converts unicode decimal value into an HTML entity
    decimal2Html = (num) => `&#${num};`,

    //converts a character into an HTML entity 
    char2Html = (char) => {
      let item = `${char}`;

      //spread operator can detect emoji surrogate pairs 
      if([...item].length > 1) {
        //TODO calculate a surrogate pair's value

      //ASCII character or html entity from character code
      return char.charCodeAt() > 127 ? decimal2Html(char.charCodeAt()) : char;

  //check each character
    result += char2Html(char);

  return result;
Enter fullscreen mode Exit fullscreen mode

Notice I left a //TODO comment about calculating the pair. We'll tackle that next...

Calculating a Surrogate Pair's Unicode Value

I couldn't find a good post for converting a surrogate pair to it's unicode value, so instead followed these steps for converting from unicode to surrogate pairs in reverse:

# Step 🤯 Example
1 Get the value of each part of the pair. 55358 / 56623
2 Convert each value to a binary number. 1101100000111110 / 1101110100101111
3 Take the last 10 digits of each number. 0000111110 / 0100101111
4 Concatenate the two binary numbers a single 20-bit binary number. 00001111100100101111
5 Convert 20-bit number to a decimal number. 63791
6 Add 0x10000 to the new number. 129327

The Completed UTF (Including Emoji) to HTML Function

  let result = '', 
    //converts unicode decimal value into an HTML entity
    decimal2Html = (num) => `&#${num};`,
    //converts a character into an HTML entity 
    char2Html = (char) => {
      let item = `${char}`;

      //spread operator can detect emoji surrogate pairs 
      if([...item].length > 1) {

        //handle and convert utf surrogate pairs
        let concat = '';

        //for each part of the pair
        for(let i = 0; i < 2; i++){

          //get the character code value 
          let dec = char[i].charCodeAt(),
            //convert to binary 
            bin = dec.toString(2),
            //take the last 10 bits
            last10 = bin.slice(-10);
            //concatenate into 20 bit binary
            concat = concat + last10,
            //add 0x10000 to get unicode value
            unicode = parseInt(concat,2) + 0x10000;

        //html entity from unicode value
        return decimal2Html(unicode); 

      //ASCII character or html entity from character code
      return char.charCodeAt() > 127 ? decimal2Html(char.charCodeAt()) : char;

  //check each character
    result += char2Html(char);

  return result;
Enter fullscreen mode Exit fullscreen mode


Thanks to a comment by LUKE知る, I have an even simpler way to do this:

export function utf2Html(str) {
  return [...str].map((char) => char.codePointAt() > 127 ? `&#${char.codePointAt()};` : char).join('');
Enter fullscreen mode Exit fullscreen mode

Mind blown meme: Problems Saving Unicode, Convert Symbols to HTML, Many Emoji are Surrogate Pairs, Convert Symbols & Emoji to HTML

Featured ones: