Logo

dev-resources.site

for different kinds of informations.

From 'A' to '😊': How Programming Languages Handle Strings

Published at
12/7/2024
Categories
programming
encoding
unicode
ascii
Author
sandheep_kumarpatro_1c48
Author
24 person written this
sandheep_kumarpatro_1c48
open
From 'A' to '😊': How Programming Languages Handle Strings

Strings: The Unsung Heroes of Programming Magic

When it comes to programming, some data types are downright predictable. Numbers? They’re the mathletes of the digital world, neatly stored as binary representations of integers or floating-point values. Booleans? They’re like light switches: true (on) or false (off). But strings? Strings are a wild mix of alphabets, emojis, and hieroglyphs, speaking every language under the sun. How do programming languages tame these unruly data types?

The secret lies in two powerful concepts: Character Mapping and Encoding Algorithms. Let’s dive into the world of strings and see how these mechanisms turn chaos into order.


The Simple Life: Numbers and Booleans

Before we get into the drama of strings, let’s appreciate the simplicity of numbers and booleans:

Numbers

Numbers in programming are stored as binary values:

  • Integers: Stored as fixed-size binary values (e.g., 8-bit, 32-bit). For example, the decimal number 42 is stored as 101010 in binary.
  • Floating-Point Numbers: Represented using the IEEE 754 standard, which divides the number into three parts—sign, exponent, and mantissa. For example, the number 3.14 has a precise binary layout.

Booleans

Booleans are even simpler. They’re stored as a single bit:

  • 0 for false
  • 1 for true

Both numbers and booleans have fixed and predictable representations, making them easy to process and store.


Strings: The Divas of Data Types

Now, let’s talk about strings. Unlike numbers and booleans, strings are complex because they’re a collection of characters, and each character can belong to a different language or symbol set. This is where Character Mapping and Encoding come into play.


What’s in a String? The Role of Character Mapping

At its core, character mapping assigns each character a unique identifier, called a code point. Think of it as giving every character in existence—from 'A' to '😊'—a VIP pass in the digital world.

ASCII: The OG Character Set

  • Introduced in the 1960s, ASCII is a 7-bit character set with mappings for 128 characters.
    • 'A' → 0x41 (65 in decimal)
    • '3' → 0x33 (51 in decimal)
    • '%' → 0x25 (37 in decimal)
  • ASCII worked great for English but fell short when it came to other languages or symbols.

Unicode: The Global Ambassador

Unicode stepped in to represent every character across all languages and symbols. Each character is assigned a hexadecimal code point, such as:

  • 'A' → U+0041
  • '😊' → U+1F60A
  • '日' → U+65E5

The hexadecimal format (U+) makes it easier to handle large ranges of characters, ensuring that even the most obscure symbols have a home.

Ascii code is a subset of Unicode. In fact, the first 128 code points of Unicode is actually Ascii code


Encoding: From Code Points to Bytes

Mapping characters to code points is just the beginning. To store or transmit these characters, we need to convert them into a sequence of bytes. This is where Encoding Algorithms come into play.

The Encoding Process

Encoding takes the hexadecimal code point and converts it into a byte (or series of bytes) that the computer can handle. Encoding algorithms might sound intimidating, but let’s break them down with simple analogies and examples. Think of encoding as the process of packing items into boxes before shipping them. Different algorithms use different ways to pack these items (characters) into boxes (bytes). For example:

  • 'A' (U+0041) becomes 0x41 in UTF-8.
  • '😊' (U+1F60A) becomes 0xF0 0x9F 0x98 0x8A in UTF-8.

Popular Encoding Algorithms

1. UTF-8: The Flexible Packer

What it does:

UTF-8 is like a smart packer that adjusts the size of the box depending on what’s inside. For small items (like letters and numbers), it uses tiny boxes (1 byte). For bigger items (like emojis), it uses larger boxes (up to 4 bytes).

How it works:

  • ASCII characters (A-Z, a-z, 0-9, and symbols like ! and ?) fit into a 1-byte box.
    • Example: 'A' (code point U+0041) → 0x41 (1 byte).
  • Characters from other languages or special symbols use 2, 3, or 4 bytes.
    • Example: '😊' (code point U+1F60A) → 0xF0 0x9F 0x98 0x8A (4 bytes).

Real-life analogy:

Imagine sending a postcard with a short note (like "Hi!"). UTF-8 uses a small envelope. Now imagine sending a fancy greeting card with pop-ups and glitter—UTF-8 grabs a bigger box to fit everything neatly.

Why it’s great:

UTF-8 saves space. It’s widely used on the web because it handles simple texts (like English) efficiently while still accommodating complex scripts or emojis.

2. UTF-16: The Medium-Sized Box

What it does:

UTF-16 starts with medium-sized boxes (2 bytes) for most items. But when something really large shows up (like an emoji or rare symbol), it uses two boxes together, called a surrogate pair (4 bytes in total).

How it works:

  • Most characters fit in 2 bytes.
    • Example: 'A' (code point U+0041) → 0x0041 (2 bytes).
  • Characters outside the Basic Multilingual Plane (BMP)—like emojis—need a surrogate pair (4 bytes).
    • Example: '😊' (code point U+1F60A) → 0xD83D 0xDE0A (4 bytes).

Real-life analogy:

Think of UTF-16 as a warehouse that uses medium-sized boxes for most products. When something oversized arrives (like a piano), the warehouse combines two boxes to pack it.

Why it’s used:

UTF-16 strikes a balance between simplicity and flexibility. It’s commonly used in programming languages like Java and JavaScript for internal string storage.

3. UTF-32: The Fixed-Width Solution

What it does:

UTF-32 is like using giant boxes (4 bytes) for every single item, no matter how small. While this makes it predictable and easy to unpack, it also wastes space.

How it works:

  • Every character, no matter how simple or complex, uses 4 bytes.
    • Example: 'A' (code point U+0041) → 0x00000041 (4 bytes).
    • Example: '😊' (code point U+1F60A) → 0x0001F60A (4 bytes).

Real-life analogy:

Imagine packing a single paperclip into a huge box. Sure, it’s easy to find the paperclip later, but you’re wasting a lot of storage space!

Why it’s rarely used:

UTF-32 is predictable but inefficient. It’s mostly used in specialized systems where simplicity is more important than saving space.


Let’s Visualize with an Example

Consider the string: "A 😊"

Here’s how each encoding handles it:

Encoding 'A' (U+0041) '😊' (U+1F60A) Total Bytes
UTF-8 0x41 (1B) 0xF0 0x9F 0x98 0x8A (4B) 5 bytes
UTF-16 0x0041 (2B) 0xD83D 0xDE0A (4B) 6 bytes
UTF-32 0x00000041 (4B) 0x0001F60A (4B) 8 bytes

Encoding in Everyday Life

Imagine you’re texting your friend:

  • Your keyboard input: “😊”
  • Behind the scenes: The emoji’s code point (U+1F60A) is encoded into bytes (e.g., 0xF0 0x9F 0x98 0x8A in UTF-8).
  • On your friend’s phone: These bytes are decoded back into the emoji and displayed on the screen.

Without encoding algorithms, your texts would be gibberish. Encoding ensures that your “😊” stays a smile, no matter the language or device.


A Code Example: Strings in Action

Here’s how JavaScript handles strings:

let str = "Hello 😊";

// Unicode code points
console.log(str.codePointAt(0).toString(16)); // '48' (U+0048 for 'H')
console.log(str.codePointAt(6).toString(16)); // '1f60a' (U+1F60A for '😊')

// UTF-8 Encoding
const encoder = new TextEncoder();
const utf8Bytes = encoder.encode(str);
console.log(utf8Bytes); // Uint8Array of UTF-8 bytes

Enter fullscreen mode Exit fullscreen mode

Notice how the code points are always hexadecimal, reflecting the encoding process.


Why Hexadecimal is the result after Character Encoding? Digging Into the Technical Details

When dealing with encoding, data is ultimately represented as binary (1s and 0s) at the hardware level. While binary is the foundation of computing, it’s not practical for humans to read or work with due to its length and complexity. Hexadecimal (base 16) serves as a bridge between human readability and machine-level representation. Let’s unpack this in technical detail.

1. Compactness and Alignment with Binary

Hexadecimal is a base-16 numbering system, where each digit represents four bits (a nibble) of binary data. This alignment with binary makes it much more concise and easier to work with than binary itself.

For example:

  • Binary representation of the character 'A' (code point U+0041 in Unicode): 01000001 (8 bits, 1 byte)
  • Hexadecimal equivalent: 0x41 (just 2 digits)

The hexadecimal digit 4 corresponds to the binary bits 0100, and the hexadecimal digit 1 corresponds to 0001. This direct mapping makes converting between binary and hexadecimal straightforward and lossless.

Contrast this with decimal (base 10):

  • Decimal for 'A': 65 While understandable, decimal doesn’t have a direct alignment with binary, making conversions less intuitive and more error-prone.

2. Representation of Bytes and Code Points

In encoding systems, each character’s code point is ultimately represented as one or more bytes. Hexadecimal simplifies this byte-level representation.

For instance, consider the emoji 😊 (Unicode U+1F60A):

  • Binary representation: 11110000 10011111 10011000 10001010 (32 bits in UTF-8)
  • Hexadecimal representation: 0xF0 0x9F 0x98 0x8A

By grouping every 4 binary bits into a single hexadecimal digit, the byte representation becomes significantly easier to read and manage.

3. Efficient for Low-Level Programming

Programming languages and systems often use hexadecimal when dealing with low-level operations, such as memory addresses, bit manipulation, and binary protocols. Since computers inherently work with bits and bytes, hexadecimal allows developers to:

  • Represent binary data compactly without ambiguity.
  • Align better with memory structures, where data is organized in bytes (8 bits) or words (multiples of bytes).

4. Standard in Encodings

Hexadecimal is the de facto standard in character encoding systems. For example:

  • Unicode defines code points in hexadecimal (e.g., U+0041 for 'A').
  • Encodings like UTF-8 and UTF-16 often represent byte sequences in hexadecimal for documentation and debugging purposes.

5. Example: Breaking Down '😊'

Let’s analyze how the emoji '😊' (Unicode U+1F60A) is encoded in UTF-8:

  1. Determine the Code Point

    Unicode assigns '😊' the code point U+1F60A. This means it has a binary value of:

    0001 1111 0110 0000 1010

  2. UTF-8 Encoding Rules

    UTF-8 uses variable-length encoding:

-   Characters with code points up to `U+007F` (7 bits) use 1 byte.
-   Characters with code points from `U+0080` to `U+07FF` (11 bits) use 2 bytes.
-   Characters with code points from `U+0800` to `U+FFFF` (16 bits) use 3 bytes.
-   Characters with code points above `U+10000` (up to `U+10FFFF`) use 4 bytes.

Since `U+1F60A` is greater than `U+FFFF`, it will use 4 bytes.
Enter fullscreen mode Exit fullscreen mode
  1. Convert to UTF-8 Bytes Following the UTF-8 algorithm for 4-byte encoding:
-   Start with the binary pattern: `11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`
-   Fill in the bits from the code point `0001 1111 0110 0000 1010`:  
    `11110000 10011111 10011000 10001010`
Enter fullscreen mode Exit fullscreen mode
  1. Represent in Hexadecimal
-   Binary: `11110000 10011111 10011000 10001010`
-   Hexadecimal: `0xF0 0x9F 0x98 0x8A`
Enter fullscreen mode Exit fullscreen mode

This hexadecimal representation is concise and aligns neatly with byte boundaries.

6. Readability for Humans

Humans find it easier to interpret and debug data represented in hexadecimal than in binary or decimal:

  • Binary: Long and error-prone (11110000100111111001100010001010)
  • Decimal: Non-intuitive for bytes and code points (e.g., 4036981394 for '😊')
  • Hexadecimal: Compact and aligned with byte structure (0xF0 0x9F 0x98 0x8A)

Why Not Binary, Octal, or Decimal for Code Points?

When choosing a numbering system for code points, we aim to balance efficiency, usability, and compactness. Each traditional system—binary, octal, and decimal—has its merits but also significant drawbacks compared to hexadecimal. Let’s dive into the low-level technical details of why hexadecimal becomes the natural choice.


1. Binary (Base 2)

  • How It Works: Binary uses only two digits: 0 and 1. Each digit represents one bit. For example:

    • 'A' (code point 65): 01000001 in binary.
    • '😊' (code point U+1F60A): 11110000100111111001100010001010.
  • Advantages:

    • Direct alignment with the way computers store and process data.
    • Each bit corresponds directly to a binary switch in hardware.
  • Disadvantages:

    • Extremely verbose. Representing even a simple code point like U+1F60A requires 32 bits.
    • Not human-readable or practical for debugging. For example, distinguishing 1111011 from 1111101 at a glance is difficult.

Binary is the foundation of computing but is too cumbersome for humans to use directly.


2. Octal (Base 8)

  • How It Works: Octal groups 3 bits into a single digit, giving eight possible values per digit (0 to 7). For example:

    • 'A' (binary 01000001): 101 in octal.
    • '😊' (binary 11110000100111111001100010001010): 170477112.
  • Advantages:

    • More compact than binary (3 bits per digit).
    • Aligns well with older systems that operated in 3-bit groupings (e.g., PDP-8 computers).
  • Disadvantages:

    • Limited range per digit (0 to 7). It takes more digits to represent larger numbers.
    • Harder to align with byte boundaries (bytes are 8 bits, not divisible evenly by 3).

Octal works for certain legacy systems but doesn’t fit modern architectures.


3. Decimal (Base 10)

  • How It Works: Decimal uses 10 digits (0 to 9). Each digit typically requires 4 bits to represent, as binary combinations are grouped into "nibbles" (half a byte).

  • Advantages:

    • Familiar to humans—it’s our natural numbering system.
  • Disadvantages:

    • Inefficient Binary Representation: Decimal doesn’t fully utilize the available binary combinations:
      • A 4-bit group can represent 16 values (from 0000 to 1111 in binary).
      • Decimal only uses 10 of these (0000 to 1001 for 0 to 9).
      • Numbers 1010 to 1111 (10 to 15) are wasted or require additional bits, which leads to inefficiency.
    • Complex for Encoding: For a decimal number like 10, instead of a direct binary conversion (1010), we need two 4-bit groups (00010000) to represent it, consuming more space unnecessarily.

Decimal is intuitive but highly inefficient for computational purposes due to wasted binary space.


Why Hexadecimal (Base 16)?

Hexadecimal uses 16 digits (0-9 and A-F), mapping perfectly to 4-bit binary groups. Let’s see why this is ideal for encoding systems:

  • Direct Alignment with Binary:

    • Each hexadecimal digit corresponds exactly to a 4-bit binary group.
    • Example:
      • Binary: 11110000 10011111 10011000 10001010 (32 bits for '😊' in UTF-8).
      • Hexadecimal: 0xF0 0x9F 0x98 0x8A (4 hexadecimal digits for each byte).
  • Compact Representation:

    • A single hexadecimal digit (F) represents as much information as four binary digits (1111).
    • '😊' (UTF-8): 8 hexadecimal characters vs. 32 binary digits.
  • Efficient Storage:

    • No wasted space like in decimal. Hexadecimal uses all 16 possible combinations of 4 bits.
    • Example: The binary range 0000 to 1111 corresponds directly to hexadecimal 0 to F.
  • Ease of Use for Humans:

    • More compact and readable than binary.
    • Easier to convert to and from binary compared to octal or decimal.

Example Comparison

Let’s represent the code point '😊' (U+1F60A) in different systems:
| System | Representation | Length |
| -------------- | ----------------------- | ------- |
| Binary | 11110000100111111001100010001010 | 32 bits |
| Octal | 170477112 | 9 digits |
| Decimal | 128522 | 6 digits |
| Hexadecimal | 0x1F60A | 5 digits |

Hexadecimal strikes the best balance between compactness and usability.


Why Hexadecimal for Encoding?

Character encoding relies on hexadecimal for the following reasons:

  1. Binary Compatibility: Code points and byte-level encoding map naturally to hexadecimal.
  2. Efficiency: No wasted space in binary representation.
  3. Ease of Debugging: Compact and human-readable for low-level programming.
  4. Industry Standard: Encodings like UTF-8, UTF-16, and ASCII use hexadecimal to describe character bytes and code points.

In summary, while binary, octal, and decimal systems have their niches, hexadecimal stands out as the most efficient and human-friendly representation for character encoding and computing.

Hexadecimal is the unsung hero of encoding:

  • Compact, with each digit representing 4 binary bits.
  • Aligns perfectly with byte boundaries.
  • The default for representing Unicode code points and encoded bytes.
  • A practical choice for humans working with machine-level data.

This balance of human readability and technical efficiency is why encoding processes, whether UTF-8 or UTF-16, prefer hexadecimal over decimal.


Why Strings Matter

Strings aren’t just data—they’re the foundation of communication in programming. From displaying text on a webpage to processing user input, strings make it all possible. Understanding how they work gives us insight into the magic behind modern programming.

So, the next time you write a string full of emojis, accented letters, or symbols, remember: strings may be complex, but they’re what make programming human. After all, isn’t it amazing that a simple “😊” can be broken down into U+1F60A and encoded as 0xF0 0x9F 0x98 0x8A? Now that’s magic.

encoding Article's
30 articles in total
Favicon
Why I Built the Laravel Encoding Package I Couldn’t Find Anywhere Else
Favicon
From 'A' to '😊': How Programming Languages Handle Strings
Favicon
Base64 strings concepts in different programming language
Favicon
Secure and Scalable Encoding Made Easy with Laravel Encoder: A Complete Tutorial
Favicon
Encoding
Favicon
On Transformers and Vectors
Favicon
The ü/ü Conundrum
Favicon
Unlocking the Potential of Video Transcoding
Favicon
How to inverse transform both ordinal and label encoding?
Favicon
Introducción a Buffer en JavaScript
Favicon
Intl.Segmenter(): Don't use string.split() nor string.length
Favicon
Packing and unpacking bytes
Favicon
Chuw Vidf Nam sogp sogp 4.0 (Cvnss4.0) zujx goc nhinl mas hoaj
Favicon
The Hitchhiker's Guide to Binary-to-Text Encoding
Favicon
Text versus bytes
Favicon
Transforming Categorical Data: A Practical Guide to Handling Non-Numerical Variables for Machine Learning Algorithms.
Favicon
Dealing with Categorical Data: Encoding Features for ML Algorithms
Favicon
Application of Media Processing Technology to 4K/8K FHD Video Processing
Favicon
Base64's goodness
Favicon
How does Base64 work?
Favicon
Ordinal Vs One Hot Vs Label Encoding
Favicon
PHP: Useful Encoding and decoding Functions You Need to Know
Favicon
How good is my video? A (traditional) video quality metrics survey
Favicon
String encodings
Favicon
The unicode encoding system
Favicon
Unicode
Favicon
Serialization
Favicon
Base 64 Encoder-Decoder in Java
Favicon
Windows 系統上 Python 的文字輸出編碼
Favicon
UTF-8 strings in C (3/3)

Featured ones: