Logo

dev-resources.site

for different kinds of informations.

Implementing UTF-8 Encoding in Zig

Published at
9/24/2024
Categories
zig
unicode
c
text
Author
jmatth11
Categories
4 categories in total
zig
open
unicode
open
c
open
text
open
Author
8 person written this
jmatth11
open
Implementing UTF-8 Encoding in Zig

tl;dr
Created a library to read/write UTF-8 encoded Unicode values in Zig for my simple text editor. Link to GitHub repo

Context

One of my side projects is a simple text editor in C called "Editlite". The purpose of this editor was to explore creating one from the ground up and implementing functionality like Plugins and loading files incrementally to support editing large files. For this article I'll focus on just the Unicode support I wanted to add, though I may do a deeper dive into the editor in a future article.

Contents

  • Intro
  • The Beginning
  • The Format
  • The Build
  • C Header
  • The Integration

Intro

Learning Zig has been on my todo list for some time now and I have been trying to find more excuses to work it into projects to become more familiar with the language. I finally found a little project I could get my feet wet while also producing something that wouldn't be just a throw away program. I have a simple text editor and it currently only supports the ASCII character set, so I thought adding a basic Unicode support library would be the perfect byte-sized intro to Zig I wanted.

The Beginning

Unicode is a standard that lays out how numerous symbols are mapped to certain code points so that applications can read and correctly display the appropriate text on any system. For example the number 69 (Hex value: 45) maps to the character E. However, number 42069 (Hex value: A455) maps to character ê‘•.

There are different encodings to support this standard but the one I chose to implement was the UTF-8 standard, which breaks these numbers up into 8bit unsigned integer type representations over the range of 1 to 4 bytes.

The Format

The best way to visualize the UTF-8 encoding format is with this table from the RFC. Also to note, there are two things we will not cover in this article and that is the Byte-order Mark which is not required (default is big-endian) and the UTF-16 Surrogates (reserved byte range U+D800-U+DFFF) which should be considered invalid in the UTF-8 format.

UTF-8 format table
(Fig. 1)

We'll break down each line and start codifying it into Zig for the baseline rules of our library.

The first example 0xxxxxxx allows for this standard to be backwards compatible with the ASCII character set. The first bit needs to be 0 and you have 7 free bits to use. So we need to ensure that first bit is zeroed out to be a valid 1 byte UTF-8 octet sequence.

zig function checking the one octet marker
(Fig. 2)

The next row starts defining the main pattern we will follow for the other sequence types. The leading 110 bits signify that this sequence is a 2 byte UTF-8 octet sequence and the following octet leads with 10 to signify this is the next byte in the current octet sequence. So we see this encoding uses a 0 padding bit to differentiate between all of these octet sequences.

This is how we could codify those two rules in Zig.

zig functions to check the next and two octet marker
(Fig. 3)

The following rows follow the same rules but for the 3 byte and 4 byte octet sequences with 1110 and 11110 respectively.

zig functions to check the three and four octet marker
(Fig. 4)

Now that we've codified these rules we can move forward with writing our library.

The Library

The first thing we'll do is define some convenience types to work with.

zig enum and structure to represent octet info
(Fig. 5)

I wanted an easy way to keep track of what type of octet sequence type a certain code point was without having to re-encode it. Also this library is meant to be used in an existing C project so we need to create our enum using c_int and add the extern keyword to our structure.

zig function to verify utf8 octets
(Fig. 6)

Next we define a convenient function to verify multi-byte octet sequences.

zig function to get the octet type from u8
(Fig. 7)

The last convenience function we will explain is this one which determines which octet type is the given 8bit value. We'll explain the export keyword with the next screenshot.

Now the core of the library! The first function we'll define is the parsing function. I like to start at the unit level so we'll write a function to read the "next" Unicode code point in a given u8 array. Here's the definition of the function:

zig function definition for parsing next code point
(Fig. 8)

Of course, we add the export keyword to tell Zig to compile this function with the C ABI and we also need to use C compatible types -- so the arr parameter is a [*]const u8 which is a slice of "unknown" size.

Alright, now to the meat of the function:

zig function of implementation of parse functionality
(Fig. 9)

We start off with a reusable invalid_point object to return on errors. Next we do some housekeeping checks. We then define our initial result which we grab the starting octet out of the array and determine it's octet type with get_oct_type. Next we just switch on the type and try to parse from there.

The first two cases are easy. If the initial type was OCT_INVALID or OCT_NEXT then this isn't a correctly formatted UTF-8 string so we return an "invalid" code point. For a OCT_ONE type we just pull the value straight out.

The rest of the cases are a little more involved but still straightforward. We check to make sure there are an expected number of bytes based on the type. We also verify the rest of the bytes are formatted properly with verify_octets. Then we pull each value out and & it with it's corresponding free bits. Lastly, we shift the values based on their free bits (6 * offset) and logical or the values into our code point in big-endian format.

That's it for parsing! Now you can use this method to loop over your UTF-8 encoded buffers!

The next crucial function to implement is the "write" function -- to take a u32 code point and write it back out to a UTF-8 encoded buffer.

zig function of write functionality
(Fig. 10)

This function is a thin function for the C API to take in a C-style array and turn it into a Zig slice then pass it onto the real write function.

zig function of write functionality, full
(Fig. 11)

The write function is also straightforward in it's implementation. Switching on the code point's type and pulling out the appropriate byte information from the u32 in big-endian format. You'll notice some convenience functions that are just thin inline functions to ensure the value is in the correct format of the UTF-8 octet sequence markers.

Now we can continually write out code points to a given u8 buffer!

The Build

Next, the build.zig file. Most of this is standard but we needed to bundle the Zig compiler runtime into the static build, so this is how the build file looked.

zig build file
(Fig. 12)

C Header

Next we need to generate the C header file to accompany the static library. Luckily the types easily map over.

C header file
(Fig. 13)

You can ignore the __THROWNL and __nonnull(()) calls, the important part to take away from this is we need to set the extern keyword in front of our functions. Now we can use this in our C programs like you would normally include a static library!

The Integration

After integrating this library into my simple text editor I can now read/write and accept Unicode input.

Text editor displaying Unicode characters
(Fig. 14)

text Article's
30 articles in total
Favicon
How to work with regular expressions
Favicon
Ultimate Guide to Exam Preparation Materials: Study Smarter, Not Harder
Favicon
The Importance of Earning an IT Certification: Unlocking Career Opportunities in the Digital Age
Favicon
CSS: List of Properties for Text
Favicon
Implementing UTF-8 Encoding in Zig
Favicon
Working with Different File Modes and File Types in Python
Favicon
teste
Favicon
A React component for highlighting text selections within text and HTML content
Favicon
Automatic convert audio notes to text with React
Favicon
How are AI text generators like GPT-3 revolutionizing content creation and storytelling?
Favicon
How to Add Blurred Text in React Native
Favicon
Introducing Speakatoo: Your Ultimate Spanish Text-to-Speech Solution
Favicon
Anonymous texting apps
Favicon
Einfügen eines Textwasserzeichens in PDF mit Java
Favicon
CSS Rainbow Text Effect To Spice Up Your Web Design
Favicon
The Beginner’s Handbook to Enhancing Web Speed: A Focus on Image Optimization
Favicon
Extrahieren von Text und Bildern aus PDF-Dokumenten mit Python
Favicon
Hinzufügen eines Text- oder Bild-Wasserzeichens zu einem Word-Dokument mit Python
Favicon
Comment trouver et remplacer des données dans Excel avec C# et VB.NET
Favicon
Mit Python Text und Bilder aus Word-Dokumenten extrahieren
Favicon
Zalgo Font Generator: Elevate Your Content with Creepy Text
Favicon
A React application that generate summaries of text documents
Favicon
Best Large Language Model APIs in 2023
Favicon
How to show less content in Angular
Favicon
Best AI Content Detection APIs in 2023
Favicon
Lire rapidement ou extraire du texte à partir d'un PDF en Java
Favicon
Unleashing Productivity with Vim - A Powerful Text Editor for All
Favicon
Arabic Text Rendering Issues in JavaFX
Favicon
How to extract text and image from word in Java applications
Favicon
flutter text widget example

Featured ones: