dev-resources.site
for different kinds of informations.
Implementing UTF-8 Encoding in Zig
tl;dr
Created a library to read/write UTF-8 encoded Unicode values in Zig for my simple text editor. Link to GitHub repo
Context
One of my side projects is a simple text editor in C called "Editlite". The purpose of this editor was to explore creating one from the ground up and implementing functionality like Plugins and loading files incrementally to support editing large files. For this article I'll focus on just the Unicode support I wanted to add, though I may do a deeper dive into the editor in a future article.
Contents
- Intro
- The Beginning
- The Format
- The Build
- C Header
- The Integration
Intro
Learning Zig has been on my todo list for some time now and I have been trying to find more excuses to work it into projects to become more familiar with the language. I finally found a little project I could get my feet wet while also producing something that wouldn't be just a throw away program. I have a simple text editor and it currently only supports the ASCII character set, so I thought adding a basic Unicode support library would be the perfect byte-sized intro to Zig I wanted.
The Beginning
Unicode is a standard that lays out how numerous symbols are mapped to certain code points so that applications can read and correctly display the appropriate text on any system. For example the number 69 (Hex value: 45) maps to the character E
. However, number 42069 (Hex value: A455) maps to character ê‘•
.
There are different encodings to support this standard but the one I chose to implement was the UTF-8 standard, which breaks these numbers up into 8bit unsigned integer type representations over the range of 1 to 4 bytes.
The Format
The best way to visualize the UTF-8 encoding format is with this table from the RFC. Also to note, there are two things we will not cover in this article and that is the Byte-order Mark which is not required (default is big-endian) and the UTF-16 Surrogates (reserved byte range U+D800
-U+DFFF
) which should be considered invalid in the UTF-8 format.
We'll break down each line and start codifying it into Zig for the baseline rules of our library.
The first example 0xxxxxxx
allows for this standard to be backwards compatible with the ASCII character set. The first bit needs to be 0
and you have 7 free bits to use. So we need to ensure that first bit is zeroed out to be a valid 1 byte UTF-8 octet sequence.
The next row starts defining the main pattern we will follow for the other sequence types. The leading 110
bits signify that this sequence is a 2 byte UTF-8 octet sequence and the following octet leads with 10
to signify this is the next byte in the current octet sequence. So we see this encoding uses a 0
padding bit to differentiate between all of these octet sequences.
This is how we could codify those two rules in Zig.
The following rows follow the same rules but for the 3 byte and 4 byte octet sequences with 1110
and 11110
respectively.
Now that we've codified these rules we can move forward with writing our library.
The Library
The first thing we'll do is define some convenience types to work with.
I wanted an easy way to keep track of what type of octet sequence type a certain code point was without having to re-encode it. Also this library is meant to be used in an existing C project so we need to create our enum
using c_int
and add the extern
keyword to our structure.
Next we define a convenient function to verify multi-byte octet sequences.
The last convenience function we will explain is this one which determines which octet type is the given 8bit value. We'll explain the export
keyword with the next screenshot.
Now the core of the library! The first function we'll define is the parsing function. I like to start at the unit level so we'll write a function to read the "next" Unicode code point in a given u8 array. Here's the definition of the function:
Of course, we add the export
keyword to tell Zig to compile this function with the C ABI and we also need to use C compatible types -- so the arr
parameter is a [*]const u8
which is a slice of "unknown" size.
Alright, now to the meat of the function:
We start off with a reusable invalid_point
object to return on errors. Next we do some housekeeping checks. We then define our initial result which we grab the starting octet out of the array and determine it's octet type with get_oct_type
. Next we just switch on the type and try to parse from there.
The first two cases are easy. If the initial type was OCT_INVALID
or OCT_NEXT
then this isn't a correctly formatted UTF-8 string so we return an "invalid" code point. For a OCT_ONE
type we just pull the value straight out.
The rest of the cases are a little more involved but still straightforward. We check to make sure there are an expected number of bytes based on the type. We also verify the rest of the bytes are formatted properly with verify_octets
. Then we pull each value out and &
it with it's corresponding free bits. Lastly, we shift the values based on their free bits (6 * offset) and logical or the values into our code point in big-endian format.
That's it for parsing! Now you can use this method to loop over your UTF-8 encoded buffers!
The next crucial function to implement is the "write" function -- to take a u32
code point and write it back out to a UTF-8 encoded buffer.
This function is a thin function for the C API to take in a C-style array and turn it into a Zig slice then pass it onto the real write function.
The write function is also straightforward in it's implementation. Switching on the code point's type and pulling out the appropriate byte information from the u32
in big-endian format. You'll notice some convenience functions that are just thin inline functions to ensure the value is in the correct format of the UTF-8 octet sequence markers.
Now we can continually write out code points to a given u8 buffer!
The Build
Next, the build.zig
file. Most of this is standard but we needed to bundle the Zig compiler runtime into the static build, so this is how the build file looked.
C Header
Next we need to generate the C header file to accompany the static library. Luckily the types easily map over.
You can ignore the __THROWNL
and __nonnull(())
calls, the important part to take away from this is we need to set the extern
keyword in front of our functions. Now we can use this in our C programs like you would normally include a static library!
The Integration
After integrating this library into my simple text editor I can now read/write and accept Unicode input.
Featured ones: