Logo

dev-resources.site

for different kinds of informations.

Nested Encoding Efficency

Published at
1/9/2023
Categories
serialization
Author
kevincox
Categories
1 categories in total
serialization
open
Author
8 person written this
kevincox
open
Nested Encoding Efficency

It’s relatively common that you want to send data over a restricted channel. For example a URL path can’t contain a ? or a null byte. A common solution is using some encoding format to convert your data into an acceptable string.

Base64

Base64 is a very common format. It can convert arbitrary binary data into an alphabet of 64 characters (plus an extra character for padding if desired).

b64_encode(['a', '?', 0, 'b', ' ', 0xFF]) == "YT8AYiD/"

With Base64 the output is always longer than the input (other than the empty string). In fact the output length is very easy to predict.

# Assumes no padding.
def b64_length(in: bytes) -> int:
    complete_chunks = len(in) // 3
    remainder = len(in) % 3

    complete_encoded = complete_chunks * 4
    remainder_encoded = 0 if remainder == 0 else remainder + 1

    return complete_encoded + remainder_encoded

The remainder is at most 3 bytes, so we can say that as the input data gets bigger the ratio approaches 4/3. Therefore, encoding data as Base64 will make it more or less 1/3 larger.

Base64 takes a very simple approach to encoding, it takes 3 input bytes and maps that to 4 output bytes. Rinse and repeat until you are done (with a little special logic for any remainder that doesn’t fit into a 3 byte chunk). This is simple, quick and predictable, but the output looks like nonsense cat becomes Y2F0 and hello becomes aGVsbG8. What if you want human-readable strings to remain human-readable?

C String Escaping

C-style string escaping is very simple. You have some set of “allowed” characters which are input literally, and anything else needs to be “escaped”.

c_encode(['a', '?', 0, 'b', ' ', 0xFF]) == "a?\x00b \xFF"

Any non-allowed bytes are replaced by a backslash-x (\x) then the hex representation of the byte. In practice most implementations have shortcuts for common characters. For example \\ for a backslash and \0 for a null byte. It is easy enough to read and write by hand.

So what is the efficiency of this encoding? Well
 it depends. If you encode only allowed characters it has a ratio of 1, if you encode only characters with shortcuts it is 2, but if you encode only disallowed characters with no shortcuts it is 4! Unlike Base64 the encoding efficiency is data-dependant.

But what I find interesting is the scaling of nested encoding. For example imagine that I want to encode a single null byte then encode the result, then encode the result again


  1. \0 (2 bytes)
  2. \\0 (3 bytes)
  3. \\\\0 (5 bytes)
  4. \\\\\\\\0 (9 bytes)
  5. \\\\\\\\\\\\\\\\0 (17 bytes)
  6. \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\0 (33 bytes)

Repeated encoding tends toward doubling in size! That doesn’t seem optimal. You would hope this case is rare but if I had $1 for every time I saw JSON encoded inside a JSON string I could buy myself something nice.

Percent Encoding

Percent Encoding is most commonly used in URLs. It is similar to C string escaping except that it doesn’t have a shortcut for the metacharacter itself.

percent_encode(['a', '?', 0, 'b', ' ', 0xFF]) == "a?%00b %FF"
percent_encode(['%']) == "%25"

It is also data-dependant. Its size ratio varies between 1 and 3. However, it doesn’t have as big of a problem with nested encoding.

  1. %00 (3 bytes)
  2. %2500 (5 bytes)
  3. %252500 (7 bytes)
  4. %25252500 (9 bytes)
  5. %2525252500 (11 bytes)
  6. %252525252500 (13 bytes)

Since the encoding doesn’t add additional characters that need escaping the growth on every iteration the growth is a constant factor based on the number of characters in the initial string that require encoding instead of doubling each time.

Can we do better?

It is easy to make a minor improvement. In C-style encoding using \. instead of \\ to represent \ would avoid the problem. Percent Encoding could be slightly optimized by adding shortcuts for common characters (for example %p instead of %25) at the cost of complexity and performance.

Metacharacter switching

One idea is a variable metacharacter. For example imagine encoding "'\ as \"\'\\. Imagine that we could then do \1\"\'\\ to say that the metacharacter is now 1 instead of \. This would avoid adding characters to the rest of the string, in this example it already saves one character.

The main downside is that you need to pick one of the following evils:

  1. Give up streaming support.
  2. Allow the metacharacter switch to occur anywhere in the stream.
  3. Accept that you may not pick the best metacharacter.

Not terrible but not great and not pretty.

Encoding Level

Another idea is declaring the “encoding level” at the start of the message. Upon encountering this the decoder just decrements the counter and passes though the data unchanged.

For example imagine a modification of base64.

  1. Y2F0
  2. =1=Y2F0
  3. =2=Y2F0
  4. =3=Y2F0

This has basically the same problems as the first approach and requires you to add another character to your alphabet or require that the nesting level is always declared (which adds overhead for non-nested encodings).

Research?

I tried looking around for a bit but couldn’t find any other discussion on this. I’m probably using the wrong keywords. If you are aware of any encoding systems that have considered this problem please let me know.

serialization Article's
30 articles in total
Favicon
Java interacting with Apache Avro
Favicon
Apache Avro
Favicon
Protocol Buffers as a Serialization Format
Favicon
WSON (Web Simple Object Notation)
Favicon
Serializing Python Object Using the pickle Module
Favicon
Converting a Unicorn project to Sitecore Content Serialization (SCS)
Favicon
Converting a TDS project to Sitecore Content Serialization (SCS) 
Favicon
Deserializing Polymorphic Lists with Kotlin, Jackson, and Spring Boot
Favicon
Mapify-ts
Favicon
ProtoBuf message serialization in Akka.NET using protobuf-net
Favicon
Pandas Dataframe to AVRO
Favicon
Mapping > Json Converters // true
Favicon
Nested Encoding Efficency
Favicon
Java serialization with Avro
Favicon
Java Serialization with Flatbuffers
Favicon
Java Serialization with Protocol Buffers
Favicon
Serialização de Objectos
Favicon
Effective Java: Consider Serialization Proxies Instead of Serialized Instances
Favicon
Effective Java: For Instance Control, Prefer Enum types to readResolve
Favicon
Effective Java: Write readObject Methods Defensively
Favicon
Effective Java: Consider Using a Custom Serialized Form
Favicon
Effective Java: Prefer Alternatives To Java Serialization
Favicon
ASP.NET XML serialisation issues: Observations on DataContractSerializer
Favicon
ReScript JSON Typed Strongly
Favicon
äœżç”šćșćˆ—ćŒ–ćœšć…©ć€‹ Rails ç«™ć°é–“ć‚łéžç‰©ä»¶
Favicon
Serialização no Kotlin
Favicon
Working with Firebase Cloud Firestore made easier with "withConverter()"
Favicon
Meet Model Object Mapper, a Database Serialization Utility for Django!
Favicon
How to Speak Binary in TypeScript
Favicon
Practical Java 16 - Using Jackson to serialize Records

Featured ones: