Logo

dev-resources.site

for different kinds of informations.

Building an email address parser in Rust with nom

Published at
11/15/2024
Categories
rust
engineering
Author
David Mytton
Categories
2 categories in total
rust
open
engineering
open
Building an email address parser in Rust with nom

Building an email address parser in Rust with nom

There is no such thing as 100% security, which is why the philosophy of defense in depth requires multiple security layers. When mitigating form spam, one of those layers is likely to be email address validation.

Arcjet’s security as code SDK includes an email address validation primitive which is bundled in our signup form protection module. This consists of two parts: validation and verification.

Email validation checks whether the user input is a valid email address i.e. the syntax format, whereas email verification checks whether the address can actually receive email e.g. is the domain valid and does it have MX records?

We aim to build all Arcjet security primitives local-first because that avoids sending data out of your environment and provides the lowest latency. When it comes to email validation, we’ve written our own email parser which performs that step entirely locally. This is written in Rust and compiled to a WebAssembly module bundled with the SDK.

This post is about how we wrote that email parser.

What is combinator parsing?

Parser combinators are high-order functions that accept one or more parsers as their input and produce a new parser as the output.

This means that you can implement your parser logic as a bunch of smaller primitive parsers, such as parsers that validate a single character or number, and then use them to build up a much more complex parser.

Using combinators leaves you with much more modular, maintainable, and testable code; because each sub-parser can be tested in isolation.

Nom

Nom is a parser combinator library written in Rust designed as toolchain for building "safe parsers without compromising the speed or memory consumption".

Its design focuses on zero-copy parsing (memory-efficient data interpretation without allocations), using Rust's streaming capabilities, and bit-level parsing support. This makes it highly efficient and versatile.

Basic parsers

The simplest parser that you can write is a do_nothing parser:

pub fn do_nothing(input: &str) -> IResult<&str, &str> {
    Ok((input, ""))
}

This parser just takes the input, takes nothing from it and returns it all in the first element.

Each parser in nom returns a Result. In the Result::Ok case it will always return a tuple of strings, where the first string is the remaining text that was not processed, and the second element is the text that was successfully matched by the parser.

If it fails to parse then it will return a Result::Err with the error that caused it not to match.

Parsers that are applied later on can then take the remaining piece (which in this case will be everything) in the first tuple element and do something with it.

The exact opposite of the do_nothing parser, is the match_all parser:

pub fn match_all(input: &str) -> IResult<&str, &str> {
    Ok(("", input))
}

This parser selects the entire input, leaving nothing for future parsers to consume (because they must use the contents of the first element).

These two extremes aren’t that useful, but show how the inputs can be transformed and provided to the output for use in the next stage of the parser.

Email addresses

The format of email addresses is specified in RFC 5322; however the standard that is defined in this RFC is very broad. In the real world most email providers are much more restrictive in what they’ll allow.

If we dig into the format, we can extract the key components:

  • Email addresses consist of a local part and a domain, separated by a @ character.
  • The local part can consist of either an atom string or a quoted string.
  • An atom string can consist of any of any alphanumeric character, or any of the following: !#$%&'*+/=?^_{|}~-. Dots are also allowed, but two consecutive dots are forbidden.
  • A quoted string can be any string once it's entirely contained within quotes. This means that something like this-is_a;[email protected],address"@example.com is completely valid per the spec, although virtually no email providers would allow such a strange address to be registered on their service.
  • Then the domain section consists of at least two domain label sections separated by dots.
  • A domain label section can contain only alphanumeric characters or a dash.

These components give us a set of rules we can implement in a parser.

Building blocks

First we need to implement parsers to test compliance with the two types of strings allowed in the local part that I described above.

// Parser for atoms (local and domain parts without quotes)
fn atom(input: &str) -> IResult<&str, &str> {
    take_while1(|c: char| c.is_ascii_alphanumeric() || "!#$%&'*+/=?^_`{|}~-".contains(c))(input)
}

// Parser for quoted strings in the local part
fn quoted_string(input: &str) -> IResult<&str, &str> {
    recognize(delimited(
        char('"'),
        take_while(|c| c != '"' && c != '\\\\'), // Simplistic; RFC allows escaped characters
        char('"'),
    ))(input)
}

Because we are only concerned with testing these simple strings at this stage, it is really easy to build tests to verify that we’re on the right track with these.

Writing tests

One of the great benefits of writing your parsers this way is how easy it makes it to write tests that test the most primitive elements of your overall parser.

#[cfg(test)]
mod tests {
    use crate::{atom, quoted_string};

    #[test]
    fn it_recognizes_atoms_correctly() {
        // It should select the whole string
        assert_eq!(Ok(("", "this-is_valid")), atom("this-is_valid"));
        // It should only select the valid atom section
        assert_eq!(Ok((";is@not-valid", "this")), atom("this;is@not-valid"));
        // The first char is not a valid atom, so it cannot select anything and errors
        assert!(atom("\\"this is text\\"").is_err());
    }

    #[test]
    fn it_recognizes_quoted_strings_correctly() {
        // It should select the whole string
        assert_eq!(
            Ok(("", "\\"this is a quoted string\\"")),
            quoted_string("\\"this is a quoted string\\"")
        );
        // It will error when it isn't a quoted string
        assert!(quoted_string("this has no quotes").is_err());
    }
}

Local part parsing

Now that we can parse the atom and quoted string sections correctly, we can combine the two using combinators to be able to identify a full local part.

// Parser for the local part
fn local_part(input: &str) -> IResult<&str, &str> {
    recognize(separated_list1(char('.'), alt((atom, quoted_string))))(input)
}

First we use seperated_list1 to allow for many (but at least one) sections of atom or quoted strings separated by dots. This doesn’t do exactly what we want however, because it returns a list of the matched string segments without the dots. For example, john.smith would become ["john", "smith"].

To get around this we can wrap this with the recognize combinator. This allows us to select the entire string that’s matched by seperated_list1 as a single str instead of returning a list.

Domain parsing

Next we need to parse the domain section.

Domain names consist of two or more domain label sections separated by dots. For example, www.example.com has three domain label sections, www, example, and com, separated by two dots.

The specification allows for email addresses to either have a domain name, or an ip address literal e.g. example@[127.0.0.1]. However in the real world most email services do not support sending/receiving an email from an ip-literal address. So we will choose not to support it for this parser. In the actual Arcjet parser for email validation, we have it as an option you can enable.

// Parser for domain labels
fn domain_label(input: &str) -> IResult<&str, &str> {
    take_while1(|c: char| c.is_ascii_alphanumeric() || c == '-')(input)
}

// Parser for the domain part
fn domain(input: &str) -> IResult<&str, &str> {
    recognize(separated_list1(char('.'), domain_label))(input)
}

Combining it all

Next we can finally combine the parsers that we created above to parse a full email address.

// Parser for the complete email address
fn email_address(input: &str) -> IResult<&str, (&str, &str)> {
    separated_pair(local_part, char('@'), domain)(input)
}

The seperated_pair function is a combinator that runs both the local_part and domain parsers on the portion of the string to the left and right of the @ sign respectively. After running it you can find the local and domain sections in the second element of the result.

To make the interface a little bit more friendly for the rest of your application, it makes sense to wrap the parser in a struct so that consumers of the parsed email address don’t need to understand the positioning of items in the tuple.

struct EmailAddress<'a> {
    local: &'a str,
    domain: &'a str,
}

impl<'a> EmailAddress<'a> {
    fn parse(candidate: &'a str) -> Option<Self> {
        email_address(candidate)
            .map(|result| EmailAddress {
                local: result.1 .0,
                domain: result.1 .1,
            })
            .ok()
    }
}

Pulling it all together into a single file:

use nom::{
    branch::alt,
    bytes::complete::{take_until1, take_while1},
    character::complete::char,
    combinator::verify,
    sequence::{delimited, separated_pair},
    IResult,
};

enum RemotePart {
    Domain(String),
    IpLiteral(String),
}

struct EmailAddress {
    local: String,
    remote: RemotePart,
}

fn is_alphanumeric(c: char) -> bool {
    c.is_alphanumeric()
}

fn is_valid_email_char(c: char) -> bool {
    is_alphanumeric(c) || c == '.' || c == '-' || c == '_' || c == '+' || c == '\"'
}

fn is_valid_domain_char(c: char) -> bool {
    is_alphanumeric(c) || c == '.' || c == '-'
}

fn local_part(input: &str) -> IResult<&str, &str> {
    take_while1(is_valid_email_char)(input)
}

fn is_valid_ip_address(input: &str) -> bool {
    crate::is_ip_address(input)
}

fn ip_literal_segment(input: &str) -> IResult<&str, &str> {
    delimited(
        char('['),
        verify(take_until1("]"), is_valid_ip_address),
        char(']'),
    )(input)
}

fn domain_segment(input: &str) -> IResult<&str, &str> {
    take_while1(is_valid_domain_char)(input)
}

fn domain_part(input: &str) -> IResult<&str, &str> {
    alt((domain_segment, ip_literal_segment))(input)
}

pub fn parse_email(input: &str) -> IResult<&str, (&str, &str)> {
    separated_pair(local_part, char('@'), domain_part)(input)
}

pub fn is_email_address(candidate: &str) -> bool {
    let email = parse_email(candidate);
    match email {
        Ok((remaining, _parsed)) => remaining.is_empty(),
        Err(_) => false,
    }
}

fn main() {
    is_email_address("[email protected]");
}

Conclusion

Rust is the perfect language for writing parsers because of its performance, correctness, and security properties. It also allows us to compile it as part of our bundled WebAssembly in the Arcjet SDK. The analysis happens locally within a secure sandbox at near-native speeds, and we don't need to rewrite it as we add more platform SDKs.

Nom has proven to be a flexible library which we expect to use more in future planned features.

Featured ones: