Logo

dev-resources.site

for different kinds of informations.

Monadic Parser Combinators: an Interactive JS Tutorial (Pt. 1)

Published at
11/2/2018
Categories
parsers
monads
functional
combinators
Author
glebec
Author
6 person written this
glebec
open
Monadic Parser Combinators: an Interactive JS Tutorial (Pt. 1)

Parsing analyzes serial data into a structural result, and is a key step in static code analysis and compilation. Parsers also demonstrate various functional concepts including purity, composition, and monads.

In this interactive tutorial, we will walk through implementing a simple parser combinator library. Combinators allow programmers to easily build up complex programs through the use of a small number of cooperative utility functions.

const color  = P.either(P.literal('red'), P.literal('blue'))
const animal = P.either(P.literal('zebra'), P.literal('giraffe'))
const spaces = P.many1(P.literal(' '))
const tallTale =
    color .chain(c =>
    spaces.chain(_ =>
    animal.chain(a => P.of(`There was a ${c} ${a}.`))))
Enter fullscreen mode Exit fullscreen mode

The intent is partly to learn about parsing and partly to learn about functional programming. No prior experience in either is assumed.

Motivations

Serial data, e.g. from a file or network payload, often must be analyzed into a result before a program can work with it. For example:

  • HTML is a string format for page contents. Browsers parse HTML into the DOM, a tree of in-memory nodes that can be queried and manipulated to affect the rendered web page.
  • JSON is a string format for nested data, often used for configuration files or network payloads. Programs can use JSON.parse to convert a JSON string into a JavaScript Object, which can be easily read and updated during runtime.
  • ESLint is a static code analysis tool for detecting errors and style deviations. ESLint uses a JavaScript parser (Espree) to read the author's program (a text file) and identify potential problems.

In addition, a class of programs called compilers go one step further, folding a parse tree back down into a converted string format. Compilers thus act as translators between strings in one formal language to strings in another.

const codeES2018 = 'num => num + 1'
const codeES5 = compileToES5(codeES2018)
console.log(codeES5) // 'function (num) { return num + 1 }'
Enter fullscreen mode Exit fullscreen mode
  • V8 is a JavaScript engine that uses a compiler to translate JavaScript into executable machine code.
  • Babel compiles JavaScript written in modern and/or domain-specific syntax (e.g. ES2018 + JSX) into older syntax (e.g. ES5).
  • Prettier compiles JavaScript with non-standard formatting into JavaScript with consistent formatting.

In a typical compiler, parsing is referred to as the front end and code generation is called the back end.

    infix       FE (parser)       +      BE (generator)      postfix
 "2 + 3 * 5"         ->          / \           ->          "2 3 5 * +"
                                2   *
                               / \
                              3   5
Enter fullscreen mode Exit fullscreen mode

However, it is also possible to generate results during the parsing step. In that case, no explicit tree is built, though the parser implicitly follows a tree structure while recursing through the input string.

Parsers

Generally, a parser extracts structure from the beginning of some serial input. The input can be an ordinary string, though some parsers may expect a stream (sequentially consumable source, e.g. generator or observable) of tokens (objects recording both linguistic type and lexical content).

Let's start simple and consider a parser to be a function from string to result.

// parse :: String -> *
const parse = s => ?
Enter fullscreen mode Exit fullscreen mode

This begs the question of what our parser produces. The answer depends on the problem. Many parsers produce tree nodes, but parsers can be written to build any result desired, including raw strings, numbers, functions, whatever.

Dealing With Failure

For now let's just capture the literal string value (or lexeme) that our parser matches. Here is an (incomplete) parser intended to match on and result in the lexeme "Triceratops":

// parseTri :: String -> String
const parseTri = s => "Triceratops"
Enter fullscreen mode Exit fullscreen mode

Sadly, this parser is broken. What if our string contains the wrong dinosaur? Parsing "T. Rex" shouldn't result in "Triceratops". We need a way to signal failure.

How would you address this issue? Come up with your own approach; your parsing function should be able to take the following strings and either result in "Triceratops" or else indicate (somehow) that it failed to parse.

// parseTri :: String -> String // change me const parseTri = s => 'Triceratops' // change me // change parseTri so that it can indicate failure console.log(parseTri('Triceratops')) // should succeed console.log(parseTri('Triceratops rules')) // should succeed console.log(parseTri('I <3 Triceratops')) // should fail console.log(parseTri('Pteranodons')) // should fail console.log(parseTri('T. Rex')) // should fail console.log(parseTri('')) // should fail

Signaling Failure: Maybe?

type Parser = String -> Maybe String
Enter fullscreen mode Exit fullscreen mode

An experienced functional programmer would likely reach for a sum type such as the Maybe monad. However, to properly address that concept now would derail this article substantially, so we will circle back to Maybe later on.

Signaling Failure: Array?

type Parser = String -> [String]
Enter fullscreen mode Exit fullscreen mode

Functional parsers commonly represent their results as lists. Failure can thus be represented by the empty list [], and multiple successes (from an ambiguous grammar) can be captured if necessary. Ambiguous grammars lie outside the scope of this article, so we don't strictly need lists.

Signaling Failure: Null

type Parser = String -> String | Null
Enter fullscreen mode Exit fullscreen mode

Both Maybe and lists are good ways of dealing with failure, but for now we will stick with a method JS developers are familiar with: null. The null value, whose inventor Tony Hoare once called his "billion dollar mistake", has serious pitfalls. However, it is idiomatic, approachable, and will suffice for the moment.

Here is our new parser. Try it on some strings and see what happens:

// parseTri :: String -> String | Null const parseTri = s => s.startsWith('Triceratops') ? 'Triceratops' : null console.log(parseTri('...try strings here...'))

Dealing with Leftovers

When we parse a result from a string, in most cases we will only consume part of the input. The remainder will have other values we can parse, which means we need to keep track of where to resume. In a mutable setting like JavaScript, one might be tempted to create a shared index value.

let index = 0
const parseTri = s => {
    const remainder = s.slice(index)
    if (remainder.startsWith('Triceratops')) {
        index += 'Triceratops'.length
        return 'Triceratops'
    }
    return null
}
Enter fullscreen mode Exit fullscreen mode

This works fine for parsing a single string:

let index = 0 const parseTri = s => { const remainder = s.slice(index) if (remainder.startsWith('Triceratops')) { index += 'Triceratops'.length return 'Triceratops' } return null } // (index & parseTri are in scope but hidden) const string = 'TriceratopsTriceratops' const res1 = parseTri(string) // 'Triceratops' const res2 = parseTri(string) // 'Triceratops' const res3 = parseTri(string) // null console.log(res1, res2, res3)

However, shared mutable state can quickly cause problems. What if we want to subsequently parse a different string?

let index = 0 const parseTri = s => { const remainder = s.slice(index) if (remainder.startsWith('Triceratops')) { index += 'Triceratops'.length return 'Triceratops' } return null } const string = 'TriceratopsTriceratops' const res1 = parseTri(string) // 'Triceratops' const res2 = parseTri(string) // 'Triceratops' const res3 = parseTri(string) // null // (index, parseTri, and res1–res3 are in scope but hidden) const string2 = 'Triceratops rules' const res4 = parseTri(string2) console.log(res4) // null // oops!

There are solutions, of course. We could create a Parser class whose instances manage internal state, or a higher-order makeParser function which closes over state. Both of those solutions bury state without actually eliminating it; debugging such hidden state is sometimes even more challenging.

Pure Token Consumption

It turns out that for this problem, we don't actually need mutable state. In functional programming, we prefer pure solutions.

Pure functions are stateless deterministic mappings from input to output, with no side effects. In other words, pure functions have output, but neither depend on nor change the external universe. If you can define your function as a (potentially infinite) table of inputs and outputs, it's pure. Try to induce how f1, f2, and f3 below are defined:

f1 in f1 out f2 in f2 out f3 in f3 out
'hi' 2 [a, b] b 0 -2
'what' 4 [0, 9, 1] 9 1 -1
'pow' 3 [0] undefined 2 2
'forest' 6 [] undefined 3 7
'a' 1 [1, 2, 3] 2 4 14
'' 0 [z, y] y 5 23
… … … … … …

If a parser needs to indicate where the next parse should commence, that implies that the leftover input should itself be a return value. So, our parsers will return two things: a parsed result, and remaining input.

type Parser = String -> String & String
Enter fullscreen mode Exit fullscreen mode

How can you write a function that returns two things? There is more than one viable way; come up with your own approach below, then read on for our take.

// parseTri :: String -> String & String const parseTri = s => s.startsWith('Triceratops') ? 'Triceratops' // fix me to also return leftover input : null // should succeed with "" as a remainder console.log(parseTri('Triceratops')) // should succeed with " rules" as a remainder console.log(parseTri('Triceratops rules')) // should fail console.log(parseTri('T. Rex'))

Tuples and Friends

Because functional programming involves a lot of data flowing in and out of functions, functional languages often feature a lightweight data structure for packaging values together: the tuple.

-- Haskell
myTuple = ("hello", 99) -- a 2-tuple
str = fst myTuple -- "hello"
num = snd myTuple -- 99
Enter fullscreen mode Exit fullscreen mode

JavaScript doesn't have tuples, but it does have Objects and Arrays (which are actually a type of Object). We can emulate tuples easily enough:

const myTuple = ['hello', 99]
const str = myTuple[0] // 'hello'
const num = myTuple[1] // false
Enter fullscreen mode Exit fullscreen mode
const myTuple = { str: 'hello', num: 99 }
const { str } = myTuple // 'hello'
const { num } = myTuple // false
Enter fullscreen mode Exit fullscreen mode

Arrays are more concise, but Objects are more expressive. You probably used one of these in your solution above; we'll use Objects.

// parseTri :: String -> { result: String, remainder: String } | Null const parseTri = s => s.startsWith('Triceratops') ? { result: 'Triceratops', remainder: s.slice(11) } : null console.log(parseTri('Triceratops')) console.log(parseTri('Triceratops is cool')) console.log(parseTri('T. Rex'))

Side Note: Isomorphisms

n-Tuples and n-element JS Arrays (when used solely to store data) are actually isomorphic. Sets A and B are isomorphic if you can define a pair of functions a2b and b2a, such that:

  • a2b maps all values in A to values in B
  • b2a maps all values in B to values in A
  • b2a(a2b(a)) = a
  • a2b(b2a(b)) = b

Isomorphisms are a rich topic with some delightful results, but the upshot is that it is nice to know when two data types are equivalent for a given use case.


Thought puzzle: why might two-valued objects not be isomorphic to 2-el arrays? What information would be lost in translation?


Threading Parsers

Our parser can now report which part of the input was not consumed, but how do we actually use that information? The leftover string from one parse becomes the input to the next parse. Complete the code below.

const assert = require('assert') const parseTri = s => s.startsWith('Triceratops') ? { result: 'Triceratops', remainder: s.slice(11) } : null const solution = [ "const firstParse = parseTri(string)", "const secondParse = parseTri(firstParse.remainder)", "const thirdParse = parseTri(secondParse.remainder)" ].join('\n') // parseTri :: String -> { result: String, remainder: String } | Null // parseTri is in scope in this snippet. You can refer back // to its definition, but you won't need to edit it. const string = 'TriceratopsTriceratops' // use parseTri to obtain the three values below: const firstParse = undefined const secondParse = undefined const thirdParse = undefined // console.log(solution) assert.deepStrictEqual(firstParse, { result: 'Triceratops', remainder: 'Triceratops' }, 'first parse works') assert.deepStrictEqual(secondParse, { result: 'Triceratops', remainder: '' }, 'second parse works') assert.equal(thirdParse, null, 'third parse works')

To see our solution, you can log the hidden variable solution.

Generalizing

We have a single parser which can parse "Triceratops" – not very interesting. Write a few parsers for other dinosaurs.

const registerTests = require('@runkit/glebec/tape-dispenser/releases/1.0.0') const testParsers = () => registerTests(t => { t.deepEqual( parseRex('T. Rex'), { result: 'T. Rex', remainder: '' }, 'parseRex parses "T. Rex"' ) t.deepEqual( parseRex('T. Rex is scary'), { result: 'T. Rex', remainder: ' is scary' }, 'parseRex parses "T. Rex is scary"' ) t.equal( parseRex('Triceratops'), null, 'parseRex fails to parse "Triceratops"' ) t.deepEqual( parseRaptor('Velociraptor'), { result: 'Velociraptor', remainder: '' }, 'parseRaptor parses "Velociraptor"' ) t.deepEqual( parseRaptor('Velociraptor is scary'), { result: 'Velociraptor', remainder: ' is scary' }, 'parseRaptor parses "Velociraptor is scary"' ) t.equal( parseRaptor('Triceratops'), null, 'parseRaptor fails to parse "Triceratops"' ) t.end() }) const solution = [ "const parseRex = s => s.startsWith('T. Rex')", " ? { result: 'T. Rex', remainder: s.slice(6) }", " : null", "", "const parseRaptor = s => s.startsWith('Velociraptor')", " ? { result: 'Velociraptor', remainder: s.slice(12) }", " : null" ].join('\n') // parseRex :: String -> { result: String, remainder: String } | Null const parseRex = undefined // matches "T. Rex" // parseRaptor :: String -> { result: String, remainder: String } | Null const parseRaptor = undefined // matches "Velociraptor" // console.log(solution) await testParsers()

This is pretty repetitive. Let's generalize and make a parseLiteral function.

const registerTests = require('@runkit/glebec/tape-dispenser/releases/1.0.0') const testParsers = () => registerTests(t => { t.deepEqual( parseRex('T. Rex'), { result: 'T. Rex', remainder: '' }, 'parseRex parses "T. Rex"' ) t.deepEqual( parseRex('T. Rex is scary'), { result: 'T. Rex', remainder: ' is scary' }, 'parseRex parses "T. Rex is scary"' ) t.equal( parseRex('Triceratops'), null, 'parseRex fails to parse "Triceratops"' ) t.deepEqual( parseRaptor('Velociraptor'), { result: 'Velociraptor', remainder: '' }, 'parseRaptor parses "Velociraptor"' ) t.deepEqual( parseRaptor('Velociraptor is scary'), { result: 'Velociraptor', remainder: ' is scary' }, 'parseRaptor parses "Velociraptor is scary"' ) t.equal( parseRaptor('Triceratops'), null, 'parseRaptor fails to parse "Triceratops"' ) t.end() }) const solution = [ "const parseLiteral = literal => s => s.startsWith(literal)", " ? { result: literal, remainder: s.slice(literal.length) }", " : null", "", "const parseRex = parseLiteral('T. Rex')", "", "const parseRaptor = parseLiteral('Velociraptor')" ].join('\n') // parseLiteral :: String -> // (String -> { result: String, remainder: String } | Null) const parseLiteral = literal => undefined // parseRex :: String -> { result: String, remainder: String } | Null const parseRex = undefined // use parseLiteral // parseRaptor :: String -> { result: String, remainder: String } | Null const parseRaptor = undefined // use parseLiteral // console.log(solution) await testParsers()

Solving the above demonstrates a common functional technique. Higher-order functions take and/or return functions themselves; parseLiteral does the latter. Instead of hard-coding individual parsers for every string, parseLiteral generalizes, mapping from strings to parsers:

string parseLiteral(string)
'hi' parser for 'hi'
'yo' parser for 'yo'
'wow' parser for 'wow'
… …

Different Result Types

Parsers match on strings, but they don't have to result in them. Write parsers for digit strings which result in numbers.

const registerTests = require('@runkit/glebec/tape-dispenser/releases/1.0.0') const testParsers = () => registerTests(t => { t.deepEqual( parseNum0('0'), { result: 0, remainder: '' }, 'parseNum0 parses "0"' ) t.deepEqual( parseNum0('0 is cool'), { result: 0, remainder: ' is cool' }, 'parseNum0 parses "0 is cool"' ) t.equal( parseNum0('1'), null, 'parseNum0 fails to parse "1"' ) t.deepEqual( parseNum1('1'), { result: 1, remainder: '' }, 'parseNum1 parses "1"' ) t.deepEqual( parseNum1('1 is cool'), { result: 1, remainder: ' is cool' }, 'parseNum1 parses "1 is cool"' ) t.equal( parseNum1('0'), null, 'parseNum1 fails to parse "0"' ) t.end() }) const solution = [ "const parseNum0 = s => s.startsWith('0')", " ? { result: 0, remainder: s.slice(1) }", " : null", "", "const parseNum1 = s => s.startsWith('1')", " ? { result: 1, remainder: s.slice(1) }", " : null" ].join('\n') // parseNum0 :: String -> { result: Number, remainder: String } | Null const parseNum0 = undefined // result should be 0, not '0' // parseNum1 :: String -> { result: Number, remainder: String } | Null const parseNum1 = undefined // result should be 1, not '1' // console.log(solution) await testParsers()

Once we start using our parsers for practical purposes, it will be convenient to transform raw lexemes into processed results specific to our use case.

Types, Signatures, & Aliases

We now have seen two different kinds of parsers: ones that result in strings, and ones that result in numbers.

parseRex  :: String -> { result: String, remainder: String } | Null
parseNum1 :: String -> { result: Number, remainder: String } | Null
Enter fullscreen mode Exit fullscreen mode

We haven't drawn special attention to these comments yet, preferring to teach by example, but these are examples of type signatures. A type signature documents what set a value belongs to. For example, the Bool type is a set of two values: true and false. You can read the notation :: as "has type" or "is in the set":

true  :: Bool -- `true`  is in the set `Bool`
false :: Bool -- `false` is in the set `Bool`
Enter fullscreen mode Exit fullscreen mode

Functions also belong to types. The ! (or "not") function happens to be in the set of functions which map values in Bool to values in Bool. We signify this as (!) :: Bool -> Bool. Here is the ! function defined:

Bool input Bool output
true false
false true

Thought puzzle: there are precisely three other functions of type Bool -> Bool. Can you define all of them? Remember, each function will be a table of two rows.


Functions are mappings from one type to another type (e.g. Int -> Bool, String -> String, or Object -> String). Accordingly, functional languages often emphasize their type system. ML and related languages like OCaml, F#, ReasonML, Miranda, Haskell, PureScript, Idris, and Elm are examples of languages with sophisticated typing, including algebraic data types and complete type inference.

JavaScript has a (sometimes infamous) dynamic type system, and lacks an official type notation. The syntax here is borrowed from Haskell and is similar to the system used by Ramda, a JS utility library. It is explained well by both the Ramda wiki and Fantasy Land docs.

We bring this up now because it is beginning to get cumbersome to write out the entire type of our parsers. The parseRex and parseNum1 type signatures differ at only one location, result:

parseRex  :: String -> { result: String, remainder: String } | Null
parseNum1 :: String -> { result: Number, remainder: String } | Null
Enter fullscreen mode Exit fullscreen mode

Accordingly, from now on we will use Parser a to mean "a parsing function whose result is of type a":

type Parser a = String -> { result: a, remainder: String } | Null
Enter fullscreen mode Exit fullscreen mode

This type alias is just a shorthand we will use for documentation purposes. It will clean up our comments considerably and make it easier to talk about parsers. For example, now we can categorize parseRex and parseNum1 more concisely:

parseRex  :: Parser String
parseNum1 :: Parser Number
Enter fullscreen mode Exit fullscreen mode

Try it yourself. Simplify our parseLiteral signature using the shorthand:

const solution = [ "parseLiteral :: String -> Parser String" ].join('\n') // long version: // parseLiteral :: String -> // (String -> { result: String, remainder: String } | Null) // short version (fill in): // parseLiteral :: ??? console.log(solution)

Combinators at Last

What if we want to match on more than one possibility? Implement the following parsers. Existing parsers are in scope, use them in your definition:

const registerTests = require('@runkit/glebec/tape-dispenser/releases/1.0.0') const parseLiteral = literal => s => s.startsWith(literal) ? { result: literal, remainder: s.slice(literal.length) } : null const parseTri = parseLiteral('Triceratops') const parseRex = parseLiteral('T. Rex') const parseRaptor = parseLiteral('Velociraptor') const testParsers = () => registerTests(t => { t.deepEqual( parseCarnivore('T. Rex is scary'), { result: 'T. Rex', remainder: ' is scary' }, 'parseCarnivore parses "T. Rex is scary"' ) t.deepEqual( parseCarnivore('Velociraptor is fast'), { result: 'Velociraptor', remainder: ' is fast' }, 'parseCarnivore parses "Velociraptor is fast"' ) t.equal( parseCarnivore('Triceratops eats grass'), null, 'parseCarnivore fails to parse "Triceratops eats grass"' ) t.deepEqual( parseBigDino('T. Rex has small arms'), { result: 'T. Rex', remainder: ' has small arms' }, 'parseBigDino parses "T. Rex has small arms"' ) t.deepEqual( parseBigDino('Triceratops has three horns'), { result: 'Triceratops', remainder: ' has three horns' }, 'parseBigDino parses "Triceratops has three horns"' ) t.equal( parseBigDino('Velociraptor hunts in packs'), null, 'parseBigDino fails to parse "Velociraptor hunts in packs"' ) t.end() }) const solution = [ "// parseCarnivore, parseBigDino :: Parser String", "const parseCarnivore = s => parseRex(s) || parseRaptor(s)", "const parseBigDino = s => parseRex(s) || parseTri(s)" ].join('\n') // parseTri :: Parser String // parseRex :: Parser String // parseRaptor :: Parser String // parseCarnivore :: Parser String const parseCarnivore = undefined // match "T. Rex" OR "Velociraptor" // parseBigDino :: Parser String const parseBigDino = undefined // match "T. Rex" OR "Triceratops" // console.log(solution) await testParsers()

Again, this is getting repetitive. What does the functional programmer do when faced with repetition? Generalize! Write a function either which takes two parsers and returns a new parser that matches either of them.

const registerTests = require('@runkit/glebec/tape-dispenser/releases/1.0.0') const parseLiteral = literal => s => s.startsWith(literal) ? { result: literal, remainder: s.slice(literal.length) } : null const parseTri = parseLiteral('Triceratops') const parseRex = parseLiteral('T. Rex') const parseRaptor = parseLiteral('Velociraptor') const testParsers = () => registerTests(t => { t.deepEqual( parseCarnivore('T. Rex is scary'), { result: 'T. Rex', remainder: ' is scary' }, 'parseCarnivore parses "T. Rex is scary"' ) t.deepEqual( parseCarnivore('Velociraptor is fast'), { result: 'Velociraptor', remainder: ' is fast' }, 'parseCarnivore parses "Velociraptor is fast"' ) t.equal( parseCarnivore('Triceratops eats grass'), null, 'parseCarnivore fails to parse "Triceratops eats grass"' ) t.deepEqual( parseBigDino('T. Rex has small arms'), { result: 'T. Rex', remainder: ' has small arms' }, 'parseBigDino parses "T. Rex has small arms"' ) t.deepEqual( parseBigDino('Triceratops has three horns'), { result: 'Triceratops', remainder: ' has three horns' }, 'parseBigDino parses "Triceratops has three horns"' ) t.equal( parseBigDino('Velociraptor hunts in packs'), null, 'parseBigDino fails to parse "Velociraptor hunts in packs"' ) t.end() }) const solution = [ "// either :: (Parser a, Parser b) -> Parser (a | b)", "const either = (p1, p2) => s => p1(s) || p2(s)" ].join('\n') // parseLiteral :: String -> Parser String // parseTri :: Parser String // parseRex :: Parser String // parseRaptor :: Parser String // either :: (Parser a, Parser b) -> Parser (a | b) const either = (p1, p2) => undefined // parseCarnivore, parseBigDino :: Parser String const parseCarnivore = either(parseRex, parseRaptor) const parseBigDino = either(parseRex, parseTri) // console.log(solution) await testParsers()

Congratulations, you've written a combinator.

Combi-who?

Combinator, like many programming terms, is overloaded – it means different things in different contexts.

  • Combinatory logic is the study of functions which act on functions. For example, the I combinator (x => x) takes in a function and returns the same function. In this field, "combinator" formally means a function with no free (unbound) variables.
  • The combinator pattern is a strategy for building flexible libraries, in which a small number of cooperative utility functions build up rich results (which in turn can be new inputs to those same functions).
    • For example, CSS "combinators" take in CSS selectors and yield new, more complex selectors. div and h1 are both primitive selectors; using the direct child (>) combinator, you can create the new div > h1 selector.

We will define a parser combinator as a function which takes in one or more parsers, and returns a new parser. Some examples:

either :: (Parser a, Parser b) -> Parser (a | b)
both   :: (Parser a, Parser b) -> Parser [a, b]
any    :: (...Parser *)        -> Parser *
many   ::  Parser a            -> Parser [a]
Enter fullscreen mode Exit fullscreen mode

You've already defined either; take a crack at both and any next. For manual testing, parseLiteral and all of our dino parsers are in scope:

const registerTests = require('@runkit/glebec/tape-dispenser/releases/1.0.0') const parseLiteral = literal => s => s.startsWith(literal) ? { result: literal, remainder: s.slice(literal.length) } : null const parseTri = parseLiteral('Triceratops') const parseRex = parseLiteral('T. Rex') const parseRaptor = parseLiteral('Velociraptor') const testTriRexViaBoth = () => registerTests(t => { const parseTriRex2 = both(parseTri, parseRex) t.deepEqual( parseTriRex2('TriceratopsT. Rex'), { result: ['Triceratops', 'T. Rex'], remainder: '' }, 'parseTriRex parses "TriceratopsT. Rex"' ) t.deepEqual( parseTriRex2('TriceratopsT. Rex!!!'), { result: ['Triceratops', 'T. Rex'], remainder: '!!!' }, 'parseTriRex parses "TriceratopsT. Rex!!!"' ) t.equal( parseTriRex2('TriceratopsT. Bone'), null, 'parseTriRex fails to parse "TriceratopsT. Bone"' ) t.equal( parseTriRex2('Triangles'), null, 'parseTriRex fails to parse "Triangles"' ) t.end() }) const solution = [ "const both = (p1, p2) => s => {", " const r1 = p1(s)", " if (!r1) return null", " const r2 = p2(r1.remainder)", " if (!r2) return null", " return {", " result: [r1.result, r2.result],", " remainder: r2.remainder", " }", "}" ].join('\n') // both takes in a two parsers, and makes a parser that matches // the first one and then the second (or fails if EITHER of them // fails). It packages the two successes into an array. // both :: (Parser a, Parser b) -> Parser [a, b] const both = undefined // parseTriRex :: Parser String const parseTriRex = both(parseTri, parseRex) console.log(parseTriRex('TriceratopsT. Rex!')) // succeeds console.log(parseTriRex('Triceratops & T. Rex')) // fails // console.log(solution) await testTriRexViaBoth()

Also, either is in scope if you want it:

const registerTests = require('@runkit/glebec/tape-dispenser/releases/1.0.0') const parseLiteral = literal => s => s.startsWith(literal) ? { result: literal, remainder: s.slice(literal.length) } : null const either = (p1, p2) => s => p1(s) || p2(s) const parseTri = parseLiteral('Triceratops') const parseRex = parseLiteral('T. Rex') const parseRaptor = parseLiteral('Velociraptor') const testDinoViaAny = () => registerTests(t => { const parseDino2 = any(parseTri, parseRex, parseRaptor) t.deepEqual( parseDino2('Velociraptor on the loose'), { result: 'Velociraptor', remainder: ' on the loose' }, 'parseDino parses "Velociraptor on the loose"' ) t.deepEqual( parseDino2('Triceratops for the win'), { result: 'Triceratops', remainder: ' for the win' }, 'parseDino parses "Triceratops for the win"' ) t.deepEqual( parseDino2('T. Rex is a king'), { result: 'T. Rex', remainder: ' is a king' }, 'parseDino parses "T. Rex is a king"' ) t.equal( parseDino2('Wooly mammoths migrate west'), null, 'parseDino fails to parse "Wooly mammoths migrate west"' ) t.end() }) const solution = [ "// any :: (...Parser *) -> Parser *", "const any = (...parsers) => parsers.reduce(either)", ].join('\n') // either :: (Parser a, Parser b) -> Parser (a | b) // any takes in a variable number of parsers, and makes a parser // that matches the first one to match (or fails if they ALL fail). // any :: (...Parser *) -> Parser * const any = undefined // parseDino :: Parser String const parseDino = any(parseTri, parseRex, parseRaptor) // console.log(solution) await testDinoViaAny()

This is a powerful concept. Instead of hand-coding ad-hoc parser functions for every conceivable purpose, we can define a few abstract combinators. Feed simple parsers in, get more powerful parsers out; those new powerful parsers can become, in turn, inputs to the same combinators. We can therefore build up rich behavior from a few easy-to-use tools. This is the essence of composition – construction via combination.

const parseLiteral = literal => s => s.startsWith(literal) ? { result: literal, remainder: s.slice(literal.length) } : null const parseTri = parseLiteral('Triceratops') const parseRex = parseLiteral('T. Rex') const parseRaptor = parseLiteral('Velociraptor') const either = (p1, p2) => s => p1(s) || p2(s) const both = (p1, p2) => s => { const r1 = p1(s) if (!r1) return null const r2 = p2(r1.remainder) if (!r2) return null return { result: [r1.result, r2.result], remainder: r2.remainder } } const any = (...parsers) => parsers.reduce(either) const parseDino = any(parseTri, parseRex, parseRaptor) const parseTwoDinos = both(parseDino, parseDino) console.log(parseTwoDinos('VelociraptorTriceratopsEtc.')) // succeeds console.log(parseTwoDinos('T. RexVelociraptor…')) // succeeds console.log(parseTwoDinos('TriceratopsNope')) // fails

Conclusion to Part 1

Below is the current state of our parser combinator library.

// type Parser a = String -> { result: a, remainder: String } | Null

// parseLiteral :: String -> Parser String
const parseLiteral = literal => s => s.startsWith(literal)
    ? { result: literal, remainder: s.slice(literal.length) }
    : null

// either :: (Parser a, Parser b) -> Parser (a | b)
const either = (p1, p2) => s => p1(s) || p2(s)

// both :: (Parser a, Parser b) -> Parser [a, b]
const both = (p1, p2) => s => {
    const r1 = p1(s)
    if (!r1) return null
    const r2 = p2(r1.remainder)
    if (!r2) return null
    return {
        result: [r1.result, r2.result],
        remainder: r2.remainder
    }
}

// any :: (...Parser *) -> Parser *
const any = (...parsers) => parsers.reduce(either)
Enter fullscreen mode Exit fullscreen mode

In this first installment, we covered:

  • what parsing is
  • some use cases for parsers
  • one (non-functional) way to deal with failure
  • one (functional) way to eliminate state
  • function purity
  • using tuples (or arrays/objects) as return values
  • threading parser results sequentially
  • types, signatures, and aliases
  • basic parser combinators

In an upcoming sequel to this article, we will cover some more challenging issues:

  • dealing with longer sequences
  • dealing with choice
  • functors and monads
  • simulating infix function notation
  • formal grammars & parsing theory
  • recursive descent
  • laziness

Part 2 will be linked here once published.


About the Author

Gabriel Lebec is an instructor for Fullstack Academy of Code, an intensive web applications development program. He has a B.A. in Mathematics and Studio Art, and too many hobbies, including the art-historical study of antique Japanese swords. You can see more by Gabriel at Medium, YouTube, GitHub, and Twitter.

Featured ones: