dev-resources.site
for different kinds of informations.
Create Your Own Programming Language 1: Numbers
In today's post we're going to scaffold out the compiler for Wanda and add numbers to the language.
Numbers will be represented by the JavaScript number type, which is a 64-bit floating point number.
Floating Point Numbers vs. Decimals
If you're not sure what that means, here's the tl;dr: the convention in regular use is to use decimal numbers, i.e. base 10. Computers can only deal with binary numbers, i.e. base 2. That means the numeral 10 in decimal means 10, whereas the numeral 10 in binary means 2.
Floating point numbers are a standard form used for representing decimal numbers in a computer's memory that deal with the issues that come with trying to convert base 10 numbers to base 2 numbers.
That means floating point numbers don't always work the exact same way as the decimals you're used to. For example, in decimal numbers 0.1 * 3 always equals 0.3. With floating point numbers, 0.1 * 3 is almost but not quite equal to 0.3 due to differences between rounding decimal numbers vs. rounding binary numbers.
To be exact, in floating point numbers 0.1 * 3 is 0.30000000000000004.
Some languages abstract these differences away by using BigDecimal numbers, but that would take away from our basic goal which is just to implement a language so we're not going to do it with Wanda. We'll stick to regular, old JavaScript numbers.
I am working on another language that does use BigDecimals, and I'll update you when there's something to show with that.
The Stages of Our Compiler
To recap the intro article, the compiler has several parts. Our compiler will have the following stages as it transforms a string of Wanda source code into its JavaScript output:
- Lexing
- Reading (producing an s-expression tree)
- Expanding
- Parsing (producing an Abstract Syntax Tree (AST))
- Desugaring
- Emitting
Note that later in the series we'll also add a stage for semantic analysis after the parsing stage.
In addition to compiling numbers, we're also going to start building out a CLI interface to the compiler with a REPL (Read-Eval-Print Loop) that will allow us to compile and evaluate code in a single step.
Most of the heavy lifting today will be done in the lexing phase, reading numbers from the text input into tokens. In fact, the expanding and desugaring phases won't do anything at all; they'll just take their input data structure and return it straight away. I'm just including them now so we don't have to add them to the pipeline later.
Project Setup
I use Prettier for code formatting, with the most basic .prettierrc
config possible:
{}
You can run npm init
if you want to initialize the project as a package. I have mine scoped to my name, @jasonsbarr/wanda
. Note that I'm using Node v20.0.0, but you shouldn't need a brand new version of Node for the code to work.
Make sure that in your package.json
file you have the directive "type": "module"
so ES2015 modules work as specified.
All code is in the src
directory unless specified otherwise.
Lexing
The lexer takes an input string of source code and transforms it into tokens that represent lexemes, which are valid bits of text that can be transformed into syntax objects as part of a tree. It takes in a string and outputs an array of token objects.
A token holds 3 bits of information: the token type, its text value, and an object that contains source location (srcloc) info: file, line, column, and position.
Source Location Info
Let's start by creating the SrcLoc
class for the srcloc info. Create a new directory lexer
in the src
directory and put this in SrcLoc.js
:
export class SrcLoc {
constructor(pos, line, col, file) {
this.pos = pos;
this.line = line;
this.col = col;
this.file = file;
}
static new(pos, line, col, file) {
return new SrcLoc(pos, line, col, file);
}
}
Note that I add a static new
method to all my classes as a personal preference so creating new objects is just a method call instead of a new
expression. That's purely my preference; you can do the same with your classes if you like or leave it off. It's up to you.
The Token Type
In the lexer
folder let's add TokenTypes.js
for an enum-like object to represent token types. "Number" is the only token type for now:
export const TokenTypes = {
Number: "Number",
};
Now let's add the Token
class in src/lexer/Token.js
:
export class Token {
constructor(type, value, srcloc) {
this.type = type;
this.value = value;
this.srcloc = srcloc;
}
static new(type, value, srcloc) {
return new Token(type, value, srcloc);
}
}
Helper Functions
Before we get into the lexer proper, we need 2 more things for support:
- A set of helper functions to detect certain characters
- An object to manage the input string as we traverse it and chunk it into tokens
First, the helper functions (in src/lexer/utils.js
). We need to detect dots, digits, whitespace, semicolons (comments begin with a semicolon and continue to the end of the line), newlines, and dashes (minus signs), plus a function for checking if a numeric string is a well-formed number:
// Character matchers
export const isDot = (ch) => /\./.test(ch);
export const isDigit = (ch) => /\d/.test(ch);
export const isWhitespace = (ch) => /\s/.test(ch);
export const isSemicolon = (ch) => /;/.test(ch);
export const isNewline = (ch) => /\n/.test(ch);
export const isDash = (ch) => /\-/.test(ch);
// String matchers
export const isNumber = (str) => /^[+-]?\d+(\.\d+)?$/.test(str);
The InputStream Class
Now a class for the InputStream
. This class manages the internal state of the input as the lexer traverses it and breaks it up into tokens.
The InputStream
class has 5 methods: eof
, lookahead
, next
, peek
, and readWhile
.
-
eof
checks to see if the current position is at or past the end of the input string -
lookahead
gets the character at the position n spots ahead of the current position -
next
gets the character at the current position and advances the input by 1 -
peek
gets the character at the current position, but does not advance the input -
readWhile
consumes every character in the input while a condition is true
src/lexer/InputStream.js
:
import { isNewline } from "./utils.js";
export class InputStream {
constructor(input, file) {
this.input = input;
this.file = file;
this.pos = 0;
this.line = 1;
this.col = 1;
}
static new(input, file) {
return new InputStream(input, file);
}
get length() {
return this.input.length;
}
eof() {
return this.pos >= this.length;
}
lookahead(n = 1) {
return this.input[this.pos + n];
}
next() {
const ch = this.input[this.pos++];
if (isNewline(ch)) {
this.line++;
this.col = 1;
} else {
this.col++;
}
return ch;
}
peek() {
return this.input[this.pos];
}
readWhile(pred) {
let str = "";
while (pred(this.peek())) {
str += this.next();
}
return str;
}
}
In-Language Exceptions
Let's add a SyntaxException
class that inherits from a base Exception
class in the language, which just inherits from the built-in JavaScript Error
class. In src/shared/exceptions.js
:
export class Exception extends Error {
constructor(msg) {
super(msg);
}
}
export class SyntaxException extends Exception {
constructor(value, srcloc) {
super(
`Syntax Exception: invalid syntax ${value} found at ${srcloc.file} (${srcloc.line}:${srcloc.col})`
);
}
}
Now we have the machinery set up so we can create the lexer itself.
The Lexer
We'll create a Lexer class that holds an InputStream
object and has methods for tokenizing the input.
The main tokenize
method creates an array for tokens, then has a loop that continues as long as we're not at the end of the input. Inside the loop we peek
at the current character then perform an action based on it.
We'll skip whitespace and comments, and the only token we're reading right now is a number token. We'll do this with the readNumber
method.
First, we import these functions and types into src/lexer/Lexer.js
:
import { SyntaxException } from "../shared/exceptions.js";
import { SrcLoc } from "./SrcLoc.js";
import { Token } from "./Token.js";
import { TokenTypes } from "./TokenTypes.js";
import {
isDash,
isDigit,
isDot,
isNewline,
isNumber,
isSemicolon,
isWhitespace,
} from "./utils.js";
The Lexer
class constructor is simple:
export class Lexer {
constructor(input) {
this.input = input;
}
static new(input) {
return new Lexer(input);
}
}
For the readNumber
method we first get the srcpos
info off the input, then check to see if there's a minus sign. If there's a minus sign, add it to the number token. Then read any ensuing numbers and decimal points into the token value. We need to be able to tell if it's a valid or invalid number token, so we're just going to take all the numbers and dots in until there are no more left.
Then, if it's not a valid number, throw a SyntaxException
. Otherwise, we create the number token.
In the Lexer
class, beneath the new
method:
readNumber() {
let { pos, line, col, file } = this.input;
const srcloc = SrcLoc.new(pos, line, col, file);
let num = "";
if (isDash(this.input.peek())) {
num += this.input.next();
}
num += this.input.readWhile((ch) => isDigit(ch) || isDot(ch));
if (!isNumber(num)) {
throw new SyntaxException(num, srcloc);
}
return Token.new(TokenTypes.Number, num, srcloc);
}
The tokenize
method runs a loop and dispatches on the current character. Right now all we're doing is skipping whitespace and comments, and reading number tokens. Anything else will cause a SyntaxException
.
In the Lexer
class below the readNumber
method:
tokenize() {
let tokens = [];
while (!this.input.eof()) {
let ch = this.input.peek();
if (isWhitespace(ch)) {
this.input.readWhile(isWhitespace);
} else if (isSemicolon(ch)) {
this.input.readWhile((ch) => !isNewline(ch) && !this.input.eof());
} else if (isDash(ch) && isDigit(this.input.lookahead(1))) {
tokens.push(this.readNumber());
} else if (isDigit(ch)) {
tokens.push(this.readNumber());
} else {
const { pos, line, col, file } = this.input;
throw new SyntaxException(ch, SrcLoc.new(pos, line, col, file));
}
}
return tokens;
}
That's all for the Lexer
class for now! Now we'll abstract it away into a tokenize
function in src/lexer/tokenize.js
:
import { InputStream } from "./InputStream.js";
import { Lexer } from "./Lexer.js";
export const tokenize = (input, file) =>
Lexer.new(InputStream.new(input, file)).tokenize();
Next we'll move onto the reader.
The Reader
"Reader" is the Lisp equivalent of a parser. A conventional reader dispatches based on the current character, similar to how our lexer works, and reads the input stream into a tree of s-expression data structures that are homoiconic with the source code. That means the code and data can essentially be substituted for one another in the programmer's imagination. The underlying data is the same as the outward representation in text.
Our compiler has both a reader and a parser for pedagogical reasons. We'll read the input into s-expressions so we can have all the power and elegance of Lisp macros, but we'll then parse that tree into a fuller AST so I can show you some additional compiler techniques that will help us make the language more interesting.
Technically we could have done all this with the s-expression tree produced by the reader and expander, but it's a little easier to do with a fuller AST and we're prioritizing understandability over efficiency here.
First we'll create a Reader
class that manages the token stream in similar fashion to how the InputStream
class manages the input string.
The Reader
class has 4 methods:
-
eof
, which checks to see if we're at the end of the token stream -
next
, which returns the current token and advances the stream by 1 -
peek
, which returns the token at the current position -
skip
, which skips the current position without returning a token
In src/reader/Reader.js
:
export class Reader {
constructor(tokens) {
this.tokens = tokens;
this.pos = 0;
}
static new(tokens) {
return new Reader(tokens);
}
get length() {
return this.tokens.length;
}
eof() {
return this.pos >= this.length;
}
next() {
return this.tokens[this.pos++];
}
peek() {
return this.tokens[this.pos];
}
skip() {
this.pos++;
}
}
The read
functions take the Reader
as a parameter and read syntactic forms from the token stream. Since we're only reading numbers right now, the read
functions are very similar.
The main read
function currently returns an array of forms. The only form we're reading right now is a number token. In the future, we'll change the data structure created by read
from an array to an s-expression made up of linked lists using cons cells. If you don't know what that means, don't worry about itβI'll show you exactly what that means in a future post. We'll also add additional syntactic forms besides plain tokens and lists.
The read
function dispatches to the readForm
function, which reads forms from the current position in the token stream. Right now our only form is the number token, which is an atomic form, so readForm
just passes the Reader
through to readAtom
. This will change soon.
In src/reader/read.js
, import these types:
import { TokenTypes } from "../lexer/TokenTypes.js";
import { SyntaxException } from "../shared/exceptions.js";
import { Reader } from "./Reader.js";
Here's the readAtom
function in its current state:
const readAtom = (reader) => {
const tok = reader.peek();
switch (tok.type) {
case TokenTypes.Number:
reader.skip();
return tok;
default:
throw new SyntaxException(tok.value, tok.srcloc);
}
};
As mentioned above, readForm
currently just delegates to readAtom
:
const readForm = (reader) => {
return readAtom(reader);
};
And read
processes the whole token stream:
export const read = (tokens) => {
const reader = Reader.new(tokens);
let parseTree = [];
while (!reader.eof()) {
parseTree.push(readForm(reader));
}
return parseTree;
};
That's the current iteration of the reader! Soon we'll add additional forms for it to read, including some reader macros that will transform the token stream as it reads it.
The Expander
The expander is simple, since there's nothing yet to expand. It merely takes the output of the reader and returns it. In src/expander/expand.js
:
export const expand = (parseTree) => parseTree;
The Parser
Unlike most Lisp languages, this one has a separate parsing stage that converts the s-expression tree into a JSON-compatible AST made up of object nodes.
Since we're only compiling numbers for now, there are only 2 AST nodes: Program
and NumberLiteral
. I'm writing the AST module as an object of node constructor methods, rather than as a set of classes.
Every node has the properties type
and srcloc
that contain the node type and srcloc
info, respectively. Different node types have different properties based on the data needed to represent the language form they signify.
The Program
node has a body
property that is an array of nodes that make up the program body. NumberLiteral
has a value
property that is the text value from the number token.
In src/parser/ast.js
:
export const ASTTypes = {
Program: "Program",
NumberLiteral: "NumberLiteral",
};
export const AST = {
Program(exprs) {
return {
type: ASTTypes.Program,
body: exprs,
srcloc: exprs[0]?.srcloc ?? SrcLoc.new(0, 0, 0, "none"),
};
},
NumberLiteral(token) {
return {
type: ASTTypes.NumberLiteral,
value: token.value,
srcloc: token.srcloc,
};
},
};
The parser also uses the Reader
class from the reader to manage its input state. This will change when we edit the reader to output a cons data structure instead of a JavaScript array, but it's good enough for now.
In src/parser/parse.js
, import the following:
import { TokenTypes } from "../lexer/TokenTypes.js";
import { SyntaxException } from "../shared/exceptions.js";
import { AST } from "./ast.js";
import { Reader } from "../reader/Reader.js";
Like with the reader, we have a main parse
function that dispatches to the parseExpr
function. Since our only expression type right now is numbers, all parseExpr
currently does is dispatch to parsePrimitive
.
The parser will change quite a bit when we start returning a linked list from the reader instead of an array, but this will do for now:
const parsePrimitive = (reader) => {
const token = reader.peek();
switch (token.type) {
case TokenTypes.Number:
reader.skip();
return AST.NumberLiteral(token);
default:
throw new SyntaxException(token.value, token.srcloc);
}
};
const parseExpr = (reader) => {
return parsePrimitive(reader);
};
export const parse = (readTree) => {
let body = [];
const reader = Reader.new(readTree);
while (!reader.eof()) {
body.push(parseExpr(reader));
}
return AST.Program(body);
};
That's it for the parser. In languages with more complicated syntax, parsing is a huge and complicated undertaking. Here, it's pretty simple. Even as we add more forms to the language that require parsing, it will still be much simpler for us than it would be in a language with more complicated syntax.
If there's interest, I may write a tutorial for a more complex parser in the future. Let me know in the comments if that's something you'd like to see!
The Desugarer
The desugarer is the first stage in the back end of the compiler. It takes higher, more complex forms in the language and converts them into simpler core forms. Sometimes you'll see this process called "lowering."
Like the expander above, the desugarer has nothing to do right now since we only have a single form in our language. That will change in the near future. For now, just put this in src/desugarer/desugar.js
:
export const desugar = (ast) => ast;
And with that, we're ready to build the code emitter!
The Emitter
The emitter for our language is going to be fairly simple as well. It simply walks the desugared AST and emits code based on the nodes.
It uses a design pattern called the Visitor Pattern to "visit" each node of the tree and perform an action based on the node type. In this case, the action is emitting the code.
The emitter uses postorder traversal to visit the nodes, meaning it acts on the children at the bottom of the hierarchy before it acts on their parent nodes.
The Dispatcher Method
The emit
method dispatches node-specific emitter methods depending on the node type. It uses a switch
statement on the node.type
property.
Since the only node types we have right now are Program
and NumberLiteral
, those are the only nodes we have emitters for.
The default
case in emit
throws a SyntaxException
, though this is only for completeness to make sure we remember to implement emitter methods for every core node type.
In src/emitter/Emitter.js
:
import { ASTTypes } from "../parser/ast.js";
import { SyntaxException } from "../shared/exceptions.js";
export class Emitter {
constructor(program) {
this.program = program;
}
static new(program) {
return new Emitter(program);
}
emit(node = this.program) {
switch (node.type) {
case ASTTypes.Program:
return this.emitProgram(node);
case ASTTypes.NumberLiteral:
return this.emitNumber(node);
default:
throw new SyntaxException(node.type, node.srcloc);
}
}
emitNumber(node) {
return node.value;
}
emitProgram(node) {
let code = "";
for (let n of node.body) {
code += this.emit(n);
}
return code;
}
}
Then in src/emitter/emit.js
:
import { Emitter } from "./Emitter.js";
export const emit = (ast) => Emitter.new(ast).emit();
With that, we can create the whole compilation pipeline from Wanda input to JavaScript output.
The CLI
We're not going to build out a whole big CLI application just yet. I'll dedicate a whole article to that in the future. It would be nice to have some streamlined functions to let us compile and run the code though, as well as a REPL so we can run test programs.
Let's create a simple compilation pipeline in src/cli/compile.js
:
import { tokenize } from "../lexer/tokenize.js";
import { read } from "../reader/read.js";
import { expand } from "../expander/expand.js";
import { parse } from "../parser/parse.js";
import { desugar } from "../desugarer/desugar.js";
import { emit } from "../emitter/emit.js";
export const compile = (input, file = "stdin") =>
emit(desugar(parse(expand(read(tokenize(input, file))))));
Then an evaluator function using the Node.js vm
module in src/cli/eval.js
:
import vm from "vm";
import { compile } from "./compile.js";
export const EVAL = (input) => vm.runInThisContext(compile(input));
Note that EVAL
is in all caps because eval
is a reserved word in JavaScript.
For our simple REPL we'll need a way to get input from the console. The simplest way to do that is just to install the readline-sync
package from NPM:
npm install readline-sync
And here's the simplest of REPLs in src/cli/repl.js
:
import readlineSync from "readline-sync";
import { EVAL } from "./eval.js";
import { print } from "../printer/print.js";
const read = (prompt) => readlineSync.question(prompt);
export const repl = () => {
let prompt = "wanda> ";
while (true) {
try {
let result = EVAL(read(prompt)) ?? "";
if (result === "") {
process.exit(0);
}
print(result);
} catch (e) {
console.error(e.message);
}
}
};
Finally, we run the CLI by simply calling the repl
function in src/cli/run.js
:
import { repl } from "./repl.js";
repl();
Then run the REPL with node src/cli/run.js
and enter numbers to your heart's content!
Note that any input that is not a valid number (integer or floating point, exponential notation not supported) will cause a SyntaxException
. We'll start fixing that in the next article in the series.
Conclusion
Whew! That was a lot. But you've come a long way β you now have a fully functional compiler and evaluator with REPL! Granted, you can only run the simplest of programs, numeric literals, but we're going to make things much more interesting as we progress through the series.
I hope you understand better now how programming languages and compilers work, and I'm looking forward to adding additional features to Wanda with you in the days ahead.
You can see the code from this article in the Wanda GitHub repo. I'll conveniently tag the code from each article so you can see how things evolve as we move along. The repo code also has JSDoc comments for most of the functions and types, though I can't promise I've remembered to document every single one. There are also tests. You can run the test suite with the npm test
command.
See you next time!
Featured ones: