dev-resources.site
for different kinds of informations.
Create Your Own Programming Language 2: Primitives
In the last post we added compiling and evaluating numbers to the Wanda Programming Language. Now we're going to add the other primitive types: strings, booleans, keywords, and nil.
If you haven't seen the previous post, where we scaffold the compiler and add numbers to the language, you can find it here.
As a reminder, strings are delimited with double quotes, and keywords are valid identifiers that start with a colon:
"hello"
:this-is-a-keyword
true
nil
We're also going to make it so you can pretty-print the AST in the REPL instead of evaluating the expression. This can help with debugging later. If you're not getting the results you expect from your Wanda program, you can print the AST and make sure that what the compiler's getting is the same thing you expect it to get.
High Level Overview of The Changes
Most of the work in this article is going to be done in the lexer. Tokenizing strings turns out to be a lot of work, for one thing.
We'll also make some changes to the reader, the parser, the emitter, and the printer as well as adding a function to print the AST. We'll also make some changes to the CLI so you can start the REPL in AST printing mode if you want.
We'll also add a dedicated file for running the CLI so you can add a wanda
command to your terminal if you wish.
It may not seem like it takes as much work to get other primitives up and running like it did for numbers, but that's because there was a lot of scaffolding we had to do for the compiler as a whole just to make numbers work. Some of the articles will be like that: they'll feel like we didn't do a whole lot, but now we have at least one new language feature to show for it. Others will take a lot of work and not feel like much has changed at the end of the day. That's just how it goes sometimes!
The Lexer
The first thing we'll add to the lexer is new token types for the new primitives, in src/lexer/TokenTypes.js
:
export const TokenTypes = {
Number: "Number", // still there!
String: "String",
Boolean: "Boolean",
Keyword: "Keyword",
Nil: "Nil",
};
We're also going to add new character matchers for double-quotes, colons, valid identifier starting characters, and valid identifier characters. Plus string matchers for booleans and nil. In src/lexer/utils.js
:
// after isPlus...
export const isDoubleQuote = (ch) => /"/.test(ch);
export const isColon = (ch) => /:/.test(ch);
export const isSymbolStart = (ch) => /[=<>%:|?\\/*\p{L}_$!+-]/u.test(ch);
export const isSymbolChar = (ch) =>
/[:=@~<>%:&|?\\/^*&#'\p{L}\p{N}_$!+-]/u.test(ch);
// after isNumber...
export const isBoolean = (str) => /true|false/.test(str);
export const isNil = (str) => /nil/.test(str);
Now in the Lexer
class we'll add new if/else clauses to the main lexer loop for the new primitives in src/lexer/Lexer.js
:
// after the case for else if (isDigit(ch)) {
// ...
} else if (isDoubleQuote(ch)) {
tokens.push(this.readString());
} else if (isColon(ch)) {
tokens.push(this.readKeyword());
} else if (isSymbolStart(ch)) {
tokens.push(this.readSymbol());
} else {
// continue with error for unrecognized syntax
This means we'll need methods for readString
, readKeyword
, and readSymbol
. If you want to have the same as what I've got, I'm putting the methods in alphabetical order.
First, readString
since it's going to end up being the most complicated one:
readString() {
let { pos, line, col, file } = this.input;
const srcloc = SrcLoc.new(pos, line, col, file);
let str = this.input.next(); // collect opening double-quote
str += this.readEscaped();
return Token.new(TokenTypes.String, str, srcloc);
}
No, it doesn't look too complicated... yet... but now we need to create a method called readEscaped
.
readEscaped
will take the string in, one character at a time. If it encounters a backslash, it recognizes it as the start of an escape sequence so it looks ahead and uses the readEscapeSequence
method to read in the escaped character. If it hits an unexpected \n
in a string literal it throws an error, since Wanda strings can only be single line. Also, if it encounters a backtick as part of the string it escapes it for the emitter, since we're going to be using backtick strings in the compiled output. This eliminates some issues we might otherwise have had with newline characters in double-quoted string literals in the output JavaScript.
Note that reading the escape characters in the lexer is a design decision. We could just as well have done them in the reader or parser. Since handling them now is the simplest way to do it, we'll do that.
Here's the readEscaped
method:
readEscaped() {
let str = "";
let escaped = false;
let ended = false;
while (!this.input.eof()) {
let ch = this.input.next();
if (escaped) {
str += this.readEscapeSequence(ch);
escaped = false;
} else if (ch === "\\") {
escaped = true;
} else if (isDoubleQuote(ch)) {
ended = true;
str += ch;
break;
} else if (ch === "\n") {
throw new Exception(
"Unexpected newline in nonterminated single-line string literal"
);
} else if (ch === "`") {
str += "\\`";
} else {
str += ch;
}
}
if (!ended && this.input.eof()) {
throw new Exception(
"Expected double quote to close string literal; got EOF"
);
}
return str;
}
and readEscapeSequence
:
readEscapeSequence(c) {
let str = "";
let seq = "";
if (c === "n") {
str += "\n";
} else if (c === "b") {
str += "\b";
} else if (c === "f") {
str += "\f";
} else if (c === "r") {
str += "\r";
} else if (c === "t") {
str += "\t";
} else if (c === "v") {
str += "\v";
} else if (c === "0") {
str += "\0";
} else if (c === "'") {
str += "'";
} else if (c === '"') {
str += '"';
} else if (c === "\\") {
str += "\\";
} else if (c === "u" || c === "U") {
// is Unicode escape sequence
seq += this.input.readWhile(isHexDigit);
str += String.fromCodePoint(parseInt(seq, 16));
}
return str;
}
Note that this will read Unicode escape characters in the format \uHHHH
with any number of hex numbers necessary to complete a code point.
That's lexing strings. Keywords are considerably simpler:
// between readEscapeSequence and readNumber
readKeyword() {
let { pos, line, col, file } = this.input;
const srcloc = SrcLoc.new(pos, line, col, file);
const kw = this.input.next() + this.input.readWhile(isSymbolChar);
return Token.new(TokenTypes.Keyword, kw, srcloc);
}
We don't need to distinguish here between valid identifier starting characters and valid identifier body characters, since every valid identifier starting character is also a valid identifier body character. We'll use the same trick in the readSymbol
method:
// below readString
readSymbol() {
let { pos, line, col, file } = this.input;
const srcloc = SrcLoc.new(pos, line, col, file);
const sym = this.input.readWhile(isSymbolChar);
if (isBoolean(sym)) {
return Token.new(TokenTypes.Boolean, sym, srcloc);
} else if (isNil(sym)) {
return Token.new(TokenTypes.Nil, sym, srcloc);
}
// Throw for now, since we haven't implemented symbols yet
throw new SyntaxException(sym, srcloc);
}
Note that we're just using the readSymbol
method to read primitive literals for booleans and nil for now. Soon we'll use it to read identifiers as well, at which point we'll remove the exception at the end of the method.
The Reader
The changes to the reader are pretty simple: we just add cases to the readAtom
function to handle the new token types. Just like with numbers, the reader handles the other primitive tokens themselves as the forms for those values.
In src/reader/read.js
, here's the new readAtom
function:
const readAtom = (reader) => {
const tok = reader.peek();
switch (tok.type) {
case TokenTypes.Number:
reader.skip();
return tok;
case TokenTypes.String:
reader.skip();
return tok;
case TokenTypes.Boolean:
reader.skip();
return tok;
case TokenTypes.Keyword:
reader.skip();
return tok;
case TokenTypes.Nil:
reader.skip();
return tok;
default:
throw new SyntaxException(tok.value, tok.srcloc);
}
};
The AST and Parser
First, we add new node types to the ASTTypes
enum in src/parser/ast.js
:
export const ASTTypes = {
Program: "Program",
NumberLiteral: "NumberLiteral",
StringLiteral: "StringLiteral",
BooleanLiteral: "BooleanLiteral",
KeywordLiteral: "KeywordLiteral",
NilLiteral: "NilLiteral",
};
Next, we add new constructor methods to the AST
object for the new primitives:
export const AST = {
// Program and NumberLiteral constructors
StringLiteral(token) {
return {
type: ASTTypes.StringLiteral,
value: token.value,
srcloc: token.srcloc,
};
},
BooleanLiteral(token) {
return {
type: ASTTypes.BooleanLiteral,
value: token.value,
srcloc: token.srcloc,
};
},
KeywordLiteral(token) {
return {
type: ASTTypes.KeywordLiteral,
value: token.value,
srcloc: token.srcloc,
};
},
NilLiteral(token) {
return {
type: ASTTypes.NilLiteral,
value: token.value,
srcloc: token.srcloc,
};
},
}
Now in src/parser/parse.js
we make changes to parsePrimitive
that are very similar to what we did in the reader:
const parsePrimitive = (reader) => {
const token = reader.peek();
switch (token.type) {
case TokenTypes.Number:
reader.skip();
return AST.NumberLiteral(token);
case TokenTypes.String:
reader.skip();
return AST.StringLiteral(token);
case TokenTypes.Boolean:
reader.skip();
return AST.BooleanLiteral(token);
case TokenTypes.Keyword:
reader.skip();
return AST.KeywordLiteral(token);
case TokenTypes.Nil:
reader.skip();
return AST.NilLiteral(token);
default:
throw new SyntaxException(token.value, token.srcloc);
}
};
The Emitter
Now we need to emit code for the new primitive forms. In src/emitter/Emitter.js
we'll need to add emitString
, emitBoolean
, emitKeyword
, and emitNil
methods to the Emitter
class. We'll also add cases for each primitive node to the switch
in the emit
method.
Here's the new emit
method:
emit(node = this.program) {
switch (node.type) {
case ASTTypes.Program:
return this.emitProgram(node);
case ASTTypes.NumberLiteral:
return this.emitNumber(node);
case ASTTypes.StringLiteral:
return this.emitString(node);
case ASTTypes.BooleanLiteral:
return this.emitBoolean(node);
case ASTTypes.KeywordLiteral:
return this.emitKeyword(node);
case ASTTypes.NilLiteral:
return this.emitNil(node);
default:
throw new SyntaxException(node.type, node.srcloc);
}
}
The emitBoolean
method is about as simple as it can get:
emitBoolean(node) {
return node.value;
}
Since keywords are JavaScript symbols under the hood, we'll need to use the Symbol.for
constructor in emitting keywords (Note that we're emitting a string containing the "Symbol.for" call, not the constructor call itself):
emitKeyword(node) {
return `Symbol.for("${node.value}")`;
}
The emitNil
method just returns the string "null":
emitNil(node) {
return "null";
}
Finally, emitString
slices off the opening and closing double quotes and wraps the value in backticks to keep certain escape sequences from causing weird problems when evaluating the result:
emitString(node) {
return "`" + node.value.slice(1, -1) + "`";
}
The Printer
Now that we have some new values besides numbers that will be output by the EVAL
function in the CLI, we'll need to handle them in the printer. Here's the new printString
function in src/printer/printString.js
:
export const printString = (value, withQuotes) => {
switch (typeof value) {
case "number":
return String(value);
case "string":
return withQuotes ? `"${value}"` : value;
case "symbol":
return value.description;
case "boolean":
return String(value);
case "undefined":
return "nil";
case "object":
if (value === null) {
return "nil";
}
default:
throw new Exception(`Invalid print value ${value}`);
}
};
Note that since both null
and undefined
are represented as nil
in Wanda we have to handle both cases in the printer. Also, currently null
is the only "object" type value that will be produced by EVAL
for now.
We're also going to implement an AST printer so we can use it for debugging purposes if needed. Create a new file printAST.js
in src/printer
and import the ASTTypes
enum:
import { ASTTypes } from "../parser/ast.js";
We're going to do the AST printer as a class that implements the Visitor pattern then export a functional interface to it, just like the code emitter. First, the constructors:
class ASTPrinter {
constructor(program) {
this.program = program;
}
static new(program) {
return new ASTPrinter(program);
}
}
We'll need a print
method to dispatch on the AST node type, and individual methods to handle those nodes. Since all our nodes except Program
are primitives, we can actually handle all of those with a single method, printPrimitive
:
printPrimitive(node, indent) {
return `${" ".repeat(indent)}${node.type}: ${
node.type === "NilLiteral" ? "nil" : node.value
}`;
}
Note the indent
property, which we'll eventually use to show how nodes are nested inside other nodes once we have compound forms that aren't just primitives.
We also need a method to handle the Program
node:
printProgram(node, indent) {
let pStr = "";
for (let n of node.body) {
pStr += this.print(n, indent);
+"\n";
}
return pStr;
}
Finally, the print
method to dispatch based on the node type:
print(node = this.program, indent = 0) {
switch (node.type) {
case ASTTypes.Program:
return this.printProgram(node, indent);
case ASTTypes.NumberLiteral:
case ASTTypes.StringLiteral:
case ASTTypes.BooleanLiteral:
case ASTTypes.KeywordLiteral:
case ASTTypes.NilLiteral:
return this.printPrimitive(node, indent);
}
}
Finally, at the bottom of printAST.js
, let's export a function to use the ASTPrinter
:
export const printAST = (ast) => ASTPrinter.new(ast).print();
Improving the CLI
We're going to add 2 AST pretty print functions to the CLI. We're also going to add the ability to invoke the CLI with a print
command and option so the REPL will print the AST given to it as a program. This will allow us to debug things later when our programs get more complicated.
First, here's src/cli/pprint.js
:
import { tokenize } from "../lexer/tokenize.js";
import { read } from "../reader/read.js";
import { expand } from "../expander/expand.js";
import { parse } from "../parser/parse.js";
import { desugar } from "../desugarer/desugar.js";
import { printAST } from "../printer/printAST.js";
export const pprintDesugaredAST = (input, file = "stdin") =>
printAST(desugar(parse(expand(read(tokenize(input, file))))));
export const pprintAST = (input, file = "stdin") =>
printAST(parse(expand(read(tokenize(input, file)))));
We're going to add a helper function that will let us throw an Exception
from an expression position in the compiler's code, in src/shared/fail.js
:
import { Exception } from "./exceptions.js";
export const fail = (msg, exn = Exception) => {
throw new exn(msg);
};
Now we'll make some changes in src/cli/repl.js
:
import readlineSync from "readline-sync";
import { EVAL } from "./eval.js";
import { pprintAST, pprintDesugaredAST } from "./pprint.js";
import { print } from "../printer/print.js";
import { fail } from "../shared/fail.js";
const read = (prompt) => readlineSync.question(prompt);
export const repl = (mode) => {
const proc =
mode === "repl"
? EVAL
: mode === "printDesugared"
? pprintDesugaredAST
: mode === "printAST"
? pprintAST
: fail("Invalid REPL mode specified");
let prompt = "wanda> ";
while (true) {
try {
const input = read(prompt);
if (input === ":quit") {
process.exit(0);
}
let result = proc(input);
print(result, mode === "repl");
} catch (e) {
console.error(e.message);
}
}
};
Note that now if the user enters the keyword :quit
at the prompt it will exit the REPL.
Now in src/cli/run.js
we're going to create an actual run
function that will set the mode for the REPL instead of just importing and calling the repl
function:
import { Exception } from "../shared/exceptions.js";
import { repl } from "./repl.js";
export const run = () => {
let mode = "";
switch (process.argv[2]) {
case "print":
if (process.argv[3] === "-d") {
mode = "printDesugared";
break;
} else if (process.argv[3] === "-a") {
mode = "printAST";
break;
}
case undefined:
case "-i":
case "repl":
mode = "repl";
break;
default:
throw new Exception("Invalid command specified");
}
repl(mode);
};
Now we have a print
command we can use when we execute the program, which is further used with either -a
or -d
options to print either the base AST or its desugared version. Right now there's no difference between the two, but that will change!
Creating The wanda
Command
Now we need a way to call the run
function. If you follow these instructions you'll be able to start the Wanda REPL just using the command wanda
.
First, in the project root, open the package.json
file. Somewhere in the toplevel properties (I've got it just below the "name"
property), add:
{
// ...
"bin": "bin/wanda.js",
// ...
}
Now in the project root create the bin
directory (not src/bin
) and add the file wanda.js
to it. Here are its contents:
#!/usr/bin/env node
import { run } from "../src/cli/run.js";
run();
If the first line looks strange to you, it's called a shebang. It's a convention from UNIX shell scripts. It lets the system know to use Node.js to process the file, just like if you were writing a Bash script it would let the system know to use Bash to process the file.
After saving bin/wanda.js
, make sure you're in the project root in your terminal and enter the command npm link
. This will tell NPM to link your wanda
command as a toplevel system command. Now whatever code is executed in bin/wanda.js
will run when you enter the command wanda
.
You can try it yourself: enter wanda
into your terminal and hit ENTER.
Now you can enter values of any of the primitive types, including numbers, strings, booleans, keywords, and nil.
Want to see the AST printer in action? Run the command wanda print -a
and see it work!
Conclusion
That wraps it up for today! We've now implemented all the primitive types in Wanda. Next time we'll add call expressions and change the data structure output by the reader to be more like that of a conventional Lisp language: using cons cells (linked lists) to represent the forms. This won't seem like a revolutionary change at first, but as it turns out this will enable us to do some pretty neat things in the future.
To see an exact snapshot of the state of the Wanda compiler and CLI as of the end of this article, check out the tag on GitHub. Note that the repo also includes JSDoc comments for the compiler as well as tests.
I'll see you next time!
Featured ones: