dev-resources.site
for different kinds of informations.
Lexing the Source (Pogo Pt: 4)
Intro
In this series I am creating a transpiler from Python to Golang called Pogo. In the last post we defined which tokens will be used in our compiler. Using these tokens we can move onto the next step, the lexer.
What is a Lexer?
A lexer is the part of the compiler that turns the source code, which is usually stored as some sort of array of characters, into an array of tokens. These tokens have much more semantic meaning than a lone character, and a token can be a variable name, bracket, number, etc.
Creating a Lexer struct
The lexer only needs three values to keep check of, the source, its current character, and its current position in the source.
type Lexer struct {
curPos int
curChar byte
source []byte
}
We will be processing the source as a slice of bytes, and characters will be individual bytes.
We also need a way to easily move onto the next token to check.
func (l *Lexer) nextChar() {
l.curPos++
if l.curPos == len(l.source) {
l.curChar = 0 // Nil
} else {
l.curChar = l.source[l.curPos]
}
}
Here we move onto the next character by incrementing the current position. We then get the character from that position, but only if we are looking at an index that is within the length of the source. If it isn't, we return 0. 0 as a byte corresponds to the null character in ASCII, which is usually written as '\0' as a char (although this isn't allowed in Go, which does make the bytes simpler).
Lastly, it's useful to be able to see the next character without incrementing curPos
. We do this using a function called peek
.
func (l *Lexer) peek() byte {
if l.curPos >= len(l.source)-1 {
return 0
}
return l.source[l.curPos+1]
}
This function similarly checks we are within the slice bounds, but otherwise just returns the next character.
Lexing the Source
Now that we have some helpful things to guide us, we can start implementing the logic to separate characters into tokens.
Now, for those clean coders, you may need to look away for a second, compilers can end up messy due to the repetitive yet unique nature of them. All of the rest of the logic will be held in one function.
Before that, we should do some error checking of our source.
if len(input) == 0 {
log.Fatal("[Lex (lex)] Missing input")
}
If the file we are going to lex doesn't have any data, we just leave, no point doing all this computation on literally nothing. In compilers it is also useful to have helpful error messages. As we are just now in development instead of actual use, we will have error messages that are more focused on our ease. In this way, we don't care as much what line in the Python caused the error, but where in the compiler's source code, such as which file and function, the error occurred.
Now, we can finally start lexing to a slice of tokens.
If, else, if, else, if, else
As we loop over the entire input we will be using a heap of if statements such as this one.
if l.curChar == '+' {
token = Token{tokenCode["MO_PLUS"], "+"}
}
This sort of behavior repeats for -
, *
, /
, %
, and so on. The Token
object that we receive is then just appended to the end of the slice, and so on until the entire source has been processed.
Obviously, some things can't be processed like this. The first edge case we have is the newline. In my case, newlines are both \n and \r, both of them in succession. Since I know it is always like this, I can check for both, and when I find one, I skip 2 characters.
if l.curChar == '\r' && l.peek() == '\n' {
token = Token{tokenCode["NEWLINE"], "\n"}
l.nextChar()
}
The next edge case is a comment
else if l.curChar == '#' {
start := l.curPos
for l.peek() != '\r' && l.peek() != '\n' {
l.nextChar()
}
note := string(l.source[start : l.curPos+1])
token = Token{tokenCode["COMMENT_ONE"], note}
}
First, we make a marker for where the comment starts, when we end the marker you can see in the last 2 lines we take a range of the source from the start marker and the current position (plus one to include the position we are currently looking at). And the last line turns that into a token. To find the end of that range, we simply look for when the next newline is. The reason we use peek
instead of incrementing curPos
one more time is so that when the next loop of lexing starts, we begin on the first value of the newline, which will be a pair of '\n' or '\r'.
Now in Python indentation is important, so we need to save that information as a token. But spaces aren't usually important so we don't want to save those.
else if l.curChar == ' ' {
if string(l.source[l.curPos:l.curPos+4]) == " " {
token = Token{tokenCode["INDENT"], " "}
l.nextChar()
l.nextChar()
l.nextChar()
}
}
When we come across a space, we simply check for 4 spaces in a row. If we do find it, then it is important and save the token, then skip 3 more characters.
Error Tangent
Now part of this code that might seem like an error is l.source[l.curPos:l.curPos+4]
. This may seem like an error since when l.curPos
is right up near the length of l.source
, then l.curPos+4
will end up outside of the range of l.source
, which should throw an error. However, in Go this does not cause an error. I don't know exactly why, Go could just be creating an extra check to make sure slices aren't index incorrectly. My other proposition is that is due to how slices are built fundamentally. Since slices are in some way a viewport into a static array, if the array is larger than the slice, you can still index outside of the slice but into the array, which may cause this.
More Tokens
The next type of edge case is the word. Words end up a tokens such as identifiers, keywords, function calls, and much more.
start := l.curPos
for unicode.IsLetter(rune(l.peek())) {
l.nextChar()
}
word := string(l.source[start : l.curPos+1])
This is very similar to the comment code, where we keep note of where we started the word. Once we have done this and found the word, we can now use more if statements to figure out if it's a keyword, in-built function, True
/False
, and all the others. Once we get to the end of this section, if we have not allocated which token the word is, then it must be an identifier, and save it as such.
Numbers are also pretty similar.
start := l.curPos
for unicode.IsDigit(rune(l.peek())) l.peek() == '_' l.peek() == '.' {
l.nextChar()
}
num := string(l.source[start : l.curPos+1])
if num[len(num)-1] == '_' || num[len(num)-1] == '.' {
log.Fatal("[Lex (lex)] Numbers must end with a digit")
return tokens
}
The first section should be seeming familiar but with one difference. The first difference is we are looking for digits (obviously). Next, we also want to allow underscores. And lastly, we also need to allow full stops (periods). Now there is one case I don't want to allow, which is when we don't end the number with a digit, such as 7.
or 15_
. If this occurs we throw an error. Now there are other errors that could occur, such as multiple full stops, but for now, with the personally written Python that will be imported, we will leave this as a future issue.
Lastly, we can safely say if haven't figured out what type of token this character belongs to yet, it's illegal (or not implemented).
// Not Implemented
if token == (Token{}) {
token = Token{tokenCode["ILLEGAL"], ""}
}
And after moving onto the next character, that's it! We've successfully lexed the entire source code. Now this may not have seemed very important, but if you remember from one of the previous posts this was part 1 of 5 in the compilation process, so it's not going to seem like we've made that much progress. But we are much closer than you would think, because only 3 out of 5 of those processes actually push the Python code closer to the Go code, those other processes being parsing and emitting.
Next
We need to start getting ready for the 2nd process, parsing, however, the tokens we have created will not be powerful enough for this job, so we will need to create a more broad data structure, a Structure
. Structure
s will be very similar to tokens, but will have some extra features that will be discussed in the next post.
Featured ones: