Crafting Interpreters in Haskell - Scanning
I have started reading this excellent book called Crafting Interpreters. As I read this book, I have decided to implement the Lox
intepreter described in this book in Haskell. My goal is to share the implementation and add some commentary for each chapter described in this book.
Note that at times, it might seem like code is duplicated across functions. Often, I have taken this deliberate approach so that the code can be looked at in isolation while describing that piece of code in this tutorial. While I have tests for most of this code, I have not spent time to make it production ready. Therefore, only use this for educational purposes.
In this part of the series, we will be looking at the scanner/lexer implementation for the Lox
language. The code will be build in this post will closely reflect the requirements/techniques described in Scanning chapter. The code in this series also uses Parsec. Therefore, familiarity with this library is assumed. As a refresher to how this library works, I would recommended walking through examples at jakewheat.github.io. I have used many of the techniques described in this tutorial as I built the scanner
.
You can find the complete version of this module and related tests at Scanner.hs and test_scanner.hs.
The order of implementation closely follows the description in the chapter as much as possible to keep the explanation of the implementation clear. Since we are using Parsec
and not directly walking through text our implemenation becomes lot easier. We do not have to maintain state (position of cursor) etc as described in the example. We also use Parsec
functionality of looking ahead and making decisions. I think this is reasonable to use Parsec
and take advantage of this libary.
Let’s first get some import
statements out of the way.
--- Scanner.hs
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE TemplateHaskell #-}
module Scanner where
import Import hiding (many, (<|>), try)
import Data.Text as T
import Data.Char
import Text.Parsec.String as PS
import Text.Parsec.Char as PC
import Text.Parsec
import RIO.Partial (read)
Layout
- Error Handling
- Defining Tokens
- Running the Scanner
- Single Character Tokens
- Double Character Tokens
- Quoted Strings
- Numeric Values
- Keywords and Identifiers
- Single Line Comments
- Conclusion
Error Handling
Since we are using the Parser
we will just use the error reporting functionality provided by Parsec
. Therefore, we will define the parser
to be of type
type LoxScannerResult = Either ParseError [LoxTokInfo]
We will shortly, talk about what LoxTokInfo
will be. The ParserError
type is provided by Parsec
.
Defining Tokens
Next, we define the tokens
of the Lox
language that will be captured by the scanner. The enum
types nicely translate to a sum
type in Haskell.
data LoxTok =
-- Single-character tokens.
LEFT_PAREN| RIGHT_PAREN| LEFT_BRACE| RIGHT_BRACE|
COMMA| DOT| MINUS| PLUS| SEMICOLON| SLASH| STAR|
-- One or two character tokens.
BANG| BANG_EQUAL|
EQUAL| EQUAL_EQUAL|
GREATER| GREATER_EQUAL|
LESS| LESS_EQUAL|
-- Literals/Identifiers/Comments/Numeric values
IDENTIFIER String| STRING String| NUMBER Double| COMMENT Text|
-- Keywords.
AND| CLASS| ELSE| FALSE| FUN| FOR| IF| NIL| OR|
PRINT| RETURN| SUPER| THIS| TRUE| VAR| WHILE|
WHITESPACE -- needed to parse end of tokens
EOF
deriving (Show, Eq)
You will notice that the book captures the context
around Identifiers
, STRING
literals, NUMBER
inside the Token
object. But, we can capture that information as part of the IDENTIFIER
, STRING
, NUMBER
data constructors. Notice, that we also capture single line COMMENT
tokens (whereas the example in the book discards them).
Let’s now define our own record which is equivalent to the Token
object:
data LoxTokInfo = LoxTokInfo {
tokinfo_type:: LoxTok,
tokinfo_lexeme:: Maybe T.Text, -- currently not used
tokinfo_literal:: Maybe LoxObject, -- currently not used
tok_position:: SourcePos -- Provided by Parsec
}deriving (Show, Eq)
Note, that currently, tokinfo_lexeme
and tokinfo_literal
are not used. I have left it here to see if I will be needing them in later chapters. For now, the context
is captured in the data constructors. For tok_position
we are able to just capture the SourcePos
type provided by Parsec
.
Running the scanner
Before we start working on lexing the tokens
, let create a helper function that will help us test our scanner in ghci
as we build the token parser.
-- We will fill this function as we progress
scanToken :: Parser LoxTokInfo
= _
scanToken
-- public function that will parse the entire input until `eof`
scanner :: String -> LoxScannerResult
= parse (many scanToken <* eof) "" scanner
Recognize Single Character Tokens
As described by the type LoxTok
we have a number of single character tokens to recognize. Since, we want to reuse the scanner implementation, we first create a mapping and then use the mapping to recognize the token.
charMapping :: [(LoxTok, Char)]
=
charMapping LEFT_PAREN, '('),
[ (RIGHT_PAREN, ')'),
(LEFT_BRACE, '{'),
(RIGHT_BRACE, '}'),
(COMMA, ','),
(DOT, '.'),
(MINUS, '-'),
(PLUS, '+'),
(SEMICOLON, ';'),
(SLASH, '/'),
(STAR, '*'),
(BANG, '!'),
(EQUAL, '='),
(GREATER, '>'),
(LESS, '<')
(
]
scanSingleCharToken :: Parser LoxTokInfo
= do
scanSingleCharToken <- getPosition
source_pos <- choice $ build <$> charMapping
sel return $ LoxTokInfo sel Nothing Nothing source_pos
where
build :: (LoxTok, Char) -> Parser LoxTok
= x <$ char y <* whitespace build (x, y)
If we succeed scanning a single character token, the scanSingleCharToken
returns LoxTokInfo
with a value for the token
and also the source_pos
. We will follow similar pattern in other scan*
functions as well. Lets update the scanToken
function with this tokenizer.
scanToken :: Parser LoxTokInfo
=
scanToken scanSingleCharToken
Now, lets fire up ghci
, with stack ghci
and check if our function works:
>>import Scanner
λ >>scanner "{"
λ Right [LoxTokInfo {tokinfo_type = LEFT_BRACE, tokinfo_lexeme = Nothing, tokinfo_literal = Nothing, tok_position = (line 1, column 1)}]
it :: LoxScannerResult
Recognizing Double Tokens
As described in the Scanning
chapter, we also need to handle tokens such as ==
, !=
etc. The scanner to scan such tokens would be:
doubleCharMapping :: [(LoxTok, String)]
=
doubleCharMapping BANG_EQUAL, "!="),
[ (EQUAL_EQUAL, "=="),
(GREATER_EQUAL, ">="),
(LESS_EQUAL, "<=")
(
]
scanDoubleToken :: Parser LoxTokInfo
= do
scanDoubleToken <- getPosition
source_pos <- choice $ build <$> doubleCharMapping
sel return $ LoxTokInfo sel Nothing Nothing source_pos
where
build :: (LoxTok, String) -> Parser LoxTok
= x <$ string y <* whitespace build (x, y)
This scanner just differs from the scanSingleCharToken
scanner in the mapping
it uses. Nevertheless, we duplicate this code and not worry about abstraction. Let’s try this in ghci
after updating our scanToken
function. Note, that this scanner has to preceed the scanSingleCharToken
scanner for us to detect ==
. Or else, we will end up with to =
tokens, and the scanDoubleCharToken
would never be invoked.
>>scanner "="
λ Right [LoxTokInfo {tokinfo_type = EQUAL, tokinfo_lexeme = Nothing, tokinfo_literal = Nothing, tok_position = (line 1, column 1)}]
it :: LoxScannerResult
>>scanner "=="
λ Right [LoxTokInfo {tokinfo_type = EQUAL_EQUAL, tokinfo_lexeme = Nothing, tokinfo_literal = Nothing, tok_position = (line 1, column 1)}]
it :: LoxScannerResult
>> λ
Note, that =
scans to EQUAL
and ==
scans to EQUAL_EQUAL
.
Recognizing Quoted Strings
Recognizing quoted strings getts a little tricky, since one needs to keep track of escape sequences and nested quoted strings. We had a couple of helper functions to just to the same. Note, that I am using this approach from the SO answer noted in the code snippet below. I will skip the explanation for this direct you to the SO answer in this case.
-- -- https://stackoverflow.com/questions/24106314/parser-for-quoted-string-using-parsec
escape :: Parser String
= do
escape <- char '\\'
d <- oneOf "\\\"0nrvtbf" -- all the characters which can be escaped
c return [d, c]
nonEscape :: Parser Char
= noneOf "\\\"\0\n\r\v\t\b\f"
nonEscape
character :: Parser String
= fmap return nonEscape <|> escape
character
scanQuotedString :: Parser LoxTokInfo
= do
scanQuotedString <- getPosition
source_pos <- char '"' *> many character <* char '"' <* whitespace
qstring return $ LoxTokInfo (STRING $ Import.concat qstring) Nothing Nothing source_pos
Update the scanToken
function and let’s try this in ghci
.
scanToken :: Parser LoxTokInfo
=
scanToken
try scanDoubleToken<|> scanSingleCharToken
<|> try scanQuotedString
Example of quoted string being parsed:
>>scanner "\"test this with \\n newline\""
λ Right [LoxTokInfo {tokinfo_type = STRING "test this with \\n newline", tokinfo_lexeme = Nothing, tokinfo_literal = Nothing, tok_position = (line 1, column 1)}]
Recognizing Numeric values (we will use DOUBLES entirely)
Our next task is to scan strings representing numeric values as `Double. The scanning functions for these would be:
= do
whitespaceToken <- getPosition
source_pos <- many1 $ char ' '
_ return $ LoxTokInfo WHITESPACE Nothing Nothing source_pos
scanDouble :: Parser LoxTokInfo
= do
scanDouble <- getPosition
source_pos let la = lookAhead (whitespaceToken <|> scanSingleCharToken)
<- do
sel <- Text.Parsec.many1 digit
firstPart <* la <* whitespace) <|> NUMBER (read firstPart) <$ la <* whitespace
try (secondCharacter firstPart return $ LoxTokInfo sel Nothing Nothing source_pos
where
secondCharacter :: String -> Parser LoxTok
= do
secondCharacter firstPart $ char '.'
void <- Text.Parsec.many1 digit <* whitespace
secondPart return $ NUMBER $ read $ Import.concat [firstPart, ".", secondPart]
Again, as described in the book, we need to scan two parts for any numeric value. The string preceding .
and the string following the .
. Also, the .
is optional. In our case, we first scan the string preceding the .
, and then optionally call secondCharacter
to scan the remaining string if any exists. After updating scanToken
we have
scanToken :: Parser LoxTokInfo
=
scanToken
try scanDoubleToken<|> scanSingleCharToken
<|> try scanQuotedString
<|> try scanDouble
>>fmap tokenShow $ fromRight [] $ scanner "1.1;"
λ "LoxTok=NUMBER 1.1","LoxTok=SEMICOLON"]
[it :: [String]
The lookAhead
is needed to handle cases where the numeric value is preceeded by non-terminating characters like alpha characters. For example, 1.1abc
. In this case we want the lexer to fail
>>scanner "1.1a"
λ Left (line 1, column 4):
"a"
unexpected "(", ")", "{", "}", ",", ".", "-", "+", ";", "/", "*", "!", "=", ">" or "<"
expecting digit, it :: LoxScannerResult
>> λ
Recognizing Keywords and Identifiers
Our code for recognizing key words will be similar to the ones we had to scan character tokens and double character tokens. Similar to our earlier approach, we will use a map to generate all the cases.
keywordMapping :: [(LoxTok, String)]
=
keywordMapping
[AND, "and"),
(CLASS, "class"),
(ELSE, "else"),
(FALSE, "false"),
(FUN, "fun"),
(FOR, "for"),
(IF, "if"),
(NIL, "nil"),
(OR, "or"),
(PRINT, "print"),
(RETURN, "return"),
(SUPER, "super"),
(THIS, "this"),
(TRUE, "true"),
(VAR, "var"),
(WHILE, "while")
(
]
scanKeywordToken :: Parser LoxTokInfo
= do
scanKeywordToken <- getPosition
source_pos <- choice $ build <$> keywordMapping
sel return $ LoxTokInfo sel Nothing Nothing source_pos
where
build :: (LoxTok, String) -> Parser LoxTok
= x <$ string y <* whitespace build (x, y)
And, accordingly, let’s update the scanToken
function.
scanToken :: Parser LoxTokInfo
=
scanToken
try scanDoubleToken<|> try scanSingleCharToken
<|> try scanQuotedString
<|> try scanDouble
<|> try scanKeywordToken
Testing, this in ghci
shows
λ >>scanner "class"
Right [LoxTokInfo {tokinfo_type = CLASS, tokinfo_lexeme = Nothing, tokinfo_literal = Nothing, tok_position = (line 1, column 1)}]
it :: LoxScannerResult
Now, to recoginize Identifiers
, we need to make sure they are not prefixes
of the keywords
. Identifiers also can only start with an alphabet followed by digits and _
. We can implement that parser as follows:
-- -- http://jakewheat.github.io/intro_to_parsing/#_var
var :: Parser String
= do
var <- firstChar
fc <- many nonFirstChar
rest return (fc : rest)
where
= satisfy (\a -> isLetter a || a == '_')
firstChar = satisfy (\a -> isDigit a || isLetter a || a == '_') nonFirstChar
But, after parsing the identifier, we also need to parse it and make sure they are not keywords
. Therefore, we will wrap the var
parser into checkIdentifier
function.
checkIfIdentifier :: Parser LoxTokInfo
= do
checkIfIdentifier <- getPosition
source_pos <- var
s | (x, y) <- keywordMapping, y == s]) s source_pos
result ([(x, y) where
= do
result xs s source_pos case xs of
-> return $ LoxTokInfo (IDENTIFIER s) Nothing Nothing source_pos
[] :_ -> return $ LoxTokInfo x Nothing Nothing source_pos (x, _)
Notice that if the scanned string
matches any of the keywords, then we return the keyword
else return the string wrapped inside the IDENTIFIER
. Once again, lets update the scanToken
function
scanToken :: Parser LoxTokInfo
=
scanToken
try scanDoubleToken<|> try scanSingleCharToken
<|> try scanQuotedString
<|> try scanDouble
<|> checkIfIdentifier
Recognizing single line comments
One other addition to the scanner as described in the chapter is to add support to support single line comments. We diverge from chapter and capture the comment
inside a COMMENT
data constructor.
scanComment :: Parser LoxTokInfo
= do
scanComment <- getPosition
source_pos <- string "//"
_ -- TODO: Find a better way to do this, scanning this more than once is less desirable
<- try (manyTill anyToken (try (oneOf "\n"))) <|> manyTill anyToken eof
comment return $ LoxTokInfo (COMMENT (T.pack comment)) Nothing Nothing source_pos
The scanComment
function could potentially be improved to not do two entire parses when there is comment at the last line of the file. For now, I have let it as is.
The final version of scanToken
is as follows. Note, that we need to add scanComment
parser before scanSingleCharToken
, since the //
shares a prefix with /
.
scanToken :: Parser LoxTokInfo
=
scanToken
try scanComment<|> try scanDoubleToken
<|> try scanSingleCharToken
<|> try scanQuotedString
<|> try scanDouble
<|> checkIfIdentifier
Conclusion
With the above changes we have scanner replicated in Haskell which can lex a Lox
program. Hope you found this useful. See you all in the follow up series.