Crafting Interpreters in Haskell - Scanning
I have started reading this excellent book called Crafting Interpreters. As I read this book, I have decided to implement the Lox intepreter described in this book in Haskell. My goal is to share the implementation and add some commentary for each chapter described in this book.
Note that at times, it might seem like code is duplicated across functions. Often, I have taken this deliberate approach so that the code can be looked at in isolation while describing that piece of code in this tutorial. While I have tests for most of this code, I have not spent time to make it production ready. Therefore, only use this for educational purposes.
In this part of the series, we will be looking at the scanner/lexer implementation for the Lox language. The code will be build in this post will closely reflect the requirements/techniques described in Scanning chapter. The code in this series also uses Parsec. Therefore, familiarity with this library is assumed. As a refresher to how this library works, I would recommended walking through examples at jakewheat.github.io. I have used many of the techniques described in this tutorial as I built the scanner.
You can find the complete version of this module and related tests at Scanner.hs and test_scanner.hs.
The order of implementation closely follows the description in the chapter as much as possible to keep the explanation of the implementation clear. Since we are using Parsec and not directly walking through text our implemenation becomes lot easier. We do not have to maintain state (position of cursor) etc as described in the example. We also use Parsec functionality of looking ahead and making decisions. I think this is reasonable to use Parsec and take advantage of this libary.
Let’s first get some import statements out of the way.
--- Scanner.hs
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE TemplateHaskell #-}
module Scanner where
import Import hiding (many, (<|>), try)
import Data.Text as T
import Data.Char
import Text.Parsec.String as PS
import Text.Parsec.Char as PC
import Text.Parsec
import RIO.Partial (read)Layout
- Error Handling
- Defining Tokens
- Running the Scanner
- Single Character Tokens
- Double Character Tokens
- Quoted Strings
- Numeric Values
- Keywords and Identifiers
- Single Line Comments
- Conclusion
Error Handling
Since we are using the Parser we will just use the error reporting functionality provided by Parsec. Therefore, we will define the parser to be of type
type LoxScannerResult = Either ParseError [LoxTokInfo]We will shortly, talk about what LoxTokInfo will be. The ParserError type is provided by Parsec.
Defining Tokens
Next, we define the tokens of the Lox language that will be captured by the scanner. The enum types nicely translate to a sum type in Haskell.
data LoxTok =
-- Single-character tokens.
LEFT_PAREN| RIGHT_PAREN| LEFT_BRACE| RIGHT_BRACE|
COMMA| DOT| MINUS| PLUS| SEMICOLON| SLASH| STAR|
-- One or two character tokens.
BANG| BANG_EQUAL|
EQUAL| EQUAL_EQUAL|
GREATER| GREATER_EQUAL|
LESS| LESS_EQUAL|
-- Literals/Identifiers/Comments/Numeric values
IDENTIFIER String| STRING String| NUMBER Double| COMMENT Text|
-- Keywords.
AND| CLASS| ELSE| FALSE| FUN| FOR| IF| NIL| OR|
PRINT| RETURN| SUPER| THIS| TRUE| VAR| WHILE|
WHITESPACE -- needed to parse end of tokens
EOF
deriving (Show, Eq)You will notice that the book captures the context around Identifiers, STRING literals, NUMBER inside the Token object. But, we can capture that information as part of the IDENTIFIER, STRING, NUMBER data constructors. Notice, that we also capture single line COMMENT tokens (whereas the example in the book discards them).
Let’s now define our own record which is equivalent to the Token object:
data LoxTokInfo = LoxTokInfo {
tokinfo_type:: LoxTok,
tokinfo_lexeme:: Maybe T.Text, -- currently not used
tokinfo_literal:: Maybe LoxObject, -- currently not used
tok_position:: SourcePos -- Provided by Parsec
}
deriving (Show, Eq)Note, that currently, tokinfo_lexeme and tokinfo_literal are not used. I have left it here to see if I will be needing them in later chapters. For now, the context is captured in the data constructors. For tok_position we are able to just capture the SourcePos type provided by Parsec.
Running the scanner
Before we start working on lexing the tokens, let create a helper function that will help us test our scanner in ghci as we build the token parser.
-- We will fill this function as we progress
scanToken :: Parser LoxTokInfo
scanToken = _
-- public function that will parse the entire input until `eof`
scanner :: String -> LoxScannerResult
scanner = parse (many scanToken <* eof) ""Recognize Single Character Tokens
As described by the type LoxTok we have a number of single character tokens to recognize. Since, we want to reuse the scanner implementation, we first create a mapping and then use the mapping to recognize the token.
charMapping :: [(LoxTok, Char)]
charMapping =
[ (LEFT_PAREN, '('),
(RIGHT_PAREN, ')'),
(LEFT_BRACE, '{'),
(RIGHT_BRACE, '}'),
(COMMA, ','),
(DOT, '.'),
(MINUS, '-'),
(PLUS, '+'),
(SEMICOLON, ';'),
(SLASH, '/'),
(STAR, '*'),
(BANG, '!'),
(EQUAL, '='),
(GREATER, '>'),
(LESS, '<')
]
scanSingleCharToken :: Parser LoxTokInfo
scanSingleCharToken = do
source_pos <- getPosition
sel <- choice $ build <$> charMapping
return $ LoxTokInfo sel Nothing Nothing source_pos
where
build :: (LoxTok, Char) -> Parser LoxTok
build (x, y) = x <$ char y <* whitespaceIf we succeed scanning a single character token, the scanSingleCharToken returns LoxTokInfo with a value for the tokenand also the source_pos. We will follow similar pattern in other scan* functions as well. Lets update the scanToken function with this tokenizer.
scanToken :: Parser LoxTokInfo
scanToken =
scanSingleCharTokenNow, lets fire up ghci, with stack ghci and check if our function works:
λ >>import Scanner
λ >>scanner "{"
Right [LoxTokInfo {tokinfo_type = LEFT_BRACE, tokinfo_lexeme = Nothing, tokinfo_literal = Nothing, tok_position = (line 1, column 1)}]
it :: LoxScannerResultRecognizing Double Tokens
As described in the Scanning chapter, we also need to handle tokens such as ==, != etc. The scanner to scan such tokens would be:
doubleCharMapping :: [(LoxTok, String)]
doubleCharMapping =
[ (BANG_EQUAL, "!="),
(EQUAL_EQUAL, "=="),
(GREATER_EQUAL, ">="),
(LESS_EQUAL, "<=")
]
scanDoubleToken :: Parser LoxTokInfo
scanDoubleToken = do
source_pos <- getPosition
sel <- choice $ build <$> doubleCharMapping
return $ LoxTokInfo sel Nothing Nothing source_pos
where
build :: (LoxTok, String) -> Parser LoxTok
build (x, y) = x <$ string y <* whitespaceThis scanner just differs from the scanSingleCharToken scanner in the mapping it uses. Nevertheless, we duplicate this code and not worry about abstraction. Let’s try this in ghci after updating our scanToken function. Note, that this scanner has to preceed the scanSingleCharToken scanner for us to detect ==. Or else, we will end up with to = tokens, and the scanDoubleCharToken would never be invoked.
λ >>scanner "="
Right [LoxTokInfo {tokinfo_type = EQUAL, tokinfo_lexeme = Nothing, tokinfo_literal = Nothing, tok_position = (line 1, column 1)}]
it :: LoxScannerResult
λ >>scanner "=="
Right [LoxTokInfo {tokinfo_type = EQUAL_EQUAL, tokinfo_lexeme = Nothing, tokinfo_literal = Nothing, tok_position = (line 1, column 1)}]
it :: LoxScannerResult
λ >>Note, that = scans to EQUAL and == scans to EQUAL_EQUAL.
Recognizing Quoted Strings
Recognizing quoted strings getts a little tricky, since one needs to keep track of escape sequences and nested quoted strings. We had a couple of helper functions to just to the same. Note, that I am using this approach from the SO answer noted in the code snippet below. I will skip the explanation for this direct you to the SO answer in this case.
-- -- https://stackoverflow.com/questions/24106314/parser-for-quoted-string-using-parsec
escape :: Parser String
escape = do
d <- char '\\'
c <- oneOf "\\\"0nrvtbf" -- all the characters which can be escaped
return [d, c]
nonEscape :: Parser Char
nonEscape = noneOf "\\\"\0\n\r\v\t\b\f"
character :: Parser String
character = fmap return nonEscape <|> escape
scanQuotedString :: Parser LoxTokInfo
scanQuotedString = do
source_pos <- getPosition
qstring <- char '"' *> many character <* char '"' <* whitespace
return $ LoxTokInfo (STRING $ Import.concat qstring) Nothing Nothing source_posUpdate the scanToken function and let’s try this in ghci.
scanToken :: Parser LoxTokInfo
scanToken =
try scanDoubleToken
<|> scanSingleCharToken
<|> try scanQuotedStringExample of quoted string being parsed:
λ >>scanner "\"test this with \\n newline\""
Right [LoxTokInfo {tokinfo_type = STRING "test this with \\n newline", tokinfo_lexeme = Nothing, tokinfo_literal = Nothing, tok_position = (line 1, column 1)}]Recognizing Numeric values (we will use DOUBLES entirely)
Our next task is to scan strings representing numeric values as `Double. The scanning functions for these would be:
whitespaceToken = do
source_pos <- getPosition
_ <- many1 $ char ' '
return $ LoxTokInfo WHITESPACE Nothing Nothing source_pos
scanDouble :: Parser LoxTokInfo
scanDouble = do
source_pos <- getPosition
let la = lookAhead (whitespaceToken <|> scanSingleCharToken)
sel <- do
firstPart <- Text.Parsec.many1 digit
try (secondCharacter firstPart <* la <* whitespace) <|> NUMBER (read firstPart) <$ la <* whitespace
return $ LoxTokInfo sel Nothing Nothing source_pos
where
secondCharacter :: String -> Parser LoxTok
secondCharacter firstPart = do
void $ char '.'
secondPart <- Text.Parsec.many1 digit <* whitespace
return $ NUMBER $ read $ Import.concat [firstPart, ".", secondPart]Again, as described in the book, we need to scan two parts for any numeric value. The string preceding . and the string following the .. Also, the . is optional. In our case, we first scan the string preceding the ., and then optionally call secondCharacter to scan the remaining string if any exists. After updating scanToken we have
scanToken :: Parser LoxTokInfo
scanToken =
try scanDoubleToken
<|> scanSingleCharToken
<|> try scanQuotedString
<|> try scanDoubleλ >>fmap tokenShow $ fromRight [] $ scanner "1.1;"
["LoxTok=NUMBER 1.1","LoxTok=SEMICOLON"]
it :: [String]The lookAhead is needed to handle cases where the numeric value is preceeded by non-terminating characters like alpha characters. For example, 1.1abc. In this case we want the lexer to fail
λ >>scanner "1.1a"
Left (line 1, column 4):
unexpected "a"
expecting digit, "(", ")", "{", "}", ",", ".", "-", "+", ";", "/", "*", "!", "=", ">" or "<"
it :: LoxScannerResult
λ >>Recognizing Keywords and Identifiers
Our code for recognizing key words will be similar to the ones we had to scan character tokens and double character tokens. Similar to our earlier approach, we will use a map to generate all the cases.
keywordMapping :: [(LoxTok, String)]
keywordMapping =
[
(AND, "and"),
(CLASS, "class"),
(ELSE, "else"),
(FALSE, "false"),
(FUN, "fun"),
(FOR, "for"),
(IF, "if"),
(NIL, "nil"),
(OR, "or"),
(PRINT, "print"),
(RETURN, "return"),
(SUPER, "super"),
(THIS, "this"),
(TRUE, "true"),
(VAR, "var"),
(WHILE, "while")
]
scanKeywordToken :: Parser LoxTokInfo
scanKeywordToken = do
source_pos <- getPosition
sel <- choice $ build <$> keywordMapping
return $ LoxTokInfo sel Nothing Nothing source_pos
where
build :: (LoxTok, String) -> Parser LoxTok
build (x, y) = x <$ string y <* whitespaceAnd, accordingly, let’s update the scanToken function.
scanToken :: Parser LoxTokInfo
scanToken =
try scanDoubleToken
<|> try scanSingleCharToken
<|> try scanQuotedString
<|> try scanDouble
<|> try scanKeywordTokenTesting, this in ghci shows
λ >>scanner "class"
Right [LoxTokInfo {tokinfo_type = CLASS, tokinfo_lexeme = Nothing, tokinfo_literal = Nothing, tok_position = (line 1, column 1)}]
it :: LoxScannerResult
Now, to recoginize Identifiers, we need to make sure they are not prefixes of the keywords. Identifiers also can only start with an alphabet followed by digits and _. We can implement that parser as follows:
-- -- http://jakewheat.github.io/intro_to_parsing/#_var
var :: Parser String
var = do
fc <- firstChar
rest <- many nonFirstChar
return (fc : rest)
where
firstChar = satisfy (\a -> isLetter a || a == '_')
nonFirstChar = satisfy (\a -> isDigit a || isLetter a || a == '_')But, after parsing the identifier, we also need to parse it and make sure they are not keywords. Therefore, we will wrap the var parser into checkIdentifier function.
checkIfIdentifier :: Parser LoxTokInfo
checkIfIdentifier = do
source_pos <- getPosition
s <- var
result ([(x, y) | (x, y) <- keywordMapping, y == s]) s source_pos
where
result xs s source_pos = do
case xs of
[] -> return $ LoxTokInfo (IDENTIFIER s) Nothing Nothing source_pos
(x, _):_ -> return $ LoxTokInfo x Nothing Nothing source_posNotice that if the scanned string matches any of the keywords, then we return the keyword else return the string wrapped inside the IDENTIFIER. Once again, lets update the scanToken function
scanToken :: Parser LoxTokInfo
scanToken =
try scanDoubleToken
<|> try scanSingleCharToken
<|> try scanQuotedString
<|> try scanDouble
<|> checkIfIdentifierRecognizing single line comments
One other addition to the scanner as described in the chapter is to add support to support single line comments. We diverge from chapter and capture the comment inside a COMMENT data constructor.
scanComment :: Parser LoxTokInfo
scanComment = do
source_pos <- getPosition
_ <- string "//"
-- TODO: Find a better way to do this, scanning this more than once is less desirable
comment <- try (manyTill anyToken (try (oneOf "\n"))) <|> manyTill anyToken eof
return $ LoxTokInfo (COMMENT (T.pack comment)) Nothing Nothing source_posThe scanComment function could potentially be improved to not do two entire parses when there is comment at the last line of the file. For now, I have let it as is.
The final version of scanToken is as follows. Note, that we need to add scanComment parser before scanSingleCharToken, since the // shares a prefix with /.
scanToken :: Parser LoxTokInfo
scanToken =
try scanComment
<|> try scanDoubleToken
<|> try scanSingleCharToken
<|> try scanQuotedString
<|> try scanDouble
<|> checkIfIdentifierConclusion
With the above changes we have scanner replicated in Haskell which can lex a Lox program. Hope you found this useful. See you all in the follow up series.