Crafting Interpreters in Haskell - Scanning

2021-09-11

I have started reading this excellent book called Crafting Interpreters. As I read this book, I have decided to implement the Lox intepreter described in this book in Haskell. My goal is to share the implementation and add some commentary for each chapter described in this book.

Note that at times, it might seem like code is duplicated across functions. Often, I have taken this deliberate approach so that the code can be looked at in isolation while describing that piece of code in this tutorial. While I have tests for most of this code, I have not spent time to make it production ready. Therefore, only use this for educational purposes.

In this part of the series, we will be looking at the scanner/lexer implementation for the Lox language. The code will be build in this post will closely reflect the requirements/techniques described in Scanning chapter. The code in this series also uses Parsec. Therefore, familiarity with this library is assumed. As a refresher to how this library works, I would recommended walking through examples at jakewheat.github.io. I have used many of the techniques described in this tutorial as I built the scanner.

You can find the complete version of this module and related tests at Scanner.hs and test_scanner.hs.

The order of implementation closely follows the description in the chapter as much as possible to keep the explanation of the implementation clear. Since we are using Parsec and not directly walking through text our implemenation becomes lot easier. We do not have to maintain state (position of cursor) etc as described in the example. We also use Parsec functionality of looking ahead and making decisions. I think this is reasonable to use Parsec and take advantage of this libary.

Let’s first get some import statements out of the way.

--- Scanner.hs
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE TemplateHaskell #-}

module Scanner where

import Import hiding (many, (<|>), try)
import Data.Text as T
import Data.Char
import Text.Parsec.String as PS
import Text.Parsec.Char as PC
import Text.Parsec
import RIO.Partial (read)

Layout

Error Handling
Defining Tokens
Running the Scanner
Single Character Tokens
Double Character Tokens
Quoted Strings
Numeric Values
Keywords and Identifiers
Single Line Comments
Conclusion

Error Handling

Since we are using the Parser we will just use the error reporting functionality provided by Parsec. Therefore, we will define the parser to be of type

type LoxScannerResult = Either ParseError [LoxTokInfo]

We will shortly, talk about what LoxTokInfo will be. The ParserError type is provided by Parsec.

Defining Tokens

Next, we define the tokens of the Lox language that will be captured by the scanner. The enum types nicely translate to a sum type in Haskell.

data LoxTok =
  -- Single-character tokens.
  LEFT_PAREN| RIGHT_PAREN| LEFT_BRACE| RIGHT_BRACE|
  COMMA| DOT| MINUS| PLUS| SEMICOLON| SLASH| STAR|

  -- One or two character tokens.
  BANG| BANG_EQUAL|
  EQUAL| EQUAL_EQUAL|
  GREATER| GREATER_EQUAL|
  LESS| LESS_EQUAL|

  -- Literals/Identifiers/Comments/Numeric values
  IDENTIFIER String| STRING String| NUMBER Double| COMMENT Text|

  -- Keywords.
  AND| CLASS| ELSE| FALSE| FUN| FOR| IF| NIL| OR|
  PRINT| RETURN| SUPER| THIS| TRUE| VAR| WHILE|

  WHITESPACE -- needed to parse end of tokens

  EOF
  deriving (Show, Eq)

You will notice that the book captures the context around Identifiers, STRING literals, NUMBER inside the Token object. But, we can capture that information as part of the IDENTIFIER, STRING, NUMBER data constructors. Notice, that we also capture single line COMMENT tokens (whereas the example in the book discards them).

Let’s now define our own record which is equivalent to the Token object:

data LoxTokInfo = LoxTokInfo {
  tokinfo_type:: LoxTok,
  tokinfo_lexeme:: Maybe T.Text,   -- currently not used
  tokinfo_literal:: Maybe LoxObject, -- currently not used
  tok_position:: SourcePos  -- Provided by Parsec
  }
  deriving (Show, Eq)

Note, that currently, tokinfo_lexeme and tokinfo_literal are not used. I have left it here to see if I will be needing them in later chapters. For now, the context is captured in the data constructors. For tok_position we are able to just capture the SourcePos type provided by Parsec.

Running the scanner

Before we start working on lexing the tokens, let create a helper function that will help us test our scanner in ghci as we build the token parser.

-- We will fill this function as we progress
scanToken :: Parser LoxTokInfo
scanToken = _

-- public function that will parse the entire input until `eof`
scanner :: String -> LoxScannerResult
scanner =  parse (many scanToken <* eof) ""

Recognize Single Character Tokens

As described by the type LoxTok we have a number of single character tokens to recognize. Since, we want to reuse the scanner implementation, we first create a mapping and then use the mapping to recognize the token.

charMapping :: [(LoxTok, Char)]
charMapping =
  [ (LEFT_PAREN, '('),
    (RIGHT_PAREN, ')'),
    (LEFT_BRACE, '{'),
    (RIGHT_BRACE, '}'),
    (COMMA, ','),
    (DOT, '.'),
    (MINUS, '-'),
    (PLUS, '+'),
    (SEMICOLON, ';'),
    (SLASH, '/'),
    (STAR, '*'),
    (BANG, '!'),
    (EQUAL, '='),
    (GREATER, '>'),
    (LESS, '<')
  ]

scanSingleCharToken :: Parser LoxTokInfo
scanSingleCharToken = do
  source_pos <- getPosition
  sel <- choice $ build <$> charMapping
  return $ LoxTokInfo sel Nothing Nothing source_pos
    where
      build :: (LoxTok, Char) -> Parser LoxTok
      build (x, y) = x <$ char y <* whitespace

If we succeed scanning a single character token, the scanSingleCharToken returns LoxTokInfo with a value for the tokenand also the source_pos. We will follow similar pattern in other scan* functions as well. Lets update the scanToken function with this tokenizer.

scanToken :: Parser LoxTokInfo
scanToken =
  scanSingleCharToken

Now, lets fire up ghci, with stack ghci and check if our function works:

λ >>import Scanner
λ >>scanner "{"
Right [LoxTokInfo {tokinfo_type = LEFT_BRACE, tokinfo_lexeme = Nothing, tokinfo_literal = Nothing, tok_position = (line 1, column 1)}]
it :: LoxScannerResult

Recognizing Double Tokens

As described in the Scanning chapter, we also need to handle tokens such as ==, != etc. The scanner to scan such tokens would be:

doubleCharMapping :: [(LoxTok, String)]
doubleCharMapping =
  [ (BANG_EQUAL, "!="),
    (EQUAL_EQUAL, "=="),
    (GREATER_EQUAL, ">="),
    (LESS_EQUAL, "<=")
  ]

scanDoubleToken :: Parser LoxTokInfo
scanDoubleToken = do
  source_pos <- getPosition
  sel <- choice $ build <$> doubleCharMapping
  return $ LoxTokInfo sel Nothing Nothing source_pos
  where
    build :: (LoxTok, String) -> Parser LoxTok
    build (x, y) = x <$ string y <* whitespace

This scanner just differs from the scanSingleCharToken scanner in the mapping it uses. Nevertheless, we duplicate this code and not worry about abstraction. Let’s try this in ghci after updating our scanToken function. Note, that this scanner has to preceed the scanSingleCharToken scanner for us to detect ==. Or else, we will end up with to = tokens, and the scanDoubleCharToken would never be invoked.

λ >>scanner "="
Right [LoxTokInfo {tokinfo_type = EQUAL, tokinfo_lexeme = Nothing, tokinfo_literal = Nothing, tok_position = (line 1, column 1)}]
it :: LoxScannerResult
λ >>scanner "=="
Right [LoxTokInfo {tokinfo_type = EQUAL_EQUAL, tokinfo_lexeme = Nothing, tokinfo_literal = Nothing, tok_position = (line 1, column 1)}]
it :: LoxScannerResult
λ >>

Note, that = scans to EQUAL and == scans to EQUAL_EQUAL.

Recognizing Quoted Strings

Recognizing quoted strings getts a little tricky, since one needs to keep track of escape sequences and nested quoted strings. We had a couple of helper functions to just to the same. Note, that I am using this approach from the SO answer noted in the code snippet below. I will skip the explanation for this direct you to the SO answer in this case.

-- -- https://stackoverflow.com/questions/24106314/parser-for-quoted-string-using-parsec
escape :: Parser String
escape = do
  d <- char '\\'
  c <- oneOf "\\\"0nrvtbf" -- all the characters which can be escaped
  return [d, c]

nonEscape :: Parser Char
nonEscape = noneOf "\\\"\0\n\r\v\t\b\f"

character :: Parser String
character = fmap return nonEscape <|> escape

scanQuotedString :: Parser LoxTokInfo
scanQuotedString = do
  source_pos <- getPosition
  qstring <- char '"' *> many character <* char '"' <* whitespace
  return $ LoxTokInfo (STRING $ Import.concat qstring) Nothing Nothing source_pos

Update the scanToken function and let’s try this in ghci.

scanToken :: Parser LoxTokInfo
scanToken =
  try scanDoubleToken
  <|> scanSingleCharToken
  <|> try scanQuotedString

Example of quoted string being parsed:

λ >>scanner "\"test this with \\n newline\""
Right [LoxTokInfo {tokinfo_type = STRING "test this with \\n newline", tokinfo_lexeme = Nothing, tokinfo_literal = Nothing, tok_position = (line 1, column 1)}]

Recognizing Numeric values (we will use DOUBLES entirely)

Our next task is to scan strings representing numeric values as `Double. The scanning functions for these would be:

whitespaceToken = do
  source_pos <- getPosition
  _ <- many1 $ char ' '
  return $ LoxTokInfo WHITESPACE Nothing Nothing source_pos

scanDouble :: Parser LoxTokInfo
scanDouble = do
  source_pos <- getPosition
  let la = lookAhead (whitespaceToken <|> scanSingleCharToken)
  sel <- do
    firstPart <- Text.Parsec.many1 digit
    try (secondCharacter firstPart <* la <* whitespace) <|> NUMBER (read firstPart) <$ la <* whitespace
  return $ LoxTokInfo sel Nothing Nothing source_pos
  where
    secondCharacter :: String -> Parser LoxTok
    secondCharacter firstPart = do
      void $ char '.'
      secondPart <- Text.Parsec.many1 digit <* whitespace
      return $ NUMBER $ read $ Import.concat [firstPart, ".", secondPart]

Again, as described in the book, we need to scan two parts for any numeric value. The string preceding . and the string following the .. Also, the . is optional. In our case, we first scan the string preceding the ., and then optionally call secondCharacter to scan the remaining string if any exists. After updating scanToken we have


scanToken :: Parser LoxTokInfo
scanToken =
  try scanDoubleToken
  <|> scanSingleCharToken
  <|> try scanQuotedString
  <|> try scanDouble

λ >>fmap tokenShow $ fromRight [] $ scanner "1.1;"
["LoxTok=NUMBER 1.1","LoxTok=SEMICOLON"]
it :: [String]

The lookAhead is needed to handle cases where the numeric value is preceeded by non-terminating characters like alpha characters. For example, 1.1abc. In this case we want the lexer to fail

λ >>scanner "1.1a"
Left (line 1, column 4):
unexpected "a"
expecting digit, "(", ")", "{", "}", ",", ".", "-", "+", ";", "/", "*", "!", "=", ">" or "<"
it :: LoxScannerResult
λ >>

Recognizing Keywords and Identifiers

Our code for recognizing key words will be similar to the ones we had to scan character tokens and double character tokens. Similar to our earlier approach, we will use a map to generate all the cases.

keywordMapping :: [(LoxTok, String)]
keywordMapping =
  [
    (AND, "and"),
    (CLASS, "class"),
    (ELSE, "else"),
    (FALSE, "false"),
    (FUN, "fun"),
    (FOR, "for"),
    (IF, "if"),
    (NIL, "nil"),
    (OR, "or"),
    (PRINT, "print"),
    (RETURN, "return"),
    (SUPER, "super"),
    (THIS, "this"),
    (TRUE, "true"),
    (VAR, "var"),
    (WHILE, "while")
    ]

scanKeywordToken :: Parser LoxTokInfo
scanKeywordToken = do
  source_pos <- getPosition
  sel <- choice $ build <$> keywordMapping
  return $ LoxTokInfo sel Nothing Nothing source_pos
  where
    build :: (LoxTok, String) -> Parser LoxTok
    build (x, y) = x <$ string y <* whitespace

And, accordingly, let’s update the scanToken function.

scanToken :: Parser LoxTokInfo
scanToken =
  try scanDoubleToken
    <|> try scanSingleCharToken
    <|> try scanQuotedString
    <|> try scanDouble
    <|> try scanKeywordToken

Testing, this in ghci shows

λ >>scanner "class"
Right [LoxTokInfo {tokinfo_type = CLASS, tokinfo_lexeme = Nothing, tokinfo_literal = Nothing, tok_position = (line 1, column 1)}]
it :: LoxScannerResult

Now, to recoginize Identifiers, we need to make sure they are not prefixes of the keywords. Identifiers also can only start with an alphabet followed by digits and _. We can implement that parser as follows:

-- -- http://jakewheat.github.io/intro_to_parsing/#_var
var :: Parser String
var = do
  fc <- firstChar
  rest <- many nonFirstChar
  return (fc : rest)
  where
    firstChar = satisfy (\a -> isLetter a || a == '_')
    nonFirstChar = satisfy (\a -> isDigit a || isLetter a || a == '_')

But, after parsing the identifier, we also need to parse it and make sure they are not keywords. Therefore, we will wrap the var parser into checkIdentifier function.

checkIfIdentifier :: Parser LoxTokInfo
checkIfIdentifier = do
  source_pos <- getPosition
  s <- var
  result ([(x, y) | (x, y) <- keywordMapping, y == s]) s source_pos
  where
    result xs s source_pos = do
      case xs of
        [] -> return $ LoxTokInfo (IDENTIFIER s) Nothing Nothing source_pos
        (x, _):_ -> return $ LoxTokInfo x Nothing Nothing  source_pos

Notice that if the scanned string matches any of the keywords, then we return the keyword else return the string wrapped inside the IDENTIFIER. Once again, lets update the scanToken function


scanToken :: Parser LoxTokInfo
scanToken =
  try scanDoubleToken
    <|> try scanSingleCharToken
    <|> try scanQuotedString
    <|> try scanDouble
    <|> checkIfIdentifier

Recognizing single line comments

One other addition to the scanner as described in the chapter is to add support to support single line comments. We diverge from chapter and capture the comment inside a COMMENT data constructor.

scanComment :: Parser LoxTokInfo
scanComment = do
  source_pos <- getPosition
  _ <- string "//"
  -- TODO: Find a better way to do this, scanning this more than once is less desirable
  comment <- try (manyTill anyToken (try (oneOf "\n"))) <|> manyTill anyToken eof
  return $ LoxTokInfo (COMMENT (T.pack comment)) Nothing Nothing source_pos

The scanComment function could potentially be improved to not do two entire parses when there is comment at the last line of the file. For now, I have let it as is.

The final version of scanToken is as follows. Note, that we need to add scanComment parser before scanSingleCharToken, since the // shares a prefix with /.


scanToken :: Parser LoxTokInfo
scanToken =
    try scanComment
    <|> try scanDoubleToken
    <|> try scanSingleCharToken
    <|> try scanQuotedString
    <|> try scanDouble
    <|> checkIfIdentifier

Conclusion

With the above changes we have scanner replicated in Haskell which can lex a Lox program. Hope you found this useful. See you all in the follow up series.