A Simple Implementation of Pratt Parser in Haskell

2021-10-06

I am currently working through an implementation of language described in Crafting Interpreters book in Haskell. The chapter on Compiling Expressions introduces the user to a parser called the Pratt Parser or Top-Down Operator Precendence parser. During my first pass through this chapter, I found it difficult to follow how the parser was actually working, though the author does a very good job building up a the C code in small chunks. I had to pause at this point and find other resources that would help in understanding how this algorithm worked.

First, I tried reading the author Bob Nystrom’s suggestion and visited his blog post on the same topic here. I still could not get it. Perhaps, I had some pre-concieved notion of this parser which was making it difficult for me. Then, looking around I found a more accessible explanation by Eli Bendersky here. The author does a great job discussing the algorithm with a simple Python based parser. With some staring and playing around with small AST’s, the working of the algorithm started making sense.

After understanding the code examples in Python, and later going through the original paper Top Down Operator Precedence - Vaughan R. Pratt, I realized the actual algorithm was so simple, but yet so powerful. Looking at the algorithm as some conditional depth-first traversal on a tree was very helpful for me.

In this post, I attempt to provide an mental model I used to understand how this algorithm works. Once, we get the basics down, I will show you how to implement this in Haskell for a very basic grammar. The final code we produce with demonstrate the working of the Pratt parser in setting up both associativity and precedence.

Problem

Let’s look at the problem this algorithm solves. Say, we have an expression 4 + 2 * 3. This expression could get parsed into 2 different Abstract Syntax Trees (AST). The first one would parse the expression as (4+2)*3 and produce the result 18, while the second AST would produce 4+(2*3), thus producing 10 on evaluation of the expression.

-- ast where `+` gets higher precedence than `*`
     *
    / \
   +   3
  / \
 4   2

-- ast where `*` gets higher precedence

    +
   / \
  4   *
     / \
    /   \
   2     3

What we need is the second AST. More fundamentally, we need to encode the precedence of these operators in the parser code to parse this expression. Usually, when a recursive-descent parser is used,the functions are arranged in such a manner such the both the associativity and precedence are handled.

The Pratt parser is an alternate algorithm that addresses the precedence requirements. In a recursive-descent(RD) parser, the non-terminals of the grammar become entry points(functions) in our code. But, with a Pratt parser, as you will see later in this post, the tokens of the parsed expression drive the parsing. Based on the token in context, different paths are taken to construct the desired AST.

Approach

Lets’ look at a more complex syntax tree. We will use this AST to understand how to approach this problem.

    -- 1 + (2*3) + 4

                   +
                  / \
                 +   4
                / \
               /   \
              1     *
                   / \
                  /   \
                 2     3

I will slightly rotate this tree to make the levels of operators clearer.

            1-----+---------+
                  |         |
                  |         4
                  |
               2--*--3

Notice that the + operators are at one level, and * operators are at lower level. We can view these levels at the precedence levels. These operators operate at with respect to each other. Here the lower the level in the diagram, the higher the precedence of the operator. Here is another example, that will help clarify, what I am trying to explain.


    -- 1 + 2 * 3 / 6 - 1

                   -
                  / \
                 +   1
                / \
               /   \
              1    `div`
                   / \
                  /   \
                 *    6
                / \
               /   \
              2     3

Lets rotate the tree slightly, so that we line up the operators, on the same plane as their precdence levels. Here + and - share the same precdence level, and * and / share the same precedence level.


      1--------+----------`-`------1
                |
                |
            2---*----3----/
                          |
                          6

Say, we are asked to evaluate the AST example we saw earlier:

    -- 1 + 2 * 3 + 4

                   +
                  / \
                 +   4
                / \
               /   \
              1     *
                   / \
                  /   \
                 2     3

At every node that is rooted with an operator we need to make a decision based on operator the subtree of the operator is rooted on. Here, say we are at the first +, then looking at the right sub-tree, since * has higher precedence we will evaluate that subtree before evaluating +. But, of course we don’t have this tree consructed already to be able to make this decision. We have to work with an expression which looks like the one below and that is where the Pratt parser helps.

1 + 2 * 3 + 4

Let’s define the precedence level of + to be 10 and that of * to be 20. If we had a simple expression such as 1 + 2 * 3, this is how our parser could work:

Parse token 1. Eval this value.
Parse token +. Say,at this point, the precdence level is 10
Parse token 2. At this point, we need to decide if 2 associates with + or the following operator.

Now, at this point we know we would have evaluated 1 which we will call the null denomination of +. But, before we decide whether to evaluate 2, we will have to look ahead to check if 2 associates to another sub-tree rooted with an operator that has higher precedence. Therefore,

Parse token *

Now, since the * has higher precedence that +, we start a new parse tree rooted at *, evaluate 2 as the left operand of * and then perform same steps with 3. Once the sub-tree is evaluated, then we continue with our initial rooted tree from where we had branched off.

That is the crux of this algorithm. Logically, inspect the precedence of the operator that is the root of the subtree and then decide whether to evaluate the sub-tree or just evaluate the operand to the right.

Yet, another way to think about, is given an expression string 1 + 2 * 3, we are processing each token in a depth-first search tree. At each point, we are at a precedence level (call it, rbp). Every time we need to decide whether to evaluate a token, we need to look at the precedence level of that operator that token needs to associate with. If precedence is higher, than we recurse down to that subtree. Once the recursion of the sub-tree returns, we continue with the current level.

Implementing the Pratt Parser in Haskell

A Simple Grammar

Let’s define a grammar for the expressions we will support. We will parse this expression into a Token type list. The list of Tokens will be the input to the Pratt parser.

We will start with support for expressions with binary operations such as +, -, * and /. Later in the post, we will also add a negation (unary) operator.

-- Example expressions:
1 + 2 * 3 + 4

-- parens introduce grouping
(1 + 2) * 3 + 4

-- this is right associative, we will also see how to do this
3 ^ 2 ^ 3
.```

The tokens we will need to produce while scanning this grammar can be defined as follows:

``` haskell
data Token =
  Plus
  | Minus
  | Star
  | Slash
  | Exp
  | Number !Double
  | LParen
  | RParen
  | EndTok
  deriving (Eq, Show)

Set Up A Parser for the Grammar

We will use Parsec to scan the expression and produce the list of Tokens. I will skip the explanation of this code. For now, it is enough to assume that this scanner produces a list of Tokens when the parseExpression function is called with an expression string.

whitespace :: Parser ()
whitespace = void $ PS.many $ oneOf " "

lexeme :: Parser a -> Parser a
lexeme p = p <* whitespace

scanNumber :: Parser Token
scanNumber = do
  digits <- lexeme $ PS.many1 digit
  return $ Number (read digits)

scanOperator :: Parser Token
scanOperator = choice $ build <$> [
  (Plus, '+'),
  (Minus, '-'),
  (Slash, '/'),
  (Star, '*'),
  (Exp, '^'),
  (LParen, '('),
  (RParen, ')')
  ]
  where
    build :: (Token, Char) -> Parser Token
    build (x, y) = lexeme $ x <$ char y

parseExpression :: String -> Either PS.ParseError [Token]
parseExpression inp = do
  toks <- PS.parse (many1 (PS.try scanNumber <|> scanOperator) <* eof) "" inp
  return $ toks ++ [EndTok]

Define The State

Since we need to pass around the list of [Tokens] while we parse the tokens into a AST, we will perform the entire computation inside a State monad. For our simple grammar, our expression will always produce a Double value.

type TokenS = State [Token] Double

Helper Functions For Handling State

Let’s next set up helper functions, that will let us peek into the current token, or move our state to the next token in the [Token] value. Note, I have chosen to just error out on invalid calls here for sake of exposition. I would follow different approach like, use Maybe or Either data types, in more serious code.

nextToken :: State [Token] Token
nextToken = do
  s <- get
  case s of
    (x:xs) -> do
      put xs
      return x
    _ -> error "Token list is empty"

currToken :: State [Token] Token
currToken = do
  s <- get
  case s of
    (x:_) -> return x
    _ -> error "No more tokens"

Set Up `InFix` Precision Map

The binary operator examples we saw above are examples of infix operators. We will now create a map of precedence levels for the binary operators we will support.

infixPrecedence :: Token -> Double
infixPrecedence tok = case tok of
  Number _ -> 0
  LParen -> 0
  RParen -> 0
  EndTok -> 0
  Minus -> 10
  Plus -> 10
  Star -> 20
  Slash -> 20
  Exp -> 30
  _ -> error $ "prec not defined for = " ++ show tok

Null denominations(`nud`)

These are right operands of the operators we are evaluating. The original paper refer’s to this term as nud and we stick to that terminology.

In our case, for the first version, the only arguments the binary operators accept is the Number data constructor. The nud function just evaluates the token that is passed to it.

nud :: Token -> State [Token] Double
nud (Number x) = return x
nud _ = error "only number literal supported for nud"

Left Denominations(`led`)

And, now we need a way to evaluate the right side of the operator. We handle this in a function called led. This functions accepts the already evaluated left operand. This function is called, when the parser encounters one of the operators. The operator parameter is also passed in as an argument. Note, that by the time this function is called, the head of [Token] list contains the right operand to the operator this function is called for. Therefore, first, the function evaluates its right expression and then does the operator specific evaluation.

To start let’s just focus on + and *. Notice that for +, -, *, /, we first evaluate right operand by calling into the precedenceParser function. The expression function accepts the precedence level of the current operator. The precedenceParser function is the core function which will perform the precedence based depth first traversal.

led :: Double -> Token -> TokenS
led left tok = do
  case tok of
    Plus -> do
      right <- expression (infixPrecedence tok)
      return $ left + right
    Minus -> do
      right <- expression (infixPrecedence tok)
      return $ left - right
    Star -> do
      right <- expression (infixPrecedence tok)
      return $ left * right
    Slash -> do
      right <- expression (infixPrecedence tok)
      return $ left / right
    Exp -> do
      right <- expression $ infixPrecedence tok - 1
      return $ left ** right
    _ -> error $ show tok ++ "not supported"

Precedence Parser

Now, we have all the peripheral functions set up. We are ready to implement the core function of our parser. Usually, I have seen, this function named as precedenceParser. We will use the same name.


-- rbp here is the right-bound precedence level

precedenceParser :: Double -> TokenS
precedenceParser rbp = do
  token <- nextToken
  left <- nud token
  nt1 <- currToken
  go left nt1
  where
    go left' nt' = if rbp < infixPrecedence nt' then do
        void nextToken
        left'' <- led left' nt'
        nt'' <- currToken
        go left'' nt''
      else return left'

Now, lets look at this function closely. The function accepts the precedence level as its argument. Here are the steps, the function follows to parse an expression like `1 + 2 * 3 + 4.

The function is called with 0 and a token state, that looks like [Number 1, Plus, Number 2, Star, Number 3, Plus, Number 4]
nextToken, updates the token state to [Plus, Number 2, Star, Number 3, Plus, Number 4] and binds 1 to token.
We call nud on token and this gives us the value of the left expression of the operator.
Now we need to decide whether to just return or evaluate the deeper sub-tree based on precedence level. Remember we start with rbp=0. Since rbp is less the precedence level of +, which is our current token (nt1 <- currToken), we recurse with the helper function go. In the go function, we update the token state to [Number 2, Star, Number 3, Plus, Number 4], and then call the led function. If you recall, the led function accepts an operator token and evaluates the token at the head of the Token list.
Once the go function recursion is completed at the given precedence level, we return the last evaluated left expression.

Here is the call stack (I have also shown the state that gets passed during the calls)

.. precedenceParser 0 [Number 1, Plus, Number 2, Star, Number 3, Plus, Number 4]
....... nud (Number 1)  -- we have 1
....... go 1 `+`
............. led 2 `+` [Number 2, Star, Number 3, Plus, Number 4]
..................... precendenceParser 10 [Number 2, Star, Number 3, Plus, Number 4]
........................... nud (Number 2)  -- we have number 2
........................... go 2 `*`
.................................  led 2 `*` [Number 3, Plus, Number 4]
.......................................... precedenceParser 20 [Number 3, Plus, Number 4]
................................................  nud (Number 3) -- we have 3
..................................................... -- no call to `go` since `20` < precedenc of `+` which is `10`
................................................  return 3
.................................  return 6 (2 * 3)
..................... return 1 + 6
.........go 7 `+` [Number 4]
.............led 7 `+`[Number 4]
.....................precendenceParser 10 [Number 4]
...........................nud (Number 7) -- we have 4
...........................no more tokens, `go` not called anymore
.............return 11
.........return 11

Adding support for `Prefix` opeartors

For prefix operatos like a negation operator, we need to create a new precedence map and update our nud function accordingly. For this post, we will hard code the prefixes in the nud function itself. Notice, that a high value of 100 is being passed as precedence of minus operator. We do that since, we want the argument to - evaluated before it can be negated.

nud :: Token -> State [Token] Double
nud (Number x) = return x
nud Minus = do
  right <- precedenceParser 100  -- add this to prefix map
  return $ -right
nud _ = error "only literal supported for nud"

The full code example that is reproduced below also handles grouping () as prefix operators.

Full Code Example

The code below is the entire Parser in one module. There is additional handling for Exp operator and parens in the code below. You can load the code into ghci and run evalAll to run some examples provided at the bottom of the module.

module PrattParser where

import Control.Monad.State.Strict
import Text.Parsec.String
import Text.Parsec.Char
import qualified Text.Parsec as PS
import Text.Parsec.Combinator
import Control.Applicative
import Data.Either


whitespace :: Parser ()
whitespace = void $ PS.many $ oneOf " "

lexeme :: Parser a -> Parser a
lexeme p = p <* whitespace

scanNumber :: Parser Token
scanNumber = do
  digits <- lexeme $ PS.many1 digit
  return $ Number (read digits)

scanOperator :: Parser Token
scanOperator = choice $ build <$> [
  (Plus, '+'),
  (Minus, '-'),
  (Slash, '/'),
  (Star, '*'),
  (Exp, '^'),
  (LParen, '('),
  (RParen, ')')
  ]
  where
    build :: (Token, Char) -> Parser Token
    build (x, y) = lexeme $ x <$ char y

parseExpression :: String -> Either PS.ParseError [Token]
parseExpression inp = do
  toks <- PS.parse (many1 (PS.try scanNumber <|> scanOperator) <* eof) "" inp
  return $ toks ++ [EndTok]


data Token =
  Plus
  | Minus
  | Star
  | Slash
  | Exp
  | Number !Double
  | LParen
  | RParen
  | EndTok
  deriving (Eq, Show)

type TokenS = State [Token] Double

nextToken :: State [Token] Token
nextToken = do
  s <- get
  case s of
    (x:xs) -> do
      put xs
      return x
    _ -> error "Token list is empty"

currToken :: State [Token] Token
currToken = do
  s <- get
  case s of
    (x:_) -> return x
    _ -> error "No more tokens"

hasToken :: State [Token] Bool
hasToken = do
  s <- get
  case s of
    (_:_) -> return True
    [] -> return False


nud :: Token -> State [Token] Double
nud (Number x) = return x
nud Minus = do
  right <- precedenceParser 100  -- add this to prefix map
  return $ -right
nud LParen = do
  right <- precedenceParser 0
  token <- currToken
  case token of
    RParen -> do
      void nextToken
      return right
    _ -> error $ "unexpected token = " ++ show token ++ " found."
nud _ = error "only literal supported for nud"

infixPrecedence :: Token -> Double
infixPrecedence tok = case tok of
  Number _ -> 0
  LParen -> 0
  RParen -> 0
  EndTok -> 0
  Minus -> 10
  Plus -> 10
  Star -> 20
  Slash -> 20
  Exp -> 30
  _ -> error $ "prec not defined for = " ++ show tok

led :: Double -> Token -> TokenS
led left tok = do
  case tok of
    Plus -> do
      right <- precedenceParser (infixPrecedence tok)
      return $ left + right
    Minus -> do
      right <- precedenceParser (infixPrecedence tok)
      return $ left - right
    Star -> do
      right <- precedenceParser (infixPrecedence tok)
      return $ left * right
    Slash -> do
      right <- precedenceParser (infixPrecedence tok)
      return $ left / right
    Exp -> do
      right <- precedenceParser $ infixPrecedence tok - 1
      return $ left ** right
    _ -> error $ show tok ++ "not supported"


precedenceParser :: Double -> TokenS
precedenceParser rbp = do
  token <- nextToken
  left <- nud token
  nt1 <- currToken
  go left nt1
  where
    go left' nt' = if rbp < infixPrecedence nt' then do
        void nextToken
        left'' <- led left' nt'
        nt'' <- currToken
        go left'' nt''
      else return left'

-- Tests precedenceParser

evalExpression :: [String] -> [Double]
evalExpression = map (evalState (precedenceParser 0) . fromRight [] . parseExpression)

evalAll = evalExpression [
  "1+2+3+4",
  "10-2+1",
  "10-5-1",
  "10+2*3-8",
  "3^2^3",
  "(10+2)*3-8"
  ]

Conclusion

In this post, I have simplified the implementation to help in exposition. If you look at some other examples of Pratt parsers, you will notice the code builds up a table of operators mapped to their respective prefix and infix handling functions along with their precedence. While that code is flexible, but the keys in the mapping end up being strings. In the implementation we have, we get exhaustive-pattern matching gaurantees in both nud and led functions. The table based approach does not provide you this gaurantee.

In future posts, I will be showing how we can use a similar implementation while implementing the virtual VM for lox in Haskell.

References

Top-Down operator precedence parsing

Link to Original TDOP paper

Pratt Parsers: Expression Parsing Made Easy