A Simple Implementation of Pratt Parser in Haskell
I am currently working through an implementation of language described in Crafting Interpreters book in Haskell. The chapter on Compiling Expressions introduces the user to a parser called the Pratt Parser or Top-Down Operator Precendence parser. During my first pass through this chapter, I found it difficult to follow how the parser was actually working, though the author does a very good job building up a the C code in small chunks. I had to pause at this point and find other resources that would help in understanding how this algorithm worked.
First, I tried reading the author Bob Nystrom’s suggestion and visited his blog post on the same topic here. I still could not get it. Perhaps, I had some pre-concieved notion of this parser which was making it difficult for me. Then, looking around I found a more accessible explanation by Eli Bendersky here. The author does a great job discussing the algorithm with a simple Python based parser. With some staring and playing around with small AST’s, the working of the algorithm started making sense.
After understanding the code examples in Python, and later going through the original paper Top Down Operator Precedence - Vaughan R. Pratt, I realized the actual algorithm was so simple, but yet so powerful. Looking at the algorithm as some conditional depth-first traversal on a tree was very helpful for me.
In this post, I attempt to provide an mental model I used to understand how this algorithm works. Once, we get the basics down, I will show you how to implement this in Haskell for a very basic grammar. The final code we produce with demonstrate the working of the Pratt parser in setting up both associativity and precedence.
Problem
Let’s look at the problem this algorithm solves. Say, we have an expression 4 + 2 * 3. This expression could get parsed into 2 different Abstract Syntax Trees (AST). The first one would parse the expression as (4+2)*3 and produce the result 18, while the second AST would produce 4+(2*3), thus producing 10 on evaluation of the expression.
-- ast where `+` gets higher precedence than `*`
*
/ \
+ 3
/ \
4 2
-- ast where `*` gets higher precedence
+
/ \
4 *
/ \
/ \
2 3
What we need is the second AST. More fundamentally, we need to encode the precedence of these operators in the parser code to parse this expression. Usually, when a recursive-descent parser is used,the functions are arranged in such a manner such the both the associativity and precedence are handled.
The Pratt parser is an alternate algorithm that addresses the precedence requirements. In a recursive-descent(RD) parser, the non-terminals of the grammar become entry points(functions) in our code. But, with a Pratt parser, as you will see later in this post, the tokens of the parsed expression drive the parsing. Based on the token in context, different paths are taken to construct the desired AST.
Approach
Lets’ look at a more complex syntax tree. We will use this AST to understand how to approach this problem.
-- 1 + (2*3) + 4
+
/ \
+ 4
/ \
/ \
1 *
/ \
/ \
2 3
I will slightly rotate this tree to make the levels of operators clearer.
1-----+---------+
| |
| 4
|
2--*--3
Notice that the + operators are at one level, and * operators are at lower level. We can view these levels at the precedence levels. These operators operate at with respect to each other. Here the lower the level in the diagram, the higher the precedence of the operator. Here is another example, that will help clarify, what I am trying to explain.
-- 1 + 2 * 3 / 6 - 1
-
/ \
+ 1
/ \
/ \
1 `div`
/ \
/ \
* 6
/ \
/ \
2 3
Lets rotate the tree slightly, so that we line up the operators, on the same plane as their precdence levels. Here + and - share the same precdence level, and * and / share the same precedence level.
1--------+----------`-`------1
|
|
2---*----3----/
|
6
Say, we are asked to evaluate the AST example we saw earlier:
-- 1 + 2 * 3 + 4
+
/ \
+ 4
/ \
/ \
1 *
/ \
/ \
2 3
At every node that is rooted with an operator we need to make a decision based on operator the subtree of the operator is rooted on. Here, say we are at the first +, then looking at the right sub-tree, since * has higher precedence we will evaluate that subtree before evaluating +. But, of course we don’t have this tree consructed already to be able to make this decision. We have to work with an expression which looks like the one below and that is where the Pratt parser helps.
1 + 2 * 3 + 4
Let’s define the precedence level of + to be 10 and that of * to be 20. If we had a simple expression such as 1 + 2 * 3, this is how our parser could work:
- Parse token
1. Eval this value. - Parse token
+. Say,at this point, the precdence level is10 - Parse token
2. At this point, we need to decide if2associates with+or the following operator.
Now, at this point we know we would have evaluated 1 which we will call the null denomination of +. But, before we decide whether to evaluate 2, we will have to look ahead to check if 2 associates to another sub-tree rooted with an operator that has higher precedence. Therefore,
- Parse token
*
Now, since the * has higher precedence that +, we start a new parse tree rooted at *, evaluate 2 as the left operand of * and then perform same steps with 3. Once the sub-tree is evaluated, then we continue with our initial rooted tree from where we had branched off.
That is the crux of this algorithm. Logically, inspect the precedence of the operator that is the root of the subtree and then decide whether to evaluate the sub-tree or just evaluate the operand to the right.
Yet, another way to think about, is given an expression string 1 + 2 * 3, we are processing each token in a depth-first search tree. At each point, we are at a precedence level (call it, rbp). Every time we need to decide whether to evaluate a token, we need to look at the precedence level of that operator that token needs to associate with. If precedence is higher, than we recurse down to that subtree. Once the recursion of the sub-tree returns, we continue with the current level.
Implementing the Pratt Parser in Haskell
A Simple Grammar
Let’s define a grammar for the expressions we will support. We will parse this expression into a Token type list. The list of Tokens will be the input to the Pratt parser.
We will start with support for expressions with binary operations such as +, -, * and /. Later in the post, we will also add a negation (unary) operator.
-- Example expressions:
1 + 2 * 3 + 4
-- parens introduce grouping
(1 + 2) * 3 + 4
-- this is right associative, we will also see how to do this
3 ^ 2 ^ 3
.```
The tokens we will need to produce while scanning this grammar can be defined as follows:
``` haskell
data Token =
Plus
| Minus
| Star
| Slash
| Exp
| Number !Double
| LParen
| RParen
| EndTok
deriving (Eq, Show)Set Up A Parser for the Grammar
We will use Parsec to scan the expression and produce the list of Tokens. I will skip the explanation of this code. For now, it is enough to assume that this scanner produces a list of Tokens when the parseExpression function is called with an expression string.
whitespace :: Parser ()
whitespace = void $ PS.many $ oneOf " "
lexeme :: Parser a -> Parser a
lexeme p = p <* whitespace
scanNumber :: Parser Token
scanNumber = do
digits <- lexeme $ PS.many1 digit
return $ Number (read digits)
scanOperator :: Parser Token
scanOperator = choice $ build <$> [
(Plus, '+'),
(Minus, '-'),
(Slash, '/'),
(Star, '*'),
(Exp, '^'),
(LParen, '('),
(RParen, ')')
]
where
build :: (Token, Char) -> Parser Token
build (x, y) = lexeme $ x <$ char y
parseExpression :: String -> Either PS.ParseError [Token]
parseExpression inp = do
toks <- PS.parse (many1 (PS.try scanNumber <|> scanOperator) <* eof) "" inp
return $ toks ++ [EndTok]Define The State
Since we need to pass around the list of [Tokens] while we parse the tokens into a AST, we will perform the entire computation inside a State monad. For our simple grammar, our expression will always produce a Double value.
type TokenS = State [Token] DoubleHelper Functions For Handling State
Let’s next set up helper functions, that will let us peek into the current token, or move our state to the next token in the [Token] value. Note, I have chosen to just error out on invalid calls here for sake of exposition. I would follow different approach like, use Maybe or Either data types, in more serious code.
nextToken :: State [Token] Token
nextToken = do
s <- get
case s of
(x:xs) -> do
put xs
return x
_ -> error "Token list is empty"
currToken :: State [Token] Token
currToken = do
s <- get
case s of
(x:_) -> return x
_ -> error "No more tokens"Set Up InFix Precision Map
The binary operator examples we saw above are examples of infix operators. We will now create a map of precedence levels for the binary operators we will support.
infixPrecedence :: Token -> Double
infixPrecedence tok = case tok of
Number _ -> 0
LParen -> 0
RParen -> 0
EndTok -> 0
Minus -> 10
Plus -> 10
Star -> 20
Slash -> 20
Exp -> 30
_ -> error $ "prec not defined for = " ++ show tokNull denominations(nud)
These are right operands of the operators we are evaluating. The original paper refer’s to this term as nud and we stick to that terminology.
In our case, for the first version, the only arguments the binary operators accept is the Number data constructor. The nud function just evaluates the token that is passed to it.
nud :: Token -> State [Token] Double
nud (Number x) = return x
nud _ = error "only number literal supported for nud"Left Denominations(led)
And, now we need a way to evaluate the right side of the operator. We handle this in a function called led. This functions accepts the already evaluated left operand. This function is called, when the parser encounters one of the operators. The operator parameter is also passed in as an argument. Note, that by the time this function is called, the head of [Token] list contains the right operand to the operator this function is called for. Therefore, first, the function evaluates its right expression and then does the operator specific evaluation.
To start let’s just focus on + and *. Notice that for +, -, *, /, we first evaluate right operand by calling into the precedenceParser function. The expression function accepts the precedence level of the current operator. The precedenceParser function is the core function which will perform the precedence based depth first traversal.
led :: Double -> Token -> TokenS
led left tok = do
case tok of
Plus -> do
right <- expression (infixPrecedence tok)
return $ left + right
Minus -> do
right <- expression (infixPrecedence tok)
return $ left - right
Star -> do
right <- expression (infixPrecedence tok)
return $ left * right
Slash -> do
right <- expression (infixPrecedence tok)
return $ left / right
Exp -> do
right <- expression $ infixPrecedence tok - 1
return $ left ** right
_ -> error $ show tok ++ "not supported"Precedence Parser
Now, we have all the peripheral functions set up. We are ready to implement the core function of our parser. Usually, I have seen, this function named as precedenceParser. We will use the same name.
-- rbp here is the right-bound precedence level
precedenceParser :: Double -> TokenS
precedenceParser rbp = do
token <- nextToken
left <- nud token
nt1 <- currToken
go left nt1
where
go left' nt' = if rbp < infixPrecedence nt' then do
void nextToken
left'' <- led left' nt'
nt'' <- currToken
go left'' nt''
else return left'Now, lets look at this function closely. The function accepts the precedence level as its argument. Here are the steps, the function follows to parse an expression like `1 + 2 * 3 + 4.
The function is called with
0and atokenstate, that looks like[Number 1, Plus, Number 2, Star, Number 3, Plus, Number 4]nextToken, updates thetokenstate to[Plus, Number 2, Star, Number 3, Plus, Number 4]and binds1totoken.We call
nudontokenand this gives us the value of theleftexpression of the operator.Now we need to decide whether to just return or evaluate the deeper sub-tree based on precedence level. Remember we start with
rbp=0. Sincerbpis less the precedence level of+, which is our current token (nt1 <- currToken), we recurse with the helper functiongo. In thegofunction, we update thetokenstate to[Number 2, Star, Number 3, Plus, Number 4], and then call theledfunction. If you recall, theledfunction accepts anoperatortoken and evaluates the token at the head of theTokenlist.Once the
gofunction recursion is completed at the given precedence level, we return the last evaluatedleftexpression.
Here is the call stack (I have also shown the state that gets passed during the calls)
.. precedenceParser 0 [Number 1, Plus, Number 2, Star, Number 3, Plus, Number 4]
....... nud (Number 1) -- we have 1
....... go 1 `+`
............. led 2 `+` [Number 2, Star, Number 3, Plus, Number 4]
..................... precendenceParser 10 [Number 2, Star, Number 3, Plus, Number 4]
........................... nud (Number 2) -- we have number 2
........................... go 2 `*`
................................. led 2 `*` [Number 3, Plus, Number 4]
.......................................... precedenceParser 20 [Number 3, Plus, Number 4]
................................................ nud (Number 3) -- we have 3
..................................................... -- no call to `go` since `20` < precedenc of `+` which is `10`
................................................ return 3
................................. return 6 (2 * 3)
..................... return 1 + 6
.........go 7 `+` [Number 4]
.............led 7 `+`[Number 4]
.....................precendenceParser 10 [Number 4]
...........................nud (Number 7) -- we have 4
...........................no more tokens, `go` not called anymore
.............return 11
.........return 11
Adding support for Prefix opeartors
For prefix operatos like a negation operator, we need to create a new precedence map and update our nud function accordingly. For this post, we will hard code the prefixes in the nud function itself. Notice, that a high value of 100 is being passed as precedence of minus operator. We do that since, we want the argument to - evaluated before it can be negated.
nud :: Token -> State [Token] Double
nud (Number x) = return x
nud Minus = do
right <- precedenceParser 100 -- add this to prefix map
return $ -right
nud _ = error "only literal supported for nud"The full code example that is reproduced below also handles grouping () as prefix operators.
Full Code Example
The code below is the entire Parser in one module. There is additional handling for Exp operator and parens in the code below. You can load the code into ghci and run evalAll to run some examples provided at the bottom of the module.
module PrattParser where
import Control.Monad.State.Strict
import Text.Parsec.String
import Text.Parsec.Char
import qualified Text.Parsec as PS
import Text.Parsec.Combinator
import Control.Applicative
import Data.Either
whitespace :: Parser ()
whitespace = void $ PS.many $ oneOf " "
lexeme :: Parser a -> Parser a
lexeme p = p <* whitespace
scanNumber :: Parser Token
scanNumber = do
digits <- lexeme $ PS.many1 digit
return $ Number (read digits)
scanOperator :: Parser Token
scanOperator = choice $ build <$> [
(Plus, '+'),
(Minus, '-'),
(Slash, '/'),
(Star, '*'),
(Exp, '^'),
(LParen, '('),
(RParen, ')')
]
where
build :: (Token, Char) -> Parser Token
build (x, y) = lexeme $ x <$ char y
parseExpression :: String -> Either PS.ParseError [Token]
parseExpression inp = do
toks <- PS.parse (many1 (PS.try scanNumber <|> scanOperator) <* eof) "" inp
return $ toks ++ [EndTok]
data Token =
Plus
| Minus
| Star
| Slash
| Exp
| Number !Double
| LParen
| RParen
| EndTok
deriving (Eq, Show)
type TokenS = State [Token] Double
nextToken :: State [Token] Token
nextToken = do
s <- get
case s of
(x:xs) -> do
put xs
return x
_ -> error "Token list is empty"
currToken :: State [Token] Token
currToken = do
s <- get
case s of
(x:_) -> return x
_ -> error "No more tokens"
hasToken :: State [Token] Bool
hasToken = do
s <- get
case s of
(_:_) -> return True
[] -> return False
nud :: Token -> State [Token] Double
nud (Number x) = return x
nud Minus = do
right <- precedenceParser 100 -- add this to prefix map
return $ -right
nud LParen = do
right <- precedenceParser 0
token <- currToken
case token of
RParen -> do
void nextToken
return right
_ -> error $ "unexpected token = " ++ show token ++ " found."
nud _ = error "only literal supported for nud"
infixPrecedence :: Token -> Double
infixPrecedence tok = case tok of
Number _ -> 0
LParen -> 0
RParen -> 0
EndTok -> 0
Minus -> 10
Plus -> 10
Star -> 20
Slash -> 20
Exp -> 30
_ -> error $ "prec not defined for = " ++ show tok
led :: Double -> Token -> TokenS
led left tok = do
case tok of
Plus -> do
right <- precedenceParser (infixPrecedence tok)
return $ left + right
Minus -> do
right <- precedenceParser (infixPrecedence tok)
return $ left - right
Star -> do
right <- precedenceParser (infixPrecedence tok)
return $ left * right
Slash -> do
right <- precedenceParser (infixPrecedence tok)
return $ left / right
Exp -> do
right <- precedenceParser $ infixPrecedence tok - 1
return $ left ** right
_ -> error $ show tok ++ "not supported"
precedenceParser :: Double -> TokenS
precedenceParser rbp = do
token <- nextToken
left <- nud token
nt1 <- currToken
go left nt1
where
go left' nt' = if rbp < infixPrecedence nt' then do
void nextToken
left'' <- led left' nt'
nt'' <- currToken
go left'' nt''
else return left'
-- Tests precedenceParser
evalExpression :: [String] -> [Double]
evalExpression = map (evalState (precedenceParser 0) . fromRight [] . parseExpression)
evalAll = evalExpression [
"1+2+3+4",
"10-2+1",
"10-5-1",
"10+2*3-8",
"3^2^3",
"(10+2)*3-8"
]Conclusion
In this post, I have simplified the implementation to help in exposition. If you look at some other examples of Pratt parsers, you will notice the code builds up a table of operators mapped to their respective prefix and infix handling functions along with their precedence. While that code is flexible, but the keys in the mapping end up being strings. In the implementation we have, we get exhaustive-pattern matching gaurantees in both nud and led functions. The table based approach does not provide you this gaurantee.
In future posts, I will be showing how we can use a similar implementation while implementing the virtual VM for lox in Haskell.