A Simple Implementation of Pratt Parser in Haskell
I am currently working through an implementation of language described in Crafting Interpreters book in Haskell. The chapter on Compiling Expressions introduces the user to a parser called the Pratt Parser
or Top-Down Operator Precendence
parser. During my first pass through this chapter, I found it difficult to follow how the parser was actually working, though the author does a very good job building up a the C
code in small chunks. I had to pause at this point and find other resources that would help in understanding how this algorithm worked.
First, I tried reading the author Bob Nystrom’s suggestion and visited his blog post on the same topic here. I still could not get it. Perhaps, I had some pre-concieved notion of this parser which was making it difficult for me. Then, looking around I found a more accessible explanation by Eli Bendersky here. The author does a great job discussing the algorithm with a simple Python based parser. With some staring and playing around with small AST’s, the working of the algorithm started making sense.
After understanding the code examples in Python, and later going through the original paper Top Down Operator Precedence - Vaughan R. Pratt, I realized the actual algorithm was so simple, but yet so powerful. Looking at the algorithm as some conditional depth-first traversal on a tree was very helpful for me.
In this post, I attempt to provide an mental model I used to understand how this algorithm works. Once, we get the basics down, I will show you how to implement this in Haskell for a very basic grammar. The final code we produce with demonstrate the working of the Pratt parser in setting up both associativity and precedence.
Problem
Let’s look at the problem this algorithm solves. Say, we have an expression 4 + 2 * 3
. This expression could get parsed into 2 different Abstract Syntax Trees (AST). The first one would parse the expression as (4+2)*3
and produce the result 18
, while the second AST would produce 4+(2*3)
, thus producing 10
on evaluation of the expression.
-- ast where `+` gets higher precedence than `*`
*
/ \
+ 3
/ \
4 2
-- ast where `*` gets higher precedence
+
/ \
4 *
/ \
/ \
2 3
What we need is the second AST. More fundamentally, we need to encode the precedence of these operators in the parser code to parse this expression. Usually, when a recursive-descent parser is used,the functions are arranged in such a manner such the both the associativity and precedence are handled.
The Pratt parser
is an alternate algorithm that addresses the precedence requirements. In a recursive-descent(RD) parser, the non-terminals of the grammar become entry points(functions) in our code. But, with a Pratt parser
, as you will see later in this post, the tokens
of the parsed expression drive the parsing. Based on the token
in context, different paths are taken to construct the desired AST.
Approach
Lets’ look at a more complex syntax tree. We will use this AST to understand how to approach this problem.
-- 1 + (2*3) + 4
+
/ \
+ 4
/ \
/ \
1 *
/ \
/ \
2 3
I will slightly rotate this tree to make the levels
of operators clearer.
1-----+---------+
| |
| 4
|
2--*--3
Notice that the +
operators are at one level, and *
operators are at lower level. We can view these levels at the precedence
levels. These operators operate at with respect to each other. Here the lower
the level in the diagram, the higher the precedence of the operator. Here is another example, that will help clarify, what I am trying to explain.
-- 1 + 2 * 3 / 6 - 1
-
/ \
+ 1
/ \
/ \
1 `div`
/ \
/ \
* 6
/ \
/ \
2 3
Lets rotate the tree slightly, so that we line up the operators, on the same plane as their precdence levels. Here +
and -
share the same precdence level, and *
and /
share the same precedence level.
1--------+----------`-`------1
|
|
2---*----3----/
|
6
Say, we are asked to evaluate the AST example we saw earlier:
-- 1 + 2 * 3 + 4
+
/ \
+ 4
/ \
/ \
1 *
/ \
/ \
2 3
At every node that is rooted with an operator
we need to make a decision based on operator
the subtree of the operator
is rooted on. Here, say we are at the first +
, then looking at the right sub-tree, since *
has higher precedence we will evaluate that subtree before evaluating +
. But, of course we don’t have this tree consructed already to be able to make this decision. We have to work with an expression which looks like the one below and that is where the Pratt parser helps.
1 + 2 * 3 + 4
Let’s define the precedence level of +
to be 10
and that of *
to be 20
. If we had a simple expression such as 1 + 2 * 3
, this is how our parser could work:
- Parse token
1
. Eval this value. - Parse token
+
. Say,at this point, the precdence level is10
- Parse token
2
. At this point, we need to decide if2
associates with+
or the following operator.
Now, at this point we know we would have evaluated 1
which we will call the null
denomination of +
. But, before we decide whether to evaluate 2
, we will have to look ahead to check if 2
associates to another sub-tree rooted with an operator that has higher precedence. Therefore,
- Parse token
*
Now, since the *
has higher precedence that +
, we start a new parse tree rooted at *
, evaluate 2
as the left operand of *
and then perform same steps with 3
. Once the sub-tree is evaluated, then we continue with our initial rooted tree from where we had branched off.
That is the crux of this algorithm. Logically, inspect the precedence of the operator
that is the root
of the subtree and then decide whether to evaluate the sub-tree or just evaluate the operand to the right.
Yet, another way to think about, is given an expression string 1 + 2 * 3
, we are processing each token in a depth-first search tree. At each point, we are at a precedence level (call it, rbp). Every time we need to decide whether to evaluate a token, we need to look at the precedence level of that operator that token needs to associate with. If precedence is higher, than we recurse down to that subtree. Once the recursion of the sub-tree returns, we continue with the current level.
Implementing the Pratt Parser in Haskell
A Simple Grammar
Let’s define a grammar for the expressions we will support. We will parse this expression into a Token
type list. The list of Tokens
will be the input to the Pratt parser.
We will start with support for expressions with binary
operations such as +
, -
, *
and /
. Later in the post, we will also add a negation
(unary) operator.
-- Example expressions:
1 + 2 * 3 + 4
-- parens introduce grouping
1 + 2) * 3 + 4
(
-- this is right associative, we will also see how to do this
3 ^ 2 ^ 3
.```
The tokens we will need to produce while scanning this grammar can be defined as follows:
``` haskell
data Token =
Plus
| Minus
| Star
| Slash
| Exp
| Number !Double
| LParen
| RParen
| EndTok
deriving (Eq, Show)
Set Up A Parser for the Grammar
We will use Parsec
to scan the expression and produce the list of Tokens
. I will skip the explanation of this code. For now, it is enough to assume that this scanner produces a list of Tokens
when the parseExpression
function is called with an expression string.
whitespace :: Parser ()
= void $ PS.many $ oneOf " "
whitespace
lexeme :: Parser a -> Parser a
= p <* whitespace
lexeme p
scanNumber :: Parser Token
= do
scanNumber <- lexeme $ PS.many1 digit
digits return $ Number (read digits)
scanOperator :: Parser Token
= choice $ build <$> [
scanOperator Plus, '+'),
(Minus, '-'),
(Slash, '/'),
(Star, '*'),
(Exp, '^'),
(LParen, '('),
(RParen, ')')
(
]where
build :: (Token, Char) -> Parser Token
= lexeme $ x <$ char y
build (x, y)
parseExpression :: String -> Either PS.ParseError [Token]
= do
parseExpression inp <- PS.parse (many1 (PS.try scanNumber <|> scanOperator) <* eof) "" inp
toks return $ toks ++ [EndTok]
Define The State
Since we need to pass around the list of [Tokens]
while we parse the tokens into a AST, we will perform the entire computation inside a State
monad. For our simple grammar, our expression will always produce a Double
value.
type TokenS = State [Token] Double
Helper Functions For Handling State
Let’s next set up helper functions, that will let us peek
into the current token, or move our state to the next
token in the [Token]
value. Note, I have chosen to just error
out on invalid calls here for sake of exposition. I would follow different approach like, use Maybe or Either data types, in more serious code.
nextToken :: State [Token] Token
= do
nextToken <- get
s case s of
:xs) -> do
(x
put xsreturn x
-> error "Token list is empty"
_
currToken :: State [Token] Token
= do
currToken <- get
s case s of
:_) -> return x
(x-> error "No more tokens" _
Set Up InFix
Precision Map
The binary operator
examples we saw above are examples of infix
operators. We will now create a map of precedence levels for the binary operators we will support.
infixPrecedence :: Token -> Double
= case tok of
infixPrecedence tok Number _ -> 0
LParen -> 0
RParen -> 0
EndTok -> 0
Minus -> 10
Plus -> 10
Star -> 20
Slash -> 20
Exp -> 30
-> error $ "prec not defined for = " ++ show tok _
Null denominations(nud
)
These are right
operands of the operators we are evaluating. The original paper refer’s to this term as nud
and we stick to that terminology.
In our case, for the first version, the only arguments
the binary operators accept is the Number
data constructor. The nud
function just evaluates the token that is passed to it.
nud :: Token -> State [Token] Double
Number x) = return x
nud (= error "only number literal supported for nud" nud _
Left Denominations(led
)
And, now we need a way to evaluate the right
side of the operator. We handle this in a function called led
. This functions accepts the already evaluated left
operand. This function is called, when the parser encounters one of the operators. The operator parameter is also passed in as an argument. Note, that by the time this function is called, the head of [Token]
list contains the right
operand to the operator this function is called for. Therefore, first, the function evaluates its right
expression and then does the operator specific evaluation.
To start let’s just focus on +
and *
. Notice that for +
, -
, *
, /
, we first evaluate right
operand by calling into the precedenceParser
function. The expression
function accepts the precedence level of the current operator. The precedenceParser
function is the core function which will perform the precedence based depth first traversal
.
led :: Double -> Token -> TokenS
= do
led left tok case tok of
Plus -> do
<- expression (infixPrecedence tok)
right return $ left + right
Minus -> do
<- expression (infixPrecedence tok)
right return $ left - right
Star -> do
<- expression (infixPrecedence tok)
right return $ left * right
Slash -> do
<- expression (infixPrecedence tok)
right return $ left / right
Exp -> do
<- expression $ infixPrecedence tok - 1
right return $ left ** right
-> error $ show tok ++ "not supported" _
Precedence Parser
Now, we have all the peripheral functions set up. We are ready to implement the core
function of our parser. Usually, I have seen, this function named as precedenceParser
. We will use the same name.
-- rbp here is the right-bound precedence level
precedenceParser :: Double -> TokenS
= do
precedenceParser rbp <- nextToken
token <- nud token
left <- currToken
nt1
go left nt1where
= if rbp < infixPrecedence nt' then do
go left' nt'
void nextToken<- led left' nt'
left'' <- currToken
nt''
go left'' nt''else return left'
Now, lets look at this function closely. The function accepts the precedence
level as its argument. Here are the steps, the function follows to parse an expression like `1 + 2 * 3 + 4.
The function is called with
0
and atoken
state, that looks like[Number 1, Plus, Number 2, Star, Number 3, Plus, Number 4]
nextToken
, updates thetoken
state to[Plus, Number 2, Star, Number 3, Plus, Number 4]
and binds1
totoken
.We call
nud
ontoken
and this gives us the value of theleft
expression of the operator.Now we need to decide whether to just return or evaluate the deeper sub-tree based on precedence level. Remember we start with
rbp=0
. Sincerbp
is less the precedence level of+
, which is our current token (nt1 <- currToken
), we recurse with the helper functiongo
. In thego
function, we update thetoken
state to[Number 2, Star, Number 3, Plus, Number 4]
, and then call theled
function. If you recall, theled
function accepts anoperator
token and evaluates the token at the head of theToken
list.Once the
go
function recursion is completed at the given precedence level, we return the last evaluatedleft
expression.
Here is the call stack (I have also shown the state that gets passed during the calls)
.. precedenceParser 0 [Number 1, Plus, Number 2, Star, Number 3, Plus, Number 4]
....... nud (Number 1) -- we have 1
....... go 1 `+`
............. led 2 `+` [Number 2, Star, Number 3, Plus, Number 4]
..................... precendenceParser 10 [Number 2, Star, Number 3, Plus, Number 4]
........................... nud (Number 2) -- we have number 2
........................... go 2 `*`
................................. led 2 `*` [Number 3, Plus, Number 4]
.......................................... precedenceParser 20 [Number 3, Plus, Number 4]
................................................ nud (Number 3) -- we have 3
..................................................... -- no call to `go` since `20` < precedenc of `+` which is `10`
................................................ return 3
................................. return 6 (2 * 3)
..................... return 1 + 6
.........go 7 `+` [Number 4]
.............led 7 `+`[Number 4]
.....................precendenceParser 10 [Number 4]
...........................nud (Number 7) -- we have 4
...........................no more tokens, `go` not called anymore
.............return 11
.........return 11
Adding support for Prefix
opeartors
For prefix operatos like a negation
operator, we need to create a new precedence map and update our nud
function accordingly. For this post, we will hard code the prefixes in the nud
function itself. Notice, that a high value of 100 is being passed as precedence of minus
operator. We do that since, we want the argument to -
evaluated before it can be negated.
nud :: Token -> State [Token] Double
Number x) = return x
nud (Minus = do
nud <- precedenceParser 100 -- add this to prefix map
right return $ -right
= error "only literal supported for nud" nud _
The full code example that is reproduced below also handles grouping ()
as prefix operators.
Full Code Example
The code below is the entire Parser in one module. There is additional handling for Exp
operator and parens
in the code below. You can load the code into ghci
and run evalAll
to run some examples provided at the bottom of the module.
module PrattParser where
import Control.Monad.State.Strict
import Text.Parsec.String
import Text.Parsec.Char
import qualified Text.Parsec as PS
import Text.Parsec.Combinator
import Control.Applicative
import Data.Either
whitespace :: Parser ()
= void $ PS.many $ oneOf " "
whitespace
lexeme :: Parser a -> Parser a
= p <* whitespace
lexeme p
scanNumber :: Parser Token
= do
scanNumber <- lexeme $ PS.many1 digit
digits return $ Number (read digits)
scanOperator :: Parser Token
= choice $ build <$> [
scanOperator Plus, '+'),
(Minus, '-'),
(Slash, '/'),
(Star, '*'),
(Exp, '^'),
(LParen, '('),
(RParen, ')')
(
]where
build :: (Token, Char) -> Parser Token
= lexeme $ x <$ char y
build (x, y)
parseExpression :: String -> Either PS.ParseError [Token]
= do
parseExpression inp <- PS.parse (many1 (PS.try scanNumber <|> scanOperator) <* eof) "" inp
toks return $ toks ++ [EndTok]
data Token =
Plus
| Minus
| Star
| Slash
| Exp
| Number !Double
| LParen
| RParen
| EndTok
deriving (Eq, Show)
type TokenS = State [Token] Double
nextToken :: State [Token] Token
= do
nextToken <- get
s case s of
:xs) -> do
(x
put xsreturn x
-> error "Token list is empty"
_
currToken :: State [Token] Token
= do
currToken <- get
s case s of
:_) -> return x
(x-> error "No more tokens"
_
hasToken :: State [Token] Bool
= do
hasToken <- get
s case s of
:_) -> return True
(_-> return False
[]
nud :: Token -> State [Token] Double
Number x) = return x
nud (Minus = do
nud <- precedenceParser 100 -- add this to prefix map
right return $ -right
LParen = do
nud <- precedenceParser 0
right <- currToken
token case token of
RParen -> do
void nextTokenreturn right
-> error $ "unexpected token = " ++ show token ++ " found."
_ = error "only literal supported for nud"
nud _
infixPrecedence :: Token -> Double
= case tok of
infixPrecedence tok Number _ -> 0
LParen -> 0
RParen -> 0
EndTok -> 0
Minus -> 10
Plus -> 10
Star -> 20
Slash -> 20
Exp -> 30
-> error $ "prec not defined for = " ++ show tok
_
led :: Double -> Token -> TokenS
= do
led left tok case tok of
Plus -> do
<- precedenceParser (infixPrecedence tok)
right return $ left + right
Minus -> do
<- precedenceParser (infixPrecedence tok)
right return $ left - right
Star -> do
<- precedenceParser (infixPrecedence tok)
right return $ left * right
Slash -> do
<- precedenceParser (infixPrecedence tok)
right return $ left / right
Exp -> do
<- precedenceParser $ infixPrecedence tok - 1
right return $ left ** right
-> error $ show tok ++ "not supported"
_
precedenceParser :: Double -> TokenS
= do
precedenceParser rbp <- nextToken
token <- nud token
left <- currToken
nt1
go left nt1where
= if rbp < infixPrecedence nt' then do
go left' nt'
void nextToken<- led left' nt'
left'' <- currToken
nt''
go left'' nt''else return left'
-- Tests precedenceParser
evalExpression :: [String] -> [Double]
= map (evalState (precedenceParser 0) . fromRight [] . parseExpression)
evalExpression
= evalExpression [
evalAll "1+2+3+4",
"10-2+1",
"10-5-1",
"10+2*3-8",
"3^2^3",
"(10+2)*3-8"
]
Conclusion
In this post, I have simplified the implementation to help in exposition. If you look at some other examples of Pratt parsers, you will notice the code builds up a table of operators mapped to their respective prefix
and infix
handling functions along with their precedence. While that code is flexible, but the keys in the mapping end up being strings. In the implementation we have, we get exhaustive-pattern matching gaurantees in both nud
and led
functions. The table based approach does not provide you this gaurantee.
In future posts, I will be showing how we can use a similar implementation while implementing the virtual VM for lox
in Haskell.