Minimal Parser Combinator in Python

2021-10-26

Parser Combinators

A parser can be viewed as a function that parses some input and produces a transformed structure based on this input. Such parsers can be combined to extend their functionality. These functions that are used to combined these parsers are called Parser Combinators

Parser combinators are higher-order functions that combine parsers and other parser combinators to produce another higher-level parser. The idea behind parser combinators are that one starts with a small definitions of parsers and combine these through different combinator functions that extends their functionality.

In this post, we will learn to build a minimal parser combinator library from scratch. The functions provided here can be used to extend the library with more parser combinators.

Let’s first get started with some imports.


# initial imports
import operator
import typing as tp

Structure of a Parser

A parser’s primary functionality at the highest level is to accept some input (say a str), and produce some structured object. This structured object could be a list of lexical tokens, an AST or may be a fully evaluated value. With that in mind, we can model the type of parser as follows:


ParserP = tp.Callable[[str], tp.Tuple[tp.Any, str]]
# A simple parser
example_parser = lambda s: (s[0], s[1:])

Notice the parser defined above is a function that takes as input a str and returns a tuple. The first element of tuple is the input that is consumed by the parser and the second element is the remaining input the parser has not consumed yet. The type of first element in the output could be a lexical token, an AST node or any structured object. For our purpose, we will consider the output to either be a str or a tuple in some cases, as you will see below.

The example_parser listed above just extract the first element of input and returns a tuple as output.

Build Small Parsers

Let’s use this model of the parser, to define a few parsers, that do simple things.

First, lets add a helper class to capture exceptions and also a parse function that will run the parsers we will be building in this post.

class ParserError(Exception):
    def __init__(self, msg, content):
        super().__init__(f"{msg}: {content}")

# parse function here, can be used to run the parsers/parser combinators that we will build below.

#  Some helper functions
def parse(p: ParserP, s: str) -> tp.Tuple[tp.Any, str]:
    (a, s) = p(s)
    return (a, s)

First we build a simple parser, which just consumes the first element of its input.


def anyChar() -> ParserP:
    def func(s):
        return (s[0], s[1:])
    return func

Note that the above parser, returns a func whose type is tp.Callable[[str], tp.Tuple[tp.Any, str]]. This is a example of a function that constructs a parser. This will be a recurring theme of this post, that is functions returning a fully constructed parsers.

And we can test this function as follows:


In [363]: parse(anyChar(), 'hello world')
Out[363]: ('h', 'ello world')

Instead of any character, if we wanted to consume a particular character as the input, then we could build another parser as follows:

def oneChar(c) -> ParserP:
    def func(s):
        if s[0] == c:
            return (s[0], s[1:])
        raise ParserError(f"Unexpected {s[0]}, expecting {c}", s)

    return func

And here are the results of running this parser

In [364]: parse(oneChar('h'), 'hello world')
Out[364]: ('h', 'ello world')

And, if we want to just parse a digit, then we can build another parser


def anyDigit() -> ParserP:
    def func(s):
        if s[0].isdigit():
            return (s[0], s[1:])
        raise ParserError(f"Expected digit, got {s[0]}", s)

    return func

Running this would give used

In [366]: parse(anyDigit(), '123')
Out[366]: ('1', '23')

Abstracting away common functionality

You will notice that the last two parsers we built perform a common operation where they check if the first character satisfies a certain condition before returning the result. We should be able extract this functionality into its own parser. We can create a function called satisfy that will return a parser. The function takes a predicate_function that needs to be satisfied if the input has to be consumer. If not, then the function throws an exception. You will notice that we will use this exception when we want to choose between multiple parsers results.


# Generic Predicate Parser
def satisfy(pred_function: tp.Callable[["char"], bool]) -> ParserP:
    def func(s):
        if not s:
            raise ParserError("Empty string", "")
        if pred_function(s[0]):
            return (s[0], s[1:])
        raise ParserError(f"Unexpected condition", s)

    return func

And now, we can update our earlier parsers to use this new defintion:

def oneCharP(c) -> ParserP:
    return satisfy(lambda c1: c == c1)


def anyDigitP() -> ParserP:
    return satisfy(lambda c: c.isdigit())

Time to build our first parser combinator

The oneCharP function parses one character out of the input. We could combine these parsers to consume more input. We can compose 2 parsers to obtain a new parser using the compose function.

First lets defined the compose function:

def compose(p1: ParserP, p2: ParserP) -> ParserP:
    def func(s):
        (a, s1) = parse(p1, s)
        (b, s2) = parse(p2, s1)
        return ((a, b), s2)

    return func

Notice, the structure of the compose function. We first run the parser p1, capture the new state s1 and use that state to parse using p2. The parser when called returns the two captured tokens a and b. Note, here that the parser returns tuple(a, b). This can vary based on the parer being built. Now, lets use the compose function.

In [369]: hp = oneChar('h')
     ...: ep = oneChar('e')
     ...: parse(compose(hp, ep), "hello world")
     ...:
     ...:
Out[369]: (('h', 'e'), 'llo world')

The compose function is an example of a parser combinator. It accepts two parsers and returns another parser.

A choice parser combinator

There will be scenarios where we have two parsers, and we have to choose between one of them. For example, if we want to consume either a h or H as first character in the input, we could use the choice parser combinator.

# A choice parser combinator

def choice(p1: ParserP, p2: ParserP) -> ParserP:
    def func(s):
        try:
            return p1(s)
        except ParserError:
            return p2(s)
    return func

The choice parser tries the first parser p1 and if that parser raises an exception (remember the satisfy function throws a ParserError), the second parser p2 is tried. Here is an example of that usage.

In [371]: hp = oneChar('h')
     ...: ep = oneChar('H')
     ...: print(parse(choice(hp, ep), "hello world"))
     ...: print(parse(choice(hp, ep), "Hello world"))
     ...:
     ...:
('h', 'ello world')
('H', 'ello world')

Using the parser combinators to parse simple expression

Using the above parser combinators, we can build a simple expression parser that can parse an expression like 1 + 2 or 1 - 2 + 3 etc.

# def expression():
#   parse digit  - we can use anyDigit()
#   parse operator  -  we need to be able to choose different options
#   parse digit  - parse the next digit

To support this expression, we also need a parser to parse the math operators (+, -, etc). Note,we are able to use mathOp (a parser) to build a more powerful parser, using the parser combinator mathOpP. The mathOpP parser combinator uses the choice parser combinator to produce a new parser! This is an example of using a multiple parser combinators to produce a new parser.

# now lets use choiceP
def mathOp(op):
    return satisfy(lambda c: c == op)

def mathOpP() -> ParserP:
    plus = mathOp("+")
    minus = mathOp("-")
    mult = mathOp("*")
    div = mathOp("/")

    return choice(plus, choice(minus, choice(mult, minus)))

We will make our first attempt in implementing the expression parser.

def expression_does_not_work():
    def func(s):

        (digit1, s1) = parse(anyDigitP(), s)
        (op, s2) = parse(mathOpP(), s1)
        (digit2, s3) = parse(anyDigitP(), s2) # this does not work

        return ((int(digit1), op, int(digit2)), s3)

    return func

Running this, we notice that + operator needs to be converted to a python callable like (operator.add) for us to be able to use this.

In [372]: parse(expression_does_not_work(), "1+2")
Out[372]: ((1, '+', 2), '')

The parsed input from mathOpP is not available for us to perform the transformation. One option is for mathOp to do the transformation. But, let’s assume that is not possible. Let’s assume the produced output from mathOp needs to be passed to another sequence of parsers before a decision can be made. This kind of scenario happens often, where subsequent parsers need to construct different parsers based on the earlier parsed input. In that case, we need to use the bind operation to produce the next parser.

Introduce the bind function


ParserF = tp.Callable[[tp.Any], ParserP]

def bind(p1: ParserP, pf: ParserF) -> ParserP:
    def func(s):
        (a, s1) = parse(p1, s)
        # call the passed in function to get the next parser `p2`. This function take an argument that was produced by the first parser `p1`
        p2 = pf(a)
        (b, s2) = parse(p2, s1)
        return (b, s2)

    return func

The bind function is different from the compose function. The two parsers that were passed to the compose function only depended on the string that is being consumed. The second parser passed to compose just takes as argument the unconsumed string from the first parser. But, this is slightly different with the bind function. The second argument passed to the bind function, first takes an input the output produced from the previous parser and produces a new parser. Then, we run the next parser. This is an important distinction. This helps parsers, that need to branch out and make decisions based on the context (that is, based on result of previous parser).

Below, you will notice how we can use the bind function. The output of the mathOp is passed to another function that returns a parser, that when called returns the add function along with the unconsumed state s.

def mathOpP() -> ParserP:
    def f(op):
        if op == "+":
            return operator.add
        elif op == "-":
            return operator.sub
        elif op == "*":
            return operator.mul
        elif op == "/":
            return operator.floordiv

    plus = bind(mathOp("+"), lambda a: lambda s: (f(a), s))
    minus = bind(mathOp("-"), lambda a: lambda s: (f(a), s))
    mult = bind(mathOp("*"), lambda a: lambda s: (f(a), s))
    div = bind(mathOp("/"), lambda a: lambda s: (f(a), s))

    return choice(plus, choice(minus, choice(mult, div)))

Now the expression function works and we can apply the op function on the two digits.


# need a way to recurse
def expression_or_digit():
    return choiceP(expression_2(), anyDigitP())

def expression_2():
    def func(s):
        (digit1, s1) = parse(anyDigitP(), s)
        (op, s2) = parse(mathOpP(), s1)
        (digit2, s3) = parse(expression_or_digit(), s2)
        return (op(int(digit1), int(digit2)), s3)

    return func

Notice that we also need a helper function so that we can recurse across multiple operands.

In [375]: parse(expression_or_digit(), "1+2+5")
Out[375]: (8, '')

Note, that this parser does not respect precedence and associativity. We need to build some more machinery to get that right. That will be out of scope of this post.

Revisiting the strP function.

Now that we have the bind function defined, we can use that function to parse a string. Remember, the earlier version of this function was producing a nested tuple.

# Parse a string
def strP1(es: str) -> ParserP:
    def f2(c):
        def f(x):
            f1 = lambda xs: lambda s: (x + xs, s)
            return bind(oneCharP(c), f1)

        return f

    def func(s):
        p = oneCharP(es[0])
        for c in es[1:]:
            p = bind(p, f2(c))
        return p(s)

    return func

For the reader, here is another example of a parser combinator, that can apply a parser zero or more times.


# Another simple combinator
def many1(p: ParserP):
    def func(s):
        result = []
        while True:
            try:
                (a, s) = parse(p, s)
                result.append(a)
                if not s:
                    break
            except ParserError:
                break
        return ("".join(result), s)

    return func

Conclusion

In this post, I have tried to cover some very basic concepts that make the parser combinators work. We built the basic buildings blocks that allow us to compose many more parsers. I did leave out the implementation of a more robust expression parser. I hope to refine that in future posts.