Alex by example (Write You A Python Lexer)

2021-04-04

Contents

  1. Objective
  2. Preparation
  3. Basic Lexer
  4. Support Numeric Values and String Values
    1. Lexer for Numeric Values
    2. Lexer for String Literals
  5. Capture the Source Location of Tokens
  6. Store State in State Monad
  7. Support Whitespace context(and have a Full Lexer!
    1. The Complexity of Newlines, Idents and Dedents
    2. Capture Newlines, INDENTs and DEDENTs
    3. Addressing Parens and Comments (with Newlines)
    4. Special handling for Comments
    5. Matching Parens
  8. That’s a wrap!
  9. References

Objective

The main goal of this tutorial is to build a lexer for Python 3 (tested on 3.9) language specification as described here. By working through this tutorial we will be replicating the complete behavior of the tokenizer implemented in Lib/tokenizer.py and described here. The final implementation of this tutorial has around 80+ tests that were directly extracted using the tests fixtures and results used in test/test_tokenize.py

By working towards this goal, the reader should acquire enough familiarity to implement a real world lexer for a serious language using Alex.

The objectives of this tutorial are as follows:

  • Build an real world project using Alex (here we build a Lexer for Python 3.9)
  • Learn the different features of Alex as we incrementatlly build the Lexer (macros, set expressions, rules, actions, error handling, threading state)
  • Motivate and use approaches that are required to implement a complete Lexer (context-sensitive lexer with State)

What this article is not

  • This article is not a replacement of more precise documenation on Alex. But, you would greatly benefit from this article if you were to refer to the documentation as you work through the examples.
  • Does not teach you about lexers and its theory. Some basic understanding of how lexer works is expected, but reading through the example should also provide sufficient guidance on how the framework guides you in building the lexer without understand much theory.

Who is this book for

If you are interested in implementing a real world lexer and use it in your projects, then working through the example in this book should give you enough knowledge and familiarity.

  • The article assumes beginner knowledge with Haskell. The later examples in the book uses State monads and that is as advanced as it gets.
  • Someone who is familiar with lexer atleast at the very basic level
  • Familiarity with Regular Expressions is assumed (we use them a lot in this tutorial!)
  • Familiarity with Python 3.0 language syntax is assumed

1.1 How the tutorial is organized

Preparation

We build the basic plumbing we need in terms of project creation, adding required boiler plate code that Alex needs to generate the lexer.

Basic Lexer

We build a basic lexer that supports identifiers and integers. We use this iteration to also provide the functions Alex expects us to implement extending what we built in the Preparation section. 1.

Support Numeric Values And String Values

We will use the character set and macro features provided by Alex to support decimals and different floating point representations. Examples of support include values like 42, 42e10, 0.42e-10 etc.

Since we are building a lexer for the Python 3 version of the language, we have an interesting tasks for supporting a wide range of string representations. The variety of string representation supported by Python is available here. Our lexer should support all these representation at the end of this section.

Capture the Source Location

By now you would have noticed that we have not captured the Token positions as they appear in the source code. In this iteration we will handle that and capture the exact starting and ending positions of each Token we generate. While working on this we will try to stay true to how the Python tokenizer works in terms of determing which offsets to store.

Store the Lexer State in a State Monad

Python language is white-space sensitive. Until, this iteration we did not have to deal with it. We will have to build the support for this soon. Before we add support for white-space sensitivity, we will have to build the machinery to capture the context of the state the lexer is in. We captures information on whether we encountered newlines, empty lines, comments and whitespaces(representing indentations).

Support Whitespace Context (and a Complete Lexer!)

We start adding our initial support for white space sensitivity by adding INDENT and DEDENT tokens. Then we handle the complexity involved with newlines when mixed with comments and parens.

By the end of this section we will have a fully functioning lexer which is (almost) true (upto the tests) replica of the Python implementation of the tokenizer.

A word on the Code used in this tutorial

The tutorial uses code that is published at write-you-a-lexer. There are 5 Example folder under the src folder. Each Example folder can be run independently as show in the section below.

Each incremental iteration of our Lexer is implemented as a Example folder.

Each of the Example folder also has corresponding tests test_tokenizer*.hs. Each of these tests refer to test_fixtures insides test_fixtures folder. Using the files in the test_fixtures folder tests are generated for each of the implementation in the lexer across all Examples* folder. The test_fixtures folders are numbered to match the Example number. For example, all test_fixtures for lexer implemented under Example4 folder can be run using test_tokenizer4.hs which in turn uses the test_fixtures/4/*.txt files as the test fixtures.

2. Preparation (Example1)

As part of the preparation for the tutorial, we will create stack project (using rio as the template).

stack new wya-lexer rio
cd wya-lexer

Alternatively, the project can be cloned from here. All code used in this tutorial is from this repository.

Chanages to package.yaml

You can refer to the package.yaml. An important note to note in this file, is that we add a cabal directive as follow:

build-tools: alex

This directive allows cabal to execute alex on any file with .x extension. Running, alex on this file makes all *.x modules available to the project. More information on this directive can be found in the documentation for cabal.

2.1 Types and Functions required by Alex

For the initial setup, we will add the following type and functions required by Alex. We will refine these types and functions in future iterations. For now, we will add the following defintions to src/Example1/LexerUtil.hs. More detailed information about why need these functions and types can be found in the alex documentation. We will also get familiar with these types and functions as we work through this tutorial.

type Byte = Word8
type AlexInput = (Char, [Byte], String)

alexGetByte :: AlexInput -> Maybe (Byte,AlexInput)
alexGetByte (c,b:bs,s) = Just (b,(c,bs,s))
alexGetByte (_,[],[]) = Nothing
alexGetByte (_,[],c:s)  =  case encode [c] of
    b : bs -> Just (b, (c, bs, s))
    [] -> error $ "Not byte returned for " ++ show c

One way to understand how these definitions work, is to imagine Alex starting with the initial input AlexInput. Then, Alex iteratively calls alexGetByte to get the next byte. Alex needs to match the rules we provide as it reads each byte from the call to alexGetByte. Once the entire byte for the character is read, Alex will call action and then recursively call this function with the new input that was returned as part of the earlier call to alexGetByte. Note, that Alex does not make any assumptions on the type of Alexinput, just that it needs an input to recursively pass it to alexGetByte. This state of this input determines what getAlexByte can work on to get the next byte.

We will be refinining the AlexInput type and the alexGetByte function throughout this tutorial.

Since, Alex translates all matches of the rules we provide into a token, we need to provide definitions of Token we want to capture. We will define the first version as follows in Example1.Token.

-- basic tokens
data Token =
      Name
    | Number
    | String
    | Op
    deriving (Eq, Show)

-- the augmented structure that will store more information about the tokens
data TokenInfo = TokenInfo {
    token_type:: Token
  , token_string:: T.Text
  , start_pos:: (Int, Int)
  , end_pos:: (Int, Int)
  }
  deriving (Show, Eq)

Now, in Example1.LexerUtil, we also need to provide an action function that Alex will call to generate the TokenInfo value. The initial implementation here is again simple and will be improved later on.

-- adopted from Alex documentation

type AlexAction = Token -> AlexInput -> Int -> TokenInfo

action :: AlexAction
action tok inp inp_len = let
  (c, _, s) = inp
  in
    TokenInfo
    {
      token_type = tok,
      token_string=T.pack (take inp_len s),
      start_pos=(0, 0),  --- this will be set in correct values later
      end_pos=(0, 0)     --- this will be set in correct values later
    }

Now, we are ready to define our first version of Lexer rules. That will be our task in the first iteration.

3. Basic Lexer (Example1)

In this iteration, we will define a basic set of rules that captures any valid (non-unicode) identifiers in Python and also capture decimal values. We will skip the new lines and white spaces in this iteration.

Alex provides two building blocks to define rules. One construct is to define a character set rule to match the text on. The identifiers for a charactset setis preceded by $ symbol. This character set rules can be combined together and defined as macros. The identifiers defining macros are prefixed with @ symbol. Now let’s use these 2 basic constructs to define our first set of lexer rules.

The entire file is here

--- (Example1.Lexer.x)

--- the character set rules
$digit = 0-9   -- digits
$alpha = [a-zA-Z_]  -- alphabetic characters
$white_no_nl = [\ \t]

-- a macro that combines the character set rules
-- here this macro captures any valid Python identifier
@identifier = $alpha [$alpha $digit \_]*

-- a macros that combines character sets to capture a decimal number
@decnumber = (0(_?0)* | [1-9](_?[0-9])*)

-- a macros just currently combines a decimal number.
Later on we will add more macros to this definition
@number = @decnumber

Now, it is time to use these macros and let Alex call our action function on matching tokens to that we can build our TokenInfo value. Therefore, next we define the rules.

--- (Example1.Lexer.x)

tokens :-
       --- instruct Alex to call `action` with the `Number` token when the set of characters matches the @number macro
       @number {action Number}

       -- instruct Alex to call `action` with the `Name` token when the set of characters matches the @identifier macro
       @identifier {action Name}

       --- for now we skip whitespaces/newlines/tabs etc
       $white+ ;     -- this ignores new lines as well

Finally, to run the lexer we will provide the following function in (Example1.Lexer.x) to satisfy module dependencies. We will invoke this function in one of our entry point functions. This function recursively calls alexScan until all tokens are scanned and Alex returns an AlexEOF value. (The mechanics of alexScan calls the alexGetByte function we provided earlier). Note that, this function will return the list of TokenInfo values.


-- adapted from https://www.haskell.org/alex/doc/html/wrappers.html
-- alexScanTokens :: String -> [TokenInfo]
lexer str = go ('\n',[],str)
  where go inp@(_,_bs,str) =
          case alexScan inp 0 of
            AlexEOF -> []
            AlexError _ -> error "lexical error"
            AlexSkip  inp' len     -> go inp'
            AlexToken inp' len act -> act inp len : go inp'

To run the lexer, we have a helper function in Example1.LexerRunner.hs that can be run from ghci.


λ >>import Example1.LexerRunner
λ >>runLexer "zzz"
Right [TokenInfo {token_type = Name, token_string = "zzz", start_pos = (0,0), end_pos = (0,0)}]

Note, that zzz parses to a Name token. The start_pos and end_pos are not set. We will fix that later.

Here are other example of decimals generating Number and also an example of skipping whitespaces.


it :: Either String [TokenInfo]
λ >>runLexer "42"
Right [TokenInfo {token_type = Number, token_string = "42", start_pos = (0,0), end_pos = (0,0)}]
it :: Either String [TokenInfo]
λ >>runLexer "42\nvariable"
Right [TokenInfo {token_type = Number, token_string = "42", start_pos = (0,0), end_pos = (0,0)},TokenInfo {token_type = Name, token_string = "variable", start_pos = (0,0), end_pos = (0,0)}]
it :: Either String [TokenInfo]
λ >>

Here is a simplified picture that describes the control flow


runLexer
  |
  |
  V
 lexer (runs in loop until  <--------------------------------
  |                                                          ^
  |                                                          |
  V                                                          |
 alexScan inp 0 (alex scans input for a matching rule)       |
  |                                                          |
  |  (matching rule found, then call corresponding action)   |
  V                                                          |
 action  (if tok == EOF END else loop back to runLexer)----->

With this we have our first iteration of the lexer! We used this iteration to set up the modules, types and function. We will build upon these simple types and function during our future iterations.

4 Support Numeric Values and String Literals (Example2)

Python has diverse support for how numeric values and string literals can be defined in the Python code. In this iteration, our objective is to define the rules that will handle the entire set of numerical values and also string literals supported by Python. The lexical grammar for string literals and numerical values can be found here and here.

4.1 Adding support for numerical values

We will first support numerical values and then work our way in supporting string literals. The below table show the entire set of rules for supporting numerical values. The text following the example breaks down these rules and provides some explanation.

--- Example2/Lexer.x

$nonzerodigit = [1..9]
$bindigit     =  [01]
$octdigit     = [0..7]
@hexdigit     =  $digit | [a-f] | [A-F]
@decinteger = (0(_?0)* | [1-9](_?[0-9])*)
@hexinteger   = ([0][xX](_?[0-9a-fA-F])+)
@bininteger = (0[bB](_?[01])+)
@octinteger = (0[oO](_?[0-7])+)
@intnumber    =  @decinteger | @bininteger | @octinteger | @hexinteger

@digitpart     =  $digit([_]|$digit)*
@fraction      =  [\.] @digitpart
@pointfloat    =  (@digitpart)* @fraction | @digitpart[\.]

@exponent      =  [eE] ([\+\-]?) @digitpart
@exponentfloat =  (@digitpart | @pointfloat)* @exponent

@floatnumber   =  @pointfloat | @exponentfloat

@imagnumber    =  (@floatnumber | @digitpart) [jJ]
@number = @imagnumber | @floatnumber | @intnumber

We first declare our primitive character sets as follows

$digit        = [0..9] -- this is provided by Alex. We leave it here for completeness
$nonzerodigit = [1..9]
$bindigit     =  [01]
$octdigit     = [0..7]

Now, lets support decimal integers such as 42, 42_00, 00_00 etc. We declare similar macros for hex, binary and octal numbers

@decinteger = (0(_?0)* | [1-9](_?[0-9])*)
@hexinteger   = ([0][xX](_?[0-9a-fA-F])+)
@bininteger = (0[bB](_?[01])+)
@octinteger = (0[oO](_?[0-7])+)

Finally, we put together all integer representations into one macro

@intnumber    =  @decinteger | @bininteger | @octinteger | @hexinteger

Next, we deal with floating point representations. Note that Python supports floating point numbers such as .42, 0.42, 42.. 42_42.42_42.

@digitpart     =  $digit([_]|$digit)*
@fraction      =  [\.] @digitpart
@pointfloat    =  (@digitpart)* @fraction | @digitpart[\.]

We also need to add support for exponents, such as 1e12, 1_1e1_1, 0.12e12_1 etc.

@exponent      =  [eE] ([\+\-]?) @digitpart
@exponentfloat =  (@digitpart | @pointfloat)* @exponent

Finally, we need a simple declaration for imaginary numberss

@imagnumber    =  (@floatnumber | @digitpart) [jJ]

Now, we put together all our rules for supporting integers, floats and imaginary numbers.

@number = @imagnumber | @floatnumber | @intnumber

4.2 Lexer for String Literals

As you may already be aware, Python supports specifying string literals in diverse ways. Here are some sample of strings that can be specified inside a Python program.

"abc"  # double quoted string
'abc'  # single quoted string
"abc\  # string with line continuation character
def"
"""
This is an example of a multi-line string
"""
'''
Another example of a multi-line string
'''

# supported string prefixes (combinations of r, f, b, u)
b'\xe\0e\xe\xe'  # byte string
rf"string with prefixes"

At first supporting these variations might seem daunting. But, fortunately, Python lexical grammer documentation provides a well-broken down defintions that we are use to support this variety of represenations

# https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

stringliteral   ::=  [stringprefix](shortstring | longstring)
stringprefix    ::=  "r" | "u" | "R" | "U" | "f" | "F"
                     | "fr" | "Fr" | "fR" | "FR" | "rf" | "rF" | "Rf" | "RF"
shortstring     ::=  "'" shortstringitem* "'" | '"' shortstringitem* '"'
longstring      ::=  "'''" longstringitem* "'''" | '"""' longstringitem* '"""'
shortstringitem ::=  shortstringchar | stringescapeseq
longstringitem  ::=  longstringchar | stringescapeseq
shortstringchar ::=  <any source character except "\" or newline or the quote>
longstringchar  ::=  <any source character except "\">
stringescapeseq ::=  "\" <any source character>

# byte string

bytesliteral   ::=  bytesprefix(shortbytes | longbytes)
bytesprefix    ::=  "b" | "B" | "br" | "Br" | "bR" | "BR" | "rb" | "rB" | "Rb" | "RB"
shortbytes     ::=  "'" shortbytesitem* "'" | '"' shortbytesitem* '"'
longbytes      ::=  "'''" longbytesitem* "'''" | '"""' longbytesitem* '"""'
shortbytesitem ::=  shortbyteschar | bytesescapeseq
longbytesitem  ::=  longbyteschar | bytesescapeseq
shortbyteschar ::=  <any ASCII character except "\" or newline or the quote>
longbyteschar  ::=  <any ASCII character except "\">
bytesescapeseq ::=  "\" <any ASCII character>

We will use the above defintions to define the rules our lexer needs. We will first deal with regular strings. We can apply the same logic to bytestrings. We will take the same approach as we took for numeric values. The first table here displays the full set of rules we would have defined at the end of this section.

-- start string related rules
-- adopted from https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
$lf = \n  -- line feed
$cr = \r  -- carriage return
@eol_pattern = $lf | $cr $lf | $cr $lf

@stringprefix    =  r | u | R | U | f | F
                     | fr | Fr | fR | FR | rf | rF | Rf | RF

$short_str_char = [^ \n \r ' \" \\]
$shortstringchar_nosinglequote = [^ ' \\ \n]
$shortstringchar_nodoublequote = [^ \" \\ \n]
@stringescapeseq =  [\\](\\|'|\"|@eol_pattern|$short_str_char)
@shortstringitem_single =  $shortstringchar_nosinglequote | @stringescapeseq
@shortstringitem_double =  $shortstringchar_nodoublequote | @stringescapeseq
@shortstring     =  ' @shortstringitem_single* ' | \" @shortstringitem_double* \"

$longstringchar  = [. \n] # [' \"]
@longstringitem_single  =  $longstringchar | @stringescapeseq | @one_single_quote | @two_single_quotes | \"
@longstringitem_double  =  $longstringchar | @stringescapeseq | @one_double_quote | @two_double_quotes | \'
@longstring      =  (''' @longstringitem_single* ''') | (\"\"\" @longstringitem_double* \"\"\")

@stringliteral   =  (@stringprefix)* (@shortstring | @longstring)

That looks complicated. Let’s break it up and look at this in parts. First let’s define the character sets and the string prefixes that will be supported.

$lf = \n  -- line feed
$cr = \r  -- carriage return
@eol_pattern = $lf | $cr $lf | $cr $lf

@stringprefix    =  r | u | R | U | f | F
                     | fr | Fr | fR | FR | rf | rF | Rf | RF

We will create 2 groups of rules, one group prefixed with short will support strings encolsed in single or double quotes. The second group will support long strings that are created by enclosing them in either ''' or """. Let’s start with the first group.

For a short string representation, we don’t want to allow \n \r ' '' \' characters, unless they are escaped by the backslash(`). The exception to this rule, is we allow single-quotes without escapes in a double quoted string and vice versa. The following rules, defines these requirements.

-- these characters only allowed with escape sequence
$short_str_char = [^ \n \r ' \" \\]

-- with escape character (\) we allow certain characters
@stringescapeseq =  [\\](\\|'|\"|@eol_pattern|$short_str_char)

-- these characters not allowed inside a single-quoted string
$shortstringchar_nosinglequote = [^ ' \\ \n]

-- these characters not allowed inside a double-quoted string
$shortstringchar_nodoublequote = [^ \" \\ \n]

-- combine the above rules to identify a valid character in single quoted string
@shortstringitem_single =  $shortstringchar_nosinglequote | @stringescapeseq

-- combine the above rules to identify a valid character in a double quoted string
@shortstringitem_double =  $shortstringchar_nodoublequote | @stringescapeseq

-- put together above rules to created a double-quoted or single-quoted string
@shortstring     =  ' @shortstringitem_single* ' | \" @shortstringitem_double* \"

For the long-form string, the rules are simpler, since many more characters are allowed inside this string without escaping

-- adopted from language-python package

-- the only character that is not allowed ' or "" quotes depending upon the string wrapper used
$longstringchar  = [. \n] # [' \"]

-- set up for single quoted long string
@longstringitem_single  =  $longstringchar | @stringescapeseq | @one_single_quote | @two_single_quotes | \"

-- set up for double quoted long string
@longstringitem_double  =  $longstringchar | @stringescapeseq | @one_double_quote | @two_double_quotes | \'
@longstring      =  (''' @longstringitem_single* ''') | (\"\"\" @longstringitem_double* \"\"\")

And, finally we put together both these defintions, to add support for a string literal

@stringliteral   =  (@stringprefix)* (@shortstring | @longstring)

We can follow the same reasoning for the byte-string rules.

Finally, we update our action, to use the new rules we created above. This is how the tokens section looks after end of this iteration.

tokens :-
       @number {action Number}
       @identifier {action Name}
       @stringliteral | @bytesliteral {action String}
       $white+ ;     -- this ignores new lines as well

Compiling and running our example, we see:

λ >>L2.runLexer "'test'"
Right [TokenInfo {token_type = String, token_string = "'test'", start_pos = (0,0), end_pos = (0,0)}]
it :: Either String [Example2.Tokens.TokenInfo]
λ >>L2.runLexer "'''this this'''"
Right [TokenInfo {token_type = String, token_string = "'''this this'''", start_pos = (0,0), end_pos = (0,0)}]
it :: Either String [Example2.Tokens.TokenInfo]
λ >>L2.runLexer "42_42.e+42_42"
Right [TokenInfo {token_type = Number, token_string = "42_42.e+42_42", start_pos = (0,0), end_pos = (0,0)}]
it :: Either String [Example2.Tokens.TokenInfo]

More examples can be found in test_fixtures of this iteration under test/test_fixtures/2

5. Capture The Source Location of Tokens (Example3)

Up until now, you would have noticed the start_pos and end_pos values in the TokenInfo are set to (0, 0). We have not captured the location in the input string that was used to generate a particular TokenInfo value. We will fix that in this iteration.

Before that let’s add support for special characters to our lexer.

Thus far, our lexer only processes the NAME, STRING and NUMBER tokens. Let’s extend the definition of Token to capture all the special characters and their meaning as follows:

-- Example3/Tokens.hs

data Token =
    .....

    | Lpar
    | Rpar
    | Lsqb
    | Rsqb
    | Colon
    | Comma
    | Semi
    | Plus
    | Minus


    ....
    ...

We also add the corresponding change to the Example3/Lexer.x module.

-- sample of operators we add to Example3/Lexer.x

      "("   { action Lpar }
      ")"   { action Rpar }
      "["   { action Lsqb }
      "]"   { action Rsqb }
      "{"   { action Lbrace }
      "}"   { action Rbrace }
      "..." { action Ellipsis}
      "->"  { action RArrow }
      "."   { action Dot }
      "~"   { action Tilde }
      "+"   { action Plus }
      "-"   { action Minus }
      "**"  { action DoubleStar }
      "*"   { action Star }

With this our lexer supports expressions such as x + y, abc = "xyz" + "def", (a + b) + c etc.

Now, let’s get on to the task of capturing the source location information in our TokenInfo value. If you recollect in the preparation section, we added functions and types to work with the Alex runtime. We will be updating those functions here. First, we will update the AlexInput to capture the location as we update the AlexInput values inside the alexGetByte function.

Here is what we need to do. First update AlexInput, so that it holds another attribute of type AlexPosn. We use this value to capture the starting of the location from where the next token was caputured (at which point the action function is called).

Example3/LexerUtil.hs

-- adapted from https://www.haskell.org/alex/doc/html/wrappers.html
data AlexPosn = AlexPosn !Int  -- absolute character offset
                       !Int  -- line number
                       !Int  -- column number
                       deriving (Show, Eq)
type AlexInput = (AlexPosn,     -- current position,
                  Char,         -- previous char
                  [Byte],       -- rest of the bytes for the current char
                  String)       -- current input string

Next, we will extend the alexGetByte function, to update the the AlexPosn values, everytime a complete character has been read. To help with this, we use two helper functions. Note, that we just adopted this code from the wrapper Alex usually generates (but we are not using the wrappers in this tutorial).

alexGetByte :: AlexInput -> Maybe (Byte,AlexInput)
alexGetByte (pn, c,b:bs,s) = Just (b,(pn, c,bs,s))
alexGetByte (_, _,[],[]) = Nothing
alexGetByte (p,_,[],c:s)  = let
  p' = alexMove p c
  in
  case encode [c] of
    b : bs -> p' `seq` Just (b, (p', c, bs, s))
    [] -> error $ "Not byte returned for " ++ show c


alexStartPos = AlexPosn 0 1 1

-- adapter from code generated from Alex (remove change for \t, since Python counts this as offset=1)
alexMove :: AlexPosn -> Char -> AlexPosn
-- alexMove (AlexPn a l c) '\t' = AlexPosn (a+1)  l     ((c+8-1) `div` 8*8+1)  -- tabsize=8
alexMove (AlexPosn a l _) '\n' = AlexPosn (a+1) (l+1)   1
alexMove (AlexPosn a l c) _    = AlexPosn (a+1)  l     (c+1)

And finally, we have to use this information, when we construct the TokenInfo value in our action method. Note, that we can use the starting value present in AlexPosn and the inp_len(length of currently captured token) to initialize the start_pos and end_pos.

action :: AlexAction
action tok inp inp_len = let
  (AlexPosn _ line col, c, _, s) = inp  -- Get the AlexPosn and use that info to calculate start, end positions
  in
    TokenInfo
    {
      token_type = tok,
      token_string=T.pack (take inp_len s),
      start_pos=(line, col),
      end_pos=(line, col+inp_len-1)
    }

With these changes, all TokenInfo values that are constructed in the action function will also capture the source localtion.

Here are some examples

λ >>import Example3.LexerRunner as L3
λ >>L3.runLexer "a=b+c\nx=1"
Right [TokenInfo {token_type = Name, token_string = "a", start_pos = (1,1), end_pos = (1,1)},TokenInfo {token_type = Equal, token_string = "=", start_pos = (1,2), end_pos = (1,2)},TokenInfo {token_type = Name, to
ken_string = "b", start_pos = (1,3), end_pos = (1,3)},TokenInfo {token_type = Plus, token_string = "+", start_pos = (1,4), end_pos = (1,4)},TokenInfo {token_type = Name, token_string = "c", start_pos = (1,5), end
_pos = (1,5)},TokenInfo {token_type = Name, token_string = "x", start_pos = (2,1), end_pos = (2,1)},TokenInfo {token_type = Equal, token_string = "=", start_pos = (2,2), end_pos = (2,2)},TokenInfo {token_type = Number, token_string = "1", start_pos = (2,3), end_pos = (2,3)}]
it :: Either String [Example3.Tokens.TokenInfo]
λ >>

6. Store state in a State Monad (Example4)

By now you might have noticed we are not handling newlines or identations yet. Python’s grammar is context-senstive and the lexer we constructed so far does not use the context of surrounding tokens yet. We did not have to store the state of our lexer up until now. But, to support the white-space sensitive requirements of the Python grammar we will have to preserve the state of the context.

There are many approaches to solving this problem. One could entirely use the start_code machinery provided my Alex to achieve this. In this tutorial, we will adopt the State Monad machinery to store and retrieve the context. We will as earlier, use the code Alex would have generated if we had used the %wrapper monadUserState directive. One could easily replace this code with the stock State monad.

Let’s first think about all the information we would like to store in our state. You might have noticed that we are already implicitly carrying around state in the form of AlexInput, which gets passed to the alexScan function and gets updated in the alexGetByte function. So, may be we can abstract that information into our new state. We also need additional information we would like to store to handle newlines, indentations, empty new lines, lines with just comments etc. Later on we will also have to handle cases where () are used to express break up statements across multiple lines.

With these requirements in mind, we could define the following 2 data structures. Again, the data structures are adopted heavily from the the monadUserState wrapper Alex provides. This can be easily replaced by a State monad with a custom structure. But, we stick to these definitions to understand Alex documentation and be consistent with other resources we find on the internet.

First abstract the AlexInput and AlexPosn into our new state definition.


-- Example4/LexerUtil.hs

-- The Alex state monad
-- This record encapsulates all of the `AlexInput` attributes and also contains the `AlexPosn`.
-- We also add the `AlexUserState` to this structure and we will use them later
data AlexState = AlexState {
        alex_pos :: !AlexPosn,  -- position at current input location
        alex_inp :: String,     -- the current input
        alex_chr :: !Char,      -- the character before the input
        alex_bytes :: [Byte],   -- rest of the bytes for the current char
        alex_scd :: !Int,       -- the current startcode
        -- Used in later iterations
        alex_ust :: AlexUserState -- AlexUserState will be defined in the user program
    }

data AlexUserState = AlexUserState {
       userStateStartCode :: !Int,
       userStateIndentStack :: [Int],
       userStatePendingTokens :: [TokenInfo],
       userStateParenStack:: [Token],
       userStatePrevComment :: Bool,
       userStatePrevToken :: Token
     }
     deriving (Show)

We will now define a new type Alex a and the required monadic interfaces for that.

-- Example4/LexerUtil.hs

data AlexState = AlexState {
        alex_pos :: !AlexPosn,  -- position at current input location
        alex_inp :: String,     -- the current input
        alex_chr :: !Char,      -- the character before the input
        alex_bytes :: [Byte],   -- rest of the bytes for the current char
        alex_scd :: !Int,       -- the current startcode
    }

newtype Alex a = Alex { unAlex :: AlexState -> Either String (AlexState, a) }

instance Functor Alex where
  fmap f a = Alex $ \s -> case unAlex a s of
                            Left msg -> Left msg
                            Right (s', a') -> Right (s', f a')

instance Applicative Alex where
  pure a   = Alex $ \s -> Right (s, a)
  fa <*> a = Alex $ \s -> case unAlex fa s of
                            Left msg -> Left msg
                            Right (s', f) -> case unAlex a s' of
                                               Left msg -> Left msg
                                               Right (s'', b) -> Right (s'', f b)

instance Monad Alex where
  m >>= k  = Alex $ \s -> case unAlex m s of
                                Left msg -> Left msg
                                Right (s',a) -> unAlex (k a) s'
  return = pure

We will also add a getter/setter for this state monad. We will use these functions to get/set the AlexInput and AlexPosn values.

-- needed by Alex
alexError :: String -> Alex a
alexError message = Alex $ const $ Left message

alexGetInput :: Alex AlexInput
alexGetInput
 = Alex $ \s@AlexState{alex_pos=pos,alex_chr=c,alex_bytes=bs,alex_inp=inp__} ->
        Right (s, (pos,c,bs,inp__))

alexSetInput :: AlexInput -> Alex ()
alexSetInput (pos,c,bs,inp)
 = Alex $ \s -> Right (s{alex_pos=pos,alex_chr=c,alex_bytes=bs,alex_inp=inp}, ())

Now, we update our lexer function to use the state monad. Note, that we break up the original version from Example3. The lexer function first gets the inp state and calls alexScan. After alexScan returns, we update the inp state and then call the action function. This is where we will capture the state between each call to alexScan functions.

Since, the lexer runs only once, we need to recursively call this lexer until we reach EOF token. The recursion is handled in the lexerFold function. The lexerFold function recursively invokes the lexer until it detects the EOF token and then returns all the collected TokenInfo values.


lexer :: Alex TokenInfo  -- This is our state monad
lexer = do
      inp <- alexGetInput
      case alexScan inp 0 of
        AlexEOF -> alexEOF
        AlexError ((AlexPosn _ line column),_,_,_) -> alexError $ "lexical error at line " ++ (show line) ++ ", column " ++ (show column)
        AlexSkip  inp' _len -> do
          alexSetInput inp'
          lexer -- skip and loop around
        AlexToken inp' len action -> do
          alexSetInput inp'
          action inp len

-- adopted from language-python
lexerFold :: Alex [TokenInfo]
lexerFold = loop []
  where
    loop toks = do
      token_info@TokenInfo {..} <- lexer
      case token_type of
        EOF -> return $ L.reverse toks
        _ -> loop (token_info : toks)

-- from generated Alex file
runAlex :: String -> Alex a -> Either String a
runAlex inp (Alex f) =
  case f
    ( AlexState
        { alex_pos = alexStartPos,
          alex_inp = inp,
          alex_chr = '\n',
          alex_bytes = [],
          alex_ust = alexInitUserState,
          alex_scd = 0
        }
    ) of
    Left msg -> Left msg
    Right (_, a) -> Right a

}

We also update the action function to now return Alex TokenInfo instead of TokenInfo. Note, that we also adjusted the start_pos and end_pos values to align with the values produced by the Python tokenizer.


-- helper function to construct the TokenInfo value
-- We will add more logic in this function later
constructToken :: Token -> (AlexPosn, b1, c1, String) -> Int -> (AlexPosn, b2, c2, d) -> TokenInfo
constructToken tok inp inp_len n_inp = let
    (AlexPosn _ line col, _, _,s) = inp
    (AlexPosn _ nline ncol, _c, _rest, _s) = n_inp -- new input in state

    start_pos = (line, col - 1)
    tok_str = T.pack $ take inp_len s
    end_pos = (nline, ncol -1)
    in
      TokenInfo {
              token_type=tok,
              token_string=tok_str,
              start_pos=start_pos,
              end_pos=end_pos
              }

type AlexAction result = AlexInput -> Int -> Alex result

-- Update action to run inside the Alex state
action :: Token -> AlexAction TokenInfo
action tok inp inp_len = do
       -- this has new updated input
       n_inp@(AlexPosn _ nline ncol,c, rest,s) <- alexGetInput
       return $ constructToken tok inp inp_len n_inp

With these changes, we have completely abstracted away AlexInput and AlexPosn into the State monad. We added a custom user state which we will use soon. We also threaded the state through the alexScan function until we see the EOF token.

Let’s now test our changes in GHCI.

λ >>import Example4.LexerRunner as L4
λ >>L4.runLexer "a=b+c"
Right [TokenInfo {token_type = Name, token_string = "a", start_pos = (1,0), end_pos = (1,1)},TokenInfo {token_type = Equal, token_string = "=", start_pos = (1,1), end_pos = (1,2)},TokenInfo {token_type = Name, token_string = "b", start_pos = (1,2), end_pos = (1,3)},TokenInfo {token_type = Plus, token_string = "+", start_pos = (1,3), end_pos = (1,4)},TokenInfo {token_type = Name, token_string = "c", start_pos = (1,4), end_pos = (1,5)}]
it :: Either String [Example4.Tokens.TokenInfo]
λ >>

7. Supporting Whitespace context (and having a Full Lexer!) (Example5)

It has been interesting so far. But, we are not done yet. This probably will be the most difficult task to implement. We had to go through the process of setting up the state monad just to accomplish this task we have at hand. Let’s look at some examples on how the Python tokenizer behaves when it comes to handling whitespaces.

7.1 The Complexity of Newlines, Indents and Dedents

In [55]: import tokenize
    ...:
    ...: z = """
    ...: if f:
    ...:     x
    ...:         y
    ...: """
    ...:
    ...:
    ...: list(tokenize.generate_tokens(io.StringIO(z).readline))
Out[55]:
[TokenInfo(type=61 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n'),
 TokenInfo(type=1 (NAME), string='if', start=(2, 0), end=(2, 2), line='if f:\n'),
 TokenInfo(type=1 (NAME), string='f', start=(2, 3), end=(2, 4), line='if f:\n'),
 TokenInfo(type=54 (OP), string=':', start=(2, 4), end=(2, 5), line='if f:\n'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 5), end=(2, 6), line='if f:\n'),
 TokenInfo(type=5 (INDENT), string='    ', start=(3, 0), end=(3, 4), line='    x \n'),
 TokenInfo(type=1 (NAME), string='x', start=(3, 4), end=(3, 5), line='    x \n'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 6), end=(3, 7), line='    x \n'),
 TokenInfo(type=5 (INDENT), string='        ', start=(4, 0), end=(4, 8), line='        y\n'),
 TokenInfo(type=1 (NAME), string='y', start=(4, 8), end=(4, 9), line='        y\n'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(4, 9), end=(4, 10), line='        y\n'),
 TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line=''),
 TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line=''),
 TokenInfo(type=0 (ENDMARKER), string='', start=(5, 0), end=(5, 0), line='')]

You will notice that we have NEWLINE followed by INDENT and then finally we also DEDENT (twice), so that start_pos and end_pos end up at column 0. Therefore, for us to accomplish this task, we will have to capture the location we see a newline followed by whitespaces. We will do that in the next section.

7.2 Capture Newlines, INDENTs and DEDENTs

We first add a new rule to the lexer

-- Example5/Lexer.x

       \n$white* {startWhite}
       $white+ ;   -- ignote this since we only care up significant white spaces (leading white spaces)

Now we will implement the startWhite action to handle this rule. To keep track of all indentations we make use of the userStateIndentStack attribute of the AlexUserState record. The initialization function has the value set to [1] and we set the userStatePendingTokens to an empty list. When the startWhite action is called we perform the following actions that is described after the code listing.

Here is the implementation of the whiteSpace action. We will break down this code and understand what its doing below.

startWhite:: AlexAction TokenInfo
startWhite inp inp_len = do
     is <- userStateIndentStack <$> alexGetUserState
     let cur = case is of
           c:_ -> c
           _ -> error "Indentation stack is not set, alteast one element should be present. Empty list found"

     let (AlexPosn _ line _, _, _,s) = inp
     -- new input
     n_inp@(AlexPosn _ nline ncol, _, _, ns) <- alexGetInput

     newline_tok <- newLineAction inp inp_len
     userState <- alexGetUserState

     if  | True -> do    -- this if condition here is a placeholder which we will update later
          -- at this point we have an indentation, but this
          -- indentation could be starting on different line based on
          -- preceding empty lines. For all precedin empty lines we
          -- will insert an NL token
          let parts = L.map T.length . T.splitOn "\n" . T.pack . take inp_len $ s
          let pos = L.last parts + 1
          let nl_tokens = constructNLTokens
                          -- the first new line will be returned at
                          -- end of this function
                          (line + 1)
                          -- ["\n", ....., ""] therefore adjust for
                          -- the 2 items (first \n and last
                          -- non-newline)
                          (L.length parts - 2)
          when (pos > cur) $
                 alexSetUserState $ userState {
            userStateIndentStack = pos:is,
            -- takes care of adding the preceding new lines as well
            userStatePendingTokens=nl_tokens ++ [constructToken Indent inp inp_len n_inp]}
          when (pos < cur) $ do
                 let (pre, post) = span (> pos) is
                 let top = case post of
                       t:_ -> t
                       [] -> error $ unwords ["Invalid indent with cur= ", show cur]
                 if pos == top then
                    alexSetUserState $ userState {
                          userStateIndentStack = post,
                          userStatePendingTokens=nl_tokens ++ map (const (constructToken Dedent inp inp_len n_inp)) pre}
                 else
                   error $ "Invalid indents : " ++ "pos = " ++ show pos ++ " top= " ++ show top ++ "userState = " ++ show userState ++ "pre = " ++ show pre ++ "post = " ++ show post
          when (pos == cur) $
                  alexSetUserState $ userState {
                  userStatePendingTokens=nl_tokens}

          -- set prev token
          alexSetPrevToken newline_tok
          return (constructToken newline_tok inp inp_len n_inp)
  1. when (pos > cur) : This happens when we encounter new indentations. We check to see if the indentation is further out from the value at the top of the userStateIndentStack. If yes, we add the new position to the stack. We also add the INDENT token to the stack, since after we yield a NEWLINE token, we need to yield all the tokens in the userStatePendingTokens stack. By adding the INDENT tokens to this stack we achieve that. The lexerLoop generates the pending tokens by popping from the pending stack before calling the lexer to fetch the next token.
  2. when (pos < cur) : This condition is true when we encounter an outer indent. Now, we could have dedented out by more than one level of indentation. Therefore, we need to compute the number of DEDENT tokens that needs to be added to into the pending stack.
  3. where (pos == cur) : When the indentation level has not changed, we still have to introduce anyNL` tokens that may be needed for empty lines within the same indentation level.

There is one other subtle condition we need to handle. You will notice that we are constructing a specified number of nl_tokens. This is to support the behaviour of the original Python tokenizer that introduces NL tokens for empty new lines. The empty new lines do not get INDENTED.

Here is another example of the Python tokenizer where you will notice 2 NL tokens before the INDENT token. That is the reason behind appending the extra NL tokens across all the conditions encountered above.

In [57]: import tokenize
    ...:
    ...: z = """
    ...: if f:
    ...:
    ...:
    ...:     x
    ...: """
    ...:
    ...:
    ...: list(tokenize.generate_tokens(io.StringIO(z).readline))
Out[57]:
[TokenInfo(type=61 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n'),
 TokenInfo(type=1 (NAME), string='if', start=(2, 0), end=(2, 2), line='if f:\n'),
 TokenInfo(type=1 (NAME), string='f', start=(2, 3), end=(2, 4), line='if f:\n'),
 TokenInfo(type=54 (OP), string=':', start=(2, 4), end=(2, 5), line='if f:\n'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 5), end=(2, 6), line='if f:\n'),
 TokenInfo(type=61 (NL), string='\n', start=(3, 0), end=(3, 1), line='\n'),
 TokenInfo(type=61 (NL), string='\n', start=(4, 0), end=(4, 1), line='\n'),
 TokenInfo(type=5 (INDENT), string='    ', start=(5, 0), end=(5, 4), line='    x \n'),
 TokenInfo(type=1 (NAME), string='x', start=(5, 4), end=(5, 5), line='    x \n'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(5, 6), end=(5, 7), line='    x \n'),
 TokenInfo(type=6 (DEDENT), string='', start=(6, 0), end=(6, 0), line=''),
 TokenInfo(type=0 (ENDMARKER), string='', start=(6, 0), end=(6, 0), line='')]

Now, this alone is not sufficient. You will notice that we have a if | True block in the code above. We plan to extend on this if construct below.

You will notice that we are also setting the userStatePrevToken values in this function. We will next look at why we need that when dealing with comments.

Finally, we still have some more scenarios to handle. - We cannot INDENT if are within a context of a paren (. Therefore, we need to add that condition as well to our whiteSpace action. Notice, that we will be capturing the userStatePrevToken after constructing every token. We can use this information, to infer if we are within the context of a (. - If we just encountered a #comment by itself on a line, then the newline token generated has to be NL and not NEWLINE. We will also not create an INDENT in this case even if the comment is indented.

7.3 Addressing Parens and Comments (with Newlines)

We will now update our whiteSpace action to handle these scenarios. We will introduce new condition under the if construct to handle the paren and the comment scenario.

startWhite:: AlexAction TokenInfo
startWhite inp inp_len = do
     is <- userStateIndentStack <$> alexGetUserState
     let cur = case is of
           c:_ -> c
           _ -> error "Indentation stack is not set, alteast one element should be present. Empty list found"

     parenDepth <- length . userStateParenStack <$> alexGetUserState
     let (AlexPosn _ line _, _, _,s) = inp
     -- new input
     n_inp@(AlexPosn _ nline ncol, _, _, ns) <- alexGetInput

     newline_tok <- newLineAction inp inp_len
     userState <- alexGetUserState

     if | (parenDepth > 0) ->  action Newline inp inp_len
        | T.isPrefixOf "#" (T.pack ns) -> do  -- this means we are on a comment only line
            alexSetPrevToken newline_tok
            return (constructToken newline_tok inp inp_len n_inp)    -- no INDENTS on comment only new line                                                                                                          | otherwise -> ... -- rest of code as above

For the above code additions to work we also need to track the commentToken and paren action that is explained further down.

We will be using a helper function newLineAction to decide whether to generate NL or NEWLINE token depending on the flag set regarding comments. The userStatePrevComment flag is set in the commentAction function whenever a comment is found on a new line (and followed by just whitespaces).

7.4 Special handling for Comments

-- Example5/LexerUtil.hs

newLineAction :: AlexAction Token
newLineAction inp inp_len = do
  is_prev_comment <- userStatePrevComment <$> alexGetUserState
  if is_prev_comment then (do
   alexSetPrevComment False
   return Nl) else return Newline
commentAction :: AlexAction TokenInfo
commentAction inp inp_len = do
  prevToken <-  userStatePrevToken <$> alexGetUserState
  let prevCommentFlag = prevToken == Nl || prevToken == Newline

  -- here we only flag prevComment if it was followed by a newline
  -- we need this since is the only time we don't introduce the `INDENT`
  alexSetPrevComment prevCommentFlag

  -- this has new updated input
  --((AlexPosn _ line col),c, rest,s) <- alexGetInput
  let (AlexPosn _ line col,c, rest,s) = inp

  let token_string = T.stripStart . T.pack $ take inp_len s
  let new_len = T.length token_string

  let new_pos = col + inp_len - new_len - 1
  let start_pos = (line, new_pos)
  let end_pos = (line, new_pos + new_len)

  return $ TokenInfo {
    token_type=Comment,
    token_string=token_string,
    start_pos=start_pos,
    end_pos=end_pos,
    }

Here is the subtelity we are trying to handle for #comments that are present by themselves, as opposed to comments beside some statements.

In [59]: import tokenize
    ...:
    ...: z = """
    ...: if f:
    ...:
    ...:     # comment here gets NL
    ...:     x # comment here get NEWLINE
    ...: """
    ...:
    ...:
    ...: list(tokenize.generate_tokens(io.StringIO(z).readline))
Out[59]:
[TokenInfo(type=61 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n'),
 TokenInfo(type=1 (NAME), string='if', start=(2, 0), end=(2, 2), line='if f:\n'),
 TokenInfo(type=1 (NAME), string='f', start=(2, 3), end=(2, 4), line='if f:\n'),
 TokenInfo(type=54 (OP), string=':', start=(2, 4), end=(2, 5), line='if f:\n'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 5), end=(2, 6), line='if f:\n'),
 TokenInfo(type=61 (NL), string='\n', start=(3, 0), end=(3, 1), line='\n'),
 TokenInfo(type=60 (COMMENT), string='# comment here gets NL', start=(4, 4), end=(4, 26), line='    # comment here gets NL\n'),
 TokenInfo(type=61 (NL), string='\n', start=(4, 26), end=(4, 27), line='    # comment here gets NL\n'),
 TokenInfo(type=5 (INDENT), string='    ', start=(5, 0), end=(5, 4), line='    x # comment here get NEWLINE\n'),
 TokenInfo(type=1 (NAME), string='x', start=(5, 4), end=(5, 5), line='    x # comment here get NEWLINE\n'),
 TokenInfo(type=60 (COMMENT), string='# comment here get NEWLINE', start=(5, 6), end=(5, 32), line='    x # comment here get NEWLINE\n'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(5, 32), end=(5, 33), line='    x # comment here get NEWLINE\n'),
 TokenInfo(type=6 (DEDENT), string='', start=(6, 0), end=(6, 0), line=''),
 TokenInfo(type=0 (ENDMARKER), string='', start=(6, 0), end=(6, 0), line='')]

Observe, when NL tokens are generated as opposed to NEWLINE tokens. Understanding, this can help you understand the code snippets above.

One other thing, we need to handle is make sure we have matching nested parens, when we generate an tokens for parens. We add these 2 functions, to handle that scenario:

7.5 Validate Matching Parens

-- Para handling - adopted from language-python library
openParen:: Token -> AlexAction TokenInfo
openParen token inp inp_len = do
           userState <- alexGetUserState
           alexSetUserState $ userState {
                            userStateParenStack=token:userStateParenStack userState}
           action token inp inp_len


closeParen:: Token -> AlexAction TokenInfo
closeParen token inp inp_len = do
       userState <- alexGetUserState
       let topParent = userStateParenStack userState
       case topParent of
            t:ts -> case matchParen t token of
            -- TODO: Add line number info
                 False -> error $ "Parens don't match " ++ show t ++ "and" ++ show token
                 True -> do
                       -- pop the stack
                       alexSetUserState $ userState {
                                  userStateParenStack=ts}
                       action token inp inp_len
            -- TODO: Add line number info
            [] -> error $ "No paren's to pop for " ++ show token


matchParen :: Token -> Token -> Bool
matchParen t1 t2 = case (t1, t2) of
           (Lpar, Rpar) -> True
           (Lsqb, Rsqb) -> True
           (Lbrace, Rbrace) -> True
           (_, _) -> False

And finally, to handle the subtleness if how the original Python tokenizer handles comments, we need an specialized action for comments. Here we set up a flag to identify cases where a #comment was seen n a new line without any other statements.

We will accordingly, update our rules for parens and comments to call these actions.

-- Example5/Lexer.x

      "("   { openParen Lpar }
      ")"   { closeParen Rpar }
      "["   { openParen Lsqb }
      "]"   { closeParen Rsqb }
      "{"   { openParen Lbrace }
      "}"   { closeParen Rbrace }

For comments, to handle newline behavior, we introduce more rules

-- Example5/Lexer.x

$newline = [\r \n]
$not_newline = ~$newline
@commentline = (($white_no_nl)*  \# ($not_newline)*)
@empty_line = (($white_no_nl)*\n)

tokens :-
       @empty_line {action Nl}
       @commentline {commentAction}

And that’s it. We have a version of a lexer for Python 3.0 which is fully compatible with this tokenizer.

8. That’s a Wrap!

Phew, if you got this far, then Congratulations! You built a complete lexer in Haskell using Alex. The lexer you built atleast matches the behavior of the Python tokenizer upto the 80+ tests we ran against our lexer2. That is a awesome feat! For some of us, this is a college level project but for other’s this is quite an accomplishment!

For further reading, I would recommend taking a look at the Alex documentation. A lof of what is explained in the document must make sense if you got this far in the tutorial. I have also left a bunch of references that I used while I implemented this project. There are other lexers in the wild that has been built using Alex. Hopefully, this tutorial has given you the knowledge you need to understand those implementations.

Limitations of the parser we built

  1. We do not support Unicode characters in identifiers
  2. ‘\’ as line continuation character is not supported in our implementation. That can be easily added.
  3. We don’t generate the ENCODE and ENDMARKER tokens. The code is commented out to make it easy for tests.

9. References

  1. Alex Most of data structures used was adopted from this documentation
  2. language-python This implementation here helped me understand other approaches of handling white-space context and code organization
  3. jmoy/alexhappy The example here for handling white-spaces/indentation were the building blocks for handling white-spaces in this tutorial

  1. Alex has a concept called wrappers where versions of this functions are automatically generated. But, to learn to use Alex, we will write this from scratch.↩︎

  2. The current implementation of the lexer does not support Unicode characters in the identifiers even though Unicode characters can be present in the string literals.↩︎