Alex by example (Write You A Python Lexer)
Contents
- Objective
- Preparation
- Basic Lexer
- Support Numeric Values and String Values
- Capture the Source Location of Tokens
- Store State in State Monad
- Support Whitespace context(and have a Full Lexer!
- That’s a wrap!
- References
Objective
The main goal of this tutorial is to build a lexer for Python 3 (tested on 3.9) language specification as described here. By working through this tutorial we will be replicating the complete behavior of the tokenizer implemented in Lib/tokenizer.py and described here. The final implementation of this tutorial has around 80+ tests that were directly extracted using the tests fixtures and results used in test/test_tokenize.py
By working towards this goal, the reader should acquire enough familiarity to implement a real world lexer for a serious language using Alex
.
The objectives of this tutorial are as follows:
- Build an real world project using Alex (here we build a Lexer for Python 3.9)
- Learn the different features of Alex as we incrementatlly build the Lexer (macros, set expressions, rules, actions, error handling, threading state)
- Motivate and use approaches that are required to implement a complete Lexer (context-sensitive lexer with State)
What this article is not
- This article is not a replacement of more precise documenation on Alex. But, you would greatly benefit from this article if you were to refer to the documentation as you work through the examples.
- Does not teach you about
lexers
and its theory. Some basic understanding of how lexer works is expected, but reading through the example should also provide sufficient guidance on how the framework guides you in building the lexer without understand much theory.
Who is this book for
If you are interested in implementing a real world lexer and use it in your projects, then working through the example in this book should give you enough knowledge and familiarity.
- The article assumes beginner knowledge with Haskell. The later examples in the book uses State monads and that is as advanced as it gets.
- Someone who is familiar with lexer atleast at the very basic level
- Familiarity with Regular Expressions is assumed (we use them a lot in this tutorial!)
- Familiarity with Python 3.0 language syntax is assumed
1.1 How the tutorial is organized
Preparation
We build the basic plumbing we need in terms of project creation, adding required boiler plate code that Alex
needs to generate the lexer.
Basic Lexer
We build a basic lexer that supports identifiers
and integers
. We use this iteration to also provide the functions Alex expects us to implement extending what we built in the Preparation section. 1.
Support Numeric Values And String Values
We will use the character set
and macro
features provided by Alex to support decimals and different floating point representations. Examples of support include values like 42
, 42e10
, 0.42e-10
etc.
Since we are building a lexer for the Python 3 version of the language, we have an interesting tasks for supporting a wide range of string representations. The variety of string representation supported by Python is available here. Our lexer should support all these representation at the end of this section.
Capture the Source Location
By now you would have noticed that we have not captured the Token
positions as they appear in the source code. In this iteration we will handle that and capture the exact starting and ending positions of each Token we generate. While working on this we will try to stay true to how the Python tokenizer works in terms of determing which offsets to store.
Store the Lexer State in a State Monad
Python language is white-space sensitive. Until, this iteration we did not have to deal with it. We will have to build the support for this soon. Before we add support for white-space sensitivity, we will have to build the machinery to capture the context of the state the lexer is in. We captures information on whether we encountered newlines, empty lines, comments and whitespaces(representing indentations).
Support Whitespace Context (and a Complete Lexer!)
We start adding our initial support for white space sensitivity by adding INDENT and DEDENT tokens. Then we handle the complexity involved with newlines when mixed with comments and parens.
By the end of this section we will have a fully functioning lexer which is (almost) true (upto the tests) replica of the Python implementation of the tokenizer.
A word on the Code used in this tutorial
The tutorial uses code that is published at write-you-a-lexer. There are 5 Example folder under the src
folder. Each Example
folder can be run independently as show in the section below.
Each incremental iteration of our Lexer is implemented as a Example
folder.
Each of the Example
folder also has corresponding tests test_tokenizer*.hs
. Each of these tests refer to test_fixtures
insides test_fixtures
folder. Using the files in the test_fixtures
folder tests are generated for each of the implementation in the lexer across all Examples*
folder. The test_fixtures folders are numbered to match the Example
number. For example, all test_fixtures for lexer implemented under Example4
folder can be run using test_tokenizer4.hs
which in turn uses the test_fixtures/4/*.txt
files as the test fixtures.
2. Preparation (Example1)
As part of the preparation for the tutorial, we will create stack
project (using rio
as the template).
stack new wya-lexer rio
cd wya-lexer
Alternatively, the project can be cloned from here. All code used in this tutorial is from this repository.
Chanages to package.yaml
You can refer to the package.yaml
. An important note to note in this file, is that we add a cabal
directive as follow:
build-tools: alex
This directive allows cabal to execute alex
on any file with .x
extension. Running, alex
on this file makes all *.x
modules available to the project. More information on this directive can be found in the documentation for cabal
.
2.1 Types and Functions required by Alex
For the initial setup, we will add the following type
and functions
required by Alex
. We will refine these types
and functions
in future iterations. For now, we will add the following defintions to src/Example1/LexerUtil.hs
. More detailed information about why need these functions and types can be found in the alex documentation. We will also get familiar with these types and functions as we work through this tutorial.
type Byte = Word8
type AlexInput = (Char, [Byte], String)
alexGetByte :: AlexInput -> Maybe (Byte,AlexInput)
:bs,s) = Just (b,(c,bs,s))
alexGetByte (c,b= Nothing
alexGetByte (_,[],[]) :s) = case encode [c] of
alexGetByte (_,[],c: bs -> Just (b, (c, bs, s))
b -> error $ "Not byte returned for " ++ show c []
One way to understand how these definitions work, is to imagine Alex
starting with the initial input AlexInput
. Then, Alex
iteratively calls alexGetByte
to get the next byte. Alex
needs to match the rules
we provide as it reads each byte from the call to alexGetByte
. Once the entire byte for the character is read, Alex
will call action
and then recursively call this function with the new
input that was returned as part of the earlier call to alexGetByte
. Note, that Alex
does not make any assumptions on the type
of Alexinput, just that it needs an input
to recursively pass it to alexGetByte
. This state of this input determines what getAlexByte
can work on to get the next byte.
We will be refinining the AlexInput
type and the alexGetByte
function throughout this tutorial.
Since, Alex translates all matches of the rules we provide into a token
, we need to provide definitions of Token
we want to capture. We will define the first version as follows in Example1.Token
.
-- basic tokens
data Token =
Name
| Number
| String
| Op
deriving (Eq, Show)
-- the augmented structure that will store more information about the tokens
data TokenInfo = TokenInfo {
token_type:: Token
token_string:: T.Text
, start_pos:: (Int, Int)
, end_pos:: (Int, Int)
,
}deriving (Show, Eq)
Now, in Example1.LexerUtil
, we also need to provide an action
function that Alex
will call to generate the TokenInfo
value. The initial implementation here is again simple and will be improved later on.
-- adopted from Alex documentation
type AlexAction = Token -> AlexInput -> Int -> TokenInfo
action :: AlexAction
= let
action tok inp inp_len = inp
(c, _, s) in
TokenInfo
{= tok,
token_type =T.pack (take inp_len s),
token_string=(0, 0), --- this will be set in correct values later
start_pos=(0, 0) --- this will be set in correct values later
end_pos }
Now, we are ready to define our first version of Lexer rules. That will be our task in the first iteration.
3. Basic Lexer (Example1)
In this iteration, we will define a basic set of rules that captures any valid (non-unicode) identifiers in Python and also capture decimal values. We will skip the new lines and white spaces in this iteration.
Alex provides two building blocks to define rules. One construct is to define a character set
rule to match the text on. The identifiers for a charactset set
is preceded by $
symbol. This character set
rules can be combined together and defined as macros
. The identifiers defining macros are prefixed with @
symbol. Now let’s use these 2 basic constructs to define our first set of lexer rules.
The entire file is here
--- (Example1.Lexer.x)
--- the character set rules
$digit = 0-9 -- digits
$alpha = [a-zA-Z_] -- alphabetic characters
$white_no_nl = [\ \t]
-- a macro that combines the character set rules
-- here this macro captures any valid Python identifier
@identifier = $alpha [$alpha $digit \_]*
-- a macros that combines character sets to capture a decimal number
@decnumber = (0(_?0)* | [1-9](_?[0-9])*)
-- a macros just currently combines a decimal number.
Later on we will add more macros to this definition
@number = @decnumber
Now, it is time to use these macros
and let Alex
call our action
function on matching tokens to that we can build our TokenInfo
value. Therefore, next we define the rules.
--- (Example1.Lexer.x)
:-
tokens --- instruct Alex to call `action` with the `Number` token when the set of characters matches the @number macro
@number {action Number}
-- instruct Alex to call `action` with the `Name` token when the set of characters matches the @identifier macro
@identifier {action Name}
--- for now we skip whitespaces/newlines/tabs etc
$white+ ; -- this ignores new lines as well
Finally, to run the lexer we will provide the following function in (Example1.Lexer.x) to satisfy module dependencies. We will invoke this function in one of our entry point functions. This function recursively calls alexScan
until all tokens are scanned and Alex
returns an AlexEOF
value. (The mechanics of alexScan
calls the alexGetByte
function we provided earlier). Note that, this function will return the list of TokenInfo
values.
-- adapted from https://www.haskell.org/alex/doc/html/wrappers.html
-- alexScanTokens :: String -> [TokenInfo]
= go ('\n',[],str)
lexer str where go inp@(_,_bs,str) =
case alexScan inp 0 of
AlexEOF -> []
AlexError _ -> error "lexical error"
AlexSkip inp' len -> go inp'
AlexToken inp' len act -> act inp len : go inp'
To run the lexer, we have a helper function in Example1.LexerRunner.hs
that can be run from ghci.
>>import Example1.LexerRunner
λ >>runLexer "zzz"
λ Right [TokenInfo {token_type = Name, token_string = "zzz", start_pos = (0,0), end_pos = (0,0)}]
Note, that zzz
parses to a Name
token. The start_pos
and end_pos
are not set. We will fix that later.
Here are other example of decimals
generating Number
and also an example of skipping whitespaces
.
it :: Either String [TokenInfo]
>>runLexer "42"
λ Right [TokenInfo {token_type = Number, token_string = "42", start_pos = (0,0), end_pos = (0,0)}]
it :: Either String [TokenInfo]
>>runLexer "42\nvariable"
λ Right [TokenInfo {token_type = Number, token_string = "42", start_pos = (0,0), end_pos = (0,0)},TokenInfo {token_type = Name, token_string = "variable", start_pos = (0,0), end_pos = (0,0)}]
it :: Either String [TokenInfo]
>> λ
Here is a simplified picture that describes the control flow
runLexer
|
|
V
lexer (runs in loop until <--------------------------------
| ^
| |
V |
alexScan inp 0 (alex scans input for a matching rule) |
| |
| (matching rule found, then call corresponding action) |
V |
action (if tok == EOF END else loop back to runLexer)----->
With this we have our first iteration of the lexer! We used this iteration to set up the modules, types and function. We will build upon these simple types and function during our future iterations.
4 Support Numeric Values and String Literals (Example2)
Python has diverse support for how numeric values and string literals can be defined in the Python code. In this iteration, our objective is to define the rules that will handle the entire set of numerical values and also string literals supported by Python. The lexical grammar for string literals and numerical values can be found here and here.
4.1 Adding support for numerical values
We will first support numerical values and then work our way in supporting string literals. The below table show the entire set of rules for supporting numerical values. The text following the example breaks down these rules and provides some explanation.
--- Example2/Lexer.x
$nonzerodigit = [1..9]
$bindigit = [01]
$octdigit = [0..7]
@hexdigit = $digit | [a-f] | [A-F]
@decinteger = (0(_?0)* | [1-9](_?[0-9])*)
@hexinteger = ([0][xX](_?[0-9a-fA-F])+)
@bininteger = (0[bB](_?[01])+)
@octinteger = (0[oO](_?[0-7])+)
@intnumber = @decinteger | @bininteger | @octinteger | @hexinteger
@digitpart = $digit([_]|$digit)*
@fraction = [\.] @digitpart
@pointfloat = (@digitpart)* @fraction | @digitpart[\.]
@exponent = [eE] ([\+\-]?) @digitpart
@exponentfloat = (@digitpart | @pointfloat)* @exponent
@floatnumber = @pointfloat | @exponentfloat
@imagnumber = (@floatnumber | @digitpart) [jJ]
@number = @imagnumber | @floatnumber | @intnumber
We first declare our primitive character sets as follows
$digit = [0..9] -- this is provided by Alex. We leave it here for completeness
$nonzerodigit = [1..9]
$bindigit = [01] $octdigit = [0..7]
Now, lets support decimal integers such as 42
, 42_00
, 00_00
etc. We declare similar macros for hex, binary and octal numbers
@decinteger = (0(_?0)* | [1-9](_?[0-9])*)
@hexinteger = ([0][xX](_?[0-9a-fA-F])+)
@bininteger = (0[bB](_?[01])+) @octinteger = (0[oO](_?[0-7])+)
Finally, we put together all integer representations into one macro
@intnumber = @decinteger | @bininteger | @octinteger | @hexinteger
Next, we deal with floating point representations. Note that Python supports floating point numbers such as .42
, 0.42
, 42.
. 42_42.42_42
.
@digitpart = $digit([_]|$digit)*
@fraction = [\.] @digitpart @pointfloat = (@digitpart)* @fraction | @digitpart[\.]
We also need to add support for exponents, such as 1e12
, 1_1e1_1
, 0.12e12_1
etc.
@exponent = [eE] ([\+\-]?) @digitpart @exponentfloat = (@digitpart | @pointfloat)* @exponent
Finally, we need a simple declaration for imaginary numberss
@imagnumber = (@floatnumber | @digitpart) [jJ]
Now, we put together all our rules for supporting integers, floats and imaginary numbers.
@number = @imagnumber | @floatnumber | @intnumber
4.2 Lexer for String Literals
As you may already be aware, Python supports specifying string literals in diverse ways. Here are some sample of strings that can be specified inside a Python program.
"abc" # double quoted string
'abc' # single quoted string
"abc\ # string with line continuation character
def"
"""
This is an example of a multi-line string
"""
'''
Another example of a multi-line string
'''
# supported string prefixes (combinations of r, f, b, u)
'\xe\0e\xe\xe' # byte string
brf"string with prefixes"
At first supporting these variations might seem daunting. But, fortunately, Python lexical grammer documentation provides a well-broken down defintions that we are use to support this variety of represenations
# https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
stringliteral ::= [stringprefix](shortstring | longstring)
stringprefix ::= "r" | "u" | "R" | "U" | "f" | "F"
"fr" | "Fr" | "fR" | "FR" | "rf" | "rF" | "Rf" | "RF"
| shortstring ::= "'" shortstringitem* "'" | '"' shortstringitem* '"'
longstring ::= "'''" longstringitem* "'''" | '"""' longstringitem* '"""'
shortstringitem ::= shortstringchar | stringescapeseq
longstringitem ::= longstringchar | stringescapeseq
shortstringchar ::= <any source character except "\" or newline or the quote>
longstringchar ::= <any source character except "\">
stringescapeseq ::= "\" <any source character>
# byte string
bytesliteral ::= bytesprefix(shortbytes | longbytes)
bytesprefix ::= "b" | "B" | "br" | "Br" | "bR" | "BR" | "rb" | "rB" | "Rb" | "RB"
shortbytes ::= "'" shortbytesitem* "'" | '"' shortbytesitem* '"'
longbytes ::= "'''" longbytesitem* "'''" | '"""' longbytesitem* '"""'
shortbytesitem ::= shortbyteschar | bytesescapeseq
longbytesitem ::= longbyteschar | bytesescapeseq
shortbyteschar ::= <any ASCII character except "\" or newline or the quote>
longbyteschar ::= <any ASCII character except "\">
bytesescapeseq ::= "\" <any ASCII character>
We will use the above defintions to define the rules our lexer needs. We will first deal with regular strings. We can apply the same logic to bytestrings. We will take the same approach as we took for numeric values. The first table here displays the full set of rules we would have defined at the end of this section.
-- start string related rules
-- adopted from https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
$lf = \n -- line feed
$cr = \r -- carriage return
@eol_pattern = $lf | $cr $lf | $cr $lf
@stringprefix = r | u | R | U | f | F
| fr | Fr | fR | FR | rf | rF | Rf | RF
$short_str_char = [^ \n \r ' \" \\]
$shortstringchar_nosinglequote = [^ ' \\ \n]
$shortstringchar_nodoublequote = [^ \" \\ \n]
@stringescapeseq = [\\](\\|'|\"|@eol_pattern|$short_str_char)
@shortstringitem_single = $shortstringchar_nosinglequote | @stringescapeseq
@shortstringitem_double = $shortstringchar_nodoublequote | @stringescapeseq
@shortstring = ' @shortstringitem_single* ' | \" @shortstringitem_double* \"
$longstringchar = [. \n] # [' \"]
@longstringitem_single = $longstringchar | @stringescapeseq | @one_single_quote | @two_single_quotes | \"
@longstringitem_double = $longstringchar | @stringescapeseq | @one_double_quote | @two_double_quotes | \'
@longstring = (''' @longstringitem_single* ''') | (\"\"\" @longstringitem_double* \"\"\")
@stringliteral = (@stringprefix)* (@shortstring | @longstring)
That looks complicated. Let’s break it up and look at this in parts. First let’s define the character sets and the string prefixes that will be supported.
$lf = \n -- line feed
$cr = \r -- carriage return
@eol_pattern = $lf | $cr $lf | $cr $lf
@stringprefix = r | u | R | U | f | F | fr | Fr | fR | FR | rf | rF | Rf | RF
We will create 2 groups of rules, one group prefixed with short
will support strings encolsed in single or double quotes. The second group will support long strings that are created by enclosing them in either '''
or """
. Let’s start with the first group.
For a short string representation, we don’t want to allow \n \r ' '' \' characters, unless they are escaped by the backslash(
`). The exception to this rule, is we allow single-quotes without escapes in a double quoted string and vice versa. The following rules, defines these requirements.
-- these characters only allowed with escape sequence
$short_str_char = [^ \n \r ' \" \\]
-- with escape character (\) we allow certain characters
@stringescapeseq = [\\](\\|'|\"|@eol_pattern|$short_str_char)
-- these characters not allowed inside a single-quoted string
$shortstringchar_nosinglequote = [^ ' \\ \n]
-- these characters not allowed inside a double-quoted string
$shortstringchar_nodoublequote = [^ \" \\ \n]
-- combine the above rules to identify a valid character in single quoted string
@shortstringitem_single = $shortstringchar_nosinglequote | @stringescapeseq
-- combine the above rules to identify a valid character in a double quoted string
@shortstringitem_double = $shortstringchar_nodoublequote | @stringescapeseq
-- put together above rules to created a double-quoted or single-quoted string @shortstring = ' @shortstringitem_single* ' | \" @shortstringitem_double* \"
For the long-form string, the rules are simpler, since many more characters are allowed inside this string without escaping
-- adopted from language-python package
-- the only character that is not allowed ' or "" quotes depending upon the string wrapper used
$longstringchar = [. \n] # [' \"]
-- set up for single quoted long string
@longstringitem_single = $longstringchar | @stringescapeseq | @one_single_quote | @two_single_quotes | \"
-- set up for double quoted long string
@longstringitem_double = $longstringchar | @stringescapeseq | @one_double_quote | @two_double_quotes | \' @longstring = (''' @longstringitem_single* ''') | (\"\"\" @longstringitem_double* \"\"\")
And, finally we put together both these defintions, to add support for a string literal
@stringliteral = (@stringprefix)* (@shortstring | @longstring)
We can follow the same reasoning for the byte-string rules.
Finally, we update our action, to use the new rules we created above. This is how the tokens
section looks after end of this iteration.
tokens :-
@number {action Number}
@identifier {action Name}
@stringliteral | @bytesliteral {action String}
$white+ ; -- this ignores new lines as well
Compiling and running our example, we see:
>>L2.runLexer "'test'"
λ Right [TokenInfo {token_type = String, token_string = "'test'", start_pos = (0,0), end_pos = (0,0)}]
it :: Either String [Example2.Tokens.TokenInfo]
>>L2.runLexer "'''this this'''"
λ Right [TokenInfo {token_type = String, token_string = "'''this this'''", start_pos = (0,0), end_pos = (0,0)}]
it :: Either String [Example2.Tokens.TokenInfo]
>>L2.runLexer "42_42.e+42_42"
λ Right [TokenInfo {token_type = Number, token_string = "42_42.e+42_42", start_pos = (0,0), end_pos = (0,0)}]
it :: Either String [Example2.Tokens.TokenInfo]
More examples can be found in test_fixtures of this iteration under test/test_fixtures/2
5. Capture The Source Location of Tokens (Example3)
Up until now, you would have noticed the start_pos
and end_pos
values in the TokenInfo
are set to (0, 0). We have not captured the location in the input string that was used to generate a particular TokenInfo
value. We will fix that in this iteration.
Before that let’s add support for special characters to our lexer.
Thus far, our lexer only processes the NAME
, STRING
and NUMBER
tokens. Let’s extend the definition of Token
to capture all the special characters and their meaning as follows:
-- Example3/Tokens.hs
data Token =
.....
| Lpar
| Rpar
| Lsqb
| Rsqb
| Colon
| Comma
| Semi
| Plus
| Minus
....
...
We also add the corresponding change to the Example3/Lexer.x module.
-- sample of operators we add to Example3/Lexer.x
"(" { action Lpar }
")" { action Rpar }
"[" { action Lsqb }
"]" { action Rsqb }
"{" { action Lbrace }
"}" { action Rbrace }
"..." { action Ellipsis}
"->" { action RArrow }
"." { action Dot }
"~" { action Tilde }
"+" { action Plus }
"-" { action Minus }
"**" { action DoubleStar }
"*" { action Star }
With this our lexer supports expressions such as x + y
, abc = "xyz" + "def"
, (a + b) + c
etc.
Now, let’s get on to the task of capturing the source location information in our TokenInfo
value. If you recollect in the preparation section, we added functions and types to work with the Alex runtime. We will be updating those functions here. First, we will update the AlexInput
to capture the location as we update the AlexInput
values inside the alexGetByte
function.
Here is what we need to do. First update AlexInput
, so that it holds another attribute of type AlexPosn
. We use this value to capture the starting of the location from where the next token was caputured (at which point the action
function is called).
Example3/LexerUtil.hs
-- adapted from https://www.haskell.org/alex/doc/html/wrappers.html
data AlexPosn = AlexPosn !Int -- absolute character offset
!Int -- line number
!Int -- column number
deriving (Show, Eq)
type AlexInput = (AlexPosn, -- current position,
Char, -- previous char
Byte], -- rest of the bytes for the current char
[String) -- current input string
Next, we will extend the alexGetByte
function, to update the the AlexPosn
values, everytime a complete character has been read. To help with this, we use two helper functions. Note, that we just adopted this code from the wrapper Alex usually generates (but we are not using the wrappers in this tutorial).
alexGetByte :: AlexInput -> Maybe (Byte,AlexInput)
:bs,s) = Just (b,(pn, c,bs,s))
alexGetByte (pn, c,b= Nothing
alexGetByte (_, _,[],[]) :s) = let
alexGetByte (p,_,[],c= alexMove p c
p' in
case encode [c] of
: bs -> p' `seq` Just (b, (p', c, bs, s))
b -> error $ "Not byte returned for " ++ show c
[]
= AlexPosn 0 1 1
alexStartPos
-- adapter from code generated from Alex (remove change for \t, since Python counts this as offset=1)
alexMove :: AlexPosn -> Char -> AlexPosn
-- alexMove (AlexPn a l c) '\t' = AlexPosn (a+1) l ((c+8-1) `div` 8*8+1) -- tabsize=8
AlexPosn a l _) '\n' = AlexPosn (a+1) (l+1) 1
alexMove (AlexPosn a l c) _ = AlexPosn (a+1) l (c+1) alexMove (
And finally, we have to use this information, when we construct the TokenInfo
value in our action
method. Note, that we can use the starting value present in AlexPosn and the inp_len
(length of currently captured token) to initialize the start_pos
and end_pos
.
action :: AlexAction
= let
action tok inp inp_len AlexPosn _ line col, c, _, s) = inp -- Get the AlexPosn and use that info to calculate start, end positions
(in
TokenInfo
{= tok,
token_type =T.pack (take inp_len s),
token_string=(line, col),
start_pos=(line, col+inp_len-1)
end_pos }
With these changes, all TokenInfo
values that are constructed in the action
function will also capture the source localtion.
Here are some examples
>>import Example3.LexerRunner as L3
λ >>L3.runLexer "a=b+c\nx=1"
λ Right [TokenInfo {token_type = Name, token_string = "a", start_pos = (1,1), end_pos = (1,1)},TokenInfo {token_type = Equal, token_string = "=", start_pos = (1,2), end_pos = (1,2)},TokenInfo {token_type = Name, to
= "b", start_pos = (1,3), end_pos = (1,3)},TokenInfo {token_type = Plus, token_string = "+", start_pos = (1,4), end_pos = (1,4)},TokenInfo {token_type = Name, token_string = "c", start_pos = (1,5), end
ken_string = (1,5)},TokenInfo {token_type = Name, token_string = "x", start_pos = (2,1), end_pos = (2,1)},TokenInfo {token_type = Equal, token_string = "=", start_pos = (2,2), end_pos = (2,2)},TokenInfo {token_type = Number, token_string = "1", start_pos = (2,3), end_pos = (2,3)}]
_pos it :: Either String [Example3.Tokens.TokenInfo]
>> λ
6. Store state in a State Monad (Example4)
By now you might have noticed we are not handling newlines or identations yet. Python’s grammar is context-senstive and the lexer we constructed so far does not use the context of surrounding tokens yet. We did not have to store the state of our lexer up until now. But, to support the white-space sensitive requirements of the Python grammar we will have to preserve the state of the context.
There are many approaches to solving this problem. One could entirely use the start_code
machinery provided my Alex to achieve this. In this tutorial, we will adopt the State Monad machinery to store and retrieve the context. We will as earlier, use the code Alex would have generated if we had used the %wrapper monadUserState
directive. One could easily replace this code with the stock State monad.
Let’s first think about all the information we would like to store in our state. You might have noticed that we are already implicitly carrying around state in the form of AlexInput
, which gets passed to the alexScan
function and gets updated in the alexGetByte
function. So, may be we can abstract that information into our new state. We also need additional information we would like to store to handle newlines, indentations, empty new lines, lines with just comments etc. Later on we will also have to handle cases where ()
are used to express break up statements across multiple lines.
With these requirements in mind, we could define the following 2 data structures. Again, the data structures are adopted heavily from the the monadUserState
wrapper Alex provides. This can be easily replaced by a State monad with a custom structure. But, we stick to these definitions to understand Alex documentation and be consistent with other resources we find on the internet.
First abstract the AlexInput
and AlexPosn
into our new state definition.
-- Example4/LexerUtil.hs
-- The Alex state monad
-- This record encapsulates all of the `AlexInput` attributes and also contains the `AlexPosn`.
-- We also add the `AlexUserState` to this structure and we will use them later
data AlexState = AlexState {
alex_pos :: !AlexPosn, -- position at current input location
alex_inp :: String, -- the current input
alex_chr :: !Char, -- the character before the input
alex_bytes :: [Byte], -- rest of the bytes for the current char
alex_scd :: !Int, -- the current startcode
-- Used in later iterations
alex_ust :: AlexUserState -- AlexUserState will be defined in the user program
}
data AlexUserState = AlexUserState {
userStateStartCode :: !Int,
userStateIndentStack :: [Int],
userStatePendingTokens :: [TokenInfo],
userStateParenStack:: [Token],
userStatePrevComment :: Bool,
userStatePrevToken :: Token
}deriving (Show)
We will now define a new type Alex a
and the required monadic interfaces for that.
-- Example4/LexerUtil.hs
data AlexState = AlexState {
alex_pos :: !AlexPosn, -- position at current input location
alex_inp :: String, -- the current input
alex_chr :: !Char, -- the character before the input
alex_bytes :: [Byte], -- rest of the bytes for the current char
alex_scd :: !Int, -- the current startcode
}
newtype Alex a = Alex { unAlex :: AlexState -> Either String (AlexState, a) }
instance Functor Alex where
fmap f a = Alex $ \s -> case unAlex a s of
Left msg -> Left msg
Right (s', a') -> Right (s', f a')
instance Applicative Alex where
pure a = Alex $ \s -> Right (s, a)
<*> a = Alex $ \s -> case unAlex fa s of
fa Left msg -> Left msg
Right (s', f) -> case unAlex a s' of
Left msg -> Left msg
Right (s'', b) -> Right (s'', f b)
instance Monad Alex where
>>= k = Alex $ \s -> case unAlex m s of
m Left msg -> Left msg
Right (s',a) -> unAlex (k a) s'
return = pure
We will also add a getter/setter for this state monad. We will use these functions to get/set the AlexInput
and AlexPosn
values.
-- needed by Alex
alexError :: String -> Alex a
= Alex $ const $ Left message
alexError message
alexGetInput :: Alex AlexInput
alexGetInput= Alex $ \s@AlexState{alex_pos=pos,alex_chr=c,alex_bytes=bs,alex_inp=inp__} ->
Right (s, (pos,c,bs,inp__))
alexSetInput :: AlexInput -> Alex ()
alexSetInput (pos,c,bs,inp)= Alex $ \s -> Right (s{alex_pos=pos,alex_chr=c,alex_bytes=bs,alex_inp=inp}, ())
Now, we update our lexer
function to use the state monad. Note, that we break up the original version from Example3. The lexer
function first gets the inp
state and calls alexScan
. After alexScan
returns, we update the inp
state and then call the action function. This is where we will capture the state between each call to alexScan
functions.
Since, the lexer
runs only once, we need to recursively call this lexer
until we reach EOF
token. The recursion is handled in the lexerFold
function. The lexerFold
function recursively invokes the lexer
until it detects the EOF
token and then returns all the collected TokenInfo
values.
lexer :: Alex TokenInfo -- This is our state monad
= do
lexer <- alexGetInput
inp case alexScan inp 0 of
AlexEOF -> alexEOF
AlexError ((AlexPosn _ line column),_,_,_) -> alexError $ "lexical error at line " ++ (show line) ++ ", column " ++ (show column)
AlexSkip inp' _len -> do
alexSetInput inp'-- skip and loop around
lexer AlexToken inp' len action -> do
alexSetInput inp'
action inp len
-- adopted from language-python
lexerFold :: Alex [TokenInfo]
= loop []
lexerFold where
= do
loop toks @TokenInfo {..} <- lexer
token_infocase token_type of
EOF -> return $ L.reverse toks
-> loop (token_info : toks)
_
-- from generated Alex file
runAlex :: String -> Alex a -> Either String a
Alex f) =
runAlex inp (case f
AlexState
( = alexStartPos,
{ alex_pos = inp,
alex_inp = '\n',
alex_chr = [],
alex_bytes = alexInitUserState,
alex_ust = 0
alex_scd
}of
) Left msg -> Left msg
Right (_, a) -> Right a
}
We also update the action
function to now return Alex TokenInfo
instead of TokenInfo
. Note, that we also adjusted the start_pos
and end_pos
values to align with the values produced by the Python tokenizer.
-- helper function to construct the TokenInfo value
-- We will add more logic in this function later
constructToken :: Token -> (AlexPosn, b1, c1, String) -> Int -> (AlexPosn, b2, c2, d) -> TokenInfo
= let
constructToken tok inp inp_len n_inp AlexPosn _ line col, _, _,s) = inp
(AlexPosn _ nline ncol, _c, _rest, _s) = n_inp -- new input in state
(
= (line, col - 1)
start_pos = T.pack $ take inp_len s
tok_str = (nline, ncol -1)
end_pos in
TokenInfo {
=tok,
token_type=tok_str,
token_string=start_pos,
start_pos=end_pos
end_pos
}
type AlexAction result = AlexInput -> Int -> Alex result
-- Update action to run inside the Alex state
action :: Token -> AlexAction TokenInfo
= do
action tok inp inp_len -- this has new updated input
@(AlexPosn _ nline ncol,c, rest,s) <- alexGetInput
n_inpreturn $ constructToken tok inp inp_len n_inp
With these changes, we have completely abstracted away AlexInput
and AlexPosn
into the State monad. We added a custom user state which we will use soon. We also threaded the state through the alexScan
function until we see the EOF
token.
Let’s now test our changes in GHCI.
>>import Example4.LexerRunner as L4
λ >>L4.runLexer "a=b+c"
λ Right [TokenInfo {token_type = Name, token_string = "a", start_pos = (1,0), end_pos = (1,1)},TokenInfo {token_type = Equal, token_string = "=", start_pos = (1,1), end_pos = (1,2)},TokenInfo {token_type = Name, token_string = "b", start_pos = (1,2), end_pos = (1,3)},TokenInfo {token_type = Plus, token_string = "+", start_pos = (1,3), end_pos = (1,4)},TokenInfo {token_type = Name, token_string = "c", start_pos = (1,4), end_pos = (1,5)}]
it :: Either String [Example4.Tokens.TokenInfo]
>> λ
7. Supporting Whitespace context (and having a Full Lexer!) (Example5)
It has been interesting so far. But, we are not done yet. This probably will be the most difficult task to implement. We had to go through the process of setting up the state monad just to accomplish this task we have at hand. Let’s look at some examples on how the Python tokenizer behaves when it comes to handling whitespaces.
7.1 The Complexity of Newlines, Indents and Dedents
55]: import tokenize
In [
...:= """
...: z ...: if f:
...: x
...: y
...: """
...:
...:list(tokenize.generate_tokens(io.StringIO(z).readline))
...: 55]:
Out[type=61 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n'),
[TokenInfo(type=1 (NAME), string='if', start=(2, 0), end=(2, 2), line='if f:\n'),
TokenInfo(type=1 (NAME), string='f', start=(2, 3), end=(2, 4), line='if f:\n'),
TokenInfo(type=54 (OP), string=':', start=(2, 4), end=(2, 5), line='if f:\n'),
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 5), end=(2, 6), line='if f:\n'),
TokenInfo(type=5 (INDENT), string=' ', start=(3, 0), end=(3, 4), line=' x \n'),
TokenInfo(type=1 (NAME), string='x', start=(3, 4), end=(3, 5), line=' x \n'),
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 6), end=(3, 7), line=' x \n'),
TokenInfo(type=5 (INDENT), string=' ', start=(4, 0), end=(4, 8), line=' y\n'),
TokenInfo(type=1 (NAME), string='y', start=(4, 8), end=(4, 9), line=' y\n'),
TokenInfo(type=4 (NEWLINE), string='\n', start=(4, 9), end=(4, 10), line=' y\n'),
TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line=''),
TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line=''),
TokenInfo(type=0 (ENDMARKER), string='', start=(5, 0), end=(5, 0), line='')] TokenInfo(
You will notice that we have NEWLINE
followed by INDENT
and then finally we also DEDENT
(twice), so that start_pos
and end_pos
end up at column 0. Therefore, for us to accomplish this task, we will have to capture the location we see a newline
followed by whitespaces
. We will do that in the next section.
7.2 Capture Newlines, INDENTs and DEDENTs
We first add a new rule to the lexer
-- Example5/Lexer.x
$white* {startWhite}
\n$white+ ; -- ignote this since we only care up significant white spaces (leading white spaces)
Now we will implement the startWhite
action to handle this rule. To keep track of all indentations we make use of the userStateIndentStack
attribute of the AlexUserState
record. The initialization function has the value set to [1]
and we set the userStatePendingTokens
to an empty list. When the startWhite
action is called we perform the following actions that is described after the code listing.
Here is the implementation of the whiteSpace
action. We will break down this code and understand what its doing below.
startWhite:: AlexAction TokenInfo
= do
startWhite inp inp_len <- userStateIndentStack <$> alexGetUserState
is let cur = case is of
:_ -> c
c-> error "Indentation stack is not set, alteast one element should be present. Empty list found"
_
let (AlexPosn _ line _, _, _,s) = inp
-- new input
@(AlexPosn _ nline ncol, _, _, ns) <- alexGetInput
n_inp
<- newLineAction inp inp_len
newline_tok <- alexGetUserState
userState
if | True -> do -- this if condition here is a placeholder which we will update later
-- at this point we have an indentation, but this
-- indentation could be starting on different line based on
-- preceding empty lines. For all precedin empty lines we
-- will insert an NL token
let parts = L.map T.length . T.splitOn "\n" . T.pack . take inp_len $ s
let pos = L.last parts + 1
let nl_tokens = constructNLTokens
-- the first new line will be returned at
-- end of this function
+ 1)
(line -- ["\n", ....., ""] therefore adjust for
-- the 2 items (first \n and last
-- non-newline)
- 2)
(L.length parts > cur) $
when (pos $ userState {
alexSetUserState = pos:is,
userStateIndentStack -- takes care of adding the preceding new lines as well
=nl_tokens ++ [constructToken Indent inp inp_len n_inp]}
userStatePendingTokens< cur) $ do
when (pos let (pre, post) = span (> pos) is
let top = case post of
:_ -> t
t-> error $ unwords ["Invalid indent with cur= ", show cur]
[] if pos == top then
$ userState {
alexSetUserState = post,
userStateIndentStack =nl_tokens ++ map (const (constructToken Dedent inp inp_len n_inp)) pre}
userStatePendingTokenselse
error $ "Invalid indents : " ++ "pos = " ++ show pos ++ " top= " ++ show top ++ "userState = " ++ show userState ++ "pre = " ++ show pre ++ "post = " ++ show post
== cur) $
when (pos $ userState {
alexSetUserState =nl_tokens}
userStatePendingTokens
-- set prev token
alexSetPrevToken newline_tokreturn (constructToken newline_tok inp inp_len n_inp)
when (pos > cur)
: This happens when we encounter new indentations. We check to see if the indentation is further out from the value at the top of theuserStateIndentStack
. If yes, we add the new position to the stack. We also add theINDENT
token to the stack, since after we yield a NEWLINE token, we need to yield all the tokens in theuserStatePendingTokens
stack. By adding theINDENT
tokens to this stack we achieve that. ThelexerLoop
generates the pending tokens by popping from the pending stack before calling thelexer
to fetch the next token.when (pos < cur)
: This condition is true when we encounter an outer indent. Now, we could havededented
out by more than one level of indentation. Therefore, we need to compute the number ofDEDENT
tokens that needs to be added to into thepending
stack.where (pos == cur) : When the indentation level has not changed, we still have to introduce any
NL` tokens that may be needed for empty lines within the same indentation level.
There is one other subtle condition we need to handle. You will notice that we are constructing a specified number of nl_tokens
. This is to support the behaviour of the original Python tokenizer that introduces NL
tokens for empty new lines. The empty
new lines do not get INDENTED
.
Here is another example of the Python tokenizer where you will notice 2 NL
tokens before the INDENT
token. That is the reason behind appending the extra NL
tokens across all the conditions encountered above.
57]: import tokenize
In [
...:= """
...: z ...: if f:
...:
...:
...: x
...: """
...:
...:list(tokenize.generate_tokens(io.StringIO(z).readline))
...: 57]:
Out[type=61 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n'),
[TokenInfo(type=1 (NAME), string='if', start=(2, 0), end=(2, 2), line='if f:\n'),
TokenInfo(type=1 (NAME), string='f', start=(2, 3), end=(2, 4), line='if f:\n'),
TokenInfo(type=54 (OP), string=':', start=(2, 4), end=(2, 5), line='if f:\n'),
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 5), end=(2, 6), line='if f:\n'),
TokenInfo(type=61 (NL), string='\n', start=(3, 0), end=(3, 1), line='\n'),
TokenInfo(type=61 (NL), string='\n', start=(4, 0), end=(4, 1), line='\n'),
TokenInfo(type=5 (INDENT), string=' ', start=(5, 0), end=(5, 4), line=' x \n'),
TokenInfo(type=1 (NAME), string='x', start=(5, 4), end=(5, 5), line=' x \n'),
TokenInfo(type=4 (NEWLINE), string='\n', start=(5, 6), end=(5, 7), line=' x \n'),
TokenInfo(type=6 (DEDENT), string='', start=(6, 0), end=(6, 0), line=''),
TokenInfo(type=0 (ENDMARKER), string='', start=(6, 0), end=(6, 0), line='')] TokenInfo(
Now, this alone is not sufficient. You will notice that we have a if | True
block in the code above. We plan to extend on this if
construct below.
You will notice that we are also setting the userStatePrevToken
values in this function. We will next look at why we need that when dealing with comments
.
Finally, we still have some more scenarios to handle. - We cannot INDENT
if are within a context of a paren (
. Therefore, we need to add that condition as well to our whiteSpace
action. Notice, that we will be capturing the userStatePrevToken
after constructing every token. We can use this information, to infer if we are within the context of a (
. - If we just encountered a #comment
by itself on a line, then the newline token generated has to be NL
and not NEWLINE
. We will also not create an INDENT
in this case even if the comment is indented.
7.3 Addressing Parens and Comments (with Newlines)
We will now update our whiteSpace
action to handle these scenarios. We will introduce new condition under the if
construct to handle the paren
and the comment
scenario.
startWhite:: AlexAction TokenInfo
= do
startWhite inp inp_len <- userStateIndentStack <$> alexGetUserState
is let cur = case is of
:_ -> c
c-> error "Indentation stack is not set, alteast one element should be present. Empty list found"
_
<- length . userStateParenStack <$> alexGetUserState
parenDepth let (AlexPosn _ line _, _, _,s) = inp
-- new input
@(AlexPosn _ nline ncol, _, _, ns) <- alexGetInput
n_inp
<- newLineAction inp inp_len
newline_tok <- alexGetUserState
userState
if | (parenDepth > 0) -> action Newline inp inp_len
| T.isPrefixOf "#" (T.pack ns) -> do -- this means we are on a comment only line
alexSetPrevToken newline_tokreturn (constructToken newline_tok inp inp_len n_inp) -- no INDENTS on comment only new line | otherwise -> ... -- rest of code as above
For the above code additions to work we also need to track the commentToken
and paren
action that is explained further down.
We will be using a helper function newLineAction
to decide whether to generate NL
or NEWLINE
token depending on the flag set regarding comments. The userStatePrevComment
flag is set in the commentAction
function whenever a comment
is found on a new line (and followed by just whitespaces).
7.4 Special handling for Comments
-- Example5/LexerUtil.hs
newLineAction :: AlexAction Token
= do
newLineAction inp inp_len <- userStatePrevComment <$> alexGetUserState
is_prev_comment if is_prev_comment then (do
False
alexSetPrevComment return Nl) else return Newline
commentAction :: AlexAction TokenInfo
= do
commentAction inp inp_len <- userStatePrevToken <$> alexGetUserState
prevToken let prevCommentFlag = prevToken == Nl || prevToken == Newline
-- here we only flag prevComment if it was followed by a newline
-- we need this since is the only time we don't introduce the `INDENT`
alexSetPrevComment prevCommentFlag
-- this has new updated input
--((AlexPosn _ line col),c, rest,s) <- alexGetInput
let (AlexPosn _ line col,c, rest,s) = inp
let token_string = T.stripStart . T.pack $ take inp_len s
let new_len = T.length token_string
let new_pos = col + inp_len - new_len - 1
let start_pos = (line, new_pos)
let end_pos = (line, new_pos + new_len)
return $ TokenInfo {
=Comment,
token_type=token_string,
token_string=start_pos,
start_pos=end_pos,
end_pos }
Here is the subtelity we are trying to handle for #comments
that are present by themselves, as opposed to comments beside some statements.
59]: import tokenize
In [
...:= """
...: z ...: if f:
...:
...: # comment here gets NL
...: x # comment here get NEWLINE
...: """
...:
...:list(tokenize.generate_tokens(io.StringIO(z).readline))
...: 59]:
Out[type=61 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n'),
[TokenInfo(type=1 (NAME), string='if', start=(2, 0), end=(2, 2), line='if f:\n'),
TokenInfo(type=1 (NAME), string='f', start=(2, 3), end=(2, 4), line='if f:\n'),
TokenInfo(type=54 (OP), string=':', start=(2, 4), end=(2, 5), line='if f:\n'),
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 5), end=(2, 6), line='if f:\n'),
TokenInfo(type=61 (NL), string='\n', start=(3, 0), end=(3, 1), line='\n'),
TokenInfo(type=60 (COMMENT), string='# comment here gets NL', start=(4, 4), end=(4, 26), line=' # comment here gets NL\n'),
TokenInfo(type=61 (NL), string='\n', start=(4, 26), end=(4, 27), line=' # comment here gets NL\n'),
TokenInfo(type=5 (INDENT), string=' ', start=(5, 0), end=(5, 4), line=' x # comment here get NEWLINE\n'),
TokenInfo(type=1 (NAME), string='x', start=(5, 4), end=(5, 5), line=' x # comment here get NEWLINE\n'),
TokenInfo(type=60 (COMMENT), string='# comment here get NEWLINE', start=(5, 6), end=(5, 32), line=' x # comment here get NEWLINE\n'),
TokenInfo(type=4 (NEWLINE), string='\n', start=(5, 32), end=(5, 33), line=' x # comment here get NEWLINE\n'),
TokenInfo(type=6 (DEDENT), string='', start=(6, 0), end=(6, 0), line=''),
TokenInfo(type=0 (ENDMARKER), string='', start=(6, 0), end=(6, 0), line='')] TokenInfo(
Observe, when NL
tokens are generated as opposed to NEWLINE
tokens. Understanding, this can help you understand the code snippets above.
One other thing, we need to handle is make sure we have matching nested parens, when we generate an tokens for parens. We add these 2 functions, to handle that scenario:
7.5 Validate Matching Parens
-- Para handling - adopted from language-python library
openParen:: Token -> AlexAction TokenInfo
= do
openParen token inp inp_len <- alexGetUserState
userState $ userState {
alexSetUserState =token:userStateParenStack userState}
userStateParenStack
action token inp inp_len
closeParen:: Token -> AlexAction TokenInfo
= do
closeParen token inp inp_len <- alexGetUserState
userState let topParent = userStateParenStack userState
case topParent of
:ts -> case matchParen t token of
t-- TODO: Add line number info
False -> error $ "Parens don't match " ++ show t ++ "and" ++ show token
True -> do
-- pop the stack
$ userState {
alexSetUserState =ts}
userStateParenStack
action token inp inp_len-- TODO: Add line number info
-> error $ "No paren's to pop for " ++ show token
[]
matchParen :: Token -> Token -> Bool
= case (t1, t2) of
matchParen t1 t2 Lpar, Rpar) -> True
(Lsqb, Rsqb) -> True
(Lbrace, Rbrace) -> True
(-> False (_, _)
And finally, to handle the subtleness if how the original Python tokenizer handles comments, we need an specialized action for comments. Here we set up a flag to identify cases where a #comment
was seen n a new line without any other statements.
We will accordingly, update our rules for parens
and comments
to call these actions.
-- Example5/Lexer.x
"(" { openParen Lpar }
")" { closeParen Rpar }
"[" { openParen Lsqb }
"]" { closeParen Rsqb }
"{" { openParen Lbrace }
"}" { closeParen Rbrace }
For comments, to handle newline behavior, we introduce more rules
-- Example5/Lexer.x
$newline = [\r \n]
$not_newline = ~$newline
@commentline = (($white_no_nl)* \# ($not_newline)*)
@empty_line = (($white_no_nl)*\n)
tokens :-
@empty_line {action Nl}
@commentline {commentAction}
And that’s it. We have a version of a lexer for Python 3.0 which is fully compatible with this tokenizer.
8. That’s a Wrap!
Phew, if you got this far, then Congratulations! You built a complete lexer in Haskell using Alex. The lexer you built atleast matches the behavior of the Python tokenizer upto the 80+ tests we ran against our lexer2. That is a awesome feat! For some of us, this is a college level project but for other’s this is quite an accomplishment!
For further reading, I would recommend taking a look at the Alex documentation. A lof of what is explained in the document must make sense if you got this far in the tutorial. I have also left a bunch of references that I used while I implemented this project. There are other lexers in the wild that has been built using Alex. Hopefully, this tutorial has given you the knowledge you need to understand those implementations.
Limitations of the parser we built
- We do not support Unicode characters in identifiers
- ‘\’ as line continuation character is not supported in our implementation. That can be easily added.
- We don’t generate the
ENCODE
andENDMARKER
tokens. The code is commented out to make it easy for tests.
9. References
- Alex Most of data structures used was adopted from this documentation
- language-python This implementation here helped me understand other approaches of handling white-space context and code organization
- jmoy/alexhappy The example here for handling white-spaces/indentation were the building blocks for handling white-spaces in this tutorial
Alex has a concept called
wrappers
where versions of this functions are automatically generated. But, to learn to use Alex, we will write this from scratch.↩︎The current implementation of the lexer does not support Unicode characters in the identifiers even though Unicode characters can be present in the string literals.↩︎