Notebook

University of Zagreb
Faculty of Electrical Engineering and Computing

Programming in Haskell

Academic Year 2014/2015

Introduction to Parsec¶

Version 1.1

Changelog

1.1 - Added do notation examples

Parsec is a parser combinator library. What does that mean?

Parsers are constructed from simple buildings blocks that are combined in various ways. Let us begin by constructing a simple parser that parses a single digit.

In [2]:

import Text.Parsec.String (Parser)
import Text.Parsec.Char (digit)
import Text.Parsec (parse, ParseError)
import Control.Applicative ((<$>), (<$), (<*>), (<*), (*>), Applicative)
import Data.Char (digitToInt)

err = "An error has occurred"

simpleParse :: Parser a -> String -> Either ParseError a
simpleParse p = parse p err

parseDigit :: Parser Int
parseDigit = digitToInt <$> digit

simpleParse parseDigit "2"

simpleParse parseDigit "9"

simpleParse parseDigit "a"

simpleParse parseDigit "99"

Right 2

Right 9

Left "An error has occurred" (line 1, column 1):
unexpected "a"
expecting digit

Right 9

There are a few things we can note from that example:

A parser is a wrapper around its return type. A Parser that parses a String will be a Parser String.
We need to use a special parse function for parsing. It takes a Parser, a default error message and a String to parse.
We can get either a Left ParseError (like the in the third example) or Right Int, the thing we got out of the parser, if the thing goes right.
The function digitToInt is pure. Parsers are not! We need to "lift" the function to an impure level. I used the Applicative style, simply due to personal preference. However, we could also use the monadic style you use in IO or (in this particular case) fmap. In fact, it turns out that <$> is fmap.
The parser works for a digit, but not for a letter. This is what we want. However, it cannot parse "99"!

Monadic style:

In [4]:

import Control.Monad (liftM)

parseDigit' :: Parser Int
parseDigit' = do
    d <- digit
    return . digitToInt $ d
    
-- Alternatively (we didn't do this yet I think?)
parseDigit'' :: Parser Int
parseDigit'' = liftM digitToInt digit

simpleParse parseDigit' "8"

simpleParse parseDigit'' "8"

Right 8

Right 8

So, getting a single digit is simple enough. But what if we want more than one digit?

This is where the combinator part comes into play. We will use a quantifier - many1 (think of it as Kleene's + operator in regular expressions. Or don't if that doesn't help). There is also a many quantifier that works like Kleene's * operator.

Let us implement a parser for positive integers in general.

In [5]:

import Text.Parsec.Combinator (many1)
import Control.Applicative (many)

number :: Parser Int
number = read <$> many1 digit

simpleParse number "100"
simpleParse number "92"
simpleParse number "-10"

Right 100

Right 92

Left "An error has occurred" (line 1, column 1):
unexpected "-"
expecting digit

Okay! That works for positive integers! But what about negative integers? That doesn't quite work with this - our parser makes no mention of minuses. Let's give it a go.

In [6]:

import Text.Parsec.Char (char)

infixr 5 <:>
(<:>) :: Applicative f => f a -> f [a] -> f [a] 
a <:> b = (:) <$> a <*> b

negative :: Parser Int
negative = read <$> char '-' <:> many1 digit

simpleParse negative "-10"
simpleParse negative "-9999"
simpleParse negative "-0"
simpleParse negative "20"

Right (-10)

Right (-9999)

Right 0

Left "An error has occurred" (line 1, column 1):
unexpected "2"
expecting "-"

Note the <*> operator. The rule is more or less like this - for the first argument of a function you are lifting you use <$>. For other arguments, you use <*>.

So, we now have two separate parsers - one that can parse positive integers and one that can parse negative integers. This is fine, but we would prefer a parser than can parse both and properly evaluate their value.

We can use the choice combinator <|>!

In [7]:

import Control.Applicative ((<|>))

integer :: Parser Int
integer = number <|> negative
-- alternatively Text.Parsec.Combinator.choice [number, negative]

simpleParse integer "100"
simpleParse integer "-100"
simpleParse integer "20"
simpleParse integer " 10"

Right 100

Right (-100)

Right 20

Left "An error has occurred" (line 1, column 1):
unexpected " "
expecting digit or "-"

The choice combinator tries out the parser on the left. If it fails, instead of giving an error, it tries the next one, and so on (we can chain many parsers). If none of the parsers succeed, only then is an error thrown.

The last one failed - this is to be expected. We never really specified any spaces. However, in real world applications of Parsec, we are usually interested in tokens in a parsed text. These tokens are quite often surrounded by whitespace we are not interested in at all! Well, unless we're parsing Python or something.

There is a simple workaround, however. We just need to ignore the spaces.

We do this in two steps:

Strip the leading spaces with the parser
Strip the trailing spaces by defining a token parser

In [8]:

import Text.Parsec.Char (spaces)

token :: Parser a -> Parser a
token = (<* spaces)

betterParse :: Parser a -> String -> Either ParseError a
betterParse p = parse (spaces *> p) err

integer' = token integer

betterParse integer' " 200"
betterParse integer' " -50   "
betterParse (many1 integer') "20 30 40 -18 2 -1"

Right 200

Right (-50)

Right [20,30,40,-18,2,-1]

There is one problem with the choice combinator <|>, however. If the left side of the expression consumes input before it evaluates to an error and hands the string off to the right side, that part of the string will remain consumed.

Let us implement a simple parser that parses two expressions - word lists and integer lists. Let these lists be formed like:

[dog, cat, hamster]

[3, 1, 2]

We will define a helper data type to hold such disparate types.

In [9]:

import Text.Parsec.Combinator (sepBy1)
import Text.Parsec.Char (letter)

symbol :: Char -> Parser Char
symbol = token . char

data MyList = WordList { text :: [String] }
            | IntList  { ints :: [Int] } deriving (Eq, Show)


-- Some helper functions
brackets :: Parser a -> Parser a
brackets = (<* symbol ']') . (symbol '[' *>)

elements :: Parser a -> Parser [a]
elements = flip sepBy1 (symbol ',')

listOf :: Parser a -> Parser [a]
listOf = brackets . elements . token

list :: Parser MyList
list = intlist <|> strlist
    where intlist  = IntList  <$> listOf integer'
          strlist  = WordList <$> listOf (many1 letter)
          
betterParse list "[1, 2, 3, 6, 8]"
betterParse list "[dogcat, monkeypig]" -- WHY?!

Right (IntList {ints = [1,2,3,6,8]})

Left "An error has occurred" (line 1, column 2):
unexpected "d"
expecting white space, digit or "-"

Well, because. We already consumed the '[' symbol before we realised the first parser cannot match the expression, so the second parser doesn't match either. We have to "unconsume" the first symbol. Luckily, people who made Parsec thought of this and created the try function. Basically, if it succeeds in consuming the input, it behaves normally. If it breaks something, however, it claims it never even touched it and passes the entire input onwards.

Let's use that!

In [10]:

import Text.Parsec (try)

list' :: Parser MyList
list' = try intlist <|> strlist
    where intlist  = IntList  <$> listOf integer'
          strlist  = WordList <$> listOf (many1 letter)
          
betterParse list' "[1, 2]"
betterParse list' "[manbearpig, unladenswallow]"

Right (IntList {ints = [1,2]})

Right (WordList {text = ["manbearpig","unladenswallow"]})

The try function is especially useful when dealing with variables and keywords. In many languages if is a keyword, but iff is a valid identified. Of course, you notice it's not a valid character halfway through parsing it, so backtracking is a must.

A small computing language¶

Now that we know some of the basics of Parsec, let us apply them to a particular problem. We want to define a simple language that uses variables and integers to perform simple computations. An example of such a program would be:

a = 3 + 5
b = a - 2 + 7
a = 1

Each expressions is an assignment expression consisting of two parts:

A left hand side variable to which a value is being assigned
A right hand side expression consisting of values, variables, and the + and - operators.

Let us start out by defining the structures of such a language and their evaluation.

In [15]:

import qualified Data.Map as M
import Control.Monad (join)

data Expression = Val   Int
                | Var   String
                | Plus  Expression Expression
                | Minus Expression Expression
                  deriving Show

data Assignment = Assignment String Expression deriving Show

type Program = [Assignment]
                             
type VarTable = M.Map String Int

eval :: VarTable -> Expression -> Maybe Int
eval vt e = case e of (Val   v)   -> Just v
                      (Var   v)   -> M.lookup v vt
                      (Plus  a b) -> (+) <$> eval vt a <*> eval vt b
                      (Minus a b) -> (-) <$> eval vt a <*> eval vt b
                      
assign :: Assignment -> VarTable -> Maybe VarTable
assign (Assignment v e) vt = insert v vt <$> eval vt e
    where insert k = flip (M.insert k)
    
run :: Program -> Maybe VarTable
run = run' (Just M.empty)
    where run' vt []     = vt
          run' vt (a:as) = run' (join $ assign a <$> vt) as

-- a = 7 + -3
-- b = a - 1
-- a = -2
-- c = a + b
program = [ Assignment "a" (Plus (Val 7) (Val (-3)))
          , Assignment "b" (Minus (Var "a") (Val 1))
          , Assignment "a" (Val (-2))
          , Assignment "c" (Plus (Var "a") (Var "b"))
          ]
          
run program

Just (fromList [("a",-2),("b",3),("c",1)])

We now have an evaluator for our small language. However, writing it in Haskell is both tedious and impractical. We want to write it as text and then parse it and run it, like a proper programming language.

We already have a definition for integers, but we need one for variables. Let us say the first character of a variable name has to be a letter and the rest can be alphanumeric.

In [16]:

import Text.Parsec.Char (alphaNum)

variable :: Parser String
variable = token $ letter <:> many alphaNum

betterParse variable "var"
betterParse variable " x12"
betterParse variable "1x"

Right "var"

Right "x12"

Left "An error has occurred" (line 1, column 1):
unexpected "1"
expecting white space or letter

Implementing the Expression parser will sadly not be so trivial. In implementing this operation (and other similar operations) we encounter a problem - left recursion. The Plus and Minus instances use infix operators. This complicates things. There are two ways to solve this problem - manually and using operator tables. We will go through both.

First, let us implement an Expression parser manually, using chainl1.

In [17]:

import Text.Parsec.Combinator (chainl1)

operator :: Parser (Expression -> Expression -> Expression)
operator = plus <|> minus
    where plus  = symbol '+' *> return Plus
          minus = symbol '-' *> return Minus

expression :: Parser Expression
expression = term `chainl1` operator
    where term = val <|> var
          var  = Var <$> variable
          val  = Val <$> integer'

betterParse expression " 9"
betterParse expression "-9"
betterParse expression "1 + 9"
betterParse expression "1 - 9"
betterParse expression "a + 3 - b + c"
betterParse expression "1--3"

Right (Val 9)

Right (Val (-9))

Right (Plus (Val 1) (Val 9))

Right (Minus (Val 1) (Val 9))

Right (Plus (Minus (Plus (Var "a") (Val 3)) (Var "b")) (Var "c"))

Right (Minus (Val 1) (Val (-3)))

This is trivial for such a simple use case, but sometimes we are also concerned with operator priority, have a multitude of possible subexpressions and so on. For that reason, Parsec enables us to quickly build parsers for infix (and prefix) operations using operator tables.

We define a list of lists of operators. Each sublist contains a set of operators. The order does matter - the operators in the first list have the highest priority, the operators in the second second highest and so on.

Let us create a parser for the same example:

In [18]:

import Text.Parsec.Expr

-- Both of our operators have the same priority
table = [[binary '+' Plus, binary '-' Minus]]
        where binary name f = Infix (f <$ (symbol name)) AssocLeft
        
expression' :: Parser Expression
expression' = buildExpressionParser table other
    where other = var <|> val
          var   = Var <$> variable
          val   = Val <$> integer'
          
betterParse expression' "9"
betterParse expression' "-9"
betterParse expression' "1 + 9"
betterParse expression' "1 - 9"
betterParse expression' "a + 3 - b + c"
betterParse expression' "1--3"

Right (Val 9)

Right (Val (-9))

Right (Plus (Val 1) (Val 9))

Right (Minus (Val 1) (Val 9))

Right (Plus (Minus (Plus (Var "a") (Val 3)) (Var "b")) (Var "c"))

Right (Minus (Val 1) (Val (-3)))

Which one you use is a matter of preference, but the second one is certainly preferable in more complex cass.

Note the variable named other in the example. The table lists various combinator function, while the other argument lists parsers that are not constructed by applying unary or binary functions to instances of Expression.

Let us conclude this example by creating the parser for Assignment and a program as a whole, as well as a simple program interpreter.

In [37]:

assignment :: Parser Assignment
assignment = Assignment <$> variable <*> (symbol '=' *> expression)

program :: Parser Program
program = many1 assignment

interpret :: String -> Maybe VarTable
interpret s = case prog of Left  e -> Nothing
                           Right p -> run   p
    where prog = betterParse program s
    
example = "a = 7 - 9 + 3 + 15\nb = a - 9\nc = a + a + b\na = c + b"

interpret example

Just (fromList [("a",46),("b",7),("c",39)])

As you can see, Parsec is quite good at building complex parsers from simple building blocks. This, of course, does not show the full power of Parsec, but the majority of its functionalities were covered here.

For some other useful functions look at the Parsec documentation, especially the Text.Parsec.Char and Text.Parsec.Combinator libraries, as well as the Text.Parsec.Expression entry if you have need for Prefix operators and other associativites (Right or None).

Monadic (`do`) style parsers¶

Parsers are not only Applicatives, they are also monads (as we saw at the beginning). That means we can use do notation, which we are more familiar with. I originally chose Applicative notation due to style preferences.

Let us rewrite the same set of parsers using do notation instead of the Applicative style. Some may find this easier to understand and implement. We will define each parsing as an "action" of some sort.

In [40]:

import Control.Monad (void)
-- Void discards an action -> it turns an m a  to an m ()
-- where m is some monad (IO for example, so int this case IO a -> IO ())

mToken :: Parser a -> Parser a
mToken p = do
    x <- p
    void $ spaces -- Spaces are read, but they are not stored. We just "skip" them.
    return x
    
mSymbol :: Char -> Parser Char
mSymbol = mToken . char

mNumber :: Parser Int
mNumber = do
    num <- many1 digit
    return . read $ num
    
mNegative :: Parser Int
mNegative = do
    pref <- char '-'
    num  <- many1 digit
    return . read $ pref : num

mInteger :: Parser Int
mInteger = mToken p
    where p = mNumber <|> mNegative

mVariable :: Parser String
mVariable = do
    h <- letter
    t <- many alphaNum
    mToken . return $ h : t
    
mTable = [[binary '+' Plus, binary '-' Minus]]
    where binary sym f = Infix (mkParser sym f) AssocLeft
          mkParser s f = do
              void $ mSymbol s -- read +/-, but discard it. It HAS to be there but is not used.
              return f

mExpression :: Parser Expression
mExpression = buildExpressionParser mTable other
    where other = var <|> val
          var   = do
              v <- mVariable
              return $ Var v
          val   = do
              v <- mInteger
              return $ Val v
              
mAssignment :: Parser Assignment
mAssignment = do
    name <- mVariable
    void $ mSymbol '='
    expr <- mExpression
    return $ Assignment name expr
    
mProgram :: Parser Program
mProgram = many1 mAssignment

mInterpret :: String -> Maybe VarTable
mInterpret s = case prog of Right p -> run p
                            _       -> Nothing
            where prog = betterParse mProgram s
            
mInterpret example

Just (fromList [("a",46),("b",7),("c",39)])

Introduction to Parsec¶

A small computing language¶

Monadic (do) style parsers¶

Monadic (`do`) style parsers¶