University of Zagreb
Faculty of Electrical Engineering and Computing
Programming in Haskell
http://www.fer.unizg.hr/predmet/puh
Academic Year 2014/2015
© 2014 Luka Skukan
Version 1.1
Changelog
do
notation examplesParsec is a parser combinator library. What does that mean?
Parsers are constructed from simple buildings blocks that are combined in various ways. Let us begin by constructing a simple parser that parses a single digit.
import Text.Parsec.String (Parser)
import Text.Parsec.Char (digit)
import Text.Parsec (parse, ParseError)
import Control.Applicative ((<$>), (<$), (<*>), (<*), (*>), Applicative)
import Data.Char (digitToInt)
err = "An error has occurred"
simpleParse :: Parser a -> String -> Either ParseError a
simpleParse p = parse p err
parseDigit :: Parser Int
parseDigit = digitToInt <$> digit
simpleParse parseDigit "2"
simpleParse parseDigit "9"
simpleParse parseDigit "a"
simpleParse parseDigit "99"
Right 2
Right 9
Left "An error has occurred" (line 1, column 1): unexpected "a" expecting digit
Right 9
There are a few things we can note from that example:
Parser
that parses a String
will be a Parser String
.Parser
, a default error message and a String
to parse.Left ParseError
(like the in the third example) or Right Int
, the thing we got out of the parser, if the thing goes right.digitToInt
is pure. Parsers are not! We need to "lift" the function to an impure level. I used the Applicative style, simply due to personal preference. However, we could also use the monadic style you use in IO
or (in this particular case) fmap
. In fact, it turns out that <$>
is fmap.Monadic style:
import Control.Monad (liftM)
parseDigit' :: Parser Int
parseDigit' = do
d <- digit
return . digitToInt $ d
-- Alternatively (we didn't do this yet I think?)
parseDigit'' :: Parser Int
parseDigit'' = liftM digitToInt digit
simpleParse parseDigit' "8"
simpleParse parseDigit'' "8"
Right 8
Right 8
So, getting a single digit is simple enough. But what if we want more than one digit?
This is where the combinator part comes into play. We will use a quantifier - many1
(think of it as Kleene's + operator in regular expressions. Or don't if that doesn't help). There is also a many
quantifier that works like Kleene's * operator.
Let us implement a parser for positive integers in general.
import Text.Parsec.Combinator (many1)
import Control.Applicative (many)
number :: Parser Int
number = read <$> many1 digit
simpleParse number "100"
simpleParse number "92"
simpleParse number "-10"
Right 100
Right 92
Left "An error has occurred" (line 1, column 1): unexpected "-" expecting digit
Okay! That works for positive integers! But what about negative integers? That doesn't quite work with this - our parser makes no mention of minuses. Let's give it a go.
import Text.Parsec.Char (char)
infixr 5 <:>
(<:>) :: Applicative f => f a -> f [a] -> f [a]
a <:> b = (:) <$> a <*> b
negative :: Parser Int
negative = read <$> char '-' <:> many1 digit
simpleParse negative "-10"
simpleParse negative "-9999"
simpleParse negative "-0"
simpleParse negative "20"
Right (-10)
Right (-9999)
Right 0
Left "An error has occurred" (line 1, column 1): unexpected "2" expecting "-"
Note the <*>
operator. The rule is more or less like this - for the first argument of a function you are lifting you use <$>
. For other arguments, you use <*>
.
So, we now have two separate parsers - one that can parse positive integers and one that can parse negative integers. This is fine, but we would prefer a parser than can parse both and properly evaluate their value.
We can use the choice combinator <|>
!
import Control.Applicative ((<|>))
integer :: Parser Int
integer = number <|> negative
-- alternatively Text.Parsec.Combinator.choice [number, negative]
simpleParse integer "100"
simpleParse integer "-100"
simpleParse integer "20"
simpleParse integer " 10"
Right 100
Right (-100)
Right 20
Left "An error has occurred" (line 1, column 1): unexpected " " expecting digit or "-"
The choice combinator tries out the parser on the left. If it fails, instead of giving an error, it tries the next one, and so on (we can chain many parsers). If none of the parsers succeed, only then is an error thrown.
The last one failed - this is to be expected. We never really specified any spaces. However, in real world applications of Parsec, we are usually interested in tokens in a parsed text. These tokens are quite often surrounded by whitespace we are not interested in at all! Well, unless we're parsing Python or something.
There is a simple workaround, however. We just need to ignore the spaces.
We do this in two steps:
import Text.Parsec.Char (spaces)
token :: Parser a -> Parser a
token = (<* spaces)
betterParse :: Parser a -> String -> Either ParseError a
betterParse p = parse (spaces *> p) err
integer' = token integer
betterParse integer' " 200"
betterParse integer' " -50 "
betterParse (many1 integer') "20 30 40 -18 2 -1"
Right 200
Right (-50)
Right [20,30,40,-18,2,-1]
There is one problem with the choice combinator <|>
, however. If the left side of the expression consumes input before it evaluates to an error and hands the string off to the right side, that part of the string will remain consumed.
Let us implement a simple parser that parses two expressions - word lists and integer lists. Let these lists be formed like:
[dog, cat, hamster]
[3, 1, 2]
We will define a helper data type to hold such disparate types.
import Text.Parsec.Combinator (sepBy1)
import Text.Parsec.Char (letter)
symbol :: Char -> Parser Char
symbol = token . char
data MyList = WordList { text :: [String] }
| IntList { ints :: [Int] } deriving (Eq, Show)
-- Some helper functions
brackets :: Parser a -> Parser a
brackets = (<* symbol ']') . (symbol '[' *>)
elements :: Parser a -> Parser [a]
elements = flip sepBy1 (symbol ',')
listOf :: Parser a -> Parser [a]
listOf = brackets . elements . token
list :: Parser MyList
list = intlist <|> strlist
where intlist = IntList <$> listOf integer'
strlist = WordList <$> listOf (many1 letter)
betterParse list "[1, 2, 3, 6, 8]"
betterParse list "[dogcat, monkeypig]" -- WHY?!
Right (IntList {ints = [1,2,3,6,8]})
Left "An error has occurred" (line 1, column 2): unexpected "d" expecting white space, digit or "-"
Well, because.
We already consumed the '['
symbol before we realised the first parser cannot match the expression, so the second parser doesn't match either. We have to "unconsume" the first symbol. Luckily, people who made Parsec thought of this and created the try
function. Basically, if it succeeds in consuming the input, it behaves normally. If it breaks something, however, it claims it never even touched it and passes the entire input onwards.
Let's use that!
import Text.Parsec (try)
list' :: Parser MyList
list' = try intlist <|> strlist
where intlist = IntList <$> listOf integer'
strlist = WordList <$> listOf (many1 letter)
betterParse list' "[1, 2]"
betterParse list' "[manbearpig, unladenswallow]"
Right (IntList {ints = [1,2]})
Right (WordList {text = ["manbearpig","unladenswallow"]})
The try
function is especially useful when dealing with variables and keywords. In many languages if
is a keyword, but iff
is a valid identified. Of course, you notice it's not a valid character halfway through parsing it, so backtracking is a must.
Now that we know some of the basics of Parsec, let us apply them to a particular problem. We want to define a simple language that uses variables and integers to perform simple computations. An example of such a program would be:
a = 3 + 5
b = a - 2 + 7
a = 1
Each expressions is an assignment expression consisting of two parts:
+
and -
operators.Let us start out by defining the structures of such a language and their evaluation.
import qualified Data.Map as M
import Control.Monad (join)
data Expression = Val Int
| Var String
| Plus Expression Expression
| Minus Expression Expression
deriving Show
data Assignment = Assignment String Expression deriving Show
type Program = [Assignment]
type VarTable = M.Map String Int
eval :: VarTable -> Expression -> Maybe Int
eval vt e = case e of (Val v) -> Just v
(Var v) -> M.lookup v vt
(Plus a b) -> (+) <$> eval vt a <*> eval vt b
(Minus a b) -> (-) <$> eval vt a <*> eval vt b
assign :: Assignment -> VarTable -> Maybe VarTable
assign (Assignment v e) vt = insert v vt <$> eval vt e
where insert k = flip (M.insert k)
run :: Program -> Maybe VarTable
run = run' (Just M.empty)
where run' vt [] = vt
run' vt (a:as) = run' (join $ assign a <$> vt) as
-- a = 7 + -3
-- b = a - 1
-- a = -2
-- c = a + b
program = [ Assignment "a" (Plus (Val 7) (Val (-3)))
, Assignment "b" (Minus (Var "a") (Val 1))
, Assignment "a" (Val (-2))
, Assignment "c" (Plus (Var "a") (Var "b"))
]
run program
Just (fromList [("a",-2),("b",3),("c",1)])
We now have an evaluator for our small language. However, writing it in Haskell is both tedious and impractical. We want to write it as text and then parse it and run it, like a proper programming language.
We already have a definition for integers, but we need one for variables. Let us say the first character of a variable name has to be a letter and the rest can be alphanumeric.
import Text.Parsec.Char (alphaNum)
variable :: Parser String
variable = token $ letter <:> many alphaNum
betterParse variable "var"
betterParse variable " x12"
betterParse variable "1x"
Right "var"
Right "x12"
Left "An error has occurred" (line 1, column 1): unexpected "1" expecting white space or letter
Implementing the Expression
parser will sadly not be so trivial. In implementing this operation (and other similar operations) we encounter a problem - left recursion. The Plus
and Minus
instances use infix operators. This complicates things. There are two ways to solve this problem - manually and using operator tables. We will go through both.
First, let us implement an Expression parser manually, using chainl1
.
import Text.Parsec.Combinator (chainl1)
operator :: Parser (Expression -> Expression -> Expression)
operator = plus <|> minus
where plus = symbol '+' *> return Plus
minus = symbol '-' *> return Minus
expression :: Parser Expression
expression = term `chainl1` operator
where term = val <|> var
var = Var <$> variable
val = Val <$> integer'
betterParse expression " 9"
betterParse expression "-9"
betterParse expression "1 + 9"
betterParse expression "1 - 9"
betterParse expression "a + 3 - b + c"
betterParse expression "1--3"
Right (Val 9)
Right (Val (-9))
Right (Plus (Val 1) (Val 9))
Right (Minus (Val 1) (Val 9))
Right (Plus (Minus (Plus (Var "a") (Val 3)) (Var "b")) (Var "c"))
Right (Minus (Val 1) (Val (-3)))
This is trivial for such a simple use case, but sometimes we are also concerned with operator priority, have a multitude of possible subexpressions and so on. For that reason, Parsec enables us to quickly build parsers for infix (and prefix) operations using operator tables.
We define a list of lists of operators. Each sublist contains a set of operators. The order does matter - the operators in the first list have the highest priority, the operators in the second second highest and so on.
Let us create a parser for the same example:
import Text.Parsec.Expr
-- Both of our operators have the same priority
table = [[binary '+' Plus, binary '-' Minus]]
where binary name f = Infix (f <$ (symbol name)) AssocLeft
expression' :: Parser Expression
expression' = buildExpressionParser table other
where other = var <|> val
var = Var <$> variable
val = Val <$> integer'
betterParse expression' "9"
betterParse expression' "-9"
betterParse expression' "1 + 9"
betterParse expression' "1 - 9"
betterParse expression' "a + 3 - b + c"
betterParse expression' "1--3"
Right (Val 9)
Right (Val (-9))
Right (Plus (Val 1) (Val 9))
Right (Minus (Val 1) (Val 9))
Right (Plus (Minus (Plus (Var "a") (Val 3)) (Var "b")) (Var "c"))
Right (Minus (Val 1) (Val (-3)))
Which one you use is a matter of preference, but the second one is certainly preferable in more complex cass.
Note the variable named other
in the example. The table lists various combinator function, while the other
argument lists parsers that are not constructed by applying unary or binary functions to instances of Expression
.
Let us conclude this example by creating the parser for Assignment
and a program as a whole, as well as a simple program interpreter.
assignment :: Parser Assignment
assignment = Assignment <$> variable <*> (symbol '=' *> expression)
program :: Parser Program
program = many1 assignment
interpret :: String -> Maybe VarTable
interpret s = case prog of Left e -> Nothing
Right p -> run p
where prog = betterParse program s
example = "a = 7 - 9 + 3 + 15\nb = a - 9\nc = a + a + b\na = c + b"
interpret example
Just (fromList [("a",46),("b",7),("c",39)])
As you can see, Parsec is quite good at building complex parsers from simple building blocks. This, of course, does not show the full power of Parsec, but the majority of its functionalities were covered here.
For some other useful functions look at the Parsec documentation, especially the Text.Parsec.Char
and Text.Parsec.Combinator
libraries, as well as the Text.Parsec.Expression
entry if you have need for Prefix operators and other associativites (Right or None).
do
) style parsers¶Parsers are not only Applicatives, they are also monads (as we saw at the beginning). That means we can use do
notation, which we are more familiar with. I originally chose Applicative notation due to style preferences.
Let us rewrite the same set of parsers using do
notation instead of the Applicative
style. Some may find this easier to understand and implement. We will define each parsing as an "action" of some sort.
import Control.Monad (void)
-- Void discards an action -> it turns an m a to an m ()
-- where m is some monad (IO for example, so int this case IO a -> IO ())
mToken :: Parser a -> Parser a
mToken p = do
x <- p
void $ spaces -- Spaces are read, but they are not stored. We just "skip" them.
return x
mSymbol :: Char -> Parser Char
mSymbol = mToken . char
mNumber :: Parser Int
mNumber = do
num <- many1 digit
return . read $ num
mNegative :: Parser Int
mNegative = do
pref <- char '-'
num <- many1 digit
return . read $ pref : num
mInteger :: Parser Int
mInteger = mToken p
where p = mNumber <|> mNegative
mVariable :: Parser String
mVariable = do
h <- letter
t <- many alphaNum
mToken . return $ h : t
mTable = [[binary '+' Plus, binary '-' Minus]]
where binary sym f = Infix (mkParser sym f) AssocLeft
mkParser s f = do
void $ mSymbol s -- read +/-, but discard it. It HAS to be there but is not used.
return f
mExpression :: Parser Expression
mExpression = buildExpressionParser mTable other
where other = var <|> val
var = do
v <- mVariable
return $ Var v
val = do
v <- mInteger
return $ Val v
mAssignment :: Parser Assignment
mAssignment = do
name <- mVariable
void $ mSymbol '='
expr <- mExpression
return $ Assignment name expr
mProgram :: Parser Program
mProgram = many1 mAssignment
mInterpret :: String -> Maybe VarTable
mInterpret s = case prog of Right p -> run p
_ -> Nothing
where prog = betterParse mProgram s
mInterpret example
Just (fromList [("a",46),("b",7),("c",39)])