Parsing with Haskell Haskell is an excellent language for all your parsing needs. The functional nature of the language makes it easy to compose different building blocks together without worrying about nasty side effects and unforeseen consequences. Since the language is so well-suited for parsing, there are several different libraries out there. Each of them differ a bit in their approaches. We'll explore three libraries in this series. Parsing Primer: Gherkin Syntax Haskell is a truly awesome language for parsing. Haskell expressions tend to compose in simple ways with very clearly controlled side effects. This provides an ideal environment in which to break down parsing into simpler tasks. Thus there are many excellent parsing libraries out there. In this series, we'll be taking a tour of some of these libraries. But before we look at specific code, it will be useful to establish a common example for what we're going to be parsing. In this first part, I'll introduce Gherkin Syntax, the language behind the Cucumber framework. We'll go through the language specifics, then show the basics of how we set ourselves up for success in Haskell. If you're already familiar with Gherkin syntax, feel free to move on to part 2 of this series! Or if you want to challenge yourself with more libraries to learn, you can download our Production Checklist. It'll give you a survey of libraries for many different tasks! Like some of our other series, there's a Github Repository that has all the code for these tutorials! You can observe some Gherkin examples in this directory. We'll compose our Haskell types in this module. GHERKIN BACKGROUND Cucumber is a framework for Behavior Driven Development. Under BDD, we first describe all the general behaviors we want our code to perform in plain language. This paradigm is an alternative to Test Driven Development. There, we use test cases to determine our next programming objectives. But BDD can do both of these if we can take behavior descriptions and automatically create tests from them! This would allow less technical members of a project team to effectively write tests! The main challenge of this is formalizing a language for describing these behaviors. If we have a formal language, then we can parse it. If we can parse it into a reasonable structure, then we can turn that structure into runnable test code. This series will focus on the second part of this problem: turning Gherkin Syntax into a data structure (a Haskell data structure, in our case). GHERKIN SYNTAX Gherkin syntax has many complexities, but for these articles we'll be focusing on the core elements of it. The behaviors you want to test are broken down into a series of features. We describe each feature in its own .feature file. So our overarching task is to read input from a single file and turn it into a Feature object. We begin our description of a feature with the Feature keyword (obviously). We'll give it a title, and then give it an indented description (our example will be a simple banking app): Feature: Registering a User As a potential user I want to be able to create an account with a username, email and password So that I can start depositing money into my account Each feature then has a series of scenarios. These describe specific cases of what can happen as part of this feature. Each scenario begins with the Scenario keyword and a title. Scenario: Successful registration ... Scenario: Email is already taken ... Scenario: Username is already taken ... Each scenario then has a series of Gherkin statements. These statements begin with one of the keywords Given, When, Then, or And. You should use Given statements to describe pre-conditions of the scenario. Then you'll use When to describe the particular action a user is taking to initiate the scenario. And finally, you'll use Then to describe the after effects. Scenario: Email is already taken Given there is already an account with the email "test@test.com" When I register an account with username "test", email "test@test.com" and password "1234abcd!?" Then it should fail with an error: "An account with that email already exists" You can supplement any of these cases with a statement beginning with And. Scenario: Email is already taken Given there is already an account with the email "test@test.com" And there is already an account with the username "test" When I register an account with username "test", email "test@test.com" and password "1234abcd!?" Then it should fail with an error: "An account with that email already exists" And there should still only be one account with the email "test@test.com" Gherkin syntax does not enforce that you use the different keywords in a semantically sound way. We could start every statement with Given and it would still work. But obviously you should do whatever you can to make your tests sound correct. We can also fill in statements with variables in angle brackets. We'll then follow the scenario with a table of examples for those variables: Scenario: Successful Registration Given There is no account with username or email When I register the account with username , email and password Then it should successfully create the account with , , and Examples: | username | email | password | | john doe | john@doe.com | ABCD1234!? | | jane doe | jane.doe@gmail.com | abcdefgh1.aba | | jackson | jackson@yahoo.com | cadsw4ll0p/ | We can also create a Background for the whole feature. This is a scenario-like description of preconditions that exist for every scenario in that feature. This can also have an example table: Feature: User Log In ... Background: Given: There is an existing user with username , email and password Examples: | username | email | password | | john doe | john@doe.com | ABCD1234!? | | jane doe | jane.doe@gmail.com | abcdefgh1.aba | And that's the whole language we're going to be working with! HASKELL DATA STRUCTURES Let's appreciate now how easy it is to create data structures in Haskell to represent this syntax. We'll start with a description of a Feature. It has a title, description (which we'll treat as a list of multiple lines), the background, and then a list of scenarios. We'll also treat the background like a "nameless" scenario that may or may not exist: data Feature = Feature { featureTitle :: String , featureDescription :: [String] , featureBackground :: Maybe Scenario , featureScenarios :: [Scenario] } Now let's describe what a Scenario is. It's main components are its title and a list of statements. We'll also observe that we should have some kind of structure for the list of examples we'll provide: data Scenario = Scenario { scenarioTitle :: String , scenarioStatements :: [Statement] , scenarioExamples :: ExampleTable } This ExampleTable will store a list of possible keys as well as list of tuple maps. Each tuple will contain keys and values. At the scale we're likely to be working at, it's not worth it to use a full Map: data ExampleTable = ExampleTable { exampleTableKeys :: [String] , exampleTableExamples :: [[(String, Value)]] } Now we'll have to define what we mean by a Value. We'll keep it simple and only use literal bools, strings, numbers, and a null value: data Value = ValueNumber Scientific | ValueString String | ValueBool Bool | ValueNull And finally we'll describe a statement. This will have the string itself, as well as a list of variable keywords to interpolate: data Statement = Statement { statementText :: String , statementExampleVariables :: [String] } And that's all there is too it! We can put all these types in a single file and feel pretty good about that. In Java or C++, we would want to make a separate file (or two!) for each type and there would be a lot more boilerplate involved. GENERAL PARSING APPROACH Another reason we'll see that Haskell is good for parsing is the ease of breaking problems down into smaller pieces. We'll have one function for parsing an example table, a different function for parsing a statement, and so on. Then gluing these together will actually be slick and simple! CONCLUSION Now you can move on to part 2 where we'll actually look at how to start parsing this. The first library we'll use is the regex-applicative parsing library. We'll see how we can get a lot of what we want without even using a monadic context! For some more ideas on parsing libraries you can use, check out our free Production Checklist. It will tell you about different libraries for parsing as well as a great many other tasks, from data structures to web APIs! Applicative Parsing In part 1 of this series, we prepared ourselves for parsing by going over the basics of the Gherkin Syntax. In this part, we'll be using Regular Expression (Regex) based, applicative parsing to parse the syntax. We'll start by focusing on the fundamentals of this library and building up a vocabulary of combinators to use. We'll make heavy use of the Applicative typeclass. If you need a refresher on that, check out this article. As we start coding, you can also follow along with the examples on Github here! Most of the code here is in the RegexParser module. If you're itching to try a monadic approach to parsing, be sure to check out part 3 of this series, where we'll learn about the Attoparsec library. If you want to learn about a wider variety of production utilities, download our Production Checklist. It summarizes many other useful libraries for writing higher level Haskell. GETTING STARTED So to start parsing, let's make some notes about our input format. First, we'll treat our input feature document as a single string. We'll remove all empty lines, and then trim leading and trailing whitespace from each line. parseFeatureFromFile :: FilePath -> IO Feature parseFeatureFromFile inputFile = do fileContents <- lines <$> readFile inputFile let nonEmptyLines = filter (not . isEmpty) fileContents let trimmedLines = map trim nonEmptyLines let finalString = unlines trimmedLines case parseFeature finalString of ... ... isEmpty :: String -> Bool isEmpty = all isSpace trim :: String -> String trim input = reverse flippedTrimmed where trimStart = dropWhile isSpace input flipped = reverse trimStart flippedTrimmed = dropWhile isSpace flipped This means a few things for our syntax. First, we don't care about indentation. Second, we ignore extra lines. This means our parsers might allow certain formats we don't want. But that's OK because we're trying to keep things simple. THE RE TYPE With applicative based parsing, the main data type we'll be working with is called RE, for regular expression. This represents a parser, and it's parameterized by two types: data RE s a = ... The s type refers to the fundamental unit we'll be parsing. Since we're parsing our input as a single String, this will be Char. Then the a type is the result of the parsing element. This varies from parser to parser. The most basic combinator we can use is sym. This parses a single symbol of your choosing: sym :: s - > RE s s parseLowercaseA :: RE Char Char parseLowercaseA = sym 'a' To use an RE parser, we call the match function or its infix equivalent =~. These will return a Just value if we can match the entire input string, and Nothing otherwise: >> match parseLowercaseA "a" Just 'a' >> "b" =~ parseLowercaseA Nothing >> "ab" =~ parseLowercaseA Nothing -- (Needs to parse entire input) PREDICATES AND STRINGS Naturally, we'll want some more complicated functionality. Instead of parsing a single input character, we can parse any character that fits a particular predicate by using psym. So if we want to read any character that was not a newline, we could do: parseNonNewline :: RE Char Char parseNonNewline = psym (/= '\n') The string combinator allows us to match a particular full string and then return it: readFeatureWord :: RE Char String readFeatureWord = string "Feature" We'll use this for parsing keywords, though we'll often end up discarding the "result". APPLICATIVE COMBINATORS Now the RE type is applicative. This means we can apply all kinds of applicative combinators over it. One of these is many, which allows us to apply a single parser several times. Here is one combinator that we'll use a lot. It allows us to read everything up until a newline and return the resulting string: readUntilEndOfLine :: RE Char String readUntilEndOfLine = many (psym (/= '\n')) Beyond this, we'll want to make use of the applicative <*> operator to combine different parsers. We can also apply a pure function (or constructor) on top of those by using <$>. Suppose we have a data type that stores two characters. Here's how we can build a parser for it: data TwoChars = TwoChars Char Char parseTwoChars :: RE Char TwoChars parseTwoChars = TwoChars <$> parseNonNewline <*> parseNonNewline ... >> match parseTwoChars "ab" Just (TwoChars 'a' 'b') We can also use <* and *>, which are cousins of the main applicative operator. The first one will parse but then ignore the right hand parse result. The second discards the left side result. parseFirst :: RE Char Char parseFirst = parseNonNewline <* parseNonNewline parseSecond :: RE Char Char parseSecond = parseNonNewline *> parseNonnewline >> match parseFirst "ab" Just 'a' >> match parseSecond "ab" Just 'b' >> match parseFirst "a" Nothing Notice the last one fails because the parser needs to have both inputs! We'll come back to this idea of failure in a second. But now that we know this technique, we can write a couple other useful parsers: readThroughEndOfLine :: RE Char String readThroughEndOfLine = readUntilEndOfLine <* sym '\n' readThroughBar :: RE Char String readThroughBar = readUntilBar <* sym '|' readUntilBar :: RE Char String readUntilBar = many (psym (\c -> c /= '|' && c /= '\n')) The first will parse the rest of the line and then consume the newline character itself. The other parsers accomplish this same task, except with the vertical bar character. We'll need these when we parse the Examples section further down. ALTERNATIVES: DEALING WITH PARSE FAILURE We introduced the notion of a parser "failing" up above. Of course, we need to be able to offer alternatives when a parser fails! Otherwise our language will be very limited in its structure. Luckily, the RE type also implements Alternative. This means we can use the <|> operator to determine an alternative parser when one fails. Let's see this in action: parseFeatureTitle :: RE Char String parseFeatureTitle = string "Feature: " *> readThroughEndOfLine parseScenarioTitle :: RE Char String parseScenarioTitle = string "Scenario: " *> readThroughEndOfLine parseEither :: RE Char String parseEither = parseFeatureTitle <|> parseScenarioTitle >> match parseFeatureTitle "Feature: Login\n" Just "Login" >> match parseFeatureTitle "Scenario: Login\n" Nothing >> match parseEither "Scenario: Login\n" Just "Login" Of course, if ALL the options fail, then we'll still have a failing parser! >> match parseEither "Random: Login\n" Nothing We'll need this to introduce some level of choice into our parsing system. For instance, it's up to the user if they want to include a Background as part of their feature. So we need to be able to read the background if it's there or else move onto parsing a scenario. VALUE PARSER In keeping with our approach from the last article, we're going to start with smaller elements of our syntax. Then we can use these to build larger ones with ease. To that end, let's build a parser for our Value type, the most basic data structure in our syntax. Let's recall what that looks like: data Value = ValueNull | ValueBool Bool | ValueString String | ValueNumber Scientific Since we have different constructors, we'll make a parser for each one. Then we can combine them with alternative syntax: valueParser :: RE Char Value valueParser = nullParser <|> boolParser <|> numberParser <|> stringParser Now our parsers for the null values and boolean values are easy. For each of them, we'll give a few different options about what strings we can use to represent those elements. Then, as with the larger parser, we'll combine them with <|>. nullParser :: RE Char Value nullParser = (string "null" <|> string "NULL" <|> string "Null") *> pure ValueNull boolParser :: RE Char Value boolParser = trueParser *> pure (ValueBool True) <|> falseParser *> pure (ValueBool False) where trueParser = string "True" <|> string "true" <|> string "TRUE" falseParser = string "False" <|> string "false" <|> string "FALSE" ```haskell Notice in both these cases we discard the actual string with *> and then return our constructor. We have to wrap the desired result with pure. # NUMBER AND STRING VALUES Numbers and strings are a little more complicated since we can't rely on hard-coded formats. In the case of numbers, we'll account for integers, decimals, and negative numbers. We'll ignore scientific notation for now. An integer is simple to parse, since we'll have many characters that are all numbers. We use some instead of many to enforce that there is at least one: ```haskell numberParser :: RE Char Value numberPaser = ... where integerParser = some (psym isNumber) A decimal parser will read some numbers, then a decimal point, and then more numbers. We'll insist there is at least one number after the decimal point. numberParser :: RE Char Value numberPaser = ... where integerParser = some (psym isNumber) decimalParser = many (psym isNumber) <*> sym '.' <*> some (psym isNumber) Finally, for negative numbers, we'll read a negative symbol and then one of the other parsers: numberParser :: RE Char Value numberPaser = ... where integerParser = some (psym isNumber) decimalParser = many (psym isNumber) <*> sym '.' <*> some (psym isNumber) negativeParser = sym '-' <*> (decimalParser <|> integerParser) However, we can't combine these parsers as is! Right now, they all return different results! The integer parser returns a single string. The decimal parser returns two strings and the decimal character, and so on. In general, we'll want to combine each parser's results into a single string and then pass them to the read function. This requires mapping a couple functions over our last two parsers: numberParser :: RE Char Value numberPaser = ... where integerParser = some (psym isNumber) decimalParser = combineDecimal <$> many (psym isNumber) <*> sym '.' <*> some (psym isNumber) negativeParser = (:) <$> sym '-' <*> (decimalParser <|> integerParser) combineDecimal :: String -> Char -> String -> String combineDecimal base point decimal = base ++ (point : decimal) Now all our number parsers return strings, so we can safely combine them. We'll map the ValueNumber constructor over the value we read from the string. numberParser :: RE Char Value numberPaser = (ValueNumber . read) <$> (negativeParser <|> decimalParser <|> integerParser) where ... Note that order matters! If we put the integer parser first, we'll be in trouble! If we encounter a decimal, the integer parser will greedily succeed and parse everything before the decimal point. We'll either lose all the information after the decimal, or worse, have a parse failure. The last thing we need to do is read a string. We need to read everything in the example cell until we hit a vertical bar, but then ignore any whitespace. Luckily, we have the right combinator for this, and we've even written a trim function already! stringParser :: RE Char Value stringParser = (ValueString . trim) <$> readUntilBar And now our valueParser will work as expected! BUILDING AN EXAMPLE TABLE Now that we can parse individual values, let's figure out how to parse the full example table. We can use our individual value parser to parse a whole line of values! The first step is to read the vertical bar at the start of the line. exampleLineParser :: RE Char [Value] exampleLineParser = sym '|' *> ... Next, we'll build a parser for each cell. It will read the whitespace, then the value, and then read up through the next bar. exampleLineParser :: RE Char [Value] exampleLineParser = sym '|' *> ... where cellParser = many isNonNewlineSpace *> valueParser <* readThroughBar isNonNewlineSpace :: RE Char Char isNonNewlineSpace = psym (\c -> isSpace c && c /= '\n') Now we read many of these and finish by reading the newline: exampleLineParser :: RE Char [Value] exampleLineParser = sym '|' *> many cellParser <* readThroughEndOfLine where cellParser = many isNonNewlineSpace *> valueParser <* readThroughBar Now, we need a similar parser that reads the title column of our examples. This will have the same structure as the value cells, only it will read normal alphabetic strings instead of values. exampleColumnTitleLineParser :: RE Char [String] exampleColumnTitleLineParser = sym '|' *> many cellParser <* readThroughEndOfLine where cellParser = many isNonNewlineSpace *> many (psym isAlpha) <* readThroughBar Now we can start building the full example parser. We'll want to read the string, the column titles, and then the value lines. exampleTableParser :: RE Char ExampleTable exampleTableParser = (string "Examples:" *> readThroughEndOfLine) *> exampleColumnTitleLineParser <*> many exampleLineParser `` We're not quite done yet. We'll need to apply a function over these results that will produce the final ExampleTable. And the trick is that we want to map up the example keys with their values. We can accomplish this with a simple function. It will return zip the keys over each value list using map: ```haskell exampleTableParser :: RE Char ExampleTable exampleTableParser = buildExampleTable <$> (string "Examples:" *> readThroughEndOfLine) *> exampleColumnTitleLineParser <*> many exampleLineParser where buildExampleTable :: [String] -> [[Value]] -> ExampleTable buildExampleTable keys valueLists = ExampleTable keys (map (zip keys) valueLists) STATEMENTS Now we that we can parse the examples for a given scenario, we need to parse the Gherkin statements. To start with, let's make a generic parser that takes the keyword as an argument. Then our full parser will try each of the different statement keywords: parseStatementLine :: String -> RE Char Statement parseStatementLine signal = ... parseStatement :: RE Char Statement parseStatement = parseStatementLine "Given" <|> parseStatementLine "When" <|> parseStatementLine "Then" <|> parseStatementLine "And" Now we'll get the signal word out of the way and parse the statement line itself. parseStatementLine :: String -> RE Char Statement parseStatementLine signal = string signal *> sym ' ' *> ... Parsing the statement is tricky. We want to parse the keys inside brackets and separate them as keys. But we also want them as part of the statement's string. To that end, we'll make two helper parsers. First, nonBrackets will parse everything in a string up through a bracket (or a newline). nonBrackets :: RE Char String nonBrackets = many (psym (\c -> c /= '\n' && c /= '<')) We'll also want a parser that parses the brackets and returns the keyword inside: insideBrackets :: RE Char String insideBrackets = sym '<' *> many (psym (/= '>')) <* sym '>' Now to read a statement, we start with non-brackets, and alternate with keys in brackets. Let's observe that we start and end with non-brackets, since they can be empty. Thus we can represent a line a list of non-bracket/bracket pairs, followed by a last non-bracket part. To make a pair, we combine the parser results in a tuple using the (,) constructor enabled by TupleSections: parseStatementLine :: String -> RE Char Statement parseStatementLine signal = string signal *> sym ' ' *> many ((,) <$> nonBrackets <*> insideBrackets) <*> nonBrackets From here, we need a recursive function that will build up our final statement string and the list of keys. We do this with buildStatement. parseStatementLine :: String -> RE Char Statement parseStatementLine signal = string signal *> sym ' ' *> (buildStatement <$> many ((,) <$> nonBrackets <*> insideBrackets) <*> nonBrackets) where buildStatement :: [(String, String)] -> String -> (String, [String]) buildStatement [] last = (last, []) buildStatement ((str, key) : rest) rem = let (str', keys) = buildStatement rest rem in (str <> "<" <> key <> ">" <> str', key : keys) The last thing we need is a final helper that will take the result of buildStatement and turn it into a Statement. We'll call this finalizeStatement, and then we're done! parseStatementLine :: String -> RE Char Statement parseStatementLine signal = string signal *> sym ' ' *> (finalizeStatement . buildStatement <$> many ((,) <$> nonBrackets <*> insideBrackets) <*> nonBrackets) where buildStatement :: [(String, String)] -> String -> (String, [String]) buildStatement [] last = (last, []) buildStatement ((str, key) : rest) rem = let (str', keys) = buildStatement rest rem in (str <> "<" <> key <> ">" <> str', key : keys) finalizeStatement :: (String, [String]) -> Statement finalizeStatement (regex, variables) = Statement regex variables SCENARIOS Now that we have all our pieces in place, it's quite easy to write the parser for scenario! First we get the title by reading the keyword and then the rest of the line: scenarioParser :: RE Char Scenario scenarioParser = string "Scenario: " *> readThroughEndOfLine ... After that, we read many statements, and then the example table. Since the example table might not exist, we'll provide an alternative that is a pure, empty table. We can wrap everything together by mapping the Scenario constructor over it. scenarioParser :: RE Char Scenario scenarioParser = Scenario <$> (string "Scenario: " *> readThroughEndOfLine) <*> many (statementParser <* sym '\n') <*> (exampleTableParser <|> pure (ExampleTable [] [])) We can also make a "Background" parser that is very similar. All that changes is that we read the string "Background" instead of a title. Since we'll hard-code the title as "Background", we can include it with the constructor and map it over the parser. backgroundParser :: RE Char Scenario backgroundParser = Scenario "Background" <$> (string "Background:" *> readThroughEndOfLine) *> many (statementParser <* sym '\n') <*> (exampleTableParser <|> pure (ExampleTable [] [])) FINALLY THE FEATURE We're almost done! All we have left is to write the featureParser itself! As with scenarios, we'll start with the keyword and a title line: featureParser :: RE Char Feature featureParser = Feature <$> (string "Feature: " *> readThroughEndOfLine) <*> ... Now we'll use the optional combinator to parse the Background if it exists, but return Nothing if it doesn't. Then we'll wrap up with parsing many scenarios! featureParser :: RE Char Feature featureParser = Feature <$> (string "Feature: " *> readThroughEndOfLine) <*> pure [] <*> (optional backgroundParser) <*> (many scenarioParser) Note that here we're ignoring the "description" of a feature we proposed as part of our original syntax and simply giving an empty list of strings. Since there are no keywords for that, it turns out to be painful to deal with it using applicative parsing. When we look at monadic approaches starting next week, we'll see it isn't as hard there. CONCLUSION This wraps up our exploration of applicative parsing. We can see how well suited Haskell is for parsing. The functional nature of the language means it's easy to start with small building blocks like our first parsers. Then we can gradually combine them to make something larger. It can be a little tricky to wrap our heads around all the different operators and combinators. But once you understand the ways in which these let us combine our parsers, they make a lot of sense and are easy to use. You should now move onto part 3 of this series, where we will start learning about monadic parsing. You'll get to see how we use the Attoparsec library to parse this same Gherkin syntax! To further your knowledge of useful Haskell libraries, download our free Production Checklist! It will tell you about libraries for many tasks, from databases to machine learning! If you've never written a line of Haskell before, never fear! Download our Beginners Checklist to learn more! Attoparsec In part 2 of this series we looked at the Regex-based Applicative Parsing library. We took a lot of smaller combinators and put them together to parse our Gherkin syntax (check out part 1 for a quick refresher on that). This week, we'll look at a new library: Attoparsec. Instead of trying to do everything with a purely applicative structure, this library uses a monadic approach. This approach is much more common. It results in syntax that is simpler to read and understand. It will also make it easier for us to add certain features. To follow along with the code for this article, take a look at the AttoParser module on Github! For some more excellent ideas about useful libraries, download our Production Checklist! It includes material on libraries for everything from data structures to machine learning! Finally, if you already know about Attoparsec, feel free to move onto part 4 and learn about Megaparsec! THE PARSER TYPE In applicative parsing, all our parsers had the type RE Char. This type belonged to the Applicative typeclass but was not a Monad. For Attoparsec, we'll instead be using the Parser type, a full monad. So in general we'll be writing parsers with the following types: featureParser :: Parser Feature scenarioParser :: Parser Scenario statementParser :: Parser Statement exampleTableParser :: Parser ExampleTable valueParser :: Parser Value PARSING VALUES The first thing we should realize though is that our parser is still an Applicative! So not everything needs to change! We can still make use of operators like *> and <|>. In fact, we can leave our value parsing code almost exactly the same! For instance, the valueParser, nullParser, and boolParser expressions can remain the same: valueParser :: Parser Value valueParser = nullParser <|> boolParser <|> numberParser <|> stringParser nullParser :: Parser Value nullParser = (string "null" <|> string "NULL" <|> string "Null") *> pure ValueNull boolParser :: Parser Value boolParser = (trueParser *> pure (ValueBool True)) <|> (falseParser *> pure (ValueBool False)) where trueParser = string "True" <|> string "true" <|> string "TRUE" falseParser = string "False" <|> string "false" <|> string "FALSE" If we wanted, we could make these more "monadic" without changing their structure. For instance, we can use return instead of pure (since they are identical). We can also use >> instead of *> to perform monadic actions while discarding a result. Our value parser for numbers changes a bit, but it gets simpler! The authors of Attoparsec provide a convenient parser for reading scientific numbers: numberParser :: Parser Value numberParser = ValueNumber <$> scientific Then for string values, we'll use the takeTill combinator to read all the characters until a vertical bar or newline. Then we'll apply a few text functions to remove the whitespace and get it back to a String. (The Parser monad we're using parses things as Text rather than String). stringParser :: Parser Value stringParser = (ValueString . unpack . strip) <$> takeTill (\c -> c == '|' || c == '\n') PARSING EXAMPLES As we parse the example table, we'll switch to a more monadic approach by using do-syntax. First, we establish a cellParser that will read a value within a cell. cellParser = do skipWhile nonNewlineSpace val <- valueParser skipWhile (not . barOrNewline) char '|' return val Each line in our statement refers to a step of the parsing process. So first we skip all the leading whitespace. Then we parse our value. Then we skip the remaining space, and parse the final vertical bar to end the cell. Then we'll return the value we parsed. It's a lot easier to keep track of what's going on here compared to applicative syntax. It's not hard to see which parts of the input we discard and which we use. If we don't assign the value with <- within do-syntax, we discard the value. If we retrieve it, we'll use it. To complete the exampleLineParser, we parse the initial bar, get many values, close out the line, and then return them: exampleLineParser :: Parser [Value] exampleLineParser = do char '|' cells <- many cellParser char '\n' return cells where cellParser = ... Reading the keys for the table is almost identical. All that changes is that our cellParser uses many letter instead of valueParser. So now we can put these pieces together for our exampleTableParser: exampleTableParser :: Parser ExampleTable exampleTableParser = do string "Examples:" consumeLine keys <- exampleColumnTitleLineParser valueLists <- many exampleLineParser return $ ExampleTable keys (map (zip keys) valueLists) We read the signal string "Examples:", followed by consuming the line. Then we get our keys and values, and build the table with them. Again, this is much simpler than mapping a function like buildExampleTable like in applicative syntax. STATEMENTS The Statement parser is another area where we can improve the clarity of our code. Once again, we'll define two helper parsers. These will fetch the portions outside brackets and then inside brackets, respectively: nonBrackets :: Parser String nonBrackets = many (satisfy (\c -> c /= '\n' && c /= '<')) insideBrackets :: Parser String insideBrackets = do char '<' key <- many letter char '>' return key Now when we put these together, we can more clearly see the steps of the process outlined in do-syntax. First we parse the “signal” word, then a space. Then we get the “pairs” of non-bracketed and bracketed portions. Finally, we'll get one last non-bracketed part: parseStatementLine :: Text -> Parser Statement parseStatementLine signal = do string signal char ' ' pairs <- many ((,) <$> nonBrackets <*> insideBrackets) finalString <- nonBrackets ... Now we can define our helper function buildStatement and call it on its own line in do-syntax. Then we'll return the resulting Statement. This is much easier to read than tracking which functions we map over which sections of the parser: parseStatementLine :: Text -> Parser Statement parseStatementLine signal = do string signal char ' ' pairs <- many ((,) <$> nonBrackets <*> insideBrackets) finalString <- nonBrackets let (fullString, keys) = buildStatement pairs finalString return $ Statement fullString keys where buildStatement :: [(String, String)] -> String -> (String, [String]) buildStatement [] last = (last, []) buildStatement ((str, key) : rest) rem = let (str', keys) = buildStatement rest rem in (str <> "<" <> key <> ">" <> str', key : keys) SCENARIOS AND FEATURES As with applicative parsing, it's now straightforward for us to finish everything off. To parse a scenario, we read the keyword, consume the line to read the title, and read the statements and examples: scenarioParser :: Parser Scenario scenarioParser = do string "Scenario: " title <- consumeLine statements <- many (parseStatement <* char '\n') examples <- (exampleTableParser <|> return (ExampleTable [] [])) return $ Scenario title statements examples Again, we provide an empty ExampleTable as an alternative if there are no examples. The parser for Background looks very similar. The only difference is we ignore the result of the line and instead use Background as the title string. backgroundParser :: Parser Scenario backgroundParser = do string "Background:" consumeLine statements <- many (parseStatement <* char '\n') examples <- (exampleTableParser <|> return (ExampleTable [] [])) return $ Scenario "Background" statements examples Finally, we'll put all this together as a feature. We read the title, get the background if it exists, and read our scenarios: featureParser :: Parser Feature featureParser = do string "Feature: " title <- consumeLine maybeBackground <- optional backgroundParser scenarios <- many scenarioParser return $ Feature title maybeBackground scenarios FEATURE DESCRIPTION One extra feature we'll add now is that we can more easily parse the “description” of a feature. We omitted them in applicative parsing, as it's a real pain to implement. It becomes much simpler when using a monadic approach. The first step we have to take though is to make one parser for all the main elements of our feature. This approach looks like this: featureParser :: Parser Feature featureParser = do string "Feature: " title <- consumeLine (description, maybeBackground, scenarios) <- parseRestOfFeature return $ Feature title description maybeBackground scenarios parseRestOfFeature :: Parser ([String], Maybe Scenario, [Scenario]) parseRestOfFeature = ... Now we'll use a recursive function that reads one line of the description at a time and adds to a growing list. The trick is that we'll use the choice combinator offered by Attoparsec. We'll create two parsers. The first assumes there are no further lines of description. It attempts to parse the background and scenario list. The second reads a line of description, adds it to our growing list, and recurses: parseRestOfFeature :: Parser ([String], Maybe Scenario, [Scenario]) parseRestOfFeature = parseRestOfFeatureTail [] where parseRestOfFeatureTail prevDesc = do (fullDesc, maybeBG, scenarios) <- choice [noDescriptionLine prevDesc, descriptionLine prevDesc] return (fullDesc, maybeBG, scenarios) So we'll first try to run this noDescriptionLineParser. It will try to read the background and then the scenarios as we've always done. If it succeeds, we know we're done. The argument we passed is the full description: where noDescriptionLine prevDesc = do maybeBackground <- optional backgroundParser scenarios <- some scenarioParser return (prevDesc, maybeBackground, scenarios) Now if this parser fails, we know that it means the next line is actually part of the description. So we'll write a parser to consume a full line, and then recurse: descriptionLine prevDesc = do nextLine <- consumeLine parseRestOfFeatureTail (prevDesc ++ [nextLine]) And now we're done! We can parse descriptions! CONCLUSION That wraps up our exploration of Attoparsec. Now you can move on to the fourth and final part of this series where we'll learn about Megaparsec. We'll find that it's syntactically very similar to Attoparsec with a few small exceptions. We'll see how we can use some of the added power of monadic parsing to enrich our syntax. To learn more about cool Haskell libraries, be sure to check out our Production Checklist! It'll tell you a little bit about libraries in all kinds of areas like databases and web APIs. If you've never written Haskell at all, download our Beginner's Checklist! It'll give you all the resources you need to get started on your Haskell journey! Megaparsec In part 3 of this series, we explored the Attoparsec library. It provided us with a clearer syntax to work with compared to applicative parsing, which we learned in part 2. This week, we'll explore one final library: Megaparsec. This library has a lot in common with Attoparsec. In fact, the two have a lot of compatibility by design. Ultimately, we'll find that we don't need to change our syntax a whole lot. But Megaparsec does have a few extra features that can make our lives simpler. To follow the code examples here, head to the Github repository and take a look at the MegaParser module on Github! To learn about more awesome libraries you can use in production, make sure to download our Production Checklist! But never fear if you're new to Haskell! Just take a look at our Beginners checklist and you'll know where to get started! A DIFFERENT PARSER TYPE To start out, the basic parsing type for Megaparsec is a little more complicated. It has two type parameters, e and s, and also comes with a built-in monad transformer ParsecT. data ParsecT e s m a = ParsecT ... type Parsec e s = ParsecT e s Identity The e type allows us to provide some custom error data to our parser. The s type refers to the input type of our parser, typically some variant of String. This parameter also exists under the hood in Attoparsec. But we sidestepped that issue by using the Text module. For now, we'll set up our own type alias that will sweep these parameters under the rug: type MParser = Parsec Void Text TRYING OUR HARDEST Let's start filling in our parsers. There's one structural difference between Attoparsec and Megaparsec. When a parser fails in Attoparsec, its default behavior is to backtrack. This means it acts as though it consumed no input. This is not the case in Megaparsec! A naive attempt to repeat our nullParser code could fail in some ways: nullParser :: MParser Value nullParser = nullWordParser >> return ValueNull where nullWordParser = string "Null" <|> string "NULL" <|> string "null" Suppose we get the input "NULL" for this parser. Our program will attempt to select the first parser, which will parse the N token. Then it will fail on U. It will move on to the second parser, but it will have already consumed the N! Thus the second and third parser will both fail as well! We get around this issue by using the try combinator. Using try gives us the Attoparsec behavior of backtracking if our parser fails. The following will work without issue: nullParser :: MParser Value nullParser = nullWordParser >> return ValueNull where nullWordParser = try (string "Null") <|> try (string "NULL") <|> try (string "null") Even better, Megaparsec also has a convenience function string' for case insensitive parsing. So our null and boolean parsers become even simpler: nullParser :: MParser Value nullParser = M.string' "null" >> return ValueNull boolParser :: MParser Value boolParser = (trueParser >> return (ValueBool True)) <|> (falseParser >> return (ValueBool False)) where trueParser = M.string' "true" falseParser = M.string' "false" Unlike Attoparsec, we don't have a convenient parser for scientific numbers. We'll have to go back to our logic from applicative parsing, only this time with monadic syntax. numberParser :: MParser Value numberParser = (ValueNumber . read) <$> (negativeParser <|> decimalParser <|> integerParser) where integerParser :: MParser String integerParser = M.try (some M.digitChar) decimalParser :: MParser String decimalParser = M.try $ do front <- many M.digitChar M.char '.' back <- some M.digitChar return $ front ++ ('.' : back) negativeParser :: MParser String negativeParser = M.try $ do M.char '-' num <- decimalParser <|> integerParser return $ '-' : num Notice that each of our first two parsers use try to allow proper backtracking. For parsing strings, we'll use the satisfy combinator to read everything up until a bar or newline: stringParser :: MParser Value stringParser = (ValueString . trim) <$> many (M.satisfy (not . barOrNewline)) And then filling in our value parser is easy as it was before: valueParser :: MParser Value valueParser = nullParser <|> boolParser <|> numberParser <|> stringParser FILLING IN THE DETAILS Aside from some trivial alterations, nothing changes about how we parse example tables. The Statement parser requires adding in another try call when we're grabbing our pairs: parseStatementLine :: Text -> MParser Statement parseStatementLine signal = do M.string signal M.char ' ' pairs <- many $ M.try ((,) <$> nonBrackets <*> insideBrackets) finalString <- nonBrackets let (fullString, keys) = buildStatement pairs finalString return $ Statement fullString keys where buildStatement = ... Otherwise, we'll fail on any case where we don't use any keywords in the statement! But it's otherwise the same. Of course, we also need to change how we call our parser in the first place. We'll use the runParser function instead of Attoparsec's parseOnly. This takes an extra argument for the source file of our parser to provide better messages. parseFeatureFromFile :: FilePath -> IO Feature parseFeatureFromFile inputFile = do ... case runParser featureParser finalString inputFile of Left s -> error (show s) Right feature -> return feature But nothing else changes in the structure of our parsers. It's very easy to take Attoparsec code and Megaparsec code and re-use it with the other library! ADDING SOME STATE One bonus we do get from Megaparsec is that its monad transformer makes it easier for us to use other monadic functionality. Our parser for statement lines has always been a little bit clunky. Let's clean it up a little bit by allowing ourselves to store a list of strings as a state object. Here's how we'll change our parser type: type MParser = ParsecT Void Text (State [String]) Now whenever we parse a key using our brackets parser, we can append that key to our existing list using modify. We'll also return the brackets along with the string instead of merely the keyword: insideBrackets :: MParser String insideBrackets = do M.char '<' key <- many M.letterChar M.char '>' modify (++ [key]) -- Store the key in the state! return $ ('<' : key) ++ ['>'] Now instead of forming tuples, we can concatenate the strings we parse! parseStatementLine :: Text -> MParser Statement parseStatementLine signal = do M.string signal M.char ' ' pairs <- many $ M.try ((++) <$> nonBrackets <*> insideBrackets) finalString <- nonBrackets let fullString = concat pairs ++ finalString ... And now how do we get our final list of keys? Simple! We get our state value, reset it, and return everything. No need for our messy buildStatement function! parseStatementLine :: Text -> MParser Statement parseStatementLine signal = do M.string signal M.char ' ' pairs <- many $ M.try ((++) <$> nonBrackets <*> insideBrackets) finalString <- nonBrackets let fullString = concat pairs ++ finalString keys <- get put [] return $ Statement fullString keys When we run this parser at the start, we now have to use runParserT instead of runParser. This returns us an action in the State monad, meaning we have to use evalState to get our final result: parseFeatureFromFile :: FilePath -> IO Feature parseFeatureFromFile inputFile = do ... case evalState (stateAction finalString) [] of Left s -> error (show s) Right feature -> return feature where stateAction s = runParserT featureParser inputFile s BONUSES OF MEGAPARSEC As a last bonus, let's look at error messages in Megaparsec. When we have errors in Attoparsec, the parseOnly function gives us an error string. But it's not that helpful. All it tells us is what individual parser on the inside of our system failed: >> parseOnly nullParser "true" Left "string" >> parseOnly "numberParser" "hello" Left "Failed reading: takeWhile1" These messages don't tell us where within the input it failed, or what we expected instead. Let's compare this to Megaparsec and runParser: >> runParser nullParser "true" "" Left (TrivialError (SourcePos {sourceName = "true", sourceLine = Pos 1, sourceColumn = Pos 1} :| []) (Just EndOfInput) (fromList [Tokens ('n' :| "ull")])) >> runParser numberParser "hello" "" Left (TrivialError (SourcePos {sourceName = "hello", sourceLine = Pos 1, sourceColumn = Pos 1} :| []) (Just EndOfInput) (fromList [Tokens ('-' :| ""),Tokens ('.' :| ""),Label ('d' :| "igit")])) This gives us a lot more information! We can see the string we're trying to parse. We can also see the exact position it fails at. It'll even give us a picture of what parsers it was trying to use. In a larger system, this makes a big difference. We can track down where we've gone wrong either in developing our syntax, or conforming our input to meet the syntax. If we customize the e parameter type, we can even add our own details into the error message to help even more! CONCLUSION This wraps up our exploration of parsing libraries in Haskell! In the past few weeks, we've learned about Applicative parsing, Attoparsec, and Megaparsec. The first provides useful and intuitive combinators for when our language is regular. It allows us to avoid using a monad for parsing and the baggage that might bring. With Attoparsec, we saw an introduction to monadic style parsing. This provided us with a syntax that was easier to understand and where we could see what was happening. Finally in this part, we explored Megaparsec. This library has a lot in common syntactically with Attoparsec. But it provides a few more bells and whistles that can make many tasks easier. Ready to explore some more areas of Haskell development? Want to get some ideas for new libraries to learn? Download our Production Checklist! It'll give you a quick summary of some tools in areas ranging from data structures to web APIs! Never programmed in Haskell before? Want to get started? Check out our Beginners Checklist! It has all the tools you need to start your Haskell journey!