Sunday, January 20, 2013

Parsing Expression Grammar - part 2

At the end of part 1 I had a parser that will parse a PEG, but isn't that useful because its output is simply a recursive data structure of the matched elements.
The next step is to add the ability to process each matching production rule. I do this by associating a function with each rule, the function takes two arguments - the matching string and the parsed elements - and returns a parse result. The parse result type is extended to allow passing an arbitrary type as a parse result...
This adds the Parsed of 'a type to the discriminated union.
Because each grammar element now requires three pieces of information (name, expression, function) I define a record type to make handling this easier. The function declaration of parseExpression changes as expected.
The changes needed to parseExpression are small...
The original pattern match for NonTerminal simply recursively called parse with the definition for the NonTerminal name, now the same recursive call is made but, if the result is a successful match, the function associated with the match is called and the result of that function is passed as the result of the match.

Grammar example - 1

Taking the same grammar I used in part 1, I now need to add a function to each rule.
In this example, I am evaluating the expression as it is being parsed, this is the simplest approach, later on I will show a two stage approach where the expression is converted to an AST that can be evaluated later.
Hopefully each of these is self explanatory, the same function can be used to parse both & and | because they have equivalent structure; the first pattern in parseBin is taken when the Optional second element of the Sequence is not matched.
This code can now evaluate simple boolean expressions...

> parseExpression g "start" "true&false";;
val it : ParseResult<bool> * int = (Parsed false, 10)
> parseExpression g "start" "true|true&false";;
val it : ParseResult<bool> * int = (Parsed true, 15)
> parseExpression g "start" "(true|true)&false";;
val it : ParseResult<bool> * int = (Parsed false, 17)

Grammar example - 2

In this example the same grammar (with some minor changes, such as allowing whitespace) demonstrates parsing into an AST (the BooleanExpr type) which is then evaluated in a second stage.
The code for this artice is on github.
In part 3 I make things a little more practical, expressing the grammar as literal objects is very clumsy - I build a parser that parses textual PEG expressions...

Thursday, January 17, 2013

Parsing Expression Grammar - part 1

I'm not a great language and grammar theoretician, so I'm not going to go into great detail about what a Parsing Expression Grammar (PEG) is, but here are a few observations.

  • PEGs don't require a tokenizing stage - this can be a mixed blessing, but does mean that the grammar can be entirely expressed in one specification
  • Tools like yacc and bison create parsers from LALR grammars - this is a relatively complicated procedure and requires that the grammar either be very carefully constructed to contain no ambiguity, or to be annotated such that the parser always knows how to resolve that ambiguity
  • Grammars expressed as PEGs don't really have this problem (although you can't have left recursion - which I'll get to eventually)


I'm using F# for the code in this series of posts, I'm not a completely fluent F# programmer so there may be some clumsy code in places, but it is a very succinct and clean language for this type of work.

A very simple introduction to PEG

A very minimalist implementation of a PEG only requires a few elements, I'll add some more later, both for completeness and convenience; a PEG expression is comprised of:
  • Terminal symbol - which is commonly represented as text in quotation marks - "foo" would be a terminal symbol
  • A NonTerminal symbol - a word or identifier not in quotations mars - e.g. foo
  • Empty - matches the end of the input - represented with an epsilon 
  • A Sequence - a list of expressions, each of which must be matched in order - represented as a space separated list
  • A Choice - a list of expressions, tested in order, the first match is chosen - represented as a list of expressions separated by /
  • Zero or More - an expression that can occur any number of times, including zero - represented as a * suffix
  • One or More - an expression that can occur any number of times, but must occur at least once - represented as a + suffix
  • Optional - an expression that may or may-not match - represented as a ? suffix
  • And - an expression that must match but that isn't captured by the matching process - represented as a & prefix
  • Not - an expression that must not match and isn't captured by the matching process - represented as a ! prefix

First Implementation

While grammars are usually stored as text files, I'm going to start with the in memory representation of the PEG expressions.
This is a straight-forward translation of the description of the elements into an F# discriminated union type, before I dive into how to process this, I'll show some code that will print a PEG expression in a human readable form.
There is nothing particularly complex about this, the Sequence and Choice patterns use the forward function combination operator >> to pipeline converting the list of expressions into a list of strings (map), and then to join the elements using the appropriate separator (reduce).
To check that this works as expected I'll try a few expressions in F# interactive.
> printPeg <| Terminal "foo";;
"foo"
val it : unit = ()
> printPeg <| NonTerminal "foo";;
foo
val it : unit = ()
> printPeg <| Sequence [NonTerminal "atom"; Optional (Sequence [Terminal "*"; NonTerminal "product"])];;
atom "*" product?
val it : unit = ()
This doesn't quite work how I want it to, in the last expression it isn't clear what is optional, so I need to make printPeg a touch more complex...
This now perhaps errs on the the cautious side, but adds parenthesis where they are needed...
> printPeg <| Sequence [NonTerminal "atom"; Optional (Sequence [Terminal "*"; NonTerminal "product"])];;
atom (("*" product)?)
val it : unit = ()
When text is parsed against a grammar, there needs to be a internal representation of the parsed data, in part 2 when I'll add functions to grammar rules, this representation is the input those functions.

  • TerminalSymbol - hopefully self explanatory
  • Production - represents a Sequence, ZeroOrMore, or OneOrMore
  • EmptyMatch - Used for non-capturing successful matches ZeroOrMore (with zero matches), Optional (not present), And, Not
  • Unmatched - No match found
Taking the parse function in parts.
First this the atomic expressions, these are all pretty straightforward.
Moving on the Sequence and Choice
The pattern for Sequence creates an F# seq over the list of sub-expressions for the sequence, when an expression doesn't match the sequence is terminated with an Unmatched parse result; that is used as a signal to mark that the Sequence as a whole didn't match; otherwise the results from the sub-expressions are packaged into a Production parse result.
The pattern for Choice enumerates over the sub-expressions until it finds the first one that matches; that result is returned.
The rest...
ZeroOrMore and OneOrMore are almost the same, DRY suggests that I should break out the common code for these patterns; both collect a sequence of contiguous matches for the sub-expression, they differ only in how they handle the empty collection.
Optional, And, and Not can all be pulled onto one line (depending on how wide your monitor is...) they simply filter the response from the recursive parsing.
parseExpression is then rounded out with a call to the inner function to start the recursive parsing in the context of the grammar and input string.

Trying it out

First here's a simple grammar for boolean arithmetic.
The printPeg function still works...

> Map.fold (fun _ k v -> printfn "%s -> %s" k <| pegToString v) () g;;
and -> atom (("&" and)?)
atom -> bool / paren / not
bool -> "true" / "false"
expr -> or
not -> "!" expr
or -> and (("|" or)?)
paren -> "(" expr ")"
start -> expr <epsilon>
val it : unit = ()
Map has sorted its elements, but the printed representations are as expected.

> parseExpression g "start" "true&true|false";;
val it : ParseResult * int =
  (Production
     [Production
        [Production
           [TerminalSymbol "true";
            Production
              [TerminalSymbol "&";
               Production [TerminalSymbol "true"; EmptyMatch]]];
         Production
           [TerminalSymbol "|";
            Production
              [Production [TerminalSymbol "false"; EmptyMatch]; EmptyMatch]]];
      EmptyMatch], 15)
> parseExpression g "start" "true|true&false";;
val it : ParseResult * int =
  (Production
     [Production
        [Production [TerminalSymbol "true"; EmptyMatch];
         Production
           [TerminalSymbol "|";
            Production
              [Production
                 [TerminalSymbol "true";
                  Production
                    [TerminalSymbol "&";
                     Production [TerminalSymbol "false"; EmptyMatch]]];
               EmptyMatch]]]; EmptyMatch], 15)
> 
This shows how the precedence of the operators is defined explicitly in the grammar, rather than by annotating tokens or rules.
In part 2, I'll turn this into something usable.
The code for this article is on github.