Thursday, January 17, 2013

Parsing Expression Grammar - part 1

I'm not a great language and grammar theoretician, so I'm not going to go into great detail about what a Parsing Expression Grammar (PEG) is, but here are a few observations.

  • PEGs don't require a tokenizing stage - this can be a mixed blessing, but does mean that the grammar can be entirely expressed in one specification
  • Tools like yacc and bison create parsers from LALR grammars - this is a relatively complicated procedure and requires that the grammar either be very carefully constructed to contain no ambiguity, or to be annotated such that the parser always knows how to resolve that ambiguity
  • Grammars expressed as PEGs don't really have this problem (although you can't have left recursion - which I'll get to eventually)


I'm using F# for the code in this series of posts, I'm not a completely fluent F# programmer so there may be some clumsy code in places, but it is a very succinct and clean language for this type of work.

A very simple introduction to PEG

A very minimalist implementation of a PEG only requires a few elements, I'll add some more later, both for completeness and convenience; a PEG expression is comprised of:
  • Terminal symbol - which is commonly represented as text in quotation marks - "foo" would be a terminal symbol
  • A NonTerminal symbol - a word or identifier not in quotations mars - e.g. foo
  • Empty - matches the end of the input - represented with an epsilon 
  • A Sequence - a list of expressions, each of which must be matched in order - represented as a space separated list
  • A Choice - a list of expressions, tested in order, the first match is chosen - represented as a list of expressions separated by /
  • Zero or More - an expression that can occur any number of times, including zero - represented as a * suffix
  • One or More - an expression that can occur any number of times, but must occur at least once - represented as a + suffix
  • Optional - an expression that may or may-not match - represented as a ? suffix
  • And - an expression that must match but that isn't captured by the matching process - represented as a & prefix
  • Not - an expression that must not match and isn't captured by the matching process - represented as a ! prefix

First Implementation

While grammars are usually stored as text files, I'm going to start with the in memory representation of the PEG expressions.
This is a straight-forward translation of the description of the elements into an F# discriminated union type, before I dive into how to process this, I'll show some code that will print a PEG expression in a human readable form.
There is nothing particularly complex about this, the Sequence and Choice patterns use the forward function combination operator >> to pipeline converting the list of expressions into a list of strings (map), and then to join the elements using the appropriate separator (reduce).
To check that this works as expected I'll try a few expressions in F# interactive.
> printPeg <| Terminal "foo";;
"foo"
val it : unit = ()
> printPeg <| NonTerminal "foo";;
foo
val it : unit = ()
> printPeg <| Sequence [NonTerminal "atom"; Optional (Sequence [Terminal "*"; NonTerminal "product"])];;
atom "*" product?
val it : unit = ()
This doesn't quite work how I want it to, in the last expression it isn't clear what is optional, so I need to make printPeg a touch more complex...
This now perhaps errs on the the cautious side, but adds parenthesis where they are needed...
> printPeg <| Sequence [NonTerminal "atom"; Optional (Sequence [Terminal "*"; NonTerminal "product"])];;
atom (("*" product)?)
val it : unit = ()
When text is parsed against a grammar, there needs to be a internal representation of the parsed data, in part 2 when I'll add functions to grammar rules, this representation is the input those functions.

  • TerminalSymbol - hopefully self explanatory
  • Production - represents a Sequence, ZeroOrMore, or OneOrMore
  • EmptyMatch - Used for non-capturing successful matches ZeroOrMore (with zero matches), Optional (not present), And, Not
  • Unmatched - No match found
Taking the parse function in parts.
First this the atomic expressions, these are all pretty straightforward.
Moving on the Sequence and Choice
The pattern for Sequence creates an F# seq over the list of sub-expressions for the sequence, when an expression doesn't match the sequence is terminated with an Unmatched parse result; that is used as a signal to mark that the Sequence as a whole didn't match; otherwise the results from the sub-expressions are packaged into a Production parse result.
The pattern for Choice enumerates over the sub-expressions until it finds the first one that matches; that result is returned.
The rest...
ZeroOrMore and OneOrMore are almost the same, DRY suggests that I should break out the common code for these patterns; both collect a sequence of contiguous matches for the sub-expression, they differ only in how they handle the empty collection.
Optional, And, and Not can all be pulled onto one line (depending on how wide your monitor is...) they simply filter the response from the recursive parsing.
parseExpression is then rounded out with a call to the inner function to start the recursive parsing in the context of the grammar and input string.

Trying it out

First here's a simple grammar for boolean arithmetic.
The printPeg function still works...

> Map.fold (fun _ k v -> printfn "%s -> %s" k <| pegToString v) () g;;
and -> atom (("&" and)?)
atom -> bool / paren / not
bool -> "true" / "false"
expr -> or
not -> "!" expr
or -> and (("|" or)?)
paren -> "(" expr ")"
start -> expr <epsilon>
val it : unit = ()
Map has sorted its elements, but the printed representations are as expected.

> parseExpression g "start" "true&true|false";;
val it : ParseResult * int =
  (Production
     [Production
        [Production
           [TerminalSymbol "true";
            Production
              [TerminalSymbol "&";
               Production [TerminalSymbol "true"; EmptyMatch]]];
         Production
           [TerminalSymbol "|";
            Production
              [Production [TerminalSymbol "false"; EmptyMatch]; EmptyMatch]]];
      EmptyMatch], 15)
> parseExpression g "start" "true|true&false";;
val it : ParseResult * int =
  (Production
     [Production
        [Production [TerminalSymbol "true"; EmptyMatch];
         Production
           [TerminalSymbol "|";
            Production
              [Production
                 [TerminalSymbol "true";
                  Production
                    [TerminalSymbol "&";
                     Production [TerminalSymbol "false"; EmptyMatch]]];
               EmptyMatch]]]; EmptyMatch], 15)
> 
This shows how the precedence of the operators is defined explicitly in the grammar, rather than by annotating tokens or rules.
In part 2, I'll turn this into something usable.
The code for this article is on github.

5 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete
  3. This comment has been removed by a blog administrator.

    ReplyDelete
  4. This comment has been removed by a blog administrator.

    ReplyDelete
  5. This comment has been removed by a blog administrator.

    ReplyDelete