Friday, February 1, 2013

Parsing Expression Grammar - part 3

The PEG parser at the end of part 2 will parse, and call out to code when a rule is parsed, but isn't particularly easy to use, most particularly because the grammar has to be expressed as F# objects. In this post I'll extend the parser to be able to parse PEG from a text file, and to generate F# code for a parser described by that PEG.
There are only two (big) things missing from the code needed to do this, a parser for PEG - which I will write as (you guessed it) a PEG, and the code generator for the parser.

Parsing PEG

Parsing the PEG requires a complete grammar.

Parser changes

I have added a few new terminal types to the Expression type defined in part 1, these allow a grammar to specify a terminal symbol match on one of a selection of characters, to match a character (strictly a UTF-16 character) based on a Unicode category, and to match any character, These make specifying real world grammars (including that for PEG itself) very much easier.
Because I have added terminal types that operate on a per character basis, I've added a simple utility method that gets the next single character from the input, if any. This is this only place in the code where surrogate pairs are specifically handled.
I have extended parseExpression in the obvious way to parse the new terminal expressions.

Parsing PEG

I now have enough pieces to be able to create a PEG parser for PEG, this is in the file pegOfPeg.fs, an example of one of the grammar rules is...
The rest of the files is much the same, with a parse method for each rule in the grammar and a corresponding F# declaration.

Code Generation

The main method for code generation is a recursive function which matches an Expression argument. For the simplest expression types, those that don't require recursively parsing one or more expressions, the code generation consists of writing code that calls out to the function that performs the parsing for that terminal type (for the terminals), matching the end of the input (for the <epsilon> rule), and calls a rule matching function (whose generation I will show below) - for NonTerminal rules. The functions ibprintf and ibprintfn are used to print an indented string to a StringBuilder.
To match Sequence and Choice expressions, the code generator generates a list of lambdas, each of which is recursively generated by the codeGen function, one for each sub-expression of the Sequence or Choice. This list of lambdas is then past to the utility function matchSequence or matchChoice as appropriate.
For the expressions which wrap a single expression (Optional, ZeroOrMore, OneOrMore, And, and Not) a similar approach generates a lambda for the inner expression and passes it to the appropriate utility function.
When matching a top-level grammar rule the code generator creates a function; there are two possible code paths, for rules where there is no code defined to be executed when the rule is matched (in which case the function simply recursively generates the code for the inner expression), and for rules where there is code to be executed on the successful match of the rule, where matching on the result of the inner expression has to be called to ensure that the code is not executed inappropriately. I make no effort to ensure that the code specified in the grammar is valid.
The code generator for the Rule expression doesn't generate a valid function, there is no let rec prefixed; this is because the rule matching methods have to be mutually recursive, so an outer function is used to call codeGen for rules - it is that function which is told whether this is the first rule or not.
The rest of the code generator is found in peg.fs and consists of bolier plate implementations of the functions used by the generated code and utility functions that make the rest of the code generation easier.

Left recursion

A PEG parser cannot work with a grammar that contains left recursion. Left recursion occurs when a non-terminal symbol in the left position in an expression recursively (either directly or indirectly) refers to the same rule. The left position is the first element in a Sequence (this is a simplification!), any element in a Choice, and for the expressions that have a single sub-expression, that sub-expression is always in the left-position. It can be difficult to determine which elements in a sequence are in the left position because a Non-Terminal that successfully matches with zero-width (for example an Optional or ZeroOrMore expression)
a <- a b     The simplest form of left recursion

a <- b c
b <- a / c   Indirect left recursion

a <- b c
b <- d a     Left recursive because d can match without consuming input
d <- e?      even though a is not in the obvious left position
The following code detects simple forms of left recursion, but doesn't find the third example above.

Putting it all together

The code in Program.fs puts this together into a program that takes an input file and writes an output, taking a grammar description, prefix and suffix from the input file. The format of the input file is as follows:

prefix written verbatim to the output file
grammar rules
suffix written verbatim to the output file
The section separators in the input file ((*%% and %%*)) are chosen to be similar to yacc and to allow the input file to be read as valid F#, which makes writing the prefix and suffix easier! The example on github implements a basic calculator program.
If the -x argument is passed to the parser generator, it will attempt to compile and execute the output file, this is done using the FSharpCodeProvider class from the F# power pack, Vladimir Ivanovskiy has a great article about Embedded Scripting using F#.
As usual, all the code for this is on github.

No comments:

Post a Comment