Advanced PGE
We've already looked at some of the basics of parser constructing using PGE and NQP. In this chapter we are going to give a more in-depth look at some of the features of the grammar engine that we haven't seen yet. Some of these more advanced features, such as inline PIR code, assertions, function calls and built-in token types will make the life of a compiler designer much easier, but are not needed for most basic tasks.
regex
, token
and proto
A regex is a high-level matching operation that allows backtracking. A token is a low-level matching operation that does not allow backtracking. A proto is like a regex but allows multiple dispatch. Think of a proto declaration as being a prototype or signature that several functions can match.
Inline PIR Sections
PIR can be embedded directly into both PGE grammar files and NQP files. This is important to fill in some gaps that NQP cannot handle due to its limitations. It is also helpful to insert some active processing into a grammar sometimes, to be able to direct the parser in a more intelligent way.
In NQP, PIR code can be inlined using the PIR
statement, followed by a quoted string of PIR code. This quoted string can be in the form of a perl-like "qw< ... >" type of quotation, if you think that looks better.
In PGE, inline PIR can be inserted using double-curly-brackets "{{ ... }}". Once in PIR mode, you can access the current match object by calling $Px = find_global "$/"
(where $Px
is any of the valid PIR registers where x is a number).
Built-In Token Types
PGE has basic default values of certain rules already defined to help with parsing. However, you can redefine these to be something else, if you don't like the default behavior.
Calling Functions
functions or subroutines are an integral part of modern programming practices. As such, support for them is part of the PAST system, and is relatively easy to implement. We're going to cover a little bit of necessary background information first, and then we will discuss how to put all the pieces together to create a system with usable subroutines.
return
Described
In Parrot control flow, especially return operations from subroutines, are implemented as special control exceptions. The reason why it is done as an exception and not as a basic .return()
PIR statement is a little bit complicated. Many languages allow for nested lexical scopes, where variables defined in an "inner" scope cannot be seen, accessed, or modified by statements in the "outer" scope. In most compilers, this behavior is enforced by the compiler directly, and is invisible when the code is converted to assembly and machine languages. However PIR is like an assembly language for the Parrot system, and it's not possible to hide things at that level. All local variables are local to the entire subroutine and cannot be localized to a single part of a subroutine. To implement nested scopes, Parrot instead uses nested subroutine
Returns and Return Values
Functions can be made to return a value use the "return" PAST.op type. The return system is based on a control exception. Exceptions, as we've discussed before, move control flow to a specified location called the "exception handler". In terms of a return exception, the handler is the code directly after the original function call. The return values (currently, the return PAST node only allows a single return value) are passed as exception data items and are retrieved by the control exception handler.
All of these details are generally hidden from the programmer, and you can treat a return PAST node exactly like you would expect. You pass a return value, if any, to the return PAST node. The current function ends and its scope is destroyed. Control flow returns to the calling function, and the return value from the function is made available.
Assertions
Repetition Counting with **
MetaSyntactic Assertions
You can call a function from within a rule using the <FUNC( )>
format.
Non-Capturing Assertions
Use <. >
form to create a match object that does not capture its contents.
Indirect Rules
A rule of the form <$ >
, which can be a string or some other data, is converted into a regular expression and then run.
Character Classes
Rules of the form <[ ]>
contain custom character classes. Rules with <-[ ]>
are complimented character classes.
Built-in Assertions
<?before>
,<!before>
<?after>
,<!after>
<?same>
,<!same>
<.ws>
<?at()>
,<!at()>
Partial Matches
You can specify a partial match, a match which attempts to match as much as possible and never fails, with the <* >
form.
Recursive Calls
You can recurse back into subrules of the current match rule using the <~~ >
rule.