Google

SPARK Documentation for Release 0.6

John Aycock
Computer Science
University of Victoria
aycock@csc.uvic.ca

This document is a random scattering of thoughts, derived from my own experience and from the valuable feedback I've gotten from:

  • Jan Decaluwe
  • Paul Dubois
  • Mike Fletcher
  • Darrell Gallion
  • Nick Mathewson
  • Gordon McMillan
  • Amit J. Patel
  • Tim Peters
  • Dickon Reed
  • Nicky Sandhu
  • Richard White

The ``real'' documentation is still the IPC7 paper. Please send me any suggestions you have about the framework, or things to add to this document!

This is new in version 0.6.

(The module name has been changed to spark.py; new code should import spark instead of import generic.)

1  The Token Class

The version of Token that's supplied is as minimal as possible. The only thing the framework assumes is a __cmp__ method, so as long as you adhere to this you can expand the class or drop in your own.

GenericParser's default error method attempts to print out the offending token in a readable form, so you may find it useful to have a __str__ or __repr__ in your Token class too.

2  The AST Class

Like the Token class, the AST class is very minimal. The framework assumes that each AST node will have a string-valued type attribute, and that the children of a node are obtainable using the __getitem__ method. Again, you can roll your own class with this type of interface too.

I prefer n-ary trees for my ASTs rather than binary trees, so the example one in examples/paper is quite a bit different than my own.

3  The GenericScanner Class

3.1  The error Method

If the input is not matched by some point by regular expressions you've supplied, then the error method will get called (in previous versions an assertion was raised instead). This method gets passed the input string and the position within the string that the error occurred.

GenericScanner provides a default implementation which prints out an error message and raises a SystemExit exception.

I'm deliberating as to whether this should be a straight exception, or if I should modify the interface to permit some form of error recovery...

3.2  Bugs/Limitations/Caveats

Speed.
I've addressed this in this latest release, thanks to some code from Tim Peters. However, GenericScanner is ultimately only as good as Python's RE engine - a peril of code re-use.

Regular expressions.
GenericScanner interprets regular expressions with the re.VERBOSE option enabled. This means you have to escape the # character if you want to literally match it.

4  The GenericParser Class

4.1  Input Grammars

GenericParser should work with any Backus-Naur form (BNF) grammar, including grammars with empty right-hand sides and ambiguous grammars. There are apparently a few rather obscure cases where Earley's parsing algorithm fails, but they're not likely to crop up in any reasonable application.

4.2  The typestring Method

This is new in version 0.6.

GenericParser can often run faster if it knows the type of its tokens. More specifically, if it knows a token's type as a string. You can tell the parser this by supplying a typestring method, which is passed a token, and must either return that token's type as a string, or None if it's unable to (in the latter case, the parser will simply fall back on a less-efficient method of operation).

For example, if a token's type is always stored in under the name type, you might have the following code:

class MyParser(GenericParser):
    ...
    def typestring(self, token):
        return token.type

Note that GenericParser may or may not use typestring; your tokens must still be comparable objects, as before. Supplying typestring is purely optional.

4.3  Bugs/Limitations/Caveats

Speed.
You're using a very general parsing algorithm implemented in Python; it's not going to be blazingly fast, sorry.

Action code.
The entire input has been recognized by GenericParser before any of your action code is executed. This may restrict the types of things you do on-the-fly in your parser actions. It's one argument for building an AST that you can traverse any way you want.

Watch your method names.
I've been bitten by this a few times. Python won't warn you if you inadventently redefine a method, which I've done when cutting and pasting a bit too freely. It's a bugger to track down too.

4.4  Ambiguity Resolution

Since the IPC7 paper, I had added some rudimentary ambiguity resolution code. It was undocumented, far from satisfactory, and it has been supplanted by the following interface.

Ambiguities are not resolved until the parse tree is being traversed. In other words, the input is already known to be syntactically correct; it's just a matter of deciding which parse tree to build when there are multiple choices.

When an ambiguity is reached, a choice will need to be made between two or more rules. These rules must reside in different p_ methods. The method names which correspond to the conflicting rules, minus the initial ``p_,'' are gathered together in a list. This list is sorted by the length of the rules' right-hand side - shorter rules appear earlier in the list - and passed to the resolve method. The resolve method must choose an element of the list and return its choice.

GenericParser supplies a default implementation of resolve which always selects the rule with the shortest right-hand side (assuming the conflicting rules reside in different p_ methods, of course). This is enough to handle the ``dangling-else'' conflict.

Now some examples. The first one always picks the rule with the shortest right-hand side (this is the default as supplied):

class MyParser(GenericParser):
    ...
    def resolve(self, list):
        return list[0]

The second example always picks the rule with the longest right-hand side:

class MyParser(GenericParser):
    ...
    def resolve(self, list):
        return list[-1]

Here's one which exercises fine-grained control:

class MyParser(GenericParser):
    ...
    def p_if_stmt(self, args):
        '''
            stmt ::= IF expr THEN stmt
        '''
    def p_ifelse_stmt(self, args):
        '''
            stmt ::= IF expr THEN stmt ELSE stmt
        '''
    ...
    def resolve(self, list):
        choice = {
            ('if_stmt', 'ifelse_stmt'): 'if_stmt',
        }
        return choice[tuple(list)]

If you have an ambiguity, but aren't sure where, you may want to override the default resolve with one that simply prints out the list it's passed. This allows you to find the guilty parties, and to ensure that there's no duplicates in the list.

5  The GenericASTBuilder Class

This class will construct syntax trees for you automatically (or at least with minimal intervention). Instead of having your parser be a subclass of GenericParser, you make it a subclass of GenericASTBuilder. GenericASTBuilder is a subclass of GenericParser, meaning the parsing interface is unchanged.

The constructor for GenericASTBuilder takes an extra argument, which is the class that you want AST nodes to be.

It's actually simpler than it sounds. The rest of this section gives a ``cookbook'' explanation. The class of AST nodes is called ``AST'' in the examples below.

5.1  Heterogeneous Parse Tree

Sometimes these are called ``concrete syntax trees.'' By heterogeneous I mean that the leaves of the tree are instances of tokens rather than instances of the AST class. (See the GenericASTTraversal section for more on this.)

class AutoExprParser(GenericASTBuilder):
    def __init__(self, AST, start='expr'):
        GenericASTBuilder.__init__(self, AST, start)

    def p_expr(self, args):
        '''
            expr ::= expr + term
            expr ::= term
            term ::= term * factor
            term ::= factor
            factor ::= number
            factor ::= float
        '''

That's it. The code that uses it would look like:

    parser = AutoExprParser(AST)
    parseTree = parser.parse(tokens)

Except for the extra class argument to the constructor, there's no changes. In AutoExprParser, all the rules are lumped together since there's no further need for actions.

The AST class must support the __setslice__ and __len__ operations in order to use GenericASTBuilder.

5.2  Homogeneous Parse Tree

To make a homogeneous parse tree, the leaves need to be changed from token instances into AST instances. When GenericASTBuilder encounters a leaf, it calls GenericASTBuilder.terminal, so you simply need to supply your own version of it:

class AutoExprParser(GenericASTBuilder):
    ...
    def terminal(self, token):
        return AST(token.type)

In practice, you'll probably want to copy some attribute values from the token to the AST node too.

5.3  Any-geneous Abstract Syntax Tree

To use GenericASTBuilder for building an abstract syntax tree, there should be a fairly straightforward mapping between the parse tree and the AST you want. Just like GenericASTBuilder.terminal was supplied in the last section, you'll now supply a GenericASTBuilder.nonterminal method. The arguments to this method are the nonterminal it's trying to build a node for, and the node's children.

For example, if I wanted to flatten the parse tree out a bit, I could skip allocating new nodes if there's only one child:

class AutoExprParser(GenericASTBuilder):
    ...
    def nonterminal(self, type, args):
        if len(args) == 1:
            return args[0]
        return GenericASTBuilder.nonterminal(self, type, args)

args is just a list, so you could also delete elements from it, or any other transformation you can imagine.

5.4  Bugs/Limitations/Caveats

Ignorance.
Any parser actions you supply in your p_ functions are silently ignored by GenericASTBuilder in the current version. This may change in the future.

6  The GenericASTTraversal Class

This was called the ASTTraversal class in the IPC7 paper. For consistency, I've renamed it and placed it in spark.py along with the other generic classes. For backward compatibility, ASTTraversal is still present in its old spot as a wrapper; new code should use GenericASTTraversal.

I got a great suggestion about heterogeneous ASTs: use the already-present token instances as leaves of the AST. I was all ready to add support into GenericASTTraversal so that it only traversed a node's children if the node had a __getitem__ method present. Then it suddenly occurred to me that GenericASTTraversal already supports heterogeneous ASTs: if you want to use tokens as leaves, just add a method to your token class:

class MyToken:
    ...
    def __getitem__(self, i):	raise IndexError
    ...
This way the interface to GenericASTTraversal is kept simple.

If you want to prevent further traversal into a subtree during a preorder traversal, calling the prune method will do the trick. I found I needed this when generating code from ASTs.

6.1  The typestring Method

This is new in version 0.6.

To decouple GenericASTTraversal from the node implementation, GenericASTTraversal now calls the typestring method to get the node's type as a string. Like its counterpart in GenericParser, it takes a node as an argument, and must return a string (it may not return None, unlike the one in GenericParser).

The default implementation simply returns the node's type field; this behaviour is backwards-compatible.

7  The GenericASTMatcher Class

GenericASTMatcher is a class that is designed for generating code from ASTs. You supply a set of patterns (in docstrings, of course) and actions; GenericASTMatcher will find a way to cover the AST with your patterns, then invoke the corresponding actions. Actions are called in left-to-right, bottom-up order.

For example:

class AutoInterpret(GenericASTMatcher):
    def __init__(self, ast):
        GenericASTMatcher.__init__(self, 'V', ast)

    def p_n(self, tree):
        '''
            V ::= number
        '''
        tree.value = int(tree.attr)

    def p_add(self, tree):
        '''
            V ::= expr ( V + V )
        '''
        tree.value = tree[0].value + tree[2].value

    def p_multiply(self, tree):
        '''
            V ::= term ( V * V )
        '''
        tree.value = tree[0].value * tree[2].value

    def p_addmul(self, tree):
        '''
            V ::= expr ( V + term ( V * V ) )
        '''
        tree.value = tree[0].value + tree[2][0].value * tree[2][2].value

As in GenericParser, the methods you supply are prefixed with p_, which in this context stands for ``pattern.'' The argument to the p_ methods is the AST node where the pattern is rooted.

Patterns are just trees linearized into prefix form, which use parentheses to denote subtrees. On the left-hand side is a symbol which the pattern is effectively ``rewritten'' to if the pattern is matched. For example, the pattern V ::= term ( V * V ) would correspond to:

 term
  /|\    => V
 / | \
V  *  V

You'd use the above class as follows:

    interpreter = AutoInterpret(ast)
    interpreter.match()
    print ast.value

You may optionally supply an AST to match. This is so you can create one instance of your matcher class, then have it match different ASTs. For instance, you could use a matcher for implementing a language's arithmetic expressions, and use a GenericASTTraversal for the rest.

AST nodes must support __getitem__ and __cmp__ methods to use GenericASTMatcher.

7.1  Bugs/Limitations/Caveats

Ambiguity.
If there's a conflict between two patterns, then GenericASTMatcher will choose the longest one. Ideally, the entire matching engine will eventually be replaced by a more sophisticated one that'll find an ``optimal'' covering.

Parentheses considered harmful.
You may end up with some strange things happening if you have a terminal/nonterminal named ``('' or ``)'' as they're delimiting the pattern's tree structure.

Patterns.
GenericASTMatcher's engine will accept patterns that are more general than those described above. These restrictions may be enforced in a later release, however.

8  Inheritance and Generic* Classes

You can override t_ and p_ methods now in the expected fashion; in other words, defining p_foo in a subclass hides p_foo in its superclass. (Previously duplicate method names were not removed when the Generic* classes did their reflection, giving strange results.)

9  Miscellaneous

Memory leaks!
Several of the Generic* classes keep references to the various t_ and p_ methods. Unfortunately this results in a circular reference, as Darrell Gallion pointed out, which Python's current garbage collector won't collect.

This will only be an issue if you create and destroy Generic* classes repeatedly; simply using a single instance of GenericScanner repeatedly will not consume extra memory.

I probably won't fix this, since full GC is likely coming soon to a Python near you, and it won't affect many users. If you do need to handle this problem, contact me and I can advise you how to change the SPARK code.

How do I keep track of the line number?
There's currently no automatic way to do this. What I do is to keep a line number attribute, say lineno, for each token. Then in my GenericScanner subclass I'll have a method like:
	def t_nl(self, s):
	    r'\n'
	    self.lineno = self.lineno + 1
	
(I set self.lineno to 1 in my subclass' tokenize method.)

What about inherited attributes?
For inherited attributes - ones that propagate from the root to the leaves of a tree - I just set variables in my subclass rather than pass them around explicitly. You may want to avoid relying on this while parsing unless you know exactly how and when your parsing actions are being invoked.

How do I distribute a project that uses SPARK?
As far as I'm concerned, you need only include spark.py. I'd appreciate it if you mentioned SPARK in your documentation and/or web pages. And if you send me the URL of your project's home page, I can add it to the SPARK web page.


File translated from TEX by TTH, version 1.60.