ply/doc/ply.html

2632Sstever@eecs.umich.edu<html>
2632Sstever@eecs.umich.edu<head>
2632Sstever@eecs.umich.edu<title>PLY (Python Lex-Yacc)</title>
2632Sstever@eecs.umich.edu</head>
2632Sstever@eecs.umich.edu<body bgcolor="#ffffff">
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<h1>PLY (Python Lex-Yacc)</h1>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<b>
2632Sstever@eecs.umich.eduDavid M. Beazley <br>
2632Sstever@eecs.umich.eduDepartment of Computer Science <br>
2632Sstever@eecs.umich.eduUniversity of Chicago <br>
2632Sstever@eecs.umich.eduChicago, IL 60637 <br>
2632Sstever@eecs.umich.edubeazley@cs.uchicago.edu <br>
2632Sstever@eecs.umich.edu</b>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduDocumentation version: $Header: /home/stever/bk/newmem2/ext/ply/doc/ply.html 1.1 03/06/06 14:53:34-00:00 stever@ $
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<h2>Introduction</h2>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduPLY is a Python-only implementation of the popular compiler
2632Sstever@eecs.umich.educonstruction tools lex and yacc.  The implementation borrows ideas
2632Sstever@eecs.umich.edufrom a number of previous efforts; most notably John Aycock's SPARK
2632Sstever@eecs.umich.edutoolkit.  However, the overall flavor of the implementation is more
2632Sstever@eecs.umich.educlosely modeled after the C version of lex and yacc.  The other
2632Sstever@eecs.umich.edusignificant feature of PLY is that it provides extensive input
2632Sstever@eecs.umich.eduvalidation and error reporting--much more so than other Python parsing
2632Sstever@eecs.umich.edutools.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduEarly versions of PLY were developed to support the Introduction to
2632Sstever@eecs.umich.eduCompilers Course at the University of Chicago.  In this course,
2632Sstever@eecs.umich.edustudents built a fully functional compiler for a simple Pascal-like
2632Sstever@eecs.umich.edulanguage.  Their compiler, implemented entirely in Python, had to
2632Sstever@eecs.umich.eduinclude lexical analysis, parsing, type checking, type inference,
2632Sstever@eecs.umich.edunested scoping, and code generation for the SPARC processor.
2632Sstever@eecs.umich.eduApproximately 30 different compiler implementations were completed in
2632Sstever@eecs.umich.eduthis course.  Most of PLY's interface and operation has been motivated by common
2632Sstever@eecs.umich.eduusability problems encountered by students.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduBecause PLY was primarily developed as an instructional tool, you will
2632Sstever@eecs.umich.edufind it to be <em>MUCH</em> more picky about token and grammar rule
2632Sstever@eecs.umich.eduspecification than most other Python parsing tools.  In part, this
2632Sstever@eecs.umich.eduadded formality is meant to catch common programming mistakes made by
2632Sstever@eecs.umich.edunovice users.  However, advanced users will also find such features to
2632Sstever@eecs.umich.edube useful when building complicated grammars for real programming
2632Sstever@eecs.umich.edulanguages.    It should also be noted that PLY does not provide much in the way
2632Sstever@eecs.umich.eduof bells and whistles (e.g., automatic construction of abstract syntax trees,
2632Sstever@eecs.umich.edutree traversal, etc.).   Instead, you will find a bare-bones, yet
2632Sstever@eecs.umich.edufully capable lex/yacc implementation written entirely in Python.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduThe rest of this document assumes that you are somewhat familar with
2632Sstever@eecs.umich.eduparsing theory, syntax directed translation, and automatic tools such
2632Sstever@eecs.umich.eduas lex and yacc. If you are unfamilar with these topics, you will
2632Sstever@eecs.umich.eduprobably want to consult an introductory text such as "Compilers:
2632Sstever@eecs.umich.eduPrinciples, Techniques, and Tools", by Aho, Sethi, and Ullman.  "Lex
2632Sstever@eecs.umich.eduand Yacc" by John Levine may also be handy.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<h2>PLY Overview</h2>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduPLY consists of two separate tools; <tt>lex.py</tt> and
2632Sstever@eecs.umich.edu<tt>yacc.py</tt>.  <tt>lex.py</tt> is used to break input text into a
2632Sstever@eecs.umich.educollection of tokens specified by a collection of regular expression
2632Sstever@eecs.umich.edurules.  <tt>yacc.py</tt> is used to recognize language syntax that has
2632Sstever@eecs.umich.edubeen specified in the form of a context free grammar.  Currently,
2632Sstever@eecs.umich.edu<tt>yacc.py</tt> uses LR parsing and generates its parsing tables
2632Sstever@eecs.umich.eduusing the SLR algorithm.  LALR(1) parsing may be supported in a future
2632Sstever@eecs.umich.edurelease.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduThe two tools are meant to work together.  Specifically,
2632Sstever@eecs.umich.edu<tt>lex.py</tt> provides an external interface in the form of a
2632Sstever@eecs.umich.edu<tt>token()</tt> function that returns the next valid token on the
2632Sstever@eecs.umich.eduinput stream.  <tt>yacc.py</tt> calls this repeatedly to retrieve
2632Sstever@eecs.umich.edutokens and invoke grammar rules.  The output of <tt>yacc.py</tt> is
2632Sstever@eecs.umich.eduoften an Abstract Syntax Tree (AST).  However, this is entirely up to
2632Sstever@eecs.umich.eduthe user.  If desired, <tt>yacc.py</tt> can also be used to implement
2632Sstever@eecs.umich.edusimple one-pass compilers.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduLike its Unix counterpart, <tt>yacc.py</tt> provides most of the
2632Sstever@eecs.umich.edufeatures you expect including extensive error checking, grammar
2632Sstever@eecs.umich.eduvalidation, support for empty productions, error tokens, and ambiguity
2632Sstever@eecs.umich.eduresolution via precedence rules.  The primary difference between
2632Sstever@eecs.umich.edu<tt>yacc.py</tt> and <tt>yacc</tt> is the use of SLR parsing instead
2632Sstever@eecs.umich.eduof LALR(1).  Although this slightly restricts the types of grammars
2632Sstever@eecs.umich.eduthan can be successfully parsed, it is sufficiently powerful to handle most
2632Sstever@eecs.umich.edukinds of normal programming language constructs.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduFinally, it is important to note that PLY relies on reflection
2632Sstever@eecs.umich.edu(introspection) to build its lexers and parsers.  Unlike traditional
2632Sstever@eecs.umich.edulex/yacc which require a special input file that is converted into a
2632Sstever@eecs.umich.eduseparate source file, the specifications given to PLY <em>are</em>
2632Sstever@eecs.umich.eduvalid Python programs.  This means that there are no extra source
2632Sstever@eecs.umich.edufiles nor is there a special compiler construction step (e.g., running
2632Sstever@eecs.umich.eduyacc to generate Python code for the compiler).
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<h2>Lex Example</h2>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<tt>lex.py</tt> is used to write tokenizers.  To do this, each token
2632Sstever@eecs.umich.edumust be defined by a regular expression rule.  The following file
2632Sstever@eecs.umich.eduimplements a very simple lexer for tokenizing simple integer expressions:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edu# ------------------------------------------------------------
2632Sstever@eecs.umich.edu# calclex.py
2632Sstever@eecs.umich.edu#
2632Sstever@eecs.umich.edu# tokenizer for a simple expression evaluator for
2632Sstever@eecs.umich.edu# numbers and +,-,*,/
2632Sstever@eecs.umich.edu# ------------------------------------------------------------
2632Sstever@eecs.umich.eduimport lex
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu# List of token names.   This is always required
2632Sstever@eecs.umich.edutokens = (
2632Sstever@eecs.umich.edu   'NUMBER',
2632Sstever@eecs.umich.edu   'PLUS',
2632Sstever@eecs.umich.edu   'MINUS',
2632Sstever@eecs.umich.edu   'TIMES',
2632Sstever@eecs.umich.edu   'DIVIDE',
2632Sstever@eecs.umich.edu   'LPAREN',
2632Sstever@eecs.umich.edu   'RPAREN',
2632Sstever@eecs.umich.edu)
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu# Regular expression rules for simple tokens
2632Sstever@eecs.umich.edut_PLUS    = r'\+'
2632Sstever@eecs.umich.edut_MINUS   = r'-'
2632Sstever@eecs.umich.edut_TIMES   = r'\*'
2632Sstever@eecs.umich.edut_DIVIDE  = r'/'
2632Sstever@eecs.umich.edut_LPAREN  = r'\('
2632Sstever@eecs.umich.edut_RPAREN  = r'\)'
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu# A regular expression rule with some action code
2632Sstever@eecs.umich.edudef t_NUMBER(t):
2632Sstever@eecs.umich.edu    r'\d+'
2632Sstever@eecs.umich.edu    try:
2632Sstever@eecs.umich.edu         t.value = int(t.value)
2632Sstever@eecs.umich.edu    except ValueError:
2632Sstever@eecs.umich.edu         print "Line %d: Number %s is too large!" % (t.lineno,t.value)
2632Sstever@eecs.umich.edu	 t.value = 0
2632Sstever@eecs.umich.edu    return t
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu# Define a rule so we can track line numbers
2632Sstever@eecs.umich.edudef t_newline(t):
2632Sstever@eecs.umich.edu    r'\n+'
2632Sstever@eecs.umich.edu    t.lineno += len(t.value)
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu# A string containing ignored characters (spaces and tabs)
2632Sstever@eecs.umich.edut_ignore  = ' \t'
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu# Error handling rule
2632Sstever@eecs.umich.edudef t_error(t):
2632Sstever@eecs.umich.edu    print "Illegal character '%s'" % t.value[0]
2632Sstever@eecs.umich.edu    t.skip(1)
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu# Build the lexer
2632Sstever@eecs.umich.edulex.lex()
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu# Test it out
2632Sstever@eecs.umich.edudata = '''
2632Sstever@eecs.umich.edu3 + 4 * 10
2632Sstever@eecs.umich.edu  + -20 *2
2632Sstever@eecs.umich.edu'''
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu# Give the lexer some input
2632Sstever@eecs.umich.edulex.input(data)
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu# Tokenize
2632Sstever@eecs.umich.eduwhile 1:
2632Sstever@eecs.umich.edu    tok = lex.token()
2632Sstever@eecs.umich.edu    if not tok: break      # No more input
2632Sstever@eecs.umich.edu    print tok
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduIn the example, the <tt>tokens</tt> list defines all of the possible
2632Sstever@eecs.umich.edutoken names that can be produced by the lexer.  This list is always required
2632Sstever@eecs.umich.eduand is used to perform a variety of validation checks.  Following the <tt>tokens</tt>
2632Sstever@eecs.umich.edulist, regular expressions are written for each token.  Each of these
2632Sstever@eecs.umich.edurules are defined by making declarations with a special prefix <tt>t_</tt> to indicate that it
2632Sstever@eecs.umich.edudefines a token.  For simple tokens, the regular expression can
2632Sstever@eecs.umich.edube specified as strings such as this (note: Python raw strings are used since they are the
2632Sstever@eecs.umich.edumost convenient way to write regular expression strings):
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edut_PLUS = r'\+'
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduIn this case, the name following the <tt>t_</tt> must exactly match one of the
2632Sstever@eecs.umich.edunames supplied in <tt>tokens</tt>.   If some kind of action needs to be performed,
2632Sstever@eecs.umich.edua token rule can be specified as a function.  For example:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edudef t_NUMBER(t):
2632Sstever@eecs.umich.edu    r'\d+'
2632Sstever@eecs.umich.edu    try:
2632Sstever@eecs.umich.edu         t.value = int(t.value)
2632Sstever@eecs.umich.edu    except ValueError:
2632Sstever@eecs.umich.edu         print "Number %s is too large!" % t.value
2632Sstever@eecs.umich.edu	 t.value = 0
2632Sstever@eecs.umich.edu    return t
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduIn this case, the regular expression rule is specified in the function documentation string.
2632Sstever@eecs.umich.eduThe function always takes a single argument which is an instance of
2632Sstever@eecs.umich.edu<tt>LexToken</tt>.   This object has attributes of <tt>t.type</tt> which is the token type,
2632Sstever@eecs.umich.edu<tt>t.value</tt> which is the lexeme, and <tt>t.lineno</tt> which is the current line number.
2632Sstever@eecs.umich.eduBy default, <tt>t.type</tt> is set to the name following the <tt>t_</tt> prefix.  The action
2632Sstever@eecs.umich.edufunction can modify the contents of the <tt>LexToken</tt> object as appropriate.  However,
2632Sstever@eecs.umich.eduwhen it is done, the resulting token should be returned.  If no value is returned by the action
2632Sstever@eecs.umich.edufunction, the token is simply discarded and the next token read.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduThe rule <tt>t_newline()</tt> illustrates a regular expression rule
2632Sstever@eecs.umich.edufor a discarded token.  In this case, a rule is written to match
2632Sstever@eecs.umich.edunewlines so that proper line number tracking can be performed.
2632Sstever@eecs.umich.eduBy returning no value, the function causes the newline character to be
2632Sstever@eecs.umich.edudiscarded.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduThe special <tt>t_ignore</tt> rule is reserved by <tt>lex.py</tt> for characters
2632Sstever@eecs.umich.eduthat should be completely ignored in the input stream.
2632Sstever@eecs.umich.eduUsually this is used to skip over whitespace and other non-essential characters.
2632Sstever@eecs.umich.eduAlthough it is possible to define a regular expression rule for whitespace in a manner
2632Sstever@eecs.umich.edusimilar to <tt>t_newline()</tt>, the use of <tt>t_ignore</tt> provides substantially better
2632Sstever@eecs.umich.edulexing performance because it is handled as a special case and is checked in a much
2632Sstever@eecs.umich.edumore efficient manner than the normal regular expression rules.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduFinally, the <tt>t_error()</tt>
2632Sstever@eecs.umich.edufunction is used to handle lexing errors that occur when illegal
2632Sstever@eecs.umich.educharacters are detected.  In this case, the <tt>t.value</tt> attribute contains the
2632Sstever@eecs.umich.edurest of the input string that has not been tokenized.  In the example, we simply print
2632Sstever@eecs.umich.eduthe offending character and skip ahead one character by calling <tt>t.skip(1)</tt>.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduTo build the lexer, the function <tt>lex.lex()</tt> is used.  This function
2632Sstever@eecs.umich.eduuses Python reflection (or introspection) to read the the regular expression rules
2632Sstever@eecs.umich.eduout of the calling context and build the lexer. Once the lexer has been built, two functions can
2632Sstever@eecs.umich.edube used to control the lexer.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<ul>
2632Sstever@eecs.umich.edu<li><tt>lex.input(data)</tt>.   Reset the lexer and store a new input string.
2632Sstever@eecs.umich.edu<li><tt>lex.token()</tt>.  Return the next token.  Returns a special <tt>LexToken</tt> instance on success or
2632Sstever@eecs.umich.eduNone if the end of the input text has been reached.
2632Sstever@eecs.umich.edu</ul>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduThe code at the bottom of the example shows how the lexer is actually used.  When executed,
2632Sstever@eecs.umich.eduthe following output will be produced:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edu$ python example.py
2632Sstever@eecs.umich.eduLexToken(NUMBER,3,2)
2632Sstever@eecs.umich.eduLexToken(PLUS,'+',2)
2632Sstever@eecs.umich.eduLexToken(NUMBER,4,2)
2632Sstever@eecs.umich.eduLexToken(TIMES,'*',2)
2632Sstever@eecs.umich.eduLexToken(NUMBER,10,2)
2632Sstever@eecs.umich.eduLexToken(PLUS,'+',3)
2632Sstever@eecs.umich.eduLexToken(MINUS,'-',3)
2632Sstever@eecs.umich.eduLexToken(NUMBER,20,3)
2632Sstever@eecs.umich.eduLexToken(TIMES,'*',3)
2632Sstever@eecs.umich.eduLexToken(NUMBER,2,3)
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<h2>Lex Implementation Notes</h2>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<ul>
2632Sstever@eecs.umich.edu<li><tt>lex.py</tt> uses the <tt>re</tt> module to do its patten matching.  When building the master regular expression,
2632Sstever@eecs.umich.edurules are added in the following order:
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<ol>
2632Sstever@eecs.umich.edu<li>All tokens defined by functions are added in the same order as they appear in the lexer file.
2632Sstever@eecs.umich.edu<li>Tokens defined by strings are added by sorting them in order of decreasing regular expression length (longer expressions
2632Sstever@eecs.umich.eduare added first).
2632Sstever@eecs.umich.edu</ol>
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduWithout this ordering, it can be difficult to correctly match certain types of tokens.  For example, if you
2632Sstever@eecs.umich.eduwanted to have separate tokens for "=" and "==", you need to make sure that "==" is checked first.  By sorting regular
2632Sstever@eecs.umich.eduexpressions in order of decreasing length, this problem is solved for rules defined as strings.  For functions,
2632Sstever@eecs.umich.eduthe order can be explicitly controlled since rules appearing first are checked first.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<P>
2632Sstever@eecs.umich.edu<li>The lexer requires input to be supplied as a single input string.  Since most machines have more than enough memory, this
2632Sstever@eecs.umich.edurarely presents a performance concern.  However, it means that the lexer currently can't be used with streaming data
2632Sstever@eecs.umich.edusuch as open files or sockets.  This limitation is primarily a side-effect of using the <tt>re</tt> module.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<li>
2632Sstever@eecs.umich.eduTo handle reserved words, it is usually easier to just match an identifier and do a special name lookup in a function
2632Sstever@eecs.umich.edulike this:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edureserved = {
2632Sstever@eecs.umich.edu   'if' : 'IF',
2632Sstever@eecs.umich.edu   'then' : 'THEN',
2632Sstever@eecs.umich.edu   'else' : 'ELSE',
2632Sstever@eecs.umich.edu   'while' : 'WHILE',
2632Sstever@eecs.umich.edu   ...
2632Sstever@eecs.umich.edu}
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edudef t_ID(t):
2632Sstever@eecs.umich.edu    r'[a-zA-Z_][a-zA-Z_0-9]*'
2632Sstever@eecs.umich.edu    t.type = reserved.get(t.value,'ID')    # Check for reserved words
2632Sstever@eecs.umich.edu    return t
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<li>The lexer requires tokens to be defined as class instances with <tt>t.type</tt>, <tt>t.value</tt>, and <tt>t.lineno</tt>
2632Sstever@eecs.umich.eduattributes.   By default, tokens are created as instances of the <tt>LexToken</tt> class defined internally to <tt>lex.py</tt>.
2632Sstever@eecs.umich.eduIf desired, you can create new kinds of tokens provided that they have the three required attributes.   However,
2632Sstever@eecs.umich.eduin practice, it is probably safer to stick with the default.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<li>The only safe attribute for assigning token properties is <tt>t.value</tt>.   In some cases, you may want to attach
2632Sstever@eecs.umich.edua number of different properties to a token (e.g., symbol table entries for identifiers).  To do this, replace <tt>t.value</tt>
2632Sstever@eecs.umich.eduwith a tuple or class instance. For example:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edudef t_ID(t):
2632Sstever@eecs.umich.edu    ...
2632Sstever@eecs.umich.edu    # For identifiers, create a (lexeme, symtab) tuple
2632Sstever@eecs.umich.edu    t.value = (t.value, symbol_lookup(t.value))
2632Sstever@eecs.umich.edu    ...
2632Sstever@eecs.umich.edu    return t
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduAlthough allowed, do NOT assign additional attributes to the token object.  For example,
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edudef t_ID(t):
2632Sstever@eecs.umich.edu    ...
2632Sstever@eecs.umich.edu    # Bad implementation of above
2632Sstever@eecs.umich.edu    t.symtab = symbol_lookup(t.value)
2632Sstever@eecs.umich.edu    ...
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduThe reason you don't want to do this is that the <tt>yacc.py</tt>
2632Sstever@eecs.umich.edumodule only provides public access to the <tt>t.value</tt> attribute of each token.
2632Sstever@eecs.umich.eduTherefore, any other attributes you assign are inaccessible (if you are familiar
2632Sstever@eecs.umich.eduwith the internals of C lex/yacc, <tt>t.value</tt> is the same as <tt>yylval.tok</tt>).
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<li>To track line numbers, the lexer internally maintains a line
2632Sstever@eecs.umich.edunumber variable.  Each token automatically gets the value of the
2632Sstever@eecs.umich.educurrent line number in the <tt>t.lineno</tt> attribute. To modify the
2632Sstever@eecs.umich.educurrent line number, simply change the <tt>t.lineno</tt> attribute
2632Sstever@eecs.umich.eduin a function rule (as previously shown for
2632Sstever@eecs.umich.edu<tt>t_newline()</tt>).  Even if the resulting token is discarded,
2632Sstever@eecs.umich.educhanges to the line number remain in effect for subsequent tokens.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<li>To support multiple scanners in the same application, the <tt>lex.lex()</tt> function
2632Sstever@eecs.umich.eduactually returns a special <tt>Lexer</tt> object.   This object has two methods
2632Sstever@eecs.umich.edu<tt>input()</tt> and <tt>token()</tt> that can be used to supply input and get tokens.  For example:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edulexer = lex.lex()
2632Sstever@eecs.umich.edulexer.input(sometext)
2632Sstever@eecs.umich.eduwhile 1:
2632Sstever@eecs.umich.edu    tok = lexer.token()
2632Sstever@eecs.umich.edu    if not tok: break
2632Sstever@eecs.umich.edu    print tok
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduThe functions <tt>lex.input()</tt> and <tt>lex.token()</tt> are bound to the <tt>input()</tt>
2632Sstever@eecs.umich.eduand <tt>token()</tt> methods of the last lexer created by the lex module.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<li>To reduce compiler startup time and improve performance, the lexer can be built in optimized mode as follows:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edulex.lex(optimize=1)
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduWhen used, most error checking and validation is disabled.   This provides a slight performance
2632Sstever@eecs.umich.edugain while tokenizing and tends to chop a few tenths of a second off startup time.  Since it disables
2632Sstever@eecs.umich.eduerror checking, this mode is not the default and is not recommended during development.  However, once
2632Sstever@eecs.umich.eduyou have your compiler fully working, it is usually safe to disable the error checks.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<li>You can enable some additional debugging by building the lexer like this:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edulex.lex(debug=1)
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<li>To help you debug your lexer, <tt>lex.py</tt> comes with a simple main program which will either
2632Sstever@eecs.umich.edutokenize input read from standard input or from a file.  To use it, simply put this in your lexer:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.eduif __name__ == '__main__':
2632Sstever@eecs.umich.edu     lex.runmain()
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduThen, run you lexer as a main program such as <tt>python mylex.py</tt>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<li>Since the lexer is written entirely in Python, its performance is
2632Sstever@eecs.umich.edulargely determined by that of the Python <tt>re</tt> module.  Although
2632Sstever@eecs.umich.eduthe lexer has been written to be as efficient as possible, it's not
2632Sstever@eecs.umich.edublazingly fast when used on very large input files.  Sorry.  If
2632Sstever@eecs.umich.eduperformance is concern, you might consider upgrading to the most
2632Sstever@eecs.umich.edurecent version of Python, creating a hand-written lexer, or offloading
2632Sstever@eecs.umich.eduthe lexer into a C extension module.  In defense of <tt>lex.py</tt>,
2632Sstever@eecs.umich.eduit's performance is not <em>that</em> bad when used on reasonably
2632Sstever@eecs.umich.edusized input files.  For instance, lexing a 4700 line C program with
2632Sstever@eecs.umich.edu32000 input tokens takes about 20 seconds on a 200 Mhz PC.  Obviously,
2632Sstever@eecs.umich.eduit will run much faster on a more speedy machine.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu</ul>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<h2>Parsing basics</h2>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<tt>yacc.py</tt> is used to parse language syntax.  Before showing an
2632Sstever@eecs.umich.eduexample, there are a few important bits of background that must be
2632Sstever@eecs.umich.edumentioned.  First, <tt>syntax</tt> is usually specified in terms of a
2632Sstever@eecs.umich.educontext free grammar (CFG).  For example, if you wanted to parse
2632Sstever@eecs.umich.edusimple arithmetic expressions, you might first write an unambiguous
2632Sstever@eecs.umich.edugrammar specification like this:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.eduexpression : expression + term
2632Sstever@eecs.umich.edu           | expression - term
2632Sstever@eecs.umich.edu           | term
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduterm       : term * factor
2632Sstever@eecs.umich.edu           | term / factor
2632Sstever@eecs.umich.edu           | factor
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edufactor     : NUMBER
2632Sstever@eecs.umich.edu           | ( expression )
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduNext, the semantic behavior of a language is often specified using a
2632Sstever@eecs.umich.edutechnique known as syntax directed translation.  In syntax directed
2632Sstever@eecs.umich.edutranslation, attributes are attached to each symbol in a given grammar
2632Sstever@eecs.umich.edurule along with an action.  Whenever a particular grammar rule is
2632Sstever@eecs.umich.edurecognized, the action describes what to do.  For example, given the
2632Sstever@eecs.umich.eduexpression grammar above, you might write the specification for a
2632Sstever@eecs.umich.edusimple calculator like this:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.eduGrammar                             Action
2632Sstever@eecs.umich.edu--------------------------------    --------------------------------------------
2632Sstever@eecs.umich.eduexpression0 : expression1 + term    expression0.val = expression1.val + term.val
2632Sstever@eecs.umich.edu            | expression1 - term    expression0.val = expression1.val - term.val
2632Sstever@eecs.umich.edu            | term                  expression0.val = term.val
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduterm0       : term1 * factor        term0.val = term1.val * factor.val
2632Sstever@eecs.umich.edu            | term1 / factor        term0.val = term1.val / factor.val
2632Sstever@eecs.umich.edu            | factor                term0.val = factor.val
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edufactor      : NUMBER                factor.val = int(NUMBER.lexval)
2632Sstever@eecs.umich.edu            | ( expression )        factor.val = expression.val
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduFinally, Yacc uses a parsing technique known as LR-parsing or shift-reduce parsing.  LR parsing is a
2632Sstever@eecs.umich.edubottom up technique that tries to recognize the right-hand-side of various grammar rules.
2632Sstever@eecs.umich.eduWhenever a valid right-hand-side is found in the input, the appropriate action code is triggered and the
2632Sstever@eecs.umich.edugrammar symbols are replaced by the grammar symbol on the left-hand-side.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduLR parsing is commonly implemented by shifting grammar symbols onto a stack and looking at the stack and the next
2632Sstever@eecs.umich.eduinput token for patterns.   The details of the algorithm can be found in a compiler text, but the
2632Sstever@eecs.umich.edufollowing example illustrates the steps that are performed if you wanted to parse the expression
2632Sstever@eecs.umich.edu<tt>3 + 5 * (10 - 20)</tt> using the grammar defined above:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.eduStep Symbol Stack           Input Tokens            Action
2632Sstever@eecs.umich.edu---- ---------------------  ---------------------   -------------------------------
2632Sstever@eecs.umich.edu1    $                      3 + 5 * ( 10 - 20 )$    Shift 3
2632Sstever@eecs.umich.edu2    $ 3                      + 5 * ( 10 - 20 )$    Reduce factor : NUMBER
2632Sstever@eecs.umich.edu3    $ factor                 + 5 * ( 10 - 20 )$    Reduce term   : factor
2632Sstever@eecs.umich.edu4    $ term                   + 5 * ( 10 - 20 )$    Reduce expr : term
2632Sstever@eecs.umich.edu5    $ expr                   + 5 * ( 10 - 20 )$    Shift +
2632Sstever@eecs.umich.edu6    $ expr +                   5 * ( 10 - 20 )$    Shift 5
2632Sstever@eecs.umich.edu7    $ expr + 5                   * ( 10 - 20 )$    Reduce factor : NUMBER
2632Sstever@eecs.umich.edu8    $ expr + factor              * ( 10 - 20 )$    Reduce term   : factor
2632Sstever@eecs.umich.edu9    $ expr + term                * ( 10 - 20 )$    Shift *
2632Sstever@eecs.umich.edu10   $ expr + term *                ( 10 - 20 )$    Shift (
2632Sstever@eecs.umich.edu11   $ expr + term * (                10 - 20 )$    Shift 10
2632Sstever@eecs.umich.edu12   $ expr + term * ( 10                - 20 )$    Reduce factor : NUMBER
2632Sstever@eecs.umich.edu13   $ expr + term * ( factor            - 20 )$    Reduce term : factor
2632Sstever@eecs.umich.edu14   $ expr + term * ( term              - 20 )$    Reduce expr : term
2632Sstever@eecs.umich.edu15   $ expr + term * ( expr              - 20 )$    Shift -
2632Sstever@eecs.umich.edu16   $ expr + term * ( expr -              20 )$    Shift 20
2632Sstever@eecs.umich.edu17   $ expr + term * ( expr - 20              )$    Reduce factor : NUMBER
2632Sstever@eecs.umich.edu18   $ expr + term * ( expr - factor          )$    Reduce term : factor
2632Sstever@eecs.umich.edu19   $ expr + term * ( expr - term            )$    Reduce expr : expr - term
2632Sstever@eecs.umich.edu20   $ expr + term * ( expr                   )$    Shift )
2632Sstever@eecs.umich.edu21   $ expr + term * ( expr )                  $    Reduce factor : (expr)
2632Sstever@eecs.umich.edu22   $ expr + term * factor                    $    Reduce term : term * factor
2632Sstever@eecs.umich.edu23   $ expr + term                             $    Reduce expr : expr + term
2632Sstever@eecs.umich.edu24   $ expr                                    $    Reduce expr
2632Sstever@eecs.umich.edu25   $                                         $    Success!
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduWhen parsing the expression, an underlying state machine and the current input token determine what to do next.
2632Sstever@eecs.umich.eduIf the next token looks like part of a valid grammar rule (based on other items on the stack), it is generally shifted
2632Sstever@eecs.umich.eduonto the stack.  If the top of the stack contains a valid right-hand-side of a grammar rule, it is
2632Sstever@eecs.umich.eduusually "reduced" and the symbols replaced with the symbol on the left-hand-side.  When this reduction occurs, the
2632Sstever@eecs.umich.eduappropriate action is triggered (if defined).  If the input token can't be shifted and the top of stack doesn't match
2632Sstever@eecs.umich.eduany grammar rules, a syntax error has occurred and the parser must take some kind of recovery step (or bail out).
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduIt is important to note that the underlying implementation is actually built around a large finite-state machine
2632Sstever@eecs.umich.eduand some tables.   The construction of these tables is quite complicated and beyond the scope of this discussion.
2632Sstever@eecs.umich.eduHowever, subtle details of this process explain why, in the example above, the parser chooses to shift a token
2632Sstever@eecs.umich.eduonto the stack in step 9 rather than reducing the rule <tt>expr : expr + term</tt>.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<h2>Yacc example</h2>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduSuppose you wanted to make a grammar for simple arithmetic expressions as previously described.   Here is
2632Sstever@eecs.umich.eduhow you would do it with <tt>yacc.py</tt>:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edu# Yacc example
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduimport yacc
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu# Get the token map from the lexer.  This is required.
2632Sstever@eecs.umich.edufrom calclex import tokens
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edudef p_expression_plus(t):
2632Sstever@eecs.umich.edu    'expression : expression PLUS term'
2632Sstever@eecs.umich.edu    t[0] = t[1] + t[3]
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edudef p_expression_minus(t):
2632Sstever@eecs.umich.edu    'expression : expression MINUS term'
2632Sstever@eecs.umich.edu    t[0] = t[1] - t[3]
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edudef p_expression_term(t):
2632Sstever@eecs.umich.edu    'expression : term'
2632Sstever@eecs.umich.edu    t[0] = t[1]
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edudef p_term_times(t):
2632Sstever@eecs.umich.edu    'term : term TIMES factor'
2632Sstever@eecs.umich.edu    t[0] = t[1] * t[3]
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edudef p_term_div(t):
2632Sstever@eecs.umich.edu    'term : term DIVIDE factor'
2632Sstever@eecs.umich.edu    t[0] = t[1] / t[3]
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edudef p_term_factor(t):
2632Sstever@eecs.umich.edu    'term : factor'
2632Sstever@eecs.umich.edu    t[0] = t[1]
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edudef p_factor_num(t):
2632Sstever@eecs.umich.edu    'factor : NUMBER'
2632Sstever@eecs.umich.edu    t[0] = t[1]
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edudef p_factor_expr(t):
2632Sstever@eecs.umich.edu    'factor : LPAREN expression RPAREN'
2632Sstever@eecs.umich.edu    t[0] = t[2]
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu# Error rule for syntax errors
2632Sstever@eecs.umich.edudef p_error(t):
2632Sstever@eecs.umich.edu    print "Syntax error in input!"
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu# Build the parser
2632Sstever@eecs.umich.eduyacc.yacc()
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduwhile 1:
2632Sstever@eecs.umich.edu   try:
2632Sstever@eecs.umich.edu       s = raw_input('calc > ')
2632Sstever@eecs.umich.edu   except EOFError:
2632Sstever@eecs.umich.edu       break
2632Sstever@eecs.umich.edu   if not s: continue
2632Sstever@eecs.umich.edu   result = yacc.parse(s)
2632Sstever@eecs.umich.edu   print result
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduIn this example, each grammar rule is defined by a Python function where the docstring to that function contains the
2632Sstever@eecs.umich.eduappropriate context-free grammar specification (an idea borrowed from John Aycock's SPARK toolkit).  Each function accepts a single
2632Sstever@eecs.umich.eduargument <tt>t</tt> that is a sequence containing the values of each grammar symbol in the corresponding rule.  The values of
2632Sstever@eecs.umich.edu<tt>t[i]</tt> are mapped to grammar symbols as shown here:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edudef p_expression_plus(t):
2632Sstever@eecs.umich.edu    'expression : expression PLUS term'
2632Sstever@eecs.umich.edu    #   ^            ^        ^    ^
2632Sstever@eecs.umich.edu    #  t[0]         t[1]     t[2] t[3]
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    t[0] = t[1] + t[3]
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduFor tokens, the "value" in the corresponding <tt>t[i]</tt> is the
2632Sstever@eecs.umich.edu<em>same</em> as the value of the <tt>t.value</tt> attribute assigned
2632Sstever@eecs.umich.eduin the lexer module.  For non-terminals, the value is determined by
2632Sstever@eecs.umich.eduwhatever is placed in <tt>t[0]</tt> when rules are reduced.  This
2632Sstever@eecs.umich.eduvalue can be anything at all.  However, it probably most common for
2632Sstever@eecs.umich.eduthe value to be a simple Python type, a tuple, or an instance.  In this example, we
2632Sstever@eecs.umich.eduare relying on the fact that the <tt>NUMBER</tt> token stores an integer value in its value
2632Sstever@eecs.umich.edufield.  All of the other rules simply perform various types of integer operations and store
2632Sstever@eecs.umich.eduthe result.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduThe first rule defined in the yacc specification determines the starting grammar
2632Sstever@eecs.umich.edusymbol (in this case, a rule for <tt>expression</tt> appears first).  Whenever
2632Sstever@eecs.umich.eduthe starting rule is reduced by the parser and no more input is available, parsing
2632Sstever@eecs.umich.edustops and the final value is returned (this value will be whatever the top-most rule
2632Sstever@eecs.umich.eduplaced in <tt>t[0]</tt>).
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>The <tt>p_error(t)</tt> rule is defined to catch syntax errors.  See the error handling section
2632Sstever@eecs.umich.edubelow for more detail.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduTo build the parser, call the <tt>yacc.yacc()</tt> function.  This function
2632Sstever@eecs.umich.edulooks at the module and attempts to construct all of the LR parsing tables for the grammar
2632Sstever@eecs.umich.eduyou have specified.   The first time <tt>yacc.yacc()</tt> is invoked, you will get a message
2632Sstever@eecs.umich.edusuch as this:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edu$ python calcparse.py
2632Sstever@eecs.umich.eduyacc: Generating SLR parsing table...
2632Sstever@eecs.umich.educalc >
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduSince table construction is relatively expensive (especially for large
2632Sstever@eecs.umich.edugrammars), the resulting parsing table is written to the current
2632Sstever@eecs.umich.edudirectory in a file called <tt>parsetab.py</tt>.  In addition, a
2632Sstever@eecs.umich.edudebugging file called <tt>parser.out</tt> is created.  On subsequent
2632Sstever@eecs.umich.eduexecutions, <tt>yacc</tt> will reload the table from
2632Sstever@eecs.umich.edu<tt>parsetab.py</tt> unless it has detected a change in the underlying
2632Sstever@eecs.umich.edugrammar (in which case the tables and <tt>parsetab.py</tt> file are
2632Sstever@eecs.umich.eduregenerated).
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduIf any errors are detected in your grammar specification, <tt>yacc.py</tt> will produce
2632Sstever@eecs.umich.edudiagnostic messages and possibly raise an exception.  Some of the errors that can be detected include:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<ul>
2632Sstever@eecs.umich.edu<li>Duplicated function names (if more than one rule function have the same name in the grammar file).
2632Sstever@eecs.umich.edu<li>Shift/reduce and reduce/reduce conflicts generated by ambiguous grammars.
2632Sstever@eecs.umich.edu<li>Badly specified grammar rules.
2632Sstever@eecs.umich.edu<li>Infinite recursion (rules that can never terminate).
2632Sstever@eecs.umich.edu<li>Unused rules and tokens
2632Sstever@eecs.umich.edu<li>Undefined rules and tokens
2632Sstever@eecs.umich.edu</ul>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduThe next few sections now discuss a few finer points of grammar construction.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<h2>Combining Grammar Rule Functions</h2>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduWhen grammar rules are similar, they can be combined into a single function.
2632Sstever@eecs.umich.eduFor example, consider the two rules in our earlier example:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edudef p_expression_plus(t):
2632Sstever@eecs.umich.edu    'expression : expression PLUS term'
2632Sstever@eecs.umich.edu    t[0] = t[1] + t[3]
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edudef p_expression_minus(t):
2632Sstever@eecs.umich.edu    'expression : expression MINUS term'
2632Sstever@eecs.umich.edu    t[0] = t[1] - t[3]
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduInstead of writing two functions, you might write a single function like this:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edudef p_expression(t):
2632Sstever@eecs.umich.edu    '''expression : expression PLUS term
2632Sstever@eecs.umich.edu                  | expression MINUS term'''
2632Sstever@eecs.umich.edu    if t[2] == '+':
2632Sstever@eecs.umich.edu        t[0] = t[1] + t[3]
2632Sstever@eecs.umich.edu    elif t[2] == '-':
2632Sstever@eecs.umich.edu        t[0] = t[1] - t[3]
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduIn general, the doc string for any given function can contain multiple grammar rules.  So, it would
2632Sstever@eecs.umich.eduhave also been legal (although possibly confusing) to write this:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edudef p_binary_operators(t):
2632Sstever@eecs.umich.edu    '''expression : expression PLUS term
2632Sstever@eecs.umich.edu                  | expression MINUS term
2632Sstever@eecs.umich.edu       term       : term TIMES factor
2632Sstever@eecs.umich.edu                  | term DIVIDE factor'''
2632Sstever@eecs.umich.edu    if t[2] == '+':
2632Sstever@eecs.umich.edu        t[0] = t[1] + t[3]
2632Sstever@eecs.umich.edu    elif t[2] == '-':
2632Sstever@eecs.umich.edu        t[0] = t[1] - t[3]
2632Sstever@eecs.umich.edu    elif t[2] == '*':
2632Sstever@eecs.umich.edu        t[0] = t[1] * t[3]
2632Sstever@eecs.umich.edu    elif t[2] == '/':
2632Sstever@eecs.umich.edu        t[0] = t[1] / t[3]
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduWhen combining grammar rules into a single function, it is usually a good idea for all of the rules to have
2632Sstever@eecs.umich.edua similar structure (e.g., the same number of terms).  Otherwise, the corresponding action code may be more
2632Sstever@eecs.umich.educomplicated than necessary.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<h2>Empty Productions</h2>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<tt>yacc.py</tt> can handle empty productions by defining a rule like this:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edudef p_empty(t):
2632Sstever@eecs.umich.edu    'empty :'
2632Sstever@eecs.umich.edu    pass
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduNow to use the empty production, simply use 'empty' as a symbol.  For example:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edudef p_optitem(t):
2632Sstever@eecs.umich.edu    'optitem : item'
2632Sstever@eecs.umich.edu    '        | empty'
2632Sstever@eecs.umich.edu    ...
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<h2>Dealing With Ambiguous Grammars</h2>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduThe expression grammar given in the earlier example has been written in a special format to eliminate ambiguity.
2632Sstever@eecs.umich.eduHowever, in many situations, it is extremely difficult or awkward to write grammars in this format.  A
2632Sstever@eecs.umich.edumuch more natural way to express the grammar is in a more compact form like this:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.eduexpression : expression PLUS expression
2632Sstever@eecs.umich.edu           | expression MINUS expression
2632Sstever@eecs.umich.edu           | expression TIMES expression
2632Sstever@eecs.umich.edu           | expression DIVIDE expression
2632Sstever@eecs.umich.edu           | LPAREN expression RPAREN
2632Sstever@eecs.umich.edu           | NUMBER
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduUnfortunately, this grammar specification is ambiguous.  For example, if you are parsing the string
2632Sstever@eecs.umich.edu"3 * 4 + 5", there is no way to tell how the operators are supposed to be grouped.
2632Sstever@eecs.umich.eduFor example, does this expression mean "(3 * 4) + 5" or is it "3 * (4+5)"?
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduWhen an ambiguous grammar is given to <tt>yacc.py</tt> it will print messages about "shift/reduce conflicts"
2632Sstever@eecs.umich.eduor a "reduce/reduce conflicts".   A shift/reduce conflict is caused when the parser generator can't decide
2632Sstever@eecs.umich.eduwhether or not to reduce a rule or shift a symbol on the parsing stack.   For example, consider
2632Sstever@eecs.umich.eduthe string "3 * 4 + 5" and the internal parsing stack:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.eduStep Symbol Stack           Input Tokens            Action
2632Sstever@eecs.umich.edu---- ---------------------  ---------------------   -------------------------------
2632Sstever@eecs.umich.edu1    $                                3 * 4 + 5$    Shift 3
2632Sstever@eecs.umich.edu2    $ 3                                * 4 + 5$    Reduce : expression : NUMBER
2632Sstever@eecs.umich.edu3    $ expr                             * 4 + 5$    Shift *
2632Sstever@eecs.umich.edu4    $ expr *                             4 + 5$    Shift 4
2632Sstever@eecs.umich.edu5    $ expr * 4                             + 5$    Reduce: expression : NUMBER
2632Sstever@eecs.umich.edu6    $ expr * expr                          + 5$    SHIFT/REDUCE CONFLICT ????
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduIn this case, when the parser reaches step 6, it has two options.  One is the reduce the
2632Sstever@eecs.umich.edurule <tt>expr : expr * expr</tt> on the stack.  The other option is to shift the
2632Sstever@eecs.umich.edutoken <tt>+</tt> on the stack.   Both options are perfectly legal from the rules
2632Sstever@eecs.umich.eduof the context-free-grammar.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduBy default, all shift/reduce conflicts are resolved in favor of shifting.  Therefore, in the above
2632Sstever@eecs.umich.eduexample, the parser will always shift the <tt>+</tt> instead of reducing.    Although this
2632Sstever@eecs.umich.edustrategy works in many cases (including the ambiguous if-then-else), it is not enough for arithmetic
2632Sstever@eecs.umich.eduexpressions.  In fact, in the above example, the decision to shift <tt>+</tt> is completely wrong---we should have
2632Sstever@eecs.umich.edureduced <tt>expr * expr</tt> since multiplication has higher precedence than addition.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>To resolve ambiguity, especially in expression grammars, <tt>yacc.py</tt> allows individual
2632Sstever@eecs.umich.edutokens to be assigned a precedence level and associativity.  This is done by adding a variable
2632Sstever@eecs.umich.edu<tt>precedence</tt> to the grammar file like this:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.eduprecedence = (
2632Sstever@eecs.umich.edu    ('left', 'PLUS', 'MINUS'),
2632Sstever@eecs.umich.edu    ('left', 'TIMES', 'DIVIDE'),
2632Sstever@eecs.umich.edu)
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduThis declaration specifies that <tt>PLUS</tt>/<tt>MINUS</tt> have
2632Sstever@eecs.umich.eduthe same precedence level and are left-associative and that
2632Sstever@eecs.umich.edu<tt>TIMES</tt>/<tt>DIVIDE</tt> have the same precedence and are left-associative.
2632Sstever@eecs.umich.eduFurthermore, the declaration specifies that <tt>TIMES</tt>/<tt>DIVIDE</tt> have higher
2632Sstever@eecs.umich.eduprecedence than <tt>PLUS</tt>/<tt>MINUS</tt> (since they appear later in the
2632Sstever@eecs.umich.eduprecedence specification).
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduThe precedence specification is used to attach a numerical precedence value and associativity direction
2632Sstever@eecs.umich.eduto each grammar rule. This is always determined by the precedence of the right-most terminal symbol.  Therefore,
2632Sstever@eecs.umich.eduif PLUS/MINUS had a precedence of 1 and TIMES/DIVIDE had a precedence of 2, the grammar rules
2632Sstever@eecs.umich.eduwould have precedence values as follows:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.eduexpression : expression PLUS expression                 # prec = 1, left
2632Sstever@eecs.umich.edu           | expression MINUS expression                # prec = 1, left
2632Sstever@eecs.umich.edu           | expression TIMES expression                # prec = 2, left
2632Sstever@eecs.umich.edu           | expression DIVIDE expression               # prec = 2, left
2632Sstever@eecs.umich.edu           | LPAREN expression RPAREN                   # prec = unknown
2632Sstever@eecs.umich.edu           | NUMBER                                     # prec = unknown
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduWhen shift/reduce conflicts are encountered, the parser generator resolves the conflict by
2632Sstever@eecs.umich.edulooking at the precedence rules and associativity specifiers.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<ol>
2632Sstever@eecs.umich.edu<li>If the current token has higher precedence, it is shifted.
2632Sstever@eecs.umich.edu<li>If the grammar rule on the stack has higher precedence, the rule is reduced.
2632Sstever@eecs.umich.edu<li>If the current token and the grammar rule have the same precedence, the
2632Sstever@eecs.umich.edurule is reduced for left associativity, whereas the token is shifted for right associativity.
2632Sstever@eecs.umich.edu<li>If nothing is known about the precedence, shift/reduce conflicts are resolved in
2632Sstever@eecs.umich.edufavor of shifting (the default).
2632Sstever@eecs.umich.edu</ol>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduWhen shift/reduce conflicts are resolved using the first three techniques (with the help of
2632Sstever@eecs.umich.eduprecedence rules), <tt>yacc.py</tt> will report no errors or conflicts in the grammar.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduOne problem with the precedence specifier technique is that it is sometimes necessary to
2632Sstever@eecs.umich.educhange the precedence of an operator in certain contents.  For example, consider a unary-minus operator
2632Sstever@eecs.umich.eduin "3 + 4 * -5".  Normally, unary minus has a very high precedence--being evaluated before the multiply.
2632Sstever@eecs.umich.eduHowever, in our precedence specifier, MINUS has a lower precedence than TIMES.  To deal with this,
2632Sstever@eecs.umich.eduprecedence rules can be given for fictitious tokens like this:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.eduprecedence = (
2632Sstever@eecs.umich.edu    ('left', 'PLUS', 'MINUS'),
2632Sstever@eecs.umich.edu    ('left', 'TIMES', 'DIVIDE'),
2632Sstever@eecs.umich.edu    ('right', 'UMINUS'),            # Unary minus operator
2632Sstever@eecs.umich.edu)
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduNow, in the grammar file, we can write our unary minus rule like this:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edudef p_expr_uminus(t):
2632Sstever@eecs.umich.edu    'expression : MINUS expression %prec UMINUS'
2632Sstever@eecs.umich.edu    t[0] = -t[2]
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduIn this case, <tt>%prec UMINUS</tt> overrides the default rule precedence--setting it to that
2632Sstever@eecs.umich.eduof UMINUS in the precedence specifier.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduIt is also possible to specify non-associativity in the <tt>precedence</tt> table. This would
2632Sstever@eecs.umich.edube used when you <em>don't</em> want operations to chain together.  For example, suppose
2632Sstever@eecs.umich.eduyou wanted to support a comparison operators like <tt>&lt;</tt> and <tt>&gt;</tt> but you didn't want to allow
2632Sstever@eecs.umich.educombinations like <tt>a &lt; b &lt; c</tt>.   To do this, simply specify a rule like this:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.eduprecedence = (
2632Sstever@eecs.umich.edu    ('nonassoc', 'LESSTHAN', 'GREATERTHAN'),  # Nonassociative operators
2632Sstever@eecs.umich.edu    ('left', 'PLUS', 'MINUS'),
2632Sstever@eecs.umich.edu    ('left', 'TIMES', 'DIVIDE'),
2632Sstever@eecs.umich.edu    ('right', 'UMINUS'),            # Unary minus operator
2632Sstever@eecs.umich.edu)
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduReduce/reduce conflicts are caused when there are multiple grammar
2632Sstever@eecs.umich.edurules that can be applied to a given set of symbols.  This kind of
2632Sstever@eecs.umich.educonflict is almost always bad and is always resolved by picking the
2632Sstever@eecs.umich.edurule that appears first in the grammar file.   Reduce/reduce conflicts
2632Sstever@eecs.umich.eduare almost always caused when different sets of grammar rules somehow
2632Sstever@eecs.umich.edugenerate the same set of symbols.  For example:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.eduassignment :  ID EQUALS NUMBER
2632Sstever@eecs.umich.edu           |  ID EQUALS expression
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduexpression : expression PLUS expression
2632Sstever@eecs.umich.edu           | expression MINUS expression
2632Sstever@eecs.umich.edu           | expression TIMES expression
2632Sstever@eecs.umich.edu           | expression DIVIDE expression
2632Sstever@eecs.umich.edu           | LPAREN expression RPAREN
2632Sstever@eecs.umich.edu           | NUMBER
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduIn this case, a reduce/reduce conflict exists between these two rules:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.eduassignment  : ID EQUALS NUMBER
2632Sstever@eecs.umich.eduexpression  : NUMBER
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduFor example, if you wrote "a = 5", the parser can't figure out if this
2632Sstever@eecs.umich.eduis supposed to reduced as <tt>assignment : ID EQUALS NUMBER</tt> or
2632Sstever@eecs.umich.eduwhether it's supposed to reduce the 5 as an expression and then reduce
2632Sstever@eecs.umich.eduthe rule <tt>assignment : ID EQUALS expression</tt>.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<h2>The parser.out file</h2>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduTracking down shift/reduce and reduce/reduce conflicts is one of the finer pleasures of using an LR
2632Sstever@eecs.umich.eduparsing algorithm.  To assist in debugging, <tt>yacc.py</tt> creates a debugging file called
2632Sstever@eecs.umich.edu'parser.out' when it generates the parsing table.   The contents of this file look like the following:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.eduUnused terminals:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduGrammar
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduRule 1     expression -> expression PLUS expression
2632Sstever@eecs.umich.eduRule 2     expression -> expression MINUS expression
2632Sstever@eecs.umich.eduRule 3     expression -> expression TIMES expression
2632Sstever@eecs.umich.eduRule 4     expression -> expression DIVIDE expression
2632Sstever@eecs.umich.eduRule 5     expression -> NUMBER
2632Sstever@eecs.umich.eduRule 6     expression -> LPAREN expression RPAREN
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduTerminals, with rules where they appear
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduTIMES                : 3
2632Sstever@eecs.umich.eduerror                :
2632Sstever@eecs.umich.eduMINUS                : 2
2632Sstever@eecs.umich.eduRPAREN               : 6
2632Sstever@eecs.umich.eduLPAREN               : 6
2632Sstever@eecs.umich.eduDIVIDE               : 4
2632Sstever@eecs.umich.eduPLUS                 : 1
2632Sstever@eecs.umich.eduNUMBER               : 5
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduNonterminals, with rules where they appear
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduexpression           : 1 1 2 2 3 3 4 4 6 0
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduParsing method: SLR
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edustate 0
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    S' -> . expression
2632Sstever@eecs.umich.edu    expression -> . expression PLUS expression
2632Sstever@eecs.umich.edu    expression -> . expression MINUS expression
2632Sstever@eecs.umich.edu    expression -> . expression TIMES expression
2632Sstever@eecs.umich.edu    expression -> . expression DIVIDE expression
2632Sstever@eecs.umich.edu    expression -> . NUMBER
2632Sstever@eecs.umich.edu    expression -> . LPAREN expression RPAREN
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    NUMBER          shift and go to state 3
2632Sstever@eecs.umich.edu    LPAREN          shift and go to state 2
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edustate 1
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    S' -> expression .
2632Sstever@eecs.umich.edu    expression -> expression . PLUS expression
2632Sstever@eecs.umich.edu    expression -> expression . MINUS expression
2632Sstever@eecs.umich.edu    expression -> expression . TIMES expression
2632Sstever@eecs.umich.edu    expression -> expression . DIVIDE expression
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    PLUS            shift and go to state 6
2632Sstever@eecs.umich.edu    MINUS           shift and go to state 5
2632Sstever@eecs.umich.edu    TIMES           shift and go to state 4
2632Sstever@eecs.umich.edu    DIVIDE          shift and go to state 7
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edustate 2
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    expression -> LPAREN . expression RPAREN
2632Sstever@eecs.umich.edu    expression -> . expression PLUS expression
2632Sstever@eecs.umich.edu    expression -> . expression MINUS expression
2632Sstever@eecs.umich.edu    expression -> . expression TIMES expression
2632Sstever@eecs.umich.edu    expression -> . expression DIVIDE expression
2632Sstever@eecs.umich.edu    expression -> . NUMBER
2632Sstever@eecs.umich.edu    expression -> . LPAREN expression RPAREN
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    NUMBER          shift and go to state 3
2632Sstever@eecs.umich.edu    LPAREN          shift and go to state 2
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edustate 3
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    expression -> NUMBER .
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    $               reduce using rule 5
2632Sstever@eecs.umich.edu    PLUS            reduce using rule 5
2632Sstever@eecs.umich.edu    MINUS           reduce using rule 5
2632Sstever@eecs.umich.edu    TIMES           reduce using rule 5
2632Sstever@eecs.umich.edu    DIVIDE          reduce using rule 5
2632Sstever@eecs.umich.edu    RPAREN          reduce using rule 5
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edustate 4
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    expression -> expression TIMES . expression
2632Sstever@eecs.umich.edu    expression -> . expression PLUS expression
2632Sstever@eecs.umich.edu    expression -> . expression MINUS expression
2632Sstever@eecs.umich.edu    expression -> . expression TIMES expression
2632Sstever@eecs.umich.edu    expression -> . expression DIVIDE expression
2632Sstever@eecs.umich.edu    expression -> . NUMBER
2632Sstever@eecs.umich.edu    expression -> . LPAREN expression RPAREN
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    NUMBER          shift and go to state 3
2632Sstever@eecs.umich.edu    LPAREN          shift and go to state 2
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edustate 5
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    expression -> expression MINUS . expression
2632Sstever@eecs.umich.edu    expression -> . expression PLUS expression
2632Sstever@eecs.umich.edu    expression -> . expression MINUS expression
2632Sstever@eecs.umich.edu    expression -> . expression TIMES expression
2632Sstever@eecs.umich.edu    expression -> . expression DIVIDE expression
2632Sstever@eecs.umich.edu    expression -> . NUMBER
2632Sstever@eecs.umich.edu    expression -> . LPAREN expression RPAREN
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    NUMBER          shift and go to state 3
2632Sstever@eecs.umich.edu    LPAREN          shift and go to state 2
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edustate 6
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    expression -> expression PLUS . expression
2632Sstever@eecs.umich.edu    expression -> . expression PLUS expression
2632Sstever@eecs.umich.edu    expression -> . expression MINUS expression
2632Sstever@eecs.umich.edu    expression -> . expression TIMES expression
2632Sstever@eecs.umich.edu    expression -> . expression DIVIDE expression
2632Sstever@eecs.umich.edu    expression -> . NUMBER
2632Sstever@eecs.umich.edu    expression -> . LPAREN expression RPAREN
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    NUMBER          shift and go to state 3
2632Sstever@eecs.umich.edu    LPAREN          shift and go to state 2
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edustate 7
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    expression -> expression DIVIDE . expression
2632Sstever@eecs.umich.edu    expression -> . expression PLUS expression
2632Sstever@eecs.umich.edu    expression -> . expression MINUS expression
2632Sstever@eecs.umich.edu    expression -> . expression TIMES expression
2632Sstever@eecs.umich.edu    expression -> . expression DIVIDE expression
2632Sstever@eecs.umich.edu    expression -> . NUMBER
2632Sstever@eecs.umich.edu    expression -> . LPAREN expression RPAREN
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    NUMBER          shift and go to state 3
2632Sstever@eecs.umich.edu    LPAREN          shift and go to state 2
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edustate 8
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    expression -> LPAREN expression . RPAREN
2632Sstever@eecs.umich.edu    expression -> expression . PLUS expression
2632Sstever@eecs.umich.edu    expression -> expression . MINUS expression
2632Sstever@eecs.umich.edu    expression -> expression . TIMES expression
2632Sstever@eecs.umich.edu    expression -> expression . DIVIDE expression
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    RPAREN          shift and go to state 13
2632Sstever@eecs.umich.edu    PLUS            shift and go to state 6
2632Sstever@eecs.umich.edu    MINUS           shift and go to state 5
2632Sstever@eecs.umich.edu    TIMES           shift and go to state 4
2632Sstever@eecs.umich.edu    DIVIDE          shift and go to state 7
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edustate 9
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    expression -> expression TIMES expression .
2632Sstever@eecs.umich.edu    expression -> expression . PLUS expression
2632Sstever@eecs.umich.edu    expression -> expression . MINUS expression
2632Sstever@eecs.umich.edu    expression -> expression . TIMES expression
2632Sstever@eecs.umich.edu    expression -> expression . DIVIDE expression
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    $               reduce using rule 3
2632Sstever@eecs.umich.edu    PLUS            reduce using rule 3
2632Sstever@eecs.umich.edu    MINUS           reduce using rule 3
2632Sstever@eecs.umich.edu    TIMES           reduce using rule 3
2632Sstever@eecs.umich.edu    DIVIDE          reduce using rule 3
2632Sstever@eecs.umich.edu    RPAREN          reduce using rule 3
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu  ! PLUS            [ shift and go to state 6 ]
2632Sstever@eecs.umich.edu  ! MINUS           [ shift and go to state 5 ]
2632Sstever@eecs.umich.edu  ! TIMES           [ shift and go to state 4 ]
2632Sstever@eecs.umich.edu  ! DIVIDE          [ shift and go to state 7 ]
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edustate 10
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    expression -> expression MINUS expression .
2632Sstever@eecs.umich.edu    expression -> expression . PLUS expression
2632Sstever@eecs.umich.edu    expression -> expression . MINUS expression
2632Sstever@eecs.umich.edu    expression -> expression . TIMES expression
2632Sstever@eecs.umich.edu    expression -> expression . DIVIDE expression
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    $               reduce using rule 2
2632Sstever@eecs.umich.edu    PLUS            reduce using rule 2
2632Sstever@eecs.umich.edu    MINUS           reduce using rule 2
2632Sstever@eecs.umich.edu    RPAREN          reduce using rule 2
2632Sstever@eecs.umich.edu    TIMES           shift and go to state 4
2632Sstever@eecs.umich.edu    DIVIDE          shift and go to state 7
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu  ! TIMES           [ reduce using rule 2 ]
2632Sstever@eecs.umich.edu  ! DIVIDE          [ reduce using rule 2 ]
2632Sstever@eecs.umich.edu  ! PLUS            [ shift and go to state 6 ]
2632Sstever@eecs.umich.edu  ! MINUS           [ shift and go to state 5 ]
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edustate 11
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    expression -> expression PLUS expression .
2632Sstever@eecs.umich.edu    expression -> expression . PLUS expression
2632Sstever@eecs.umich.edu    expression -> expression . MINUS expression
2632Sstever@eecs.umich.edu    expression -> expression . TIMES expression
2632Sstever@eecs.umich.edu    expression -> expression . DIVIDE expression
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    $               reduce using rule 1
2632Sstever@eecs.umich.edu    PLUS            reduce using rule 1
2632Sstever@eecs.umich.edu    MINUS           reduce using rule 1
2632Sstever@eecs.umich.edu    RPAREN          reduce using rule 1
2632Sstever@eecs.umich.edu    TIMES           shift and go to state 4
2632Sstever@eecs.umich.edu    DIVIDE          shift and go to state 7
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu  ! TIMES           [ reduce using rule 1 ]
2632Sstever@eecs.umich.edu  ! DIVIDE          [ reduce using rule 1 ]
2632Sstever@eecs.umich.edu  ! PLUS            [ shift and go to state 6 ]
2632Sstever@eecs.umich.edu  ! MINUS           [ shift and go to state 5 ]
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edustate 12
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    expression -> expression DIVIDE expression .
2632Sstever@eecs.umich.edu    expression -> expression . PLUS expression
2632Sstever@eecs.umich.edu    expression -> expression . MINUS expression
2632Sstever@eecs.umich.edu    expression -> expression . TIMES expression
2632Sstever@eecs.umich.edu    expression -> expression . DIVIDE expression
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    $               reduce using rule 4
2632Sstever@eecs.umich.edu    PLUS            reduce using rule 4
2632Sstever@eecs.umich.edu    MINUS           reduce using rule 4
2632Sstever@eecs.umich.edu    TIMES           reduce using rule 4
2632Sstever@eecs.umich.edu    DIVIDE          reduce using rule 4
2632Sstever@eecs.umich.edu    RPAREN          reduce using rule 4
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu  ! PLUS            [ shift and go to state 6 ]
2632Sstever@eecs.umich.edu  ! MINUS           [ shift and go to state 5 ]
2632Sstever@eecs.umich.edu  ! TIMES           [ shift and go to state 4 ]
2632Sstever@eecs.umich.edu  ! DIVIDE          [ shift and go to state 7 ]
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edustate 13
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    expression -> LPAREN expression RPAREN .
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    $               reduce using rule 6
2632Sstever@eecs.umich.edu    PLUS            reduce using rule 6
2632Sstever@eecs.umich.edu    MINUS           reduce using rule 6
2632Sstever@eecs.umich.edu    TIMES           reduce using rule 6
2632Sstever@eecs.umich.edu    DIVIDE          reduce using rule 6
2632Sstever@eecs.umich.edu    RPAREN          reduce using rule 6
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduIn the file, each state of the grammar is described.  Within each state the "." indicates the current
2632Sstever@eecs.umich.edulocation of the parse within any applicable grammar rules.   In addition, the actions for each valid
2632Sstever@eecs.umich.eduinput token are listed.   When a shift/reduce or reduce/reduce conflict arises, rules <em>not</em> selected
2632Sstever@eecs.umich.eduare prefixed with an !.  For example:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edu  ! TIMES           [ reduce using rule 2 ]
2632Sstever@eecs.umich.edu  ! DIVIDE          [ reduce using rule 2 ]
2632Sstever@eecs.umich.edu  ! PLUS            [ shift and go to state 6 ]
2632Sstever@eecs.umich.edu  ! MINUS           [ shift and go to state 5 ]
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduBy looking at these rules (and with a little practice), you can usually track down the source
2632Sstever@eecs.umich.eduof most parsing conflicts.  It should also be stressed that not all shift-reduce conflicts are
2632Sstever@eecs.umich.edubad.  However, the only way to be sure that they are resolved correctly is to look at <tt>parser.out</tt>.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<h2>Syntax Error Handling</h2>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduWhen a syntax error occurs during parsing, the error is immediately
2632Sstever@eecs.umich.edudetected (i.e., the parser does not read any more tokens beyond the
2632Sstever@eecs.umich.edusource of the error).  Error recovery in LR parsers is a delicate
2632Sstever@eecs.umich.edutopic that involves ancient rituals and black-magic.   The recovery mechanism
2632Sstever@eecs.umich.eduprovided by <tt>yacc.py</tt> is comparable to Unix yacc so you may want
2632Sstever@eecs.umich.educonsult a book like O'Reilly's "Lex and Yacc" for some of the finer details.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduWhen a syntax error occurs, <tt>yacc.py</tt> performs the following steps:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<ol>
2632Sstever@eecs.umich.edu<li>On the first occurrence of an error, the user-defined <tt>p_error()</tt> function
2632Sstever@eecs.umich.eduis called with the offending token as an argument.  Afterwards, the parser enters
2632Sstever@eecs.umich.eduan "error-recovery" mode in which it will not make future calls to <tt>p_error()</tt> until it
2632Sstever@eecs.umich.eduhas successfully shifted at least 3 tokens onto the parsing stack.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<li>If no recovery action is taken in <tt>p_error()</tt>, the offending lookahead token is replaced
2632Sstever@eecs.umich.eduwith a special <tt>error</tt> token.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<li>If the offending lookahead token is already set to <tt>error</tt>, the top item of the parsing stack is
2632Sstever@eecs.umich.edudeleted.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<li>If the entire parsing stack is unwound, the parser enters a restart state and attempts to start
2632Sstever@eecs.umich.eduparsing from its initial state.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<li>If a grammar rule accepts <tt>error</tt> as a token, it will be
2632Sstever@eecs.umich.edushifted onto the parsing stack.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<li>If the top item of the parsing stack is <tt>error</tt>, lookahead tokens will be discarded until the
2632Sstever@eecs.umich.eduparser can successfully shift a new symbol or reduce a rule involving <tt>error</tt>.
2632Sstever@eecs.umich.edu</ol>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<h4>Recovery and resynchronization with error rules</h4>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduThe most well-behaved approach for handling syntax errors is to write grammar rules that include the <tt>error</tt>
2632Sstever@eecs.umich.edutoken.  For example, suppose your language had a grammar rule for a print statement like this:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edudef p_statement_print(t):
2632Sstever@eecs.umich.edu     'statement : PRINT expr SEMI'
2632Sstever@eecs.umich.edu     ...
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduTo account for the possibility of a bad expression, you might write an additional grammar rule like this:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edudef p_statement_print_error(t):
2632Sstever@eecs.umich.edu     'statement : PRINT error SEMI'
2632Sstever@eecs.umich.edu     print "Syntax error in print statement. Bad expression"
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduIn this case, the <tt>error</tt> token will match any sequence of
2632Sstever@eecs.umich.edutokens that might appear up to the first semicolon that is
2632Sstever@eecs.umich.eduencountered.  Once the semicolon is reached, the rule will be
2632Sstever@eecs.umich.eduinvoked and the <tt>error</tt> token will go away.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduThis type of recovery is sometimes known as parser resynchronization.
2632Sstever@eecs.umich.eduThe <tt>error</tt> token acts as a wildcard for any bad input text and
2632Sstever@eecs.umich.eduthe token immediately following <tt>error</tt> acts as a
2632Sstever@eecs.umich.edusynchronization token.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduIt is important to note that the <tt>error</tt> token usually does not appear as the last token
2632Sstever@eecs.umich.eduon the right in an error rule.  For example:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edudef p_statement_print_error(t):
2632Sstever@eecs.umich.edu    'statement : PRINT error'
2632Sstever@eecs.umich.edu    print "Syntax error in print statement. Bad expression"
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduThis is because the first bad token encountered will cause the rule to
2632Sstever@eecs.umich.edube reduced--which may make it difficult to recover if more bad tokens
2632Sstever@eecs.umich.eduimmediately follow.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<h4>Panic mode recovery</h4>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduAn alternative error recovery scheme is to enter a panic mode recovery in which tokens are
2632Sstever@eecs.umich.edudiscarded to a point where the parser might be able to recover in some sensible manner.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduPanic mode recovery is implemented entirely in the <tt>p_error()</tt> function.  For example, this
2632Sstever@eecs.umich.edufunction starts discarding tokens until it reaches a closing '}'.  Then, it restarts the
2632Sstever@eecs.umich.eduparser in its initial state.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edudef p_error(t):
2632Sstever@eecs.umich.edu    print "Whoa. You are seriously hosed."
2632Sstever@eecs.umich.edu    # Read ahead looking for a closing '}'
2632Sstever@eecs.umich.edu    while 1:
2632Sstever@eecs.umich.edu        tok = yacc.token()             # Get the next token
2632Sstever@eecs.umich.edu        if not tok or tok.type == 'RBRACE': break
2632Sstever@eecs.umich.edu    yacc.restart()
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduThis function simply discards the bad token and tells the parser that the error was ok.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edudef p_error(t):
2632Sstever@eecs.umich.edu    print "Syntax error at token", t.type
2632Sstever@eecs.umich.edu    # Just discard the token and tell the parser it's okay.
2632Sstever@eecs.umich.edu    yacc.errok()
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<P>
2632Sstever@eecs.umich.eduWithin the <tt>p_error()</tt> function, three functions are available to control the behavior
2632Sstever@eecs.umich.eduof the parser:
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<ul>
2632Sstever@eecs.umich.edu<li><tt>yacc.errok()</tt>.  This resets the parser state so it doesn't think it's in error-recovery
2632Sstever@eecs.umich.edumode.   This will prevent an <tt>error</tt> token from being generated and will reset the internal
2632Sstever@eecs.umich.eduerror counters so that the next syntax error will call <tt>p_error()</tt> again.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<li><tt>yacc.token()</tt>.  This returns the next token on the input stream.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<li><tt>yacc.restart()</tt>.  This discards the entire parsing stack and resets the parser
2632Sstever@eecs.umich.eduto its initial state.
2632Sstever@eecs.umich.edu</ul>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduNote: these functions are only available when invoking <tt>p_error()</tt> and are not available
2632Sstever@eecs.umich.eduat any other time.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduTo supply the next lookahead token to the parser, <tt>p_error()</tt> can return a token.  This might be
2632Sstever@eecs.umich.eduuseful if trying to synchronize on special characters.  For example:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edudef p_error(t):
2632Sstever@eecs.umich.edu    # Read ahead looking for a terminating ";"
2632Sstever@eecs.umich.edu    while 1:
2632Sstever@eecs.umich.edu        tok = yacc.token()             # Get the next token
2632Sstever@eecs.umich.edu        if not tok or tok.type == 'SEMI': break
2632Sstever@eecs.umich.edu    yacc.errok()
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    # Return SEMI to the parser as the next lookahead token
2632Sstever@eecs.umich.edu    return tok
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<h4>General comments on error handling</h4>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduFor normal types of languages, error recovery with error rules and resynchronization characters is probably the most reliable
2632Sstever@eecs.umich.edutechnique. This is because you can instrument the grammar to catch errors at selected places where it is relatively easy
2632Sstever@eecs.umich.eduto recover and continue parsing.  Panic mode recovery is really only useful in certain specialized applications where you might want
2632Sstever@eecs.umich.eduto discard huge portions of the input text to find a valid restart point.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<h2>Line Number Tracking</h2>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<tt>yacc.py</tt> automatically tracks line numbers for all of the grammar symbols and tokens it processes.  To retrieve the line
2632Sstever@eecs.umich.edunumbers, two functions are used in grammar rules:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<ul>
2632Sstever@eecs.umich.edu<li><tt>t.lineno(num)</tt>.  Return the starting line number for symbol <em>num</em>
2632Sstever@eecs.umich.edu<li><tt>t.linespan(num)</tt>. Return a tuple (startline,endline) with the starting and ending line number for symbol <em>num</em>.
2632Sstever@eecs.umich.edu</ul>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduFor example:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edudef t_expression(t):
2632Sstever@eecs.umich.edu    'expression : expression PLUS expression'
2632Sstever@eecs.umich.edu    t.lineno(1)        # Line number of the left expression
2632Sstever@eecs.umich.edu    t.lineno(2)        # line number of the PLUS operator
2632Sstever@eecs.umich.edu    t.lineno(3)        # line number of the right expression
2632Sstever@eecs.umich.edu    ...
2632Sstever@eecs.umich.edu    start,end = t.linespan(3)    # Start,end lines of the right expression
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduSince line numbers are managed internally by the parser, there is usually no need to modify the line
2632Sstever@eecs.umich.edunumbers.  However, if you want to save the line numbers in a parse-tree node, you will need to make your own
2632Sstever@eecs.umich.eduprivate copy.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<h2>AST Construction</h2>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<tt>yacc.py</tt> provides no special functions for constructing an abstract syntax tree.  However, such
2632Sstever@eecs.umich.educonstruction is easy enough to do on your own.  Simply create a data structure for abstract syntax tree nodes
2632Sstever@eecs.umich.eduand assign nodes to <tt>t[0]</tt> in each rule.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduFor example:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.educlass Expr: pass
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.educlass BinOp(Expr):
2632Sstever@eecs.umich.edu    def __init__(self,left,op,right):
2632Sstever@eecs.umich.edu        self.type = "binop"
2632Sstever@eecs.umich.edu        self.left = left
2632Sstever@eecs.umich.edu        self.right = right
2632Sstever@eecs.umich.edu        self.op = op
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.educlass Number(Expr):
2632Sstever@eecs.umich.edu    def __init__(self,value):
2632Sstever@eecs.umich.edu        self.type = "number"
2632Sstever@eecs.umich.edu        self.value = value
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edudef p_expression_binop(t):
2632Sstever@eecs.umich.edu    '''expression : expression PLUS expression
2632Sstever@eecs.umich.edu                  | expression MINUS expression
2632Sstever@eecs.umich.edu                  | expression TIMES expression
2632Sstever@eecs.umich.edu                  | expression DIVIDE expression'''
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    t[0] = BinOp(t[1],t[2],t[3])
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edudef p_expression_group(t):
2632Sstever@eecs.umich.edu    'expression : LPAREN expression RPAREN'
2632Sstever@eecs.umich.edu    t[0] = t[2]
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edudef p_expression_number(t):
2632Sstever@eecs.umich.edu    'expression : NUMBER'
2632Sstever@eecs.umich.edu    t[0] = Number(t[1])
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduTo simplify tree traversal, it may make sense to pick a very generic tree structure for your parse tree nodes.
2632Sstever@eecs.umich.eduFor example:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.educlass Node:
2632Sstever@eecs.umich.edu    def __init__(self,type,children=None,leaf=None):
2632Sstever@eecs.umich.edu         self.type = type
2632Sstever@eecs.umich.edu         if children:
2632Sstever@eecs.umich.edu              self.children = children
2632Sstever@eecs.umich.edu         else:
2632Sstever@eecs.umich.edu              self.children = [ ]
2632Sstever@eecs.umich.edu         self.leaf = leaf
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edudef p_expression_binop(t):
2632Sstever@eecs.umich.edu    '''expression : expression PLUS expression
2632Sstever@eecs.umich.edu                  | expression MINUS expression
2632Sstever@eecs.umich.edu                  | expression TIMES expression
2632Sstever@eecs.umich.edu                  | expression DIVIDE expression'''
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu    t[0] = Node("binop", [t[1],t[3]], t[2])
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<h2>Yacc implementation notes</h2>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<ul>
2632Sstever@eecs.umich.edu<li>By default, <tt>yacc.py</tt> relies on <tt>lex.py</tt> for tokenizing.  However, an alternative tokenizer
2632Sstever@eecs.umich.educan be supplied as follows:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.eduyacc.parse(lexer=x)
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.eduin this case, <tt>x</tt> must be a Lexer object that minimally has a <tt>x.token()</tt> method for retrieving the next
2632Sstever@eecs.umich.edutoken.   If an input string is given to <tt>yacc.parse()</tt>, the lexer must also have an <tt>x.input()</tt> method.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<li>By default, the yacc generates tables in debugging mode (which produces the parser.out file and other output).
2632Sstever@eecs.umich.eduTo disable this, use
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.eduyacc.yacc(debug=0)
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<li>To change the name of the <tt>parsetab.py</tt> file,  use:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.eduyacc.yacc(tabmodule="foo")
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<P>
2632Sstever@eecs.umich.edu<li>To print copious amounts of debugging during parsing, use:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.eduyacc.parse(debug=1)
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<li>The <tt>yacc.yacc()</tt> function really returns a parser object.  If you want to support multiple
2632Sstever@eecs.umich.eduparsers in the same application, do this:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edup = yacc.yacc()
2632Sstever@eecs.umich.edu...
2632Sstever@eecs.umich.edup.parse()
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduNote: The function <tt>yacc.parse()</tt> is bound to the last parser that was generated.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<li>Since the generation of the SLR tables is relatively expensive, previously generated tables are
2632Sstever@eecs.umich.educached and reused if possible.  The decision to regenerate the tables is determined by taking an MD5
2632Sstever@eecs.umich.educhecksum of all grammar rules and precedence rules.  Only in the event of a mismatch are the tables regenerated.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduIt should be noted that table generation is reasonably efficient, even for grammars that involve around a 100 rules
2632Sstever@eecs.umich.eduand several hundred states.  For more complex languages such as C, table generation may take 30-60 seconds on a slow
2632Sstever@eecs.umich.edumachine.  Please be patient.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.edu<li>Since LR parsing is mostly driven by tables, the performance of the parser is largely independent of the
2632Sstever@eecs.umich.edusize of the grammar.   The biggest bottlenecks will be the lexer and the complexity of your grammar rules.
2632Sstever@eecs.umich.edu</ul>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<h2>Parser and Lexer State Management</h2>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduIn advanced parsing applications, you may want to have multiple
2632Sstever@eecs.umich.eduparsers and lexers.  Furthermore, the parser may want to control the
2632Sstever@eecs.umich.edubehavior of the lexer in some way.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduTo do this, it is important to note that both the lexer and parser are
2632Sstever@eecs.umich.eduactually implemented as objects.   These objects are returned by the
2632Sstever@eecs.umich.edu<tt>lex()</tt> and <tt>yacc()</tt> functions respectively.  For example:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edulexer  = lex.lex()       # Return lexer object
2632Sstever@eecs.umich.eduparser = yacc.yacc()     # Return parser object
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduWithin lexer and parser rules, these objects are also available.  In the lexer,
2632Sstever@eecs.umich.eduthe "lexer" attribute of a token refers to the lexer object in use.  For example:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edudef t_NUMBER(t):
2632Sstever@eecs.umich.edu   r'\d+'
2632Sstever@eecs.umich.edu   ...
2632Sstever@eecs.umich.edu   print t.lexer           # Show lexer object
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduIn the parser, the "lexer" and "parser" attributes refer to the lexer
2632Sstever@eecs.umich.eduand parser objects respectively.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edudef p_expr_plus(t):
2632Sstever@eecs.umich.edu   'expr : expr PLUS expr'
2632Sstever@eecs.umich.edu   ...
2632Sstever@eecs.umich.edu   print t.parser          # Show parser object
2632Sstever@eecs.umich.edu   print t.lexer           # Show lexer object
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduIf necessary, arbitrary attributes can be attached to the lexer or parser object.
2632Sstever@eecs.umich.eduFor example, if you wanted to have different parsing modes, you could attach a mode
2632Sstever@eecs.umich.eduattribute to the parser object and look at it later.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<h2>Using Python's Optimized Mode</h2>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduBecause PLY uses information from doc-strings, parsing and lexing
2632Sstever@eecs.umich.eduinformation must be gathered while running the Python interpreter in
2632Sstever@eecs.umich.edunormal mode (i.e., not with the -O or -OO options).  However, if you
2632Sstever@eecs.umich.eduspecify optimized mode like this:
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<blockquote>
2632Sstever@eecs.umich.edu<pre>
2632Sstever@eecs.umich.edulex.lex(optimize=1)
2632Sstever@eecs.umich.eduyacc.yacc(optimize=1)
2632Sstever@eecs.umich.edu</pre>
2632Sstever@eecs.umich.edu</blockquote>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduthen PLY can later be used when Python runs in optimized mode. To make this work,
2632Sstever@eecs.umich.edumake sure you first run Python in normal mode.  Once the lexing and parsing tables
2632Sstever@eecs.umich.eduhave been generated the first time, run Python in optimized mode. PLY will use
2632Sstever@eecs.umich.eduthe tables without the need for doc strings.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<p>
2632Sstever@eecs.umich.eduBeware: running PLY in optimized mode disables a lot of error
2632Sstever@eecs.umich.educhecking.  You should only do this when your project has stabilized
2632Sstever@eecs.umich.eduand you don't need to do any debugging.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu<h2>Where to go from here?</h2>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.eduThe <tt>examples</tt> directory of the PLY distribution contains several simple examples.   Please consult a
2632Sstever@eecs.umich.educompilers textbook for the theory and underlying implementation details or LR parsing.
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu</body>
2632Sstever@eecs.umich.edu</html>
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu
2632Sstever@eecs.umich.edu