ply.html revision 2632
12632Sstever@eecs.umich.edu<html>
22632Sstever@eecs.umich.edu<head>
32632Sstever@eecs.umich.edu<title>PLY (Python Lex-Yacc)</title>
42632Sstever@eecs.umich.edu</head>
52632Sstever@eecs.umich.edu<body bgcolor="#ffffff">
62632Sstever@eecs.umich.edu
72632Sstever@eecs.umich.edu<h1>PLY (Python Lex-Yacc)</h1>
82632Sstever@eecs.umich.edu
92632Sstever@eecs.umich.edu<b>
102632Sstever@eecs.umich.eduDavid M. Beazley <br>
112632Sstever@eecs.umich.eduDepartment of Computer Science <br>
122632Sstever@eecs.umich.eduUniversity of Chicago <br>
132632Sstever@eecs.umich.eduChicago, IL 60637 <br>
142632Sstever@eecs.umich.edubeazley@cs.uchicago.edu <br>
152632Sstever@eecs.umich.edu</b>
162632Sstever@eecs.umich.edu
172632Sstever@eecs.umich.edu<p>
182632Sstever@eecs.umich.eduDocumentation version: $Header: /home/stever/bk/newmem2/ext/ply/doc/ply.html 1.1 03/06/06 14:53:34-00:00 stever@ $
192632Sstever@eecs.umich.edu
202632Sstever@eecs.umich.edu<h2>Introduction</h2>
212632Sstever@eecs.umich.edu
222632Sstever@eecs.umich.eduPLY is a Python-only implementation of the popular compiler
232632Sstever@eecs.umich.educonstruction tools lex and yacc.  The implementation borrows ideas
242632Sstever@eecs.umich.edufrom a number of previous efforts; most notably John Aycock's SPARK
252632Sstever@eecs.umich.edutoolkit.  However, the overall flavor of the implementation is more
262632Sstever@eecs.umich.educlosely modeled after the C version of lex and yacc.  The other
272632Sstever@eecs.umich.edusignificant feature of PLY is that it provides extensive input
282632Sstever@eecs.umich.eduvalidation and error reporting--much more so than other Python parsing
292632Sstever@eecs.umich.edutools.
302632Sstever@eecs.umich.edu
312632Sstever@eecs.umich.edu<p>
322632Sstever@eecs.umich.eduEarly versions of PLY were developed to support the Introduction to
332632Sstever@eecs.umich.eduCompilers Course at the University of Chicago.  In this course,
342632Sstever@eecs.umich.edustudents built a fully functional compiler for a simple Pascal-like
352632Sstever@eecs.umich.edulanguage.  Their compiler, implemented entirely in Python, had to
362632Sstever@eecs.umich.eduinclude lexical analysis, parsing, type checking, type inference,
372632Sstever@eecs.umich.edunested scoping, and code generation for the SPARC processor.
382632Sstever@eecs.umich.eduApproximately 30 different compiler implementations were completed in
392632Sstever@eecs.umich.eduthis course.  Most of PLY's interface and operation has been motivated by common
402632Sstever@eecs.umich.eduusability problems encountered by students.
412632Sstever@eecs.umich.edu
422632Sstever@eecs.umich.edu<p>
432632Sstever@eecs.umich.eduBecause PLY was primarily developed as an instructional tool, you will
442632Sstever@eecs.umich.edufind it to be <em>MUCH</em> more picky about token and grammar rule
452632Sstever@eecs.umich.eduspecification than most other Python parsing tools.  In part, this
462632Sstever@eecs.umich.eduadded formality is meant to catch common programming mistakes made by
472632Sstever@eecs.umich.edunovice users.  However, advanced users will also find such features to
482632Sstever@eecs.umich.edube useful when building complicated grammars for real programming
492632Sstever@eecs.umich.edulanguages.    It should also be noted that PLY does not provide much in the way
502632Sstever@eecs.umich.eduof bells and whistles (e.g., automatic construction of abstract syntax trees,
512632Sstever@eecs.umich.edutree traversal, etc.).   Instead, you will find a bare-bones, yet
522632Sstever@eecs.umich.edufully capable lex/yacc implementation written entirely in Python.
532632Sstever@eecs.umich.edu
542632Sstever@eecs.umich.edu<p>
552632Sstever@eecs.umich.eduThe rest of this document assumes that you are somewhat familar with
562632Sstever@eecs.umich.eduparsing theory, syntax directed translation, and automatic tools such
572632Sstever@eecs.umich.eduas lex and yacc. If you are unfamilar with these topics, you will
582632Sstever@eecs.umich.eduprobably want to consult an introductory text such as "Compilers:
592632Sstever@eecs.umich.eduPrinciples, Techniques, and Tools", by Aho, Sethi, and Ullman.  "Lex
602632Sstever@eecs.umich.eduand Yacc" by John Levine may also be handy.
612632Sstever@eecs.umich.edu
622632Sstever@eecs.umich.edu<h2>PLY Overview</h2>
632632Sstever@eecs.umich.edu
642632Sstever@eecs.umich.eduPLY consists of two separate tools; <tt>lex.py</tt> and
652632Sstever@eecs.umich.edu<tt>yacc.py</tt>.  <tt>lex.py</tt> is used to break input text into a
662632Sstever@eecs.umich.educollection of tokens specified by a collection of regular expression
672632Sstever@eecs.umich.edurules.  <tt>yacc.py</tt> is used to recognize language syntax that has
682632Sstever@eecs.umich.edubeen specified in the form of a context free grammar.  Currently,
692632Sstever@eecs.umich.edu<tt>yacc.py</tt> uses LR parsing and generates its parsing tables
702632Sstever@eecs.umich.eduusing the SLR algorithm.  LALR(1) parsing may be supported in a future
712632Sstever@eecs.umich.edurelease. 
722632Sstever@eecs.umich.edu
732632Sstever@eecs.umich.edu<p>
742632Sstever@eecs.umich.eduThe two tools are meant to work together.  Specifically,
752632Sstever@eecs.umich.edu<tt>lex.py</tt> provides an external interface in the form of a
762632Sstever@eecs.umich.edu<tt>token()</tt> function that returns the next valid token on the
772632Sstever@eecs.umich.eduinput stream.  <tt>yacc.py</tt> calls this repeatedly to retrieve
782632Sstever@eecs.umich.edutokens and invoke grammar rules.  The output of <tt>yacc.py</tt> is
792632Sstever@eecs.umich.eduoften an Abstract Syntax Tree (AST).  However, this is entirely up to
802632Sstever@eecs.umich.eduthe user.  If desired, <tt>yacc.py</tt> can also be used to implement
812632Sstever@eecs.umich.edusimple one-pass compilers.
822632Sstever@eecs.umich.edu
832632Sstever@eecs.umich.edu<p>
842632Sstever@eecs.umich.eduLike its Unix counterpart, <tt>yacc.py</tt> provides most of the
852632Sstever@eecs.umich.edufeatures you expect including extensive error checking, grammar
862632Sstever@eecs.umich.eduvalidation, support for empty productions, error tokens, and ambiguity
872632Sstever@eecs.umich.eduresolution via precedence rules.  The primary difference between
882632Sstever@eecs.umich.edu<tt>yacc.py</tt> and <tt>yacc</tt> is the use of SLR parsing instead
892632Sstever@eecs.umich.eduof LALR(1).  Although this slightly restricts the types of grammars
902632Sstever@eecs.umich.eduthan can be successfully parsed, it is sufficiently powerful to handle most
912632Sstever@eecs.umich.edukinds of normal programming language constructs.
922632Sstever@eecs.umich.edu
932632Sstever@eecs.umich.edu<p>
942632Sstever@eecs.umich.eduFinally, it is important to note that PLY relies on reflection
952632Sstever@eecs.umich.edu(introspection) to build its lexers and parsers.  Unlike traditional
962632Sstever@eecs.umich.edulex/yacc which require a special input file that is converted into a
972632Sstever@eecs.umich.eduseparate source file, the specifications given to PLY <em>are</em>
982632Sstever@eecs.umich.eduvalid Python programs.  This means that there are no extra source
992632Sstever@eecs.umich.edufiles nor is there a special compiler construction step (e.g., running
1002632Sstever@eecs.umich.eduyacc to generate Python code for the compiler). 
1012632Sstever@eecs.umich.edu
1022632Sstever@eecs.umich.edu<h2>Lex Example</h2>
1032632Sstever@eecs.umich.edu
1042632Sstever@eecs.umich.edu<tt>lex.py</tt> is used to write tokenizers.  To do this, each token
1052632Sstever@eecs.umich.edumust be defined by a regular expression rule.  The following file
1062632Sstever@eecs.umich.eduimplements a very simple lexer for tokenizing simple integer expressions:
1072632Sstever@eecs.umich.edu
1082632Sstever@eecs.umich.edu<blockquote>
1092632Sstever@eecs.umich.edu<pre>
1102632Sstever@eecs.umich.edu# ------------------------------------------------------------
1112632Sstever@eecs.umich.edu# calclex.py
1122632Sstever@eecs.umich.edu#
1132632Sstever@eecs.umich.edu# tokenizer for a simple expression evaluator for
1142632Sstever@eecs.umich.edu# numbers and +,-,*,/
1152632Sstever@eecs.umich.edu# ------------------------------------------------------------
1162632Sstever@eecs.umich.eduimport lex
1172632Sstever@eecs.umich.edu
1182632Sstever@eecs.umich.edu# List of token names.   This is always required
1192632Sstever@eecs.umich.edutokens = (
1202632Sstever@eecs.umich.edu   'NUMBER',
1212632Sstever@eecs.umich.edu   'PLUS',
1222632Sstever@eecs.umich.edu   'MINUS',
1232632Sstever@eecs.umich.edu   'TIMES',
1242632Sstever@eecs.umich.edu   'DIVIDE',
1252632Sstever@eecs.umich.edu   'LPAREN',
1262632Sstever@eecs.umich.edu   'RPAREN',
1272632Sstever@eecs.umich.edu)
1282632Sstever@eecs.umich.edu
1292632Sstever@eecs.umich.edu# Regular expression rules for simple tokens
1302632Sstever@eecs.umich.edut_PLUS    = r'\+'
1312632Sstever@eecs.umich.edut_MINUS   = r'-'
1322632Sstever@eecs.umich.edut_TIMES   = r'\*'
1332632Sstever@eecs.umich.edut_DIVIDE  = r'/'
1342632Sstever@eecs.umich.edut_LPAREN  = r'\('
1352632Sstever@eecs.umich.edut_RPAREN  = r'\)'
1362632Sstever@eecs.umich.edu
1372632Sstever@eecs.umich.edu# A regular expression rule with some action code
1382632Sstever@eecs.umich.edudef t_NUMBER(t):
1392632Sstever@eecs.umich.edu    r'\d+'
1402632Sstever@eecs.umich.edu    try:
1412632Sstever@eecs.umich.edu         t.value = int(t.value)    
1422632Sstever@eecs.umich.edu    except ValueError:
1432632Sstever@eecs.umich.edu         print "Line %d: Number %s is too large!" % (t.lineno,t.value)
1442632Sstever@eecs.umich.edu	 t.value = 0
1452632Sstever@eecs.umich.edu    return t
1462632Sstever@eecs.umich.edu
1472632Sstever@eecs.umich.edu# Define a rule so we can track line numbers
1482632Sstever@eecs.umich.edudef t_newline(t):
1492632Sstever@eecs.umich.edu    r'\n+'
1502632Sstever@eecs.umich.edu    t.lineno += len(t.value)
1512632Sstever@eecs.umich.edu
1522632Sstever@eecs.umich.edu# A string containing ignored characters (spaces and tabs)
1532632Sstever@eecs.umich.edut_ignore  = ' \t'
1542632Sstever@eecs.umich.edu
1552632Sstever@eecs.umich.edu# Error handling rule
1562632Sstever@eecs.umich.edudef t_error(t):
1572632Sstever@eecs.umich.edu    print "Illegal character '%s'" % t.value[0]
1582632Sstever@eecs.umich.edu    t.skip(1)
1592632Sstever@eecs.umich.edu
1602632Sstever@eecs.umich.edu# Build the lexer
1612632Sstever@eecs.umich.edulex.lex()
1622632Sstever@eecs.umich.edu
1632632Sstever@eecs.umich.edu# Test it out
1642632Sstever@eecs.umich.edudata = '''
1652632Sstever@eecs.umich.edu3 + 4 * 10
1662632Sstever@eecs.umich.edu  + -20 *2
1672632Sstever@eecs.umich.edu'''
1682632Sstever@eecs.umich.edu
1692632Sstever@eecs.umich.edu# Give the lexer some input
1702632Sstever@eecs.umich.edulex.input(data)
1712632Sstever@eecs.umich.edu
1722632Sstever@eecs.umich.edu# Tokenize
1732632Sstever@eecs.umich.eduwhile 1:
1742632Sstever@eecs.umich.edu    tok = lex.token()
1752632Sstever@eecs.umich.edu    if not tok: break      # No more input
1762632Sstever@eecs.umich.edu    print tok
1772632Sstever@eecs.umich.edu</pre>
1782632Sstever@eecs.umich.edu</blockquote>
1792632Sstever@eecs.umich.edu
1802632Sstever@eecs.umich.eduIn the example, the <tt>tokens</tt> list defines all of the possible
1812632Sstever@eecs.umich.edutoken names that can be produced by the lexer.  This list is always required
1822632Sstever@eecs.umich.eduand is used to perform a variety of validation checks.  Following the <tt>tokens</tt>
1832632Sstever@eecs.umich.edulist, regular expressions are written for each token.  Each of these
1842632Sstever@eecs.umich.edurules are defined by making declarations with a special prefix <tt>t_</tt> to indicate that it
1852632Sstever@eecs.umich.edudefines a token.  For simple tokens, the regular expression can
1862632Sstever@eecs.umich.edube specified as strings such as this (note: Python raw strings are used since they are the
1872632Sstever@eecs.umich.edumost convenient way to write regular expression strings):
1882632Sstever@eecs.umich.edu
1892632Sstever@eecs.umich.edu<blockquote>
1902632Sstever@eecs.umich.edu<pre>
1912632Sstever@eecs.umich.edut_PLUS = r'\+'
1922632Sstever@eecs.umich.edu</pre>
1932632Sstever@eecs.umich.edu</blockquote>
1942632Sstever@eecs.umich.edu
1952632Sstever@eecs.umich.eduIn this case, the name following the <tt>t_</tt> must exactly match one of the
1962632Sstever@eecs.umich.edunames supplied in <tt>tokens</tt>.   If some kind of action needs to be performed,
1972632Sstever@eecs.umich.edua token rule can be specified as a function.  For example:
1982632Sstever@eecs.umich.edu
1992632Sstever@eecs.umich.edu<blockquote>
2002632Sstever@eecs.umich.edu<pre>
2012632Sstever@eecs.umich.edudef t_NUMBER(t):
2022632Sstever@eecs.umich.edu    r'\d+'
2032632Sstever@eecs.umich.edu    try:
2042632Sstever@eecs.umich.edu         t.value = int(t.value)
2052632Sstever@eecs.umich.edu    except ValueError:
2062632Sstever@eecs.umich.edu         print "Number %s is too large!" % t.value
2072632Sstever@eecs.umich.edu	 t.value = 0
2082632Sstever@eecs.umich.edu    return t
2092632Sstever@eecs.umich.edu</pre>
2102632Sstever@eecs.umich.edu</blockquote>
2112632Sstever@eecs.umich.edu
2122632Sstever@eecs.umich.eduIn this case, the regular expression rule is specified in the function documentation string. 
2132632Sstever@eecs.umich.eduThe function always takes a single argument which is an instance of 
2142632Sstever@eecs.umich.edu<tt>LexToken</tt>.   This object has attributes of <tt>t.type</tt> which is the token type,
2152632Sstever@eecs.umich.edu<tt>t.value</tt> which is the lexeme, and <tt>t.lineno</tt> which is the current line number.
2162632Sstever@eecs.umich.eduBy default, <tt>t.type</tt> is set to the name following the <tt>t_</tt> prefix.  The action
2172632Sstever@eecs.umich.edufunction can modify the contents of the <tt>LexToken</tt> object as appropriate.  However, 
2182632Sstever@eecs.umich.eduwhen it is done, the resulting token should be returned.  If no value is returned by the action
2192632Sstever@eecs.umich.edufunction, the token is simply discarded and the next token read.
2202632Sstever@eecs.umich.edu
2212632Sstever@eecs.umich.edu<p>
2222632Sstever@eecs.umich.eduThe rule <tt>t_newline()</tt> illustrates a regular expression rule
2232632Sstever@eecs.umich.edufor a discarded token.  In this case, a rule is written to match
2242632Sstever@eecs.umich.edunewlines so that proper line number tracking can be performed.
2252632Sstever@eecs.umich.eduBy returning no value, the function causes the newline character to be 
2262632Sstever@eecs.umich.edudiscarded.
2272632Sstever@eecs.umich.edu
2282632Sstever@eecs.umich.edu<p>
2292632Sstever@eecs.umich.eduThe special <tt>t_ignore</tt> rule is reserved by <tt>lex.py</tt> for characters
2302632Sstever@eecs.umich.eduthat should be completely ignored in the input stream. 
2312632Sstever@eecs.umich.eduUsually this is used to skip over whitespace and other non-essential characters. 
2322632Sstever@eecs.umich.eduAlthough it is possible to define a regular expression rule for whitespace in a manner
2332632Sstever@eecs.umich.edusimilar to <tt>t_newline()</tt>, the use of <tt>t_ignore</tt> provides substantially better
2342632Sstever@eecs.umich.edulexing performance because it is handled as a special case and is checked in a much
2352632Sstever@eecs.umich.edumore efficient manner than the normal regular expression rules.
2362632Sstever@eecs.umich.edu
2372632Sstever@eecs.umich.edu<p>
2382632Sstever@eecs.umich.eduFinally, the <tt>t_error()</tt>
2392632Sstever@eecs.umich.edufunction is used to handle lexing errors that occur when illegal
2402632Sstever@eecs.umich.educharacters are detected.  In this case, the <tt>t.value</tt> attribute contains the
2412632Sstever@eecs.umich.edurest of the input string that has not been tokenized.  In the example, we simply print
2422632Sstever@eecs.umich.eduthe offending character and skip ahead one character by calling <tt>t.skip(1)</tt>.
2432632Sstever@eecs.umich.edu
2442632Sstever@eecs.umich.edu<p>
2452632Sstever@eecs.umich.eduTo build the lexer, the function <tt>lex.lex()</tt> is used.  This function
2462632Sstever@eecs.umich.eduuses Python reflection (or introspection) to read the the regular expression rules
2472632Sstever@eecs.umich.eduout of the calling context and build the lexer. Once the lexer has been built, two functions can
2482632Sstever@eecs.umich.edube used to control the lexer.
2492632Sstever@eecs.umich.edu
2502632Sstever@eecs.umich.edu<ul>
2512632Sstever@eecs.umich.edu<li><tt>lex.input(data)</tt>.   Reset the lexer and store a new input string.
2522632Sstever@eecs.umich.edu<li><tt>lex.token()</tt>.  Return the next token.  Returns a special <tt>LexToken</tt> instance on success or
2532632Sstever@eecs.umich.eduNone if the end of the input text has been reached.
2542632Sstever@eecs.umich.edu</ul>
2552632Sstever@eecs.umich.edu
2562632Sstever@eecs.umich.eduThe code at the bottom of the example shows how the lexer is actually used.  When executed,
2572632Sstever@eecs.umich.eduthe following output will be produced:
2582632Sstever@eecs.umich.edu
2592632Sstever@eecs.umich.edu<blockquote>
2602632Sstever@eecs.umich.edu<pre>
2612632Sstever@eecs.umich.edu$ python example.py
2622632Sstever@eecs.umich.eduLexToken(NUMBER,3,2)
2632632Sstever@eecs.umich.eduLexToken(PLUS,'+',2)
2642632Sstever@eecs.umich.eduLexToken(NUMBER,4,2)
2652632Sstever@eecs.umich.eduLexToken(TIMES,'*',2)
2662632Sstever@eecs.umich.eduLexToken(NUMBER,10,2)
2672632Sstever@eecs.umich.eduLexToken(PLUS,'+',3)
2682632Sstever@eecs.umich.eduLexToken(MINUS,'-',3)
2692632Sstever@eecs.umich.eduLexToken(NUMBER,20,3)
2702632Sstever@eecs.umich.eduLexToken(TIMES,'*',3)
2712632Sstever@eecs.umich.eduLexToken(NUMBER,2,3)
2722632Sstever@eecs.umich.edu</pre>
2732632Sstever@eecs.umich.edu</blockquote>
2742632Sstever@eecs.umich.edu
2752632Sstever@eecs.umich.edu<h2>Lex Implementation Notes</h2>
2762632Sstever@eecs.umich.edu
2772632Sstever@eecs.umich.edu<ul>
2782632Sstever@eecs.umich.edu<li><tt>lex.py</tt> uses the <tt>re</tt> module to do its patten matching.  When building the master regular expression,
2792632Sstever@eecs.umich.edurules are added in the following order:
2802632Sstever@eecs.umich.edu<p>
2812632Sstever@eecs.umich.edu<ol>
2822632Sstever@eecs.umich.edu<li>All tokens defined by functions are added in the same order as they appear in the lexer file.
2832632Sstever@eecs.umich.edu<li>Tokens defined by strings are added by sorting them in order of decreasing regular expression length (longer expressions
2842632Sstever@eecs.umich.eduare added first).
2852632Sstever@eecs.umich.edu</ol>
2862632Sstever@eecs.umich.edu<p>
2872632Sstever@eecs.umich.eduWithout this ordering, it can be difficult to correctly match certain types of tokens.  For example, if you 
2882632Sstever@eecs.umich.eduwanted to have separate tokens for "=" and "==", you need to make sure that "==" is checked first.  By sorting regular
2892632Sstever@eecs.umich.eduexpressions in order of decreasing length, this problem is solved for rules defined as strings.  For functions,
2902632Sstever@eecs.umich.eduthe order can be explicitly controlled since rules appearing first are checked first.
2912632Sstever@eecs.umich.edu
2922632Sstever@eecs.umich.edu<P>
2932632Sstever@eecs.umich.edu<li>The lexer requires input to be supplied as a single input string.  Since most machines have more than enough memory, this 
2942632Sstever@eecs.umich.edurarely presents a performance concern.  However, it means that the lexer currently can't be used with streaming data
2952632Sstever@eecs.umich.edusuch as open files or sockets.  This limitation is primarily a side-effect of using the <tt>re</tt> module.
2962632Sstever@eecs.umich.edu
2972632Sstever@eecs.umich.edu<p>
2982632Sstever@eecs.umich.edu<li>
2992632Sstever@eecs.umich.eduTo handle reserved words, it is usually easier to just match an identifier and do a special name lookup in a function
3002632Sstever@eecs.umich.edulike this:
3012632Sstever@eecs.umich.edu
3022632Sstever@eecs.umich.edu<blockquote>
3032632Sstever@eecs.umich.edu<pre>
3042632Sstever@eecs.umich.edureserved = {
3052632Sstever@eecs.umich.edu   'if' : 'IF',
3062632Sstever@eecs.umich.edu   'then' : 'THEN',
3072632Sstever@eecs.umich.edu   'else' : 'ELSE',
3082632Sstever@eecs.umich.edu   'while' : 'WHILE',
3092632Sstever@eecs.umich.edu   ...
3102632Sstever@eecs.umich.edu}
3112632Sstever@eecs.umich.edu
3122632Sstever@eecs.umich.edudef t_ID(t):
3132632Sstever@eecs.umich.edu    r'[a-zA-Z_][a-zA-Z_0-9]*'
3142632Sstever@eecs.umich.edu    t.type = reserved.get(t.value,'ID')    # Check for reserved words
3152632Sstever@eecs.umich.edu    return t
3162632Sstever@eecs.umich.edu</pre>
3172632Sstever@eecs.umich.edu</blockquote>
3182632Sstever@eecs.umich.edu
3192632Sstever@eecs.umich.edu<p>
3202632Sstever@eecs.umich.edu<li>The lexer requires tokens to be defined as class instances with <tt>t.type</tt>, <tt>t.value</tt>, and <tt>t.lineno</tt>
3212632Sstever@eecs.umich.eduattributes.   By default, tokens are created as instances of the <tt>LexToken</tt> class defined internally to <tt>lex.py</tt>.
3222632Sstever@eecs.umich.eduIf desired, you can create new kinds of tokens provided that they have the three required attributes.   However,
3232632Sstever@eecs.umich.eduin practice, it is probably safer to stick with the default.
3242632Sstever@eecs.umich.edu
3252632Sstever@eecs.umich.edu<p>
3262632Sstever@eecs.umich.edu<li>The only safe attribute for assigning token properties is <tt>t.value</tt>.   In some cases, you may want to attach
3272632Sstever@eecs.umich.edua number of different properties to a token (e.g., symbol table entries for identifiers).  To do this, replace <tt>t.value</tt>
3282632Sstever@eecs.umich.eduwith a tuple or class instance. For example:
3292632Sstever@eecs.umich.edu
3302632Sstever@eecs.umich.edu<blockquote>
3312632Sstever@eecs.umich.edu<pre>
3322632Sstever@eecs.umich.edudef t_ID(t):
3332632Sstever@eecs.umich.edu    ...
3342632Sstever@eecs.umich.edu    # For identifiers, create a (lexeme, symtab) tuple
3352632Sstever@eecs.umich.edu    t.value = (t.value, symbol_lookup(t.value))
3362632Sstever@eecs.umich.edu    ...
3372632Sstever@eecs.umich.edu    return t
3382632Sstever@eecs.umich.edu</pre>
3392632Sstever@eecs.umich.edu</blockquote>
3402632Sstever@eecs.umich.edu
3412632Sstever@eecs.umich.eduAlthough allowed, do NOT assign additional attributes to the token object.  For example,
3422632Sstever@eecs.umich.edu<blockquote>
3432632Sstever@eecs.umich.edu<pre>
3442632Sstever@eecs.umich.edudef t_ID(t):
3452632Sstever@eecs.umich.edu    ...
3462632Sstever@eecs.umich.edu    # Bad implementation of above
3472632Sstever@eecs.umich.edu    t.symtab = symbol_lookup(t.value)
3482632Sstever@eecs.umich.edu    ...
3492632Sstever@eecs.umich.edu</pre>
3502632Sstever@eecs.umich.edu</blockquote>
3512632Sstever@eecs.umich.edu
3522632Sstever@eecs.umich.eduThe reason you don't want to do this is that the <tt>yacc.py</tt>
3532632Sstever@eecs.umich.edumodule only provides public access to the <tt>t.value</tt> attribute of each token.
3542632Sstever@eecs.umich.eduTherefore, any other attributes you assign are inaccessible (if you are familiar
3552632Sstever@eecs.umich.eduwith the internals of C lex/yacc, <tt>t.value</tt> is the same as <tt>yylval.tok</tt>).
3562632Sstever@eecs.umich.edu
3572632Sstever@eecs.umich.edu<p>
3582632Sstever@eecs.umich.edu<li>To track line numbers, the lexer internally maintains a line
3592632Sstever@eecs.umich.edunumber variable.  Each token automatically gets the value of the
3602632Sstever@eecs.umich.educurrent line number in the <tt>t.lineno</tt> attribute. To modify the
3612632Sstever@eecs.umich.educurrent line number, simply change the <tt>t.lineno</tt> attribute
3622632Sstever@eecs.umich.eduin a function rule (as previously shown for
3632632Sstever@eecs.umich.edu<tt>t_newline()</tt>).  Even if the resulting token is discarded,
3642632Sstever@eecs.umich.educhanges to the line number remain in effect for subsequent tokens.
3652632Sstever@eecs.umich.edu
3662632Sstever@eecs.umich.edu<p>
3672632Sstever@eecs.umich.edu<li>To support multiple scanners in the same application, the <tt>lex.lex()</tt> function
3682632Sstever@eecs.umich.eduactually returns a special <tt>Lexer</tt> object.   This object has two methods 
3692632Sstever@eecs.umich.edu<tt>input()</tt> and <tt>token()</tt> that can be used to supply input and get tokens.  For example:
3702632Sstever@eecs.umich.edu
3712632Sstever@eecs.umich.edu<blockquote>
3722632Sstever@eecs.umich.edu<pre>
3732632Sstever@eecs.umich.edulexer = lex.lex()
3742632Sstever@eecs.umich.edulexer.input(sometext)
3752632Sstever@eecs.umich.eduwhile 1:
3762632Sstever@eecs.umich.edu    tok = lexer.token()
3772632Sstever@eecs.umich.edu    if not tok: break
3782632Sstever@eecs.umich.edu    print tok
3792632Sstever@eecs.umich.edu</pre>
3802632Sstever@eecs.umich.edu</blockquote>
3812632Sstever@eecs.umich.edu
3822632Sstever@eecs.umich.eduThe functions <tt>lex.input()</tt> and <tt>lex.token()</tt> are bound to the <tt>input()</tt> 
3832632Sstever@eecs.umich.eduand <tt>token()</tt> methods of the last lexer created by the lex module. 
3842632Sstever@eecs.umich.edu
3852632Sstever@eecs.umich.edu
3862632Sstever@eecs.umich.edu<p>
3872632Sstever@eecs.umich.edu<li>To reduce compiler startup time and improve performance, the lexer can be built in optimized mode as follows:
3882632Sstever@eecs.umich.edu
3892632Sstever@eecs.umich.edu<blockquote>
3902632Sstever@eecs.umich.edu<pre>
3912632Sstever@eecs.umich.edulex.lex(optimize=1)
3922632Sstever@eecs.umich.edu</pre>
3932632Sstever@eecs.umich.edu</blockquote>
3942632Sstever@eecs.umich.edu
3952632Sstever@eecs.umich.eduWhen used, most error checking and validation is disabled.   This provides a slight performance
3962632Sstever@eecs.umich.edugain while tokenizing and tends to chop a few tenths of a second off startup time.  Since it disables
3972632Sstever@eecs.umich.eduerror checking, this mode is not the default and is not recommended during development.  However, once
3982632Sstever@eecs.umich.eduyou have your compiler fully working, it is usually safe to disable the error checks.
3992632Sstever@eecs.umich.edu
4002632Sstever@eecs.umich.edu<p>
4012632Sstever@eecs.umich.edu<li>You can enable some additional debugging by building the lexer like this:
4022632Sstever@eecs.umich.edu
4032632Sstever@eecs.umich.edu<blockquote>
4042632Sstever@eecs.umich.edu<pre>
4052632Sstever@eecs.umich.edulex.lex(debug=1)
4062632Sstever@eecs.umich.edu</pre>
4072632Sstever@eecs.umich.edu</blockquote>
4082632Sstever@eecs.umich.edu
4092632Sstever@eecs.umich.edu<p>
4102632Sstever@eecs.umich.edu<li>To help you debug your lexer, <tt>lex.py</tt> comes with a simple main program which will either
4112632Sstever@eecs.umich.edutokenize input read from standard input or from a file.  To use it, simply put this in your lexer:
4122632Sstever@eecs.umich.edu
4132632Sstever@eecs.umich.edu<blockquote>
4142632Sstever@eecs.umich.edu<pre>
4152632Sstever@eecs.umich.eduif __name__ == '__main__':
4162632Sstever@eecs.umich.edu     lex.runmain()
4172632Sstever@eecs.umich.edu</pre>
4182632Sstever@eecs.umich.edu</blockquote>
4192632Sstever@eecs.umich.edu
4202632Sstever@eecs.umich.eduThen, run you lexer as a main program such as <tt>python mylex.py</tt>
4212632Sstever@eecs.umich.edu
4222632Sstever@eecs.umich.edu<p>
4232632Sstever@eecs.umich.edu<li>Since the lexer is written entirely in Python, its performance is
4242632Sstever@eecs.umich.edulargely determined by that of the Python <tt>re</tt> module.  Although
4252632Sstever@eecs.umich.eduthe lexer has been written to be as efficient as possible, it's not
4262632Sstever@eecs.umich.edublazingly fast when used on very large input files.  Sorry.  If
4272632Sstever@eecs.umich.eduperformance is concern, you might consider upgrading to the most
4282632Sstever@eecs.umich.edurecent version of Python, creating a hand-written lexer, or offloading
4292632Sstever@eecs.umich.eduthe lexer into a C extension module.  In defense of <tt>lex.py</tt>,
4302632Sstever@eecs.umich.eduit's performance is not <em>that</em> bad when used on reasonably
4312632Sstever@eecs.umich.edusized input files.  For instance, lexing a 4700 line C program with
4322632Sstever@eecs.umich.edu32000 input tokens takes about 20 seconds on a 200 Mhz PC.  Obviously,
4332632Sstever@eecs.umich.eduit will run much faster on a more speedy machine.
4342632Sstever@eecs.umich.edu
4352632Sstever@eecs.umich.edu</ul>
4362632Sstever@eecs.umich.edu
4372632Sstever@eecs.umich.edu<h2>Parsing basics</h2>
4382632Sstever@eecs.umich.edu
4392632Sstever@eecs.umich.edu<tt>yacc.py</tt> is used to parse language syntax.  Before showing an
4402632Sstever@eecs.umich.eduexample, there are a few important bits of background that must be
4412632Sstever@eecs.umich.edumentioned.  First, <tt>syntax</tt> is usually specified in terms of a
4422632Sstever@eecs.umich.educontext free grammar (CFG).  For example, if you wanted to parse
4432632Sstever@eecs.umich.edusimple arithmetic expressions, you might first write an unambiguous
4442632Sstever@eecs.umich.edugrammar specification like this:
4452632Sstever@eecs.umich.edu
4462632Sstever@eecs.umich.edu<blockquote>
4472632Sstever@eecs.umich.edu<pre> 
4482632Sstever@eecs.umich.eduexpression : expression + term
4492632Sstever@eecs.umich.edu           | expression - term
4502632Sstever@eecs.umich.edu           | term
4512632Sstever@eecs.umich.edu
4522632Sstever@eecs.umich.eduterm       : term * factor
4532632Sstever@eecs.umich.edu           | term / factor
4542632Sstever@eecs.umich.edu           | factor
4552632Sstever@eecs.umich.edu
4562632Sstever@eecs.umich.edufactor     : NUMBER
4572632Sstever@eecs.umich.edu           | ( expression )
4582632Sstever@eecs.umich.edu</pre>
4592632Sstever@eecs.umich.edu</blockquote>
4602632Sstever@eecs.umich.edu
4612632Sstever@eecs.umich.eduNext, the semantic behavior of a language is often specified using a
4622632Sstever@eecs.umich.edutechnique known as syntax directed translation.  In syntax directed
4632632Sstever@eecs.umich.edutranslation, attributes are attached to each symbol in a given grammar
4642632Sstever@eecs.umich.edurule along with an action.  Whenever a particular grammar rule is
4652632Sstever@eecs.umich.edurecognized, the action describes what to do.  For example, given the
4662632Sstever@eecs.umich.eduexpression grammar above, you might write the specification for a
4672632Sstever@eecs.umich.edusimple calculator like this:
4682632Sstever@eecs.umich.edu
4692632Sstever@eecs.umich.edu<blockquote>
4702632Sstever@eecs.umich.edu<pre> 
4712632Sstever@eecs.umich.eduGrammar                             Action
4722632Sstever@eecs.umich.edu--------------------------------    -------------------------------------------- 
4732632Sstever@eecs.umich.eduexpression0 : expression1 + term    expression0.val = expression1.val + term.val
4742632Sstever@eecs.umich.edu            | expression1 - term    expression0.val = expression1.val - term.val
4752632Sstever@eecs.umich.edu            | term                  expression0.val = term.val
4762632Sstever@eecs.umich.edu
4772632Sstever@eecs.umich.eduterm0       : term1 * factor        term0.val = term1.val * factor.val
4782632Sstever@eecs.umich.edu            | term1 / factor        term0.val = term1.val / factor.val
4792632Sstever@eecs.umich.edu            | factor                term0.val = factor.val
4802632Sstever@eecs.umich.edu
4812632Sstever@eecs.umich.edufactor      : NUMBER                factor.val = int(NUMBER.lexval)
4822632Sstever@eecs.umich.edu            | ( expression )        factor.val = expression.val
4832632Sstever@eecs.umich.edu</pre>
4842632Sstever@eecs.umich.edu</blockquote>
4852632Sstever@eecs.umich.edu
4862632Sstever@eecs.umich.eduFinally, Yacc uses a parsing technique known as LR-parsing or shift-reduce parsing.  LR parsing is a
4872632Sstever@eecs.umich.edubottom up technique that tries to recognize the right-hand-side of various grammar rules.
4882632Sstever@eecs.umich.eduWhenever a valid right-hand-side is found in the input, the appropriate action code is triggered and the
4892632Sstever@eecs.umich.edugrammar symbols are replaced by the grammar symbol on the left-hand-side. 
4902632Sstever@eecs.umich.edu
4912632Sstever@eecs.umich.edu<p>
4922632Sstever@eecs.umich.eduLR parsing is commonly implemented by shifting grammar symbols onto a stack and looking at the stack and the next
4932632Sstever@eecs.umich.eduinput token for patterns.   The details of the algorithm can be found in a compiler text, but the
4942632Sstever@eecs.umich.edufollowing example illustrates the steps that are performed if you wanted to parse the expression 
4952632Sstever@eecs.umich.edu<tt>3 + 5 * (10 - 20)</tt> using the grammar defined above:
4962632Sstever@eecs.umich.edu
4972632Sstever@eecs.umich.edu<blockquote>
4982632Sstever@eecs.umich.edu<pre>
4992632Sstever@eecs.umich.eduStep Symbol Stack           Input Tokens            Action
5002632Sstever@eecs.umich.edu---- ---------------------  ---------------------   -------------------------------
5012632Sstever@eecs.umich.edu1    $                      3 + 5 * ( 10 - 20 )$    Shift 3
5022632Sstever@eecs.umich.edu2    $ 3                      + 5 * ( 10 - 20 )$    Reduce factor : NUMBER
5032632Sstever@eecs.umich.edu3    $ factor                 + 5 * ( 10 - 20 )$    Reduce term   : factor
5042632Sstever@eecs.umich.edu4    $ term                   + 5 * ( 10 - 20 )$    Reduce expr : term
5052632Sstever@eecs.umich.edu5    $ expr                   + 5 * ( 10 - 20 )$    Shift +
5062632Sstever@eecs.umich.edu6    $ expr +                   5 * ( 10 - 20 )$    Shift 5
5072632Sstever@eecs.umich.edu7    $ expr + 5                   * ( 10 - 20 )$    Reduce factor : NUMBER
5082632Sstever@eecs.umich.edu8    $ expr + factor              * ( 10 - 20 )$    Reduce term   : factor
5092632Sstever@eecs.umich.edu9    $ expr + term                * ( 10 - 20 )$    Shift *
5102632Sstever@eecs.umich.edu10   $ expr + term *                ( 10 - 20 )$    Shift (
5112632Sstever@eecs.umich.edu11   $ expr + term * (                10 - 20 )$    Shift 10
5122632Sstever@eecs.umich.edu12   $ expr + term * ( 10                - 20 )$    Reduce factor : NUMBER
5132632Sstever@eecs.umich.edu13   $ expr + term * ( factor            - 20 )$    Reduce term : factor
5142632Sstever@eecs.umich.edu14   $ expr + term * ( term              - 20 )$    Reduce expr : term
5152632Sstever@eecs.umich.edu15   $ expr + term * ( expr              - 20 )$    Shift -
5162632Sstever@eecs.umich.edu16   $ expr + term * ( expr -              20 )$    Shift 20
5172632Sstever@eecs.umich.edu17   $ expr + term * ( expr - 20              )$    Reduce factor : NUMBER
5182632Sstever@eecs.umich.edu18   $ expr + term * ( expr - factor          )$    Reduce term : factor
5192632Sstever@eecs.umich.edu19   $ expr + term * ( expr - term            )$    Reduce expr : expr - term
5202632Sstever@eecs.umich.edu20   $ expr + term * ( expr                   )$    Shift )
5212632Sstever@eecs.umich.edu21   $ expr + term * ( expr )                  $    Reduce factor : (expr)
5222632Sstever@eecs.umich.edu22   $ expr + term * factor                    $    Reduce term : term * factor
5232632Sstever@eecs.umich.edu23   $ expr + term                             $    Reduce expr : expr + term
5242632Sstever@eecs.umich.edu24   $ expr                                    $    Reduce expr
5252632Sstever@eecs.umich.edu25   $                                         $    Success!
5262632Sstever@eecs.umich.edu</pre>
5272632Sstever@eecs.umich.edu</blockquote>
5282632Sstever@eecs.umich.edu
5292632Sstever@eecs.umich.eduWhen parsing the expression, an underlying state machine and the current input token determine what to do next.  
5302632Sstever@eecs.umich.eduIf the next token looks like part of a valid grammar rule (based on other items on the stack), it is generally shifted
5312632Sstever@eecs.umich.eduonto the stack.  If the top of the stack contains a valid right-hand-side of a grammar rule, it is
5322632Sstever@eecs.umich.eduusually "reduced" and the symbols replaced with the symbol on the left-hand-side.  When this reduction occurs, the
5332632Sstever@eecs.umich.eduappropriate action is triggered (if defined).  If the input token can't be shifted and the top of stack doesn't match
5342632Sstever@eecs.umich.eduany grammar rules, a syntax error has occurred and the parser must take some kind of recovery step (or bail out).
5352632Sstever@eecs.umich.edu
5362632Sstever@eecs.umich.edu<p>
5372632Sstever@eecs.umich.eduIt is important to note that the underlying implementation is actually built around a large finite-state machine
5382632Sstever@eecs.umich.eduand some tables.   The construction of these tables is quite complicated and beyond the scope of this discussion.
5392632Sstever@eecs.umich.eduHowever, subtle details of this process explain why, in the example above, the parser chooses to shift a token
5402632Sstever@eecs.umich.eduonto the stack in step 9 rather than reducing the rule <tt>expr : expr + term</tt>.
5412632Sstever@eecs.umich.edu
5422632Sstever@eecs.umich.edu<h2>Yacc example</h2>
5432632Sstever@eecs.umich.edu
5442632Sstever@eecs.umich.eduSuppose you wanted to make a grammar for simple arithmetic expressions as previously described.   Here is
5452632Sstever@eecs.umich.eduhow you would do it with <tt>yacc.py</tt>:
5462632Sstever@eecs.umich.edu
5472632Sstever@eecs.umich.edu<blockquote>
5482632Sstever@eecs.umich.edu<pre>
5492632Sstever@eecs.umich.edu# Yacc example
5502632Sstever@eecs.umich.edu
5512632Sstever@eecs.umich.eduimport yacc
5522632Sstever@eecs.umich.edu
5532632Sstever@eecs.umich.edu# Get the token map from the lexer.  This is required.
5542632Sstever@eecs.umich.edufrom calclex import tokens
5552632Sstever@eecs.umich.edu
5562632Sstever@eecs.umich.edudef p_expression_plus(t):
5572632Sstever@eecs.umich.edu    'expression : expression PLUS term'
5582632Sstever@eecs.umich.edu    t[0] = t[1] + t[3]
5592632Sstever@eecs.umich.edu
5602632Sstever@eecs.umich.edudef p_expression_minus(t):
5612632Sstever@eecs.umich.edu    'expression : expression MINUS term'
5622632Sstever@eecs.umich.edu    t[0] = t[1] - t[3]
5632632Sstever@eecs.umich.edu
5642632Sstever@eecs.umich.edudef p_expression_term(t):
5652632Sstever@eecs.umich.edu    'expression : term'
5662632Sstever@eecs.umich.edu    t[0] = t[1]
5672632Sstever@eecs.umich.edu
5682632Sstever@eecs.umich.edudef p_term_times(t):
5692632Sstever@eecs.umich.edu    'term : term TIMES factor'
5702632Sstever@eecs.umich.edu    t[0] = t[1] * t[3]
5712632Sstever@eecs.umich.edu
5722632Sstever@eecs.umich.edudef p_term_div(t):
5732632Sstever@eecs.umich.edu    'term : term DIVIDE factor'
5742632Sstever@eecs.umich.edu    t[0] = t[1] / t[3]
5752632Sstever@eecs.umich.edu
5762632Sstever@eecs.umich.edudef p_term_factor(t):
5772632Sstever@eecs.umich.edu    'term : factor'
5782632Sstever@eecs.umich.edu    t[0] = t[1]
5792632Sstever@eecs.umich.edu
5802632Sstever@eecs.umich.edudef p_factor_num(t):
5812632Sstever@eecs.umich.edu    'factor : NUMBER'
5822632Sstever@eecs.umich.edu    t[0] = t[1]
5832632Sstever@eecs.umich.edu
5842632Sstever@eecs.umich.edudef p_factor_expr(t):
5852632Sstever@eecs.umich.edu    'factor : LPAREN expression RPAREN'
5862632Sstever@eecs.umich.edu    t[0] = t[2]
5872632Sstever@eecs.umich.edu
5882632Sstever@eecs.umich.edu# Error rule for syntax errors
5892632Sstever@eecs.umich.edudef p_error(t):
5902632Sstever@eecs.umich.edu    print "Syntax error in input!"
5912632Sstever@eecs.umich.edu
5922632Sstever@eecs.umich.edu# Build the parser
5932632Sstever@eecs.umich.eduyacc.yacc()
5942632Sstever@eecs.umich.edu
5952632Sstever@eecs.umich.eduwhile 1:
5962632Sstever@eecs.umich.edu   try:
5972632Sstever@eecs.umich.edu       s = raw_input('calc > ')
5982632Sstever@eecs.umich.edu   except EOFError:
5992632Sstever@eecs.umich.edu       break
6002632Sstever@eecs.umich.edu   if not s: continue
6012632Sstever@eecs.umich.edu   result = yacc.parse(s)
6022632Sstever@eecs.umich.edu   print result
6032632Sstever@eecs.umich.edu</pre>
6042632Sstever@eecs.umich.edu</blockquote>
6052632Sstever@eecs.umich.edu
6062632Sstever@eecs.umich.eduIn this example, each grammar rule is defined by a Python function where the docstring to that function contains the
6072632Sstever@eecs.umich.eduappropriate context-free grammar specification (an idea borrowed from John Aycock's SPARK toolkit).  Each function accepts a single
6082632Sstever@eecs.umich.eduargument <tt>t</tt> that is a sequence containing the values of each grammar symbol in the corresponding rule.  The values of
6092632Sstever@eecs.umich.edu<tt>t[i]</tt> are mapped to grammar symbols as shown here:
6102632Sstever@eecs.umich.edu
6112632Sstever@eecs.umich.edu<blockquote>
6122632Sstever@eecs.umich.edu<pre>
6132632Sstever@eecs.umich.edudef p_expression_plus(t):
6142632Sstever@eecs.umich.edu    'expression : expression PLUS term'
6152632Sstever@eecs.umich.edu    #   ^            ^        ^    ^
6162632Sstever@eecs.umich.edu    #  t[0]         t[1]     t[2] t[3]
6172632Sstever@eecs.umich.edu
6182632Sstever@eecs.umich.edu    t[0] = t[1] + t[3]
6192632Sstever@eecs.umich.edu</pre>
6202632Sstever@eecs.umich.edu</blockquote>
6212632Sstever@eecs.umich.edu
6222632Sstever@eecs.umich.eduFor tokens, the "value" in the corresponding <tt>t[i]</tt> is the
6232632Sstever@eecs.umich.edu<em>same</em> as the value of the <tt>t.value</tt> attribute assigned
6242632Sstever@eecs.umich.eduin the lexer module.  For non-terminals, the value is determined by
6252632Sstever@eecs.umich.eduwhatever is placed in <tt>t[0]</tt> when rules are reduced.  This
6262632Sstever@eecs.umich.eduvalue can be anything at all.  However, it probably most common for
6272632Sstever@eecs.umich.eduthe value to be a simple Python type, a tuple, or an instance.  In this example, we
6282632Sstever@eecs.umich.eduare relying on the fact that the <tt>NUMBER</tt> token stores an integer value in its value
6292632Sstever@eecs.umich.edufield.  All of the other rules simply perform various types of integer operations and store
6302632Sstever@eecs.umich.eduthe result.
6312632Sstever@eecs.umich.edu
6322632Sstever@eecs.umich.edu<p>
6332632Sstever@eecs.umich.eduThe first rule defined in the yacc specification determines the starting grammar
6342632Sstever@eecs.umich.edusymbol (in this case, a rule for <tt>expression</tt> appears first).  Whenever
6352632Sstever@eecs.umich.eduthe starting rule is reduced by the parser and no more input is available, parsing 
6362632Sstever@eecs.umich.edustops and the final value is returned (this value will be whatever the top-most rule
6372632Sstever@eecs.umich.eduplaced in <tt>t[0]</tt>).
6382632Sstever@eecs.umich.edu
6392632Sstever@eecs.umich.edu<p>The <tt>p_error(t)</tt> rule is defined to catch syntax errors.  See the error handling section
6402632Sstever@eecs.umich.edubelow for more detail.
6412632Sstever@eecs.umich.edu
6422632Sstever@eecs.umich.edu<p>
6432632Sstever@eecs.umich.eduTo build the parser, call the <tt>yacc.yacc()</tt> function.  This function
6442632Sstever@eecs.umich.edulooks at the module and attempts to construct all of the LR parsing tables for the grammar
6452632Sstever@eecs.umich.eduyou have specified.   The first time <tt>yacc.yacc()</tt> is invoked, you will get a message
6462632Sstever@eecs.umich.edusuch as this:
6472632Sstever@eecs.umich.edu
6482632Sstever@eecs.umich.edu<blockquote>
6492632Sstever@eecs.umich.edu<pre>
6502632Sstever@eecs.umich.edu$ python calcparse.py
6512632Sstever@eecs.umich.eduyacc: Generating SLR parsing table...  
6522632Sstever@eecs.umich.educalc > 
6532632Sstever@eecs.umich.edu</pre>
6542632Sstever@eecs.umich.edu</blockquote>
6552632Sstever@eecs.umich.edu
6562632Sstever@eecs.umich.eduSince table construction is relatively expensive (especially for large
6572632Sstever@eecs.umich.edugrammars), the resulting parsing table is written to the current
6582632Sstever@eecs.umich.edudirectory in a file called <tt>parsetab.py</tt>.  In addition, a
6592632Sstever@eecs.umich.edudebugging file called <tt>parser.out</tt> is created.  On subsequent
6602632Sstever@eecs.umich.eduexecutions, <tt>yacc</tt> will reload the table from
6612632Sstever@eecs.umich.edu<tt>parsetab.py</tt> unless it has detected a change in the underlying
6622632Sstever@eecs.umich.edugrammar (in which case the tables and <tt>parsetab.py</tt> file are
6632632Sstever@eecs.umich.eduregenerated).
6642632Sstever@eecs.umich.edu
6652632Sstever@eecs.umich.edu<p>
6662632Sstever@eecs.umich.eduIf any errors are detected in your grammar specification, <tt>yacc.py</tt> will produce
6672632Sstever@eecs.umich.edudiagnostic messages and possibly raise an exception.  Some of the errors that can be detected include:
6682632Sstever@eecs.umich.edu
6692632Sstever@eecs.umich.edu<ul>
6702632Sstever@eecs.umich.edu<li>Duplicated function names (if more than one rule function have the same name in the grammar file).
6712632Sstever@eecs.umich.edu<li>Shift/reduce and reduce/reduce conflicts generated by ambiguous grammars.
6722632Sstever@eecs.umich.edu<li>Badly specified grammar rules.
6732632Sstever@eecs.umich.edu<li>Infinite recursion (rules that can never terminate).
6742632Sstever@eecs.umich.edu<li>Unused rules and tokens
6752632Sstever@eecs.umich.edu<li>Undefined rules and tokens
6762632Sstever@eecs.umich.edu</ul>
6772632Sstever@eecs.umich.edu
6782632Sstever@eecs.umich.eduThe next few sections now discuss a few finer points of grammar construction.
6792632Sstever@eecs.umich.edu
6802632Sstever@eecs.umich.edu<h2>Combining Grammar Rule Functions</h2>
6812632Sstever@eecs.umich.edu
6822632Sstever@eecs.umich.eduWhen grammar rules are similar, they can be combined into a single function.
6832632Sstever@eecs.umich.eduFor example, consider the two rules in our earlier example:
6842632Sstever@eecs.umich.edu
6852632Sstever@eecs.umich.edu<blockquote>
6862632Sstever@eecs.umich.edu<pre>
6872632Sstever@eecs.umich.edudef p_expression_plus(t):
6882632Sstever@eecs.umich.edu    'expression : expression PLUS term'
6892632Sstever@eecs.umich.edu    t[0] = t[1] + t[3]
6902632Sstever@eecs.umich.edu
6912632Sstever@eecs.umich.edudef p_expression_minus(t):
6922632Sstever@eecs.umich.edu    'expression : expression MINUS term'
6932632Sstever@eecs.umich.edu    t[0] = t[1] - t[3]
6942632Sstever@eecs.umich.edu</pre>
6952632Sstever@eecs.umich.edu</blockquote>
6962632Sstever@eecs.umich.edu
6972632Sstever@eecs.umich.eduInstead of writing two functions, you might write a single function like this:
6982632Sstever@eecs.umich.edu
6992632Sstever@eecs.umich.edu<blockquote>
7002632Sstever@eecs.umich.edu<pre>
7012632Sstever@eecs.umich.edudef p_expression(t):
7022632Sstever@eecs.umich.edu    '''expression : expression PLUS term
7032632Sstever@eecs.umich.edu                  | expression MINUS term'''
7042632Sstever@eecs.umich.edu    if t[2] == '+':
7052632Sstever@eecs.umich.edu        t[0] = t[1] + t[3]
7062632Sstever@eecs.umich.edu    elif t[2] == '-':
7072632Sstever@eecs.umich.edu        t[0] = t[1] - t[3]
7082632Sstever@eecs.umich.edu</pre>
7092632Sstever@eecs.umich.edu</blockquote>
7102632Sstever@eecs.umich.edu
7112632Sstever@eecs.umich.eduIn general, the doc string for any given function can contain multiple grammar rules.  So, it would
7122632Sstever@eecs.umich.eduhave also been legal (although possibly confusing) to write this:
7132632Sstever@eecs.umich.edu
7142632Sstever@eecs.umich.edu<blockquote>
7152632Sstever@eecs.umich.edu<pre>
7162632Sstever@eecs.umich.edudef p_binary_operators(t):
7172632Sstever@eecs.umich.edu    '''expression : expression PLUS term
7182632Sstever@eecs.umich.edu                  | expression MINUS term
7192632Sstever@eecs.umich.edu       term       : term TIMES factor
7202632Sstever@eecs.umich.edu                  | term DIVIDE factor'''
7212632Sstever@eecs.umich.edu    if t[2] == '+':
7222632Sstever@eecs.umich.edu        t[0] = t[1] + t[3]
7232632Sstever@eecs.umich.edu    elif t[2] == '-':
7242632Sstever@eecs.umich.edu        t[0] = t[1] - t[3]
7252632Sstever@eecs.umich.edu    elif t[2] == '*':
7262632Sstever@eecs.umich.edu        t[0] = t[1] * t[3]
7272632Sstever@eecs.umich.edu    elif t[2] == '/':
7282632Sstever@eecs.umich.edu        t[0] = t[1] / t[3]
7292632Sstever@eecs.umich.edu</pre>
7302632Sstever@eecs.umich.edu</blockquote>
7312632Sstever@eecs.umich.edu
7322632Sstever@eecs.umich.eduWhen combining grammar rules into a single function, it is usually a good idea for all of the rules to have
7332632Sstever@eecs.umich.edua similar structure (e.g., the same number of terms).  Otherwise, the corresponding action code may be more 
7342632Sstever@eecs.umich.educomplicated than necessary.
7352632Sstever@eecs.umich.edu
7362632Sstever@eecs.umich.edu<h2>Empty Productions</h2>
7372632Sstever@eecs.umich.edu
7382632Sstever@eecs.umich.edu<tt>yacc.py</tt> can handle empty productions by defining a rule like this:
7392632Sstever@eecs.umich.edu
7402632Sstever@eecs.umich.edu<blockquote>
7412632Sstever@eecs.umich.edu<pre>
7422632Sstever@eecs.umich.edudef p_empty(t):
7432632Sstever@eecs.umich.edu    'empty :'
7442632Sstever@eecs.umich.edu    pass
7452632Sstever@eecs.umich.edu</pre>
7462632Sstever@eecs.umich.edu</blockquote>
7472632Sstever@eecs.umich.edu
7482632Sstever@eecs.umich.eduNow to use the empty production, simply use 'empty' as a symbol.  For example:
7492632Sstever@eecs.umich.edu
7502632Sstever@eecs.umich.edu<blockquote>
7512632Sstever@eecs.umich.edu<pre>
7522632Sstever@eecs.umich.edudef p_optitem(t):
7532632Sstever@eecs.umich.edu    'optitem : item'
7542632Sstever@eecs.umich.edu    '        | empty'
7552632Sstever@eecs.umich.edu    ...
7562632Sstever@eecs.umich.edu</pre>
7572632Sstever@eecs.umich.edu</blockquote>
7582632Sstever@eecs.umich.edu
7592632Sstever@eecs.umich.edu<h2>Dealing With Ambiguous Grammars</h2>
7602632Sstever@eecs.umich.edu
7612632Sstever@eecs.umich.eduThe expression grammar given in the earlier example has been written in a special format to eliminate ambiguity. 
7622632Sstever@eecs.umich.eduHowever, in many situations, it is extremely difficult or awkward to write grammars in this format.  A
7632632Sstever@eecs.umich.edumuch more natural way to express the grammar is in a more compact form like this:
7642632Sstever@eecs.umich.edu
7652632Sstever@eecs.umich.edu<blockquote>
7662632Sstever@eecs.umich.edu<pre>
7672632Sstever@eecs.umich.eduexpression : expression PLUS expression
7682632Sstever@eecs.umich.edu           | expression MINUS expression
7692632Sstever@eecs.umich.edu           | expression TIMES expression
7702632Sstever@eecs.umich.edu           | expression DIVIDE expression
7712632Sstever@eecs.umich.edu           | LPAREN expression RPAREN
7722632Sstever@eecs.umich.edu           | NUMBER
7732632Sstever@eecs.umich.edu</pre>
7742632Sstever@eecs.umich.edu</blockquote>
7752632Sstever@eecs.umich.edu
7762632Sstever@eecs.umich.eduUnfortunately, this grammar specification is ambiguous.  For example, if you are parsing the string
7772632Sstever@eecs.umich.edu"3 * 4 + 5", there is no way to tell how the operators are supposed to be grouped.
7782632Sstever@eecs.umich.eduFor example, does this expression mean "(3 * 4) + 5" or is it "3 * (4+5)"?
7792632Sstever@eecs.umich.edu
7802632Sstever@eecs.umich.edu<p>
7812632Sstever@eecs.umich.eduWhen an ambiguous grammar is given to <tt>yacc.py</tt> it will print messages about "shift/reduce conflicts" 
7822632Sstever@eecs.umich.eduor a "reduce/reduce conflicts".   A shift/reduce conflict is caused when the parser generator can't decide
7832632Sstever@eecs.umich.eduwhether or not to reduce a rule or shift a symbol on the parsing stack.   For example, consider
7842632Sstever@eecs.umich.eduthe string "3 * 4 + 5" and the internal parsing stack:
7852632Sstever@eecs.umich.edu
7862632Sstever@eecs.umich.edu<blockquote>
7872632Sstever@eecs.umich.edu<pre>
7882632Sstever@eecs.umich.eduStep Symbol Stack           Input Tokens            Action
7892632Sstever@eecs.umich.edu---- ---------------------  ---------------------   -------------------------------
7902632Sstever@eecs.umich.edu1    $                                3 * 4 + 5$    Shift 3
7912632Sstever@eecs.umich.edu2    $ 3                                * 4 + 5$    Reduce : expression : NUMBER
7922632Sstever@eecs.umich.edu3    $ expr                             * 4 + 5$    Shift *
7932632Sstever@eecs.umich.edu4    $ expr *                             4 + 5$    Shift 4
7942632Sstever@eecs.umich.edu5    $ expr * 4                             + 5$    Reduce: expression : NUMBER
7952632Sstever@eecs.umich.edu6    $ expr * expr                          + 5$    SHIFT/REDUCE CONFLICT ????
7962632Sstever@eecs.umich.edu</pre>
7972632Sstever@eecs.umich.edu</blockquote>
7982632Sstever@eecs.umich.edu
7992632Sstever@eecs.umich.eduIn this case, when the parser reaches step 6, it has two options.  One is the reduce the
8002632Sstever@eecs.umich.edurule <tt>expr : expr * expr</tt> on the stack.  The other option is to shift the
8012632Sstever@eecs.umich.edutoken <tt>+</tt> on the stack.   Both options are perfectly legal from the rules
8022632Sstever@eecs.umich.eduof the context-free-grammar.
8032632Sstever@eecs.umich.edu
8042632Sstever@eecs.umich.edu<p>
8052632Sstever@eecs.umich.eduBy default, all shift/reduce conflicts are resolved in favor of shifting.  Therefore, in the above
8062632Sstever@eecs.umich.eduexample, the parser will always shift the <tt>+</tt> instead of reducing.    Although this
8072632Sstever@eecs.umich.edustrategy works in many cases (including the ambiguous if-then-else), it is not enough for arithmetic
8082632Sstever@eecs.umich.eduexpressions.  In fact, in the above example, the decision to shift <tt>+</tt> is completely wrong---we should have
8092632Sstever@eecs.umich.edureduced <tt>expr * expr</tt> since multiplication has higher precedence than addition.
8102632Sstever@eecs.umich.edu
8112632Sstever@eecs.umich.edu<p>To resolve ambiguity, especially in expression grammars, <tt>yacc.py</tt> allows individual
8122632Sstever@eecs.umich.edutokens to be assigned a precedence level and associativity.  This is done by adding a variable
8132632Sstever@eecs.umich.edu<tt>precedence</tt> to the grammar file like this:
8142632Sstever@eecs.umich.edu
8152632Sstever@eecs.umich.edu<blockquote>
8162632Sstever@eecs.umich.edu<pre>
8172632Sstever@eecs.umich.eduprecedence = (
8182632Sstever@eecs.umich.edu    ('left', 'PLUS', 'MINUS'),
8192632Sstever@eecs.umich.edu    ('left', 'TIMES', 'DIVIDE'),
8202632Sstever@eecs.umich.edu)
8212632Sstever@eecs.umich.edu</pre>
8222632Sstever@eecs.umich.edu</blockquote>
8232632Sstever@eecs.umich.edu
8242632Sstever@eecs.umich.eduThis declaration specifies that <tt>PLUS</tt>/<tt>MINUS</tt> have
8252632Sstever@eecs.umich.eduthe same precedence level and are left-associative and that 
8262632Sstever@eecs.umich.edu<tt>TIMES</tt>/<tt>DIVIDE</tt> have the same precedence and are left-associative.
8272632Sstever@eecs.umich.eduFurthermore, the declaration specifies that <tt>TIMES</tt>/<tt>DIVIDE</tt> have higher
8282632Sstever@eecs.umich.eduprecedence than <tt>PLUS</tt>/<tt>MINUS</tt> (since they appear later in the
8292632Sstever@eecs.umich.eduprecedence specification).
8302632Sstever@eecs.umich.edu
8312632Sstever@eecs.umich.edu<p>
8322632Sstever@eecs.umich.eduThe precedence specification is used to attach a numerical precedence value and associativity direction 
8332632Sstever@eecs.umich.eduto each grammar rule. This is always determined by the precedence of the right-most terminal symbol.  Therefore,
8342632Sstever@eecs.umich.eduif PLUS/MINUS had a precedence of 1 and TIMES/DIVIDE had a precedence of 2, the grammar rules
8352632Sstever@eecs.umich.eduwould have precedence values as follows:
8362632Sstever@eecs.umich.edu
8372632Sstever@eecs.umich.edu<blockquote>
8382632Sstever@eecs.umich.edu<pre>
8392632Sstever@eecs.umich.eduexpression : expression PLUS expression                 # prec = 1, left
8402632Sstever@eecs.umich.edu           | expression MINUS expression                # prec = 1, left
8412632Sstever@eecs.umich.edu           | expression TIMES expression                # prec = 2, left
8422632Sstever@eecs.umich.edu           | expression DIVIDE expression               # prec = 2, left
8432632Sstever@eecs.umich.edu           | LPAREN expression RPAREN                   # prec = unknown
8442632Sstever@eecs.umich.edu           | NUMBER                                     # prec = unknown
8452632Sstever@eecs.umich.edu</pre>
8462632Sstever@eecs.umich.edu</blockquote>
8472632Sstever@eecs.umich.edu
8482632Sstever@eecs.umich.eduWhen shift/reduce conflicts are encountered, the parser generator resolves the conflict by
8492632Sstever@eecs.umich.edulooking at the precedence rules and associativity specifiers.
8502632Sstever@eecs.umich.edu
8512632Sstever@eecs.umich.edu<p>
8522632Sstever@eecs.umich.edu<ol>
8532632Sstever@eecs.umich.edu<li>If the current token has higher precedence, it is shifted.
8542632Sstever@eecs.umich.edu<li>If the grammar rule on the stack has higher precedence, the rule is reduced.
8552632Sstever@eecs.umich.edu<li>If the current token and the grammar rule have the same precedence, the
8562632Sstever@eecs.umich.edurule is reduced for left associativity, whereas the token is shifted for right associativity.
8572632Sstever@eecs.umich.edu<li>If nothing is known about the precedence, shift/reduce conflicts are resolved in
8582632Sstever@eecs.umich.edufavor of shifting (the default).
8592632Sstever@eecs.umich.edu</ol>
8602632Sstever@eecs.umich.edu
8612632Sstever@eecs.umich.edu<p>
8622632Sstever@eecs.umich.eduWhen shift/reduce conflicts are resolved using the first three techniques (with the help of
8632632Sstever@eecs.umich.eduprecedence rules), <tt>yacc.py</tt> will report no errors or conflicts in the grammar. 
8642632Sstever@eecs.umich.edu
8652632Sstever@eecs.umich.edu<p>
8662632Sstever@eecs.umich.eduOne problem with the precedence specifier technique is that it is sometimes necessary to
8672632Sstever@eecs.umich.educhange the precedence of an operator in certain contents.  For example, consider a unary-minus operator
8682632Sstever@eecs.umich.eduin "3 + 4 * -5".  Normally, unary minus has a very high precedence--being evaluated before the multiply.
8692632Sstever@eecs.umich.eduHowever, in our precedence specifier, MINUS has a lower precedence than TIMES.  To deal with this,
8702632Sstever@eecs.umich.eduprecedence rules can be given for fictitious tokens like this:
8712632Sstever@eecs.umich.edu
8722632Sstever@eecs.umich.edu<blockquote>
8732632Sstever@eecs.umich.edu<pre>
8742632Sstever@eecs.umich.eduprecedence = (
8752632Sstever@eecs.umich.edu    ('left', 'PLUS', 'MINUS'),
8762632Sstever@eecs.umich.edu    ('left', 'TIMES', 'DIVIDE'),
8772632Sstever@eecs.umich.edu    ('right', 'UMINUS'),            # Unary minus operator
8782632Sstever@eecs.umich.edu)
8792632Sstever@eecs.umich.edu</pre>
8802632Sstever@eecs.umich.edu</blockquote>
8812632Sstever@eecs.umich.edu
8822632Sstever@eecs.umich.eduNow, in the grammar file, we can write our unary minus rule like this:
8832632Sstever@eecs.umich.edu
8842632Sstever@eecs.umich.edu<blockquote>
8852632Sstever@eecs.umich.edu<pre>
8862632Sstever@eecs.umich.edudef p_expr_uminus(t):
8872632Sstever@eecs.umich.edu    'expression : MINUS expression %prec UMINUS'
8882632Sstever@eecs.umich.edu    t[0] = -t[2]
8892632Sstever@eecs.umich.edu</pre>
8902632Sstever@eecs.umich.edu</blockquote>
8912632Sstever@eecs.umich.edu
8922632Sstever@eecs.umich.eduIn this case, <tt>%prec UMINUS</tt> overrides the default rule precedence--setting it to that
8932632Sstever@eecs.umich.eduof UMINUS in the precedence specifier.
8942632Sstever@eecs.umich.edu
8952632Sstever@eecs.umich.edu<p>
8962632Sstever@eecs.umich.eduIt is also possible to specify non-associativity in the <tt>precedence</tt> table. This would
8972632Sstever@eecs.umich.edube used when you <em>don't</em> want operations to chain together.  For example, suppose
8982632Sstever@eecs.umich.eduyou wanted to support a comparison operators like <tt>&lt;</tt> and <tt>&gt;</tt> but you didn't want to allow
8992632Sstever@eecs.umich.educombinations like <tt>a &lt; b &lt; c</tt>.   To do this, simply specify a rule like this:
9002632Sstever@eecs.umich.edu
9012632Sstever@eecs.umich.edu<blockquote>
9022632Sstever@eecs.umich.edu<pre>
9032632Sstever@eecs.umich.eduprecedence = (
9042632Sstever@eecs.umich.edu    ('nonassoc', 'LESSTHAN', 'GREATERTHAN'),  # Nonassociative operators
9052632Sstever@eecs.umich.edu    ('left', 'PLUS', 'MINUS'),
9062632Sstever@eecs.umich.edu    ('left', 'TIMES', 'DIVIDE'),
9072632Sstever@eecs.umich.edu    ('right', 'UMINUS'),            # Unary minus operator
9082632Sstever@eecs.umich.edu)
9092632Sstever@eecs.umich.edu</pre>
9102632Sstever@eecs.umich.edu</blockquote>
9112632Sstever@eecs.umich.edu
9122632Sstever@eecs.umich.edu<p>
9132632Sstever@eecs.umich.eduReduce/reduce conflicts are caused when there are multiple grammar
9142632Sstever@eecs.umich.edurules that can be applied to a given set of symbols.  This kind of
9152632Sstever@eecs.umich.educonflict is almost always bad and is always resolved by picking the
9162632Sstever@eecs.umich.edurule that appears first in the grammar file.   Reduce/reduce conflicts
9172632Sstever@eecs.umich.eduare almost always caused when different sets of grammar rules somehow
9182632Sstever@eecs.umich.edugenerate the same set of symbols.  For example:
9192632Sstever@eecs.umich.edu
9202632Sstever@eecs.umich.edu<blockquote>
9212632Sstever@eecs.umich.edu<pre>
9222632Sstever@eecs.umich.eduassignment :  ID EQUALS NUMBER
9232632Sstever@eecs.umich.edu           |  ID EQUALS expression
9242632Sstever@eecs.umich.edu           
9252632Sstever@eecs.umich.eduexpression : expression PLUS expression
9262632Sstever@eecs.umich.edu           | expression MINUS expression
9272632Sstever@eecs.umich.edu           | expression TIMES expression
9282632Sstever@eecs.umich.edu           | expression DIVIDE expression
9292632Sstever@eecs.umich.edu           | LPAREN expression RPAREN
9302632Sstever@eecs.umich.edu           | NUMBER
9312632Sstever@eecs.umich.edu</pre>
9322632Sstever@eecs.umich.edu</blockquote>
9332632Sstever@eecs.umich.edu
9342632Sstever@eecs.umich.eduIn this case, a reduce/reduce conflict exists between these two rules:
9352632Sstever@eecs.umich.edu
9362632Sstever@eecs.umich.edu<blockquote>
9372632Sstever@eecs.umich.edu<pre>
9382632Sstever@eecs.umich.eduassignment  : ID EQUALS NUMBER
9392632Sstever@eecs.umich.eduexpression  : NUMBER
9402632Sstever@eecs.umich.edu</pre>
9412632Sstever@eecs.umich.edu</blockquote>
9422632Sstever@eecs.umich.edu
9432632Sstever@eecs.umich.eduFor example, if you wrote "a = 5", the parser can't figure out if this
9442632Sstever@eecs.umich.eduis supposed to reduced as <tt>assignment : ID EQUALS NUMBER</tt> or
9452632Sstever@eecs.umich.eduwhether it's supposed to reduce the 5 as an expression and then reduce
9462632Sstever@eecs.umich.eduthe rule <tt>assignment : ID EQUALS expression</tt>.
9472632Sstever@eecs.umich.edu
9482632Sstever@eecs.umich.edu<h2>The parser.out file</h2>
9492632Sstever@eecs.umich.edu
9502632Sstever@eecs.umich.eduTracking down shift/reduce and reduce/reduce conflicts is one of the finer pleasures of using an LR
9512632Sstever@eecs.umich.eduparsing algorithm.  To assist in debugging, <tt>yacc.py</tt> creates a debugging file called
9522632Sstever@eecs.umich.edu'parser.out' when it generates the parsing table.   The contents of this file look like the following:
9532632Sstever@eecs.umich.edu
9542632Sstever@eecs.umich.edu<blockquote>
9552632Sstever@eecs.umich.edu<pre>
9562632Sstever@eecs.umich.eduUnused terminals:
9572632Sstever@eecs.umich.edu
9582632Sstever@eecs.umich.edu
9592632Sstever@eecs.umich.eduGrammar
9602632Sstever@eecs.umich.edu
9612632Sstever@eecs.umich.eduRule 1     expression -> expression PLUS expression
9622632Sstever@eecs.umich.eduRule 2     expression -> expression MINUS expression
9632632Sstever@eecs.umich.eduRule 3     expression -> expression TIMES expression
9642632Sstever@eecs.umich.eduRule 4     expression -> expression DIVIDE expression
9652632Sstever@eecs.umich.eduRule 5     expression -> NUMBER
9662632Sstever@eecs.umich.eduRule 6     expression -> LPAREN expression RPAREN
9672632Sstever@eecs.umich.edu
9682632Sstever@eecs.umich.eduTerminals, with rules where they appear
9692632Sstever@eecs.umich.edu
9702632Sstever@eecs.umich.eduTIMES                : 3
9712632Sstever@eecs.umich.eduerror                : 
9722632Sstever@eecs.umich.eduMINUS                : 2
9732632Sstever@eecs.umich.eduRPAREN               : 6
9742632Sstever@eecs.umich.eduLPAREN               : 6
9752632Sstever@eecs.umich.eduDIVIDE               : 4
9762632Sstever@eecs.umich.eduPLUS                 : 1
9772632Sstever@eecs.umich.eduNUMBER               : 5
9782632Sstever@eecs.umich.edu
9792632Sstever@eecs.umich.eduNonterminals, with rules where they appear
9802632Sstever@eecs.umich.edu
9812632Sstever@eecs.umich.eduexpression           : 1 1 2 2 3 3 4 4 6 0
9822632Sstever@eecs.umich.edu
9832632Sstever@eecs.umich.edu
9842632Sstever@eecs.umich.eduParsing method: SLR
9852632Sstever@eecs.umich.edu
9862632Sstever@eecs.umich.edu
9872632Sstever@eecs.umich.edustate 0
9882632Sstever@eecs.umich.edu
9892632Sstever@eecs.umich.edu    S' -> . expression
9902632Sstever@eecs.umich.edu    expression -> . expression PLUS expression
9912632Sstever@eecs.umich.edu    expression -> . expression MINUS expression
9922632Sstever@eecs.umich.edu    expression -> . expression TIMES expression
9932632Sstever@eecs.umich.edu    expression -> . expression DIVIDE expression
9942632Sstever@eecs.umich.edu    expression -> . NUMBER
9952632Sstever@eecs.umich.edu    expression -> . LPAREN expression RPAREN
9962632Sstever@eecs.umich.edu
9972632Sstever@eecs.umich.edu    NUMBER          shift and go to state 3
9982632Sstever@eecs.umich.edu    LPAREN          shift and go to state 2
9992632Sstever@eecs.umich.edu
10002632Sstever@eecs.umich.edu
10012632Sstever@eecs.umich.edustate 1
10022632Sstever@eecs.umich.edu
10032632Sstever@eecs.umich.edu    S' -> expression .
10042632Sstever@eecs.umich.edu    expression -> expression . PLUS expression
10052632Sstever@eecs.umich.edu    expression -> expression . MINUS expression
10062632Sstever@eecs.umich.edu    expression -> expression . TIMES expression
10072632Sstever@eecs.umich.edu    expression -> expression . DIVIDE expression
10082632Sstever@eecs.umich.edu
10092632Sstever@eecs.umich.edu    PLUS            shift and go to state 6
10102632Sstever@eecs.umich.edu    MINUS           shift and go to state 5
10112632Sstever@eecs.umich.edu    TIMES           shift and go to state 4
10122632Sstever@eecs.umich.edu    DIVIDE          shift and go to state 7
10132632Sstever@eecs.umich.edu
10142632Sstever@eecs.umich.edu
10152632Sstever@eecs.umich.edustate 2
10162632Sstever@eecs.umich.edu
10172632Sstever@eecs.umich.edu    expression -> LPAREN . expression RPAREN
10182632Sstever@eecs.umich.edu    expression -> . expression PLUS expression
10192632Sstever@eecs.umich.edu    expression -> . expression MINUS expression
10202632Sstever@eecs.umich.edu    expression -> . expression TIMES expression
10212632Sstever@eecs.umich.edu    expression -> . expression DIVIDE expression
10222632Sstever@eecs.umich.edu    expression -> . NUMBER
10232632Sstever@eecs.umich.edu    expression -> . LPAREN expression RPAREN
10242632Sstever@eecs.umich.edu
10252632Sstever@eecs.umich.edu    NUMBER          shift and go to state 3
10262632Sstever@eecs.umich.edu    LPAREN          shift and go to state 2
10272632Sstever@eecs.umich.edu
10282632Sstever@eecs.umich.edu
10292632Sstever@eecs.umich.edustate 3
10302632Sstever@eecs.umich.edu
10312632Sstever@eecs.umich.edu    expression -> NUMBER .
10322632Sstever@eecs.umich.edu
10332632Sstever@eecs.umich.edu    $               reduce using rule 5
10342632Sstever@eecs.umich.edu    PLUS            reduce using rule 5
10352632Sstever@eecs.umich.edu    MINUS           reduce using rule 5
10362632Sstever@eecs.umich.edu    TIMES           reduce using rule 5
10372632Sstever@eecs.umich.edu    DIVIDE          reduce using rule 5
10382632Sstever@eecs.umich.edu    RPAREN          reduce using rule 5
10392632Sstever@eecs.umich.edu
10402632Sstever@eecs.umich.edu
10412632Sstever@eecs.umich.edustate 4
10422632Sstever@eecs.umich.edu
10432632Sstever@eecs.umich.edu    expression -> expression TIMES . expression
10442632Sstever@eecs.umich.edu    expression -> . expression PLUS expression
10452632Sstever@eecs.umich.edu    expression -> . expression MINUS expression
10462632Sstever@eecs.umich.edu    expression -> . expression TIMES expression
10472632Sstever@eecs.umich.edu    expression -> . expression DIVIDE expression
10482632Sstever@eecs.umich.edu    expression -> . NUMBER
10492632Sstever@eecs.umich.edu    expression -> . LPAREN expression RPAREN
10502632Sstever@eecs.umich.edu
10512632Sstever@eecs.umich.edu    NUMBER          shift and go to state 3
10522632Sstever@eecs.umich.edu    LPAREN          shift and go to state 2
10532632Sstever@eecs.umich.edu
10542632Sstever@eecs.umich.edu
10552632Sstever@eecs.umich.edustate 5
10562632Sstever@eecs.umich.edu
10572632Sstever@eecs.umich.edu    expression -> expression MINUS . expression
10582632Sstever@eecs.umich.edu    expression -> . expression PLUS expression
10592632Sstever@eecs.umich.edu    expression -> . expression MINUS expression
10602632Sstever@eecs.umich.edu    expression -> . expression TIMES expression
10612632Sstever@eecs.umich.edu    expression -> . expression DIVIDE expression
10622632Sstever@eecs.umich.edu    expression -> . NUMBER
10632632Sstever@eecs.umich.edu    expression -> . LPAREN expression RPAREN
10642632Sstever@eecs.umich.edu
10652632Sstever@eecs.umich.edu    NUMBER          shift and go to state 3
10662632Sstever@eecs.umich.edu    LPAREN          shift and go to state 2
10672632Sstever@eecs.umich.edu
10682632Sstever@eecs.umich.edu
10692632Sstever@eecs.umich.edustate 6
10702632Sstever@eecs.umich.edu
10712632Sstever@eecs.umich.edu    expression -> expression PLUS . expression
10722632Sstever@eecs.umich.edu    expression -> . expression PLUS expression
10732632Sstever@eecs.umich.edu    expression -> . expression MINUS expression
10742632Sstever@eecs.umich.edu    expression -> . expression TIMES expression
10752632Sstever@eecs.umich.edu    expression -> . expression DIVIDE expression
10762632Sstever@eecs.umich.edu    expression -> . NUMBER
10772632Sstever@eecs.umich.edu    expression -> . LPAREN expression RPAREN
10782632Sstever@eecs.umich.edu
10792632Sstever@eecs.umich.edu    NUMBER          shift and go to state 3
10802632Sstever@eecs.umich.edu    LPAREN          shift and go to state 2
10812632Sstever@eecs.umich.edu
10822632Sstever@eecs.umich.edu
10832632Sstever@eecs.umich.edustate 7
10842632Sstever@eecs.umich.edu
10852632Sstever@eecs.umich.edu    expression -> expression DIVIDE . expression
10862632Sstever@eecs.umich.edu    expression -> . expression PLUS expression
10872632Sstever@eecs.umich.edu    expression -> . expression MINUS expression
10882632Sstever@eecs.umich.edu    expression -> . expression TIMES expression
10892632Sstever@eecs.umich.edu    expression -> . expression DIVIDE expression
10902632Sstever@eecs.umich.edu    expression -> . NUMBER
10912632Sstever@eecs.umich.edu    expression -> . LPAREN expression RPAREN
10922632Sstever@eecs.umich.edu
10932632Sstever@eecs.umich.edu    NUMBER          shift and go to state 3
10942632Sstever@eecs.umich.edu    LPAREN          shift and go to state 2
10952632Sstever@eecs.umich.edu
10962632Sstever@eecs.umich.edu
10972632Sstever@eecs.umich.edustate 8
10982632Sstever@eecs.umich.edu
10992632Sstever@eecs.umich.edu    expression -> LPAREN expression . RPAREN
11002632Sstever@eecs.umich.edu    expression -> expression . PLUS expression
11012632Sstever@eecs.umich.edu    expression -> expression . MINUS expression
11022632Sstever@eecs.umich.edu    expression -> expression . TIMES expression
11032632Sstever@eecs.umich.edu    expression -> expression . DIVIDE expression
11042632Sstever@eecs.umich.edu
11052632Sstever@eecs.umich.edu    RPAREN          shift and go to state 13
11062632Sstever@eecs.umich.edu    PLUS            shift and go to state 6
11072632Sstever@eecs.umich.edu    MINUS           shift and go to state 5
11082632Sstever@eecs.umich.edu    TIMES           shift and go to state 4
11092632Sstever@eecs.umich.edu    DIVIDE          shift and go to state 7
11102632Sstever@eecs.umich.edu
11112632Sstever@eecs.umich.edu
11122632Sstever@eecs.umich.edustate 9
11132632Sstever@eecs.umich.edu
11142632Sstever@eecs.umich.edu    expression -> expression TIMES expression .
11152632Sstever@eecs.umich.edu    expression -> expression . PLUS expression
11162632Sstever@eecs.umich.edu    expression -> expression . MINUS expression
11172632Sstever@eecs.umich.edu    expression -> expression . TIMES expression
11182632Sstever@eecs.umich.edu    expression -> expression . DIVIDE expression
11192632Sstever@eecs.umich.edu
11202632Sstever@eecs.umich.edu    $               reduce using rule 3
11212632Sstever@eecs.umich.edu    PLUS            reduce using rule 3
11222632Sstever@eecs.umich.edu    MINUS           reduce using rule 3
11232632Sstever@eecs.umich.edu    TIMES           reduce using rule 3
11242632Sstever@eecs.umich.edu    DIVIDE          reduce using rule 3
11252632Sstever@eecs.umich.edu    RPAREN          reduce using rule 3
11262632Sstever@eecs.umich.edu
11272632Sstever@eecs.umich.edu  ! PLUS            [ shift and go to state 6 ]
11282632Sstever@eecs.umich.edu  ! MINUS           [ shift and go to state 5 ]
11292632Sstever@eecs.umich.edu  ! TIMES           [ shift and go to state 4 ]
11302632Sstever@eecs.umich.edu  ! DIVIDE          [ shift and go to state 7 ]
11312632Sstever@eecs.umich.edu
11322632Sstever@eecs.umich.edustate 10
11332632Sstever@eecs.umich.edu
11342632Sstever@eecs.umich.edu    expression -> expression MINUS expression .
11352632Sstever@eecs.umich.edu    expression -> expression . PLUS expression
11362632Sstever@eecs.umich.edu    expression -> expression . MINUS expression
11372632Sstever@eecs.umich.edu    expression -> expression . TIMES expression
11382632Sstever@eecs.umich.edu    expression -> expression . DIVIDE expression
11392632Sstever@eecs.umich.edu
11402632Sstever@eecs.umich.edu    $               reduce using rule 2
11412632Sstever@eecs.umich.edu    PLUS            reduce using rule 2
11422632Sstever@eecs.umich.edu    MINUS           reduce using rule 2
11432632Sstever@eecs.umich.edu    RPAREN          reduce using rule 2
11442632Sstever@eecs.umich.edu    TIMES           shift and go to state 4
11452632Sstever@eecs.umich.edu    DIVIDE          shift and go to state 7
11462632Sstever@eecs.umich.edu
11472632Sstever@eecs.umich.edu  ! TIMES           [ reduce using rule 2 ]
11482632Sstever@eecs.umich.edu  ! DIVIDE          [ reduce using rule 2 ]
11492632Sstever@eecs.umich.edu  ! PLUS            [ shift and go to state 6 ]
11502632Sstever@eecs.umich.edu  ! MINUS           [ shift and go to state 5 ]
11512632Sstever@eecs.umich.edu
11522632Sstever@eecs.umich.edustate 11
11532632Sstever@eecs.umich.edu
11542632Sstever@eecs.umich.edu    expression -> expression PLUS expression .
11552632Sstever@eecs.umich.edu    expression -> expression . PLUS expression
11562632Sstever@eecs.umich.edu    expression -> expression . MINUS expression
11572632Sstever@eecs.umich.edu    expression -> expression . TIMES expression
11582632Sstever@eecs.umich.edu    expression -> expression . DIVIDE expression
11592632Sstever@eecs.umich.edu
11602632Sstever@eecs.umich.edu    $               reduce using rule 1
11612632Sstever@eecs.umich.edu    PLUS            reduce using rule 1
11622632Sstever@eecs.umich.edu    MINUS           reduce using rule 1
11632632Sstever@eecs.umich.edu    RPAREN          reduce using rule 1
11642632Sstever@eecs.umich.edu    TIMES           shift and go to state 4
11652632Sstever@eecs.umich.edu    DIVIDE          shift and go to state 7
11662632Sstever@eecs.umich.edu
11672632Sstever@eecs.umich.edu  ! TIMES           [ reduce using rule 1 ]
11682632Sstever@eecs.umich.edu  ! DIVIDE          [ reduce using rule 1 ]
11692632Sstever@eecs.umich.edu  ! PLUS            [ shift and go to state 6 ]
11702632Sstever@eecs.umich.edu  ! MINUS           [ shift and go to state 5 ]
11712632Sstever@eecs.umich.edu
11722632Sstever@eecs.umich.edustate 12
11732632Sstever@eecs.umich.edu
11742632Sstever@eecs.umich.edu    expression -> expression DIVIDE expression .
11752632Sstever@eecs.umich.edu    expression -> expression . PLUS expression
11762632Sstever@eecs.umich.edu    expression -> expression . MINUS expression
11772632Sstever@eecs.umich.edu    expression -> expression . TIMES expression
11782632Sstever@eecs.umich.edu    expression -> expression . DIVIDE expression
11792632Sstever@eecs.umich.edu
11802632Sstever@eecs.umich.edu    $               reduce using rule 4
11812632Sstever@eecs.umich.edu    PLUS            reduce using rule 4
11822632Sstever@eecs.umich.edu    MINUS           reduce using rule 4
11832632Sstever@eecs.umich.edu    TIMES           reduce using rule 4
11842632Sstever@eecs.umich.edu    DIVIDE          reduce using rule 4
11852632Sstever@eecs.umich.edu    RPAREN          reduce using rule 4
11862632Sstever@eecs.umich.edu
11872632Sstever@eecs.umich.edu  ! PLUS            [ shift and go to state 6 ]
11882632Sstever@eecs.umich.edu  ! MINUS           [ shift and go to state 5 ]
11892632Sstever@eecs.umich.edu  ! TIMES           [ shift and go to state 4 ]
11902632Sstever@eecs.umich.edu  ! DIVIDE          [ shift and go to state 7 ]
11912632Sstever@eecs.umich.edu
11922632Sstever@eecs.umich.edustate 13
11932632Sstever@eecs.umich.edu
11942632Sstever@eecs.umich.edu    expression -> LPAREN expression RPAREN .
11952632Sstever@eecs.umich.edu
11962632Sstever@eecs.umich.edu    $               reduce using rule 6
11972632Sstever@eecs.umich.edu    PLUS            reduce using rule 6
11982632Sstever@eecs.umich.edu    MINUS           reduce using rule 6
11992632Sstever@eecs.umich.edu    TIMES           reduce using rule 6
12002632Sstever@eecs.umich.edu    DIVIDE          reduce using rule 6
12012632Sstever@eecs.umich.edu    RPAREN          reduce using rule 6
12022632Sstever@eecs.umich.edu</pre>
12032632Sstever@eecs.umich.edu</blockquote>
12042632Sstever@eecs.umich.edu
12052632Sstever@eecs.umich.eduIn the file, each state of the grammar is described.  Within each state the "." indicates the current
12062632Sstever@eecs.umich.edulocation of the parse within any applicable grammar rules.   In addition, the actions for each valid
12072632Sstever@eecs.umich.eduinput token are listed.   When a shift/reduce or reduce/reduce conflict arises, rules <em>not</em> selected
12082632Sstever@eecs.umich.eduare prefixed with an !.  For example:
12092632Sstever@eecs.umich.edu
12102632Sstever@eecs.umich.edu<blockquote>
12112632Sstever@eecs.umich.edu<pre>
12122632Sstever@eecs.umich.edu  ! TIMES           [ reduce using rule 2 ]
12132632Sstever@eecs.umich.edu  ! DIVIDE          [ reduce using rule 2 ]
12142632Sstever@eecs.umich.edu  ! PLUS            [ shift and go to state 6 ]
12152632Sstever@eecs.umich.edu  ! MINUS           [ shift and go to state 5 ]
12162632Sstever@eecs.umich.edu</pre>
12172632Sstever@eecs.umich.edu</blockquote>
12182632Sstever@eecs.umich.edu
12192632Sstever@eecs.umich.eduBy looking at these rules (and with a little practice), you can usually track down the source
12202632Sstever@eecs.umich.eduof most parsing conflicts.  It should also be stressed that not all shift-reduce conflicts are
12212632Sstever@eecs.umich.edubad.  However, the only way to be sure that they are resolved correctly is to look at <tt>parser.out</tt>.
12222632Sstever@eecs.umich.edu  
12232632Sstever@eecs.umich.edu<h2>Syntax Error Handling</h2>
12242632Sstever@eecs.umich.edu
12252632Sstever@eecs.umich.eduWhen a syntax error occurs during parsing, the error is immediately
12262632Sstever@eecs.umich.edudetected (i.e., the parser does not read any more tokens beyond the
12272632Sstever@eecs.umich.edusource of the error).  Error recovery in LR parsers is a delicate
12282632Sstever@eecs.umich.edutopic that involves ancient rituals and black-magic.   The recovery mechanism
12292632Sstever@eecs.umich.eduprovided by <tt>yacc.py</tt> is comparable to Unix yacc so you may want
12302632Sstever@eecs.umich.educonsult a book like O'Reilly's "Lex and Yacc" for some of the finer details.
12312632Sstever@eecs.umich.edu
12322632Sstever@eecs.umich.edu<p>
12332632Sstever@eecs.umich.eduWhen a syntax error occurs, <tt>yacc.py</tt> performs the following steps:
12342632Sstever@eecs.umich.edu
12352632Sstever@eecs.umich.edu<ol>
12362632Sstever@eecs.umich.edu<li>On the first occurrence of an error, the user-defined <tt>p_error()</tt> function
12372632Sstever@eecs.umich.eduis called with the offending token as an argument.  Afterwards, the parser enters
12382632Sstever@eecs.umich.eduan "error-recovery" mode in which it will not make future calls to <tt>p_error()</tt> until it
12392632Sstever@eecs.umich.eduhas successfully shifted at least 3 tokens onto the parsing stack.
12402632Sstever@eecs.umich.edu
12412632Sstever@eecs.umich.edu<p>
12422632Sstever@eecs.umich.edu<li>If no recovery action is taken in <tt>p_error()</tt>, the offending lookahead token is replaced
12432632Sstever@eecs.umich.eduwith a special <tt>error</tt> token.
12442632Sstever@eecs.umich.edu
12452632Sstever@eecs.umich.edu<p>
12462632Sstever@eecs.umich.edu<li>If the offending lookahead token is already set to <tt>error</tt>, the top item of the parsing stack is
12472632Sstever@eecs.umich.edudeleted.
12482632Sstever@eecs.umich.edu
12492632Sstever@eecs.umich.edu<p>
12502632Sstever@eecs.umich.edu<li>If the entire parsing stack is unwound, the parser enters a restart state and attempts to start
12512632Sstever@eecs.umich.eduparsing from its initial state.
12522632Sstever@eecs.umich.edu
12532632Sstever@eecs.umich.edu<p>
12542632Sstever@eecs.umich.edu<li>If a grammar rule accepts <tt>error</tt> as a token, it will be
12552632Sstever@eecs.umich.edushifted onto the parsing stack.
12562632Sstever@eecs.umich.edu
12572632Sstever@eecs.umich.edu<p>
12582632Sstever@eecs.umich.edu<li>If the top item of the parsing stack is <tt>error</tt>, lookahead tokens will be discarded until the
12592632Sstever@eecs.umich.eduparser can successfully shift a new symbol or reduce a rule involving <tt>error</tt>.
12602632Sstever@eecs.umich.edu</ol>
12612632Sstever@eecs.umich.edu
12622632Sstever@eecs.umich.edu<h4>Recovery and resynchronization with error rules</h4>
12632632Sstever@eecs.umich.edu
12642632Sstever@eecs.umich.eduThe most well-behaved approach for handling syntax errors is to write grammar rules that include the <tt>error</tt>
12652632Sstever@eecs.umich.edutoken.  For example, suppose your language had a grammar rule for a print statement like this:
12662632Sstever@eecs.umich.edu
12672632Sstever@eecs.umich.edu<blockquote>
12682632Sstever@eecs.umich.edu<pre>
12692632Sstever@eecs.umich.edudef p_statement_print(t):
12702632Sstever@eecs.umich.edu     'statement : PRINT expr SEMI'
12712632Sstever@eecs.umich.edu     ...
12722632Sstever@eecs.umich.edu</pre>
12732632Sstever@eecs.umich.edu</blockquote>
12742632Sstever@eecs.umich.edu
12752632Sstever@eecs.umich.eduTo account for the possibility of a bad expression, you might write an additional grammar rule like this:
12762632Sstever@eecs.umich.edu
12772632Sstever@eecs.umich.edu<blockquote>
12782632Sstever@eecs.umich.edu<pre>
12792632Sstever@eecs.umich.edudef p_statement_print_error(t):
12802632Sstever@eecs.umich.edu     'statement : PRINT error SEMI'
12812632Sstever@eecs.umich.edu     print "Syntax error in print statement. Bad expression"
12822632Sstever@eecs.umich.edu
12832632Sstever@eecs.umich.edu</pre>
12842632Sstever@eecs.umich.edu</blockquote>
12852632Sstever@eecs.umich.edu
12862632Sstever@eecs.umich.eduIn this case, the <tt>error</tt> token will match any sequence of
12872632Sstever@eecs.umich.edutokens that might appear up to the first semicolon that is
12882632Sstever@eecs.umich.eduencountered.  Once the semicolon is reached, the rule will be
12892632Sstever@eecs.umich.eduinvoked and the <tt>error</tt> token will go away.
12902632Sstever@eecs.umich.edu
12912632Sstever@eecs.umich.edu<p>
12922632Sstever@eecs.umich.eduThis type of recovery is sometimes known as parser resynchronization.
12932632Sstever@eecs.umich.eduThe <tt>error</tt> token acts as a wildcard for any bad input text and
12942632Sstever@eecs.umich.eduthe token immediately following <tt>error</tt> acts as a
12952632Sstever@eecs.umich.edusynchronization token.
12962632Sstever@eecs.umich.edu
12972632Sstever@eecs.umich.edu<p>
12982632Sstever@eecs.umich.eduIt is important to note that the <tt>error</tt> token usually does not appear as the last token
12992632Sstever@eecs.umich.eduon the right in an error rule.  For example:
13002632Sstever@eecs.umich.edu
13012632Sstever@eecs.umich.edu<blockquote>
13022632Sstever@eecs.umich.edu<pre>
13032632Sstever@eecs.umich.edudef p_statement_print_error(t):
13042632Sstever@eecs.umich.edu    'statement : PRINT error'
13052632Sstever@eecs.umich.edu    print "Syntax error in print statement. Bad expression"
13062632Sstever@eecs.umich.edu</pre>
13072632Sstever@eecs.umich.edu</blockquote>
13082632Sstever@eecs.umich.edu
13092632Sstever@eecs.umich.eduThis is because the first bad token encountered will cause the rule to
13102632Sstever@eecs.umich.edube reduced--which may make it difficult to recover if more bad tokens
13112632Sstever@eecs.umich.eduimmediately follow.   
13122632Sstever@eecs.umich.edu
13132632Sstever@eecs.umich.edu<h4>Panic mode recovery</h4>
13142632Sstever@eecs.umich.edu
13152632Sstever@eecs.umich.eduAn alternative error recovery scheme is to enter a panic mode recovery in which tokens are
13162632Sstever@eecs.umich.edudiscarded to a point where the parser might be able to recover in some sensible manner.
13172632Sstever@eecs.umich.edu
13182632Sstever@eecs.umich.edu<p>
13192632Sstever@eecs.umich.eduPanic mode recovery is implemented entirely in the <tt>p_error()</tt> function.  For example, this
13202632Sstever@eecs.umich.edufunction starts discarding tokens until it reaches a closing '}'.  Then, it restarts the 
13212632Sstever@eecs.umich.eduparser in its initial state.
13222632Sstever@eecs.umich.edu
13232632Sstever@eecs.umich.edu<blockquote>
13242632Sstever@eecs.umich.edu<pre>
13252632Sstever@eecs.umich.edudef p_error(t):
13262632Sstever@eecs.umich.edu    print "Whoa. You are seriously hosed."
13272632Sstever@eecs.umich.edu    # Read ahead looking for a closing '}'
13282632Sstever@eecs.umich.edu    while 1:
13292632Sstever@eecs.umich.edu        tok = yacc.token()             # Get the next token
13302632Sstever@eecs.umich.edu        if not tok or tok.type == 'RBRACE': break
13312632Sstever@eecs.umich.edu    yacc.restart()
13322632Sstever@eecs.umich.edu</pre>
13332632Sstever@eecs.umich.edu</blockquote>
13342632Sstever@eecs.umich.edu
13352632Sstever@eecs.umich.edu<p>
13362632Sstever@eecs.umich.eduThis function simply discards the bad token and tells the parser that the error was ok.
13372632Sstever@eecs.umich.edu
13382632Sstever@eecs.umich.edu<blockquote>
13392632Sstever@eecs.umich.edu<pre>
13402632Sstever@eecs.umich.edudef p_error(t):
13412632Sstever@eecs.umich.edu    print "Syntax error at token", t.type
13422632Sstever@eecs.umich.edu    # Just discard the token and tell the parser it's okay.
13432632Sstever@eecs.umich.edu    yacc.errok()
13442632Sstever@eecs.umich.edu</pre>
13452632Sstever@eecs.umich.edu</blockquote>
13462632Sstever@eecs.umich.edu
13472632Sstever@eecs.umich.edu<P>
13482632Sstever@eecs.umich.eduWithin the <tt>p_error()</tt> function, three functions are available to control the behavior
13492632Sstever@eecs.umich.eduof the parser:
13502632Sstever@eecs.umich.edu<p>
13512632Sstever@eecs.umich.edu<ul>
13522632Sstever@eecs.umich.edu<li><tt>yacc.errok()</tt>.  This resets the parser state so it doesn't think it's in error-recovery
13532632Sstever@eecs.umich.edumode.   This will prevent an <tt>error</tt> token from being generated and will reset the internal
13542632Sstever@eecs.umich.eduerror counters so that the next syntax error will call <tt>p_error()</tt> again.
13552632Sstever@eecs.umich.edu
13562632Sstever@eecs.umich.edu<p>
13572632Sstever@eecs.umich.edu<li><tt>yacc.token()</tt>.  This returns the next token on the input stream.
13582632Sstever@eecs.umich.edu
13592632Sstever@eecs.umich.edu<p>
13602632Sstever@eecs.umich.edu<li><tt>yacc.restart()</tt>.  This discards the entire parsing stack and resets the parser
13612632Sstever@eecs.umich.eduto its initial state. 
13622632Sstever@eecs.umich.edu</ul>
13632632Sstever@eecs.umich.edu
13642632Sstever@eecs.umich.eduNote: these functions are only available when invoking <tt>p_error()</tt> and are not available
13652632Sstever@eecs.umich.eduat any other time.
13662632Sstever@eecs.umich.edu
13672632Sstever@eecs.umich.edu<p>
13682632Sstever@eecs.umich.eduTo supply the next lookahead token to the parser, <tt>p_error()</tt> can return a token.  This might be
13692632Sstever@eecs.umich.eduuseful if trying to synchronize on special characters.  For example:
13702632Sstever@eecs.umich.edu
13712632Sstever@eecs.umich.edu<blockquote>
13722632Sstever@eecs.umich.edu<pre>
13732632Sstever@eecs.umich.edudef p_error(t):
13742632Sstever@eecs.umich.edu    # Read ahead looking for a terminating ";"
13752632Sstever@eecs.umich.edu    while 1:
13762632Sstever@eecs.umich.edu        tok = yacc.token()             # Get the next token
13772632Sstever@eecs.umich.edu        if not tok or tok.type == 'SEMI': break
13782632Sstever@eecs.umich.edu    yacc.errok()
13792632Sstever@eecs.umich.edu
13802632Sstever@eecs.umich.edu    # Return SEMI to the parser as the next lookahead token
13812632Sstever@eecs.umich.edu    return tok  
13822632Sstever@eecs.umich.edu</pre>
13832632Sstever@eecs.umich.edu</blockquote>
13842632Sstever@eecs.umich.edu
13852632Sstever@eecs.umich.edu<h4>General comments on error handling</h4>
13862632Sstever@eecs.umich.edu
13872632Sstever@eecs.umich.eduFor normal types of languages, error recovery with error rules and resynchronization characters is probably the most reliable
13882632Sstever@eecs.umich.edutechnique. This is because you can instrument the grammar to catch errors at selected places where it is relatively easy 
13892632Sstever@eecs.umich.eduto recover and continue parsing.  Panic mode recovery is really only useful in certain specialized applications where you might want
13902632Sstever@eecs.umich.eduto discard huge portions of the input text to find a valid restart point.
13912632Sstever@eecs.umich.edu
13922632Sstever@eecs.umich.edu<h2>Line Number Tracking</h2>
13932632Sstever@eecs.umich.edu
13942632Sstever@eecs.umich.edu<tt>yacc.py</tt> automatically tracks line numbers for all of the grammar symbols and tokens it processes.  To retrieve the line
13952632Sstever@eecs.umich.edunumbers, two functions are used in grammar rules:
13962632Sstever@eecs.umich.edu
13972632Sstever@eecs.umich.edu<ul>
13982632Sstever@eecs.umich.edu<li><tt>t.lineno(num)</tt>.  Return the starting line number for symbol <em>num</em>
13992632Sstever@eecs.umich.edu<li><tt>t.linespan(num)</tt>. Return a tuple (startline,endline) with the starting and ending line number for symbol <em>num</em>.
14002632Sstever@eecs.umich.edu</ul>
14012632Sstever@eecs.umich.edu
14022632Sstever@eecs.umich.eduFor example:
14032632Sstever@eecs.umich.edu
14042632Sstever@eecs.umich.edu<blockquote>
14052632Sstever@eecs.umich.edu<pre>
14062632Sstever@eecs.umich.edudef t_expression(t):
14072632Sstever@eecs.umich.edu    'expression : expression PLUS expression'
14082632Sstever@eecs.umich.edu    t.lineno(1)        # Line number of the left expression
14092632Sstever@eecs.umich.edu    t.lineno(2)        # line number of the PLUS operator
14102632Sstever@eecs.umich.edu    t.lineno(3)        # line number of the right expression
14112632Sstever@eecs.umich.edu    ...
14122632Sstever@eecs.umich.edu    start,end = t.linespan(3)    # Start,end lines of the right expression
14132632Sstever@eecs.umich.edu
14142632Sstever@eecs.umich.edu</pre>
14152632Sstever@eecs.umich.edu</blockquote>
14162632Sstever@eecs.umich.edu
14172632Sstever@eecs.umich.eduSince line numbers are managed internally by the parser, there is usually no need to modify the line
14182632Sstever@eecs.umich.edunumbers.  However, if you want to save the line numbers in a parse-tree node, you will need to make your own
14192632Sstever@eecs.umich.eduprivate copy.
14202632Sstever@eecs.umich.edu
14212632Sstever@eecs.umich.edu<h2>AST Construction</h2>
14222632Sstever@eecs.umich.edu
14232632Sstever@eecs.umich.edu<tt>yacc.py</tt> provides no special functions for constructing an abstract syntax tree.  However, such
14242632Sstever@eecs.umich.educonstruction is easy enough to do on your own.  Simply create a data structure for abstract syntax tree nodes
14252632Sstever@eecs.umich.eduand assign nodes to <tt>t[0]</tt> in each rule.
14262632Sstever@eecs.umich.edu
14272632Sstever@eecs.umich.eduFor example:
14282632Sstever@eecs.umich.edu
14292632Sstever@eecs.umich.edu<blockquote>
14302632Sstever@eecs.umich.edu<pre>
14312632Sstever@eecs.umich.educlass Expr: pass
14322632Sstever@eecs.umich.edu
14332632Sstever@eecs.umich.educlass BinOp(Expr):
14342632Sstever@eecs.umich.edu    def __init__(self,left,op,right):
14352632Sstever@eecs.umich.edu        self.type = "binop"
14362632Sstever@eecs.umich.edu        self.left = left
14372632Sstever@eecs.umich.edu        self.right = right
14382632Sstever@eecs.umich.edu        self.op = op
14392632Sstever@eecs.umich.edu
14402632Sstever@eecs.umich.educlass Number(Expr):
14412632Sstever@eecs.umich.edu    def __init__(self,value):
14422632Sstever@eecs.umich.edu        self.type = "number"
14432632Sstever@eecs.umich.edu        self.value = value
14442632Sstever@eecs.umich.edu
14452632Sstever@eecs.umich.edudef p_expression_binop(t):
14462632Sstever@eecs.umich.edu    '''expression : expression PLUS expression
14472632Sstever@eecs.umich.edu                  | expression MINUS expression
14482632Sstever@eecs.umich.edu                  | expression TIMES expression
14492632Sstever@eecs.umich.edu                  | expression DIVIDE expression'''
14502632Sstever@eecs.umich.edu
14512632Sstever@eecs.umich.edu    t[0] = BinOp(t[1],t[2],t[3])
14522632Sstever@eecs.umich.edu
14532632Sstever@eecs.umich.edudef p_expression_group(t):
14542632Sstever@eecs.umich.edu    'expression : LPAREN expression RPAREN'
14552632Sstever@eecs.umich.edu    t[0] = t[2]
14562632Sstever@eecs.umich.edu
14572632Sstever@eecs.umich.edudef p_expression_number(t):
14582632Sstever@eecs.umich.edu    'expression : NUMBER'
14592632Sstever@eecs.umich.edu    t[0] = Number(t[1])
14602632Sstever@eecs.umich.edu</pre>
14612632Sstever@eecs.umich.edu</blockquote>
14622632Sstever@eecs.umich.edu
14632632Sstever@eecs.umich.eduTo simplify tree traversal, it may make sense to pick a very generic tree structure for your parse tree nodes.
14642632Sstever@eecs.umich.eduFor example:
14652632Sstever@eecs.umich.edu
14662632Sstever@eecs.umich.edu<blockquote>
14672632Sstever@eecs.umich.edu<pre>
14682632Sstever@eecs.umich.educlass Node:
14692632Sstever@eecs.umich.edu    def __init__(self,type,children=None,leaf=None):
14702632Sstever@eecs.umich.edu         self.type = type
14712632Sstever@eecs.umich.edu         if children:
14722632Sstever@eecs.umich.edu              self.children = children
14732632Sstever@eecs.umich.edu         else:
14742632Sstever@eecs.umich.edu              self.children = [ ]
14752632Sstever@eecs.umich.edu         self.leaf = leaf
14762632Sstever@eecs.umich.edu	 
14772632Sstever@eecs.umich.edudef p_expression_binop(t):
14782632Sstever@eecs.umich.edu    '''expression : expression PLUS expression
14792632Sstever@eecs.umich.edu                  | expression MINUS expression
14802632Sstever@eecs.umich.edu                  | expression TIMES expression
14812632Sstever@eecs.umich.edu                  | expression DIVIDE expression'''
14822632Sstever@eecs.umich.edu
14832632Sstever@eecs.umich.edu    t[0] = Node("binop", [t[1],t[3]], t[2])
14842632Sstever@eecs.umich.edu</pre>
14852632Sstever@eecs.umich.edu</blockquote>
14862632Sstever@eecs.umich.edu
14872632Sstever@eecs.umich.edu<h2>Yacc implementation notes</h2>
14882632Sstever@eecs.umich.edu
14892632Sstever@eecs.umich.edu<ul>
14902632Sstever@eecs.umich.edu<li>By default, <tt>yacc.py</tt> relies on <tt>lex.py</tt> for tokenizing.  However, an alternative tokenizer
14912632Sstever@eecs.umich.educan be supplied as follows:
14922632Sstever@eecs.umich.edu
14932632Sstever@eecs.umich.edu<blockquote>
14942632Sstever@eecs.umich.edu<pre>
14952632Sstever@eecs.umich.eduyacc.parse(lexer=x)
14962632Sstever@eecs.umich.edu</pre>
14972632Sstever@eecs.umich.edu</blockquote>
14982632Sstever@eecs.umich.eduin this case, <tt>x</tt> must be a Lexer object that minimally has a <tt>x.token()</tt> method for retrieving the next
14992632Sstever@eecs.umich.edutoken.   If an input string is given to <tt>yacc.parse()</tt>, the lexer must also have an <tt>x.input()</tt> method.
15002632Sstever@eecs.umich.edu
15012632Sstever@eecs.umich.edu<p>
15022632Sstever@eecs.umich.edu<li>By default, the yacc generates tables in debugging mode (which produces the parser.out file and other output).
15032632Sstever@eecs.umich.eduTo disable this, use
15042632Sstever@eecs.umich.edu
15052632Sstever@eecs.umich.edu<blockquote>
15062632Sstever@eecs.umich.edu<pre>
15072632Sstever@eecs.umich.eduyacc.yacc(debug=0)
15082632Sstever@eecs.umich.edu</pre>
15092632Sstever@eecs.umich.edu</blockquote>
15102632Sstever@eecs.umich.edu
15112632Sstever@eecs.umich.edu<p>
15122632Sstever@eecs.umich.edu<li>To change the name of the <tt>parsetab.py</tt> file,  use:
15132632Sstever@eecs.umich.edu
15142632Sstever@eecs.umich.edu<blockquote>
15152632Sstever@eecs.umich.edu<pre>
15162632Sstever@eecs.umich.eduyacc.yacc(tabmodule="foo")
15172632Sstever@eecs.umich.edu</pre>
15182632Sstever@eecs.umich.edu</blockquote>
15192632Sstever@eecs.umich.edu
15202632Sstever@eecs.umich.edu<P>
15212632Sstever@eecs.umich.edu<li>To print copious amounts of debugging during parsing, use:
15222632Sstever@eecs.umich.edu
15232632Sstever@eecs.umich.edu<blockquote>
15242632Sstever@eecs.umich.edu<pre>
15252632Sstever@eecs.umich.eduyacc.parse(debug=1)
15262632Sstever@eecs.umich.edu</pre>
15272632Sstever@eecs.umich.edu</blockquote>
15282632Sstever@eecs.umich.edu
15292632Sstever@eecs.umich.edu<p>
15302632Sstever@eecs.umich.edu<li>The <tt>yacc.yacc()</tt> function really returns a parser object.  If you want to support multiple
15312632Sstever@eecs.umich.eduparsers in the same application, do this:
15322632Sstever@eecs.umich.edu
15332632Sstever@eecs.umich.edu<blockquote>
15342632Sstever@eecs.umich.edu<pre>
15352632Sstever@eecs.umich.edup = yacc.yacc()
15362632Sstever@eecs.umich.edu...
15372632Sstever@eecs.umich.edup.parse()
15382632Sstever@eecs.umich.edu</pre>
15392632Sstever@eecs.umich.edu</blockquote>
15402632Sstever@eecs.umich.edu
15412632Sstever@eecs.umich.eduNote: The function <tt>yacc.parse()</tt> is bound to the last parser that was generated.
15422632Sstever@eecs.umich.edu
15432632Sstever@eecs.umich.edu<p>
15442632Sstever@eecs.umich.edu<li>Since the generation of the SLR tables is relatively expensive, previously generated tables are
15452632Sstever@eecs.umich.educached and reused if possible.  The decision to regenerate the tables is determined by taking an MD5
15462632Sstever@eecs.umich.educhecksum of all grammar rules and precedence rules.  Only in the event of a mismatch are the tables regenerated.
15472632Sstever@eecs.umich.edu
15482632Sstever@eecs.umich.edu<p>
15492632Sstever@eecs.umich.eduIt should be noted that table generation is reasonably efficient, even for grammars that involve around a 100 rules
15502632Sstever@eecs.umich.eduand several hundred states.  For more complex languages such as C, table generation may take 30-60 seconds on a slow
15512632Sstever@eecs.umich.edumachine.  Please be patient.
15522632Sstever@eecs.umich.edu
15532632Sstever@eecs.umich.edu<p>
15542632Sstever@eecs.umich.edu<li>Since LR parsing is mostly driven by tables, the performance of the parser is largely independent of the
15552632Sstever@eecs.umich.edusize of the grammar.   The biggest bottlenecks will be the lexer and the complexity of your grammar rules.
15562632Sstever@eecs.umich.edu</ul>
15572632Sstever@eecs.umich.edu
15582632Sstever@eecs.umich.edu<h2>Parser and Lexer State Management</h2>
15592632Sstever@eecs.umich.edu
15602632Sstever@eecs.umich.eduIn advanced parsing applications, you may want to have multiple
15612632Sstever@eecs.umich.eduparsers and lexers.  Furthermore, the parser may want to control the
15622632Sstever@eecs.umich.edubehavior of the lexer in some way.
15632632Sstever@eecs.umich.edu
15642632Sstever@eecs.umich.edu<p>
15652632Sstever@eecs.umich.eduTo do this, it is important to note that both the lexer and parser are
15662632Sstever@eecs.umich.eduactually implemented as objects.   These objects are returned by the
15672632Sstever@eecs.umich.edu<tt>lex()</tt> and <tt>yacc()</tt> functions respectively.  For example:
15682632Sstever@eecs.umich.edu
15692632Sstever@eecs.umich.edu<blockquote>
15702632Sstever@eecs.umich.edu<pre>
15712632Sstever@eecs.umich.edulexer  = lex.lex()       # Return lexer object
15722632Sstever@eecs.umich.eduparser = yacc.yacc()     # Return parser object
15732632Sstever@eecs.umich.edu</pre>
15742632Sstever@eecs.umich.edu</blockquote>
15752632Sstever@eecs.umich.edu
15762632Sstever@eecs.umich.eduWithin lexer and parser rules, these objects are also available.  In the lexer,
15772632Sstever@eecs.umich.eduthe "lexer" attribute of a token refers to the lexer object in use.  For example:
15782632Sstever@eecs.umich.edu
15792632Sstever@eecs.umich.edu<blockquote>
15802632Sstever@eecs.umich.edu<pre>
15812632Sstever@eecs.umich.edudef t_NUMBER(t):
15822632Sstever@eecs.umich.edu   r'\d+'
15832632Sstever@eecs.umich.edu   ...
15842632Sstever@eecs.umich.edu   print t.lexer           # Show lexer object
15852632Sstever@eecs.umich.edu</pre>
15862632Sstever@eecs.umich.edu</blockquote>
15872632Sstever@eecs.umich.edu
15882632Sstever@eecs.umich.eduIn the parser, the "lexer" and "parser" attributes refer to the lexer
15892632Sstever@eecs.umich.eduand parser objects respectively.
15902632Sstever@eecs.umich.edu
15912632Sstever@eecs.umich.edu<blockquote>
15922632Sstever@eecs.umich.edu<pre>
15932632Sstever@eecs.umich.edudef p_expr_plus(t):
15942632Sstever@eecs.umich.edu   'expr : expr PLUS expr'
15952632Sstever@eecs.umich.edu   ...
15962632Sstever@eecs.umich.edu   print t.parser          # Show parser object
15972632Sstever@eecs.umich.edu   print t.lexer           # Show lexer object
15982632Sstever@eecs.umich.edu</pre>
15992632Sstever@eecs.umich.edu</blockquote>
16002632Sstever@eecs.umich.edu
16012632Sstever@eecs.umich.eduIf necessary, arbitrary attributes can be attached to the lexer or parser object.
16022632Sstever@eecs.umich.eduFor example, if you wanted to have different parsing modes, you could attach a mode
16032632Sstever@eecs.umich.eduattribute to the parser object and look at it later.
16042632Sstever@eecs.umich.edu
16052632Sstever@eecs.umich.edu<h2>Using Python's Optimized Mode</h2>
16062632Sstever@eecs.umich.edu
16072632Sstever@eecs.umich.eduBecause PLY uses information from doc-strings, parsing and lexing
16082632Sstever@eecs.umich.eduinformation must be gathered while running the Python interpreter in
16092632Sstever@eecs.umich.edunormal mode (i.e., not with the -O or -OO options).  However, if you
16102632Sstever@eecs.umich.eduspecify optimized mode like this:
16112632Sstever@eecs.umich.edu
16122632Sstever@eecs.umich.edu<blockquote>
16132632Sstever@eecs.umich.edu<pre>
16142632Sstever@eecs.umich.edulex.lex(optimize=1)
16152632Sstever@eecs.umich.eduyacc.yacc(optimize=1)
16162632Sstever@eecs.umich.edu</pre>
16172632Sstever@eecs.umich.edu</blockquote>
16182632Sstever@eecs.umich.edu
16192632Sstever@eecs.umich.eduthen PLY can later be used when Python runs in optimized mode. To make this work,
16202632Sstever@eecs.umich.edumake sure you first run Python in normal mode.  Once the lexing and parsing tables
16212632Sstever@eecs.umich.eduhave been generated the first time, run Python in optimized mode. PLY will use
16222632Sstever@eecs.umich.eduthe tables without the need for doc strings.
16232632Sstever@eecs.umich.edu
16242632Sstever@eecs.umich.edu<p>
16252632Sstever@eecs.umich.eduBeware: running PLY in optimized mode disables a lot of error
16262632Sstever@eecs.umich.educhecking.  You should only do this when your project has stabilized
16272632Sstever@eecs.umich.eduand you don't need to do any debugging.
16282632Sstever@eecs.umich.edu  
16292632Sstever@eecs.umich.edu<h2>Where to go from here?</h2>
16302632Sstever@eecs.umich.edu
16312632Sstever@eecs.umich.eduThe <tt>examples</tt> directory of the PLY distribution contains several simple examples.   Please consult a
16322632Sstever@eecs.umich.educompilers textbook for the theory and underlying implementation details or LR parsing.
16332632Sstever@eecs.umich.edu
16342632Sstever@eecs.umich.edu</body>
16352632Sstever@eecs.umich.edu</html>
16362632Sstever@eecs.umich.edu
16372632Sstever@eecs.umich.edu
16382632Sstever@eecs.umich.edu
16392632Sstever@eecs.umich.edu
16402632Sstever@eecs.umich.edu
16412632Sstever@eecs.umich.edu
16422632Sstever@eecs.umich.edu
1643