ply.html revision 2632
12632Sstever@eecs.umich.edu<html>
22632Sstever@eecs.umich.edu<head>
32632Sstever@eecs.umich.edu<title>PLY (Python Lex-Yacc)</title>
42632Sstever@eecs.umich.edu</head>
52632Sstever@eecs.umich.edu<body bgcolor="#ffffff">
62632Sstever@eecs.umich.edu
72632Sstever@eecs.umich.edu<h1>PLY (Python Lex-Yacc)</h1>
84479Sbinkertn@umich.edu
92632Sstever@eecs.umich.edu<b>
102632Sstever@eecs.umich.eduDavid M. Beazley <br>
114479Sbinkertn@umich.eduDepartment of Computer Science <br>
122632Sstever@eecs.umich.eduUniversity of Chicago <br>
132632Sstever@eecs.umich.eduChicago, IL 60637 <br>
142632Sstever@eecs.umich.edubeazley@cs.uchicago.edu <br>
154479Sbinkertn@umich.edu</b>
162632Sstever@eecs.umich.edu
174479Sbinkertn@umich.edu<p>
184479Sbinkertn@umich.eduDocumentation version: $Header: /home/stever/bk/newmem2/ext/ply/doc/ply.html 1.1 03/06/06 14:53:34-00:00 stever@ $
194479Sbinkertn@umich.edu
204479Sbinkertn@umich.edu<h2>Introduction</h2>
214479Sbinkertn@umich.edu
224479Sbinkertn@umich.eduPLY is a Python-only implementation of the popular compiler
234479Sbinkertn@umich.educonstruction tools lex and yacc.  The implementation borrows ideas
244479Sbinkertn@umich.edufrom a number of previous efforts; most notably John Aycock's SPARK
254479Sbinkertn@umich.edutoolkit.  However, the overall flavor of the implementation is more
264479Sbinkertn@umich.educlosely modeled after the C version of lex and yacc.  The other
274479Sbinkertn@umich.edusignificant feature of PLY is that it provides extensive input
284479Sbinkertn@umich.eduvalidation and error reporting--much more so than other Python parsing
294479Sbinkertn@umich.edutools.
304479Sbinkertn@umich.edu
314479Sbinkertn@umich.edu<p>
324479Sbinkertn@umich.eduEarly versions of PLY were developed to support the Introduction to
334479Sbinkertn@umich.eduCompilers Course at the University of Chicago.  In this course,
344479Sbinkertn@umich.edustudents built a fully functional compiler for a simple Pascal-like
354479Sbinkertn@umich.edulanguage.  Their compiler, implemented entirely in Python, had to
364479Sbinkertn@umich.eduinclude lexical analysis, parsing, type checking, type inference,
374479Sbinkertn@umich.edunested scoping, and code generation for the SPARC processor.
384479Sbinkertn@umich.eduApproximately 30 different compiler implementations were completed in
394479Sbinkertn@umich.eduthis course.  Most of PLY's interface and operation has been motivated by common
404479Sbinkertn@umich.eduusability problems encountered by students.
414479Sbinkertn@umich.edu
424479Sbinkertn@umich.edu<p>
434479Sbinkertn@umich.eduBecause PLY was primarily developed as an instructional tool, you will
444479Sbinkertn@umich.edufind it to be <em>MUCH</em> more picky about token and grammar rule
454479Sbinkertn@umich.eduspecification than most other Python parsing tools.  In part, this
464479Sbinkertn@umich.eduadded formality is meant to catch common programming mistakes made by
474479Sbinkertn@umich.edunovice users.  However, advanced users will also find such features to
484479Sbinkertn@umich.edube useful when building complicated grammars for real programming
494479Sbinkertn@umich.edulanguages.    It should also be noted that PLY does not provide much in the way
504479Sbinkertn@umich.eduof bells and whistles (e.g., automatic construction of abstract syntax trees,
514479Sbinkertn@umich.edutree traversal, etc.).   Instead, you will find a bare-bones, yet
524479Sbinkertn@umich.edufully capable lex/yacc implementation written entirely in Python.
534479Sbinkertn@umich.edu
544479Sbinkertn@umich.edu<p>
554479Sbinkertn@umich.eduThe rest of this document assumes that you are somewhat familar with
564479Sbinkertn@umich.eduparsing theory, syntax directed translation, and automatic tools such
574479Sbinkertn@umich.eduas lex and yacc. If you are unfamilar with these topics, you will
584479Sbinkertn@umich.eduprobably want to consult an introductory text such as "Compilers:
594479Sbinkertn@umich.eduPrinciples, Techniques, and Tools", by Aho, Sethi, and Ullman.  "Lex
604479Sbinkertn@umich.eduand Yacc" by John Levine may also be handy.
614479Sbinkertn@umich.edu
624479Sbinkertn@umich.edu<h2>PLY Overview</h2>
634479Sbinkertn@umich.edu
644479Sbinkertn@umich.eduPLY consists of two separate tools; <tt>lex.py</tt> and
654479Sbinkertn@umich.edu<tt>yacc.py</tt>.  <tt>lex.py</tt> is used to break input text into a
664479Sbinkertn@umich.educollection of tokens specified by a collection of regular expression
674479Sbinkertn@umich.edurules.  <tt>yacc.py</tt> is used to recognize language syntax that has
684479Sbinkertn@umich.edubeen specified in the form of a context free grammar.  Currently,
694479Sbinkertn@umich.edu<tt>yacc.py</tt> uses LR parsing and generates its parsing tables
704479Sbinkertn@umich.eduusing the SLR algorithm.  LALR(1) parsing may be supported in a future
714479Sbinkertn@umich.edurelease. 
724479Sbinkertn@umich.edu
734479Sbinkertn@umich.edu<p>
744479Sbinkertn@umich.eduThe two tools are meant to work together.  Specifically,
754479Sbinkertn@umich.edu<tt>lex.py</tt> provides an external interface in the form of a
764479Sbinkertn@umich.edu<tt>token()</tt> function that returns the next valid token on the
774479Sbinkertn@umich.eduinput stream.  <tt>yacc.py</tt> calls this repeatedly to retrieve
784479Sbinkertn@umich.edutokens and invoke grammar rules.  The output of <tt>yacc.py</tt> is
794479Sbinkertn@umich.eduoften an Abstract Syntax Tree (AST).  However, this is entirely up to
804479Sbinkertn@umich.eduthe user.  If desired, <tt>yacc.py</tt> can also be used to implement
814479Sbinkertn@umich.edusimple one-pass compilers.
824479Sbinkertn@umich.edu
834479Sbinkertn@umich.edu<p>
844479Sbinkertn@umich.eduLike its Unix counterpart, <tt>yacc.py</tt> provides most of the
854479Sbinkertn@umich.edufeatures you expect including extensive error checking, grammar
864479Sbinkertn@umich.eduvalidation, support for empty productions, error tokens, and ambiguity
874479Sbinkertn@umich.eduresolution via precedence rules.  The primary difference between
884479Sbinkertn@umich.edu<tt>yacc.py</tt> and <tt>yacc</tt> is the use of SLR parsing instead
894479Sbinkertn@umich.eduof LALR(1).  Although this slightly restricts the types of grammars
904479Sbinkertn@umich.eduthan can be successfully parsed, it is sufficiently powerful to handle most
914479Sbinkertn@umich.edukinds of normal programming language constructs.
922632Sstever@eecs.umich.edu
932632Sstever@eecs.umich.edu<p>
942632Sstever@eecs.umich.eduFinally, it is important to note that PLY relies on reflection
952632Sstever@eecs.umich.edu(introspection) to build its lexers and parsers.  Unlike traditional
962632Sstever@eecs.umich.edulex/yacc which require a special input file that is converted into a
974479Sbinkertn@umich.eduseparate source file, the specifications given to PLY <em>are</em>
982632Sstever@eecs.umich.eduvalid Python programs.  This means that there are no extra source
992632Sstever@eecs.umich.edufiles nor is there a special compiler construction step (e.g., running
1002632Sstever@eecs.umich.eduyacc to generate Python code for the compiler). 
1014479Sbinkertn@umich.edu
1024479Sbinkertn@umich.edu<h2>Lex Example</h2>
1034479Sbinkertn@umich.edu
1042632Sstever@eecs.umich.edu<tt>lex.py</tt> is used to write tokenizers.  To do this, each token
1052632Sstever@eecs.umich.edumust be defined by a regular expression rule.  The following file
1062632Sstever@eecs.umich.eduimplements a very simple lexer for tokenizing simple integer expressions:
1074479Sbinkertn@umich.edu
1084479Sbinkertn@umich.edu<blockquote>
1094479Sbinkertn@umich.edu<pre>
1104479Sbinkertn@umich.edu# ------------------------------------------------------------
1112632Sstever@eecs.umich.edu# calclex.py
1122632Sstever@eecs.umich.edu#
1132632Sstever@eecs.umich.edu# tokenizer for a simple expression evaluator for
1142632Sstever@eecs.umich.edu# numbers and +,-,*,/
1154479Sbinkertn@umich.edu# ------------------------------------------------------------
1164479Sbinkertn@umich.eduimport lex
1174479Sbinkertn@umich.edu
1184479Sbinkertn@umich.edu# List of token names.   This is always required
1194479Sbinkertn@umich.edutokens = (
1204479Sbinkertn@umich.edu   'NUMBER',
1214479Sbinkertn@umich.edu   'PLUS',
1224479Sbinkertn@umich.edu   'MINUS',
1234479Sbinkertn@umich.edu   'TIMES',
1244479Sbinkertn@umich.edu   'DIVIDE',
1254479Sbinkertn@umich.edu   'LPAREN',
1264479Sbinkertn@umich.edu   'RPAREN',
1274479Sbinkertn@umich.edu)
1284479Sbinkertn@umich.edu
1292632Sstever@eecs.umich.edu# Regular expression rules for simple tokens
1302632Sstever@eecs.umich.edut_PLUS    = r'\+'
1314479Sbinkertn@umich.edut_MINUS   = r'-'
1324479Sbinkertn@umich.edut_TIMES   = r'\*'
1332632Sstever@eecs.umich.edut_DIVIDE  = r'/'
1342632Sstever@eecs.umich.edut_LPAREN  = r'\('
1352632Sstever@eecs.umich.edut_RPAREN  = r'\)'
1362632Sstever@eecs.umich.edu
1372632Sstever@eecs.umich.edu# A regular expression rule with some action code
1382632Sstever@eecs.umich.edudef t_NUMBER(t):
1392632Sstever@eecs.umich.edu    r'\d+'
1402632Sstever@eecs.umich.edu    try:
1412632Sstever@eecs.umich.edu         t.value = int(t.value)    
1424479Sbinkertn@umich.edu    except ValueError:
1432632Sstever@eecs.umich.edu         print "Line %d: Number %s is too large!" % (t.lineno,t.value)
1442632Sstever@eecs.umich.edu	 t.value = 0
1452632Sstever@eecs.umich.edu    return t
1462632Sstever@eecs.umich.edu
1472632Sstever@eecs.umich.edu# Define a rule so we can track line numbers
1484479Sbinkertn@umich.edudef t_newline(t):
1494479Sbinkertn@umich.edu    r'\n+'
1502632Sstever@eecs.umich.edu    t.lineno += len(t.value)
1512632Sstever@eecs.umich.edu
1524479Sbinkertn@umich.edu# A string containing ignored characters (spaces and tabs)
1534479Sbinkertn@umich.edut_ignore  = ' \t'
1544479Sbinkertn@umich.edu
1554479Sbinkertn@umich.edu# Error handling rule
1564479Sbinkertn@umich.edudef t_error(t):
1574479Sbinkertn@umich.edu    print "Illegal character '%s'" % t.value[0]
1584479Sbinkertn@umich.edu    t.skip(1)
1594479Sbinkertn@umich.edu
1604479Sbinkertn@umich.edu# Build the lexer
1614479Sbinkertn@umich.edulex.lex()
1624479Sbinkertn@umich.edu
1634479Sbinkertn@umich.edu# Test it out
1644479Sbinkertn@umich.edudata = '''
1654479Sbinkertn@umich.edu3 + 4 * 10
1664479Sbinkertn@umich.edu  + -20 *2
1674479Sbinkertn@umich.edu'''
1684479Sbinkertn@umich.edu
1694479Sbinkertn@umich.edu# Give the lexer some input
1704479Sbinkertn@umich.edulex.input(data)
1714479Sbinkertn@umich.edu
1724479Sbinkertn@umich.edu# Tokenize
1734479Sbinkertn@umich.eduwhile 1:
1744479Sbinkertn@umich.edu    tok = lex.token()
1754479Sbinkertn@umich.edu    if not tok: break      # No more input
1764479Sbinkertn@umich.edu    print tok
1774479Sbinkertn@umich.edu</pre>
1784479Sbinkertn@umich.edu</blockquote>
1794479Sbinkertn@umich.edu
1804479Sbinkertn@umich.eduIn the example, the <tt>tokens</tt> list defines all of the possible
1814479Sbinkertn@umich.edutoken names that can be produced by the lexer.  This list is always required
1824479Sbinkertn@umich.eduand is used to perform a variety of validation checks.  Following the <tt>tokens</tt>
1834479Sbinkertn@umich.edulist, regular expressions are written for each token.  Each of these
1844479Sbinkertn@umich.edurules are defined by making declarations with a special prefix <tt>t_</tt> to indicate that it
1854479Sbinkertn@umich.edudefines a token.  For simple tokens, the regular expression can
1864479Sbinkertn@umich.edube specified as strings such as this (note: Python raw strings are used since they are the
1874479Sbinkertn@umich.edumost convenient way to write regular expression strings):
1884479Sbinkertn@umich.edu
1894479Sbinkertn@umich.edu<blockquote>
1904479Sbinkertn@umich.edu<pre>
1914479Sbinkertn@umich.edut_PLUS = r'\+'
1924479Sbinkertn@umich.edu</pre>
1934479Sbinkertn@umich.edu</blockquote>
1944479Sbinkertn@umich.edu
1954479Sbinkertn@umich.eduIn this case, the name following the <tt>t_</tt> must exactly match one of the
1964479Sbinkertn@umich.edunames supplied in <tt>tokens</tt>.   If some kind of action needs to be performed,
1974479Sbinkertn@umich.edua token rule can be specified as a function.  For example:
1984479Sbinkertn@umich.edu
1994479Sbinkertn@umich.edu<blockquote>
2004479Sbinkertn@umich.edu<pre>
2014479Sbinkertn@umich.edudef t_NUMBER(t):
2024479Sbinkertn@umich.edu    r'\d+'
2034479Sbinkertn@umich.edu    try:
2044479Sbinkertn@umich.edu         t.value = int(t.value)
2054479Sbinkertn@umich.edu    except ValueError:
2064479Sbinkertn@umich.edu         print "Number %s is too large!" % t.value
2074479Sbinkertn@umich.edu	 t.value = 0
2084479Sbinkertn@umich.edu    return t
2094479Sbinkertn@umich.edu</pre>
2104479Sbinkertn@umich.edu</blockquote>
2114479Sbinkertn@umich.edu
2124479Sbinkertn@umich.eduIn this case, the regular expression rule is specified in the function documentation string. 
2132632Sstever@eecs.umich.eduThe function always takes a single argument which is an instance of 
2142632Sstever@eecs.umich.edu<tt>LexToken</tt>.   This object has attributes of <tt>t.type</tt> which is the token type,
2152632Sstever@eecs.umich.edu<tt>t.value</tt> which is the lexeme, and <tt>t.lineno</tt> which is the current line number.
2162632Sstever@eecs.umich.eduBy default, <tt>t.type</tt> is set to the name following the <tt>t_</tt> prefix.  The action
2172632Sstever@eecs.umich.edufunction can modify the contents of the <tt>LexToken</tt> object as appropriate.  However, 
2182632Sstever@eecs.umich.eduwhen it is done, the resulting token should be returned.  If no value is returned by the action
2192632Sstever@eecs.umich.edufunction, the token is simply discarded and the next token read.
2202632Sstever@eecs.umich.edu
2212632Sstever@eecs.umich.edu<p>
2224479Sbinkertn@umich.eduThe rule <tt>t_newline()</tt> illustrates a regular expression rule
2232632Sstever@eecs.umich.edufor a discarded token.  In this case, a rule is written to match
2242632Sstever@eecs.umich.edunewlines so that proper line number tracking can be performed.
2252632Sstever@eecs.umich.eduBy returning no value, the function causes the newline character to be 
2262632Sstever@eecs.umich.edudiscarded.
2272632Sstever@eecs.umich.edu
2282632Sstever@eecs.umich.edu<p>
2292632Sstever@eecs.umich.eduThe special <tt>t_ignore</tt> rule is reserved by <tt>lex.py</tt> for characters
2302632Sstever@eecs.umich.eduthat should be completely ignored in the input stream. 
2312632Sstever@eecs.umich.eduUsually this is used to skip over whitespace and other non-essential characters. 
2322632Sstever@eecs.umich.eduAlthough it is possible to define a regular expression rule for whitespace in a manner
2332632Sstever@eecs.umich.edusimilar to <tt>t_newline()</tt>, the use of <tt>t_ignore</tt> provides substantially better
2342632Sstever@eecs.umich.edulexing performance because it is handled as a special case and is checked in a much
2352632Sstever@eecs.umich.edumore efficient manner than the normal regular expression rules.
2362632Sstever@eecs.umich.edu
2372632Sstever@eecs.umich.edu<p>
2382632Sstever@eecs.umich.eduFinally, the <tt>t_error()</tt>
2392632Sstever@eecs.umich.edufunction is used to handle lexing errors that occur when illegal
2402632Sstever@eecs.umich.educharacters are detected.  In this case, the <tt>t.value</tt> attribute contains the
2412632Sstever@eecs.umich.edurest of the input string that has not been tokenized.  In the example, we simply print
2422632Sstever@eecs.umich.eduthe offending character and skip ahead one character by calling <tt>t.skip(1)</tt>.
2432632Sstever@eecs.umich.edu
2442632Sstever@eecs.umich.edu<p>
2452632Sstever@eecs.umich.eduTo build the lexer, the function <tt>lex.lex()</tt> is used.  This function
2462632Sstever@eecs.umich.eduuses Python reflection (or introspection) to read the the regular expression rules
2472632Sstever@eecs.umich.eduout of the calling context and build the lexer. Once the lexer has been built, two functions can
2482632Sstever@eecs.umich.edube used to control the lexer.
2492632Sstever@eecs.umich.edu
2502632Sstever@eecs.umich.edu<ul>
2512632Sstever@eecs.umich.edu<li><tt>lex.input(data)</tt>.   Reset the lexer and store a new input string.
2522632Sstever@eecs.umich.edu<li><tt>lex.token()</tt>.  Return the next token.  Returns a special <tt>LexToken</tt> instance on success or
2532632Sstever@eecs.umich.eduNone if the end of the input text has been reached.
2542632Sstever@eecs.umich.edu</ul>
2552632Sstever@eecs.umich.edu
2564479Sbinkertn@umich.eduThe code at the bottom of the example shows how the lexer is actually used.  When executed,
2572632Sstever@eecs.umich.eduthe following output will be produced:
2582632Sstever@eecs.umich.edu
2592632Sstever@eecs.umich.edu<blockquote>
2602632Sstever@eecs.umich.edu<pre>
2612632Sstever@eecs.umich.edu$ python example.py
2622632Sstever@eecs.umich.eduLexToken(NUMBER,3,2)
2632632Sstever@eecs.umich.eduLexToken(PLUS,'+',2)
2644479Sbinkertn@umich.eduLexToken(NUMBER,4,2)
2652632Sstever@eecs.umich.eduLexToken(TIMES,'*',2)
2662632Sstever@eecs.umich.eduLexToken(NUMBER,10,2)
2672632Sstever@eecs.umich.eduLexToken(PLUS,'+',3)
2682632Sstever@eecs.umich.eduLexToken(MINUS,'-',3)
2694479Sbinkertn@umich.eduLexToken(NUMBER,20,3)
2704479Sbinkertn@umich.eduLexToken(TIMES,'*',3)
2714479Sbinkertn@umich.eduLexToken(NUMBER,2,3)
2724479Sbinkertn@umich.edu</pre>
2734479Sbinkertn@umich.edu</blockquote>
2744479Sbinkertn@umich.edu
2754479Sbinkertn@umich.edu<h2>Lex Implementation Notes</h2>
2762632Sstever@eecs.umich.edu
2772632Sstever@eecs.umich.edu<ul>
2782632Sstever@eecs.umich.edu<li><tt>lex.py</tt> uses the <tt>re</tt> module to do its patten matching.  When building the master regular expression,
2792632Sstever@eecs.umich.edurules are added in the following order:
2802632Sstever@eecs.umich.edu<p>
2812632Sstever@eecs.umich.edu<ol>
2822632Sstever@eecs.umich.edu<li>All tokens defined by functions are added in the same order as they appear in the lexer file.
2832632Sstever@eecs.umich.edu<li>Tokens defined by strings are added by sorting them in order of decreasing regular expression length (longer expressions
2842632Sstever@eecs.umich.eduare added first).
2852632Sstever@eecs.umich.edu</ol>
2862632Sstever@eecs.umich.edu<p>
2872632Sstever@eecs.umich.eduWithout this ordering, it can be difficult to correctly match certain types of tokens.  For example, if you 
2882632Sstever@eecs.umich.eduwanted to have separate tokens for "=" and "==", you need to make sure that "==" is checked first.  By sorting regular
2892632Sstever@eecs.umich.eduexpressions in order of decreasing length, this problem is solved for rules defined as strings.  For functions,
2902632Sstever@eecs.umich.eduthe order can be explicitly controlled since rules appearing first are checked first.
2912632Sstever@eecs.umich.edu
2922632Sstever@eecs.umich.edu<P>
2934479Sbinkertn@umich.edu<li>The lexer requires input to be supplied as a single input string.  Since most machines have more than enough memory, this 
2944479Sbinkertn@umich.edurarely presents a performance concern.  However, it means that the lexer currently can't be used with streaming data
2954479Sbinkertn@umich.edusuch as open files or sockets.  This limitation is primarily a side-effect of using the <tt>re</tt> module.
2964479Sbinkertn@umich.edu
2974479Sbinkertn@umich.edu<p>
2984479Sbinkertn@umich.edu<li>
2994479Sbinkertn@umich.eduTo handle reserved words, it is usually easier to just match an identifier and do a special name lookup in a function
3004479Sbinkertn@umich.edulike this:
3014479Sbinkertn@umich.edu
3024479Sbinkertn@umich.edu<blockquote>
3034479Sbinkertn@umich.edu<pre>
3044479Sbinkertn@umich.edureserved = {
3054479Sbinkertn@umich.edu   'if' : 'IF',
3064479Sbinkertn@umich.edu   'then' : 'THEN',
3074479Sbinkertn@umich.edu   'else' : 'ELSE',
3084479Sbinkertn@umich.edu   'while' : 'WHILE',
3094479Sbinkertn@umich.edu   ...
3104479Sbinkertn@umich.edu}
3114479Sbinkertn@umich.edu
3124479Sbinkertn@umich.edudef t_ID(t):
3134479Sbinkertn@umich.edu    r'[a-zA-Z_][a-zA-Z_0-9]*'
3144479Sbinkertn@umich.edu    t.type = reserved.get(t.value,'ID')    # Check for reserved words
3154479Sbinkertn@umich.edu    return t
3164479Sbinkertn@umich.edu</pre>
3174479Sbinkertn@umich.edu</blockquote>
3184479Sbinkertn@umich.edu
3194479Sbinkertn@umich.edu<p>
3204479Sbinkertn@umich.edu<li>The lexer requires tokens to be defined as class instances with <tt>t.type</tt>, <tt>t.value</tt>, and <tt>t.lineno</tt>
3214479Sbinkertn@umich.eduattributes.   By default, tokens are created as instances of the <tt>LexToken</tt> class defined internally to <tt>lex.py</tt>.
3224479Sbinkertn@umich.eduIf desired, you can create new kinds of tokens provided that they have the three required attributes.   However,
3234479Sbinkertn@umich.eduin practice, it is probably safer to stick with the default.
3244479Sbinkertn@umich.edu
3254479Sbinkertn@umich.edu<p>
3264479Sbinkertn@umich.edu<li>The only safe attribute for assigning token properties is <tt>t.value</tt>.   In some cases, you may want to attach
3274479Sbinkertn@umich.edua number of different properties to a token (e.g., symbol table entries for identifiers).  To do this, replace <tt>t.value</tt>
3284479Sbinkertn@umich.eduwith a tuple or class instance. For example:
3294479Sbinkertn@umich.edu
3304479Sbinkertn@umich.edu<blockquote>
3314479Sbinkertn@umich.edu<pre>
3324479Sbinkertn@umich.edudef t_ID(t):
3334479Sbinkertn@umich.edu    ...
3344479Sbinkertn@umich.edu    # For identifiers, create a (lexeme, symtab) tuple
3354479Sbinkertn@umich.edu    t.value = (t.value, symbol_lookup(t.value))
3364479Sbinkertn@umich.edu    ...
3374479Sbinkertn@umich.edu    return t
3384479Sbinkertn@umich.edu</pre>
3394479Sbinkertn@umich.edu</blockquote>
3404479Sbinkertn@umich.edu
3414479Sbinkertn@umich.eduAlthough allowed, do NOT assign additional attributes to the token object.  For example,
3424479Sbinkertn@umich.edu<blockquote>
3434479Sbinkertn@umich.edu<pre>
3444479Sbinkertn@umich.edudef t_ID(t):
3454479Sbinkertn@umich.edu    ...
3464479Sbinkertn@umich.edu    # Bad implementation of above
3474479Sbinkertn@umich.edu    t.symtab = symbol_lookup(t.value)
3484479Sbinkertn@umich.edu    ...
3494479Sbinkertn@umich.edu</pre>
3504479Sbinkertn@umich.edu</blockquote>
3514479Sbinkertn@umich.edu
3524479Sbinkertn@umich.eduThe reason you don't want to do this is that the <tt>yacc.py</tt>
3534479Sbinkertn@umich.edumodule only provides public access to the <tt>t.value</tt> attribute of each token.
3544479Sbinkertn@umich.eduTherefore, any other attributes you assign are inaccessible (if you are familiar
3554479Sbinkertn@umich.eduwith the internals of C lex/yacc, <tt>t.value</tt> is the same as <tt>yylval.tok</tt>).
3564479Sbinkertn@umich.edu
3574479Sbinkertn@umich.edu<p>
3584479Sbinkertn@umich.edu<li>To track line numbers, the lexer internally maintains a line
3594479Sbinkertn@umich.edunumber variable.  Each token automatically gets the value of the
3604479Sbinkertn@umich.educurrent line number in the <tt>t.lineno</tt> attribute. To modify the
3614479Sbinkertn@umich.educurrent line number, simply change the <tt>t.lineno</tt> attribute
3624479Sbinkertn@umich.eduin a function rule (as previously shown for
3632632Sstever@eecs.umich.edu<tt>t_newline()</tt>).  Even if the resulting token is discarded,
3642632Sstever@eecs.umich.educhanges to the line number remain in effect for subsequent tokens.
3652632Sstever@eecs.umich.edu
3662632Sstever@eecs.umich.edu<p>
3672632Sstever@eecs.umich.edu<li>To support multiple scanners in the same application, the <tt>lex.lex()</tt> function
3682632Sstever@eecs.umich.eduactually returns a special <tt>Lexer</tt> object.   This object has two methods 
3692632Sstever@eecs.umich.edu<tt>input()</tt> and <tt>token()</tt> that can be used to supply input and get tokens.  For example:
3702632Sstever@eecs.umich.edu
3712632Sstever@eecs.umich.edu<blockquote>
3722632Sstever@eecs.umich.edu<pre>
3732632Sstever@eecs.umich.edulexer = lex.lex()
3742632Sstever@eecs.umich.edulexer.input(sometext)
3754479Sbinkertn@umich.eduwhile 1:
3764479Sbinkertn@umich.edu    tok = lexer.token()
3772632Sstever@eecs.umich.edu    if not tok: break
3782632Sstever@eecs.umich.edu    print tok
3792632Sstever@eecs.umich.edu</pre>
3802632Sstever@eecs.umich.edu</blockquote>
3812632Sstever@eecs.umich.edu
3822632Sstever@eecs.umich.eduThe functions <tt>lex.input()</tt> and <tt>lex.token()</tt> are bound to the <tt>input()</tt> 
3832632Sstever@eecs.umich.eduand <tt>token()</tt> methods of the last lexer created by the lex module. 
3842632Sstever@eecs.umich.edu
3852632Sstever@eecs.umich.edu
3862632Sstever@eecs.umich.edu<p>
3872632Sstever@eecs.umich.edu<li>To reduce compiler startup time and improve performance, the lexer can be built in optimized mode as follows:
3882632Sstever@eecs.umich.edu
3892632Sstever@eecs.umich.edu<blockquote>
3902632Sstever@eecs.umich.edu<pre>
3914479Sbinkertn@umich.edulex.lex(optimize=1)
3922632Sstever@eecs.umich.edu</pre>
3934479Sbinkertn@umich.edu</blockquote>
3944479Sbinkertn@umich.edu
3954479Sbinkertn@umich.eduWhen used, most error checking and validation is disabled.   This provides a slight performance
3962632Sstever@eecs.umich.edugain while tokenizing and tends to chop a few tenths of a second off startup time.  Since it disables
3972632Sstever@eecs.umich.eduerror checking, this mode is not the default and is not recommended during development.  However, once
3982632Sstever@eecs.umich.eduyou have your compiler fully working, it is usually safe to disable the error checks.
3992632Sstever@eecs.umich.edu
4002632Sstever@eecs.umich.edu<p>
4012632Sstever@eecs.umich.edu<li>You can enable some additional debugging by building the lexer like this:
4024479Sbinkertn@umich.edu
4032632Sstever@eecs.umich.edu<blockquote>
4042632Sstever@eecs.umich.edu<pre>
4052632Sstever@eecs.umich.edulex.lex(debug=1)
4062632Sstever@eecs.umich.edu</pre>
4074479Sbinkertn@umich.edu</blockquote>
4082632Sstever@eecs.umich.edu
4092632Sstever@eecs.umich.edu<p>
4102632Sstever@eecs.umich.edu<li>To help you debug your lexer, <tt>lex.py</tt> comes with a simple main program which will either
4112632Sstever@eecs.umich.edutokenize input read from standard input or from a file.  To use it, simply put this in your lexer:
4122632Sstever@eecs.umich.edu
4132632Sstever@eecs.umich.edu<blockquote>
4142632Sstever@eecs.umich.edu<pre>
4152632Sstever@eecs.umich.eduif __name__ == '__main__':
4162632Sstever@eecs.umich.edu     lex.runmain()
4172632Sstever@eecs.umich.edu</pre>
4182632Sstever@eecs.umich.edu</blockquote>
4192632Sstever@eecs.umich.edu
4202632Sstever@eecs.umich.eduThen, run you lexer as a main program such as <tt>python mylex.py</tt>
4212632Sstever@eecs.umich.edu
4222632Sstever@eecs.umich.edu<p>
4232632Sstever@eecs.umich.edu<li>Since the lexer is written entirely in Python, its performance is
4242632Sstever@eecs.umich.edulargely determined by that of the Python <tt>re</tt> module.  Although
4252632Sstever@eecs.umich.eduthe lexer has been written to be as efficient as possible, it's not
4262632Sstever@eecs.umich.edublazingly fast when used on very large input files.  Sorry.  If
4272632Sstever@eecs.umich.eduperformance is concern, you might consider upgrading to the most
4282632Sstever@eecs.umich.edurecent version of Python, creating a hand-written lexer, or offloading
4292632Sstever@eecs.umich.eduthe lexer into a C extension module.  In defense of <tt>lex.py</tt>,
4302632Sstever@eecs.umich.eduit's performance is not <em>that</em> bad when used on reasonably
4312632Sstever@eecs.umich.edusized input files.  For instance, lexing a 4700 line C program with
4322632Sstever@eecs.umich.edu32000 input tokens takes about 20 seconds on a 200 Mhz PC.  Obviously,
4332632Sstever@eecs.umich.eduit will run much faster on a more speedy machine.
4342632Sstever@eecs.umich.edu
4352632Sstever@eecs.umich.edu</ul>
4362632Sstever@eecs.umich.edu
4374479Sbinkertn@umich.edu<h2>Parsing basics</h2>
4384479Sbinkertn@umich.edu
4392632Sstever@eecs.umich.edu<tt>yacc.py</tt> is used to parse language syntax.  Before showing an
4404479Sbinkertn@umich.eduexample, there are a few important bits of background that must be
4414479Sbinkertn@umich.edumentioned.  First, <tt>syntax</tt> is usually specified in terms of a
4424479Sbinkertn@umich.educontext free grammar (CFG).  For example, if you wanted to parse
4434479Sbinkertn@umich.edusimple arithmetic expressions, you might first write an unambiguous
4444479Sbinkertn@umich.edugrammar specification like this:
4454479Sbinkertn@umich.edu
4464479Sbinkertn@umich.edu<blockquote>
4474479Sbinkertn@umich.edu<pre> 
4484479Sbinkertn@umich.eduexpression : expression + term
4494479Sbinkertn@umich.edu           | expression - term
4504479Sbinkertn@umich.edu           | term
4514479Sbinkertn@umich.edu
4524479Sbinkertn@umich.eduterm       : term * factor
4534479Sbinkertn@umich.edu           | term / factor
4544479Sbinkertn@umich.edu           | factor
4554479Sbinkertn@umich.edu
4564479Sbinkertn@umich.edufactor     : NUMBER
4574479Sbinkertn@umich.edu           | ( expression )
4582632Sstever@eecs.umich.edu</pre>
4592632Sstever@eecs.umich.edu</blockquote>
4602632Sstever@eecs.umich.edu
4612632Sstever@eecs.umich.eduNext, the semantic behavior of a language is often specified using a
4622632Sstever@eecs.umich.edutechnique known as syntax directed translation.  In syntax directed
4634479Sbinkertn@umich.edutranslation, attributes are attached to each symbol in a given grammar
4642632Sstever@eecs.umich.edurule along with an action.  Whenever a particular grammar rule is
4652632Sstever@eecs.umich.edurecognized, the action describes what to do.  For example, given the
4662632Sstever@eecs.umich.eduexpression grammar above, you might write the specification for a
4672632Sstever@eecs.umich.edusimple calculator like this:
4682632Sstever@eecs.umich.edu
4692632Sstever@eecs.umich.edu<blockquote>
4704479Sbinkertn@umich.edu<pre> 
4714479Sbinkertn@umich.eduGrammar                             Action
4724479Sbinkertn@umich.edu--------------------------------    -------------------------------------------- 
4734479Sbinkertn@umich.eduexpression0 : expression1 + term    expression0.val = expression1.val + term.val
4744479Sbinkertn@umich.edu            | expression1 - term    expression0.val = expression1.val - term.val
4754479Sbinkertn@umich.edu            | term                  expression0.val = term.val
4764479Sbinkertn@umich.edu
4774479Sbinkertn@umich.eduterm0       : term1 * factor        term0.val = term1.val * factor.val
4782632Sstever@eecs.umich.edu            | term1 / factor        term0.val = term1.val / factor.val
4792632Sstever@eecs.umich.edu            | factor                term0.val = factor.val
4804479Sbinkertn@umich.edu
4814479Sbinkertn@umich.edufactor      : NUMBER                factor.val = int(NUMBER.lexval)
4824479Sbinkertn@umich.edu            | ( expression )        factor.val = expression.val
4834479Sbinkertn@umich.edu</pre>
4842632Sstever@eecs.umich.edu</blockquote>
4852632Sstever@eecs.umich.edu
4862632Sstever@eecs.umich.eduFinally, Yacc uses a parsing technique known as LR-parsing or shift-reduce parsing.  LR parsing is a
4874479Sbinkertn@umich.edubottom up technique that tries to recognize the right-hand-side of various grammar rules.
4884479Sbinkertn@umich.eduWhenever a valid right-hand-side is found in the input, the appropriate action code is triggered and the
4894479Sbinkertn@umich.edugrammar symbols are replaced by the grammar symbol on the left-hand-side. 
4904479Sbinkertn@umich.edu
4914479Sbinkertn@umich.edu<p>
4924479Sbinkertn@umich.eduLR parsing is commonly implemented by shifting grammar symbols onto a stack and looking at the stack and the next
4934479Sbinkertn@umich.eduinput token for patterns.   The details of the algorithm can be found in a compiler text, but the
4944479Sbinkertn@umich.edufollowing example illustrates the steps that are performed if you wanted to parse the expression 
4954479Sbinkertn@umich.edu<tt>3 + 5 * (10 - 20)</tt> using the grammar defined above:
4964479Sbinkertn@umich.edu
4974479Sbinkertn@umich.edu<blockquote>
4984479Sbinkertn@umich.edu<pre>
4994479Sbinkertn@umich.eduStep Symbol Stack           Input Tokens            Action
5004479Sbinkertn@umich.edu---- ---------------------  ---------------------   -------------------------------
5014479Sbinkertn@umich.edu1    $                      3 + 5 * ( 10 - 20 )$    Shift 3
5024479Sbinkertn@umich.edu2    $ 3                      + 5 * ( 10 - 20 )$    Reduce factor : NUMBER
5034479Sbinkertn@umich.edu3    $ factor                 + 5 * ( 10 - 20 )$    Reduce term   : factor
5044479Sbinkertn@umich.edu4    $ term                   + 5 * ( 10 - 20 )$    Reduce expr : term
5054479Sbinkertn@umich.edu5    $ expr                   + 5 * ( 10 - 20 )$    Shift +
5064479Sbinkertn@umich.edu6    $ expr +                   5 * ( 10 - 20 )$    Shift 5
5074479Sbinkertn@umich.edu7    $ expr + 5                   * ( 10 - 20 )$    Reduce factor : NUMBER
5084479Sbinkertn@umich.edu8    $ expr + factor              * ( 10 - 20 )$    Reduce term   : factor
5094479Sbinkertn@umich.edu9    $ expr + term                * ( 10 - 20 )$    Shift *
5104479Sbinkertn@umich.edu10   $ expr + term *                ( 10 - 20 )$    Shift (
5114479Sbinkertn@umich.edu11   $ expr + term * (                10 - 20 )$    Shift 10
5124479Sbinkertn@umich.edu12   $ expr + term * ( 10                - 20 )$    Reduce factor : NUMBER
5134479Sbinkertn@umich.edu13   $ expr + term * ( factor            - 20 )$    Reduce term : factor
5144479Sbinkertn@umich.edu14   $ expr + term * ( term              - 20 )$    Reduce expr : term
5154479Sbinkertn@umich.edu15   $ expr + term * ( expr              - 20 )$    Shift -
5162632Sstever@eecs.umich.edu16   $ expr + term * ( expr -              20 )$    Shift 20
5172632Sstever@eecs.umich.edu17   $ expr + term * ( expr - 20              )$    Reduce factor : NUMBER
5184479Sbinkertn@umich.edu18   $ expr + term * ( expr - factor          )$    Reduce term : factor
5194479Sbinkertn@umich.edu19   $ expr + term * ( expr - term            )$    Reduce expr : expr - term
5204479Sbinkertn@umich.edu20   $ expr + term * ( expr                   )$    Shift )
5214479Sbinkertn@umich.edu21   $ expr + term * ( expr )                  $    Reduce factor : (expr)
5224479Sbinkertn@umich.edu22   $ expr + term * factor                    $    Reduce term : term * factor
5234479Sbinkertn@umich.edu23   $ expr + term                             $    Reduce expr : expr + term
5244479Sbinkertn@umich.edu24   $ expr                                    $    Reduce expr
5254479Sbinkertn@umich.edu25   $                                         $    Success!
5264479Sbinkertn@umich.edu</pre>
5274479Sbinkertn@umich.edu</blockquote>
5284479Sbinkertn@umich.edu
5294479Sbinkertn@umich.eduWhen parsing the expression, an underlying state machine and the current input token determine what to do next.  
5304479Sbinkertn@umich.eduIf the next token looks like part of a valid grammar rule (based on other items on the stack), it is generally shifted
5314479Sbinkertn@umich.eduonto the stack.  If the top of the stack contains a valid right-hand-side of a grammar rule, it is
5324479Sbinkertn@umich.eduusually "reduced" and the symbols replaced with the symbol on the left-hand-side.  When this reduction occurs, the
5334479Sbinkertn@umich.eduappropriate action is triggered (if defined).  If the input token can't be shifted and the top of stack doesn't match
5344479Sbinkertn@umich.eduany grammar rules, a syntax error has occurred and the parser must take some kind of recovery step (or bail out).
5354479Sbinkertn@umich.edu
5364479Sbinkertn@umich.edu<p>
5374479Sbinkertn@umich.eduIt is important to note that the underlying implementation is actually built around a large finite-state machine
5384479Sbinkertn@umich.eduand some tables.   The construction of these tables is quite complicated and beyond the scope of this discussion.
5394479Sbinkertn@umich.eduHowever, subtle details of this process explain why, in the example above, the parser chooses to shift a token
5404479Sbinkertn@umich.eduonto the stack in step 9 rather than reducing the rule <tt>expr : expr + term</tt>.
5414479Sbinkertn@umich.edu
5422632Sstever@eecs.umich.edu<h2>Yacc example</h2>
5432632Sstever@eecs.umich.edu
5444479Sbinkertn@umich.eduSuppose you wanted to make a grammar for simple arithmetic expressions as previously described.   Here is
5454479Sbinkertn@umich.eduhow you would do it with <tt>yacc.py</tt>:
5464479Sbinkertn@umich.edu
5474479Sbinkertn@umich.edu<blockquote>
5484479Sbinkertn@umich.edu<pre>
5494479Sbinkertn@umich.edu# Yacc example
5504479Sbinkertn@umich.edu
5514479Sbinkertn@umich.eduimport yacc
5524479Sbinkertn@umich.edu
5534479Sbinkertn@umich.edu# Get the token map from the lexer.  This is required.
5544479Sbinkertn@umich.edufrom calclex import tokens
5554479Sbinkertn@umich.edu
5564479Sbinkertn@umich.edudef p_expression_plus(t):
5574479Sbinkertn@umich.edu    'expression : expression PLUS term'
5584479Sbinkertn@umich.edu    t[0] = t[1] + t[3]
5594479Sbinkertn@umich.edu
5604479Sbinkertn@umich.edudef p_expression_minus(t):
5614479Sbinkertn@umich.edu    'expression : expression MINUS term'
5624479Sbinkertn@umich.edu    t[0] = t[1] - t[3]
5634479Sbinkertn@umich.edu
5644479Sbinkertn@umich.edudef p_expression_term(t):
5654479Sbinkertn@umich.edu    'expression : term'
5664479Sbinkertn@umich.edu    t[0] = t[1]
5674479Sbinkertn@umich.edu
5684479Sbinkertn@umich.edudef p_term_times(t):
5694479Sbinkertn@umich.edu    'term : term TIMES factor'
5704479Sbinkertn@umich.edu    t[0] = t[1] * t[3]
5714479Sbinkertn@umich.edu
5724479Sbinkertn@umich.edudef p_term_div(t):
5734479Sbinkertn@umich.edu    'term : term DIVIDE factor'
5744479Sbinkertn@umich.edu    t[0] = t[1] / t[3]
5754479Sbinkertn@umich.edu
5764479Sbinkertn@umich.edudef p_term_factor(t):
5774479Sbinkertn@umich.edu    'term : factor'
5784479Sbinkertn@umich.edu    t[0] = t[1]
5794479Sbinkertn@umich.edu
5804479Sbinkertn@umich.edudef p_factor_num(t):
5814479Sbinkertn@umich.edu    'factor : NUMBER'
5824479Sbinkertn@umich.edu    t[0] = t[1]
5834479Sbinkertn@umich.edu
5844479Sbinkertn@umich.edudef p_factor_expr(t):
5854479Sbinkertn@umich.edu    'factor : LPAREN expression RPAREN'
5864479Sbinkertn@umich.edu    t[0] = t[2]
5874479Sbinkertn@umich.edu
5884479Sbinkertn@umich.edu# Error rule for syntax errors
5894479Sbinkertn@umich.edudef p_error(t):
5904479Sbinkertn@umich.edu    print "Syntax error in input!"
5914479Sbinkertn@umich.edu
5924479Sbinkertn@umich.edu# Build the parser
5934479Sbinkertn@umich.eduyacc.yacc()
5944479Sbinkertn@umich.edu
5954479Sbinkertn@umich.eduwhile 1:
5964479Sbinkertn@umich.edu   try:
5974479Sbinkertn@umich.edu       s = raw_input('calc > ')
5984479Sbinkertn@umich.edu   except EOFError:
5994479Sbinkertn@umich.edu       break
6004479Sbinkertn@umich.edu   if not s: continue
6014479Sbinkertn@umich.edu   result = yacc.parse(s)
6024479Sbinkertn@umich.edu   print result
6034479Sbinkertn@umich.edu</pre>
6044479Sbinkertn@umich.edu</blockquote>
6054479Sbinkertn@umich.edu
6064479Sbinkertn@umich.eduIn this example, each grammar rule is defined by a Python function where the docstring to that function contains the
6074479Sbinkertn@umich.eduappropriate context-free grammar specification (an idea borrowed from John Aycock's SPARK toolkit).  Each function accepts a single
6084479Sbinkertn@umich.eduargument <tt>t</tt> that is a sequence containing the values of each grammar symbol in the corresponding rule.  The values of
6094479Sbinkertn@umich.edu<tt>t[i]</tt> are mapped to grammar symbols as shown here:
6104479Sbinkertn@umich.edu
6114479Sbinkertn@umich.edu<blockquote>
6124479Sbinkertn@umich.edu<pre>
6134479Sbinkertn@umich.edudef p_expression_plus(t):
6144479Sbinkertn@umich.edu    'expression : expression PLUS term'
6154479Sbinkertn@umich.edu    #   ^            ^        ^    ^
6162632Sstever@eecs.umich.edu    #  t[0]         t[1]     t[2] t[3]
6172632Sstever@eecs.umich.edu
6182632Sstever@eecs.umich.edu    t[0] = t[1] + t[3]
6192632Sstever@eecs.umich.edu</pre>
6202632Sstever@eecs.umich.edu</blockquote>
6212632Sstever@eecs.umich.edu
6222632Sstever@eecs.umich.eduFor tokens, the "value" in the corresponding <tt>t[i]</tt> is the
6232632Sstever@eecs.umich.edu<em>same</em> as the value of the <tt>t.value</tt> attribute assigned
6242632Sstever@eecs.umich.eduin the lexer module.  For non-terminals, the value is determined by
6252632Sstever@eecs.umich.eduwhatever is placed in <tt>t[0]</tt> when rules are reduced.  This
6262632Sstever@eecs.umich.eduvalue can be anything at all.  However, it probably most common for
6272632Sstever@eecs.umich.eduthe value to be a simple Python type, a tuple, or an instance.  In this example, we
6284479Sbinkertn@umich.eduare relying on the fact that the <tt>NUMBER</tt> token stores an integer value in its value
6294479Sbinkertn@umich.edufield.  All of the other rules simply perform various types of integer operations and store
6304479Sbinkertn@umich.eduthe result.
6314479Sbinkertn@umich.edu
6324479Sbinkertn@umich.edu<p>
6334479Sbinkertn@umich.eduThe first rule defined in the yacc specification determines the starting grammar
6342632Sstever@eecs.umich.edusymbol (in this case, a rule for <tt>expression</tt> appears first).  Whenever
6352632Sstever@eecs.umich.eduthe starting rule is reduced by the parser and no more input is available, parsing 
6364479Sbinkertn@umich.edustops and the final value is returned (this value will be whatever the top-most rule
6374479Sbinkertn@umich.eduplaced in <tt>t[0]</tt>).
6384479Sbinkertn@umich.edu
6394479Sbinkertn@umich.edu<p>The <tt>p_error(t)</tt> rule is defined to catch syntax errors.  See the error handling section
6404479Sbinkertn@umich.edubelow for more detail.
6412632Sstever@eecs.umich.edu
6422632Sstever@eecs.umich.edu<p>
6432632Sstever@eecs.umich.eduTo build the parser, call the <tt>yacc.yacc()</tt> function.  This function
6444479Sbinkertn@umich.edulooks at the module and attempts to construct all of the LR parsing tables for the grammar
6454479Sbinkertn@umich.eduyou have specified.   The first time <tt>yacc.yacc()</tt> is invoked, you will get a message
6464479Sbinkertn@umich.edusuch as this:
6474479Sbinkertn@umich.edu
6484479Sbinkertn@umich.edu<blockquote>
6494479Sbinkertn@umich.edu<pre>
6504479Sbinkertn@umich.edu$ python calcparse.py
6512632Sstever@eecs.umich.eduyacc: Generating SLR parsing table...  
6522632Sstever@eecs.umich.educalc > 
6532632Sstever@eecs.umich.edu</pre>
6544479Sbinkertn@umich.edu</blockquote>
6554479Sbinkertn@umich.edu
6564479Sbinkertn@umich.eduSince table construction is relatively expensive (especially for large
6572632Sstever@eecs.umich.edugrammars), the resulting parsing table is written to the current
6582632Sstever@eecs.umich.edudirectory in a file called <tt>parsetab.py</tt>.  In addition, a
6592632Sstever@eecs.umich.edudebugging file called <tt>parser.out</tt> is created.  On subsequent
6604479Sbinkertn@umich.eduexecutions, <tt>yacc</tt> will reload the table from
6614479Sbinkertn@umich.edu<tt>parsetab.py</tt> unless it has detected a change in the underlying
6624479Sbinkertn@umich.edugrammar (in which case the tables and <tt>parsetab.py</tt> file are
6634479Sbinkertn@umich.eduregenerated).
6644479Sbinkertn@umich.edu
6652632Sstever@eecs.umich.edu<p>
6662632Sstever@eecs.umich.eduIf any errors are detected in your grammar specification, <tt>yacc.py</tt> will produce
6672632Sstever@eecs.umich.edudiagnostic messages and possibly raise an exception.  Some of the errors that can be detected include:
6684479Sbinkertn@umich.edu
6694479Sbinkertn@umich.edu<ul>
6704479Sbinkertn@umich.edu<li>Duplicated function names (if more than one rule function have the same name in the grammar file).
6714479Sbinkertn@umich.edu<li>Shift/reduce and reduce/reduce conflicts generated by ambiguous grammars.
6724479Sbinkertn@umich.edu<li>Badly specified grammar rules.
6734479Sbinkertn@umich.edu<li>Infinite recursion (rules that can never terminate).
6744479Sbinkertn@umich.edu<li>Unused rules and tokens
6754479Sbinkertn@umich.edu<li>Undefined rules and tokens
6764479Sbinkertn@umich.edu</ul>
6774479Sbinkertn@umich.edu
6784479Sbinkertn@umich.eduThe next few sections now discuss a few finer points of grammar construction.
6794479Sbinkertn@umich.edu
6804479Sbinkertn@umich.edu<h2>Combining Grammar Rule Functions</h2>
6814479Sbinkertn@umich.edu
6824479Sbinkertn@umich.eduWhen grammar rules are similar, they can be combined into a single function.
6834479Sbinkertn@umich.eduFor example, consider the two rules in our earlier example:
6844479Sbinkertn@umich.edu
6854479Sbinkertn@umich.edu<blockquote>
6864479Sbinkertn@umich.edu<pre>
6874479Sbinkertn@umich.edudef p_expression_plus(t):
6884479Sbinkertn@umich.edu    'expression : expression PLUS term'
6894479Sbinkertn@umich.edu    t[0] = t[1] + t[3]
6904479Sbinkertn@umich.edu
6914479Sbinkertn@umich.edudef p_expression_minus(t):
6924479Sbinkertn@umich.edu    'expression : expression MINUS term'
6934479Sbinkertn@umich.edu    t[0] = t[1] - t[3]
6944479Sbinkertn@umich.edu</pre>
6954479Sbinkertn@umich.edu</blockquote>
6964479Sbinkertn@umich.edu
6974479Sbinkertn@umich.eduInstead of writing two functions, you might write a single function like this:
6984479Sbinkertn@umich.edu
6994479Sbinkertn@umich.edu<blockquote>
7004479Sbinkertn@umich.edu<pre>
7014479Sbinkertn@umich.edudef p_expression(t):
7024479Sbinkertn@umich.edu    '''expression : expression PLUS term
7034479Sbinkertn@umich.edu                  | expression MINUS term'''
7044479Sbinkertn@umich.edu    if t[2] == '+':
7054479Sbinkertn@umich.edu        t[0] = t[1] + t[3]
7064479Sbinkertn@umich.edu    elif t[2] == '-':
7074479Sbinkertn@umich.edu        t[0] = t[1] - t[3]
7082632Sstever@eecs.umich.edu</pre>
7094479Sbinkertn@umich.edu</blockquote>
7104479Sbinkertn@umich.edu
7114479Sbinkertn@umich.eduIn general, the doc string for any given function can contain multiple grammar rules.  So, it would
7124479Sbinkertn@umich.eduhave also been legal (although possibly confusing) to write this:
7134479Sbinkertn@umich.edu
7144479Sbinkertn@umich.edu<blockquote>
7154479Sbinkertn@umich.edu<pre>
7164479Sbinkertn@umich.edudef p_binary_operators(t):
7174479Sbinkertn@umich.edu    '''expression : expression PLUS term
7184479Sbinkertn@umich.edu                  | expression MINUS term
7194479Sbinkertn@umich.edu       term       : term TIMES factor
7204479Sbinkertn@umich.edu                  | term DIVIDE factor'''
7214479Sbinkertn@umich.edu    if t[2] == '+':
7224479Sbinkertn@umich.edu        t[0] = t[1] + t[3]
7234479Sbinkertn@umich.edu    elif t[2] == '-':
7244479Sbinkertn@umich.edu        t[0] = t[1] - t[3]
7254479Sbinkertn@umich.edu    elif t[2] == '*':
7264479Sbinkertn@umich.edu        t[0] = t[1] * t[3]
7274479Sbinkertn@umich.edu    elif t[2] == '/':
7284479Sbinkertn@umich.edu        t[0] = t[1] / t[3]
7294479Sbinkertn@umich.edu</pre>
7304479Sbinkertn@umich.edu</blockquote>
7314479Sbinkertn@umich.edu
7324479Sbinkertn@umich.eduWhen combining grammar rules into a single function, it is usually a good idea for all of the rules to have
7334479Sbinkertn@umich.edua similar structure (e.g., the same number of terms).  Otherwise, the corresponding action code may be more 
7344479Sbinkertn@umich.educomplicated than necessary.
7354479Sbinkertn@umich.edu
7364479Sbinkertn@umich.edu<h2>Empty Productions</h2>
7372632Sstever@eecs.umich.edu
7382632Sstever@eecs.umich.edu<tt>yacc.py</tt> can handle empty productions by defining a rule like this:
7392632Sstever@eecs.umich.edu
7402632Sstever@eecs.umich.edu<blockquote>
7412632Sstever@eecs.umich.edu<pre>
7422632Sstever@eecs.umich.edudef p_empty(t):
7432632Sstever@eecs.umich.edu    'empty :'
7442632Sstever@eecs.umich.edu    pass
7454479Sbinkertn@umich.edu</pre>
7464479Sbinkertn@umich.edu</blockquote>
7474479Sbinkertn@umich.edu
7484479Sbinkertn@umich.eduNow to use the empty production, simply use 'empty' as a symbol.  For example:
7494479Sbinkertn@umich.edu
7504479Sbinkertn@umich.edu<blockquote>
7514479Sbinkertn@umich.edu<pre>
7524479Sbinkertn@umich.edudef p_optitem(t):
7534479Sbinkertn@umich.edu    'optitem : item'
7544479Sbinkertn@umich.edu    '        | empty'
7554479Sbinkertn@umich.edu    ...
7564479Sbinkertn@umich.edu</pre>
7574479Sbinkertn@umich.edu</blockquote>
7584479Sbinkertn@umich.edu
7594479Sbinkertn@umich.edu<h2>Dealing With Ambiguous Grammars</h2>
7604479Sbinkertn@umich.edu
7614479Sbinkertn@umich.eduThe expression grammar given in the earlier example has been written in a special format to eliminate ambiguity. 
7624479Sbinkertn@umich.eduHowever, in many situations, it is extremely difficult or awkward to write grammars in this format.  A
7634479Sbinkertn@umich.edumuch more natural way to express the grammar is in a more compact form like this:
7644479Sbinkertn@umich.edu
7654479Sbinkertn@umich.edu<blockquote>
7664479Sbinkertn@umich.edu<pre>
7674479Sbinkertn@umich.eduexpression : expression PLUS expression
7684479Sbinkertn@umich.edu           | expression MINUS expression
7694479Sbinkertn@umich.edu           | expression TIMES expression
7704479Sbinkertn@umich.edu           | expression DIVIDE expression
7714479Sbinkertn@umich.edu           | LPAREN expression RPAREN
7724479Sbinkertn@umich.edu           | NUMBER
7734479Sbinkertn@umich.edu</pre>
7744479Sbinkertn@umich.edu</blockquote>
7754479Sbinkertn@umich.edu
7764479Sbinkertn@umich.eduUnfortunately, this grammar specification is ambiguous.  For example, if you are parsing the string
7774479Sbinkertn@umich.edu"3 * 4 + 5", there is no way to tell how the operators are supposed to be grouped.
7784479Sbinkertn@umich.eduFor example, does this expression mean "(3 * 4) + 5" or is it "3 * (4+5)"?
7794479Sbinkertn@umich.edu
7804479Sbinkertn@umich.edu<p>
7814479Sbinkertn@umich.eduWhen an ambiguous grammar is given to <tt>yacc.py</tt> it will print messages about "shift/reduce conflicts" 
7824479Sbinkertn@umich.eduor a "reduce/reduce conflicts".   A shift/reduce conflict is caused when the parser generator can't decide
7834479Sbinkertn@umich.eduwhether or not to reduce a rule or shift a symbol on the parsing stack.   For example, consider
7844479Sbinkertn@umich.eduthe string "3 * 4 + 5" and the internal parsing stack:
7854479Sbinkertn@umich.edu
7864479Sbinkertn@umich.edu<blockquote>
7874479Sbinkertn@umich.edu<pre>
7884479Sbinkertn@umich.eduStep Symbol Stack           Input Tokens            Action
7894479Sbinkertn@umich.edu---- ---------------------  ---------------------   -------------------------------
7904479Sbinkertn@umich.edu1    $                                3 * 4 + 5$    Shift 3
7914479Sbinkertn@umich.edu2    $ 3                                * 4 + 5$    Reduce : expression : NUMBER
7924479Sbinkertn@umich.edu3    $ expr                             * 4 + 5$    Shift *
7934479Sbinkertn@umich.edu4    $ expr *                             4 + 5$    Shift 4
7944479Sbinkertn@umich.edu5    $ expr * 4                             + 5$    Reduce: expression : NUMBER
7954479Sbinkertn@umich.edu6    $ expr * expr                          + 5$    SHIFT/REDUCE CONFLICT ????
7964479Sbinkertn@umich.edu</pre>
7974479Sbinkertn@umich.edu</blockquote>
7984479Sbinkertn@umich.edu
7994479Sbinkertn@umich.eduIn this case, when the parser reaches step 6, it has two options.  One is the reduce the
8004479Sbinkertn@umich.edurule <tt>expr : expr * expr</tt> on the stack.  The other option is to shift the
8014479Sbinkertn@umich.edutoken <tt>+</tt> on the stack.   Both options are perfectly legal from the rules
8024479Sbinkertn@umich.eduof the context-free-grammar.
8034479Sbinkertn@umich.edu
8044479Sbinkertn@umich.edu<p>
8054479Sbinkertn@umich.eduBy default, all shift/reduce conflicts are resolved in favor of shifting.  Therefore, in the above
8064479Sbinkertn@umich.eduexample, the parser will always shift the <tt>+</tt> instead of reducing.    Although this
8074479Sbinkertn@umich.edustrategy works in many cases (including the ambiguous if-then-else), it is not enough for arithmetic
8084479Sbinkertn@umich.eduexpressions.  In fact, in the above example, the decision to shift <tt>+</tt> is completely wrong---we should have
8094479Sbinkertn@umich.edureduced <tt>expr * expr</tt> since multiplication has higher precedence than addition.
8104479Sbinkertn@umich.edu
8114479Sbinkertn@umich.edu<p>To resolve ambiguity, especially in expression grammars, <tt>yacc.py</tt> allows individual
8124479Sbinkertn@umich.edutokens to be assigned a precedence level and associativity.  This is done by adding a variable
8134479Sbinkertn@umich.edu<tt>precedence</tt> to the grammar file like this:
8144479Sbinkertn@umich.edu
8154479Sbinkertn@umich.edu<blockquote>
8164479Sbinkertn@umich.edu<pre>
8174479Sbinkertn@umich.eduprecedence = (
8184479Sbinkertn@umich.edu    ('left', 'PLUS', 'MINUS'),
8194479Sbinkertn@umich.edu    ('left', 'TIMES', 'DIVIDE'),
8204479Sbinkertn@umich.edu)
8214479Sbinkertn@umich.edu</pre>
8224479Sbinkertn@umich.edu</blockquote>
8234479Sbinkertn@umich.edu
8244479Sbinkertn@umich.eduThis declaration specifies that <tt>PLUS</tt>/<tt>MINUS</tt> have
8254479Sbinkertn@umich.eduthe same precedence level and are left-associative and that 
8264479Sbinkertn@umich.edu<tt>TIMES</tt>/<tt>DIVIDE</tt> have the same precedence and are left-associative.
8274479Sbinkertn@umich.eduFurthermore, the declaration specifies that <tt>TIMES</tt>/<tt>DIVIDE</tt> have higher
8284479Sbinkertn@umich.eduprecedence than <tt>PLUS</tt>/<tt>MINUS</tt> (since they appear later in the
8294479Sbinkertn@umich.eduprecedence specification).
8304479Sbinkertn@umich.edu
8314479Sbinkertn@umich.edu<p>
8324479Sbinkertn@umich.eduThe precedence specification is used to attach a numerical precedence value and associativity direction 
8334479Sbinkertn@umich.eduto each grammar rule. This is always determined by the precedence of the right-most terminal symbol.  Therefore,
8344479Sbinkertn@umich.eduif PLUS/MINUS had a precedence of 1 and TIMES/DIVIDE had a precedence of 2, the grammar rules
8354479Sbinkertn@umich.eduwould have precedence values as follows:
8364479Sbinkertn@umich.edu
8374479Sbinkertn@umich.edu<blockquote>
8384479Sbinkertn@umich.edu<pre>
8394479Sbinkertn@umich.eduexpression : expression PLUS expression                 # prec = 1, left
8404479Sbinkertn@umich.edu           | expression MINUS expression                # prec = 1, left
8414479Sbinkertn@umich.edu           | expression TIMES expression                # prec = 2, left
8424479Sbinkertn@umich.edu           | expression DIVIDE expression               # prec = 2, left
8434479Sbinkertn@umich.edu           | LPAREN expression RPAREN                   # prec = unknown
8444479Sbinkertn@umich.edu           | NUMBER                                     # prec = unknown
8454479Sbinkertn@umich.edu</pre>
8464479Sbinkertn@umich.edu</blockquote>
8474479Sbinkertn@umich.edu
8484479Sbinkertn@umich.eduWhen shift/reduce conflicts are encountered, the parser generator resolves the conflict by
8494479Sbinkertn@umich.edulooking at the precedence rules and associativity specifiers.
8504479Sbinkertn@umich.edu
8514479Sbinkertn@umich.edu<p>
8524479Sbinkertn@umich.edu<ol>
8534479Sbinkertn@umich.edu<li>If the current token has higher precedence, it is shifted.
8544479Sbinkertn@umich.edu<li>If the grammar rule on the stack has higher precedence, the rule is reduced.
8554479Sbinkertn@umich.edu<li>If the current token and the grammar rule have the same precedence, the
8564479Sbinkertn@umich.edurule is reduced for left associativity, whereas the token is shifted for right associativity.
8574479Sbinkertn@umich.edu<li>If nothing is known about the precedence, shift/reduce conflicts are resolved in
8584479Sbinkertn@umich.edufavor of shifting (the default).
8594479Sbinkertn@umich.edu</ol>
8604479Sbinkertn@umich.edu
8614479Sbinkertn@umich.edu<p>
8624479Sbinkertn@umich.eduWhen shift/reduce conflicts are resolved using the first three techniques (with the help of
8634479Sbinkertn@umich.eduprecedence rules), <tt>yacc.py</tt> will report no errors or conflicts in the grammar. 
8644479Sbinkertn@umich.edu
8654479Sbinkertn@umich.edu<p>
8664479Sbinkertn@umich.eduOne problem with the precedence specifier technique is that it is sometimes necessary to
8674479Sbinkertn@umich.educhange the precedence of an operator in certain contents.  For example, consider a unary-minus operator
8684479Sbinkertn@umich.eduin "3 + 4 * -5".  Normally, unary minus has a very high precedence--being evaluated before the multiply.
8694479Sbinkertn@umich.eduHowever, in our precedence specifier, MINUS has a lower precedence than TIMES.  To deal with this,
8704479Sbinkertn@umich.eduprecedence rules can be given for fictitious tokens like this:
8714479Sbinkertn@umich.edu
8724479Sbinkertn@umich.edu<blockquote>
8734479Sbinkertn@umich.edu<pre>
8744479Sbinkertn@umich.eduprecedence = (
8754479Sbinkertn@umich.edu    ('left', 'PLUS', 'MINUS'),
8764479Sbinkertn@umich.edu    ('left', 'TIMES', 'DIVIDE'),
8774479Sbinkertn@umich.edu    ('right', 'UMINUS'),            # Unary minus operator
8784479Sbinkertn@umich.edu)
8794479Sbinkertn@umich.edu</pre>
8804479Sbinkertn@umich.edu</blockquote>
8814479Sbinkertn@umich.edu
8824479Sbinkertn@umich.eduNow, in the grammar file, we can write our unary minus rule like this:
8834479Sbinkertn@umich.edu
8844479Sbinkertn@umich.edu<blockquote>
8854479Sbinkertn@umich.edu<pre>
8864479Sbinkertn@umich.edudef p_expr_uminus(t):
8874479Sbinkertn@umich.edu    'expression : MINUS expression %prec UMINUS'
8884479Sbinkertn@umich.edu    t[0] = -t[2]
8894479Sbinkertn@umich.edu</pre>
8904479Sbinkertn@umich.edu</blockquote>
8914479Sbinkertn@umich.edu
8924479Sbinkertn@umich.eduIn this case, <tt>%prec UMINUS</tt> overrides the default rule precedence--setting it to that
8934479Sbinkertn@umich.eduof UMINUS in the precedence specifier.
8944479Sbinkertn@umich.edu
8954479Sbinkertn@umich.edu<p>
8964479Sbinkertn@umich.eduIt is also possible to specify non-associativity in the <tt>precedence</tt> table. This would
8974479Sbinkertn@umich.edube used when you <em>don't</em> want operations to chain together.  For example, suppose
8984479Sbinkertn@umich.eduyou wanted to support a comparison operators like <tt>&lt;</tt> and <tt>&gt;</tt> but you didn't want to allow
8994479Sbinkertn@umich.educombinations like <tt>a &lt; b &lt; c</tt>.   To do this, simply specify a rule like this:
9004479Sbinkertn@umich.edu
9014479Sbinkertn@umich.edu<blockquote>
9024479Sbinkertn@umich.edu<pre>
9034479Sbinkertn@umich.eduprecedence = (
9044479Sbinkertn@umich.edu    ('nonassoc', 'LESSTHAN', 'GREATERTHAN'),  # Nonassociative operators
9054479Sbinkertn@umich.edu    ('left', 'PLUS', 'MINUS'),
9064479Sbinkertn@umich.edu    ('left', 'TIMES', 'DIVIDE'),
9074479Sbinkertn@umich.edu    ('right', 'UMINUS'),            # Unary minus operator
9084479Sbinkertn@umich.edu)
9094479Sbinkertn@umich.edu</pre>
9104479Sbinkertn@umich.edu</blockquote>
9114479Sbinkertn@umich.edu
9124479Sbinkertn@umich.edu<p>
9134479Sbinkertn@umich.eduReduce/reduce conflicts are caused when there are multiple grammar
9144479Sbinkertn@umich.edurules that can be applied to a given set of symbols.  This kind of
9154479Sbinkertn@umich.educonflict is almost always bad and is always resolved by picking the
9164479Sbinkertn@umich.edurule that appears first in the grammar file.   Reduce/reduce conflicts
9174479Sbinkertn@umich.eduare almost always caused when different sets of grammar rules somehow
9184479Sbinkertn@umich.edugenerate the same set of symbols.  For example:
9194479Sbinkertn@umich.edu
9204479Sbinkertn@umich.edu<blockquote>
9214479Sbinkertn@umich.edu<pre>
9224479Sbinkertn@umich.eduassignment :  ID EQUALS NUMBER
9234479Sbinkertn@umich.edu           |  ID EQUALS expression
9244479Sbinkertn@umich.edu           
9254479Sbinkertn@umich.eduexpression : expression PLUS expression
9264479Sbinkertn@umich.edu           | expression MINUS expression
9274479Sbinkertn@umich.edu           | expression TIMES expression
9284479Sbinkertn@umich.edu           | expression DIVIDE expression
9294479Sbinkertn@umich.edu           | LPAREN expression RPAREN
9304479Sbinkertn@umich.edu           | NUMBER
9314479Sbinkertn@umich.edu</pre>
9324479Sbinkertn@umich.edu</blockquote>
9334479Sbinkertn@umich.edu
9344479Sbinkertn@umich.eduIn this case, a reduce/reduce conflict exists between these two rules:
9354479Sbinkertn@umich.edu
9364479Sbinkertn@umich.edu<blockquote>
9374479Sbinkertn@umich.edu<pre>
9384479Sbinkertn@umich.eduassignment  : ID EQUALS NUMBER
9394479Sbinkertn@umich.eduexpression  : NUMBER
9404479Sbinkertn@umich.edu</pre>
9414479Sbinkertn@umich.edu</blockquote>
9424479Sbinkertn@umich.edu
9434479Sbinkertn@umich.eduFor example, if you wrote "a = 5", the parser can't figure out if this
9444479Sbinkertn@umich.eduis supposed to reduced as <tt>assignment : ID EQUALS NUMBER</tt> or
9454479Sbinkertn@umich.eduwhether it's supposed to reduce the 5 as an expression and then reduce
9464479Sbinkertn@umich.eduthe rule <tt>assignment : ID EQUALS expression</tt>.
9474479Sbinkertn@umich.edu
9484479Sbinkertn@umich.edu<h2>The parser.out file</h2>
9494479Sbinkertn@umich.edu
9504479Sbinkertn@umich.eduTracking down shift/reduce and reduce/reduce conflicts is one of the finer pleasures of using an LR
9514479Sbinkertn@umich.eduparsing algorithm.  To assist in debugging, <tt>yacc.py</tt> creates a debugging file called
9524479Sbinkertn@umich.edu'parser.out' when it generates the parsing table.   The contents of this file look like the following:
9534479Sbinkertn@umich.edu
9544479Sbinkertn@umich.edu<blockquote>
9554479Sbinkertn@umich.edu<pre>
9564479Sbinkertn@umich.eduUnused terminals:
9574479Sbinkertn@umich.edu
9584479Sbinkertn@umich.edu
9594479Sbinkertn@umich.eduGrammar
9604479Sbinkertn@umich.edu
9614479Sbinkertn@umich.eduRule 1     expression -> expression PLUS expression
9624479Sbinkertn@umich.eduRule 2     expression -> expression MINUS expression
9634479Sbinkertn@umich.eduRule 3     expression -> expression TIMES expression
9644479Sbinkertn@umich.eduRule 4     expression -> expression DIVIDE expression
9654479Sbinkertn@umich.eduRule 5     expression -> NUMBER
9664479Sbinkertn@umich.eduRule 6     expression -> LPAREN expression RPAREN
9674479Sbinkertn@umich.edu
9684479Sbinkertn@umich.eduTerminals, with rules where they appear
9694479Sbinkertn@umich.edu
9704479Sbinkertn@umich.eduTIMES                : 3
9714479Sbinkertn@umich.eduerror                : 
9724479Sbinkertn@umich.eduMINUS                : 2
9734479Sbinkertn@umich.eduRPAREN               : 6
9744479Sbinkertn@umich.eduLPAREN               : 6
9754479Sbinkertn@umich.eduDIVIDE               : 4
9764479Sbinkertn@umich.eduPLUS                 : 1
9774479Sbinkertn@umich.eduNUMBER               : 5
9784479Sbinkertn@umich.edu
9794479Sbinkertn@umich.eduNonterminals, with rules where they appear
9804479Sbinkertn@umich.edu
9814479Sbinkertn@umich.eduexpression           : 1 1 2 2 3 3 4 4 6 0
9824479Sbinkertn@umich.edu
9834479Sbinkertn@umich.edu
9844479Sbinkertn@umich.eduParsing method: SLR
9854479Sbinkertn@umich.edu
9864479Sbinkertn@umich.edu
9874479Sbinkertn@umich.edustate 0
9884479Sbinkertn@umich.edu
9894479Sbinkertn@umich.edu    S' -> . expression
9904479Sbinkertn@umich.edu    expression -> . expression PLUS expression
9914479Sbinkertn@umich.edu    expression -> . expression MINUS expression
9924479Sbinkertn@umich.edu    expression -> . expression TIMES expression
9934479Sbinkertn@umich.edu    expression -> . expression DIVIDE expression
9944479Sbinkertn@umich.edu    expression -> . NUMBER
9954479Sbinkertn@umich.edu    expression -> . LPAREN expression RPAREN
9964479Sbinkertn@umich.edu
9974479Sbinkertn@umich.edu    NUMBER          shift and go to state 3
9984479Sbinkertn@umich.edu    LPAREN          shift and go to state 2
9994479Sbinkertn@umich.edu
10004479Sbinkertn@umich.edu
10014479Sbinkertn@umich.edustate 1
10024479Sbinkertn@umich.edu
10034479Sbinkertn@umich.edu    S' -> expression .
10044479Sbinkertn@umich.edu    expression -> expression . PLUS expression
10054479Sbinkertn@umich.edu    expression -> expression . MINUS expression
10064479Sbinkertn@umich.edu    expression -> expression . TIMES expression
10074479Sbinkertn@umich.edu    expression -> expression . DIVIDE expression
10084479Sbinkertn@umich.edu
10094479Sbinkertn@umich.edu    PLUS            shift and go to state 6
10104479Sbinkertn@umich.edu    MINUS           shift and go to state 5
10114479Sbinkertn@umich.edu    TIMES           shift and go to state 4
10124479Sbinkertn@umich.edu    DIVIDE          shift and go to state 7
10134479Sbinkertn@umich.edu
10144479Sbinkertn@umich.edu
10154479Sbinkertn@umich.edustate 2
10164479Sbinkertn@umich.edu
10174479Sbinkertn@umich.edu    expression -> LPAREN . expression RPAREN
10184479Sbinkertn@umich.edu    expression -> . expression PLUS expression
10194479Sbinkertn@umich.edu    expression -> . expression MINUS expression
10204479Sbinkertn@umich.edu    expression -> . expression TIMES expression
10214479Sbinkertn@umich.edu    expression -> . expression DIVIDE expression
10224479Sbinkertn@umich.edu    expression -> . NUMBER
10234479Sbinkertn@umich.edu    expression -> . LPAREN expression RPAREN
10244479Sbinkertn@umich.edu
10254479Sbinkertn@umich.edu    NUMBER          shift and go to state 3
10264479Sbinkertn@umich.edu    LPAREN          shift and go to state 2
10274479Sbinkertn@umich.edu
10284479Sbinkertn@umich.edu
10294479Sbinkertn@umich.edustate 3
10304479Sbinkertn@umich.edu
10314479Sbinkertn@umich.edu    expression -> NUMBER .
10324479Sbinkertn@umich.edu
10334479Sbinkertn@umich.edu    $               reduce using rule 5
10344479Sbinkertn@umich.edu    PLUS            reduce using rule 5
10354479Sbinkertn@umich.edu    MINUS           reduce using rule 5
10364479Sbinkertn@umich.edu    TIMES           reduce using rule 5
10374479Sbinkertn@umich.edu    DIVIDE          reduce using rule 5
10384479Sbinkertn@umich.edu    RPAREN          reduce using rule 5
10394479Sbinkertn@umich.edu
10404479Sbinkertn@umich.edu
10414479Sbinkertn@umich.edustate 4
10424479Sbinkertn@umich.edu
10434479Sbinkertn@umich.edu    expression -> expression TIMES . expression
10444479Sbinkertn@umich.edu    expression -> . expression PLUS expression
10454479Sbinkertn@umich.edu    expression -> . expression MINUS expression
10464479Sbinkertn@umich.edu    expression -> . expression TIMES expression
10474479Sbinkertn@umich.edu    expression -> . expression DIVIDE expression
10484479Sbinkertn@umich.edu    expression -> . NUMBER
10494479Sbinkertn@umich.edu    expression -> . LPAREN expression RPAREN
10504479Sbinkertn@umich.edu
10514479Sbinkertn@umich.edu    NUMBER          shift and go to state 3
10524479Sbinkertn@umich.edu    LPAREN          shift and go to state 2
10534479Sbinkertn@umich.edu
10544479Sbinkertn@umich.edu
10554479Sbinkertn@umich.edustate 5
10564479Sbinkertn@umich.edu
10574479Sbinkertn@umich.edu    expression -> expression MINUS . expression
10584479Sbinkertn@umich.edu    expression -> . expression PLUS expression
10594479Sbinkertn@umich.edu    expression -> . expression MINUS expression
10604479Sbinkertn@umich.edu    expression -> . expression TIMES expression
10614479Sbinkertn@umich.edu    expression -> . expression DIVIDE expression
10624479Sbinkertn@umich.edu    expression -> . NUMBER
10634479Sbinkertn@umich.edu    expression -> . LPAREN expression RPAREN
10644479Sbinkertn@umich.edu
10654479Sbinkertn@umich.edu    NUMBER          shift and go to state 3
10664479Sbinkertn@umich.edu    LPAREN          shift and go to state 2
10674479Sbinkertn@umich.edu
10684479Sbinkertn@umich.edu
10694479Sbinkertn@umich.edustate 6
10704479Sbinkertn@umich.edu
10714479Sbinkertn@umich.edu    expression -> expression PLUS . expression
10724479Sbinkertn@umich.edu    expression -> . expression PLUS expression
10734479Sbinkertn@umich.edu    expression -> . expression MINUS expression
10744479Sbinkertn@umich.edu    expression -> . expression TIMES expression
10754479Sbinkertn@umich.edu    expression -> . expression DIVIDE expression
10764479Sbinkertn@umich.edu    expression -> . NUMBER
10774479Sbinkertn@umich.edu    expression -> . LPAREN expression RPAREN
10784479Sbinkertn@umich.edu
10794479Sbinkertn@umich.edu    NUMBER          shift and go to state 3
10804479Sbinkertn@umich.edu    LPAREN          shift and go to state 2
10814479Sbinkertn@umich.edu
10824479Sbinkertn@umich.edu
10834479Sbinkertn@umich.edustate 7
10844479Sbinkertn@umich.edu
10854479Sbinkertn@umich.edu    expression -> expression DIVIDE . expression
10864479Sbinkertn@umich.edu    expression -> . expression PLUS expression
10874479Sbinkertn@umich.edu    expression -> . expression MINUS expression
10884479Sbinkertn@umich.edu    expression -> . expression TIMES expression
10894479Sbinkertn@umich.edu    expression -> . expression DIVIDE expression
10904479Sbinkertn@umich.edu    expression -> . NUMBER
10914479Sbinkertn@umich.edu    expression -> . LPAREN expression RPAREN
10924479Sbinkertn@umich.edu
10934479Sbinkertn@umich.edu    NUMBER          shift and go to state 3
10944479Sbinkertn@umich.edu    LPAREN          shift and go to state 2
10954479Sbinkertn@umich.edu
10964479Sbinkertn@umich.edu
10974479Sbinkertn@umich.edustate 8
10984479Sbinkertn@umich.edu
10994479Sbinkertn@umich.edu    expression -> LPAREN expression . RPAREN
11004479Sbinkertn@umich.edu    expression -> expression . PLUS expression
11014479Sbinkertn@umich.edu    expression -> expression . MINUS expression
11024479Sbinkertn@umich.edu    expression -> expression . TIMES expression
11034479Sbinkertn@umich.edu    expression -> expression . DIVIDE expression
11044479Sbinkertn@umich.edu
11054479Sbinkertn@umich.edu    RPAREN          shift and go to state 13
11064479Sbinkertn@umich.edu    PLUS            shift and go to state 6
11074479Sbinkertn@umich.edu    MINUS           shift and go to state 5
11084479Sbinkertn@umich.edu    TIMES           shift and go to state 4
11094479Sbinkertn@umich.edu    DIVIDE          shift and go to state 7
11104479Sbinkertn@umich.edu
11114479Sbinkertn@umich.edu
11124479Sbinkertn@umich.edustate 9
11134479Sbinkertn@umich.edu
11144479Sbinkertn@umich.edu    expression -> expression TIMES expression .
11154479Sbinkertn@umich.edu    expression -> expression . PLUS expression
11164479Sbinkertn@umich.edu    expression -> expression . MINUS expression
11174479Sbinkertn@umich.edu    expression -> expression . TIMES expression
11184479Sbinkertn@umich.edu    expression -> expression . DIVIDE expression
11194479Sbinkertn@umich.edu
11204479Sbinkertn@umich.edu    $               reduce using rule 3
11214479Sbinkertn@umich.edu    PLUS            reduce using rule 3
11224479Sbinkertn@umich.edu    MINUS           reduce using rule 3
11234479Sbinkertn@umich.edu    TIMES           reduce using rule 3
11244479Sbinkertn@umich.edu    DIVIDE          reduce using rule 3
11254479Sbinkertn@umich.edu    RPAREN          reduce using rule 3
11264479Sbinkertn@umich.edu
11274479Sbinkertn@umich.edu  ! PLUS            [ shift and go to state 6 ]
11284479Sbinkertn@umich.edu  ! MINUS           [ shift and go to state 5 ]
11294479Sbinkertn@umich.edu  ! TIMES           [ shift and go to state 4 ]
11304479Sbinkertn@umich.edu  ! DIVIDE          [ shift and go to state 7 ]
11314479Sbinkertn@umich.edu
11324479Sbinkertn@umich.edustate 10
11334479Sbinkertn@umich.edu
11344479Sbinkertn@umich.edu    expression -> expression MINUS expression .
11354479Sbinkertn@umich.edu    expression -> expression . PLUS expression
11364479Sbinkertn@umich.edu    expression -> expression . MINUS expression
11374479Sbinkertn@umich.edu    expression -> expression . TIMES expression
11384479Sbinkertn@umich.edu    expression -> expression . DIVIDE expression
11394479Sbinkertn@umich.edu
11404479Sbinkertn@umich.edu    $               reduce using rule 2
11414479Sbinkertn@umich.edu    PLUS            reduce using rule 2
11424479Sbinkertn@umich.edu    MINUS           reduce using rule 2
11434479Sbinkertn@umich.edu    RPAREN          reduce using rule 2
11444479Sbinkertn@umich.edu    TIMES           shift and go to state 4
11454479Sbinkertn@umich.edu    DIVIDE          shift and go to state 7
11464479Sbinkertn@umich.edu
11474479Sbinkertn@umich.edu  ! TIMES           [ reduce using rule 2 ]
11484479Sbinkertn@umich.edu  ! DIVIDE          [ reduce using rule 2 ]
11494479Sbinkertn@umich.edu  ! PLUS            [ shift and go to state 6 ]
11504479Sbinkertn@umich.edu  ! MINUS           [ shift and go to state 5 ]
11514479Sbinkertn@umich.edu
11524479Sbinkertn@umich.edustate 11
11534479Sbinkertn@umich.edu
11544479Sbinkertn@umich.edu    expression -> expression PLUS expression .
11554479Sbinkertn@umich.edu    expression -> expression . PLUS expression
11564479Sbinkertn@umich.edu    expression -> expression . MINUS expression
11574479Sbinkertn@umich.edu    expression -> expression . TIMES expression
11584479Sbinkertn@umich.edu    expression -> expression . DIVIDE expression
11594479Sbinkertn@umich.edu
11604479Sbinkertn@umich.edu    $               reduce using rule 1
11614479Sbinkertn@umich.edu    PLUS            reduce using rule 1
11624479Sbinkertn@umich.edu    MINUS           reduce using rule 1
11634479Sbinkertn@umich.edu    RPAREN          reduce using rule 1
11644479Sbinkertn@umich.edu    TIMES           shift and go to state 4
11654479Sbinkertn@umich.edu    DIVIDE          shift and go to state 7
11664479Sbinkertn@umich.edu
11674479Sbinkertn@umich.edu  ! TIMES           [ reduce using rule 1 ]
11684479Sbinkertn@umich.edu  ! DIVIDE          [ reduce using rule 1 ]
11694479Sbinkertn@umich.edu  ! PLUS            [ shift and go to state 6 ]
11704479Sbinkertn@umich.edu  ! MINUS           [ shift and go to state 5 ]
11714479Sbinkertn@umich.edu
11724479Sbinkertn@umich.edustate 12
11734479Sbinkertn@umich.edu
11744479Sbinkertn@umich.edu    expression -> expression DIVIDE expression .
11754479Sbinkertn@umich.edu    expression -> expression . PLUS expression
11764479Sbinkertn@umich.edu    expression -> expression . MINUS expression
11774479Sbinkertn@umich.edu    expression -> expression . TIMES expression
11784479Sbinkertn@umich.edu    expression -> expression . DIVIDE expression
11794479Sbinkertn@umich.edu
11804479Sbinkertn@umich.edu    $               reduce using rule 4
11814479Sbinkertn@umich.edu    PLUS            reduce using rule 4
11824479Sbinkertn@umich.edu    MINUS           reduce using rule 4
11834479Sbinkertn@umich.edu    TIMES           reduce using rule 4
11844479Sbinkertn@umich.edu    DIVIDE          reduce using rule 4
11854479Sbinkertn@umich.edu    RPAREN          reduce using rule 4
11864479Sbinkertn@umich.edu
11874479Sbinkertn@umich.edu  ! PLUS            [ shift and go to state 6 ]
11884479Sbinkertn@umich.edu  ! MINUS           [ shift and go to state 5 ]
11894479Sbinkertn@umich.edu  ! TIMES           [ shift and go to state 4 ]
11904479Sbinkertn@umich.edu  ! DIVIDE          [ shift and go to state 7 ]
11914479Sbinkertn@umich.edu
11924479Sbinkertn@umich.edustate 13
11934479Sbinkertn@umich.edu
11944479Sbinkertn@umich.edu    expression -> LPAREN expression RPAREN .
11954479Sbinkertn@umich.edu
11964479Sbinkertn@umich.edu    $               reduce using rule 6
11974479Sbinkertn@umich.edu    PLUS            reduce using rule 6
11984479Sbinkertn@umich.edu    MINUS           reduce using rule 6
11994479Sbinkertn@umich.edu    TIMES           reduce using rule 6
12004479Sbinkertn@umich.edu    DIVIDE          reduce using rule 6
12014479Sbinkertn@umich.edu    RPAREN          reduce using rule 6
12024479Sbinkertn@umich.edu</pre>
12034479Sbinkertn@umich.edu</blockquote>
12044479Sbinkertn@umich.edu
12054479Sbinkertn@umich.eduIn the file, each state of the grammar is described.  Within each state the "." indicates the current
12064479Sbinkertn@umich.edulocation of the parse within any applicable grammar rules.   In addition, the actions for each valid
12074479Sbinkertn@umich.eduinput token are listed.   When a shift/reduce or reduce/reduce conflict arises, rules <em>not</em> selected
12084479Sbinkertn@umich.eduare prefixed with an !.  For example:
12094479Sbinkertn@umich.edu
12104479Sbinkertn@umich.edu<blockquote>
12114479Sbinkertn@umich.edu<pre>
12124479Sbinkertn@umich.edu  ! TIMES           [ reduce using rule 2 ]
12134479Sbinkertn@umich.edu  ! DIVIDE          [ reduce using rule 2 ]
12144479Sbinkertn@umich.edu  ! PLUS            [ shift and go to state 6 ]
12154479Sbinkertn@umich.edu  ! MINUS           [ shift and go to state 5 ]
12164479Sbinkertn@umich.edu</pre>
12174479Sbinkertn@umich.edu</blockquote>
12184479Sbinkertn@umich.edu
12194479Sbinkertn@umich.eduBy looking at these rules (and with a little practice), you can usually track down the source
12204479Sbinkertn@umich.eduof most parsing conflicts.  It should also be stressed that not all shift-reduce conflicts are
12214479Sbinkertn@umich.edubad.  However, the only way to be sure that they are resolved correctly is to look at <tt>parser.out</tt>.
12224479Sbinkertn@umich.edu  
12234479Sbinkertn@umich.edu<h2>Syntax Error Handling</h2>
12244479Sbinkertn@umich.edu
12254479Sbinkertn@umich.eduWhen a syntax error occurs during parsing, the error is immediately
12264479Sbinkertn@umich.edudetected (i.e., the parser does not read any more tokens beyond the
12274479Sbinkertn@umich.edusource of the error).  Error recovery in LR parsers is a delicate
12284479Sbinkertn@umich.edutopic that involves ancient rituals and black-magic.   The recovery mechanism
12294479Sbinkertn@umich.eduprovided by <tt>yacc.py</tt> is comparable to Unix yacc so you may want
12304479Sbinkertn@umich.educonsult a book like O'Reilly's "Lex and Yacc" for some of the finer details.
12314479Sbinkertn@umich.edu
12324479Sbinkertn@umich.edu<p>
12334479Sbinkertn@umich.eduWhen a syntax error occurs, <tt>yacc.py</tt> performs the following steps:
12344479Sbinkertn@umich.edu
12354479Sbinkertn@umich.edu<ol>
12364479Sbinkertn@umich.edu<li>On the first occurrence of an error, the user-defined <tt>p_error()</tt> function
12374479Sbinkertn@umich.eduis called with the offending token as an argument.  Afterwards, the parser enters
12384479Sbinkertn@umich.eduan "error-recovery" mode in which it will not make future calls to <tt>p_error()</tt> until it
12394479Sbinkertn@umich.eduhas successfully shifted at least 3 tokens onto the parsing stack.
12404479Sbinkertn@umich.edu
12414479Sbinkertn@umich.edu<p>
12424479Sbinkertn@umich.edu<li>If no recovery action is taken in <tt>p_error()</tt>, the offending lookahead token is replaced
12434479Sbinkertn@umich.eduwith a special <tt>error</tt> token.
12444479Sbinkertn@umich.edu
12454479Sbinkertn@umich.edu<p>
12464479Sbinkertn@umich.edu<li>If the offending lookahead token is already set to <tt>error</tt>, the top item of the parsing stack is
12474479Sbinkertn@umich.edudeleted.
12484479Sbinkertn@umich.edu
12494479Sbinkertn@umich.edu<p>
12504479Sbinkertn@umich.edu<li>If the entire parsing stack is unwound, the parser enters a restart state and attempts to start
12514479Sbinkertn@umich.eduparsing from its initial state.
12524479Sbinkertn@umich.edu
12534479Sbinkertn@umich.edu<p>
12544479Sbinkertn@umich.edu<li>If a grammar rule accepts <tt>error</tt> as a token, it will be
12554479Sbinkertn@umich.edushifted onto the parsing stack.
12564479Sbinkertn@umich.edu
12574479Sbinkertn@umich.edu<p>
12584479Sbinkertn@umich.edu<li>If the top item of the parsing stack is <tt>error</tt>, lookahead tokens will be discarded until the
12594479Sbinkertn@umich.eduparser can successfully shift a new symbol or reduce a rule involving <tt>error</tt>.
12604479Sbinkertn@umich.edu</ol>
12614479Sbinkertn@umich.edu
12624479Sbinkertn@umich.edu<h4>Recovery and resynchronization with error rules</h4>
12634479Sbinkertn@umich.edu
12644479Sbinkertn@umich.eduThe most well-behaved approach for handling syntax errors is to write grammar rules that include the <tt>error</tt>
12654479Sbinkertn@umich.edutoken.  For example, suppose your language had a grammar rule for a print statement like this:
12664479Sbinkertn@umich.edu
12674479Sbinkertn@umich.edu<blockquote>
12684479Sbinkertn@umich.edu<pre>
12694479Sbinkertn@umich.edudef p_statement_print(t):
12704479Sbinkertn@umich.edu     'statement : PRINT expr SEMI'
12714479Sbinkertn@umich.edu     ...
12724479Sbinkertn@umich.edu</pre>
12734479Sbinkertn@umich.edu</blockquote>
12744479Sbinkertn@umich.edu
12754479Sbinkertn@umich.eduTo account for the possibility of a bad expression, you might write an additional grammar rule like this:
12764479Sbinkertn@umich.edu
12774479Sbinkertn@umich.edu<blockquote>
12784479Sbinkertn@umich.edu<pre>
12794479Sbinkertn@umich.edudef p_statement_print_error(t):
12804479Sbinkertn@umich.edu     'statement : PRINT error SEMI'
12814479Sbinkertn@umich.edu     print "Syntax error in print statement. Bad expression"
12824479Sbinkertn@umich.edu
12834479Sbinkertn@umich.edu</pre>
12844479Sbinkertn@umich.edu</blockquote>
12854479Sbinkertn@umich.edu
12864479Sbinkertn@umich.eduIn this case, the <tt>error</tt> token will match any sequence of
12874479Sbinkertn@umich.edutokens that might appear up to the first semicolon that is
12884479Sbinkertn@umich.eduencountered.  Once the semicolon is reached, the rule will be
12894479Sbinkertn@umich.eduinvoked and the <tt>error</tt> token will go away.
12904479Sbinkertn@umich.edu
12914479Sbinkertn@umich.edu<p>
12924479Sbinkertn@umich.eduThis type of recovery is sometimes known as parser resynchronization.
12934479Sbinkertn@umich.eduThe <tt>error</tt> token acts as a wildcard for any bad input text and
12944479Sbinkertn@umich.eduthe token immediately following <tt>error</tt> acts as a
12954479Sbinkertn@umich.edusynchronization token.
12964479Sbinkertn@umich.edu
12974479Sbinkertn@umich.edu<p>
12984479Sbinkertn@umich.eduIt is important to note that the <tt>error</tt> token usually does not appear as the last token
12994479Sbinkertn@umich.eduon the right in an error rule.  For example:
13004479Sbinkertn@umich.edu
13014479Sbinkertn@umich.edu<blockquote>
13024479Sbinkertn@umich.edu<pre>
13034479Sbinkertn@umich.edudef p_statement_print_error(t):
13044479Sbinkertn@umich.edu    'statement : PRINT error'
13054479Sbinkertn@umich.edu    print "Syntax error in print statement. Bad expression"
13064479Sbinkertn@umich.edu</pre>
13074479Sbinkertn@umich.edu</blockquote>
13084479Sbinkertn@umich.edu
13094479Sbinkertn@umich.eduThis is because the first bad token encountered will cause the rule to
13104479Sbinkertn@umich.edube reduced--which may make it difficult to recover if more bad tokens
13114479Sbinkertn@umich.eduimmediately follow.   
13124479Sbinkertn@umich.edu
13134479Sbinkertn@umich.edu<h4>Panic mode recovery</h4>
13142632Sstever@eecs.umich.edu
13152632Sstever@eecs.umich.eduAn alternative error recovery scheme is to enter a panic mode recovery in which tokens are
13162632Sstever@eecs.umich.edudiscarded to a point where the parser might be able to recover in some sensible manner.
13172632Sstever@eecs.umich.edu
13182632Sstever@eecs.umich.edu<p>
13194479Sbinkertn@umich.eduPanic mode recovery is implemented entirely in the <tt>p_error()</tt> function.  For example, this
13202632Sstever@eecs.umich.edufunction starts discarding tokens until it reaches a closing '}'.  Then, it restarts the 
13212632Sstever@eecs.umich.eduparser in its initial state.
13224479Sbinkertn@umich.edu
13234479Sbinkertn@umich.edu<blockquote>
13244479Sbinkertn@umich.edu<pre>
13254479Sbinkertn@umich.edudef p_error(t):
13264479Sbinkertn@umich.edu    print "Whoa. You are seriously hosed."
13274479Sbinkertn@umich.edu    # Read ahead looking for a closing '}'
13284479Sbinkertn@umich.edu    while 1:
13294479Sbinkertn@umich.edu        tok = yacc.token()             # Get the next token
13304479Sbinkertn@umich.edu        if not tok or tok.type == 'RBRACE': break
13314479Sbinkertn@umich.edu    yacc.restart()
13322632Sstever@eecs.umich.edu</pre>
13332632Sstever@eecs.umich.edu</blockquote>
13344479Sbinkertn@umich.edu
13354479Sbinkertn@umich.edu<p>
13362632Sstever@eecs.umich.eduThis function simply discards the bad token and tells the parser that the error was ok.
13372632Sstever@eecs.umich.edu
13382632Sstever@eecs.umich.edu<blockquote>
13394479Sbinkertn@umich.edu<pre>
13404479Sbinkertn@umich.edudef p_error(t):
13412632Sstever@eecs.umich.edu    print "Syntax error at token", t.type
13422632Sstever@eecs.umich.edu    # Just discard the token and tell the parser it's okay.
13432632Sstever@eecs.umich.edu    yacc.errok()
13442632Sstever@eecs.umich.edu</pre>
13452632Sstever@eecs.umich.edu</blockquote>
13462632Sstever@eecs.umich.edu
13472632Sstever@eecs.umich.edu<P>
13482632Sstever@eecs.umich.eduWithin the <tt>p_error()</tt> function, three functions are available to control the behavior
13492632Sstever@eecs.umich.eduof the parser:
13502632Sstever@eecs.umich.edu<p>
13512632Sstever@eecs.umich.edu<ul>
13522632Sstever@eecs.umich.edu<li><tt>yacc.errok()</tt>.  This resets the parser state so it doesn't think it's in error-recovery
13532632Sstever@eecs.umich.edumode.   This will prevent an <tt>error</tt> token from being generated and will reset the internal
13542632Sstever@eecs.umich.eduerror counters so that the next syntax error will call <tt>p_error()</tt> again.
13552632Sstever@eecs.umich.edu
13562632Sstever@eecs.umich.edu<p>
13572632Sstever@eecs.umich.edu<li><tt>yacc.token()</tt>.  This returns the next token on the input stream.
13582632Sstever@eecs.umich.edu
13594479Sbinkertn@umich.edu<p>
13604479Sbinkertn@umich.edu<li><tt>yacc.restart()</tt>.  This discards the entire parsing stack and resets the parser
13614479Sbinkertn@umich.eduto its initial state. 
13624479Sbinkertn@umich.edu</ul>
13634479Sbinkertn@umich.edu
13642632Sstever@eecs.umich.eduNote: these functions are only available when invoking <tt>p_error()</tt> and are not available
13652632Sstever@eecs.umich.eduat any other time.
13662632Sstever@eecs.umich.edu
13672632Sstever@eecs.umich.edu<p>
13682632Sstever@eecs.umich.eduTo supply the next lookahead token to the parser, <tt>p_error()</tt> can return a token.  This might be
13692632Sstever@eecs.umich.eduuseful if trying to synchronize on special characters.  For example:
13702632Sstever@eecs.umich.edu
13712632Sstever@eecs.umich.edu<blockquote>
13722632Sstever@eecs.umich.edu<pre>
13732632Sstever@eecs.umich.edudef p_error(t):
13742632Sstever@eecs.umich.edu    # Read ahead looking for a terminating ";"
13752632Sstever@eecs.umich.edu    while 1:
13762632Sstever@eecs.umich.edu        tok = yacc.token()             # Get the next token
13772632Sstever@eecs.umich.edu        if not tok or tok.type == 'SEMI': break
13782632Sstever@eecs.umich.edu    yacc.errok()
13792632Sstever@eecs.umich.edu
13802632Sstever@eecs.umich.edu    # Return SEMI to the parser as the next lookahead token
13812632Sstever@eecs.umich.edu    return tok  
13822632Sstever@eecs.umich.edu</pre>
13832632Sstever@eecs.umich.edu</blockquote>
13842632Sstever@eecs.umich.edu
13852632Sstever@eecs.umich.edu<h4>General comments on error handling</h4>
13862632Sstever@eecs.umich.edu
13872632Sstever@eecs.umich.eduFor normal types of languages, error recovery with error rules and resynchronization characters is probably the most reliable
13884479Sbinkertn@umich.edutechnique. This is because you can instrument the grammar to catch errors at selected places where it is relatively easy 
13894479Sbinkertn@umich.eduto recover and continue parsing.  Panic mode recovery is really only useful in certain specialized applications where you might want
13904479Sbinkertn@umich.eduto discard huge portions of the input text to find a valid restart point.
13914479Sbinkertn@umich.edu
13924479Sbinkertn@umich.edu<h2>Line Number Tracking</h2>
13934479Sbinkertn@umich.edu
13942632Sstever@eecs.umich.edu<tt>yacc.py</tt> automatically tracks line numbers for all of the grammar symbols and tokens it processes.  To retrieve the line
13952632Sstever@eecs.umich.edunumbers, two functions are used in grammar rules:
13962632Sstever@eecs.umich.edu
13972632Sstever@eecs.umich.edu<ul>
13982632Sstever@eecs.umich.edu<li><tt>t.lineno(num)</tt>.  Return the starting line number for symbol <em>num</em>
13992632Sstever@eecs.umich.edu<li><tt>t.linespan(num)</tt>. Return a tuple (startline,endline) with the starting and ending line number for symbol <em>num</em>.
14002632Sstever@eecs.umich.edu</ul>
14012632Sstever@eecs.umich.edu
14022632Sstever@eecs.umich.eduFor example:
14032632Sstever@eecs.umich.edu
14042632Sstever@eecs.umich.edu<blockquote>
14052632Sstever@eecs.umich.edu<pre>
14062632Sstever@eecs.umich.edudef t_expression(t):
14072632Sstever@eecs.umich.edu    'expression : expression PLUS expression'
14082632Sstever@eecs.umich.edu    t.lineno(1)        # Line number of the left expression
14092632Sstever@eecs.umich.edu    t.lineno(2)        # line number of the PLUS operator
14102632Sstever@eecs.umich.edu    t.lineno(3)        # line number of the right expression
14112632Sstever@eecs.umich.edu    ...
14122632Sstever@eecs.umich.edu    start,end = t.linespan(3)    # Start,end lines of the right expression
14132632Sstever@eecs.umich.edu
14142632Sstever@eecs.umich.edu</pre>
14152632Sstever@eecs.umich.edu</blockquote>
14162632Sstever@eecs.umich.edu
14172632Sstever@eecs.umich.eduSince line numbers are managed internally by the parser, there is usually no need to modify the line
14182632Sstever@eecs.umich.edunumbers.  However, if you want to save the line numbers in a parse-tree node, you will need to make your own
14192632Sstever@eecs.umich.eduprivate copy.
14202632Sstever@eecs.umich.edu
14212632Sstever@eecs.umich.edu<h2>AST Construction</h2>
14222632Sstever@eecs.umich.edu
14232632Sstever@eecs.umich.edu<tt>yacc.py</tt> provides no special functions for constructing an abstract syntax tree.  However, such
14242632Sstever@eecs.umich.educonstruction is easy enough to do on your own.  Simply create a data structure for abstract syntax tree nodes
14252632Sstever@eecs.umich.eduand assign nodes to <tt>t[0]</tt> in each rule.
14262632Sstever@eecs.umich.edu
14272632Sstever@eecs.umich.eduFor example:
14282632Sstever@eecs.umich.edu
14292632Sstever@eecs.umich.edu<blockquote>
14302632Sstever@eecs.umich.edu<pre>
14312632Sstever@eecs.umich.educlass Expr: pass
14322632Sstever@eecs.umich.edu
14332632Sstever@eecs.umich.educlass BinOp(Expr):
14342632Sstever@eecs.umich.edu    def __init__(self,left,op,right):
14352632Sstever@eecs.umich.edu        self.type = "binop"
14362632Sstever@eecs.umich.edu        self.left = left
14372632Sstever@eecs.umich.edu        self.right = right
14382632Sstever@eecs.umich.edu        self.op = op
14392632Sstever@eecs.umich.edu
14402632Sstever@eecs.umich.educlass Number(Expr):
14412632Sstever@eecs.umich.edu    def __init__(self,value):
14422632Sstever@eecs.umich.edu        self.type = "number"
14432632Sstever@eecs.umich.edu        self.value = value
14444479Sbinkertn@umich.edu
14454479Sbinkertn@umich.edudef p_expression_binop(t):
14462632Sstever@eecs.umich.edu    '''expression : expression PLUS expression
14472632Sstever@eecs.umich.edu                  | expression MINUS expression
14482632Sstever@eecs.umich.edu                  | expression TIMES expression
14494479Sbinkertn@umich.edu                  | expression DIVIDE expression'''
14504479Sbinkertn@umich.edu
14514479Sbinkertn@umich.edu    t[0] = BinOp(t[1],t[2],t[3])
14524479Sbinkertn@umich.edu
14534479Sbinkertn@umich.edudef p_expression_group(t):
14544479Sbinkertn@umich.edu    'expression : LPAREN expression RPAREN'
14554479Sbinkertn@umich.edu    t[0] = t[2]
14562632Sstever@eecs.umich.edu
14572632Sstever@eecs.umich.edudef p_expression_number(t):
14582632Sstever@eecs.umich.edu    'expression : NUMBER'
14592632Sstever@eecs.umich.edu    t[0] = Number(t[1])
14602632Sstever@eecs.umich.edu</pre>
14612632Sstever@eecs.umich.edu</blockquote>
14622632Sstever@eecs.umich.edu
14632632Sstever@eecs.umich.eduTo simplify tree traversal, it may make sense to pick a very generic tree structure for your parse tree nodes.
14644479Sbinkertn@umich.eduFor example:
14652632Sstever@eecs.umich.edu
14662632Sstever@eecs.umich.edu<blockquote>
14672632Sstever@eecs.umich.edu<pre>
14682632Sstever@eecs.umich.educlass Node:
14694479Sbinkertn@umich.edu    def __init__(self,type,children=None,leaf=None):
14702632Sstever@eecs.umich.edu         self.type = type
14714479Sbinkertn@umich.edu         if children:
14724479Sbinkertn@umich.edu              self.children = children
14734479Sbinkertn@umich.edu         else:
14742632Sstever@eecs.umich.edu              self.children = [ ]
14754479Sbinkertn@umich.edu         self.leaf = leaf
14764479Sbinkertn@umich.edu	 
14774479Sbinkertn@umich.edudef p_expression_binop(t):
14782632Sstever@eecs.umich.edu    '''expression : expression PLUS expression
14794479Sbinkertn@umich.edu                  | expression MINUS expression
14804479Sbinkertn@umich.edu                  | expression TIMES expression
14814479Sbinkertn@umich.edu                  | expression DIVIDE expression'''
14822632Sstever@eecs.umich.edu
14834479Sbinkertn@umich.edu    t[0] = Node("binop", [t[1],t[3]], t[2])
14844479Sbinkertn@umich.edu</pre>
14854479Sbinkertn@umich.edu</blockquote>
14862632Sstever@eecs.umich.edu
14874479Sbinkertn@umich.edu<h2>Yacc implementation notes</h2>
14884479Sbinkertn@umich.edu
14894479Sbinkertn@umich.edu<ul>
14902632Sstever@eecs.umich.edu<li>By default, <tt>yacc.py</tt> relies on <tt>lex.py</tt> for tokenizing.  However, an alternative tokenizer
14914479Sbinkertn@umich.educan be supplied as follows:
14924479Sbinkertn@umich.edu
14934479Sbinkertn@umich.edu<blockquote>
14942632Sstever@eecs.umich.edu<pre>
14954479Sbinkertn@umich.eduyacc.parse(lexer=x)
14964479Sbinkertn@umich.edu</pre>
14974479Sbinkertn@umich.edu</blockquote>
14982632Sstever@eecs.umich.eduin this case, <tt>x</tt> must be a Lexer object that minimally has a <tt>x.token()</tt> method for retrieving the next
14994479Sbinkertn@umich.edutoken.   If an input string is given to <tt>yacc.parse()</tt>, the lexer must also have an <tt>x.input()</tt> method.
15002632Sstever@eecs.umich.edu
15012632Sstever@eecs.umich.edu<p>
15024479Sbinkertn@umich.edu<li>By default, the yacc generates tables in debugging mode (which produces the parser.out file and other output).
15032632Sstever@eecs.umich.eduTo disable this, use
15042632Sstever@eecs.umich.edu
15052632Sstever@eecs.umich.edu<blockquote>
15062632Sstever@eecs.umich.edu<pre>
15072632Sstever@eecs.umich.eduyacc.yacc(debug=0)
15084479Sbinkertn@umich.edu</pre>
15094479Sbinkertn@umich.edu</blockquote>
15104479Sbinkertn@umich.edu
15112632Sstever@eecs.umich.edu<p>
15122632Sstever@eecs.umich.edu<li>To change the name of the <tt>parsetab.py</tt> file,  use:
15132632Sstever@eecs.umich.edu
15142632Sstever@eecs.umich.edu<blockquote>
15152632Sstever@eecs.umich.edu<pre>
15162632Sstever@eecs.umich.eduyacc.yacc(tabmodule="foo")
15172632Sstever@eecs.umich.edu</pre>
15182632Sstever@eecs.umich.edu</blockquote>
15192632Sstever@eecs.umich.edu
15202632Sstever@eecs.umich.edu<P>
15212632Sstever@eecs.umich.edu<li>To print copious amounts of debugging during parsing, use:
15222632Sstever@eecs.umich.edu
15234479Sbinkertn@umich.edu<blockquote>
15244479Sbinkertn@umich.edu<pre>
15254479Sbinkertn@umich.eduyacc.parse(debug=1)
15262632Sstever@eecs.umich.edu</pre>
15272632Sstever@eecs.umich.edu</blockquote>
15282632Sstever@eecs.umich.edu
15294479Sbinkertn@umich.edu<p>
15302632Sstever@eecs.umich.edu<li>The <tt>yacc.yacc()</tt> function really returns a parser object.  If you want to support multiple
15312632Sstever@eecs.umich.eduparsers in the same application, do this:
15324479Sbinkertn@umich.edu
15334479Sbinkertn@umich.edu<blockquote>
15344479Sbinkertn@umich.edu<pre>
15352632Sstever@eecs.umich.edup = yacc.yacc()
15362632Sstever@eecs.umich.edu...
15372632Sstever@eecs.umich.edup.parse()
15384479Sbinkertn@umich.edu</pre>
15394479Sbinkertn@umich.edu</blockquote>
15402632Sstever@eecs.umich.edu
15414479Sbinkertn@umich.eduNote: The function <tt>yacc.parse()</tt> is bound to the last parser that was generated.
15422632Sstever@eecs.umich.edu
15432632Sstever@eecs.umich.edu<p>
15442632Sstever@eecs.umich.edu<li>Since the generation of the SLR tables is relatively expensive, previously generated tables are
15452632Sstever@eecs.umich.educached and reused if possible.  The decision to regenerate the tables is determined by taking an MD5
15462632Sstever@eecs.umich.educhecksum of all grammar rules and precedence rules.  Only in the event of a mismatch are the tables regenerated.
15472632Sstever@eecs.umich.edu
15484479Sbinkertn@umich.edu<p>
15494479Sbinkertn@umich.eduIt should be noted that table generation is reasonably efficient, even for grammars that involve around a 100 rules
15504479Sbinkertn@umich.eduand several hundred states.  For more complex languages such as C, table generation may take 30-60 seconds on a slow
15514479Sbinkertn@umich.edumachine.  Please be patient.
15524479Sbinkertn@umich.edu
15532632Sstever@eecs.umich.edu<p>
15542632Sstever@eecs.umich.edu<li>Since LR parsing is mostly driven by tables, the performance of the parser is largely independent of the
15552632Sstever@eecs.umich.edusize of the grammar.   The biggest bottlenecks will be the lexer and the complexity of your grammar rules.
15562632Sstever@eecs.umich.edu</ul>
15572632Sstever@eecs.umich.edu
15584479Sbinkertn@umich.edu<h2>Parser and Lexer State Management</h2>
15594479Sbinkertn@umich.edu
15604479Sbinkertn@umich.eduIn advanced parsing applications, you may want to have multiple
15614479Sbinkertn@umich.eduparsers and lexers.  Furthermore, the parser may want to control the
15622632Sstever@eecs.umich.edubehavior of the lexer in some way.
15632632Sstever@eecs.umich.edu
15642632Sstever@eecs.umich.edu<p>
15652632Sstever@eecs.umich.eduTo do this, it is important to note that both the lexer and parser are
15662632Sstever@eecs.umich.eduactually implemented as objects.   These objects are returned by the
15672632Sstever@eecs.umich.edu<tt>lex()</tt> and <tt>yacc()</tt> functions respectively.  For example:
15682632Sstever@eecs.umich.edu
15692632Sstever@eecs.umich.edu<blockquote>
15702632Sstever@eecs.umich.edu<pre>
15712632Sstever@eecs.umich.edulexer  = lex.lex()       # Return lexer object
15722632Sstever@eecs.umich.eduparser = yacc.yacc()     # Return parser object
15734479Sbinkertn@umich.edu</pre>
15742632Sstever@eecs.umich.edu</blockquote>
15752632Sstever@eecs.umich.edu
15762632Sstever@eecs.umich.eduWithin lexer and parser rules, these objects are also available.  In the lexer,
15772632Sstever@eecs.umich.eduthe "lexer" attribute of a token refers to the lexer object in use.  For example:
15782632Sstever@eecs.umich.edu
15792632Sstever@eecs.umich.edu<blockquote>
15802632Sstever@eecs.umich.edu<pre>
15812632Sstever@eecs.umich.edudef t_NUMBER(t):
15822632Sstever@eecs.umich.edu   r'\d+'
15832632Sstever@eecs.umich.edu   ...
15842632Sstever@eecs.umich.edu   print t.lexer           # Show lexer object
15854479Sbinkertn@umich.edu</pre>
15862632Sstever@eecs.umich.edu</blockquote>
15872632Sstever@eecs.umich.edu
15882632Sstever@eecs.umich.eduIn the parser, the "lexer" and "parser" attributes refer to the lexer
15892632Sstever@eecs.umich.eduand parser objects respectively.
15902632Sstever@eecs.umich.edu
15912632Sstever@eecs.umich.edu<blockquote>
15922632Sstever@eecs.umich.edu<pre>
15932632Sstever@eecs.umich.edudef p_expr_plus(t):
15942632Sstever@eecs.umich.edu   'expr : expr PLUS expr'
15952632Sstever@eecs.umich.edu   ...
15962632Sstever@eecs.umich.edu   print t.parser          # Show parser object
15972632Sstever@eecs.umich.edu   print t.lexer           # Show lexer object
15982632Sstever@eecs.umich.edu</pre>
15992632Sstever@eecs.umich.edu</blockquote>
16002632Sstever@eecs.umich.edu
16012632Sstever@eecs.umich.eduIf necessary, arbitrary attributes can be attached to the lexer or parser object.
16024479Sbinkertn@umich.eduFor example, if you wanted to have different parsing modes, you could attach a mode
16034479Sbinkertn@umich.eduattribute to the parser object and look at it later.
16042632Sstever@eecs.umich.edu
16052632Sstever@eecs.umich.edu<h2>Using Python's Optimized Mode</h2>
16062632Sstever@eecs.umich.edu
16072632Sstever@eecs.umich.eduBecause PLY uses information from doc-strings, parsing and lexing
16082632Sstever@eecs.umich.eduinformation must be gathered while running the Python interpreter in
16092632Sstever@eecs.umich.edunormal mode (i.e., not with the -O or -OO options).  However, if you
16104479Sbinkertn@umich.eduspecify optimized mode like this:
16112632Sstever@eecs.umich.edu
16124479Sbinkertn@umich.edu<blockquote>
16132632Sstever@eecs.umich.edu<pre>
16142632Sstever@eecs.umich.edulex.lex(optimize=1)
16152632Sstever@eecs.umich.eduyacc.yacc(optimize=1)
16164479Sbinkertn@umich.edu</pre>
16172632Sstever@eecs.umich.edu</blockquote>
16182632Sstever@eecs.umich.edu
16192632Sstever@eecs.umich.eduthen PLY can later be used when Python runs in optimized mode. To make this work,
16202632Sstever@eecs.umich.edumake sure you first run Python in normal mode.  Once the lexing and parsing tables
16212632Sstever@eecs.umich.eduhave been generated the first time, run Python in optimized mode. PLY will use
16222632Sstever@eecs.umich.eduthe tables without the need for doc strings.
16232632Sstever@eecs.umich.edu
16244479Sbinkertn@umich.edu<p>
16252632Sstever@eecs.umich.eduBeware: running PLY in optimized mode disables a lot of error
16262632Sstever@eecs.umich.educhecking.  You should only do this when your project has stabilized
16274479Sbinkertn@umich.eduand you don't need to do any debugging.
16284479Sbinkertn@umich.edu  
16294479Sbinkertn@umich.edu<h2>Where to go from here?</h2>
16304479Sbinkertn@umich.edu
16312632Sstever@eecs.umich.eduThe <tt>examples</tt> directory of the PLY distribution contains several simple examples.   Please consult a
16322632Sstever@eecs.umich.educompilers textbook for the theory and underlying implementation details or LR parsing.
16332632Sstever@eecs.umich.edu
16342632Sstever@eecs.umich.edu</body>
16352632Sstever@eecs.umich.edu</html>
16362632Sstever@eecs.umich.edu
16372632Sstever@eecs.umich.edu
16382632Sstever@eecs.umich.edu
16394479Sbinkertn@umich.edu
16402632Sstever@eecs.umich.edu
16412632Sstever@eecs.umich.edu
16422632Sstever@eecs.umich.edu
16432632Sstever@eecs.umich.edu