ply.html revision 2632
1<html>
2<head>
3<title>PLY (Python Lex-Yacc)</title>
4</head>
5<body bgcolor="#ffffff">
6
7<h1>PLY (Python Lex-Yacc)</h1>
8
9<b>
10David M. Beazley <br>
11Department of Computer Science <br>
12University of Chicago <br>
13Chicago, IL 60637 <br>
14beazley@cs.uchicago.edu <br>
15</b>
16
17<p>
18Documentation version: $Header: /home/stever/bk/newmem2/ext/ply/doc/ply.html 1.1 03/06/06 14:53:34-00:00 stever@ $
19
20<h2>Introduction</h2>
21
22PLY is a Python-only implementation of the popular compiler
23construction tools lex and yacc.  The implementation borrows ideas
24from a number of previous efforts; most notably John Aycock's SPARK
25toolkit.  However, the overall flavor of the implementation is more
26closely modeled after the C version of lex and yacc.  The other
27significant feature of PLY is that it provides extensive input
28validation and error reporting--much more so than other Python parsing
29tools.
30
31<p>
32Early versions of PLY were developed to support the Introduction to
33Compilers Course at the University of Chicago.  In this course,
34students built a fully functional compiler for a simple Pascal-like
35language.  Their compiler, implemented entirely in Python, had to
36include lexical analysis, parsing, type checking, type inference,
37nested scoping, and code generation for the SPARC processor.
38Approximately 30 different compiler implementations were completed in
39this course.  Most of PLY's interface and operation has been motivated by common
40usability problems encountered by students.
41
42<p>
43Because PLY was primarily developed as an instructional tool, you will
44find it to be <em>MUCH</em> more picky about token and grammar rule
45specification than most other Python parsing tools.  In part, this
46added formality is meant to catch common programming mistakes made by
47novice users.  However, advanced users will also find such features to
48be useful when building complicated grammars for real programming
49languages.    It should also be noted that PLY does not provide much in the way
50of bells and whistles (e.g., automatic construction of abstract syntax trees,
51tree traversal, etc.).   Instead, you will find a bare-bones, yet
52fully capable lex/yacc implementation written entirely in Python.
53
54<p>
55The rest of this document assumes that you are somewhat familar with
56parsing theory, syntax directed translation, and automatic tools such
57as lex and yacc. If you are unfamilar with these topics, you will
58probably want to consult an introductory text such as "Compilers:
59Principles, Techniques, and Tools", by Aho, Sethi, and Ullman.  "Lex
60and Yacc" by John Levine may also be handy.
61
62<h2>PLY Overview</h2>
63
64PLY consists of two separate tools; <tt>lex.py</tt> and
65<tt>yacc.py</tt>.  <tt>lex.py</tt> is used to break input text into a
66collection of tokens specified by a collection of regular expression
67rules.  <tt>yacc.py</tt> is used to recognize language syntax that has
68been specified in the form of a context free grammar.  Currently,
69<tt>yacc.py</tt> uses LR parsing and generates its parsing tables
70using the SLR algorithm.  LALR(1) parsing may be supported in a future
71release. 
72
73<p>
74The two tools are meant to work together.  Specifically,
75<tt>lex.py</tt> provides an external interface in the form of a
76<tt>token()</tt> function that returns the next valid token on the
77input stream.  <tt>yacc.py</tt> calls this repeatedly to retrieve
78tokens and invoke grammar rules.  The output of <tt>yacc.py</tt> is
79often an Abstract Syntax Tree (AST).  However, this is entirely up to
80the user.  If desired, <tt>yacc.py</tt> can also be used to implement
81simple one-pass compilers.
82
83<p>
84Like its Unix counterpart, <tt>yacc.py</tt> provides most of the
85features you expect including extensive error checking, grammar
86validation, support for empty productions, error tokens, and ambiguity
87resolution via precedence rules.  The primary difference between
88<tt>yacc.py</tt> and <tt>yacc</tt> is the use of SLR parsing instead
89of LALR(1).  Although this slightly restricts the types of grammars
90than can be successfully parsed, it is sufficiently powerful to handle most
91kinds of normal programming language constructs.
92
93<p>
94Finally, it is important to note that PLY relies on reflection
95(introspection) to build its lexers and parsers.  Unlike traditional
96lex/yacc which require a special input file that is converted into a
97separate source file, the specifications given to PLY <em>are</em>
98valid Python programs.  This means that there are no extra source
99files nor is there a special compiler construction step (e.g., running
100yacc to generate Python code for the compiler). 
101
102<h2>Lex Example</h2>
103
104<tt>lex.py</tt> is used to write tokenizers.  To do this, each token
105must be defined by a regular expression rule.  The following file
106implements a very simple lexer for tokenizing simple integer expressions:
107
108<blockquote>
109<pre>
110# ------------------------------------------------------------
111# calclex.py
112#
113# tokenizer for a simple expression evaluator for
114# numbers and +,-,*,/
115# ------------------------------------------------------------
116import lex
117
118# List of token names.   This is always required
119tokens = (
120   'NUMBER',
121   'PLUS',
122   'MINUS',
123   'TIMES',
124   'DIVIDE',
125   'LPAREN',
126   'RPAREN',
127)
128
129# Regular expression rules for simple tokens
130t_PLUS    = r'\+'
131t_MINUS   = r'-'
132t_TIMES   = r'\*'
133t_DIVIDE  = r'/'
134t_LPAREN  = r'\('
135t_RPAREN  = r'\)'
136
137# A regular expression rule with some action code
138def t_NUMBER(t):
139    r'\d+'
140    try:
141         t.value = int(t.value)    
142    except ValueError:
143         print "Line %d: Number %s is too large!" % (t.lineno,t.value)
144	 t.value = 0
145    return t
146
147# Define a rule so we can track line numbers
148def t_newline(t):
149    r'\n+'
150    t.lineno += len(t.value)
151
152# A string containing ignored characters (spaces and tabs)
153t_ignore  = ' \t'
154
155# Error handling rule
156def t_error(t):
157    print "Illegal character '%s'" % t.value[0]
158    t.skip(1)
159
160# Build the lexer
161lex.lex()
162
163# Test it out
164data = '''
1653 + 4 * 10
166  + -20 *2
167'''
168
169# Give the lexer some input
170lex.input(data)
171
172# Tokenize
173while 1:
174    tok = lex.token()
175    if not tok: break      # No more input
176    print tok
177</pre>
178</blockquote>
179
180In the example, the <tt>tokens</tt> list defines all of the possible
181token names that can be produced by the lexer.  This list is always required
182and is used to perform a variety of validation checks.  Following the <tt>tokens</tt>
183list, regular expressions are written for each token.  Each of these
184rules are defined by making declarations with a special prefix <tt>t_</tt> to indicate that it
185defines a token.  For simple tokens, the regular expression can
186be specified as strings such as this (note: Python raw strings are used since they are the
187most convenient way to write regular expression strings):
188
189<blockquote>
190<pre>
191t_PLUS = r'\+'
192</pre>
193</blockquote>
194
195In this case, the name following the <tt>t_</tt> must exactly match one of the
196names supplied in <tt>tokens</tt>.   If some kind of action needs to be performed,
197a token rule can be specified as a function.  For example:
198
199<blockquote>
200<pre>
201def t_NUMBER(t):
202    r'\d+'
203    try:
204         t.value = int(t.value)
205    except ValueError:
206         print "Number %s is too large!" % t.value
207	 t.value = 0
208    return t
209</pre>
210</blockquote>
211
212In this case, the regular expression rule is specified in the function documentation string. 
213The function always takes a single argument which is an instance of 
214<tt>LexToken</tt>.   This object has attributes of <tt>t.type</tt> which is the token type,
215<tt>t.value</tt> which is the lexeme, and <tt>t.lineno</tt> which is the current line number.
216By default, <tt>t.type</tt> is set to the name following the <tt>t_</tt> prefix.  The action
217function can modify the contents of the <tt>LexToken</tt> object as appropriate.  However, 
218when it is done, the resulting token should be returned.  If no value is returned by the action
219function, the token is simply discarded and the next token read.
220
221<p>
222The rule <tt>t_newline()</tt> illustrates a regular expression rule
223for a discarded token.  In this case, a rule is written to match
224newlines so that proper line number tracking can be performed.
225By returning no value, the function causes the newline character to be 
226discarded.
227
228<p>
229The special <tt>t_ignore</tt> rule is reserved by <tt>lex.py</tt> for characters
230that should be completely ignored in the input stream. 
231Usually this is used to skip over whitespace and other non-essential characters. 
232Although it is possible to define a regular expression rule for whitespace in a manner
233similar to <tt>t_newline()</tt>, the use of <tt>t_ignore</tt> provides substantially better
234lexing performance because it is handled as a special case and is checked in a much
235more efficient manner than the normal regular expression rules.
236
237<p>
238Finally, the <tt>t_error()</tt>
239function is used to handle lexing errors that occur when illegal
240characters are detected.  In this case, the <tt>t.value</tt> attribute contains the
241rest of the input string that has not been tokenized.  In the example, we simply print
242the offending character and skip ahead one character by calling <tt>t.skip(1)</tt>.
243
244<p>
245To build the lexer, the function <tt>lex.lex()</tt> is used.  This function
246uses Python reflection (or introspection) to read the the regular expression rules
247out of the calling context and build the lexer. Once the lexer has been built, two functions can
248be used to control the lexer.
249
250<ul>
251<li><tt>lex.input(data)</tt>.   Reset the lexer and store a new input string.
252<li><tt>lex.token()</tt>.  Return the next token.  Returns a special <tt>LexToken</tt> instance on success or
253None if the end of the input text has been reached.
254</ul>
255
256The code at the bottom of the example shows how the lexer is actually used.  When executed,
257the following output will be produced:
258
259<blockquote>
260<pre>
261$ python example.py
262LexToken(NUMBER,3,2)
263LexToken(PLUS,'+',2)
264LexToken(NUMBER,4,2)
265LexToken(TIMES,'*',2)
266LexToken(NUMBER,10,2)
267LexToken(PLUS,'+',3)
268LexToken(MINUS,'-',3)
269LexToken(NUMBER,20,3)
270LexToken(TIMES,'*',3)
271LexToken(NUMBER,2,3)
272</pre>
273</blockquote>
274
275<h2>Lex Implementation Notes</h2>
276
277<ul>
278<li><tt>lex.py</tt> uses the <tt>re</tt> module to do its patten matching.  When building the master regular expression,
279rules are added in the following order:
280<p>
281<ol>
282<li>All tokens defined by functions are added in the same order as they appear in the lexer file.
283<li>Tokens defined by strings are added by sorting them in order of decreasing regular expression length (longer expressions
284are added first).
285</ol>
286<p>
287Without this ordering, it can be difficult to correctly match certain types of tokens.  For example, if you 
288wanted to have separate tokens for "=" and "==", you need to make sure that "==" is checked first.  By sorting regular
289expressions in order of decreasing length, this problem is solved for rules defined as strings.  For functions,
290the order can be explicitly controlled since rules appearing first are checked first.
291
292<P>
293<li>The lexer requires input to be supplied as a single input string.  Since most machines have more than enough memory, this 
294rarely presents a performance concern.  However, it means that the lexer currently can't be used with streaming data
295such as open files or sockets.  This limitation is primarily a side-effect of using the <tt>re</tt> module.
296
297<p>
298<li>
299To handle reserved words, it is usually easier to just match an identifier and do a special name lookup in a function
300like this:
301
302<blockquote>
303<pre>
304reserved = {
305   'if' : 'IF',
306   'then' : 'THEN',
307   'else' : 'ELSE',
308   'while' : 'WHILE',
309   ...
310}
311
312def t_ID(t):
313    r'[a-zA-Z_][a-zA-Z_0-9]*'
314    t.type = reserved.get(t.value,'ID')    # Check for reserved words
315    return t
316</pre>
317</blockquote>
318
319<p>
320<li>The lexer requires tokens to be defined as class instances with <tt>t.type</tt>, <tt>t.value</tt>, and <tt>t.lineno</tt>
321attributes.   By default, tokens are created as instances of the <tt>LexToken</tt> class defined internally to <tt>lex.py</tt>.
322If desired, you can create new kinds of tokens provided that they have the three required attributes.   However,
323in practice, it is probably safer to stick with the default.
324
325<p>
326<li>The only safe attribute for assigning token properties is <tt>t.value</tt>.   In some cases, you may want to attach
327a number of different properties to a token (e.g., symbol table entries for identifiers).  To do this, replace <tt>t.value</tt>
328with a tuple or class instance. For example:
329
330<blockquote>
331<pre>
332def t_ID(t):
333    ...
334    # For identifiers, create a (lexeme, symtab) tuple
335    t.value = (t.value, symbol_lookup(t.value))
336    ...
337    return t
338</pre>
339</blockquote>
340
341Although allowed, do NOT assign additional attributes to the token object.  For example,
342<blockquote>
343<pre>
344def t_ID(t):
345    ...
346    # Bad implementation of above
347    t.symtab = symbol_lookup(t.value)
348    ...
349</pre>
350</blockquote>
351
352The reason you don't want to do this is that the <tt>yacc.py</tt>
353module only provides public access to the <tt>t.value</tt> attribute of each token.
354Therefore, any other attributes you assign are inaccessible (if you are familiar
355with the internals of C lex/yacc, <tt>t.value</tt> is the same as <tt>yylval.tok</tt>).
356
357<p>
358<li>To track line numbers, the lexer internally maintains a line
359number variable.  Each token automatically gets the value of the
360current line number in the <tt>t.lineno</tt> attribute. To modify the
361current line number, simply change the <tt>t.lineno</tt> attribute
362in a function rule (as previously shown for
363<tt>t_newline()</tt>).  Even if the resulting token is discarded,
364changes to the line number remain in effect for subsequent tokens.
365
366<p>
367<li>To support multiple scanners in the same application, the <tt>lex.lex()</tt> function
368actually returns a special <tt>Lexer</tt> object.   This object has two methods 
369<tt>input()</tt> and <tt>token()</tt> that can be used to supply input and get tokens.  For example:
370
371<blockquote>
372<pre>
373lexer = lex.lex()
374lexer.input(sometext)
375while 1:
376    tok = lexer.token()
377    if not tok: break
378    print tok
379</pre>
380</blockquote>
381
382The functions <tt>lex.input()</tt> and <tt>lex.token()</tt> are bound to the <tt>input()</tt> 
383and <tt>token()</tt> methods of the last lexer created by the lex module. 
384
385
386<p>
387<li>To reduce compiler startup time and improve performance, the lexer can be built in optimized mode as follows:
388
389<blockquote>
390<pre>
391lex.lex(optimize=1)
392</pre>
393</blockquote>
394
395When used, most error checking and validation is disabled.   This provides a slight performance
396gain while tokenizing and tends to chop a few tenths of a second off startup time.  Since it disables
397error checking, this mode is not the default and is not recommended during development.  However, once
398you have your compiler fully working, it is usually safe to disable the error checks.
399
400<p>
401<li>You can enable some additional debugging by building the lexer like this:
402
403<blockquote>
404<pre>
405lex.lex(debug=1)
406</pre>
407</blockquote>
408
409<p>
410<li>To help you debug your lexer, <tt>lex.py</tt> comes with a simple main program which will either
411tokenize input read from standard input or from a file.  To use it, simply put this in your lexer:
412
413<blockquote>
414<pre>
415if __name__ == '__main__':
416     lex.runmain()
417</pre>
418</blockquote>
419
420Then, run you lexer as a main program such as <tt>python mylex.py</tt>
421
422<p>
423<li>Since the lexer is written entirely in Python, its performance is
424largely determined by that of the Python <tt>re</tt> module.  Although
425the lexer has been written to be as efficient as possible, it's not
426blazingly fast when used on very large input files.  Sorry.  If
427performance is concern, you might consider upgrading to the most
428recent version of Python, creating a hand-written lexer, or offloading
429the lexer into a C extension module.  In defense of <tt>lex.py</tt>,
430it's performance is not <em>that</em> bad when used on reasonably
431sized input files.  For instance, lexing a 4700 line C program with
43232000 input tokens takes about 20 seconds on a 200 Mhz PC.  Obviously,
433it will run much faster on a more speedy machine.
434
435</ul>
436
437<h2>Parsing basics</h2>
438
439<tt>yacc.py</tt> is used to parse language syntax.  Before showing an
440example, there are a few important bits of background that must be
441mentioned.  First, <tt>syntax</tt> is usually specified in terms of a
442context free grammar (CFG).  For example, if you wanted to parse
443simple arithmetic expressions, you might first write an unambiguous
444grammar specification like this:
445
446<blockquote>
447<pre> 
448expression : expression + term
449           | expression - term
450           | term
451
452term       : term * factor
453           | term / factor
454           | factor
455
456factor     : NUMBER
457           | ( expression )
458</pre>
459</blockquote>
460
461Next, the semantic behavior of a language is often specified using a
462technique known as syntax directed translation.  In syntax directed
463translation, attributes are attached to each symbol in a given grammar
464rule along with an action.  Whenever a particular grammar rule is
465recognized, the action describes what to do.  For example, given the
466expression grammar above, you might write the specification for a
467simple calculator like this:
468
469<blockquote>
470<pre> 
471Grammar                             Action
472--------------------------------    -------------------------------------------- 
473expression0 : expression1 + term    expression0.val = expression1.val + term.val
474            | expression1 - term    expression0.val = expression1.val - term.val
475            | term                  expression0.val = term.val
476
477term0       : term1 * factor        term0.val = term1.val * factor.val
478            | term1 / factor        term0.val = term1.val / factor.val
479            | factor                term0.val = factor.val
480
481factor      : NUMBER                factor.val = int(NUMBER.lexval)
482            | ( expression )        factor.val = expression.val
483</pre>
484</blockquote>
485
486Finally, Yacc uses a parsing technique known as LR-parsing or shift-reduce parsing.  LR parsing is a
487bottom up technique that tries to recognize the right-hand-side of various grammar rules.
488Whenever a valid right-hand-side is found in the input, the appropriate action code is triggered and the
489grammar symbols are replaced by the grammar symbol on the left-hand-side. 
490
491<p>
492LR parsing is commonly implemented by shifting grammar symbols onto a stack and looking at the stack and the next
493input token for patterns.   The details of the algorithm can be found in a compiler text, but the
494following example illustrates the steps that are performed if you wanted to parse the expression 
495<tt>3 + 5 * (10 - 20)</tt> using the grammar defined above:
496
497<blockquote>
498<pre>
499Step Symbol Stack           Input Tokens            Action
500---- ---------------------  ---------------------   -------------------------------
5011    $                      3 + 5 * ( 10 - 20 )$    Shift 3
5022    $ 3                      + 5 * ( 10 - 20 )$    Reduce factor : NUMBER
5033    $ factor                 + 5 * ( 10 - 20 )$    Reduce term   : factor
5044    $ term                   + 5 * ( 10 - 20 )$    Reduce expr : term
5055    $ expr                   + 5 * ( 10 - 20 )$    Shift +
5066    $ expr +                   5 * ( 10 - 20 )$    Shift 5
5077    $ expr + 5                   * ( 10 - 20 )$    Reduce factor : NUMBER
5088    $ expr + factor              * ( 10 - 20 )$    Reduce term   : factor
5099    $ expr + term                * ( 10 - 20 )$    Shift *
51010   $ expr + term *                ( 10 - 20 )$    Shift (
51111   $ expr + term * (                10 - 20 )$    Shift 10
51212   $ expr + term * ( 10                - 20 )$    Reduce factor : NUMBER
51313   $ expr + term * ( factor            - 20 )$    Reduce term : factor
51414   $ expr + term * ( term              - 20 )$    Reduce expr : term
51515   $ expr + term * ( expr              - 20 )$    Shift -
51616   $ expr + term * ( expr -              20 )$    Shift 20
51717   $ expr + term * ( expr - 20              )$    Reduce factor : NUMBER
51818   $ expr + term * ( expr - factor          )$    Reduce term : factor
51919   $ expr + term * ( expr - term            )$    Reduce expr : expr - term
52020   $ expr + term * ( expr                   )$    Shift )
52121   $ expr + term * ( expr )                  $    Reduce factor : (expr)
52222   $ expr + term * factor                    $    Reduce term : term * factor
52323   $ expr + term                             $    Reduce expr : expr + term
52424   $ expr                                    $    Reduce expr
52525   $                                         $    Success!
526</pre>
527</blockquote>
528
529When parsing the expression, an underlying state machine and the current input token determine what to do next.  
530If the next token looks like part of a valid grammar rule (based on other items on the stack), it is generally shifted
531onto the stack.  If the top of the stack contains a valid right-hand-side of a grammar rule, it is
532usually "reduced" and the symbols replaced with the symbol on the left-hand-side.  When this reduction occurs, the
533appropriate action is triggered (if defined).  If the input token can't be shifted and the top of stack doesn't match
534any grammar rules, a syntax error has occurred and the parser must take some kind of recovery step (or bail out).
535
536<p>
537It is important to note that the underlying implementation is actually built around a large finite-state machine
538and some tables.   The construction of these tables is quite complicated and beyond the scope of this discussion.
539However, subtle details of this process explain why, in the example above, the parser chooses to shift a token
540onto the stack in step 9 rather than reducing the rule <tt>expr : expr + term</tt>.
541
542<h2>Yacc example</h2>
543
544Suppose you wanted to make a grammar for simple arithmetic expressions as previously described.   Here is
545how you would do it with <tt>yacc.py</tt>:
546
547<blockquote>
548<pre>
549# Yacc example
550
551import yacc
552
553# Get the token map from the lexer.  This is required.
554from calclex import tokens
555
556def p_expression_plus(t):
557    'expression : expression PLUS term'
558    t[0] = t[1] + t[3]
559
560def p_expression_minus(t):
561    'expression : expression MINUS term'
562    t[0] = t[1] - t[3]
563
564def p_expression_term(t):
565    'expression : term'
566    t[0] = t[1]
567
568def p_term_times(t):
569    'term : term TIMES factor'
570    t[0] = t[1] * t[3]
571
572def p_term_div(t):
573    'term : term DIVIDE factor'
574    t[0] = t[1] / t[3]
575
576def p_term_factor(t):
577    'term : factor'
578    t[0] = t[1]
579
580def p_factor_num(t):
581    'factor : NUMBER'
582    t[0] = t[1]
583
584def p_factor_expr(t):
585    'factor : LPAREN expression RPAREN'
586    t[0] = t[2]
587
588# Error rule for syntax errors
589def p_error(t):
590    print "Syntax error in input!"
591
592# Build the parser
593yacc.yacc()
594
595while 1:
596   try:
597       s = raw_input('calc > ')
598   except EOFError:
599       break
600   if not s: continue
601   result = yacc.parse(s)
602   print result
603</pre>
604</blockquote>
605
606In this example, each grammar rule is defined by a Python function where the docstring to that function contains the
607appropriate context-free grammar specification (an idea borrowed from John Aycock's SPARK toolkit).  Each function accepts a single
608argument <tt>t</tt> that is a sequence containing the values of each grammar symbol in the corresponding rule.  The values of
609<tt>t[i]</tt> are mapped to grammar symbols as shown here:
610
611<blockquote>
612<pre>
613def p_expression_plus(t):
614    'expression : expression PLUS term'
615    #   ^            ^        ^    ^
616    #  t[0]         t[1]     t[2] t[3]
617
618    t[0] = t[1] + t[3]
619</pre>
620</blockquote>
621
622For tokens, the "value" in the corresponding <tt>t[i]</tt> is the
623<em>same</em> as the value of the <tt>t.value</tt> attribute assigned
624in the lexer module.  For non-terminals, the value is determined by
625whatever is placed in <tt>t[0]</tt> when rules are reduced.  This
626value can be anything at all.  However, it probably most common for
627the value to be a simple Python type, a tuple, or an instance.  In this example, we
628are relying on the fact that the <tt>NUMBER</tt> token stores an integer value in its value
629field.  All of the other rules simply perform various types of integer operations and store
630the result.
631
632<p>
633The first rule defined in the yacc specification determines the starting grammar
634symbol (in this case, a rule for <tt>expression</tt> appears first).  Whenever
635the starting rule is reduced by the parser and no more input is available, parsing 
636stops and the final value is returned (this value will be whatever the top-most rule
637placed in <tt>t[0]</tt>).
638
639<p>The <tt>p_error(t)</tt> rule is defined to catch syntax errors.  See the error handling section
640below for more detail.
641
642<p>
643To build the parser, call the <tt>yacc.yacc()</tt> function.  This function
644looks at the module and attempts to construct all of the LR parsing tables for the grammar
645you have specified.   The first time <tt>yacc.yacc()</tt> is invoked, you will get a message
646such as this:
647
648<blockquote>
649<pre>
650$ python calcparse.py
651yacc: Generating SLR parsing table...  
652calc > 
653</pre>
654</blockquote>
655
656Since table construction is relatively expensive (especially for large
657grammars), the resulting parsing table is written to the current
658directory in a file called <tt>parsetab.py</tt>.  In addition, a
659debugging file called <tt>parser.out</tt> is created.  On subsequent
660executions, <tt>yacc</tt> will reload the table from
661<tt>parsetab.py</tt> unless it has detected a change in the underlying
662grammar (in which case the tables and <tt>parsetab.py</tt> file are
663regenerated).
664
665<p>
666If any errors are detected in your grammar specification, <tt>yacc.py</tt> will produce
667diagnostic messages and possibly raise an exception.  Some of the errors that can be detected include:
668
669<ul>
670<li>Duplicated function names (if more than one rule function have the same name in the grammar file).
671<li>Shift/reduce and reduce/reduce conflicts generated by ambiguous grammars.
672<li>Badly specified grammar rules.
673<li>Infinite recursion (rules that can never terminate).
674<li>Unused rules and tokens
675<li>Undefined rules and tokens
676</ul>
677
678The next few sections now discuss a few finer points of grammar construction.
679
680<h2>Combining Grammar Rule Functions</h2>
681
682When grammar rules are similar, they can be combined into a single function.
683For example, consider the two rules in our earlier example:
684
685<blockquote>
686<pre>
687def p_expression_plus(t):
688    'expression : expression PLUS term'
689    t[0] = t[1] + t[3]
690
691def p_expression_minus(t):
692    'expression : expression MINUS term'
693    t[0] = t[1] - t[3]
694</pre>
695</blockquote>
696
697Instead of writing two functions, you might write a single function like this:
698
699<blockquote>
700<pre>
701def p_expression(t):
702    '''expression : expression PLUS term
703                  | expression MINUS term'''
704    if t[2] == '+':
705        t[0] = t[1] + t[3]
706    elif t[2] == '-':
707        t[0] = t[1] - t[3]
708</pre>
709</blockquote>
710
711In general, the doc string for any given function can contain multiple grammar rules.  So, it would
712have also been legal (although possibly confusing) to write this:
713
714<blockquote>
715<pre>
716def p_binary_operators(t):
717    '''expression : expression PLUS term
718                  | expression MINUS term
719       term       : term TIMES factor
720                  | term DIVIDE factor'''
721    if t[2] == '+':
722        t[0] = t[1] + t[3]
723    elif t[2] == '-':
724        t[0] = t[1] - t[3]
725    elif t[2] == '*':
726        t[0] = t[1] * t[3]
727    elif t[2] == '/':
728        t[0] = t[1] / t[3]
729</pre>
730</blockquote>
731
732When combining grammar rules into a single function, it is usually a good idea for all of the rules to have
733a similar structure (e.g., the same number of terms).  Otherwise, the corresponding action code may be more 
734complicated than necessary.
735
736<h2>Empty Productions</h2>
737
738<tt>yacc.py</tt> can handle empty productions by defining a rule like this:
739
740<blockquote>
741<pre>
742def p_empty(t):
743    'empty :'
744    pass
745</pre>
746</blockquote>
747
748Now to use the empty production, simply use 'empty' as a symbol.  For example:
749
750<blockquote>
751<pre>
752def p_optitem(t):
753    'optitem : item'
754    '        | empty'
755    ...
756</pre>
757</blockquote>
758
759<h2>Dealing With Ambiguous Grammars</h2>
760
761The expression grammar given in the earlier example has been written in a special format to eliminate ambiguity. 
762However, in many situations, it is extremely difficult or awkward to write grammars in this format.  A
763much more natural way to express the grammar is in a more compact form like this:
764
765<blockquote>
766<pre>
767expression : expression PLUS expression
768           | expression MINUS expression
769           | expression TIMES expression
770           | expression DIVIDE expression
771           | LPAREN expression RPAREN
772           | NUMBER
773</pre>
774</blockquote>
775
776Unfortunately, this grammar specification is ambiguous.  For example, if you are parsing the string
777"3 * 4 + 5", there is no way to tell how the operators are supposed to be grouped.
778For example, does this expression mean "(3 * 4) + 5" or is it "3 * (4+5)"?
779
780<p>
781When an ambiguous grammar is given to <tt>yacc.py</tt> it will print messages about "shift/reduce conflicts" 
782or a "reduce/reduce conflicts".   A shift/reduce conflict is caused when the parser generator can't decide
783whether or not to reduce a rule or shift a symbol on the parsing stack.   For example, consider
784the string "3 * 4 + 5" and the internal parsing stack:
785
786<blockquote>
787<pre>
788Step Symbol Stack           Input Tokens            Action
789---- ---------------------  ---------------------   -------------------------------
7901    $                                3 * 4 + 5$    Shift 3
7912    $ 3                                * 4 + 5$    Reduce : expression : NUMBER
7923    $ expr                             * 4 + 5$    Shift *
7934    $ expr *                             4 + 5$    Shift 4
7945    $ expr * 4                             + 5$    Reduce: expression : NUMBER
7956    $ expr * expr                          + 5$    SHIFT/REDUCE CONFLICT ????
796</pre>
797</blockquote>
798
799In this case, when the parser reaches step 6, it has two options.  One is the reduce the
800rule <tt>expr : expr * expr</tt> on the stack.  The other option is to shift the
801token <tt>+</tt> on the stack.   Both options are perfectly legal from the rules
802of the context-free-grammar.
803
804<p>
805By default, all shift/reduce conflicts are resolved in favor of shifting.  Therefore, in the above
806example, the parser will always shift the <tt>+</tt> instead of reducing.    Although this
807strategy works in many cases (including the ambiguous if-then-else), it is not enough for arithmetic
808expressions.  In fact, in the above example, the decision to shift <tt>+</tt> is completely wrong---we should have
809reduced <tt>expr * expr</tt> since multiplication has higher precedence than addition.
810
811<p>To resolve ambiguity, especially in expression grammars, <tt>yacc.py</tt> allows individual
812tokens to be assigned a precedence level and associativity.  This is done by adding a variable
813<tt>precedence</tt> to the grammar file like this:
814
815<blockquote>
816<pre>
817precedence = (
818    ('left', 'PLUS', 'MINUS'),
819    ('left', 'TIMES', 'DIVIDE'),
820)
821</pre>
822</blockquote>
823
824This declaration specifies that <tt>PLUS</tt>/<tt>MINUS</tt> have
825the same precedence level and are left-associative and that 
826<tt>TIMES</tt>/<tt>DIVIDE</tt> have the same precedence and are left-associative.
827Furthermore, the declaration specifies that <tt>TIMES</tt>/<tt>DIVIDE</tt> have higher
828precedence than <tt>PLUS</tt>/<tt>MINUS</tt> (since they appear later in the
829precedence specification).
830
831<p>
832The precedence specification is used to attach a numerical precedence value and associativity direction 
833to each grammar rule. This is always determined by the precedence of the right-most terminal symbol.  Therefore,
834if PLUS/MINUS had a precedence of 1 and TIMES/DIVIDE had a precedence of 2, the grammar rules
835would have precedence values as follows:
836
837<blockquote>
838<pre>
839expression : expression PLUS expression                 # prec = 1, left
840           | expression MINUS expression                # prec = 1, left
841           | expression TIMES expression                # prec = 2, left
842           | expression DIVIDE expression               # prec = 2, left
843           | LPAREN expression RPAREN                   # prec = unknown
844           | NUMBER                                     # prec = unknown
845</pre>
846</blockquote>
847
848When shift/reduce conflicts are encountered, the parser generator resolves the conflict by
849looking at the precedence rules and associativity specifiers.
850
851<p>
852<ol>
853<li>If the current token has higher precedence, it is shifted.
854<li>If the grammar rule on the stack has higher precedence, the rule is reduced.
855<li>If the current token and the grammar rule have the same precedence, the
856rule is reduced for left associativity, whereas the token is shifted for right associativity.
857<li>If nothing is known about the precedence, shift/reduce conflicts are resolved in
858favor of shifting (the default).
859</ol>
860
861<p>
862When shift/reduce conflicts are resolved using the first three techniques (with the help of
863precedence rules), <tt>yacc.py</tt> will report no errors or conflicts in the grammar. 
864
865<p>
866One problem with the precedence specifier technique is that it is sometimes necessary to
867change the precedence of an operator in certain contents.  For example, consider a unary-minus operator
868in "3 + 4 * -5".  Normally, unary minus has a very high precedence--being evaluated before the multiply.
869However, in our precedence specifier, MINUS has a lower precedence than TIMES.  To deal with this,
870precedence rules can be given for fictitious tokens like this:
871
872<blockquote>
873<pre>
874precedence = (
875    ('left', 'PLUS', 'MINUS'),
876    ('left', 'TIMES', 'DIVIDE'),
877    ('right', 'UMINUS'),            # Unary minus operator
878)
879</pre>
880</blockquote>
881
882Now, in the grammar file, we can write our unary minus rule like this:
883
884<blockquote>
885<pre>
886def p_expr_uminus(t):
887    'expression : MINUS expression %prec UMINUS'
888    t[0] = -t[2]
889</pre>
890</blockquote>
891
892In this case, <tt>%prec UMINUS</tt> overrides the default rule precedence--setting it to that
893of UMINUS in the precedence specifier.
894
895<p>
896It is also possible to specify non-associativity in the <tt>precedence</tt> table. This would
897be used when you <em>don't</em> want operations to chain together.  For example, suppose
898you wanted to support a comparison operators like <tt>&lt;</tt> and <tt>&gt;</tt> but you didn't want to allow
899combinations like <tt>a &lt; b &lt; c</tt>.   To do this, simply specify a rule like this:
900
901<blockquote>
902<pre>
903precedence = (
904    ('nonassoc', 'LESSTHAN', 'GREATERTHAN'),  # Nonassociative operators
905    ('left', 'PLUS', 'MINUS'),
906    ('left', 'TIMES', 'DIVIDE'),
907    ('right', 'UMINUS'),            # Unary minus operator
908)
909</pre>
910</blockquote>
911
912<p>
913Reduce/reduce conflicts are caused when there are multiple grammar
914rules that can be applied to a given set of symbols.  This kind of
915conflict is almost always bad and is always resolved by picking the
916rule that appears first in the grammar file.   Reduce/reduce conflicts
917are almost always caused when different sets of grammar rules somehow
918generate the same set of symbols.  For example:
919
920<blockquote>
921<pre>
922assignment :  ID EQUALS NUMBER
923           |  ID EQUALS expression
924           
925expression : expression PLUS expression
926           | expression MINUS expression
927           | expression TIMES expression
928           | expression DIVIDE expression
929           | LPAREN expression RPAREN
930           | NUMBER
931</pre>
932</blockquote>
933
934In this case, a reduce/reduce conflict exists between these two rules:
935
936<blockquote>
937<pre>
938assignment  : ID EQUALS NUMBER
939expression  : NUMBER
940</pre>
941</blockquote>
942
943For example, if you wrote "a = 5", the parser can't figure out if this
944is supposed to reduced as <tt>assignment : ID EQUALS NUMBER</tt> or
945whether it's supposed to reduce the 5 as an expression and then reduce
946the rule <tt>assignment : ID EQUALS expression</tt>.
947
948<h2>The parser.out file</h2>
949
950Tracking down shift/reduce and reduce/reduce conflicts is one of the finer pleasures of using an LR
951parsing algorithm.  To assist in debugging, <tt>yacc.py</tt> creates a debugging file called
952'parser.out' when it generates the parsing table.   The contents of this file look like the following:
953
954<blockquote>
955<pre>
956Unused terminals:
957
958
959Grammar
960
961Rule 1     expression -> expression PLUS expression
962Rule 2     expression -> expression MINUS expression
963Rule 3     expression -> expression TIMES expression
964Rule 4     expression -> expression DIVIDE expression
965Rule 5     expression -> NUMBER
966Rule 6     expression -> LPAREN expression RPAREN
967
968Terminals, with rules where they appear
969
970TIMES                : 3
971error                : 
972MINUS                : 2
973RPAREN               : 6
974LPAREN               : 6
975DIVIDE               : 4
976PLUS                 : 1
977NUMBER               : 5
978
979Nonterminals, with rules where they appear
980
981expression           : 1 1 2 2 3 3 4 4 6 0
982
983
984Parsing method: SLR
985
986
987state 0
988
989    S' -> . expression
990    expression -> . expression PLUS expression
991    expression -> . expression MINUS expression
992    expression -> . expression TIMES expression
993    expression -> . expression DIVIDE expression
994    expression -> . NUMBER
995    expression -> . LPAREN expression RPAREN
996
997    NUMBER          shift and go to state 3
998    LPAREN          shift and go to state 2
999
1000
1001state 1
1002
1003    S' -> expression .
1004    expression -> expression . PLUS expression
1005    expression -> expression . MINUS expression
1006    expression -> expression . TIMES expression
1007    expression -> expression . DIVIDE expression
1008
1009    PLUS            shift and go to state 6
1010    MINUS           shift and go to state 5
1011    TIMES           shift and go to state 4
1012    DIVIDE          shift and go to state 7
1013
1014
1015state 2
1016
1017    expression -> LPAREN . expression RPAREN
1018    expression -> . expression PLUS expression
1019    expression -> . expression MINUS expression
1020    expression -> . expression TIMES expression
1021    expression -> . expression DIVIDE expression
1022    expression -> . NUMBER
1023    expression -> . LPAREN expression RPAREN
1024
1025    NUMBER          shift and go to state 3
1026    LPAREN          shift and go to state 2
1027
1028
1029state 3
1030
1031    expression -> NUMBER .
1032
1033    $               reduce using rule 5
1034    PLUS            reduce using rule 5
1035    MINUS           reduce using rule 5
1036    TIMES           reduce using rule 5
1037    DIVIDE          reduce using rule 5
1038    RPAREN          reduce using rule 5
1039
1040
1041state 4
1042
1043    expression -> expression TIMES . expression
1044    expression -> . expression PLUS expression
1045    expression -> . expression MINUS expression
1046    expression -> . expression TIMES expression
1047    expression -> . expression DIVIDE expression
1048    expression -> . NUMBER
1049    expression -> . LPAREN expression RPAREN
1050
1051    NUMBER          shift and go to state 3
1052    LPAREN          shift and go to state 2
1053
1054
1055state 5
1056
1057    expression -> expression MINUS . expression
1058    expression -> . expression PLUS expression
1059    expression -> . expression MINUS expression
1060    expression -> . expression TIMES expression
1061    expression -> . expression DIVIDE expression
1062    expression -> . NUMBER
1063    expression -> . LPAREN expression RPAREN
1064
1065    NUMBER          shift and go to state 3
1066    LPAREN          shift and go to state 2
1067
1068
1069state 6
1070
1071    expression -> expression PLUS . expression
1072    expression -> . expression PLUS expression
1073    expression -> . expression MINUS expression
1074    expression -> . expression TIMES expression
1075    expression -> . expression DIVIDE expression
1076    expression -> . NUMBER
1077    expression -> . LPAREN expression RPAREN
1078
1079    NUMBER          shift and go to state 3
1080    LPAREN          shift and go to state 2
1081
1082
1083state 7
1084
1085    expression -> expression DIVIDE . expression
1086    expression -> . expression PLUS expression
1087    expression -> . expression MINUS expression
1088    expression -> . expression TIMES expression
1089    expression -> . expression DIVIDE expression
1090    expression -> . NUMBER
1091    expression -> . LPAREN expression RPAREN
1092
1093    NUMBER          shift and go to state 3
1094    LPAREN          shift and go to state 2
1095
1096
1097state 8
1098
1099    expression -> LPAREN expression . RPAREN
1100    expression -> expression . PLUS expression
1101    expression -> expression . MINUS expression
1102    expression -> expression . TIMES expression
1103    expression -> expression . DIVIDE expression
1104
1105    RPAREN          shift and go to state 13
1106    PLUS            shift and go to state 6
1107    MINUS           shift and go to state 5
1108    TIMES           shift and go to state 4
1109    DIVIDE          shift and go to state 7
1110
1111
1112state 9
1113
1114    expression -> expression TIMES expression .
1115    expression -> expression . PLUS expression
1116    expression -> expression . MINUS expression
1117    expression -> expression . TIMES expression
1118    expression -> expression . DIVIDE expression
1119
1120    $               reduce using rule 3
1121    PLUS            reduce using rule 3
1122    MINUS           reduce using rule 3
1123    TIMES           reduce using rule 3
1124    DIVIDE          reduce using rule 3
1125    RPAREN          reduce using rule 3
1126
1127  ! PLUS            [ shift and go to state 6 ]
1128  ! MINUS           [ shift and go to state 5 ]
1129  ! TIMES           [ shift and go to state 4 ]
1130  ! DIVIDE          [ shift and go to state 7 ]
1131
1132state 10
1133
1134    expression -> expression MINUS expression .
1135    expression -> expression . PLUS expression
1136    expression -> expression . MINUS expression
1137    expression -> expression . TIMES expression
1138    expression -> expression . DIVIDE expression
1139
1140    $               reduce using rule 2
1141    PLUS            reduce using rule 2
1142    MINUS           reduce using rule 2
1143    RPAREN          reduce using rule 2
1144    TIMES           shift and go to state 4
1145    DIVIDE          shift and go to state 7
1146
1147  ! TIMES           [ reduce using rule 2 ]
1148  ! DIVIDE          [ reduce using rule 2 ]
1149  ! PLUS            [ shift and go to state 6 ]
1150  ! MINUS           [ shift and go to state 5 ]
1151
1152state 11
1153
1154    expression -> expression PLUS expression .
1155    expression -> expression . PLUS expression
1156    expression -> expression . MINUS expression
1157    expression -> expression . TIMES expression
1158    expression -> expression . DIVIDE expression
1159
1160    $               reduce using rule 1
1161    PLUS            reduce using rule 1
1162    MINUS           reduce using rule 1
1163    RPAREN          reduce using rule 1
1164    TIMES           shift and go to state 4
1165    DIVIDE          shift and go to state 7
1166
1167  ! TIMES           [ reduce using rule 1 ]
1168  ! DIVIDE          [ reduce using rule 1 ]
1169  ! PLUS            [ shift and go to state 6 ]
1170  ! MINUS           [ shift and go to state 5 ]
1171
1172state 12
1173
1174    expression -> expression DIVIDE expression .
1175    expression -> expression . PLUS expression
1176    expression -> expression . MINUS expression
1177    expression -> expression . TIMES expression
1178    expression -> expression . DIVIDE expression
1179
1180    $               reduce using rule 4
1181    PLUS            reduce using rule 4
1182    MINUS           reduce using rule 4
1183    TIMES           reduce using rule 4
1184    DIVIDE          reduce using rule 4
1185    RPAREN          reduce using rule 4
1186
1187  ! PLUS            [ shift and go to state 6 ]
1188  ! MINUS           [ shift and go to state 5 ]
1189  ! TIMES           [ shift and go to state 4 ]
1190  ! DIVIDE          [ shift and go to state 7 ]
1191
1192state 13
1193
1194    expression -> LPAREN expression RPAREN .
1195
1196    $               reduce using rule 6
1197    PLUS            reduce using rule 6
1198    MINUS           reduce using rule 6
1199    TIMES           reduce using rule 6
1200    DIVIDE          reduce using rule 6
1201    RPAREN          reduce using rule 6
1202</pre>
1203</blockquote>
1204
1205In the file, each state of the grammar is described.  Within each state the "." indicates the current
1206location of the parse within any applicable grammar rules.   In addition, the actions for each valid
1207input token are listed.   When a shift/reduce or reduce/reduce conflict arises, rules <em>not</em> selected
1208are prefixed with an !.  For example:
1209
1210<blockquote>
1211<pre>
1212  ! TIMES           [ reduce using rule 2 ]
1213  ! DIVIDE          [ reduce using rule 2 ]
1214  ! PLUS            [ shift and go to state 6 ]
1215  ! MINUS           [ shift and go to state 5 ]
1216</pre>
1217</blockquote>
1218
1219By looking at these rules (and with a little practice), you can usually track down the source
1220of most parsing conflicts.  It should also be stressed that not all shift-reduce conflicts are
1221bad.  However, the only way to be sure that they are resolved correctly is to look at <tt>parser.out</tt>.
1222  
1223<h2>Syntax Error Handling</h2>
1224
1225When a syntax error occurs during parsing, the error is immediately
1226detected (i.e., the parser does not read any more tokens beyond the
1227source of the error).  Error recovery in LR parsers is a delicate
1228topic that involves ancient rituals and black-magic.   The recovery mechanism
1229provided by <tt>yacc.py</tt> is comparable to Unix yacc so you may want
1230consult a book like O'Reilly's "Lex and Yacc" for some of the finer details.
1231
1232<p>
1233When a syntax error occurs, <tt>yacc.py</tt> performs the following steps:
1234
1235<ol>
1236<li>On the first occurrence of an error, the user-defined <tt>p_error()</tt> function
1237is called with the offending token as an argument.  Afterwards, the parser enters
1238an "error-recovery" mode in which it will not make future calls to <tt>p_error()</tt> until it
1239has successfully shifted at least 3 tokens onto the parsing stack.
1240
1241<p>
1242<li>If no recovery action is taken in <tt>p_error()</tt>, the offending lookahead token is replaced
1243with a special <tt>error</tt> token.
1244
1245<p>
1246<li>If the offending lookahead token is already set to <tt>error</tt>, the top item of the parsing stack is
1247deleted.
1248
1249<p>
1250<li>If the entire parsing stack is unwound, the parser enters a restart state and attempts to start
1251parsing from its initial state.
1252
1253<p>
1254<li>If a grammar rule accepts <tt>error</tt> as a token, it will be
1255shifted onto the parsing stack.
1256
1257<p>
1258<li>If the top item of the parsing stack is <tt>error</tt>, lookahead tokens will be discarded until the
1259parser can successfully shift a new symbol or reduce a rule involving <tt>error</tt>.
1260</ol>
1261
1262<h4>Recovery and resynchronization with error rules</h4>
1263
1264The most well-behaved approach for handling syntax errors is to write grammar rules that include the <tt>error</tt>
1265token.  For example, suppose your language had a grammar rule for a print statement like this:
1266
1267<blockquote>
1268<pre>
1269def p_statement_print(t):
1270     'statement : PRINT expr SEMI'
1271     ...
1272</pre>
1273</blockquote>
1274
1275To account for the possibility of a bad expression, you might write an additional grammar rule like this:
1276
1277<blockquote>
1278<pre>
1279def p_statement_print_error(t):
1280     'statement : PRINT error SEMI'
1281     print "Syntax error in print statement. Bad expression"
1282
1283</pre>
1284</blockquote>
1285
1286In this case, the <tt>error</tt> token will match any sequence of
1287tokens that might appear up to the first semicolon that is
1288encountered.  Once the semicolon is reached, the rule will be
1289invoked and the <tt>error</tt> token will go away.
1290
1291<p>
1292This type of recovery is sometimes known as parser resynchronization.
1293The <tt>error</tt> token acts as a wildcard for any bad input text and
1294the token immediately following <tt>error</tt> acts as a
1295synchronization token.
1296
1297<p>
1298It is important to note that the <tt>error</tt> token usually does not appear as the last token
1299on the right in an error rule.  For example:
1300
1301<blockquote>
1302<pre>
1303def p_statement_print_error(t):
1304    'statement : PRINT error'
1305    print "Syntax error in print statement. Bad expression"
1306</pre>
1307</blockquote>
1308
1309This is because the first bad token encountered will cause the rule to
1310be reduced--which may make it difficult to recover if more bad tokens
1311immediately follow.   
1312
1313<h4>Panic mode recovery</h4>
1314
1315An alternative error recovery scheme is to enter a panic mode recovery in which tokens are
1316discarded to a point where the parser might be able to recover in some sensible manner.
1317
1318<p>
1319Panic mode recovery is implemented entirely in the <tt>p_error()</tt> function.  For example, this
1320function starts discarding tokens until it reaches a closing '}'.  Then, it restarts the 
1321parser in its initial state.
1322
1323<blockquote>
1324<pre>
1325def p_error(t):
1326    print "Whoa. You are seriously hosed."
1327    # Read ahead looking for a closing '}'
1328    while 1:
1329        tok = yacc.token()             # Get the next token
1330        if not tok or tok.type == 'RBRACE': break
1331    yacc.restart()
1332</pre>
1333</blockquote>
1334
1335<p>
1336This function simply discards the bad token and tells the parser that the error was ok.
1337
1338<blockquote>
1339<pre>
1340def p_error(t):
1341    print "Syntax error at token", t.type
1342    # Just discard the token and tell the parser it's okay.
1343    yacc.errok()
1344</pre>
1345</blockquote>
1346
1347<P>
1348Within the <tt>p_error()</tt> function, three functions are available to control the behavior
1349of the parser:
1350<p>
1351<ul>
1352<li><tt>yacc.errok()</tt>.  This resets the parser state so it doesn't think it's in error-recovery
1353mode.   This will prevent an <tt>error</tt> token from being generated and will reset the internal
1354error counters so that the next syntax error will call <tt>p_error()</tt> again.
1355
1356<p>
1357<li><tt>yacc.token()</tt>.  This returns the next token on the input stream.
1358
1359<p>
1360<li><tt>yacc.restart()</tt>.  This discards the entire parsing stack and resets the parser
1361to its initial state. 
1362</ul>
1363
1364Note: these functions are only available when invoking <tt>p_error()</tt> and are not available
1365at any other time.
1366
1367<p>
1368To supply the next lookahead token to the parser, <tt>p_error()</tt> can return a token.  This might be
1369useful if trying to synchronize on special characters.  For example:
1370
1371<blockquote>
1372<pre>
1373def p_error(t):
1374    # Read ahead looking for a terminating ";"
1375    while 1:
1376        tok = yacc.token()             # Get the next token
1377        if not tok or tok.type == 'SEMI': break
1378    yacc.errok()
1379
1380    # Return SEMI to the parser as the next lookahead token
1381    return tok  
1382</pre>
1383</blockquote>
1384
1385<h4>General comments on error handling</h4>
1386
1387For normal types of languages, error recovery with error rules and resynchronization characters is probably the most reliable
1388technique. This is because you can instrument the grammar to catch errors at selected places where it is relatively easy 
1389to recover and continue parsing.  Panic mode recovery is really only useful in certain specialized applications where you might want
1390to discard huge portions of the input text to find a valid restart point.
1391
1392<h2>Line Number Tracking</h2>
1393
1394<tt>yacc.py</tt> automatically tracks line numbers for all of the grammar symbols and tokens it processes.  To retrieve the line
1395numbers, two functions are used in grammar rules:
1396
1397<ul>
1398<li><tt>t.lineno(num)</tt>.  Return the starting line number for symbol <em>num</em>
1399<li><tt>t.linespan(num)</tt>. Return a tuple (startline,endline) with the starting and ending line number for symbol <em>num</em>.
1400</ul>
1401
1402For example:
1403
1404<blockquote>
1405<pre>
1406def t_expression(t):
1407    'expression : expression PLUS expression'
1408    t.lineno(1)        # Line number of the left expression
1409    t.lineno(2)        # line number of the PLUS operator
1410    t.lineno(3)        # line number of the right expression
1411    ...
1412    start,end = t.linespan(3)    # Start,end lines of the right expression
1413
1414</pre>
1415</blockquote>
1416
1417Since line numbers are managed internally by the parser, there is usually no need to modify the line
1418numbers.  However, if you want to save the line numbers in a parse-tree node, you will need to make your own
1419private copy.
1420
1421<h2>AST Construction</h2>
1422
1423<tt>yacc.py</tt> provides no special functions for constructing an abstract syntax tree.  However, such
1424construction is easy enough to do on your own.  Simply create a data structure for abstract syntax tree nodes
1425and assign nodes to <tt>t[0]</tt> in each rule.
1426
1427For example:
1428
1429<blockquote>
1430<pre>
1431class Expr: pass
1432
1433class BinOp(Expr):
1434    def __init__(self,left,op,right):
1435        self.type = "binop"
1436        self.left = left
1437        self.right = right
1438        self.op = op
1439
1440class Number(Expr):
1441    def __init__(self,value):
1442        self.type = "number"
1443        self.value = value
1444
1445def p_expression_binop(t):
1446    '''expression : expression PLUS expression
1447                  | expression MINUS expression
1448                  | expression TIMES expression
1449                  | expression DIVIDE expression'''
1450
1451    t[0] = BinOp(t[1],t[2],t[3])
1452
1453def p_expression_group(t):
1454    'expression : LPAREN expression RPAREN'
1455    t[0] = t[2]
1456
1457def p_expression_number(t):
1458    'expression : NUMBER'
1459    t[0] = Number(t[1])
1460</pre>
1461</blockquote>
1462
1463To simplify tree traversal, it may make sense to pick a very generic tree structure for your parse tree nodes.
1464For example:
1465
1466<blockquote>
1467<pre>
1468class Node:
1469    def __init__(self,type,children=None,leaf=None):
1470         self.type = type
1471         if children:
1472              self.children = children
1473         else:
1474              self.children = [ ]
1475         self.leaf = leaf
1476	 
1477def p_expression_binop(t):
1478    '''expression : expression PLUS expression
1479                  | expression MINUS expression
1480                  | expression TIMES expression
1481                  | expression DIVIDE expression'''
1482
1483    t[0] = Node("binop", [t[1],t[3]], t[2])
1484</pre>
1485</blockquote>
1486
1487<h2>Yacc implementation notes</h2>
1488
1489<ul>
1490<li>By default, <tt>yacc.py</tt> relies on <tt>lex.py</tt> for tokenizing.  However, an alternative tokenizer
1491can be supplied as follows:
1492
1493<blockquote>
1494<pre>
1495yacc.parse(lexer=x)
1496</pre>
1497</blockquote>
1498in this case, <tt>x</tt> must be a Lexer object that minimally has a <tt>x.token()</tt> method for retrieving the next
1499token.   If an input string is given to <tt>yacc.parse()</tt>, the lexer must also have an <tt>x.input()</tt> method.
1500
1501<p>
1502<li>By default, the yacc generates tables in debugging mode (which produces the parser.out file and other output).
1503To disable this, use
1504
1505<blockquote>
1506<pre>
1507yacc.yacc(debug=0)
1508</pre>
1509</blockquote>
1510
1511<p>
1512<li>To change the name of the <tt>parsetab.py</tt> file,  use:
1513
1514<blockquote>
1515<pre>
1516yacc.yacc(tabmodule="foo")
1517</pre>
1518</blockquote>
1519
1520<P>
1521<li>To print copious amounts of debugging during parsing, use:
1522
1523<blockquote>
1524<pre>
1525yacc.parse(debug=1)
1526</pre>
1527</blockquote>
1528
1529<p>
1530<li>The <tt>yacc.yacc()</tt> function really returns a parser object.  If you want to support multiple
1531parsers in the same application, do this:
1532
1533<blockquote>
1534<pre>
1535p = yacc.yacc()
1536...
1537p.parse()
1538</pre>
1539</blockquote>
1540
1541Note: The function <tt>yacc.parse()</tt> is bound to the last parser that was generated.
1542
1543<p>
1544<li>Since the generation of the SLR tables is relatively expensive, previously generated tables are
1545cached and reused if possible.  The decision to regenerate the tables is determined by taking an MD5
1546checksum of all grammar rules and precedence rules.  Only in the event of a mismatch are the tables regenerated.
1547
1548<p>
1549It should be noted that table generation is reasonably efficient, even for grammars that involve around a 100 rules
1550and several hundred states.  For more complex languages such as C, table generation may take 30-60 seconds on a slow
1551machine.  Please be patient.
1552
1553<p>
1554<li>Since LR parsing is mostly driven by tables, the performance of the parser is largely independent of the
1555size of the grammar.   The biggest bottlenecks will be the lexer and the complexity of your grammar rules.
1556</ul>
1557
1558<h2>Parser and Lexer State Management</h2>
1559
1560In advanced parsing applications, you may want to have multiple
1561parsers and lexers.  Furthermore, the parser may want to control the
1562behavior of the lexer in some way.
1563
1564<p>
1565To do this, it is important to note that both the lexer and parser are
1566actually implemented as objects.   These objects are returned by the
1567<tt>lex()</tt> and <tt>yacc()</tt> functions respectively.  For example:
1568
1569<blockquote>
1570<pre>
1571lexer  = lex.lex()       # Return lexer object
1572parser = yacc.yacc()     # Return parser object
1573</pre>
1574</blockquote>
1575
1576Within lexer and parser rules, these objects are also available.  In the lexer,
1577the "lexer" attribute of a token refers to the lexer object in use.  For example:
1578
1579<blockquote>
1580<pre>
1581def t_NUMBER(t):
1582   r'\d+'
1583   ...
1584   print t.lexer           # Show lexer object
1585</pre>
1586</blockquote>
1587
1588In the parser, the "lexer" and "parser" attributes refer to the lexer
1589and parser objects respectively.
1590
1591<blockquote>
1592<pre>
1593def p_expr_plus(t):
1594   'expr : expr PLUS expr'
1595   ...
1596   print t.parser          # Show parser object
1597   print t.lexer           # Show lexer object
1598</pre>
1599</blockquote>
1600
1601If necessary, arbitrary attributes can be attached to the lexer or parser object.
1602For example, if you wanted to have different parsing modes, you could attach a mode
1603attribute to the parser object and look at it later.
1604
1605<h2>Using Python's Optimized Mode</h2>
1606
1607Because PLY uses information from doc-strings, parsing and lexing
1608information must be gathered while running the Python interpreter in
1609normal mode (i.e., not with the -O or -OO options).  However, if you
1610specify optimized mode like this:
1611
1612<blockquote>
1613<pre>
1614lex.lex(optimize=1)
1615yacc.yacc(optimize=1)
1616</pre>
1617</blockquote>
1618
1619then PLY can later be used when Python runs in optimized mode. To make this work,
1620make sure you first run Python in normal mode.  Once the lexing and parsing tables
1621have been generated the first time, run Python in optimized mode. PLY will use
1622the tables without the need for doc strings.
1623
1624<p>
1625Beware: running PLY in optimized mode disables a lot of error
1626checking.  You should only do this when your project has stabilized
1627and you don't need to do any debugging.
1628  
1629<h2>Where to go from here?</h2>
1630
1631The <tt>examples</tt> directory of the PLY distribution contains several simple examples.   Please consult a
1632compilers textbook for the theory and underlying implementation details or LR parsing.
1633
1634</body>
1635</html>
1636
1637
1638
1639
1640
1641
1642
1643