ply.html revision 2632
1<html> 2<head> 3<title>PLY (Python Lex-Yacc)</title> 4</head> 5<body bgcolor="#ffffff"> 6 7<h1>PLY (Python Lex-Yacc)</h1> 8 9<b> 10David M. Beazley <br> 11Department of Computer Science <br> 12University of Chicago <br> 13Chicago, IL 60637 <br> 14beazley@cs.uchicago.edu <br> 15</b> 16 17<p> 18Documentation version: $Header: /home/stever/bk/newmem2/ext/ply/doc/ply.html 1.1 03/06/06 14:53:34-00:00 stever@ $ 19 20<h2>Introduction</h2> 21 22PLY is a Python-only implementation of the popular compiler 23construction tools lex and yacc. The implementation borrows ideas 24from a number of previous efforts; most notably John Aycock's SPARK 25toolkit. However, the overall flavor of the implementation is more 26closely modeled after the C version of lex and yacc. The other 27significant feature of PLY is that it provides extensive input 28validation and error reporting--much more so than other Python parsing 29tools. 30 31<p> 32Early versions of PLY were developed to support the Introduction to 33Compilers Course at the University of Chicago. In this course, 34students built a fully functional compiler for a simple Pascal-like 35language. Their compiler, implemented entirely in Python, had to 36include lexical analysis, parsing, type checking, type inference, 37nested scoping, and code generation for the SPARC processor. 38Approximately 30 different compiler implementations were completed in 39this course. Most of PLY's interface and operation has been motivated by common 40usability problems encountered by students. 41 42<p> 43Because PLY was primarily developed as an instructional tool, you will 44find it to be <em>MUCH</em> more picky about token and grammar rule 45specification than most other Python parsing tools. In part, this 46added formality is meant to catch common programming mistakes made by 47novice users. However, advanced users will also find such features to 48be useful when building complicated grammars for real programming 49languages. It should also be noted that PLY does not provide much in the way 50of bells and whistles (e.g., automatic construction of abstract syntax trees, 51tree traversal, etc.). Instead, you will find a bare-bones, yet 52fully capable lex/yacc implementation written entirely in Python. 53 54<p> 55The rest of this document assumes that you are somewhat familar with 56parsing theory, syntax directed translation, and automatic tools such 57as lex and yacc. If you are unfamilar with these topics, you will 58probably want to consult an introductory text such as "Compilers: 59Principles, Techniques, and Tools", by Aho, Sethi, and Ullman. "Lex 60and Yacc" by John Levine may also be handy. 61 62<h2>PLY Overview</h2> 63 64PLY consists of two separate tools; <tt>lex.py</tt> and 65<tt>yacc.py</tt>. <tt>lex.py</tt> is used to break input text into a 66collection of tokens specified by a collection of regular expression 67rules. <tt>yacc.py</tt> is used to recognize language syntax that has 68been specified in the form of a context free grammar. Currently, 69<tt>yacc.py</tt> uses LR parsing and generates its parsing tables 70using the SLR algorithm. LALR(1) parsing may be supported in a future 71release. 72 73<p> 74The two tools are meant to work together. Specifically, 75<tt>lex.py</tt> provides an external interface in the form of a 76<tt>token()</tt> function that returns the next valid token on the 77input stream. <tt>yacc.py</tt> calls this repeatedly to retrieve 78tokens and invoke grammar rules. The output of <tt>yacc.py</tt> is 79often an Abstract Syntax Tree (AST). However, this is entirely up to 80the user. If desired, <tt>yacc.py</tt> can also be used to implement 81simple one-pass compilers. 82 83<p> 84Like its Unix counterpart, <tt>yacc.py</tt> provides most of the 85features you expect including extensive error checking, grammar 86validation, support for empty productions, error tokens, and ambiguity 87resolution via precedence rules. The primary difference between 88<tt>yacc.py</tt> and <tt>yacc</tt> is the use of SLR parsing instead 89of LALR(1). Although this slightly restricts the types of grammars 90than can be successfully parsed, it is sufficiently powerful to handle most 91kinds of normal programming language constructs. 92 93<p> 94Finally, it is important to note that PLY relies on reflection 95(introspection) to build its lexers and parsers. Unlike traditional 96lex/yacc which require a special input file that is converted into a 97separate source file, the specifications given to PLY <em>are</em> 98valid Python programs. This means that there are no extra source 99files nor is there a special compiler construction step (e.g., running 100yacc to generate Python code for the compiler). 101 102<h2>Lex Example</h2> 103 104<tt>lex.py</tt> is used to write tokenizers. To do this, each token 105must be defined by a regular expression rule. The following file 106implements a very simple lexer for tokenizing simple integer expressions: 107 108<blockquote> 109<pre> 110# ------------------------------------------------------------ 111# calclex.py 112# 113# tokenizer for a simple expression evaluator for 114# numbers and +,-,*,/ 115# ------------------------------------------------------------ 116import lex 117 118# List of token names. This is always required 119tokens = ( 120 'NUMBER', 121 'PLUS', 122 'MINUS', 123 'TIMES', 124 'DIVIDE', 125 'LPAREN', 126 'RPAREN', 127) 128 129# Regular expression rules for simple tokens 130t_PLUS = r'\+' 131t_MINUS = r'-' 132t_TIMES = r'\*' 133t_DIVIDE = r'/' 134t_LPAREN = r'\(' 135t_RPAREN = r'\)' 136 137# A regular expression rule with some action code 138def t_NUMBER(t): 139 r'\d+' 140 try: 141 t.value = int(t.value) 142 except ValueError: 143 print "Line %d: Number %s is too large!" % (t.lineno,t.value) 144 t.value = 0 145 return t 146 147# Define a rule so we can track line numbers 148def t_newline(t): 149 r'\n+' 150 t.lineno += len(t.value) 151 152# A string containing ignored characters (spaces and tabs) 153t_ignore = ' \t' 154 155# Error handling rule 156def t_error(t): 157 print "Illegal character '%s'" % t.value[0] 158 t.skip(1) 159 160# Build the lexer 161lex.lex() 162 163# Test it out 164data = ''' 1653 + 4 * 10 166 + -20 *2 167''' 168 169# Give the lexer some input 170lex.input(data) 171 172# Tokenize 173while 1: 174 tok = lex.token() 175 if not tok: break # No more input 176 print tok 177</pre> 178</blockquote> 179 180In the example, the <tt>tokens</tt> list defines all of the possible 181token names that can be produced by the lexer. This list is always required 182and is used to perform a variety of validation checks. Following the <tt>tokens</tt> 183list, regular expressions are written for each token. Each of these 184rules are defined by making declarations with a special prefix <tt>t_</tt> to indicate that it 185defines a token. For simple tokens, the regular expression can 186be specified as strings such as this (note: Python raw strings are used since they are the 187most convenient way to write regular expression strings): 188 189<blockquote> 190<pre> 191t_PLUS = r'\+' 192</pre> 193</blockquote> 194 195In this case, the name following the <tt>t_</tt> must exactly match one of the 196names supplied in <tt>tokens</tt>. If some kind of action needs to be performed, 197a token rule can be specified as a function. For example: 198 199<blockquote> 200<pre> 201def t_NUMBER(t): 202 r'\d+' 203 try: 204 t.value = int(t.value) 205 except ValueError: 206 print "Number %s is too large!" % t.value 207 t.value = 0 208 return t 209</pre> 210</blockquote> 211 212In this case, the regular expression rule is specified in the function documentation string. 213The function always takes a single argument which is an instance of 214<tt>LexToken</tt>. This object has attributes of <tt>t.type</tt> which is the token type, 215<tt>t.value</tt> which is the lexeme, and <tt>t.lineno</tt> which is the current line number. 216By default, <tt>t.type</tt> is set to the name following the <tt>t_</tt> prefix. The action 217function can modify the contents of the <tt>LexToken</tt> object as appropriate. However, 218when it is done, the resulting token should be returned. If no value is returned by the action 219function, the token is simply discarded and the next token read. 220 221<p> 222The rule <tt>t_newline()</tt> illustrates a regular expression rule 223for a discarded token. In this case, a rule is written to match 224newlines so that proper line number tracking can be performed. 225By returning no value, the function causes the newline character to be 226discarded. 227 228<p> 229The special <tt>t_ignore</tt> rule is reserved by <tt>lex.py</tt> for characters 230that should be completely ignored in the input stream. 231Usually this is used to skip over whitespace and other non-essential characters. 232Although it is possible to define a regular expression rule for whitespace in a manner 233similar to <tt>t_newline()</tt>, the use of <tt>t_ignore</tt> provides substantially better 234lexing performance because it is handled as a special case and is checked in a much 235more efficient manner than the normal regular expression rules. 236 237<p> 238Finally, the <tt>t_error()</tt> 239function is used to handle lexing errors that occur when illegal 240characters are detected. In this case, the <tt>t.value</tt> attribute contains the 241rest of the input string that has not been tokenized. In the example, we simply print 242the offending character and skip ahead one character by calling <tt>t.skip(1)</tt>. 243 244<p> 245To build the lexer, the function <tt>lex.lex()</tt> is used. This function 246uses Python reflection (or introspection) to read the the regular expression rules 247out of the calling context and build the lexer. Once the lexer has been built, two functions can 248be used to control the lexer. 249 250<ul> 251<li><tt>lex.input(data)</tt>. Reset the lexer and store a new input string. 252<li><tt>lex.token()</tt>. Return the next token. Returns a special <tt>LexToken</tt> instance on success or 253None if the end of the input text has been reached. 254</ul> 255 256The code at the bottom of the example shows how the lexer is actually used. When executed, 257the following output will be produced: 258 259<blockquote> 260<pre> 261$ python example.py 262LexToken(NUMBER,3,2) 263LexToken(PLUS,'+',2) 264LexToken(NUMBER,4,2) 265LexToken(TIMES,'*',2) 266LexToken(NUMBER,10,2) 267LexToken(PLUS,'+',3) 268LexToken(MINUS,'-',3) 269LexToken(NUMBER,20,3) 270LexToken(TIMES,'*',3) 271LexToken(NUMBER,2,3) 272</pre> 273</blockquote> 274 275<h2>Lex Implementation Notes</h2> 276 277<ul> 278<li><tt>lex.py</tt> uses the <tt>re</tt> module to do its patten matching. When building the master regular expression, 279rules are added in the following order: 280<p> 281<ol> 282<li>All tokens defined by functions are added in the same order as they appear in the lexer file. 283<li>Tokens defined by strings are added by sorting them in order of decreasing regular expression length (longer expressions 284are added first). 285</ol> 286<p> 287Without this ordering, it can be difficult to correctly match certain types of tokens. For example, if you 288wanted to have separate tokens for "=" and "==", you need to make sure that "==" is checked first. By sorting regular 289expressions in order of decreasing length, this problem is solved for rules defined as strings. For functions, 290the order can be explicitly controlled since rules appearing first are checked first. 291 292<P> 293<li>The lexer requires input to be supplied as a single input string. Since most machines have more than enough memory, this 294rarely presents a performance concern. However, it means that the lexer currently can't be used with streaming data 295such as open files or sockets. This limitation is primarily a side-effect of using the <tt>re</tt> module. 296 297<p> 298<li> 299To handle reserved words, it is usually easier to just match an identifier and do a special name lookup in a function 300like this: 301 302<blockquote> 303<pre> 304reserved = { 305 'if' : 'IF', 306 'then' : 'THEN', 307 'else' : 'ELSE', 308 'while' : 'WHILE', 309 ... 310} 311 312def t_ID(t): 313 r'[a-zA-Z_][a-zA-Z_0-9]*' 314 t.type = reserved.get(t.value,'ID') # Check for reserved words 315 return t 316</pre> 317</blockquote> 318 319<p> 320<li>The lexer requires tokens to be defined as class instances with <tt>t.type</tt>, <tt>t.value</tt>, and <tt>t.lineno</tt> 321attributes. By default, tokens are created as instances of the <tt>LexToken</tt> class defined internally to <tt>lex.py</tt>. 322If desired, you can create new kinds of tokens provided that they have the three required attributes. However, 323in practice, it is probably safer to stick with the default. 324 325<p> 326<li>The only safe attribute for assigning token properties is <tt>t.value</tt>. In some cases, you may want to attach 327a number of different properties to a token (e.g., symbol table entries for identifiers). To do this, replace <tt>t.value</tt> 328with a tuple or class instance. For example: 329 330<blockquote> 331<pre> 332def t_ID(t): 333 ... 334 # For identifiers, create a (lexeme, symtab) tuple 335 t.value = (t.value, symbol_lookup(t.value)) 336 ... 337 return t 338</pre> 339</blockquote> 340 341Although allowed, do NOT assign additional attributes to the token object. For example, 342<blockquote> 343<pre> 344def t_ID(t): 345 ... 346 # Bad implementation of above 347 t.symtab = symbol_lookup(t.value) 348 ... 349</pre> 350</blockquote> 351 352The reason you don't want to do this is that the <tt>yacc.py</tt> 353module only provides public access to the <tt>t.value</tt> attribute of each token. 354Therefore, any other attributes you assign are inaccessible (if you are familiar 355with the internals of C lex/yacc, <tt>t.value</tt> is the same as <tt>yylval.tok</tt>). 356 357<p> 358<li>To track line numbers, the lexer internally maintains a line 359number variable. Each token automatically gets the value of the 360current line number in the <tt>t.lineno</tt> attribute. To modify the 361current line number, simply change the <tt>t.lineno</tt> attribute 362in a function rule (as previously shown for 363<tt>t_newline()</tt>). Even if the resulting token is discarded, 364changes to the line number remain in effect for subsequent tokens. 365 366<p> 367<li>To support multiple scanners in the same application, the <tt>lex.lex()</tt> function 368actually returns a special <tt>Lexer</tt> object. This object has two methods 369<tt>input()</tt> and <tt>token()</tt> that can be used to supply input and get tokens. For example: 370 371<blockquote> 372<pre> 373lexer = lex.lex() 374lexer.input(sometext) 375while 1: 376 tok = lexer.token() 377 if not tok: break 378 print tok 379</pre> 380</blockquote> 381 382The functions <tt>lex.input()</tt> and <tt>lex.token()</tt> are bound to the <tt>input()</tt> 383and <tt>token()</tt> methods of the last lexer created by the lex module. 384 385 386<p> 387<li>To reduce compiler startup time and improve performance, the lexer can be built in optimized mode as follows: 388 389<blockquote> 390<pre> 391lex.lex(optimize=1) 392</pre> 393</blockquote> 394 395When used, most error checking and validation is disabled. This provides a slight performance 396gain while tokenizing and tends to chop a few tenths of a second off startup time. Since it disables 397error checking, this mode is not the default and is not recommended during development. However, once 398you have your compiler fully working, it is usually safe to disable the error checks. 399 400<p> 401<li>You can enable some additional debugging by building the lexer like this: 402 403<blockquote> 404<pre> 405lex.lex(debug=1) 406</pre> 407</blockquote> 408 409<p> 410<li>To help you debug your lexer, <tt>lex.py</tt> comes with a simple main program which will either 411tokenize input read from standard input or from a file. To use it, simply put this in your lexer: 412 413<blockquote> 414<pre> 415if __name__ == '__main__': 416 lex.runmain() 417</pre> 418</blockquote> 419 420Then, run you lexer as a main program such as <tt>python mylex.py</tt> 421 422<p> 423<li>Since the lexer is written entirely in Python, its performance is 424largely determined by that of the Python <tt>re</tt> module. Although 425the lexer has been written to be as efficient as possible, it's not 426blazingly fast when used on very large input files. Sorry. If 427performance is concern, you might consider upgrading to the most 428recent version of Python, creating a hand-written lexer, or offloading 429the lexer into a C extension module. In defense of <tt>lex.py</tt>, 430it's performance is not <em>that</em> bad when used on reasonably 431sized input files. For instance, lexing a 4700 line C program with 43232000 input tokens takes about 20 seconds on a 200 Mhz PC. Obviously, 433it will run much faster on a more speedy machine. 434 435</ul> 436 437<h2>Parsing basics</h2> 438 439<tt>yacc.py</tt> is used to parse language syntax. Before showing an 440example, there are a few important bits of background that must be 441mentioned. First, <tt>syntax</tt> is usually specified in terms of a 442context free grammar (CFG). For example, if you wanted to parse 443simple arithmetic expressions, you might first write an unambiguous 444grammar specification like this: 445 446<blockquote> 447<pre> 448expression : expression + term 449 | expression - term 450 | term 451 452term : term * factor 453 | term / factor 454 | factor 455 456factor : NUMBER 457 | ( expression ) 458</pre> 459</blockquote> 460 461Next, the semantic behavior of a language is often specified using a 462technique known as syntax directed translation. In syntax directed 463translation, attributes are attached to each symbol in a given grammar 464rule along with an action. Whenever a particular grammar rule is 465recognized, the action describes what to do. For example, given the 466expression grammar above, you might write the specification for a 467simple calculator like this: 468 469<blockquote> 470<pre> 471Grammar Action 472-------------------------------- -------------------------------------------- 473expression0 : expression1 + term expression0.val = expression1.val + term.val 474 | expression1 - term expression0.val = expression1.val - term.val 475 | term expression0.val = term.val 476 477term0 : term1 * factor term0.val = term1.val * factor.val 478 | term1 / factor term0.val = term1.val / factor.val 479 | factor term0.val = factor.val 480 481factor : NUMBER factor.val = int(NUMBER.lexval) 482 | ( expression ) factor.val = expression.val 483</pre> 484</blockquote> 485 486Finally, Yacc uses a parsing technique known as LR-parsing or shift-reduce parsing. LR parsing is a 487bottom up technique that tries to recognize the right-hand-side of various grammar rules. 488Whenever a valid right-hand-side is found in the input, the appropriate action code is triggered and the 489grammar symbols are replaced by the grammar symbol on the left-hand-side. 490 491<p> 492LR parsing is commonly implemented by shifting grammar symbols onto a stack and looking at the stack and the next 493input token for patterns. The details of the algorithm can be found in a compiler text, but the 494following example illustrates the steps that are performed if you wanted to parse the expression 495<tt>3 + 5 * (10 - 20)</tt> using the grammar defined above: 496 497<blockquote> 498<pre> 499Step Symbol Stack Input Tokens Action 500---- --------------------- --------------------- ------------------------------- 5011 $ 3 + 5 * ( 10 - 20 )$ Shift 3 5022 $ 3 + 5 * ( 10 - 20 )$ Reduce factor : NUMBER 5033 $ factor + 5 * ( 10 - 20 )$ Reduce term : factor 5044 $ term + 5 * ( 10 - 20 )$ Reduce expr : term 5055 $ expr + 5 * ( 10 - 20 )$ Shift + 5066 $ expr + 5 * ( 10 - 20 )$ Shift 5 5077 $ expr + 5 * ( 10 - 20 )$ Reduce factor : NUMBER 5088 $ expr + factor * ( 10 - 20 )$ Reduce term : factor 5099 $ expr + term * ( 10 - 20 )$ Shift * 51010 $ expr + term * ( 10 - 20 )$ Shift ( 51111 $ expr + term * ( 10 - 20 )$ Shift 10 51212 $ expr + term * ( 10 - 20 )$ Reduce factor : NUMBER 51313 $ expr + term * ( factor - 20 )$ Reduce term : factor 51414 $ expr + term * ( term - 20 )$ Reduce expr : term 51515 $ expr + term * ( expr - 20 )$ Shift - 51616 $ expr + term * ( expr - 20 )$ Shift 20 51717 $ expr + term * ( expr - 20 )$ Reduce factor : NUMBER 51818 $ expr + term * ( expr - factor )$ Reduce term : factor 51919 $ expr + term * ( expr - term )$ Reduce expr : expr - term 52020 $ expr + term * ( expr )$ Shift ) 52121 $ expr + term * ( expr ) $ Reduce factor : (expr) 52222 $ expr + term * factor $ Reduce term : term * factor 52323 $ expr + term $ Reduce expr : expr + term 52424 $ expr $ Reduce expr 52525 $ $ Success! 526</pre> 527</blockquote> 528 529When parsing the expression, an underlying state machine and the current input token determine what to do next. 530If the next token looks like part of a valid grammar rule (based on other items on the stack), it is generally shifted 531onto the stack. If the top of the stack contains a valid right-hand-side of a grammar rule, it is 532usually "reduced" and the symbols replaced with the symbol on the left-hand-side. When this reduction occurs, the 533appropriate action is triggered (if defined). If the input token can't be shifted and the top of stack doesn't match 534any grammar rules, a syntax error has occurred and the parser must take some kind of recovery step (or bail out). 535 536<p> 537It is important to note that the underlying implementation is actually built around a large finite-state machine 538and some tables. The construction of these tables is quite complicated and beyond the scope of this discussion. 539However, subtle details of this process explain why, in the example above, the parser chooses to shift a token 540onto the stack in step 9 rather than reducing the rule <tt>expr : expr + term</tt>. 541 542<h2>Yacc example</h2> 543 544Suppose you wanted to make a grammar for simple arithmetic expressions as previously described. Here is 545how you would do it with <tt>yacc.py</tt>: 546 547<blockquote> 548<pre> 549# Yacc example 550 551import yacc 552 553# Get the token map from the lexer. This is required. 554from calclex import tokens 555 556def p_expression_plus(t): 557 'expression : expression PLUS term' 558 t[0] = t[1] + t[3] 559 560def p_expression_minus(t): 561 'expression : expression MINUS term' 562 t[0] = t[1] - t[3] 563 564def p_expression_term(t): 565 'expression : term' 566 t[0] = t[1] 567 568def p_term_times(t): 569 'term : term TIMES factor' 570 t[0] = t[1] * t[3] 571 572def p_term_div(t): 573 'term : term DIVIDE factor' 574 t[0] = t[1] / t[3] 575 576def p_term_factor(t): 577 'term : factor' 578 t[0] = t[1] 579 580def p_factor_num(t): 581 'factor : NUMBER' 582 t[0] = t[1] 583 584def p_factor_expr(t): 585 'factor : LPAREN expression RPAREN' 586 t[0] = t[2] 587 588# Error rule for syntax errors 589def p_error(t): 590 print "Syntax error in input!" 591 592# Build the parser 593yacc.yacc() 594 595while 1: 596 try: 597 s = raw_input('calc > ') 598 except EOFError: 599 break 600 if not s: continue 601 result = yacc.parse(s) 602 print result 603</pre> 604</blockquote> 605 606In this example, each grammar rule is defined by a Python function where the docstring to that function contains the 607appropriate context-free grammar specification (an idea borrowed from John Aycock's SPARK toolkit). Each function accepts a single 608argument <tt>t</tt> that is a sequence containing the values of each grammar symbol in the corresponding rule. The values of 609<tt>t[i]</tt> are mapped to grammar symbols as shown here: 610 611<blockquote> 612<pre> 613def p_expression_plus(t): 614 'expression : expression PLUS term' 615 # ^ ^ ^ ^ 616 # t[0] t[1] t[2] t[3] 617 618 t[0] = t[1] + t[3] 619</pre> 620</blockquote> 621 622For tokens, the "value" in the corresponding <tt>t[i]</tt> is the 623<em>same</em> as the value of the <tt>t.value</tt> attribute assigned 624in the lexer module. For non-terminals, the value is determined by 625whatever is placed in <tt>t[0]</tt> when rules are reduced. This 626value can be anything at all. However, it probably most common for 627the value to be a simple Python type, a tuple, or an instance. In this example, we 628are relying on the fact that the <tt>NUMBER</tt> token stores an integer value in its value 629field. All of the other rules simply perform various types of integer operations and store 630the result. 631 632<p> 633The first rule defined in the yacc specification determines the starting grammar 634symbol (in this case, a rule for <tt>expression</tt> appears first). Whenever 635the starting rule is reduced by the parser and no more input is available, parsing 636stops and the final value is returned (this value will be whatever the top-most rule 637placed in <tt>t[0]</tt>). 638 639<p>The <tt>p_error(t)</tt> rule is defined to catch syntax errors. See the error handling section 640below for more detail. 641 642<p> 643To build the parser, call the <tt>yacc.yacc()</tt> function. This function 644looks at the module and attempts to construct all of the LR parsing tables for the grammar 645you have specified. The first time <tt>yacc.yacc()</tt> is invoked, you will get a message 646such as this: 647 648<blockquote> 649<pre> 650$ python calcparse.py 651yacc: Generating SLR parsing table... 652calc > 653</pre> 654</blockquote> 655 656Since table construction is relatively expensive (especially for large 657grammars), the resulting parsing table is written to the current 658directory in a file called <tt>parsetab.py</tt>. In addition, a 659debugging file called <tt>parser.out</tt> is created. On subsequent 660executions, <tt>yacc</tt> will reload the table from 661<tt>parsetab.py</tt> unless it has detected a change in the underlying 662grammar (in which case the tables and <tt>parsetab.py</tt> file are 663regenerated). 664 665<p> 666If any errors are detected in your grammar specification, <tt>yacc.py</tt> will produce 667diagnostic messages and possibly raise an exception. Some of the errors that can be detected include: 668 669<ul> 670<li>Duplicated function names (if more than one rule function have the same name in the grammar file). 671<li>Shift/reduce and reduce/reduce conflicts generated by ambiguous grammars. 672<li>Badly specified grammar rules. 673<li>Infinite recursion (rules that can never terminate). 674<li>Unused rules and tokens 675<li>Undefined rules and tokens 676</ul> 677 678The next few sections now discuss a few finer points of grammar construction. 679 680<h2>Combining Grammar Rule Functions</h2> 681 682When grammar rules are similar, they can be combined into a single function. 683For example, consider the two rules in our earlier example: 684 685<blockquote> 686<pre> 687def p_expression_plus(t): 688 'expression : expression PLUS term' 689 t[0] = t[1] + t[3] 690 691def p_expression_minus(t): 692 'expression : expression MINUS term' 693 t[0] = t[1] - t[3] 694</pre> 695</blockquote> 696 697Instead of writing two functions, you might write a single function like this: 698 699<blockquote> 700<pre> 701def p_expression(t): 702 '''expression : expression PLUS term 703 | expression MINUS term''' 704 if t[2] == '+': 705 t[0] = t[1] + t[3] 706 elif t[2] == '-': 707 t[0] = t[1] - t[3] 708</pre> 709</blockquote> 710 711In general, the doc string for any given function can contain multiple grammar rules. So, it would 712have also been legal (although possibly confusing) to write this: 713 714<blockquote> 715<pre> 716def p_binary_operators(t): 717 '''expression : expression PLUS term 718 | expression MINUS term 719 term : term TIMES factor 720 | term DIVIDE factor''' 721 if t[2] == '+': 722 t[0] = t[1] + t[3] 723 elif t[2] == '-': 724 t[0] = t[1] - t[3] 725 elif t[2] == '*': 726 t[0] = t[1] * t[3] 727 elif t[2] == '/': 728 t[0] = t[1] / t[3] 729</pre> 730</blockquote> 731 732When combining grammar rules into a single function, it is usually a good idea for all of the rules to have 733a similar structure (e.g., the same number of terms). Otherwise, the corresponding action code may be more 734complicated than necessary. 735 736<h2>Empty Productions</h2> 737 738<tt>yacc.py</tt> can handle empty productions by defining a rule like this: 739 740<blockquote> 741<pre> 742def p_empty(t): 743 'empty :' 744 pass 745</pre> 746</blockquote> 747 748Now to use the empty production, simply use 'empty' as a symbol. For example: 749 750<blockquote> 751<pre> 752def p_optitem(t): 753 'optitem : item' 754 ' | empty' 755 ... 756</pre> 757</blockquote> 758 759<h2>Dealing With Ambiguous Grammars</h2> 760 761The expression grammar given in the earlier example has been written in a special format to eliminate ambiguity. 762However, in many situations, it is extremely difficult or awkward to write grammars in this format. A 763much more natural way to express the grammar is in a more compact form like this: 764 765<blockquote> 766<pre> 767expression : expression PLUS expression 768 | expression MINUS expression 769 | expression TIMES expression 770 | expression DIVIDE expression 771 | LPAREN expression RPAREN 772 | NUMBER 773</pre> 774</blockquote> 775 776Unfortunately, this grammar specification is ambiguous. For example, if you are parsing the string 777"3 * 4 + 5", there is no way to tell how the operators are supposed to be grouped. 778For example, does this expression mean "(3 * 4) + 5" or is it "3 * (4+5)"? 779 780<p> 781When an ambiguous grammar is given to <tt>yacc.py</tt> it will print messages about "shift/reduce conflicts" 782or a "reduce/reduce conflicts". A shift/reduce conflict is caused when the parser generator can't decide 783whether or not to reduce a rule or shift a symbol on the parsing stack. For example, consider 784the string "3 * 4 + 5" and the internal parsing stack: 785 786<blockquote> 787<pre> 788Step Symbol Stack Input Tokens Action 789---- --------------------- --------------------- ------------------------------- 7901 $ 3 * 4 + 5$ Shift 3 7912 $ 3 * 4 + 5$ Reduce : expression : NUMBER 7923 $ expr * 4 + 5$ Shift * 7934 $ expr * 4 + 5$ Shift 4 7945 $ expr * 4 + 5$ Reduce: expression : NUMBER 7956 $ expr * expr + 5$ SHIFT/REDUCE CONFLICT ???? 796</pre> 797</blockquote> 798 799In this case, when the parser reaches step 6, it has two options. One is the reduce the 800rule <tt>expr : expr * expr</tt> on the stack. The other option is to shift the 801token <tt>+</tt> on the stack. Both options are perfectly legal from the rules 802of the context-free-grammar. 803 804<p> 805By default, all shift/reduce conflicts are resolved in favor of shifting. Therefore, in the above 806example, the parser will always shift the <tt>+</tt> instead of reducing. Although this 807strategy works in many cases (including the ambiguous if-then-else), it is not enough for arithmetic 808expressions. In fact, in the above example, the decision to shift <tt>+</tt> is completely wrong---we should have 809reduced <tt>expr * expr</tt> since multiplication has higher precedence than addition. 810 811<p>To resolve ambiguity, especially in expression grammars, <tt>yacc.py</tt> allows individual 812tokens to be assigned a precedence level and associativity. This is done by adding a variable 813<tt>precedence</tt> to the grammar file like this: 814 815<blockquote> 816<pre> 817precedence = ( 818 ('left', 'PLUS', 'MINUS'), 819 ('left', 'TIMES', 'DIVIDE'), 820) 821</pre> 822</blockquote> 823 824This declaration specifies that <tt>PLUS</tt>/<tt>MINUS</tt> have 825the same precedence level and are left-associative and that 826<tt>TIMES</tt>/<tt>DIVIDE</tt> have the same precedence and are left-associative. 827Furthermore, the declaration specifies that <tt>TIMES</tt>/<tt>DIVIDE</tt> have higher 828precedence than <tt>PLUS</tt>/<tt>MINUS</tt> (since they appear later in the 829precedence specification). 830 831<p> 832The precedence specification is used to attach a numerical precedence value and associativity direction 833to each grammar rule. This is always determined by the precedence of the right-most terminal symbol. Therefore, 834if PLUS/MINUS had a precedence of 1 and TIMES/DIVIDE had a precedence of 2, the grammar rules 835would have precedence values as follows: 836 837<blockquote> 838<pre> 839expression : expression PLUS expression # prec = 1, left 840 | expression MINUS expression # prec = 1, left 841 | expression TIMES expression # prec = 2, left 842 | expression DIVIDE expression # prec = 2, left 843 | LPAREN expression RPAREN # prec = unknown 844 | NUMBER # prec = unknown 845</pre> 846</blockquote> 847 848When shift/reduce conflicts are encountered, the parser generator resolves the conflict by 849looking at the precedence rules and associativity specifiers. 850 851<p> 852<ol> 853<li>If the current token has higher precedence, it is shifted. 854<li>If the grammar rule on the stack has higher precedence, the rule is reduced. 855<li>If the current token and the grammar rule have the same precedence, the 856rule is reduced for left associativity, whereas the token is shifted for right associativity. 857<li>If nothing is known about the precedence, shift/reduce conflicts are resolved in 858favor of shifting (the default). 859</ol> 860 861<p> 862When shift/reduce conflicts are resolved using the first three techniques (with the help of 863precedence rules), <tt>yacc.py</tt> will report no errors or conflicts in the grammar. 864 865<p> 866One problem with the precedence specifier technique is that it is sometimes necessary to 867change the precedence of an operator in certain contents. For example, consider a unary-minus operator 868in "3 + 4 * -5". Normally, unary minus has a very high precedence--being evaluated before the multiply. 869However, in our precedence specifier, MINUS has a lower precedence than TIMES. To deal with this, 870precedence rules can be given for fictitious tokens like this: 871 872<blockquote> 873<pre> 874precedence = ( 875 ('left', 'PLUS', 'MINUS'), 876 ('left', 'TIMES', 'DIVIDE'), 877 ('right', 'UMINUS'), # Unary minus operator 878) 879</pre> 880</blockquote> 881 882Now, in the grammar file, we can write our unary minus rule like this: 883 884<blockquote> 885<pre> 886def p_expr_uminus(t): 887 'expression : MINUS expression %prec UMINUS' 888 t[0] = -t[2] 889</pre> 890</blockquote> 891 892In this case, <tt>%prec UMINUS</tt> overrides the default rule precedence--setting it to that 893of UMINUS in the precedence specifier. 894 895<p> 896It is also possible to specify non-associativity in the <tt>precedence</tt> table. This would 897be used when you <em>don't</em> want operations to chain together. For example, suppose 898you wanted to support a comparison operators like <tt><</tt> and <tt>></tt> but you didn't want to allow 899combinations like <tt>a < b < c</tt>. To do this, simply specify a rule like this: 900 901<blockquote> 902<pre> 903precedence = ( 904 ('nonassoc', 'LESSTHAN', 'GREATERTHAN'), # Nonassociative operators 905 ('left', 'PLUS', 'MINUS'), 906 ('left', 'TIMES', 'DIVIDE'), 907 ('right', 'UMINUS'), # Unary minus operator 908) 909</pre> 910</blockquote> 911 912<p> 913Reduce/reduce conflicts are caused when there are multiple grammar 914rules that can be applied to a given set of symbols. This kind of 915conflict is almost always bad and is always resolved by picking the 916rule that appears first in the grammar file. Reduce/reduce conflicts 917are almost always caused when different sets of grammar rules somehow 918generate the same set of symbols. For example: 919 920<blockquote> 921<pre> 922assignment : ID EQUALS NUMBER 923 | ID EQUALS expression 924 925expression : expression PLUS expression 926 | expression MINUS expression 927 | expression TIMES expression 928 | expression DIVIDE expression 929 | LPAREN expression RPAREN 930 | NUMBER 931</pre> 932</blockquote> 933 934In this case, a reduce/reduce conflict exists between these two rules: 935 936<blockquote> 937<pre> 938assignment : ID EQUALS NUMBER 939expression : NUMBER 940</pre> 941</blockquote> 942 943For example, if you wrote "a = 5", the parser can't figure out if this 944is supposed to reduced as <tt>assignment : ID EQUALS NUMBER</tt> or 945whether it's supposed to reduce the 5 as an expression and then reduce 946the rule <tt>assignment : ID EQUALS expression</tt>. 947 948<h2>The parser.out file</h2> 949 950Tracking down shift/reduce and reduce/reduce conflicts is one of the finer pleasures of using an LR 951parsing algorithm. To assist in debugging, <tt>yacc.py</tt> creates a debugging file called 952'parser.out' when it generates the parsing table. The contents of this file look like the following: 953 954<blockquote> 955<pre> 956Unused terminals: 957 958 959Grammar 960 961Rule 1 expression -> expression PLUS expression 962Rule 2 expression -> expression MINUS expression 963Rule 3 expression -> expression TIMES expression 964Rule 4 expression -> expression DIVIDE expression 965Rule 5 expression -> NUMBER 966Rule 6 expression -> LPAREN expression RPAREN 967 968Terminals, with rules where they appear 969 970TIMES : 3 971error : 972MINUS : 2 973RPAREN : 6 974LPAREN : 6 975DIVIDE : 4 976PLUS : 1 977NUMBER : 5 978 979Nonterminals, with rules where they appear 980 981expression : 1 1 2 2 3 3 4 4 6 0 982 983 984Parsing method: SLR 985 986 987state 0 988 989 S' -> . expression 990 expression -> . expression PLUS expression 991 expression -> . expression MINUS expression 992 expression -> . expression TIMES expression 993 expression -> . expression DIVIDE expression 994 expression -> . NUMBER 995 expression -> . LPAREN expression RPAREN 996 997 NUMBER shift and go to state 3 998 LPAREN shift and go to state 2 999 1000 1001state 1 1002 1003 S' -> expression . 1004 expression -> expression . PLUS expression 1005 expression -> expression . MINUS expression 1006 expression -> expression . TIMES expression 1007 expression -> expression . DIVIDE expression 1008 1009 PLUS shift and go to state 6 1010 MINUS shift and go to state 5 1011 TIMES shift and go to state 4 1012 DIVIDE shift and go to state 7 1013 1014 1015state 2 1016 1017 expression -> LPAREN . expression RPAREN 1018 expression -> . expression PLUS expression 1019 expression -> . expression MINUS expression 1020 expression -> . expression TIMES expression 1021 expression -> . expression DIVIDE expression 1022 expression -> . NUMBER 1023 expression -> . LPAREN expression RPAREN 1024 1025 NUMBER shift and go to state 3 1026 LPAREN shift and go to state 2 1027 1028 1029state 3 1030 1031 expression -> NUMBER . 1032 1033 $ reduce using rule 5 1034 PLUS reduce using rule 5 1035 MINUS reduce using rule 5 1036 TIMES reduce using rule 5 1037 DIVIDE reduce using rule 5 1038 RPAREN reduce using rule 5 1039 1040 1041state 4 1042 1043 expression -> expression TIMES . expression 1044 expression -> . expression PLUS expression 1045 expression -> . expression MINUS expression 1046 expression -> . expression TIMES expression 1047 expression -> . expression DIVIDE expression 1048 expression -> . NUMBER 1049 expression -> . LPAREN expression RPAREN 1050 1051 NUMBER shift and go to state 3 1052 LPAREN shift and go to state 2 1053 1054 1055state 5 1056 1057 expression -> expression MINUS . expression 1058 expression -> . expression PLUS expression 1059 expression -> . expression MINUS expression 1060 expression -> . expression TIMES expression 1061 expression -> . expression DIVIDE expression 1062 expression -> . NUMBER 1063 expression -> . LPAREN expression RPAREN 1064 1065 NUMBER shift and go to state 3 1066 LPAREN shift and go to state 2 1067 1068 1069state 6 1070 1071 expression -> expression PLUS . expression 1072 expression -> . expression PLUS expression 1073 expression -> . expression MINUS expression 1074 expression -> . expression TIMES expression 1075 expression -> . expression DIVIDE expression 1076 expression -> . NUMBER 1077 expression -> . LPAREN expression RPAREN 1078 1079 NUMBER shift and go to state 3 1080 LPAREN shift and go to state 2 1081 1082 1083state 7 1084 1085 expression -> expression DIVIDE . expression 1086 expression -> . expression PLUS expression 1087 expression -> . expression MINUS expression 1088 expression -> . expression TIMES expression 1089 expression -> . expression DIVIDE expression 1090 expression -> . NUMBER 1091 expression -> . LPAREN expression RPAREN 1092 1093 NUMBER shift and go to state 3 1094 LPAREN shift and go to state 2 1095 1096 1097state 8 1098 1099 expression -> LPAREN expression . RPAREN 1100 expression -> expression . PLUS expression 1101 expression -> expression . MINUS expression 1102 expression -> expression . TIMES expression 1103 expression -> expression . DIVIDE expression 1104 1105 RPAREN shift and go to state 13 1106 PLUS shift and go to state 6 1107 MINUS shift and go to state 5 1108 TIMES shift and go to state 4 1109 DIVIDE shift and go to state 7 1110 1111 1112state 9 1113 1114 expression -> expression TIMES expression . 1115 expression -> expression . PLUS expression 1116 expression -> expression . MINUS expression 1117 expression -> expression . TIMES expression 1118 expression -> expression . DIVIDE expression 1119 1120 $ reduce using rule 3 1121 PLUS reduce using rule 3 1122 MINUS reduce using rule 3 1123 TIMES reduce using rule 3 1124 DIVIDE reduce using rule 3 1125 RPAREN reduce using rule 3 1126 1127 ! PLUS [ shift and go to state 6 ] 1128 ! MINUS [ shift and go to state 5 ] 1129 ! TIMES [ shift and go to state 4 ] 1130 ! DIVIDE [ shift and go to state 7 ] 1131 1132state 10 1133 1134 expression -> expression MINUS expression . 1135 expression -> expression . PLUS expression 1136 expression -> expression . MINUS expression 1137 expression -> expression . TIMES expression 1138 expression -> expression . DIVIDE expression 1139 1140 $ reduce using rule 2 1141 PLUS reduce using rule 2 1142 MINUS reduce using rule 2 1143 RPAREN reduce using rule 2 1144 TIMES shift and go to state 4 1145 DIVIDE shift and go to state 7 1146 1147 ! TIMES [ reduce using rule 2 ] 1148 ! DIVIDE [ reduce using rule 2 ] 1149 ! PLUS [ shift and go to state 6 ] 1150 ! MINUS [ shift and go to state 5 ] 1151 1152state 11 1153 1154 expression -> expression PLUS expression . 1155 expression -> expression . PLUS expression 1156 expression -> expression . MINUS expression 1157 expression -> expression . TIMES expression 1158 expression -> expression . DIVIDE expression 1159 1160 $ reduce using rule 1 1161 PLUS reduce using rule 1 1162 MINUS reduce using rule 1 1163 RPAREN reduce using rule 1 1164 TIMES shift and go to state 4 1165 DIVIDE shift and go to state 7 1166 1167 ! TIMES [ reduce using rule 1 ] 1168 ! DIVIDE [ reduce using rule 1 ] 1169 ! PLUS [ shift and go to state 6 ] 1170 ! MINUS [ shift and go to state 5 ] 1171 1172state 12 1173 1174 expression -> expression DIVIDE expression . 1175 expression -> expression . PLUS expression 1176 expression -> expression . MINUS expression 1177 expression -> expression . TIMES expression 1178 expression -> expression . DIVIDE expression 1179 1180 $ reduce using rule 4 1181 PLUS reduce using rule 4 1182 MINUS reduce using rule 4 1183 TIMES reduce using rule 4 1184 DIVIDE reduce using rule 4 1185 RPAREN reduce using rule 4 1186 1187 ! PLUS [ shift and go to state 6 ] 1188 ! MINUS [ shift and go to state 5 ] 1189 ! TIMES [ shift and go to state 4 ] 1190 ! DIVIDE [ shift and go to state 7 ] 1191 1192state 13 1193 1194 expression -> LPAREN expression RPAREN . 1195 1196 $ reduce using rule 6 1197 PLUS reduce using rule 6 1198 MINUS reduce using rule 6 1199 TIMES reduce using rule 6 1200 DIVIDE reduce using rule 6 1201 RPAREN reduce using rule 6 1202</pre> 1203</blockquote> 1204 1205In the file, each state of the grammar is described. Within each state the "." indicates the current 1206location of the parse within any applicable grammar rules. In addition, the actions for each valid 1207input token are listed. When a shift/reduce or reduce/reduce conflict arises, rules <em>not</em> selected 1208are prefixed with an !. For example: 1209 1210<blockquote> 1211<pre> 1212 ! TIMES [ reduce using rule 2 ] 1213 ! DIVIDE [ reduce using rule 2 ] 1214 ! PLUS [ shift and go to state 6 ] 1215 ! MINUS [ shift and go to state 5 ] 1216</pre> 1217</blockquote> 1218 1219By looking at these rules (and with a little practice), you can usually track down the source 1220of most parsing conflicts. It should also be stressed that not all shift-reduce conflicts are 1221bad. However, the only way to be sure that they are resolved correctly is to look at <tt>parser.out</tt>. 1222 1223<h2>Syntax Error Handling</h2> 1224 1225When a syntax error occurs during parsing, the error is immediately 1226detected (i.e., the parser does not read any more tokens beyond the 1227source of the error). Error recovery in LR parsers is a delicate 1228topic that involves ancient rituals and black-magic. The recovery mechanism 1229provided by <tt>yacc.py</tt> is comparable to Unix yacc so you may want 1230consult a book like O'Reilly's "Lex and Yacc" for some of the finer details. 1231 1232<p> 1233When a syntax error occurs, <tt>yacc.py</tt> performs the following steps: 1234 1235<ol> 1236<li>On the first occurrence of an error, the user-defined <tt>p_error()</tt> function 1237is called with the offending token as an argument. Afterwards, the parser enters 1238an "error-recovery" mode in which it will not make future calls to <tt>p_error()</tt> until it 1239has successfully shifted at least 3 tokens onto the parsing stack. 1240 1241<p> 1242<li>If no recovery action is taken in <tt>p_error()</tt>, the offending lookahead token is replaced 1243with a special <tt>error</tt> token. 1244 1245<p> 1246<li>If the offending lookahead token is already set to <tt>error</tt>, the top item of the parsing stack is 1247deleted. 1248 1249<p> 1250<li>If the entire parsing stack is unwound, the parser enters a restart state and attempts to start 1251parsing from its initial state. 1252 1253<p> 1254<li>If a grammar rule accepts <tt>error</tt> as a token, it will be 1255shifted onto the parsing stack. 1256 1257<p> 1258<li>If the top item of the parsing stack is <tt>error</tt>, lookahead tokens will be discarded until the 1259parser can successfully shift a new symbol or reduce a rule involving <tt>error</tt>. 1260</ol> 1261 1262<h4>Recovery and resynchronization with error rules</h4> 1263 1264The most well-behaved approach for handling syntax errors is to write grammar rules that include the <tt>error</tt> 1265token. For example, suppose your language had a grammar rule for a print statement like this: 1266 1267<blockquote> 1268<pre> 1269def p_statement_print(t): 1270 'statement : PRINT expr SEMI' 1271 ... 1272</pre> 1273</blockquote> 1274 1275To account for the possibility of a bad expression, you might write an additional grammar rule like this: 1276 1277<blockquote> 1278<pre> 1279def p_statement_print_error(t): 1280 'statement : PRINT error SEMI' 1281 print "Syntax error in print statement. Bad expression" 1282 1283</pre> 1284</blockquote> 1285 1286In this case, the <tt>error</tt> token will match any sequence of 1287tokens that might appear up to the first semicolon that is 1288encountered. Once the semicolon is reached, the rule will be 1289invoked and the <tt>error</tt> token will go away. 1290 1291<p> 1292This type of recovery is sometimes known as parser resynchronization. 1293The <tt>error</tt> token acts as a wildcard for any bad input text and 1294the token immediately following <tt>error</tt> acts as a 1295synchronization token. 1296 1297<p> 1298It is important to note that the <tt>error</tt> token usually does not appear as the last token 1299on the right in an error rule. For example: 1300 1301<blockquote> 1302<pre> 1303def p_statement_print_error(t): 1304 'statement : PRINT error' 1305 print "Syntax error in print statement. Bad expression" 1306</pre> 1307</blockquote> 1308 1309This is because the first bad token encountered will cause the rule to 1310be reduced--which may make it difficult to recover if more bad tokens 1311immediately follow. 1312 1313<h4>Panic mode recovery</h4> 1314 1315An alternative error recovery scheme is to enter a panic mode recovery in which tokens are 1316discarded to a point where the parser might be able to recover in some sensible manner. 1317 1318<p> 1319Panic mode recovery is implemented entirely in the <tt>p_error()</tt> function. For example, this 1320function starts discarding tokens until it reaches a closing '}'. Then, it restarts the 1321parser in its initial state. 1322 1323<blockquote> 1324<pre> 1325def p_error(t): 1326 print "Whoa. You are seriously hosed." 1327 # Read ahead looking for a closing '}' 1328 while 1: 1329 tok = yacc.token() # Get the next token 1330 if not tok or tok.type == 'RBRACE': break 1331 yacc.restart() 1332</pre> 1333</blockquote> 1334 1335<p> 1336This function simply discards the bad token and tells the parser that the error was ok. 1337 1338<blockquote> 1339<pre> 1340def p_error(t): 1341 print "Syntax error at token", t.type 1342 # Just discard the token and tell the parser it's okay. 1343 yacc.errok() 1344</pre> 1345</blockquote> 1346 1347<P> 1348Within the <tt>p_error()</tt> function, three functions are available to control the behavior 1349of the parser: 1350<p> 1351<ul> 1352<li><tt>yacc.errok()</tt>. This resets the parser state so it doesn't think it's in error-recovery 1353mode. This will prevent an <tt>error</tt> token from being generated and will reset the internal 1354error counters so that the next syntax error will call <tt>p_error()</tt> again. 1355 1356<p> 1357<li><tt>yacc.token()</tt>. This returns the next token on the input stream. 1358 1359<p> 1360<li><tt>yacc.restart()</tt>. This discards the entire parsing stack and resets the parser 1361to its initial state. 1362</ul> 1363 1364Note: these functions are only available when invoking <tt>p_error()</tt> and are not available 1365at any other time. 1366 1367<p> 1368To supply the next lookahead token to the parser, <tt>p_error()</tt> can return a token. This might be 1369useful if trying to synchronize on special characters. For example: 1370 1371<blockquote> 1372<pre> 1373def p_error(t): 1374 # Read ahead looking for a terminating ";" 1375 while 1: 1376 tok = yacc.token() # Get the next token 1377 if not tok or tok.type == 'SEMI': break 1378 yacc.errok() 1379 1380 # Return SEMI to the parser as the next lookahead token 1381 return tok 1382</pre> 1383</blockquote> 1384 1385<h4>General comments on error handling</h4> 1386 1387For normal types of languages, error recovery with error rules and resynchronization characters is probably the most reliable 1388technique. This is because you can instrument the grammar to catch errors at selected places where it is relatively easy 1389to recover and continue parsing. Panic mode recovery is really only useful in certain specialized applications where you might want 1390to discard huge portions of the input text to find a valid restart point. 1391 1392<h2>Line Number Tracking</h2> 1393 1394<tt>yacc.py</tt> automatically tracks line numbers for all of the grammar symbols and tokens it processes. To retrieve the line 1395numbers, two functions are used in grammar rules: 1396 1397<ul> 1398<li><tt>t.lineno(num)</tt>. Return the starting line number for symbol <em>num</em> 1399<li><tt>t.linespan(num)</tt>. Return a tuple (startline,endline) with the starting and ending line number for symbol <em>num</em>. 1400</ul> 1401 1402For example: 1403 1404<blockquote> 1405<pre> 1406def t_expression(t): 1407 'expression : expression PLUS expression' 1408 t.lineno(1) # Line number of the left expression 1409 t.lineno(2) # line number of the PLUS operator 1410 t.lineno(3) # line number of the right expression 1411 ... 1412 start,end = t.linespan(3) # Start,end lines of the right expression 1413 1414</pre> 1415</blockquote> 1416 1417Since line numbers are managed internally by the parser, there is usually no need to modify the line 1418numbers. However, if you want to save the line numbers in a parse-tree node, you will need to make your own 1419private copy. 1420 1421<h2>AST Construction</h2> 1422 1423<tt>yacc.py</tt> provides no special functions for constructing an abstract syntax tree. However, such 1424construction is easy enough to do on your own. Simply create a data structure for abstract syntax tree nodes 1425and assign nodes to <tt>t[0]</tt> in each rule. 1426 1427For example: 1428 1429<blockquote> 1430<pre> 1431class Expr: pass 1432 1433class BinOp(Expr): 1434 def __init__(self,left,op,right): 1435 self.type = "binop" 1436 self.left = left 1437 self.right = right 1438 self.op = op 1439 1440class Number(Expr): 1441 def __init__(self,value): 1442 self.type = "number" 1443 self.value = value 1444 1445def p_expression_binop(t): 1446 '''expression : expression PLUS expression 1447 | expression MINUS expression 1448 | expression TIMES expression 1449 | expression DIVIDE expression''' 1450 1451 t[0] = BinOp(t[1],t[2],t[3]) 1452 1453def p_expression_group(t): 1454 'expression : LPAREN expression RPAREN' 1455 t[0] = t[2] 1456 1457def p_expression_number(t): 1458 'expression : NUMBER' 1459 t[0] = Number(t[1]) 1460</pre> 1461</blockquote> 1462 1463To simplify tree traversal, it may make sense to pick a very generic tree structure for your parse tree nodes. 1464For example: 1465 1466<blockquote> 1467<pre> 1468class Node: 1469 def __init__(self,type,children=None,leaf=None): 1470 self.type = type 1471 if children: 1472 self.children = children 1473 else: 1474 self.children = [ ] 1475 self.leaf = leaf 1476 1477def p_expression_binop(t): 1478 '''expression : expression PLUS expression 1479 | expression MINUS expression 1480 | expression TIMES expression 1481 | expression DIVIDE expression''' 1482 1483 t[0] = Node("binop", [t[1],t[3]], t[2]) 1484</pre> 1485</blockquote> 1486 1487<h2>Yacc implementation notes</h2> 1488 1489<ul> 1490<li>By default, <tt>yacc.py</tt> relies on <tt>lex.py</tt> for tokenizing. However, an alternative tokenizer 1491can be supplied as follows: 1492 1493<blockquote> 1494<pre> 1495yacc.parse(lexer=x) 1496</pre> 1497</blockquote> 1498in this case, <tt>x</tt> must be a Lexer object that minimally has a <tt>x.token()</tt> method for retrieving the next 1499token. If an input string is given to <tt>yacc.parse()</tt>, the lexer must also have an <tt>x.input()</tt> method. 1500 1501<p> 1502<li>By default, the yacc generates tables in debugging mode (which produces the parser.out file and other output). 1503To disable this, use 1504 1505<blockquote> 1506<pre> 1507yacc.yacc(debug=0) 1508</pre> 1509</blockquote> 1510 1511<p> 1512<li>To change the name of the <tt>parsetab.py</tt> file, use: 1513 1514<blockquote> 1515<pre> 1516yacc.yacc(tabmodule="foo") 1517</pre> 1518</blockquote> 1519 1520<P> 1521<li>To print copious amounts of debugging during parsing, use: 1522 1523<blockquote> 1524<pre> 1525yacc.parse(debug=1) 1526</pre> 1527</blockquote> 1528 1529<p> 1530<li>The <tt>yacc.yacc()</tt> function really returns a parser object. If you want to support multiple 1531parsers in the same application, do this: 1532 1533<blockquote> 1534<pre> 1535p = yacc.yacc() 1536... 1537p.parse() 1538</pre> 1539</blockquote> 1540 1541Note: The function <tt>yacc.parse()</tt> is bound to the last parser that was generated. 1542 1543<p> 1544<li>Since the generation of the SLR tables is relatively expensive, previously generated tables are 1545cached and reused if possible. The decision to regenerate the tables is determined by taking an MD5 1546checksum of all grammar rules and precedence rules. Only in the event of a mismatch are the tables regenerated. 1547 1548<p> 1549It should be noted that table generation is reasonably efficient, even for grammars that involve around a 100 rules 1550and several hundred states. For more complex languages such as C, table generation may take 30-60 seconds on a slow 1551machine. Please be patient. 1552 1553<p> 1554<li>Since LR parsing is mostly driven by tables, the performance of the parser is largely independent of the 1555size of the grammar. The biggest bottlenecks will be the lexer and the complexity of your grammar rules. 1556</ul> 1557 1558<h2>Parser and Lexer State Management</h2> 1559 1560In advanced parsing applications, you may want to have multiple 1561parsers and lexers. Furthermore, the parser may want to control the 1562behavior of the lexer in some way. 1563 1564<p> 1565To do this, it is important to note that both the lexer and parser are 1566actually implemented as objects. These objects are returned by the 1567<tt>lex()</tt> and <tt>yacc()</tt> functions respectively. For example: 1568 1569<blockquote> 1570<pre> 1571lexer = lex.lex() # Return lexer object 1572parser = yacc.yacc() # Return parser object 1573</pre> 1574</blockquote> 1575 1576Within lexer and parser rules, these objects are also available. In the lexer, 1577the "lexer" attribute of a token refers to the lexer object in use. For example: 1578 1579<blockquote> 1580<pre> 1581def t_NUMBER(t): 1582 r'\d+' 1583 ... 1584 print t.lexer # Show lexer object 1585</pre> 1586</blockquote> 1587 1588In the parser, the "lexer" and "parser" attributes refer to the lexer 1589and parser objects respectively. 1590 1591<blockquote> 1592<pre> 1593def p_expr_plus(t): 1594 'expr : expr PLUS expr' 1595 ... 1596 print t.parser # Show parser object 1597 print t.lexer # Show lexer object 1598</pre> 1599</blockquote> 1600 1601If necessary, arbitrary attributes can be attached to the lexer or parser object. 1602For example, if you wanted to have different parsing modes, you could attach a mode 1603attribute to the parser object and look at it later. 1604 1605<h2>Using Python's Optimized Mode</h2> 1606 1607Because PLY uses information from doc-strings, parsing and lexing 1608information must be gathered while running the Python interpreter in 1609normal mode (i.e., not with the -O or -OO options). However, if you 1610specify optimized mode like this: 1611 1612<blockquote> 1613<pre> 1614lex.lex(optimize=1) 1615yacc.yacc(optimize=1) 1616</pre> 1617</blockquote> 1618 1619then PLY can later be used when Python runs in optimized mode. To make this work, 1620make sure you first run Python in normal mode. Once the lexing and parsing tables 1621have been generated the first time, run Python in optimized mode. PLY will use 1622the tables without the need for doc strings. 1623 1624<p> 1625Beware: running PLY in optimized mode disables a lot of error 1626checking. You should only do this when your project has stabilized 1627and you don't need to do any debugging. 1628 1629<h2>Where to go from here?</h2> 1630 1631The <tt>examples</tt> directory of the PLY distribution contains several simple examples. Please consult a 1632compilers textbook for the theory and underlying implementation details or LR parsing. 1633 1634</body> 1635</html> 1636 1637 1638 1639 1640 1641 1642 1643