CHANGES revision 4479:61d3ed46e373
1Version 2.3 2----------------------------- 302/20/07: beazley 4 Fixed a bug with character literals if the literal '.' appeared as the 5 last symbol of a grammar rule. Reported by Ales Smrcka. 6 702/19/07: beazley 8 Warning messages are now redirected to stderr instead of being printed 9 to standard output. 10 1102/19/07: beazley 12 Added a warning message to lex.py if it detects a literal backslash 13 character inside the t_ignore declaration. This is to help 14 problems that might occur if someone accidentally defines t_ignore 15 as a Python raw string. For example: 16 17 t_ignore = r' \t' 18 19 The idea for this is from an email I received from David Cimimi who 20 reported bizarre behavior in lexing as a result of defining t_ignore 21 as a raw string by accident. 22 2302/18/07: beazley 24 Performance improvements. Made some changes to the internal 25 table organization and LR parser to improve parsing performance. 26 2702/18/07: beazley 28 Automatic tracking of line number and position information must now be 29 enabled by a special flag to parse(). For example: 30 31 yacc.parse(data,tracking=True) 32 33 In many applications, it's just not that important to have the 34 parser automatically track all line numbers. By making this an 35 optional feature, it allows the parser to run significantly faster 36 (more than a 20% speed increase in many cases). Note: positional 37 information is always available for raw tokens---this change only 38 applies to positional information associated with nonterminal 39 grammar symbols. 40 *** POTENTIAL INCOMPATIBILITY *** 41 4202/18/07: beazley 43 Yacc no longer supports extended slices of grammar productions. 44 However, it does support regular slices. For example: 45 46 def p_foo(p): 47 '''foo: a b c d e''' 48 p[0] = p[1:3] 49 50 This change is a performance improvement to the parser--it streamlines 51 normal access to the grammar values since slices are now handled in 52 a __getslice__() method as opposed to __getitem__(). 53 5402/12/07: beazley 55 Fixed a bug in the handling of token names when combined with 56 start conditions. Bug reported by Todd O'Bryan. 57 58Version 2.2 59------------------------------ 6011/01/06: beazley 61 Added lexpos() and lexspan() methods to grammar symbols. These 62 mirror the same functionality of lineno() and linespan(). For 63 example: 64 65 def p_expr(p): 66 'expr : expr PLUS expr' 67 p.lexpos(1) # Lexing position of left-hand-expression 68 p.lexpos(1) # Lexing position of PLUS 69 start,end = p.lexspan(3) # Lexing range of right hand expression 70 7111/01/06: beazley 72 Minor change to error handling. The recommended way to skip characters 73 in the input is to use t.lexer.skip() as shown here: 74 75 def t_error(t): 76 print "Illegal character '%s'" % t.value[0] 77 t.lexer.skip(1) 78 79 The old approach of just using t.skip(1) will still work, but won't 80 be documented. 81 8210/31/06: beazley 83 Discarded tokens can now be specified as simple strings instead of 84 functions. To do this, simply include the text "ignore_" in the 85 token declaration. For example: 86 87 t_ignore_cppcomment = r'//.*' 88 89 Previously, this had to be done with a function. For example: 90 91 def t_ignore_cppcomment(t): 92 r'//.*' 93 pass 94 95 If start conditions/states are being used, state names should appear 96 before the "ignore_" text. 97 9810/19/06: beazley 99 The Lex module now provides support for flex-style start conditions 100 as described at http://www.gnu.org/software/flex/manual/html_chapter/flex_11.html. 101 Please refer to this document to understand this change note. Refer to 102 the PLY documentation for PLY-specific explanation of how this works. 103 104 To use start conditions, you first need to declare a set of states in 105 your lexer file: 106 107 states = ( 108 ('foo','exclusive'), 109 ('bar','inclusive') 110 ) 111 112 This serves the same role as the %s and %x specifiers in flex. 113 114 One a state has been declared, tokens for that state can be 115 declared by defining rules of the form t_state_TOK. For example: 116 117 t_PLUS = '\+' # Rule defined in INITIAL state 118 t_foo_NUM = '\d+' # Rule defined in foo state 119 t_bar_NUM = '\d+' # Rule defined in bar state 120 121 t_foo_bar_NUM = '\d+' # Rule defined in both foo and bar 122 t_ANY_NUM = '\d+' # Rule defined in all states 123 124 In addition to defining tokens for each state, the t_ignore and t_error 125 specifications can be customized for specific states. For example: 126 127 t_foo_ignore = " " # Ignored characters for foo state 128 def t_bar_error(t): 129 # Handle errors in bar state 130 131 With token rules, the following methods can be used to change states 132 133 def t_TOKNAME(t): 134 t.lexer.begin('foo') # Begin state 'foo' 135 t.lexer.push_state('foo') # Begin state 'foo', push old state 136 # onto a stack 137 t.lexer.pop_state() # Restore previous state 138 t.lexer.current_state() # Returns name of current state 139 140 These methods mirror the BEGIN(), yy_push_state(), yy_pop_state(), and 141 yy_top_state() functions in flex. 142 143 The use of start states can be used as one way to write sub-lexers. 144 For example, the lexer or parser might instruct the lexer to start 145 generating a different set of tokens depending on the context. 146 147 example/yply/ylex.py shows the use of start states to grab C/C++ 148 code fragments out of traditional yacc specification files. 149 150 *** NEW FEATURE *** Suggested by Daniel Larraz with whom I also 151 discussed various aspects of the design. 152 15310/19/06: beazley 154 Minor change to the way in which yacc.py was reporting shift/reduce 155 conflicts. Although the underlying LALR(1) algorithm was correct, 156 PLY was under-reporting the number of conflicts compared to yacc/bison 157 when precedence rules were in effect. This change should make PLY 158 report the same number of conflicts as yacc. 159 16010/19/06: beazley 161 Modified yacc so that grammar rules could also include the '-' 162 character. For example: 163 164 def p_expr_list(p): 165 'expression-list : expression-list expression' 166 167 Suggested by Oldrich Jedlicka. 168 16910/18/06: beazley 170 Attribute lexer.lexmatch added so that token rules can access the re 171 match object that was generated. For example: 172 173 def t_FOO(t): 174 r'some regex' 175 m = t.lexer.lexmatch 176 # Do something with m 177 178 179 This may be useful if you want to access named groups specified within 180 the regex for a specific token. Suggested by Oldrich Jedlicka. 181 18210/16/06: beazley 183 Changed the error message that results if an illegal character 184 is encountered and no default error function is defined in lex. 185 The exception is now more informative about the actual cause of 186 the error. 187 188Version 2.1 189------------------------------ 19010/02/06: beazley 191 The last Lexer object built by lex() can be found in lex.lexer. 192 The last Parser object built by yacc() can be found in yacc.parser. 193 19410/02/06: beazley 195 New example added: examples/yply 196 197 This example uses PLY to convert Unix-yacc specification files to 198 PLY programs with the same grammar. This may be useful if you 199 want to convert a grammar from bison/yacc to use with PLY. 200 20110/02/06: beazley 202 Added support for a start symbol to be specified in the yacc 203 input file itself. Just do this: 204 205 start = 'name' 206 207 where 'name' matches some grammar rule. For example: 208 209 def p_name(p): 210 'name : A B C' 211 ... 212 213 This mirrors the functionality of the yacc %start specifier. 214 21509/30/06: beazley 216 Some new examples added.: 217 218 examples/GardenSnake : A simple indentation based language similar 219 to Python. Shows how you might handle 220 whitespace. Contributed by Andrew Dalke. 221 222 examples/BASIC : An implementation of 1964 Dartmouth BASIC. 223 Contributed by Dave against his better 224 judgement. 225 22609/28/06: beazley 227 Minor patch to allow named groups to be used in lex regular 228 expression rules. For example: 229 230 t_QSTRING = r'''(?P<quote>['"]).*?(?P=quote)''' 231 232 Patch submitted by Adam Ring. 233 23409/28/06: beazley 235 LALR(1) is now the default parsing method. To use SLR, use 236 yacc.yacc(method="SLR"). Note: there is no performance impact 237 on parsing when using LALR(1) instead of SLR. However, constructing 238 the parsing tables will take a little longer. 239 24009/26/06: beazley 241 Change to line number tracking. To modify line numbers, modify 242 the line number of the lexer itself. For example: 243 244 def t_NEWLINE(t): 245 r'\n' 246 t.lexer.lineno += 1 247 248 This modification is both cleanup and a performance optimization. 249 In past versions, lex was monitoring every token for changes in 250 the line number. This extra processing is unnecessary for a vast 251 majority of tokens. Thus, this new approach cleans it up a bit. 252 253 *** POTENTIAL INCOMPATIBILITY *** 254 You will need to change code in your lexer that updates the line 255 number. For example, "t.lineno += 1" becomes "t.lexer.lineno += 1" 256 25709/26/06: beazley 258 Added the lexing position to tokens as an attribute lexpos. This 259 is the raw index into the input text at which a token appears. 260 This information can be used to compute column numbers and other 261 details (e.g., scan backwards from lexpos to the first newline 262 to get a column position). 263 26409/25/06: beazley 265 Changed the name of the __copy__() method on the Lexer class 266 to clone(). This is used to clone a Lexer object (e.g., if 267 you're running different lexers at the same time). 268 26909/21/06: beazley 270 Limitations related to the use of the re module have been eliminated. 271 Several users reported problems with regular expressions exceeding 272 more than 100 named groups. To solve this, lex.py is now capable 273 of automatically splitting its master regular regular expression into 274 smaller expressions as needed. This should, in theory, make it 275 possible to specify an arbitrarily large number of tokens. 276 27709/21/06: beazley 278 Improved error checking in lex.py. Rules that match the empty string 279 are now rejected (otherwise they cause the lexer to enter an infinite 280 loop). An extra check for rules containing '#' has also been added. 281 Since lex compiles regular expressions in verbose mode, '#' is interpreted 282 as a regex comment, it is critical to use '\#' instead. 283 28409/18/06: beazley 285 Added a @TOKEN decorator function to lex.py that can be used to 286 define token rules where the documentation string might be computed 287 in some way. 288 289 digit = r'([0-9])' 290 nondigit = r'([_A-Za-z])' 291 identifier = r'(' + nondigit + r'(' + digit + r'|' + nondigit + r')*)' 292 293 from ply.lex import TOKEN 294 295 @TOKEN(identifier) 296 def t_ID(t): 297 # Do whatever 298 299 The @TOKEN decorator merely sets the documentation string of the 300 associated token function as needed for lex to work. 301 302 Note: An alternative solution is the following: 303 304 def t_ID(t): 305 # Do whatever 306 307 t_ID.__doc__ = identifier 308 309 Note: Decorators require the use of Python 2.4 or later. If compatibility 310 with old versions is needed, use the latter solution. 311 312 The need for this feature was suggested by Cem Karan. 313 31409/14/06: beazley 315 Support for single-character literal tokens has been added to yacc. 316 These literals must be enclosed in quotes. For example: 317 318 def p_expr(p): 319 "expr : expr '+' expr" 320 ... 321 322 def p_expr(p): 323 'expr : expr "-" expr' 324 ... 325 326 In addition to this, it is necessary to tell the lexer module about 327 literal characters. This is done by defining the variable 'literals' 328 as a list of characters. This should be defined in the module that 329 invokes the lex.lex() function. For example: 330 331 literals = ['+','-','*','/','(',')','='] 332 333 or simply 334 335 literals = '+=*/()=' 336 337 It is important to note that literals can only be a single character. 338 When the lexer fails to match a token using its normal regular expression 339 rules, it will check the current character against the literal list. 340 If found, it will be returned with a token type set to match the literal 341 character. Otherwise, an illegal character will be signalled. 342 343 34409/14/06: beazley 345 Modified PLY to install itself as a proper Python package called 'ply'. 346 This will make it a little more friendly to other modules. This 347 changes the usage of PLY only slightly. Just do this to import the 348 modules 349 350 import ply.lex as lex 351 import ply.yacc as yacc 352 353 Alternatively, you can do this: 354 355 from ply import * 356 357 Which imports both the lex and yacc modules. 358 Change suggested by Lee June. 359 36009/13/06: beazley 361 Changed the handling of negative indices when used in production rules. 362 A negative production index now accesses already parsed symbols on the 363 parsing stack. For example, 364 365 def p_foo(p): 366 "foo: A B C D" 367 print p[1] # Value of 'A' symbol 368 print p[2] # Value of 'B' symbol 369 print p[-1] # Value of whatever symbol appears before A 370 # on the parsing stack. 371 372 p[0] = some_val # Sets the value of the 'foo' grammer symbol 373 374 This behavior makes it easier to work with embedded actions within the 375 parsing rules. For example, in C-yacc, it is possible to write code like 376 this: 377 378 bar: A { printf("seen an A = %d\n", $1); } B { do_stuff; } 379 380 In this example, the printf() code executes immediately after A has been 381 parsed. Within the embedded action code, $1 refers to the A symbol on 382 the stack. 383 384 To perform this equivalent action in PLY, you need to write a pair 385 of rules like this: 386 387 def p_bar(p): 388 "bar : A seen_A B" 389 do_stuff 390 391 def p_seen_A(p): 392 "seen_A :" 393 print "seen an A =", p[-1] 394 395 The second rule "seen_A" is merely a empty production which should be 396 reduced as soon as A is parsed in the "bar" rule above. The use 397 of the negative index p[-1] is used to access whatever symbol appeared 398 before the seen_A symbol. 399 400 This feature also makes it possible to support inherited attributes. 401 For example: 402 403 def p_decl(p): 404 "decl : scope name" 405 406 def p_scope(p): 407 """scope : GLOBAL 408 | LOCAL""" 409 p[0] = p[1] 410 411 def p_name(p): 412 "name : ID" 413 if p[-1] == "GLOBAL": 414 # ... 415 else if p[-1] == "LOCAL": 416 #... 417 418 In this case, the name rule is inheriting an attribute from the 419 scope declaration that precedes it. 420 421 *** POTENTIAL INCOMPATIBILITY *** 422 If you are currently using negative indices within existing grammar rules, 423 your code will break. This should be extremely rare if non-existent in 424 most cases. The argument to various grammar rules is not usually not 425 processed in the same way as a list of items. 426 427Version 2.0 428------------------------------ 42909/07/06: beazley 430 Major cleanup and refactoring of the LR table generation code. Both SLR 431 and LALR(1) table generation is now performed by the same code base with 432 only minor extensions for extra LALR(1) processing. 433 43409/07/06: beazley 435 Completely reimplemented the entire LALR(1) parsing engine to use the 436 DeRemer and Pennello algorithm for calculating lookahead sets. This 437 significantly improves the performance of generating LALR(1) tables 438 and has the added feature of actually working correctly! If you 439 experienced weird behavior with LALR(1) in prior releases, this should 440 hopefully resolve all of those problems. Many thanks to 441 Andrew Waters and Markus Schoepflin for submitting bug reports 442 and helping me test out the revised LALR(1) support. 443 444Version 1.8 445------------------------------ 44608/02/06: beazley 447 Fixed a problem related to the handling of default actions in LALR(1) 448 parsing. If you experienced subtle and/or bizarre behavior when trying 449 to use the LALR(1) engine, this may correct those problems. Patch 450 contributed by Russ Cox. Note: This patch has been superceded by 451 revisions for LALR(1) parsing in Ply-2.0. 452 45308/02/06: beazley 454 Added support for slicing of productions in yacc. 455 Patch contributed by Patrick Mezard. 456 457Version 1.7 458------------------------------ 45903/02/06: beazley 460 Fixed infinite recursion problem ReduceToTerminals() function that 461 would sometimes come up in LALR(1) table generation. Reported by 462 Markus Schoepflin. 463 46403/01/06: beazley 465 Added "reflags" argument to lex(). For example: 466 467 lex.lex(reflags=re.UNICODE) 468 469 This can be used to specify optional flags to the re.compile() function 470 used inside the lexer. This may be necessary for special situations such 471 as processing Unicode (e.g., if you want escapes like \w and \b to consult 472 the Unicode character property database). The need for this suggested by 473 Andreas Jung. 474 47503/01/06: beazley 476 Fixed a bug with an uninitialized variable on repeated instantiations of parser 477 objects when the write_tables=0 argument was used. Reported by Michael Brown. 478 47903/01/06: beazley 480 Modified lex.py to accept Unicode strings both as the regular expressions for 481 tokens and as input. Hopefully this is the only change needed for Unicode support. 482 Patch contributed by Johan Dahl. 483 48403/01/06: beazley 485 Modified the class-based interface to work with new-style or old-style classes. 486 Patch contributed by Michael Brown (although I tweaked it slightly so it would work 487 with older versions of Python). 488 489Version 1.6 490------------------------------ 49105/27/05: beazley 492 Incorporated patch contributed by Christopher Stawarz to fix an extremely 493 devious bug in LALR(1) parser generation. This patch should fix problems 494 numerous people reported with LALR parsing. 495 49605/27/05: beazley 497 Fixed problem with lex.py copy constructor. Reported by Dave Aitel, Aaron Lav, 498 and Thad Austin. 499 50005/27/05: beazley 501 Added outputdir option to yacc() to control output directory. Contributed 502 by Christopher Stawarz. 503 50405/27/05: beazley 505 Added rununit.py test script to run tests using the Python unittest module. 506 Contributed by Miki Tebeka. 507 508Version 1.5 509------------------------------ 51005/26/04: beazley 511 Major enhancement. LALR(1) parsing support is now working. 512 This feature was implemented by Elias Ioup (ezioup@alumni.uchicago.edu) 513 and optimized by David Beazley. To use LALR(1) parsing do 514 the following: 515 516 yacc.yacc(method="LALR") 517 518 Computing LALR(1) parsing tables takes about twice as long as 519 the default SLR method. However, LALR(1) allows you to handle 520 more complex grammars. For example, the ANSI C grammar 521 (in example/ansic) has 13 shift-reduce conflicts with SLR, but 522 only has 1 shift-reduce conflict with LALR(1). 523 52405/20/04: beazley 525 Added a __len__ method to parser production lists. Can 526 be used in parser rules like this: 527 528 def p_somerule(p): 529 """a : B C D 530 | E F" 531 if (len(p) == 3): 532 # Must have been first rule 533 elif (len(p) == 2): 534 # Must be second rule 535 536 Suggested by Joshua Gerth and others. 537 538Version 1.4 539------------------------------ 54004/23/04: beazley 541 Incorporated a variety of patches contributed by Eric Raymond. 542 These include: 543 544 0. Cleans up some comments so they don't wrap on an 80-column display. 545 1. Directs compiler errors to stderr where they belong. 546 2. Implements and documents automatic line counting when \n is ignored. 547 3. Changes the way progress messages are dumped when debugging is on. 548 The new format is both less verbose and conveys more information than 549 the old, including shift and reduce actions. 550 55104/23/04: beazley 552 Added a Python setup.py file to simply installation. Contributed 553 by Adam Kerrison. 554 55504/23/04: beazley 556 Added patches contributed by Adam Kerrison. 557 558 - Some output is now only shown when debugging is enabled. This 559 means that PLY will be completely silent when not in debugging mode. 560 561 - An optional parameter "write_tables" can be passed to yacc() to 562 control whether or not parsing tables are written. By default, 563 it is true, but it can be turned off if you don't want the yacc 564 table file. Note: disabling this will cause yacc() to regenerate 565 the parsing table each time. 566 56704/23/04: beazley 568 Added patches contributed by David McNab. This patch addes two 569 features: 570 571 - The parser can be supplied as a class instead of a module. 572 For an example of this, see the example/classcalc directory. 573 574 - Debugging output can be directed to a filename of the user's 575 choice. Use 576 577 yacc(debugfile="somefile.out") 578 579 580Version 1.3 581------------------------------ 58212/10/02: jmdyck 583 Various minor adjustments to the code that Dave checked in today. 584 Updated test/yacc_{inf,unused}.exp to reflect today's changes. 585 58612/10/02: beazley 587 Incorporated a variety of minor bug fixes to empty production 588 handling and infinite recursion checking. Contributed by 589 Michael Dyck. 590 59112/10/02: beazley 592 Removed bogus recover() method call in yacc.restart() 593 594Version 1.2 595------------------------------ 59611/27/02: beazley 597 Lexer and parser objects are now available as an attribute 598 of tokens and slices respectively. For example: 599 600 def t_NUMBER(t): 601 r'\d+' 602 print t.lexer 603 604 def p_expr_plus(t): 605 'expr: expr PLUS expr' 606 print t.lexer 607 print t.parser 608 609 This can be used for state management (if needed). 610 61110/31/02: beazley 612 Modified yacc.py to work with Python optimize mode. To make 613 this work, you need to use 614 615 yacc.yacc(optimize=1) 616 617 Furthermore, you need to first run Python in normal mode 618 to generate the necessary parsetab.py files. After that, 619 you can use python -O or python -OO. 620 621 Note: optimized mode turns off a lot of error checking. 622 Only use when you are sure that your grammar is working. 623 Make sure parsetab.py is up to date! 624 62510/30/02: beazley 626 Added cloning of Lexer objects. For example: 627 628 import copy 629 l = lex.lex() 630 lc = copy.copy(l) 631 632 l.input("Some text") 633 lc.input("Some other text") 634 ... 635 636 This might be useful if the same "lexer" is meant to 637 be used in different contexts---or if multiple lexers 638 are running concurrently. 639 64010/30/02: beazley 641 Fixed subtle bug with first set computation and empty productions. 642 Patch submitted by Michael Dyck. 643 64410/30/02: beazley 645 Fixed error messages to use "filename:line: message" instead 646 of "filename:line. message". This makes error reporting more 647 friendly to emacs. Patch submitted by Fran�ois Pinard. 648 64910/30/02: beazley 650 Improvements to parser.out file. Terminals and nonterminals 651 are sorted instead of being printed in random order. 652 Patch submitted by Fran�ois Pinard. 653 65410/30/02: beazley 655 Improvements to parser.out file output. Rules are now printed 656 in a way that's easier to understand. Contributed by Russ Cox. 657 65810/30/02: beazley 659 Added 'nonassoc' associativity support. This can be used 660 to disable the chaining of operators like a < b < c. 661 To use, simply specify 'nonassoc' in the precedence table 662 663 precedence = ( 664 ('nonassoc', 'LESSTHAN', 'GREATERTHAN'), # Nonassociative operators 665 ('left', 'PLUS', 'MINUS'), 666 ('left', 'TIMES', 'DIVIDE'), 667 ('right', 'UMINUS'), # Unary minus operator 668 ) 669 670 Patch contributed by Russ Cox. 671 67210/30/02: beazley 673 Modified the lexer to provide optional support for Python -O and -OO 674 modes. To make this work, Python *first* needs to be run in 675 unoptimized mode. This reads the lexing information and creates a 676 file "lextab.py". Then, run lex like this: 677 678 # module foo.py 679 ... 680 ... 681 lex.lex(optimize=1) 682 683 Once the lextab file has been created, subsequent calls to 684 lex.lex() will read data from the lextab file instead of using 685 introspection. In optimized mode (-O, -OO) everything should 686 work normally despite the loss of doc strings. 687 688 To change the name of the file 'lextab.py' use the following: 689 690 lex.lex(lextab="footab") 691 692 (this creates a file footab.py) 693 694 695Version 1.1 October 25, 2001 696------------------------------ 697 69810/25/01: beazley 699 Modified the table generator to produce much more compact data. 700 This should greatly reduce the size of the parsetab.py[c] file. 701 Caveat: the tables still need to be constructed so a little more 702 work is done in parsetab on import. 703 70410/25/01: beazley 705 There may be a possible bug in the cycle detector that reports errors 706 about infinite recursion. I'm having a little trouble tracking it 707 down, but if you get this problem, you can disable the cycle 708 detector as follows: 709 710 yacc.yacc(check_recursion = 0) 711 71210/25/01: beazley 713 Fixed a bug in lex.py that sometimes caused illegal characters to be 714 reported incorrectly. Reported by Sverre J�rgensen. 715 7167/8/01 : beazley 717 Added a reference to the underlying lexer object when tokens are handled by 718 functions. The lexer is available as the 'lexer' attribute. This 719 was added to provide better lexing support for languages such as Fortran 720 where certain types of tokens can't be conveniently expressed as regular 721 expressions (and where the tokenizing function may want to perform a 722 little backtracking). Suggested by Pearu Peterson. 723 7246/20/01 : beazley 725 Modified yacc() function so that an optional starting symbol can be specified. 726 For example: 727 728 yacc.yacc(start="statement") 729 730 Normally yacc always treats the first production rule as the starting symbol. 731 However, if you are debugging your grammar it may be useful to specify 732 an alternative starting symbol. Idea suggested by Rich Salz. 733 734Version 1.0 June 18, 2001 735-------------------------- 736Initial public offering 737 738