ply.html revision 4479
12632Sstever@eecs.umich.edu<html> 22632Sstever@eecs.umich.edu<head> 32632Sstever@eecs.umich.edu<title>PLY (Python Lex-Yacc)</title> 42632Sstever@eecs.umich.edu</head> 52632Sstever@eecs.umich.edu<body bgcolor="#ffffff"> 62632Sstever@eecs.umich.edu 72632Sstever@eecs.umich.edu<h1>PLY (Python Lex-Yacc)</h1> 84479Sbinkertn@umich.edu 92632Sstever@eecs.umich.edu<b> 102632Sstever@eecs.umich.eduDavid M. Beazley <br> 114479Sbinkertn@umich.edudave@dabeaz.com<br> 122632Sstever@eecs.umich.edu</b> 132632Sstever@eecs.umich.edu 142632Sstever@eecs.umich.edu<p> 154479Sbinkertn@umich.edu<b>PLY Version: 2.3</b> 162632Sstever@eecs.umich.edu<p> 174479Sbinkertn@umich.edu 184479Sbinkertn@umich.edu<!-- INDEX --> 194479Sbinkertn@umich.edu<div class="sectiontoc"> 204479Sbinkertn@umich.edu<ul> 214479Sbinkertn@umich.edu<li><a href="#ply_nn1">Introduction</a> 224479Sbinkertn@umich.edu<li><a href="#ply_nn2">PLY Overview</a> 234479Sbinkertn@umich.edu<li><a href="#ply_nn3">Lex</a> 244479Sbinkertn@umich.edu<ul> 254479Sbinkertn@umich.edu<li><a href="#ply_nn4">Lex Example</a> 264479Sbinkertn@umich.edu<li><a href="#ply_nn5">The tokens list</a> 274479Sbinkertn@umich.edu<li><a href="#ply_nn6">Specification of tokens</a> 284479Sbinkertn@umich.edu<li><a href="#ply_nn7">Token values</a> 294479Sbinkertn@umich.edu<li><a href="#ply_nn8">Discarded tokens</a> 304479Sbinkertn@umich.edu<li><a href="#ply_nn9">Line numbers and positional information</a> 314479Sbinkertn@umich.edu<li><a href="#ply_nn10">Ignored characters</a> 324479Sbinkertn@umich.edu<li><a href="#ply_nn11">Literal characters</a> 334479Sbinkertn@umich.edu<li><a href="#ply_nn12">Error handling</a> 344479Sbinkertn@umich.edu<li><a href="#ply_nn13">Building and using the lexer</a> 354479Sbinkertn@umich.edu<li><a href="#ply_nn14">The @TOKEN decorator</a> 364479Sbinkertn@umich.edu<li><a href="#ply_nn15">Optimized mode</a> 374479Sbinkertn@umich.edu<li><a href="#ply_nn16">Debugging</a> 384479Sbinkertn@umich.edu<li><a href="#ply_nn17">Alternative specification of lexers</a> 394479Sbinkertn@umich.edu<li><a href="#ply_nn18">Maintaining state</a> 404479Sbinkertn@umich.edu<li><a href="#ply_nn19">Duplicating lexers</a> 414479Sbinkertn@umich.edu<li><a href="#ply_nn20">Internal lexer state</a> 424479Sbinkertn@umich.edu<li><a href="#ply_nn21">Conditional lexing and start conditions</a> 434479Sbinkertn@umich.edu<li><a href="#ply_nn21">Miscellaneous Issues</a> 444479Sbinkertn@umich.edu</ul> 454479Sbinkertn@umich.edu<li><a href="#ply_nn22">Parsing basics</a> 464479Sbinkertn@umich.edu<li><a href="#ply_nn23">Yacc reference</a> 474479Sbinkertn@umich.edu<ul> 484479Sbinkertn@umich.edu<li><a href="#ply_nn24">An example</a> 494479Sbinkertn@umich.edu<li><a href="#ply_nn25">Combining Grammar Rule Functions</a> 504479Sbinkertn@umich.edu<li><a href="#ply_nn26">Character Literals</a> 514479Sbinkertn@umich.edu<li><a href="#ply_nn26">Empty Productions</a> 524479Sbinkertn@umich.edu<li><a href="#ply_nn28">Changing the starting symbol</a> 534479Sbinkertn@umich.edu<li><a href="#ply_nn27">Dealing With Ambiguous Grammars</a> 544479Sbinkertn@umich.edu<li><a href="#ply_nn28">The parser.out file</a> 554479Sbinkertn@umich.edu<li><a href="#ply_nn29">Syntax Error Handling</a> 564479Sbinkertn@umich.edu<ul> 574479Sbinkertn@umich.edu<li><a href="#ply_nn30">Recovery and resynchronization with error rules</a> 584479Sbinkertn@umich.edu<li><a href="#ply_nn31">Panic mode recovery</a> 594479Sbinkertn@umich.edu<li><a href="#ply_nn32">General comments on error handling</a> 604479Sbinkertn@umich.edu</ul> 614479Sbinkertn@umich.edu<li><a href="#ply_nn33">Line Number and Position Tracking</a> 624479Sbinkertn@umich.edu<li><a href="#ply_nn34">AST Construction</a> 634479Sbinkertn@umich.edu<li><a href="#ply_nn35">Embedded Actions</a> 644479Sbinkertn@umich.edu<li><a href="#ply_nn36">Yacc implementation notes</a> 654479Sbinkertn@umich.edu</ul> 664479Sbinkertn@umich.edu<li><a href="#ply_nn37">Parser and Lexer State Management</a> 674479Sbinkertn@umich.edu<li><a href="#ply_nn38">Using Python's Optimized Mode</a> 684479Sbinkertn@umich.edu<li><a href="#ply_nn39">Where to go from here?</a> 694479Sbinkertn@umich.edu</ul> 704479Sbinkertn@umich.edu</div> 714479Sbinkertn@umich.edu<!-- INDEX --> 724479Sbinkertn@umich.edu 734479Sbinkertn@umich.edu 744479Sbinkertn@umich.edu 754479Sbinkertn@umich.edu 764479Sbinkertn@umich.edu 774479Sbinkertn@umich.edu 784479Sbinkertn@umich.edu<H2><a name="ply_nn1"></a>1. Introduction</H2> 794479Sbinkertn@umich.edu 804479Sbinkertn@umich.edu 814479Sbinkertn@umich.eduPLY is a pure-Python implementation of the popular compiler 824479Sbinkertn@umich.educonstruction tools lex and yacc. The main goal of PLY is to stay 834479Sbinkertn@umich.edufairly faithful to the way in which traditional lex/yacc tools work. 844479Sbinkertn@umich.eduThis includes supporting LALR(1) parsing as well as providing 854479Sbinkertn@umich.eduextensive input validation, error reporting, and diagnostics. Thus, 864479Sbinkertn@umich.eduif you've used yacc in another programming language, it should be 874479Sbinkertn@umich.edurelatively straightforward to use PLY. 884479Sbinkertn@umich.edu 894479Sbinkertn@umich.edu<p> 904479Sbinkertn@umich.eduEarly versions of PLY were developed to support an Introduction to 914479Sbinkertn@umich.eduCompilers Course I taught in 2001 at the University of Chicago. In this course, 922632Sstever@eecs.umich.edustudents built a fully functional compiler for a simple Pascal-like 932632Sstever@eecs.umich.edulanguage. Their compiler, implemented entirely in Python, had to 942632Sstever@eecs.umich.eduinclude lexical analysis, parsing, type checking, type inference, 952632Sstever@eecs.umich.edunested scoping, and code generation for the SPARC processor. 962632Sstever@eecs.umich.eduApproximately 30 different compiler implementations were completed in 974479Sbinkertn@umich.eduthis course. Most of PLY's interface and operation has been influenced by common 982632Sstever@eecs.umich.eduusability problems encountered by students. 992632Sstever@eecs.umich.edu 1002632Sstever@eecs.umich.edu<p> 1014479Sbinkertn@umich.eduSince PLY was primarily developed as an instructional tool, you will 1024479Sbinkertn@umich.edufind it to be fairly picky about token and grammar rule 1034479Sbinkertn@umich.eduspecification. In part, this 1042632Sstever@eecs.umich.eduadded formality is meant to catch common programming mistakes made by 1052632Sstever@eecs.umich.edunovice users. However, advanced users will also find such features to 1062632Sstever@eecs.umich.edube useful when building complicated grammars for real programming 1074479Sbinkertn@umich.edulanguages. It should also be noted that PLY does not provide much in 1084479Sbinkertn@umich.eduthe way of bells and whistles (e.g., automatic construction of 1094479Sbinkertn@umich.eduabstract syntax trees, tree traversal, etc.). Nor would I consider it 1104479Sbinkertn@umich.eduto be a parsing framework. Instead, you will find a bare-bones, yet 1112632Sstever@eecs.umich.edufully capable lex/yacc implementation written entirely in Python. 1122632Sstever@eecs.umich.edu 1132632Sstever@eecs.umich.edu<p> 1142632Sstever@eecs.umich.eduThe rest of this document assumes that you are somewhat familar with 1154479Sbinkertn@umich.eduparsing theory, syntax directed translation, and the use of compiler 1164479Sbinkertn@umich.educonstruction tools such as lex and yacc in other programming 1174479Sbinkertn@umich.edulanguages. If you are unfamilar with these topics, you will probably 1184479Sbinkertn@umich.eduwant to consult an introductory text such as "Compilers: Principles, 1194479Sbinkertn@umich.eduTechniques, and Tools", by Aho, Sethi, and Ullman. O'Reilly's "Lex 1204479Sbinkertn@umich.eduand Yacc" by John Levine may also be handy. In fact, the O'Reilly book can be 1214479Sbinkertn@umich.eduused as a reference for PLY as the concepts are virtually identical. 1224479Sbinkertn@umich.edu 1234479Sbinkertn@umich.edu<H2><a name="ply_nn2"></a>2. PLY Overview</H2> 1244479Sbinkertn@umich.edu 1254479Sbinkertn@umich.edu 1264479Sbinkertn@umich.eduPLY consists of two separate modules; <tt>lex.py</tt> and 1274479Sbinkertn@umich.edu<tt>yacc.py</tt>, both of which are found in a Python package 1284479Sbinkertn@umich.educalled <tt>ply</tt>. The <tt>lex.py</tt> module is used to break input text into a 1292632Sstever@eecs.umich.educollection of tokens specified by a collection of regular expression 1302632Sstever@eecs.umich.edurules. <tt>yacc.py</tt> is used to recognize language syntax that has 1314479Sbinkertn@umich.edubeen specified in the form of a context free grammar. <tt>yacc.py</tt> uses LR parsing and generates its parsing tables 1324479Sbinkertn@umich.eduusing either the LALR(1) (the default) or SLR table generation algorithms. 1332632Sstever@eecs.umich.edu 1342632Sstever@eecs.umich.edu<p> 1352632Sstever@eecs.umich.eduThe two tools are meant to work together. Specifically, 1362632Sstever@eecs.umich.edu<tt>lex.py</tt> provides an external interface in the form of a 1372632Sstever@eecs.umich.edu<tt>token()</tt> function that returns the next valid token on the 1382632Sstever@eecs.umich.eduinput stream. <tt>yacc.py</tt> calls this repeatedly to retrieve 1392632Sstever@eecs.umich.edutokens and invoke grammar rules. The output of <tt>yacc.py</tt> is 1402632Sstever@eecs.umich.eduoften an Abstract Syntax Tree (AST). However, this is entirely up to 1412632Sstever@eecs.umich.eduthe user. If desired, <tt>yacc.py</tt> can also be used to implement 1424479Sbinkertn@umich.edusimple one-pass compilers. 1432632Sstever@eecs.umich.edu 1442632Sstever@eecs.umich.edu<p> 1452632Sstever@eecs.umich.eduLike its Unix counterpart, <tt>yacc.py</tt> provides most of the 1462632Sstever@eecs.umich.edufeatures you expect including extensive error checking, grammar 1472632Sstever@eecs.umich.eduvalidation, support for empty productions, error tokens, and ambiguity 1484479Sbinkertn@umich.eduresolution via precedence rules. In fact, everything that is possible in traditional yacc 1494479Sbinkertn@umich.edushould be supported in PLY. 1502632Sstever@eecs.umich.edu 1512632Sstever@eecs.umich.edu<p> 1524479Sbinkertn@umich.eduThe primary difference between 1534479Sbinkertn@umich.edu<tt>yacc.py</tt> and Unix <tt>yacc</tt> is that <tt>yacc.py</tt> 1544479Sbinkertn@umich.edudoesn't involve a separate code-generation process. 1554479Sbinkertn@umich.eduInstead, PLY relies on reflection (introspection) 1564479Sbinkertn@umich.eduto build its lexers and parsers. Unlike traditional lex/yacc which 1574479Sbinkertn@umich.edurequire a special input file that is converted into a separate source 1584479Sbinkertn@umich.edufile, the specifications given to PLY <em>are</em> valid Python 1594479Sbinkertn@umich.eduprograms. This means that there are no extra source files nor is 1604479Sbinkertn@umich.eduthere a special compiler construction step (e.g., running yacc to 1614479Sbinkertn@umich.edugenerate Python code for the compiler). Since the generation of the 1624479Sbinkertn@umich.eduparsing tables is relatively expensive, PLY caches the results and 1634479Sbinkertn@umich.edusaves them to a file. If no changes are detected in the input source, 1644479Sbinkertn@umich.eduthe tables are read from the cache. Otherwise, they are regenerated. 1654479Sbinkertn@umich.edu 1664479Sbinkertn@umich.edu<H2><a name="ply_nn3"></a>3. Lex</H2> 1674479Sbinkertn@umich.edu 1684479Sbinkertn@umich.edu 1694479Sbinkertn@umich.edu<tt>lex.py</tt> is used to tokenize an input string. For example, suppose 1704479Sbinkertn@umich.eduyou're writing a programming language and a user supplied the following input string: 1714479Sbinkertn@umich.edu 1724479Sbinkertn@umich.edu<blockquote> 1734479Sbinkertn@umich.edu<pre> 1744479Sbinkertn@umich.edux = 3 + 42 * (s - t) 1754479Sbinkertn@umich.edu</pre> 1764479Sbinkertn@umich.edu</blockquote> 1774479Sbinkertn@umich.edu 1784479Sbinkertn@umich.eduA tokenizer splits the string into individual tokens 1794479Sbinkertn@umich.edu 1804479Sbinkertn@umich.edu<blockquote> 1814479Sbinkertn@umich.edu<pre> 1824479Sbinkertn@umich.edu'x','=', '3', '+', '42', '*', '(', 's', '-', 't', ')' 1834479Sbinkertn@umich.edu</pre> 1844479Sbinkertn@umich.edu</blockquote> 1854479Sbinkertn@umich.edu 1864479Sbinkertn@umich.eduTokens are usually given names to indicate what they are. For example: 1874479Sbinkertn@umich.edu 1884479Sbinkertn@umich.edu<blockquote> 1894479Sbinkertn@umich.edu<pre> 1904479Sbinkertn@umich.edu'ID','EQUALS','NUMBER','PLUS','NUMBER','TIMES', 1914479Sbinkertn@umich.edu'LPAREN','ID','MINUS','ID','RPAREN' 1924479Sbinkertn@umich.edu</pre> 1934479Sbinkertn@umich.edu</blockquote> 1944479Sbinkertn@umich.edu 1954479Sbinkertn@umich.eduMore specifically, the input is broken into pairs of token types and values. For example: 1964479Sbinkertn@umich.edu 1974479Sbinkertn@umich.edu<blockquote> 1984479Sbinkertn@umich.edu<pre> 1994479Sbinkertn@umich.edu('ID','x'), ('EQUALS','='), ('NUMBER','3'), 2004479Sbinkertn@umich.edu('PLUS','+'), ('NUMBER','42), ('TIMES','*'), 2014479Sbinkertn@umich.edu('LPAREN','('), ('ID','s'), ('MINUS','-'), 2024479Sbinkertn@umich.edu('ID','t'), ('RPAREN',')' 2034479Sbinkertn@umich.edu</pre> 2044479Sbinkertn@umich.edu</blockquote> 2054479Sbinkertn@umich.edu 2064479Sbinkertn@umich.eduThe identification of tokens is typically done by writing a series of regular expression 2074479Sbinkertn@umich.edurules. The next section shows how this is done using <tt>lex.py</tt>. 2084479Sbinkertn@umich.edu 2094479Sbinkertn@umich.edu<H3><a name="ply_nn4"></a>3.1 Lex Example</H3> 2104479Sbinkertn@umich.edu 2114479Sbinkertn@umich.edu 2124479Sbinkertn@umich.eduThe following example shows how <tt>lex.py</tt> is used to write a simple tokenizer. 2132632Sstever@eecs.umich.edu 2142632Sstever@eecs.umich.edu<blockquote> 2152632Sstever@eecs.umich.edu<pre> 2162632Sstever@eecs.umich.edu# ------------------------------------------------------------ 2172632Sstever@eecs.umich.edu# calclex.py 2182632Sstever@eecs.umich.edu# 2192632Sstever@eecs.umich.edu# tokenizer for a simple expression evaluator for 2202632Sstever@eecs.umich.edu# numbers and +,-,*,/ 2212632Sstever@eecs.umich.edu# ------------------------------------------------------------ 2224479Sbinkertn@umich.eduimport ply.lex as lex 2232632Sstever@eecs.umich.edu 2242632Sstever@eecs.umich.edu# List of token names. This is always required 2252632Sstever@eecs.umich.edutokens = ( 2262632Sstever@eecs.umich.edu 'NUMBER', 2272632Sstever@eecs.umich.edu 'PLUS', 2282632Sstever@eecs.umich.edu 'MINUS', 2292632Sstever@eecs.umich.edu 'TIMES', 2302632Sstever@eecs.umich.edu 'DIVIDE', 2312632Sstever@eecs.umich.edu 'LPAREN', 2322632Sstever@eecs.umich.edu 'RPAREN', 2332632Sstever@eecs.umich.edu) 2342632Sstever@eecs.umich.edu 2352632Sstever@eecs.umich.edu# Regular expression rules for simple tokens 2362632Sstever@eecs.umich.edut_PLUS = r'\+' 2372632Sstever@eecs.umich.edut_MINUS = r'-' 2382632Sstever@eecs.umich.edut_TIMES = r'\*' 2392632Sstever@eecs.umich.edut_DIVIDE = r'/' 2402632Sstever@eecs.umich.edut_LPAREN = r'\(' 2412632Sstever@eecs.umich.edut_RPAREN = r'\)' 2422632Sstever@eecs.umich.edu 2432632Sstever@eecs.umich.edu# A regular expression rule with some action code 2442632Sstever@eecs.umich.edudef t_NUMBER(t): 2452632Sstever@eecs.umich.edu r'\d+' 2462632Sstever@eecs.umich.edu try: 2472632Sstever@eecs.umich.edu t.value = int(t.value) 2482632Sstever@eecs.umich.edu except ValueError: 2492632Sstever@eecs.umich.edu print "Line %d: Number %s is too large!" % (t.lineno,t.value) 2502632Sstever@eecs.umich.edu t.value = 0 2512632Sstever@eecs.umich.edu return t 2522632Sstever@eecs.umich.edu 2532632Sstever@eecs.umich.edu# Define a rule so we can track line numbers 2542632Sstever@eecs.umich.edudef t_newline(t): 2552632Sstever@eecs.umich.edu r'\n+' 2564479Sbinkertn@umich.edu t.lexer.lineno += len(t.value) 2572632Sstever@eecs.umich.edu 2582632Sstever@eecs.umich.edu# A string containing ignored characters (spaces and tabs) 2592632Sstever@eecs.umich.edut_ignore = ' \t' 2602632Sstever@eecs.umich.edu 2612632Sstever@eecs.umich.edu# Error handling rule 2622632Sstever@eecs.umich.edudef t_error(t): 2632632Sstever@eecs.umich.edu print "Illegal character '%s'" % t.value[0] 2644479Sbinkertn@umich.edu t.lexer.skip(1) 2652632Sstever@eecs.umich.edu 2662632Sstever@eecs.umich.edu# Build the lexer 2672632Sstever@eecs.umich.edulex.lex() 2682632Sstever@eecs.umich.edu 2694479Sbinkertn@umich.edu</pre> 2704479Sbinkertn@umich.edu</blockquote> 2714479Sbinkertn@umich.eduTo use the lexer, you first need to feed it some input text using its <tt>input()</tt> method. After that, repeated calls to <tt>token()</tt> produce tokens. The following code shows how this works: 2724479Sbinkertn@umich.edu 2734479Sbinkertn@umich.edu<blockquote> 2744479Sbinkertn@umich.edu<pre> 2754479Sbinkertn@umich.edu 2762632Sstever@eecs.umich.edu# Test it out 2772632Sstever@eecs.umich.edudata = ''' 2782632Sstever@eecs.umich.edu3 + 4 * 10 2792632Sstever@eecs.umich.edu + -20 *2 2802632Sstever@eecs.umich.edu''' 2812632Sstever@eecs.umich.edu 2822632Sstever@eecs.umich.edu# Give the lexer some input 2832632Sstever@eecs.umich.edulex.input(data) 2842632Sstever@eecs.umich.edu 2852632Sstever@eecs.umich.edu# Tokenize 2862632Sstever@eecs.umich.eduwhile 1: 2872632Sstever@eecs.umich.edu tok = lex.token() 2882632Sstever@eecs.umich.edu if not tok: break # No more input 2892632Sstever@eecs.umich.edu print tok 2902632Sstever@eecs.umich.edu</pre> 2912632Sstever@eecs.umich.edu</blockquote> 2922632Sstever@eecs.umich.edu 2934479Sbinkertn@umich.eduWhen executed, the example will produce the following output: 2944479Sbinkertn@umich.edu 2954479Sbinkertn@umich.edu<blockquote> 2964479Sbinkertn@umich.edu<pre> 2974479Sbinkertn@umich.edu$ python example.py 2984479Sbinkertn@umich.eduLexToken(NUMBER,3,2,1) 2994479Sbinkertn@umich.eduLexToken(PLUS,'+',2,3) 3004479Sbinkertn@umich.eduLexToken(NUMBER,4,2,5) 3014479Sbinkertn@umich.eduLexToken(TIMES,'*',2,7) 3024479Sbinkertn@umich.eduLexToken(NUMBER,10,2,10) 3034479Sbinkertn@umich.eduLexToken(PLUS,'+',3,14) 3044479Sbinkertn@umich.eduLexToken(MINUS,'-',3,16) 3054479Sbinkertn@umich.eduLexToken(NUMBER,20,3,18) 3064479Sbinkertn@umich.eduLexToken(TIMES,'*',3,20) 3074479Sbinkertn@umich.eduLexToken(NUMBER,2,3,21) 3084479Sbinkertn@umich.edu</pre> 3094479Sbinkertn@umich.edu</blockquote> 3104479Sbinkertn@umich.edu 3114479Sbinkertn@umich.eduThe tokens returned by <tt>lex.token()</tt> are instances 3124479Sbinkertn@umich.eduof <tt>LexToken</tt>. This object has 3134479Sbinkertn@umich.eduattributes <tt>tok.type</tt>, <tt>tok.value</tt>, 3144479Sbinkertn@umich.edu<tt>tok.lineno</tt>, and <tt>tok.lexpos</tt>. The following code shows an example of 3154479Sbinkertn@umich.eduaccessing these attributes: 3164479Sbinkertn@umich.edu 3174479Sbinkertn@umich.edu<blockquote> 3184479Sbinkertn@umich.edu<pre> 3194479Sbinkertn@umich.edu# Tokenize 3204479Sbinkertn@umich.eduwhile 1: 3214479Sbinkertn@umich.edu tok = lex.token() 3224479Sbinkertn@umich.edu if not tok: break # No more input 3234479Sbinkertn@umich.edu print tok.type, tok.value, tok.line, tok.lexpos 3244479Sbinkertn@umich.edu</pre> 3254479Sbinkertn@umich.edu</blockquote> 3264479Sbinkertn@umich.edu 3274479Sbinkertn@umich.eduThe <tt>tok.type</tt> and <tt>tok.value</tt> attributes contain the 3284479Sbinkertn@umich.edutype and value of the token itself. 3294479Sbinkertn@umich.edu<tt>tok.line</tt> and <tt>tok.lexpos</tt> contain information about 3304479Sbinkertn@umich.eduthe location of the token. <tt>tok.lexpos</tt> is the index of the 3314479Sbinkertn@umich.edutoken relative to the start of the input text. 3324479Sbinkertn@umich.edu 3334479Sbinkertn@umich.edu<H3><a name="ply_nn5"></a>3.2 The tokens list</H3> 3344479Sbinkertn@umich.edu 3354479Sbinkertn@umich.edu 3364479Sbinkertn@umich.eduAll lexers must provide a list <tt>tokens</tt> that defines all of the possible token 3374479Sbinkertn@umich.edunames that can be produced by the lexer. This list is always required 3384479Sbinkertn@umich.eduand is used to perform a variety of validation checks. The tokens list is also used by the 3394479Sbinkertn@umich.edu<tt>yacc.py</tt> module to identify terminals. 3404479Sbinkertn@umich.edu 3414479Sbinkertn@umich.edu<p> 3424479Sbinkertn@umich.eduIn the example, the following code specified the token names: 3434479Sbinkertn@umich.edu 3444479Sbinkertn@umich.edu<blockquote> 3454479Sbinkertn@umich.edu<pre> 3464479Sbinkertn@umich.edutokens = ( 3474479Sbinkertn@umich.edu 'NUMBER', 3484479Sbinkertn@umich.edu 'PLUS', 3494479Sbinkertn@umich.edu 'MINUS', 3504479Sbinkertn@umich.edu 'TIMES', 3514479Sbinkertn@umich.edu 'DIVIDE', 3524479Sbinkertn@umich.edu 'LPAREN', 3534479Sbinkertn@umich.edu 'RPAREN', 3544479Sbinkertn@umich.edu) 3554479Sbinkertn@umich.edu</pre> 3564479Sbinkertn@umich.edu</blockquote> 3574479Sbinkertn@umich.edu 3584479Sbinkertn@umich.edu<H3><a name="ply_nn6"></a>3.3 Specification of tokens</H3> 3594479Sbinkertn@umich.edu 3604479Sbinkertn@umich.edu 3614479Sbinkertn@umich.eduEach token is specified by writing a regular expression rule. Each of these rules are 3624479Sbinkertn@umich.eduare defined by making declarations with a special prefix <tt>t_</tt> to indicate that it 3632632Sstever@eecs.umich.edudefines a token. For simple tokens, the regular expression can 3642632Sstever@eecs.umich.edube specified as strings such as this (note: Python raw strings are used since they are the 3652632Sstever@eecs.umich.edumost convenient way to write regular expression strings): 3662632Sstever@eecs.umich.edu 3672632Sstever@eecs.umich.edu<blockquote> 3682632Sstever@eecs.umich.edu<pre> 3692632Sstever@eecs.umich.edut_PLUS = r'\+' 3702632Sstever@eecs.umich.edu</pre> 3712632Sstever@eecs.umich.edu</blockquote> 3722632Sstever@eecs.umich.edu 3732632Sstever@eecs.umich.eduIn this case, the name following the <tt>t_</tt> must exactly match one of the 3742632Sstever@eecs.umich.edunames supplied in <tt>tokens</tt>. If some kind of action needs to be performed, 3754479Sbinkertn@umich.edua token rule can be specified as a function. For example, this rule matches numbers and 3764479Sbinkertn@umich.educonverts the string into a Python integer. 3772632Sstever@eecs.umich.edu 3782632Sstever@eecs.umich.edu<blockquote> 3792632Sstever@eecs.umich.edu<pre> 3802632Sstever@eecs.umich.edudef t_NUMBER(t): 3812632Sstever@eecs.umich.edu r'\d+' 3822632Sstever@eecs.umich.edu try: 3832632Sstever@eecs.umich.edu t.value = int(t.value) 3842632Sstever@eecs.umich.edu except ValueError: 3852632Sstever@eecs.umich.edu print "Number %s is too large!" % t.value 3862632Sstever@eecs.umich.edu t.value = 0 3872632Sstever@eecs.umich.edu return t 3882632Sstever@eecs.umich.edu</pre> 3892632Sstever@eecs.umich.edu</blockquote> 3902632Sstever@eecs.umich.edu 3914479Sbinkertn@umich.eduWhen a function is used, the regular expression rule is specified in the function documentation string. 3922632Sstever@eecs.umich.eduThe function always takes a single argument which is an instance of 3934479Sbinkertn@umich.edu<tt>LexToken</tt>. This object has attributes of <tt>t.type</tt> which is the token type (as a string), 3944479Sbinkertn@umich.edu<tt>t.value</tt> which is the lexeme (the actual text matched), <tt>t.lineno</tt> which is the current line number, and <tt>t.lexpos</tt> which 3954479Sbinkertn@umich.eduis the position of the token relative to the beginning of the input text. 3962632Sstever@eecs.umich.eduBy default, <tt>t.type</tt> is set to the name following the <tt>t_</tt> prefix. The action 3972632Sstever@eecs.umich.edufunction can modify the contents of the <tt>LexToken</tt> object as appropriate. However, 3982632Sstever@eecs.umich.eduwhen it is done, the resulting token should be returned. If no value is returned by the action 3992632Sstever@eecs.umich.edufunction, the token is simply discarded and the next token read. 4002632Sstever@eecs.umich.edu 4012632Sstever@eecs.umich.edu<p> 4024479Sbinkertn@umich.eduInternally, <tt>lex.py</tt> uses the <tt>re</tt> module to do its patten matching. When building the master regular expression, 4032632Sstever@eecs.umich.edurules are added in the following order: 4042632Sstever@eecs.umich.edu<p> 4052632Sstever@eecs.umich.edu<ol> 4062632Sstever@eecs.umich.edu<li>All tokens defined by functions are added in the same order as they appear in the lexer file. 4074479Sbinkertn@umich.edu<li>Tokens defined by strings are added next by sorting them in order of decreasing regular expression length (longer expressions 4082632Sstever@eecs.umich.eduare added first). 4092632Sstever@eecs.umich.edu</ol> 4102632Sstever@eecs.umich.edu<p> 4112632Sstever@eecs.umich.eduWithout this ordering, it can be difficult to correctly match certain types of tokens. For example, if you 4122632Sstever@eecs.umich.eduwanted to have separate tokens for "=" and "==", you need to make sure that "==" is checked first. By sorting regular 4132632Sstever@eecs.umich.eduexpressions in order of decreasing length, this problem is solved for rules defined as strings. For functions, 4142632Sstever@eecs.umich.eduthe order can be explicitly controlled since rules appearing first are checked first. 4152632Sstever@eecs.umich.edu 4162632Sstever@eecs.umich.edu<p> 4172632Sstever@eecs.umich.eduTo handle reserved words, it is usually easier to just match an identifier and do a special name lookup in a function 4182632Sstever@eecs.umich.edulike this: 4192632Sstever@eecs.umich.edu 4202632Sstever@eecs.umich.edu<blockquote> 4212632Sstever@eecs.umich.edu<pre> 4222632Sstever@eecs.umich.edureserved = { 4232632Sstever@eecs.umich.edu 'if' : 'IF', 4242632Sstever@eecs.umich.edu 'then' : 'THEN', 4252632Sstever@eecs.umich.edu 'else' : 'ELSE', 4262632Sstever@eecs.umich.edu 'while' : 'WHILE', 4272632Sstever@eecs.umich.edu ... 4282632Sstever@eecs.umich.edu} 4292632Sstever@eecs.umich.edu 4302632Sstever@eecs.umich.edudef t_ID(t): 4312632Sstever@eecs.umich.edu r'[a-zA-Z_][a-zA-Z_0-9]*' 4322632Sstever@eecs.umich.edu t.type = reserved.get(t.value,'ID') # Check for reserved words 4332632Sstever@eecs.umich.edu return t 4342632Sstever@eecs.umich.edu</pre> 4352632Sstever@eecs.umich.edu</blockquote> 4362632Sstever@eecs.umich.edu 4374479Sbinkertn@umich.eduThis approach greatly reduces the number of regular expression rules and is likely to make things a little faster. 4384479Sbinkertn@umich.edu 4392632Sstever@eecs.umich.edu<p> 4404479Sbinkertn@umich.edu<b>Note:</b> You should avoid writing individual rules for reserved words. For example, if you write rules like this, 4414479Sbinkertn@umich.edu 4424479Sbinkertn@umich.edu<blockquote> 4434479Sbinkertn@umich.edu<pre> 4444479Sbinkertn@umich.edut_FOR = r'for' 4454479Sbinkertn@umich.edut_PRINT = r'print' 4464479Sbinkertn@umich.edu</pre> 4474479Sbinkertn@umich.edu</blockquote> 4484479Sbinkertn@umich.edu 4494479Sbinkertn@umich.eduthose rules will be triggered for identifiers that include those words as a prefix such as "forget" or "printed". This is probably not 4504479Sbinkertn@umich.eduwhat you want. 4514479Sbinkertn@umich.edu 4524479Sbinkertn@umich.edu<H3><a name="ply_nn7"></a>3.4 Token values</H3> 4534479Sbinkertn@umich.edu 4544479Sbinkertn@umich.edu 4554479Sbinkertn@umich.eduWhen tokens are returned by lex, they have a value that is stored in the <tt>value</tt> attribute. Normally, the value is the text 4564479Sbinkertn@umich.eduthat was matched. However, the value can be assigned to any Python object. For instance, when lexing identifiers, you may 4574479Sbinkertn@umich.eduwant to return both the identifier name and information from some sort of symbol table. To do this, you might write a rule like this: 4582632Sstever@eecs.umich.edu 4592632Sstever@eecs.umich.edu<blockquote> 4602632Sstever@eecs.umich.edu<pre> 4612632Sstever@eecs.umich.edudef t_ID(t): 4622632Sstever@eecs.umich.edu ... 4634479Sbinkertn@umich.edu # Look up symbol table information and return a tuple 4642632Sstever@eecs.umich.edu t.value = (t.value, symbol_lookup(t.value)) 4652632Sstever@eecs.umich.edu ... 4662632Sstever@eecs.umich.edu return t 4672632Sstever@eecs.umich.edu</pre> 4682632Sstever@eecs.umich.edu</blockquote> 4692632Sstever@eecs.umich.edu 4704479Sbinkertn@umich.eduIt is important to note that storing data in other attribute names is <em>not</em> recommended. The <tt>yacc.py</tt> module only exposes the 4714479Sbinkertn@umich.educontents of the <tt>value</tt> attribute. Thus, accessing other attributes may be unnecessarily awkward. 4724479Sbinkertn@umich.edu 4734479Sbinkertn@umich.edu<H3><a name="ply_nn8"></a>3.5 Discarded tokens</H3> 4744479Sbinkertn@umich.edu 4754479Sbinkertn@umich.edu 4764479Sbinkertn@umich.eduTo discard a token, such as a comment, simply define a token rule that returns no value. For example: 4774479Sbinkertn@umich.edu 4782632Sstever@eecs.umich.edu<blockquote> 4792632Sstever@eecs.umich.edu<pre> 4804479Sbinkertn@umich.edudef t_COMMENT(t): 4814479Sbinkertn@umich.edu r'\#.*' 4824479Sbinkertn@umich.edu pass 4834479Sbinkertn@umich.edu # No return value. Token discarded 4842632Sstever@eecs.umich.edu</pre> 4852632Sstever@eecs.umich.edu</blockquote> 4862632Sstever@eecs.umich.edu 4874479Sbinkertn@umich.eduAlternatively, you can include the prefix "ignore_" in the token declaration to force a token to be ignored. For example: 4884479Sbinkertn@umich.edu 4894479Sbinkertn@umich.edu<blockquote> 4904479Sbinkertn@umich.edu<pre> 4914479Sbinkertn@umich.edut_ignore_COMMENT = r'\#.*' 4924479Sbinkertn@umich.edu</pre> 4934479Sbinkertn@umich.edu</blockquote> 4944479Sbinkertn@umich.edu 4954479Sbinkertn@umich.eduBe advised that if you are ignoring many different kinds of text, you may still want to use functions since these provide more precise 4964479Sbinkertn@umich.educontrol over the order in which regular expressions are matched (i.e., functions are matched in order of specification whereas strings are 4974479Sbinkertn@umich.edusorted by regular expression length). 4984479Sbinkertn@umich.edu 4994479Sbinkertn@umich.edu<H3><a name="ply_nn9"></a>3.6 Line numbers and positional information</H3> 5004479Sbinkertn@umich.edu 5014479Sbinkertn@umich.edu 5024479Sbinkertn@umich.edu<p>By default, <tt>lex.py</tt> knows nothing about line numbers. This is because <tt>lex.py</tt> doesn't know anything 5034479Sbinkertn@umich.eduabout what constitutes a "line" of input (e.g., the newline character or even if the input is textual data). 5044479Sbinkertn@umich.eduTo update this information, you need to write a special rule. In the example, the <tt>t_newline()</tt> rule shows how to do this. 5054479Sbinkertn@umich.edu 5064479Sbinkertn@umich.edu<blockquote> 5074479Sbinkertn@umich.edu<pre> 5084479Sbinkertn@umich.edu# Define a rule so we can track line numbers 5094479Sbinkertn@umich.edudef t_newline(t): 5104479Sbinkertn@umich.edu r'\n+' 5114479Sbinkertn@umich.edu t.lexer.lineno += len(t.value) 5124479Sbinkertn@umich.edu</pre> 5134479Sbinkertn@umich.edu</blockquote> 5144479Sbinkertn@umich.eduWithin the rule, the <tt>lineno</tt> attribute of the underlying lexer <tt>t.lexer</tt> is updated. 5154479Sbinkertn@umich.eduAfter the line number is updated, the token is simply discarded since nothing is returned. 5162632Sstever@eecs.umich.edu 5172632Sstever@eecs.umich.edu<p> 5184479Sbinkertn@umich.edu<tt>lex.py</tt> does not perform and kind of automatic column tracking. However, it does record positional 5194479Sbinkertn@umich.eduinformation related to each token in the <tt>lexpos</tt> attribute. Using this, it is usually possible to compute 5204479Sbinkertn@umich.educolumn information as a separate step. For instance, just count backwards until you reach a newline. 5214479Sbinkertn@umich.edu 5224479Sbinkertn@umich.edu<blockquote> 5234479Sbinkertn@umich.edu<pre> 5244479Sbinkertn@umich.edu# Compute column. 5254479Sbinkertn@umich.edu# input is the input text string 5264479Sbinkertn@umich.edu# token is a token instance 5274479Sbinkertn@umich.edudef find_column(input,token): 5284479Sbinkertn@umich.edu i = token.lexpos 5294479Sbinkertn@umich.edu while i > 0: 5304479Sbinkertn@umich.edu if input[i] == '\n': break 5314479Sbinkertn@umich.edu i -= 1 5324479Sbinkertn@umich.edu column = (token.lexpos - i)+1 5334479Sbinkertn@umich.edu return column 5344479Sbinkertn@umich.edu</pre> 5354479Sbinkertn@umich.edu</blockquote> 5364479Sbinkertn@umich.edu 5374479Sbinkertn@umich.eduSince column information is often only useful in the context of error handling, calculating the column 5384479Sbinkertn@umich.eduposition can be performed when needed as opposed to doing it for each token. 5394479Sbinkertn@umich.edu 5404479Sbinkertn@umich.edu<H3><a name="ply_nn10"></a>3.7 Ignored characters</H3> 5414479Sbinkertn@umich.edu 5422632Sstever@eecs.umich.edu 5432632Sstever@eecs.umich.edu<p> 5444479Sbinkertn@umich.eduThe special <tt>t_ignore</tt> rule is reserved by <tt>lex.py</tt> for characters 5454479Sbinkertn@umich.eduthat should be completely ignored in the input stream. 5464479Sbinkertn@umich.eduUsually this is used to skip over whitespace and other non-essential characters. 5474479Sbinkertn@umich.eduAlthough it is possible to define a regular expression rule for whitespace in a manner 5484479Sbinkertn@umich.edusimilar to <tt>t_newline()</tt>, the use of <tt>t_ignore</tt> provides substantially better 5494479Sbinkertn@umich.edulexing performance because it is handled as a special case and is checked in a much 5504479Sbinkertn@umich.edumore efficient manner than the normal regular expression rules. 5514479Sbinkertn@umich.edu 5524479Sbinkertn@umich.edu<H3><a name="ply_nn11"></a>3.8 Literal characters</H3> 5534479Sbinkertn@umich.edu 5544479Sbinkertn@umich.edu 5554479Sbinkertn@umich.edu<p> 5564479Sbinkertn@umich.eduLiteral characters can be specified by defining a variable <tt>literals</tt> in your lexing module. For example: 5574479Sbinkertn@umich.edu 5584479Sbinkertn@umich.edu<blockquote> 5594479Sbinkertn@umich.edu<pre> 5604479Sbinkertn@umich.eduliterals = [ '+','-','*','/' ] 5614479Sbinkertn@umich.edu</pre> 5624479Sbinkertn@umich.edu</blockquote> 5634479Sbinkertn@umich.edu 5644479Sbinkertn@umich.eduor alternatively 5654479Sbinkertn@umich.edu 5664479Sbinkertn@umich.edu<blockquote> 5674479Sbinkertn@umich.edu<pre> 5684479Sbinkertn@umich.eduliterals = "+-*/" 5694479Sbinkertn@umich.edu</pre> 5704479Sbinkertn@umich.edu</blockquote> 5714479Sbinkertn@umich.edu 5724479Sbinkertn@umich.eduA literal character is simply a single character that is returned "as is" when encountered by the lexer. Literals are checked 5734479Sbinkertn@umich.eduafter all of the defined regular expression rules. Thus, if a rule starts with one of the literal characters, it will always 5744479Sbinkertn@umich.edutake precedence. 5754479Sbinkertn@umich.edu<p> 5764479Sbinkertn@umich.eduWhen a literal token is returned, both its <tt>type</tt> and <tt>value</tt> attributes are set to the character itself. For example, <tt>'+'</tt>. 5774479Sbinkertn@umich.edu 5784479Sbinkertn@umich.edu<H3><a name="ply_nn12"></a>3.9 Error handling</H3> 5794479Sbinkertn@umich.edu 5804479Sbinkertn@umich.edu 5814479Sbinkertn@umich.edu<p> 5824479Sbinkertn@umich.eduFinally, the <tt>t_error()</tt> 5834479Sbinkertn@umich.edufunction is used to handle lexing errors that occur when illegal 5844479Sbinkertn@umich.educharacters are detected. In this case, the <tt>t.value</tt> attribute contains the 5854479Sbinkertn@umich.edurest of the input string that has not been tokenized. In the example, the error function 5864479Sbinkertn@umich.eduwas defined as follows: 5874479Sbinkertn@umich.edu 5884479Sbinkertn@umich.edu<blockquote> 5894479Sbinkertn@umich.edu<pre> 5904479Sbinkertn@umich.edu# Error handling rule 5914479Sbinkertn@umich.edudef t_error(t): 5924479Sbinkertn@umich.edu print "Illegal character '%s'" % t.value[0] 5934479Sbinkertn@umich.edu t.lexer.skip(1) 5944479Sbinkertn@umich.edu</pre> 5954479Sbinkertn@umich.edu</blockquote> 5964479Sbinkertn@umich.edu 5974479Sbinkertn@umich.eduIn this case, we simply print the offending character and skip ahead one character by calling <tt>t.lexer.skip(1)</tt>. 5984479Sbinkertn@umich.edu 5994479Sbinkertn@umich.edu<H3><a name="ply_nn13"></a>3.10 Building and using the lexer</H3> 6004479Sbinkertn@umich.edu 6014479Sbinkertn@umich.edu 6024479Sbinkertn@umich.edu<p> 6034479Sbinkertn@umich.eduTo build the lexer, the function <tt>lex.lex()</tt> is used. This function 6044479Sbinkertn@umich.eduuses Python reflection (or introspection) to read the the regular expression rules 6054479Sbinkertn@umich.eduout of the calling context and build the lexer. Once the lexer has been built, two functions can 6064479Sbinkertn@umich.edube used to control the lexer. 6074479Sbinkertn@umich.edu 6084479Sbinkertn@umich.edu<ul> 6094479Sbinkertn@umich.edu<li><tt>lex.input(data)</tt>. Reset the lexer and store a new input string. 6104479Sbinkertn@umich.edu<li><tt>lex.token()</tt>. Return the next token. Returns a special <tt>LexToken</tt> instance on success or 6114479Sbinkertn@umich.eduNone if the end of the input text has been reached. 6124479Sbinkertn@umich.edu</ul> 6134479Sbinkertn@umich.edu 6144479Sbinkertn@umich.eduIf desired, the lexer can also be used as an object. The <tt>lex()</tt> returns a <tt>Lexer</tt> object that 6154479Sbinkertn@umich.educan be used for this purpose. For example: 6162632Sstever@eecs.umich.edu 6172632Sstever@eecs.umich.edu<blockquote> 6182632Sstever@eecs.umich.edu<pre> 6192632Sstever@eecs.umich.edulexer = lex.lex() 6202632Sstever@eecs.umich.edulexer.input(sometext) 6212632Sstever@eecs.umich.eduwhile 1: 6222632Sstever@eecs.umich.edu tok = lexer.token() 6232632Sstever@eecs.umich.edu if not tok: break 6242632Sstever@eecs.umich.edu print tok 6252632Sstever@eecs.umich.edu</pre> 6262632Sstever@eecs.umich.edu</blockquote> 6272632Sstever@eecs.umich.edu 6284479Sbinkertn@umich.edu<p> 6294479Sbinkertn@umich.eduThis latter technique should be used if you intend to use multiple lexers in your application. Simply define each 6304479Sbinkertn@umich.edulexer in its own module and use the object returned by <tt>lex()</tt> as appropriate. 6314479Sbinkertn@umich.edu 6324479Sbinkertn@umich.edu<p> 6334479Sbinkertn@umich.eduNote: The global functions <tt>lex.input()</tt> and <tt>lex.token()</tt> are bound to the <tt>input()</tt> 6342632Sstever@eecs.umich.eduand <tt>token()</tt> methods of the last lexer created by the lex module. 6352632Sstever@eecs.umich.edu 6364479Sbinkertn@umich.edu<H3><a name="ply_nn14"></a>3.11 The @TOKEN decorator</H3> 6374479Sbinkertn@umich.edu 6384479Sbinkertn@umich.edu 6394479Sbinkertn@umich.eduIn some applications, you may want to define build tokens from as a series of 6404479Sbinkertn@umich.edumore complex regular expression rules. For example: 6412632Sstever@eecs.umich.edu 6422632Sstever@eecs.umich.edu<blockquote> 6432632Sstever@eecs.umich.edu<pre> 6444479Sbinkertn@umich.edudigit = r'([0-9])' 6454479Sbinkertn@umich.edunondigit = r'([_A-Za-z])' 6464479Sbinkertn@umich.eduidentifier = r'(' + nondigit + r'(' + digit + r'|' + nondigit + r')*)' 6474479Sbinkertn@umich.edu 6484479Sbinkertn@umich.edudef t_ID(t): 6494479Sbinkertn@umich.edu # want docstring to be identifier above. ????? 6504479Sbinkertn@umich.edu ... 6512632Sstever@eecs.umich.edu</pre> 6522632Sstever@eecs.umich.edu</blockquote> 6532632Sstever@eecs.umich.edu 6544479Sbinkertn@umich.eduIn this case, we want the regular expression rule for <tt>ID</tt> to be one of the variables above. However, there is no 6554479Sbinkertn@umich.eduway to directly specify this using a normal documentation string. To solve this problem, you can use the <tt>@TOKEN</tt> 6564479Sbinkertn@umich.edudecorator. For example: 6572632Sstever@eecs.umich.edu 6582632Sstever@eecs.umich.edu<blockquote> 6592632Sstever@eecs.umich.edu<pre> 6604479Sbinkertn@umich.edufrom ply.lex import TOKEN 6614479Sbinkertn@umich.edu 6624479Sbinkertn@umich.edu@TOKEN(identifier) 6634479Sbinkertn@umich.edudef t_ID(t): 6644479Sbinkertn@umich.edu ... 6652632Sstever@eecs.umich.edu</pre> 6662632Sstever@eecs.umich.edu</blockquote> 6672632Sstever@eecs.umich.edu 6684479Sbinkertn@umich.eduThis will attach <tt>identifier</tt> to the docstring for <tt>t_ID()</tt> allowing <tt>lex.py</tt> to work normally. An alternative 6694479Sbinkertn@umich.eduapproach this problem is to set the docstring directly like this: 6704479Sbinkertn@umich.edu 6714479Sbinkertn@umich.edu<blockquote> 6724479Sbinkertn@umich.edu<pre> 6734479Sbinkertn@umich.edudef t_ID(t): 6744479Sbinkertn@umich.edu ... 6754479Sbinkertn@umich.edu 6764479Sbinkertn@umich.edut_ID.__doc__ = identifier 6774479Sbinkertn@umich.edu</pre> 6784479Sbinkertn@umich.edu</blockquote> 6794479Sbinkertn@umich.edu 6804479Sbinkertn@umich.edu<b>NOTE:</b> Use of <tt>@TOKEN</tt> requires Python-2.4 or newer. If you're concerned about backwards compatibility with older 6814479Sbinkertn@umich.eduversions of Python, use the alternative approach of setting the docstring directly. 6824479Sbinkertn@umich.edu 6834479Sbinkertn@umich.edu<H3><a name="ply_nn15"></a>3.12 Optimized mode</H3> 6844479Sbinkertn@umich.edu 6854479Sbinkertn@umich.edu 6864479Sbinkertn@umich.eduFor improved performance, it may be desirable to use Python's 6874479Sbinkertn@umich.eduoptimized mode (e.g., running Python with the <tt>-O</tt> 6884479Sbinkertn@umich.eduoption). However, doing so causes Python to ignore documentation 6894479Sbinkertn@umich.edustrings. This presents special problems for <tt>lex.py</tt>. To 6904479Sbinkertn@umich.eduhandle this case, you can create your lexer using 6914479Sbinkertn@umich.eduthe <tt>optimize</tt> option as follows: 6924479Sbinkertn@umich.edu 6934479Sbinkertn@umich.edu<blockquote> 6944479Sbinkertn@umich.edu<pre> 6954479Sbinkertn@umich.edulexer = lex.lex(optimize=1) 6964479Sbinkertn@umich.edu</pre> 6974479Sbinkertn@umich.edu</blockquote> 6984479Sbinkertn@umich.edu 6994479Sbinkertn@umich.eduNext, run Python in its normal operating mode. When you do 7004479Sbinkertn@umich.eduthis, <tt>lex.py</tt> will write a file called <tt>lextab.py</tt> to 7014479Sbinkertn@umich.eduthe current directory. This file contains all of the regular 7024479Sbinkertn@umich.eduexpression rules and tables used during lexing. On subsequent 7034479Sbinkertn@umich.eduexecutions, 7044479Sbinkertn@umich.edu<tt>lextab.py</tt> will simply be imported to build the lexer. This 7054479Sbinkertn@umich.eduapproach substantially improves the startup time of the lexer and it 7064479Sbinkertn@umich.eduworks in Python's optimized mode. 7074479Sbinkertn@umich.edu 7082632Sstever@eecs.umich.edu<p> 7094479Sbinkertn@umich.eduTo change the name of the lexer-generated file, use the <tt>lextab</tt> keyword argument. For example: 7104479Sbinkertn@umich.edu 7114479Sbinkertn@umich.edu<blockquote> 7124479Sbinkertn@umich.edu<pre> 7134479Sbinkertn@umich.edulexer = lex.lex(optimize=1,lextab="footab") 7144479Sbinkertn@umich.edu</pre> 7154479Sbinkertn@umich.edu</blockquote> 7164479Sbinkertn@umich.edu 7174479Sbinkertn@umich.eduWhen running in optimized mode, it is important to note that lex disables most error checking. Thus, this is really only recommended 7184479Sbinkertn@umich.eduif you're sure everything is working correctly and you're ready to start releasing production code. 7194479Sbinkertn@umich.edu 7204479Sbinkertn@umich.edu<H3><a name="ply_nn16"></a>3.13 Debugging</H3> 7214479Sbinkertn@umich.edu 7224479Sbinkertn@umich.edu 7234479Sbinkertn@umich.eduFor the purpose of debugging, you can run <tt>lex()</tt> in a debugging mode as follows: 7244479Sbinkertn@umich.edu 7254479Sbinkertn@umich.edu<blockquote> 7264479Sbinkertn@umich.edu<pre> 7274479Sbinkertn@umich.edulexer = lex.lex(debug=1) 7284479Sbinkertn@umich.edu</pre> 7294479Sbinkertn@umich.edu</blockquote> 7304479Sbinkertn@umich.edu 7314479Sbinkertn@umich.eduThis will result in a large amount of debugging information to be printed including all of the added rules and the master 7324479Sbinkertn@umich.eduregular expressions. 7334479Sbinkertn@umich.edu 7344479Sbinkertn@umich.eduIn addition, <tt>lex.py</tt> comes with a simple main function which 7354479Sbinkertn@umich.eduwill either tokenize input read from standard input or from a file specified 7364479Sbinkertn@umich.eduon the command line. To use it, simply put this in your lexer: 7372632Sstever@eecs.umich.edu 7382632Sstever@eecs.umich.edu<blockquote> 7392632Sstever@eecs.umich.edu<pre> 7402632Sstever@eecs.umich.eduif __name__ == '__main__': 7412632Sstever@eecs.umich.edu lex.runmain() 7422632Sstever@eecs.umich.edu</pre> 7432632Sstever@eecs.umich.edu</blockquote> 7442632Sstever@eecs.umich.edu 7454479Sbinkertn@umich.edu<H3><a name="ply_nn17"></a>3.14 Alternative specification of lexers</H3> 7464479Sbinkertn@umich.edu 7474479Sbinkertn@umich.edu 7484479Sbinkertn@umich.eduAs shown in the example, lexers are specified all within one Python module. If you want to 7494479Sbinkertn@umich.eduput token rules in a different module from the one in which you invoke <tt>lex()</tt>, use the 7504479Sbinkertn@umich.edu<tt>module</tt> keyword argument. 7514479Sbinkertn@umich.edu 7524479Sbinkertn@umich.edu<p> 7534479Sbinkertn@umich.eduFor example, you might have a dedicated module that just contains 7544479Sbinkertn@umich.eduthe token rules: 7554479Sbinkertn@umich.edu 7564479Sbinkertn@umich.edu<blockquote> 7574479Sbinkertn@umich.edu<pre> 7584479Sbinkertn@umich.edu# module: tokrules.py 7594479Sbinkertn@umich.edu# This module just contains the lexing rules 7604479Sbinkertn@umich.edu 7614479Sbinkertn@umich.edu# List of token names. This is always required 7624479Sbinkertn@umich.edutokens = ( 7634479Sbinkertn@umich.edu 'NUMBER', 7644479Sbinkertn@umich.edu 'PLUS', 7654479Sbinkertn@umich.edu 'MINUS', 7664479Sbinkertn@umich.edu 'TIMES', 7674479Sbinkertn@umich.edu 'DIVIDE', 7684479Sbinkertn@umich.edu 'LPAREN', 7694479Sbinkertn@umich.edu 'RPAREN', 7704479Sbinkertn@umich.edu) 7714479Sbinkertn@umich.edu 7724479Sbinkertn@umich.edu# Regular expression rules for simple tokens 7734479Sbinkertn@umich.edut_PLUS = r'\+' 7744479Sbinkertn@umich.edut_MINUS = r'-' 7754479Sbinkertn@umich.edut_TIMES = r'\*' 7764479Sbinkertn@umich.edut_DIVIDE = r'/' 7774479Sbinkertn@umich.edut_LPAREN = r'\(' 7784479Sbinkertn@umich.edut_RPAREN = r'\)' 7794479Sbinkertn@umich.edu 7804479Sbinkertn@umich.edu# A regular expression rule with some action code 7814479Sbinkertn@umich.edudef t_NUMBER(t): 7824479Sbinkertn@umich.edu r'\d+' 7834479Sbinkertn@umich.edu try: 7844479Sbinkertn@umich.edu t.value = int(t.value) 7854479Sbinkertn@umich.edu except ValueError: 7864479Sbinkertn@umich.edu print "Line %d: Number %s is too large!" % (t.lineno,t.value) 7874479Sbinkertn@umich.edu t.value = 0 7884479Sbinkertn@umich.edu return t 7894479Sbinkertn@umich.edu 7904479Sbinkertn@umich.edu# Define a rule so we can track line numbers 7914479Sbinkertn@umich.edudef t_newline(t): 7924479Sbinkertn@umich.edu r'\n+' 7934479Sbinkertn@umich.edu t.lexer.lineno += len(t.value) 7944479Sbinkertn@umich.edu 7954479Sbinkertn@umich.edu# A string containing ignored characters (spaces and tabs) 7964479Sbinkertn@umich.edut_ignore = ' \t' 7974479Sbinkertn@umich.edu 7984479Sbinkertn@umich.edu# Error handling rule 7994479Sbinkertn@umich.edudef t_error(t): 8004479Sbinkertn@umich.edu print "Illegal character '%s'" % t.value[0] 8014479Sbinkertn@umich.edu t.lexer.skip(1) 8024479Sbinkertn@umich.edu</pre> 8034479Sbinkertn@umich.edu</blockquote> 8044479Sbinkertn@umich.edu 8054479Sbinkertn@umich.eduNow, if you wanted to build a tokenizer from these rules from within a different module, you would do the following (shown for Python interactive mode): 8064479Sbinkertn@umich.edu 8074479Sbinkertn@umich.edu<blockquote> 8084479Sbinkertn@umich.edu<pre> 8094479Sbinkertn@umich.edu>>> import tokrules 8104479Sbinkertn@umich.edu>>> <b>lexer = lex.lex(module=tokrules)</b> 8114479Sbinkertn@umich.edu>>> lexer.input("3 + 4") 8124479Sbinkertn@umich.edu>>> lexer.token() 8134479Sbinkertn@umich.eduLexToken(NUMBER,3,1,1,0) 8144479Sbinkertn@umich.edu>>> lexer.token() 8154479Sbinkertn@umich.eduLexToken(PLUS,'+',1,2) 8164479Sbinkertn@umich.edu>>> lexer.token() 8174479Sbinkertn@umich.eduLexToken(NUMBER,4,1,4) 8184479Sbinkertn@umich.edu>>> lexer.token() 8194479Sbinkertn@umich.eduNone 8204479Sbinkertn@umich.edu>>> 8214479Sbinkertn@umich.edu</pre> 8224479Sbinkertn@umich.edu</blockquote> 8234479Sbinkertn@umich.edu 8244479Sbinkertn@umich.eduThe <tt>object</tt> option can be used to define lexers as a class instead of a module. For example: 8254479Sbinkertn@umich.edu 8264479Sbinkertn@umich.edu<blockquote> 8274479Sbinkertn@umich.edu<pre> 8284479Sbinkertn@umich.eduimport ply.lex as lex 8294479Sbinkertn@umich.edu 8304479Sbinkertn@umich.educlass MyLexer: 8314479Sbinkertn@umich.edu # List of token names. This is always required 8324479Sbinkertn@umich.edu tokens = ( 8334479Sbinkertn@umich.edu 'NUMBER', 8344479Sbinkertn@umich.edu 'PLUS', 8354479Sbinkertn@umich.edu 'MINUS', 8364479Sbinkertn@umich.edu 'TIMES', 8374479Sbinkertn@umich.edu 'DIVIDE', 8384479Sbinkertn@umich.edu 'LPAREN', 8394479Sbinkertn@umich.edu 'RPAREN', 8404479Sbinkertn@umich.edu ) 8414479Sbinkertn@umich.edu 8424479Sbinkertn@umich.edu # Regular expression rules for simple tokens 8434479Sbinkertn@umich.edu t_PLUS = r'\+' 8444479Sbinkertn@umich.edu t_MINUS = r'-' 8454479Sbinkertn@umich.edu t_TIMES = r'\*' 8464479Sbinkertn@umich.edu t_DIVIDE = r'/' 8474479Sbinkertn@umich.edu t_LPAREN = r'\(' 8484479Sbinkertn@umich.edu t_RPAREN = r'\)' 8494479Sbinkertn@umich.edu 8504479Sbinkertn@umich.edu # A regular expression rule with some action code 8514479Sbinkertn@umich.edu # Note addition of self parameter since we're in a class 8524479Sbinkertn@umich.edu def t_NUMBER(self,t): 8534479Sbinkertn@umich.edu r'\d+' 8544479Sbinkertn@umich.edu try: 8554479Sbinkertn@umich.edu t.value = int(t.value) 8564479Sbinkertn@umich.edu except ValueError: 8574479Sbinkertn@umich.edu print "Line %d: Number %s is too large!" % (t.lineno,t.value) 8584479Sbinkertn@umich.edu t.value = 0 8594479Sbinkertn@umich.edu return t 8604479Sbinkertn@umich.edu 8614479Sbinkertn@umich.edu # Define a rule so we can track line numbers 8624479Sbinkertn@umich.edu def t_newline(self,t): 8634479Sbinkertn@umich.edu r'\n+' 8644479Sbinkertn@umich.edu t.lexer.lineno += len(t.value) 8654479Sbinkertn@umich.edu 8664479Sbinkertn@umich.edu # A string containing ignored characters (spaces and tabs) 8674479Sbinkertn@umich.edu t_ignore = ' \t' 8684479Sbinkertn@umich.edu 8694479Sbinkertn@umich.edu # Error handling rule 8704479Sbinkertn@umich.edu def t_error(self,t): 8714479Sbinkertn@umich.edu print "Illegal character '%s'" % t.value[0] 8724479Sbinkertn@umich.edu t.lexer.skip(1) 8734479Sbinkertn@umich.edu 8744479Sbinkertn@umich.edu <b># Build the lexer 8754479Sbinkertn@umich.edu def build(self,**kwargs): 8764479Sbinkertn@umich.edu self.lexer = lex.lex(object=self, **kwargs)</b> 8774479Sbinkertn@umich.edu 8784479Sbinkertn@umich.edu # Test it output 8794479Sbinkertn@umich.edu def test(self,data): 8804479Sbinkertn@umich.edu self.lexer.input(data) 8814479Sbinkertn@umich.edu while 1: 8824479Sbinkertn@umich.edu tok = lexer.token() 8834479Sbinkertn@umich.edu if not tok: break 8844479Sbinkertn@umich.edu print tok 8854479Sbinkertn@umich.edu 8864479Sbinkertn@umich.edu# Build the lexer and try it out 8874479Sbinkertn@umich.edum = MyLexer() 8884479Sbinkertn@umich.edum.build() # Build the lexer 8894479Sbinkertn@umich.edum.test("3 + 4") # Test it 8904479Sbinkertn@umich.edu</pre> 8914479Sbinkertn@umich.edu</blockquote> 8924479Sbinkertn@umich.edu 8934479Sbinkertn@umich.eduFor reasons that are subtle, you should <em>NOT</em> invoke <tt>lex.lex()</tt> inside the <tt>__init__()</tt> method of your class. If you 8944479Sbinkertn@umich.edudo, it may cause bizarre behavior if someone tries to duplicate a lexer object. Keep reading. 8954479Sbinkertn@umich.edu 8964479Sbinkertn@umich.edu<H3><a name="ply_nn18"></a>3.15 Maintaining state</H3> 8974479Sbinkertn@umich.edu 8984479Sbinkertn@umich.edu 8994479Sbinkertn@umich.eduIn your lexer, you may want to maintain a variety of state information. This might include mode settings, symbol tables, and other details. There are a few 9004479Sbinkertn@umich.edudifferent ways to handle this situation. First, you could just keep some global variables: 9014479Sbinkertn@umich.edu 9024479Sbinkertn@umich.edu<blockquote> 9034479Sbinkertn@umich.edu<pre> 9044479Sbinkertn@umich.edunum_count = 0 9054479Sbinkertn@umich.edudef t_NUMBER(t): 9064479Sbinkertn@umich.edu r'\d+' 9074479Sbinkertn@umich.edu global num_count 9084479Sbinkertn@umich.edu num_count += 1 9094479Sbinkertn@umich.edu try: 9104479Sbinkertn@umich.edu t.value = int(t.value) 9114479Sbinkertn@umich.edu except ValueError: 9124479Sbinkertn@umich.edu print "Line %d: Number %s is too large!" % (t.lineno,t.value) 9134479Sbinkertn@umich.edu t.value = 0 9144479Sbinkertn@umich.edu return t 9154479Sbinkertn@umich.edu</pre> 9164479Sbinkertn@umich.edu</blockquote> 9174479Sbinkertn@umich.edu 9184479Sbinkertn@umich.eduAlternatively, you can store this information inside the Lexer object created by <tt>lex()</tt>. To this, you can use the <tt>lexer</tt> attribute 9194479Sbinkertn@umich.eduof tokens passed to the various rules. For example: 9204479Sbinkertn@umich.edu 9214479Sbinkertn@umich.edu<blockquote> 9224479Sbinkertn@umich.edu<pre> 9234479Sbinkertn@umich.edudef t_NUMBER(t): 9244479Sbinkertn@umich.edu r'\d+' 9254479Sbinkertn@umich.edu t.lexer.num_count += 1 # Note use of lexer attribute 9264479Sbinkertn@umich.edu try: 9274479Sbinkertn@umich.edu t.value = int(t.value) 9284479Sbinkertn@umich.edu except ValueError: 9294479Sbinkertn@umich.edu print "Line %d: Number %s is too large!" % (t.lineno,t.value) 9304479Sbinkertn@umich.edu t.value = 0 9314479Sbinkertn@umich.edu return t 9324479Sbinkertn@umich.edu 9334479Sbinkertn@umich.edulexer = lex.lex() 9344479Sbinkertn@umich.edulexer.num_count = 0 # Set the initial count 9354479Sbinkertn@umich.edu</pre> 9364479Sbinkertn@umich.edu</blockquote> 9374479Sbinkertn@umich.edu 9384479Sbinkertn@umich.eduThis latter approach has the advantage of storing information inside 9394479Sbinkertn@umich.eduthe lexer itself---something that may be useful if multiple instances 9404479Sbinkertn@umich.eduof the same lexer have been created. However, it may also feel kind 9414479Sbinkertn@umich.eduof "hacky" to the purists. Just to put their mind at some ease, all 9424479Sbinkertn@umich.eduinternal attributes of the lexer (with the exception of <tt>lineno</tt>) have names that are prefixed 9434479Sbinkertn@umich.eduby <tt>lex</tt> (e.g., <tt>lexdata</tt>,<tt>lexpos</tt>, etc.). Thus, 9444479Sbinkertn@umich.eduit should be perfectly safe to store attributes in the lexer that 9454479Sbinkertn@umich.edudon't have names starting with that prefix. 9464479Sbinkertn@umich.edu 9474479Sbinkertn@umich.edu<p> 9484479Sbinkertn@umich.eduA third approach is to define the lexer as a class as shown in the previous example: 9494479Sbinkertn@umich.edu 9504479Sbinkertn@umich.edu<blockquote> 9514479Sbinkertn@umich.edu<pre> 9524479Sbinkertn@umich.educlass MyLexer: 9534479Sbinkertn@umich.edu ... 9544479Sbinkertn@umich.edu def t_NUMBER(self,t): 9554479Sbinkertn@umich.edu r'\d+' 9564479Sbinkertn@umich.edu self.num_count += 1 9574479Sbinkertn@umich.edu try: 9584479Sbinkertn@umich.edu t.value = int(t.value) 9594479Sbinkertn@umich.edu except ValueError: 9604479Sbinkertn@umich.edu print "Line %d: Number %s is too large!" % (t.lineno,t.value) 9614479Sbinkertn@umich.edu t.value = 0 9624479Sbinkertn@umich.edu return t 9634479Sbinkertn@umich.edu 9644479Sbinkertn@umich.edu def build(self, **kwargs): 9654479Sbinkertn@umich.edu self.lexer = lex.lex(object=self,**kwargs) 9664479Sbinkertn@umich.edu 9674479Sbinkertn@umich.edu def __init__(self): 9684479Sbinkertn@umich.edu self.num_count = 0 9694479Sbinkertn@umich.edu 9704479Sbinkertn@umich.edu# Create a lexer 9714479Sbinkertn@umich.edum = MyLexer() 9724479Sbinkertn@umich.edulexer = lex.lex(object=m) 9734479Sbinkertn@umich.edu</pre> 9744479Sbinkertn@umich.edu</blockquote> 9754479Sbinkertn@umich.edu 9764479Sbinkertn@umich.eduThe class approach may be the easiest to manage if your application is going to be creating multiple instances of the same lexer and 9774479Sbinkertn@umich.eduyou need to manage a lot of state. 9784479Sbinkertn@umich.edu 9794479Sbinkertn@umich.edu<H3><a name="ply_nn19"></a>3.16 Duplicating lexers</H3> 9804479Sbinkertn@umich.edu 9814479Sbinkertn@umich.edu 9824479Sbinkertn@umich.edu<b>NOTE: I am thinking about deprecating this feature. Post comments on <a href="http://groups.google.com/group/ply-hack">ply-hack@googlegroups.com</a> or send me a private email at dave@dabeaz.com.</b> 9834479Sbinkertn@umich.edu 9844479Sbinkertn@umich.edu<p> 9854479Sbinkertn@umich.eduIf necessary, a lexer object can be quickly duplicated by invoking its <tt>clone()</tt> method. For example: 9864479Sbinkertn@umich.edu 9874479Sbinkertn@umich.edu<blockquote> 9884479Sbinkertn@umich.edu<pre> 9894479Sbinkertn@umich.edulexer = lex.lex() 9904479Sbinkertn@umich.edu... 9914479Sbinkertn@umich.edunewlexer = lexer.clone() 9924479Sbinkertn@umich.edu</pre> 9934479Sbinkertn@umich.edu</blockquote> 9944479Sbinkertn@umich.edu 9954479Sbinkertn@umich.eduWhen a lexer is cloned, the copy is identical to the original lexer, 9964479Sbinkertn@umich.eduincluding any input text. However, once created, different text can be 9974479Sbinkertn@umich.edufed to the clone which can be used independently. This capability may 9984479Sbinkertn@umich.edube useful in situations when you are writing a parser/compiler that 9994479Sbinkertn@umich.eduinvolves recursive or reentrant processing. For instance, if you 10004479Sbinkertn@umich.eduneeded to scan ahead in the input for some reason, you could create a 10014479Sbinkertn@umich.educlone and use it to look ahead. 10024479Sbinkertn@umich.edu 10034479Sbinkertn@umich.edu<p> 10044479Sbinkertn@umich.eduThe advantage of using <tt>clone()</tt> instead of reinvoking <tt>lex()</tt> is 10054479Sbinkertn@umich.eduthat it is significantly faster. Namely, it is not necessary to re-examine all of the 10064479Sbinkertn@umich.edutoken rules, build a regular expression, and construct internal tables. All of this 10074479Sbinkertn@umich.eduinformation can simply be reused in the new lexer. 10084479Sbinkertn@umich.edu 10094479Sbinkertn@umich.edu<p> 10104479Sbinkertn@umich.eduSpecial considerations need to be made when cloning a lexer that is defined as a class. Previous sections 10114479Sbinkertn@umich.edushowed an example of a class <tt>MyLexer</tt>. If you have the following code: 10124479Sbinkertn@umich.edu 10134479Sbinkertn@umich.edu<blockquote> 10144479Sbinkertn@umich.edu<pre> 10154479Sbinkertn@umich.edum = MyLexer() 10164479Sbinkertn@umich.edua = lex.lex(object=m) # Create a lexer 10174479Sbinkertn@umich.edu 10184479Sbinkertn@umich.edub = a.clone() # Clone the lexer 10194479Sbinkertn@umich.edu</pre> 10204479Sbinkertn@umich.edu</blockquote> 10214479Sbinkertn@umich.edu 10224479Sbinkertn@umich.eduThen both <tt>a</tt> and <tt>b</tt> are going to be bound to the same 10234479Sbinkertn@umich.eduobject <tt>m</tt>. If the object <tt>m</tt> contains internal state 10244479Sbinkertn@umich.edurelated to lexing, this sharing may lead to quite a bit of confusion. To fix this, 10254479Sbinkertn@umich.eduthe <tt>clone()</tt> method accepts an optional argument that can be used to supply a new object. This 10264479Sbinkertn@umich.educan be used to clone the lexer and bind it to a new instance. For example: 10274479Sbinkertn@umich.edu 10284479Sbinkertn@umich.edu<blockquote> 10294479Sbinkertn@umich.edu<pre> 10304479Sbinkertn@umich.edum = MyLexer() # Create a lexer 10314479Sbinkertn@umich.edua = lex.lex(object=m) 10324479Sbinkertn@umich.edu 10334479Sbinkertn@umich.edu# Create a clone 10344479Sbinkertn@umich.edun = MyLexer() # New instance of MyLexer 10354479Sbinkertn@umich.edub = a.clone(n) # New lexer bound to n 10364479Sbinkertn@umich.edu</pre> 10374479Sbinkertn@umich.edu</blockquote> 10384479Sbinkertn@umich.edu 10394479Sbinkertn@umich.eduIt may make sense to encapsulate all of this inside a method: 10404479Sbinkertn@umich.edu 10414479Sbinkertn@umich.edu<blockquote> 10424479Sbinkertn@umich.edu<pre> 10434479Sbinkertn@umich.educlass MyLexer: 10444479Sbinkertn@umich.edu ... 10454479Sbinkertn@umich.edu def clone(self): 10464479Sbinkertn@umich.edu c = MyLexer() # Create a new instance of myself 10474479Sbinkertn@umich.edu # Copy attributes from self to c as appropriate 10484479Sbinkertn@umich.edu ... 10494479Sbinkertn@umich.edu # Clone the lexer 10504479Sbinkertn@umich.edu c.lexer = self.lexer.clone(c) 10514479Sbinkertn@umich.edu return c 10524479Sbinkertn@umich.edu</pre> 10534479Sbinkertn@umich.edu</blockquote> 10544479Sbinkertn@umich.edu 10554479Sbinkertn@umich.eduThe fact that a new instance of <tt>MyLexer</tt> may be created while cloning a lexer is the reason why you should never 10564479Sbinkertn@umich.eduinvoke <tt>lex.lex()</tt> inside <tt>__init__()</tt>. If you do, the lexer will be rebuilt from scratch and you lose 10574479Sbinkertn@umich.eduall of the performance benefits of using <tt>clone()</tt> in the first place. 10584479Sbinkertn@umich.edu 10594479Sbinkertn@umich.edu<H3><a name="ply_nn20"></a>3.17 Internal lexer state</H3> 10604479Sbinkertn@umich.edu 10614479Sbinkertn@umich.edu 10624479Sbinkertn@umich.eduA Lexer object <tt>lexer</tt> has a number of internal attributes that may be useful in certain 10634479Sbinkertn@umich.edusituations. 10644479Sbinkertn@umich.edu 10654479Sbinkertn@umich.edu<p> 10664479Sbinkertn@umich.edu<tt>lexer.lexpos</tt> 10674479Sbinkertn@umich.edu<blockquote> 10684479Sbinkertn@umich.eduThis attribute is an integer that contains the current position within the input text. If you modify 10694479Sbinkertn@umich.eduthe value, it will change the result of the next call to <tt>token()</tt>. Within token rule functions, this points 10704479Sbinkertn@umich.eduto the first character <em>after</em> the matched text. If the value is modified within a rule, the next returned token will be 10714479Sbinkertn@umich.edumatched at the new position. 10724479Sbinkertn@umich.edu</blockquote> 10734479Sbinkertn@umich.edu 10744479Sbinkertn@umich.edu<p> 10754479Sbinkertn@umich.edu<tt>lexer.lineno</tt> 10764479Sbinkertn@umich.edu<blockquote> 10774479Sbinkertn@umich.eduThe current value of the line number attribute stored in the lexer. This can be modified as needed to 10784479Sbinkertn@umich.educhange the line number. 10794479Sbinkertn@umich.edu</blockquote> 10804479Sbinkertn@umich.edu 10814479Sbinkertn@umich.edu<p> 10824479Sbinkertn@umich.edu<tt>lexer.lexdata</tt> 10834479Sbinkertn@umich.edu<blockquote> 10844479Sbinkertn@umich.eduThe current input text stored in the lexer. This is the string passed with the <tt>input()</tt> method. It 10854479Sbinkertn@umich.eduwould probably be a bad idea to modify this unless you really know what you're doing. 10864479Sbinkertn@umich.edu</blockquote> 10874479Sbinkertn@umich.edu 10884479Sbinkertn@umich.edu<P> 10894479Sbinkertn@umich.edu<tt>lexer.lexmatch</tt> 10904479Sbinkertn@umich.edu<blockquote> 10914479Sbinkertn@umich.eduThis is the raw <tt>Match</tt> object returned by the Python <tt>re.match()</tt> function (used internally by PLY) for the 10924479Sbinkertn@umich.educurrent token. If you have written a regular expression that contains named groups, you can use this to retrieve those values. 10934479Sbinkertn@umich.edu</blockquote> 10944479Sbinkertn@umich.edu 10954479Sbinkertn@umich.edu<H3><a name="ply_nn21"></a>3.18 Conditional lexing and start conditions</H3> 10964479Sbinkertn@umich.edu 10974479Sbinkertn@umich.edu 10984479Sbinkertn@umich.eduIn advanced parsing applications, it may be useful to have different 10994479Sbinkertn@umich.edulexing states. For instance, you may want the occurrence of a certain 11004479Sbinkertn@umich.edutoken or syntactic construct to trigger a different kind of lexing. 11014479Sbinkertn@umich.eduPLY supports a feature that allows the underlying lexer to be put into 11024479Sbinkertn@umich.edua series of different states. Each state can have its own tokens, 11034479Sbinkertn@umich.edulexing rules, and so forth. The implementation is based largely on 11044479Sbinkertn@umich.eduthe "start condition" feature of GNU flex. Details of this can be found 11054479Sbinkertn@umich.eduat <a 11064479Sbinkertn@umich.eduhref="http://www.gnu.org/software/flex/manual/html_chapter/flex_11.html">http://www.gnu.org/software/flex/manual/html_chapter/flex_11.html.</a>. 11074479Sbinkertn@umich.edu 11084479Sbinkertn@umich.edu<p> 11094479Sbinkertn@umich.eduTo define a new lexing state, it must first be declared. This is done by including a "states" declaration in your 11104479Sbinkertn@umich.edulex file. For example: 11114479Sbinkertn@umich.edu 11124479Sbinkertn@umich.edu<blockquote> 11134479Sbinkertn@umich.edu<pre> 11144479Sbinkertn@umich.edustates = ( 11154479Sbinkertn@umich.edu ('foo','exclusive'), 11164479Sbinkertn@umich.edu ('bar','inclusive'), 11174479Sbinkertn@umich.edu) 11184479Sbinkertn@umich.edu</pre> 11194479Sbinkertn@umich.edu</blockquote> 11204479Sbinkertn@umich.edu 11214479Sbinkertn@umich.eduThis declaration declares two states, <tt>'foo'</tt> 11224479Sbinkertn@umich.eduand <tt>'bar'</tt>. States may be of two types; <tt>'exclusive'</tt> 11234479Sbinkertn@umich.eduand <tt>'inclusive'</tt>. An exclusive state completely overrides the 11244479Sbinkertn@umich.edudefault behavior of the lexer. That is, lex will only return tokens 11254479Sbinkertn@umich.eduand apply rules defined specifically for that state. An inclusive 11264479Sbinkertn@umich.edustate adds additional tokens and rules to the default set of rules. 11274479Sbinkertn@umich.eduThus, lex will return both the tokens defined by default in addition 11284479Sbinkertn@umich.eduto those defined for the inclusive state. 11294479Sbinkertn@umich.edu 11304479Sbinkertn@umich.edu<p> 11314479Sbinkertn@umich.eduOnce a state has been declared, tokens and rules are declared by including the 11324479Sbinkertn@umich.edustate name in token/rule declaration. For example: 11334479Sbinkertn@umich.edu 11344479Sbinkertn@umich.edu<blockquote> 11354479Sbinkertn@umich.edu<pre> 11364479Sbinkertn@umich.edut_foo_NUMBER = r'\d+' # Token 'NUMBER' in state 'foo' 11374479Sbinkertn@umich.edut_bar_ID = r'[a-zA-Z_][a-zA-Z0-9_]*' # Token 'ID' in state 'bar' 11384479Sbinkertn@umich.edu 11394479Sbinkertn@umich.edudef t_foo_newline(t): 11404479Sbinkertn@umich.edu r'\n' 11414479Sbinkertn@umich.edu t.lexer.lineno += 1 11424479Sbinkertn@umich.edu</pre> 11434479Sbinkertn@umich.edu</blockquote> 11444479Sbinkertn@umich.edu 11454479Sbinkertn@umich.eduA token can be declared in multiple states by including multiple state names in the declaration. For example: 11464479Sbinkertn@umich.edu 11474479Sbinkertn@umich.edu<blockquote> 11484479Sbinkertn@umich.edu<pre> 11494479Sbinkertn@umich.edut_foo_bar_NUMBER = r'\d+' # Defines token 'NUMBER' in both state 'foo' and 'bar' 11504479Sbinkertn@umich.edu</pre> 11514479Sbinkertn@umich.edu</blockquote> 11524479Sbinkertn@umich.edu 11534479Sbinkertn@umich.eduAlternative, a token can be declared in all states using the 'ANY' in the name. 11544479Sbinkertn@umich.edu 11554479Sbinkertn@umich.edu<blockquote> 11564479Sbinkertn@umich.edu<pre> 11574479Sbinkertn@umich.edut_ANY_NUMBER = r'\d+' # Defines a token 'NUMBER' in all states 11584479Sbinkertn@umich.edu</pre> 11594479Sbinkertn@umich.edu</blockquote> 11604479Sbinkertn@umich.edu 11614479Sbinkertn@umich.eduIf no state name is supplied, as is normally the case, the token is associated with a special state <tt>'INITIAL'</tt>. For example, 11624479Sbinkertn@umich.eduthese two declarations are identical: 11634479Sbinkertn@umich.edu 11644479Sbinkertn@umich.edu<blockquote> 11654479Sbinkertn@umich.edu<pre> 11664479Sbinkertn@umich.edut_NUMBER = r'\d+' 11674479Sbinkertn@umich.edut_INITIAL_NUMBER = r'\d+' 11684479Sbinkertn@umich.edu</pre> 11694479Sbinkertn@umich.edu</blockquote> 11704479Sbinkertn@umich.edu 11714479Sbinkertn@umich.edu<p> 11724479Sbinkertn@umich.eduStates are also associated with the special <tt>t_ignore</tt> and <tt>t_error()</tt> declarations. For example, if a state treats 11734479Sbinkertn@umich.eduthese differently, you can declare: 11744479Sbinkertn@umich.edu 11754479Sbinkertn@umich.edu<blockquote> 11764479Sbinkertn@umich.edu<pre> 11774479Sbinkertn@umich.edut_foo_ignore = " \t\n" # Ignored characters for state 'foo' 11784479Sbinkertn@umich.edu 11794479Sbinkertn@umich.edudef t_bar_error(t): # Special error handler for state 'bar' 11804479Sbinkertn@umich.edu pass 11814479Sbinkertn@umich.edu</pre> 11824479Sbinkertn@umich.edu</blockquote> 11834479Sbinkertn@umich.edu 11844479Sbinkertn@umich.eduBy default, lexing operates in the <tt>'INITIAL'</tt> state. This state includes all of the normally defined tokens. 11854479Sbinkertn@umich.eduFor users who aren't using different states, this fact is completely transparent. If, during lexing or parsing, you want to change 11864479Sbinkertn@umich.eduthe lexing state, use the <tt>begin()</tt> method. For example: 11874479Sbinkertn@umich.edu 11884479Sbinkertn@umich.edu<blockquote> 11894479Sbinkertn@umich.edu<pre> 11904479Sbinkertn@umich.edudef t_begin_foo(t): 11914479Sbinkertn@umich.edu r'start_foo' 11924479Sbinkertn@umich.edu t.lexer.begin('foo') # Starts 'foo' state 11934479Sbinkertn@umich.edu</pre> 11944479Sbinkertn@umich.edu</blockquote> 11954479Sbinkertn@umich.edu 11964479Sbinkertn@umich.eduTo get out of a state, you use <tt>begin()</tt> to switch back to the initial state. For example: 11974479Sbinkertn@umich.edu 11984479Sbinkertn@umich.edu<blockquote> 11994479Sbinkertn@umich.edu<pre> 12004479Sbinkertn@umich.edudef t_foo_end(t): 12014479Sbinkertn@umich.edu r'end_foo' 12024479Sbinkertn@umich.edu t.lexer.begin('INITIAL') # Back to the initial state 12034479Sbinkertn@umich.edu</pre> 12044479Sbinkertn@umich.edu</blockquote> 12054479Sbinkertn@umich.edu 12064479Sbinkertn@umich.eduThe management of states can also be done with a stack. For example: 12074479Sbinkertn@umich.edu 12084479Sbinkertn@umich.edu<blockquote> 12094479Sbinkertn@umich.edu<pre> 12104479Sbinkertn@umich.edudef t_begin_foo(t): 12114479Sbinkertn@umich.edu r'start_foo' 12124479Sbinkertn@umich.edu t.lexer.push_state('foo') # Starts 'foo' state 12134479Sbinkertn@umich.edu 12144479Sbinkertn@umich.edudef t_foo_end(t): 12154479Sbinkertn@umich.edu r'end_foo' 12164479Sbinkertn@umich.edu t.lexer.pop_state() # Back to the previous state 12174479Sbinkertn@umich.edu</pre> 12184479Sbinkertn@umich.edu</blockquote> 12194479Sbinkertn@umich.edu 12204479Sbinkertn@umich.edu<p> 12214479Sbinkertn@umich.eduThe use of a stack would be useful in situations where there are many ways of entering a new lexing state and you merely want to go back 12224479Sbinkertn@umich.eduto the previous state afterwards. 12234479Sbinkertn@umich.edu 12244479Sbinkertn@umich.edu<P> 12254479Sbinkertn@umich.eduAn example might help clarify. Suppose you were writing a parser and you wanted to grab sections of arbitrary C code enclosed by 12264479Sbinkertn@umich.educurly braces. That is, whenever you encounter a starting brace '{', you want to read all of the enclosed code up to the ending brace '}' 12274479Sbinkertn@umich.eduand return it as a string. Doing this with a normal regular expression rule is nearly (if not actually) impossible. This is because braces can 12284479Sbinkertn@umich.edube nested and can be included in comments and strings. Thus, simply matching up to the first matching '}' character isn't good enough. Here is how 12294479Sbinkertn@umich.eduyou might use lexer states to do this: 12304479Sbinkertn@umich.edu 12314479Sbinkertn@umich.edu<blockquote> 12324479Sbinkertn@umich.edu<pre> 12334479Sbinkertn@umich.edu# Declare the state 12344479Sbinkertn@umich.edustates = ( 12354479Sbinkertn@umich.edu ('ccode','exclusive'), 12364479Sbinkertn@umich.edu) 12374479Sbinkertn@umich.edu 12384479Sbinkertn@umich.edu# Match the first {. Enter ccode state. 12394479Sbinkertn@umich.edudef t_ccode(t): 12404479Sbinkertn@umich.edu r'\{' 12414479Sbinkertn@umich.edu t.lexer.code_start = t.lexer.lexpos # Record the starting position 12424479Sbinkertn@umich.edu t.lexer.level = 1 # Initial brace level 12434479Sbinkertn@umich.edu t.lexer.begin('ccode') # Enter 'ccode' state 12444479Sbinkertn@umich.edu 12454479Sbinkertn@umich.edu# Rules for the ccode state 12464479Sbinkertn@umich.edudef t_ccode_lbrace(t): 12474479Sbinkertn@umich.edu r'\{' 12484479Sbinkertn@umich.edu t.lexer.level +=1 12494479Sbinkertn@umich.edu 12504479Sbinkertn@umich.edudef t_ccode_rbrace(t): 12514479Sbinkertn@umich.edu r'\}' 12524479Sbinkertn@umich.edu t.lexer.level -=1 12534479Sbinkertn@umich.edu 12544479Sbinkertn@umich.edu # If closing brace, return the code fragment 12554479Sbinkertn@umich.edu if t.lexer.level == 0: 12564479Sbinkertn@umich.edu t.value = t.lexer.lexdata[t.lexer.code_start:t.lexer.lexpos+1] 12574479Sbinkertn@umich.edu t.type = "CCODE" 12584479Sbinkertn@umich.edu t.lexer.lineno += t.value.count('\n') 12594479Sbinkertn@umich.edu t.lexer.begin('INITIAL') 12604479Sbinkertn@umich.edu return t 12614479Sbinkertn@umich.edu 12624479Sbinkertn@umich.edu# C or C++ comment (ignore) 12634479Sbinkertn@umich.edudef t_ccode_comment(t): 12644479Sbinkertn@umich.edu r'(/\*(.|\n)*?*/)|(//.*)' 12654479Sbinkertn@umich.edu pass 12664479Sbinkertn@umich.edu 12674479Sbinkertn@umich.edu# C string 12684479Sbinkertn@umich.edudef t_ccode_string(t): 12694479Sbinkertn@umich.edu r'\"([^\\\n]|(\\.))*?\"' 12704479Sbinkertn@umich.edu 12714479Sbinkertn@umich.edu# C character literal 12724479Sbinkertn@umich.edudef t_ccode_char(t): 12734479Sbinkertn@umich.edu r'\'([^\\\n]|(\\.))*?\'' 12744479Sbinkertn@umich.edu 12754479Sbinkertn@umich.edu# Any sequence of non-whitespace characters (not braces, strings) 12764479Sbinkertn@umich.edudef t_ccode_nonspace(t): 12774479Sbinkertn@umich.edu r'[^\s\{\}\'\"]+' 12784479Sbinkertn@umich.edu 12794479Sbinkertn@umich.edu# Ignored characters (whitespace) 12804479Sbinkertn@umich.edut_ccode_ignore = " \t\n" 12814479Sbinkertn@umich.edu 12824479Sbinkertn@umich.edu# For bad characters, we just skip over it 12834479Sbinkertn@umich.edudef t_ccode_error(t): 12844479Sbinkertn@umich.edu t.lexer.skip(1) 12854479Sbinkertn@umich.edu</pre> 12864479Sbinkertn@umich.edu</blockquote> 12874479Sbinkertn@umich.edu 12884479Sbinkertn@umich.eduIn this example, the occurrence of the first '{' causes the lexer to record the starting position and enter a new state <tt>'ccode'</tt>. A collection of rules then match 12894479Sbinkertn@umich.eduvarious parts of the input that follow (comments, strings, etc.). All of these rules merely discard the token (by not returning a value). 12904479Sbinkertn@umich.eduHowever, if the closing right brace is encountered, the rule <tt>t_ccode_rbrace</tt> collects all of the code (using the earlier recorded starting 12914479Sbinkertn@umich.eduposition), stores it, and returns a token 'CCODE' containing all of that text. When returning the token, the lexing state is restored back to its 12924479Sbinkertn@umich.eduinitial state. 12934479Sbinkertn@umich.edu 12944479Sbinkertn@umich.edu<H3><a name="ply_nn21"></a>3.19 Miscellaneous Issues</H3> 12954479Sbinkertn@umich.edu 12964479Sbinkertn@umich.edu 12974479Sbinkertn@umich.edu<P> 12984479Sbinkertn@umich.edu<li>The lexer requires input to be supplied as a single input string. Since most machines have more than enough memory, this 12994479Sbinkertn@umich.edurarely presents a performance concern. However, it means that the lexer currently can't be used with streaming data 13004479Sbinkertn@umich.edusuch as open files or sockets. This limitation is primarily a side-effect of using the <tt>re</tt> module. 13014479Sbinkertn@umich.edu 13024479Sbinkertn@umich.edu<p> 13034479Sbinkertn@umich.edu<li>The lexer should work properly with both Unicode strings given as token and pattern matching rules as 13044479Sbinkertn@umich.eduwell as for input text. 13054479Sbinkertn@umich.edu 13064479Sbinkertn@umich.edu<p> 13074479Sbinkertn@umich.edu<li>If you need to supply optional flags to the re.compile() function, use the reflags option to lex. For example: 13084479Sbinkertn@umich.edu 13094479Sbinkertn@umich.edu<blockquote> 13104479Sbinkertn@umich.edu<pre> 13114479Sbinkertn@umich.edulex.lex(reflags=re.UNICODE) 13124479Sbinkertn@umich.edu</pre> 13134479Sbinkertn@umich.edu</blockquote> 13142632Sstever@eecs.umich.edu 13152632Sstever@eecs.umich.edu<p> 13162632Sstever@eecs.umich.edu<li>Since the lexer is written entirely in Python, its performance is 13172632Sstever@eecs.umich.edulargely determined by that of the Python <tt>re</tt> module. Although 13182632Sstever@eecs.umich.eduthe lexer has been written to be as efficient as possible, it's not 13194479Sbinkertn@umich.edublazingly fast when used on very large input files. If 13202632Sstever@eecs.umich.eduperformance is concern, you might consider upgrading to the most 13212632Sstever@eecs.umich.edurecent version of Python, creating a hand-written lexer, or offloading 13224479Sbinkertn@umich.eduthe lexer into a C extension module. 13234479Sbinkertn@umich.edu 13244479Sbinkertn@umich.edu<p> 13254479Sbinkertn@umich.eduIf you are going to create a hand-written lexer and you plan to use it with <tt>yacc.py</tt>, 13264479Sbinkertn@umich.eduit only needs to conform to the following requirements: 13274479Sbinkertn@umich.edu 13284479Sbinkertn@umich.edu<ul> 13294479Sbinkertn@umich.edu<li>It must provide a <tt>token()</tt> method that returns the next token or <tt>None</tt> if no more 13304479Sbinkertn@umich.edutokens are available. 13314479Sbinkertn@umich.edu<li>The <tt>token()</tt> method must return an object <tt>tok</tt> that has <tt>type</tt> and <tt>value</tt> attributes. 13322632Sstever@eecs.umich.edu</ul> 13332632Sstever@eecs.umich.edu 13344479Sbinkertn@umich.edu<H2><a name="ply_nn22"></a>4. Parsing basics</H2> 13354479Sbinkertn@umich.edu 13362632Sstever@eecs.umich.edu 13372632Sstever@eecs.umich.edu<tt>yacc.py</tt> is used to parse language syntax. Before showing an 13382632Sstever@eecs.umich.eduexample, there are a few important bits of background that must be 13394479Sbinkertn@umich.edumentioned. First, <em>syntax</em> is usually specified in terms of a BNF grammar. 13404479Sbinkertn@umich.eduFor example, if you wanted to parse 13412632Sstever@eecs.umich.edusimple arithmetic expressions, you might first write an unambiguous 13422632Sstever@eecs.umich.edugrammar specification like this: 13432632Sstever@eecs.umich.edu 13442632Sstever@eecs.umich.edu<blockquote> 13452632Sstever@eecs.umich.edu<pre> 13462632Sstever@eecs.umich.eduexpression : expression + term 13472632Sstever@eecs.umich.edu | expression - term 13482632Sstever@eecs.umich.edu | term 13492632Sstever@eecs.umich.edu 13502632Sstever@eecs.umich.eduterm : term * factor 13512632Sstever@eecs.umich.edu | term / factor 13522632Sstever@eecs.umich.edu | factor 13532632Sstever@eecs.umich.edu 13542632Sstever@eecs.umich.edufactor : NUMBER 13552632Sstever@eecs.umich.edu | ( expression ) 13562632Sstever@eecs.umich.edu</pre> 13572632Sstever@eecs.umich.edu</blockquote> 13582632Sstever@eecs.umich.edu 13594479Sbinkertn@umich.eduIn the grammar, symbols such as <tt>NUMBER</tt>, <tt>+</tt>, <tt>-</tt>, <tt>*</tt>, and <tt>/</tt> are known 13604479Sbinkertn@umich.eduas <em>terminals</em> and correspond to raw input tokens. Identifiers such as <tt>term</tt> and <tt>factor</tt> refer to more 13614479Sbinkertn@umich.educomplex rules, typically comprised of a collection of tokens. These identifiers are known as <em>non-terminals</em>. 13624479Sbinkertn@umich.edu<P> 13634479Sbinkertn@umich.eduThe semantic behavior of a language is often specified using a 13642632Sstever@eecs.umich.edutechnique known as syntax directed translation. In syntax directed 13652632Sstever@eecs.umich.edutranslation, attributes are attached to each symbol in a given grammar 13662632Sstever@eecs.umich.edurule along with an action. Whenever a particular grammar rule is 13672632Sstever@eecs.umich.edurecognized, the action describes what to do. For example, given the 13682632Sstever@eecs.umich.eduexpression grammar above, you might write the specification for a 13692632Sstever@eecs.umich.edusimple calculator like this: 13702632Sstever@eecs.umich.edu 13712632Sstever@eecs.umich.edu<blockquote> 13722632Sstever@eecs.umich.edu<pre> 13732632Sstever@eecs.umich.eduGrammar Action 13742632Sstever@eecs.umich.edu-------------------------------- -------------------------------------------- 13752632Sstever@eecs.umich.eduexpression0 : expression1 + term expression0.val = expression1.val + term.val 13762632Sstever@eecs.umich.edu | expression1 - term expression0.val = expression1.val - term.val 13772632Sstever@eecs.umich.edu | term expression0.val = term.val 13782632Sstever@eecs.umich.edu 13792632Sstever@eecs.umich.eduterm0 : term1 * factor term0.val = term1.val * factor.val 13802632Sstever@eecs.umich.edu | term1 / factor term0.val = term1.val / factor.val 13812632Sstever@eecs.umich.edu | factor term0.val = factor.val 13822632Sstever@eecs.umich.edu 13832632Sstever@eecs.umich.edufactor : NUMBER factor.val = int(NUMBER.lexval) 13842632Sstever@eecs.umich.edu | ( expression ) factor.val = expression.val 13852632Sstever@eecs.umich.edu</pre> 13862632Sstever@eecs.umich.edu</blockquote> 13872632Sstever@eecs.umich.edu 13884479Sbinkertn@umich.eduA good way to think about syntax directed translation is to simply think of each symbol in the grammar as some 13894479Sbinkertn@umich.edukind of object. The semantics of the language are then expressed as a collection of methods/operations on these 13904479Sbinkertn@umich.eduobjects. 13914479Sbinkertn@umich.edu 13924479Sbinkertn@umich.edu<p> 13934479Sbinkertn@umich.eduYacc uses a parsing technique known as LR-parsing or shift-reduce parsing. LR parsing is a 13942632Sstever@eecs.umich.edubottom up technique that tries to recognize the right-hand-side of various grammar rules. 13952632Sstever@eecs.umich.eduWhenever a valid right-hand-side is found in the input, the appropriate action code is triggered and the 13962632Sstever@eecs.umich.edugrammar symbols are replaced by the grammar symbol on the left-hand-side. 13972632Sstever@eecs.umich.edu 13982632Sstever@eecs.umich.edu<p> 13992632Sstever@eecs.umich.eduLR parsing is commonly implemented by shifting grammar symbols onto a stack and looking at the stack and the next 14002632Sstever@eecs.umich.eduinput token for patterns. The details of the algorithm can be found in a compiler text, but the 14012632Sstever@eecs.umich.edufollowing example illustrates the steps that are performed if you wanted to parse the expression 14022632Sstever@eecs.umich.edu<tt>3 + 5 * (10 - 20)</tt> using the grammar defined above: 14032632Sstever@eecs.umich.edu 14042632Sstever@eecs.umich.edu<blockquote> 14052632Sstever@eecs.umich.edu<pre> 14062632Sstever@eecs.umich.eduStep Symbol Stack Input Tokens Action 14072632Sstever@eecs.umich.edu---- --------------------- --------------------- ------------------------------- 14082632Sstever@eecs.umich.edu1 $ 3 + 5 * ( 10 - 20 )$ Shift 3 14092632Sstever@eecs.umich.edu2 $ 3 + 5 * ( 10 - 20 )$ Reduce factor : NUMBER 14102632Sstever@eecs.umich.edu3 $ factor + 5 * ( 10 - 20 )$ Reduce term : factor 14112632Sstever@eecs.umich.edu4 $ term + 5 * ( 10 - 20 )$ Reduce expr : term 14122632Sstever@eecs.umich.edu5 $ expr + 5 * ( 10 - 20 )$ Shift + 14132632Sstever@eecs.umich.edu6 $ expr + 5 * ( 10 - 20 )$ Shift 5 14142632Sstever@eecs.umich.edu7 $ expr + 5 * ( 10 - 20 )$ Reduce factor : NUMBER 14152632Sstever@eecs.umich.edu8 $ expr + factor * ( 10 - 20 )$ Reduce term : factor 14162632Sstever@eecs.umich.edu9 $ expr + term * ( 10 - 20 )$ Shift * 14172632Sstever@eecs.umich.edu10 $ expr + term * ( 10 - 20 )$ Shift ( 14182632Sstever@eecs.umich.edu11 $ expr + term * ( 10 - 20 )$ Shift 10 14192632Sstever@eecs.umich.edu12 $ expr + term * ( 10 - 20 )$ Reduce factor : NUMBER 14202632Sstever@eecs.umich.edu13 $ expr + term * ( factor - 20 )$ Reduce term : factor 14212632Sstever@eecs.umich.edu14 $ expr + term * ( term - 20 )$ Reduce expr : term 14222632Sstever@eecs.umich.edu15 $ expr + term * ( expr - 20 )$ Shift - 14232632Sstever@eecs.umich.edu16 $ expr + term * ( expr - 20 )$ Shift 20 14242632Sstever@eecs.umich.edu17 $ expr + term * ( expr - 20 )$ Reduce factor : NUMBER 14252632Sstever@eecs.umich.edu18 $ expr + term * ( expr - factor )$ Reduce term : factor 14262632Sstever@eecs.umich.edu19 $ expr + term * ( expr - term )$ Reduce expr : expr - term 14272632Sstever@eecs.umich.edu20 $ expr + term * ( expr )$ Shift ) 14282632Sstever@eecs.umich.edu21 $ expr + term * ( expr ) $ Reduce factor : (expr) 14292632Sstever@eecs.umich.edu22 $ expr + term * factor $ Reduce term : term * factor 14302632Sstever@eecs.umich.edu23 $ expr + term $ Reduce expr : expr + term 14312632Sstever@eecs.umich.edu24 $ expr $ Reduce expr 14322632Sstever@eecs.umich.edu25 $ $ Success! 14332632Sstever@eecs.umich.edu</pre> 14342632Sstever@eecs.umich.edu</blockquote> 14352632Sstever@eecs.umich.edu 14362632Sstever@eecs.umich.eduWhen parsing the expression, an underlying state machine and the current input token determine what to do next. 14372632Sstever@eecs.umich.eduIf the next token looks like part of a valid grammar rule (based on other items on the stack), it is generally shifted 14382632Sstever@eecs.umich.eduonto the stack. If the top of the stack contains a valid right-hand-side of a grammar rule, it is 14392632Sstever@eecs.umich.eduusually "reduced" and the symbols replaced with the symbol on the left-hand-side. When this reduction occurs, the 14402632Sstever@eecs.umich.eduappropriate action is triggered (if defined). If the input token can't be shifted and the top of stack doesn't match 14412632Sstever@eecs.umich.eduany grammar rules, a syntax error has occurred and the parser must take some kind of recovery step (or bail out). 14422632Sstever@eecs.umich.edu 14432632Sstever@eecs.umich.edu<p> 14444479Sbinkertn@umich.eduIt is important to note that the underlying implementation is built around a large finite-state machine that is encoded 14454479Sbinkertn@umich.eduin a collection of tables. The construction of these tables is quite complicated and beyond the scope of this discussion. 14462632Sstever@eecs.umich.eduHowever, subtle details of this process explain why, in the example above, the parser chooses to shift a token 14472632Sstever@eecs.umich.eduonto the stack in step 9 rather than reducing the rule <tt>expr : expr + term</tt>. 14482632Sstever@eecs.umich.edu 14494479Sbinkertn@umich.edu<H2><a name="ply_nn23"></a>5. Yacc reference</H2> 14504479Sbinkertn@umich.edu 14514479Sbinkertn@umich.edu 14524479Sbinkertn@umich.eduThis section describes how to use write parsers in PLY. 14534479Sbinkertn@umich.edu 14544479Sbinkertn@umich.edu<H3><a name="ply_nn24"></a>5.1 An example</H3> 14554479Sbinkertn@umich.edu 14562632Sstever@eecs.umich.edu 14572632Sstever@eecs.umich.eduSuppose you wanted to make a grammar for simple arithmetic expressions as previously described. Here is 14582632Sstever@eecs.umich.eduhow you would do it with <tt>yacc.py</tt>: 14592632Sstever@eecs.umich.edu 14602632Sstever@eecs.umich.edu<blockquote> 14612632Sstever@eecs.umich.edu<pre> 14622632Sstever@eecs.umich.edu# Yacc example 14632632Sstever@eecs.umich.edu 14644479Sbinkertn@umich.eduimport ply.yacc as yacc 14652632Sstever@eecs.umich.edu 14662632Sstever@eecs.umich.edu# Get the token map from the lexer. This is required. 14672632Sstever@eecs.umich.edufrom calclex import tokens 14682632Sstever@eecs.umich.edu 14694479Sbinkertn@umich.edudef p_expression_plus(p): 14702632Sstever@eecs.umich.edu 'expression : expression PLUS term' 14714479Sbinkertn@umich.edu p[0] = p[1] + p[3] 14724479Sbinkertn@umich.edu 14734479Sbinkertn@umich.edudef p_expression_minus(p): 14742632Sstever@eecs.umich.edu 'expression : expression MINUS term' 14754479Sbinkertn@umich.edu p[0] = p[1] - p[3] 14764479Sbinkertn@umich.edu 14774479Sbinkertn@umich.edudef p_expression_term(p): 14782632Sstever@eecs.umich.edu 'expression : term' 14794479Sbinkertn@umich.edu p[0] = p[1] 14804479Sbinkertn@umich.edu 14814479Sbinkertn@umich.edudef p_term_times(p): 14822632Sstever@eecs.umich.edu 'term : term TIMES factor' 14834479Sbinkertn@umich.edu p[0] = p[1] * p[3] 14844479Sbinkertn@umich.edu 14854479Sbinkertn@umich.edudef p_term_div(p): 14862632Sstever@eecs.umich.edu 'term : term DIVIDE factor' 14874479Sbinkertn@umich.edu p[0] = p[1] / p[3] 14884479Sbinkertn@umich.edu 14894479Sbinkertn@umich.edudef p_term_factor(p): 14902632Sstever@eecs.umich.edu 'term : factor' 14914479Sbinkertn@umich.edu p[0] = p[1] 14924479Sbinkertn@umich.edu 14934479Sbinkertn@umich.edudef p_factor_num(p): 14942632Sstever@eecs.umich.edu 'factor : NUMBER' 14954479Sbinkertn@umich.edu p[0] = p[1] 14964479Sbinkertn@umich.edu 14974479Sbinkertn@umich.edudef p_factor_expr(p): 14982632Sstever@eecs.umich.edu 'factor : LPAREN expression RPAREN' 14994479Sbinkertn@umich.edu p[0] = p[2] 15002632Sstever@eecs.umich.edu 15012632Sstever@eecs.umich.edu# Error rule for syntax errors 15024479Sbinkertn@umich.edudef p_error(p): 15032632Sstever@eecs.umich.edu print "Syntax error in input!" 15042632Sstever@eecs.umich.edu 15052632Sstever@eecs.umich.edu# Build the parser 15062632Sstever@eecs.umich.eduyacc.yacc() 15072632Sstever@eecs.umich.edu 15084479Sbinkertn@umich.edu# Use this if you want to build the parser using SLR instead of LALR 15094479Sbinkertn@umich.edu# yacc.yacc(method="SLR") 15104479Sbinkertn@umich.edu 15112632Sstever@eecs.umich.eduwhile 1: 15122632Sstever@eecs.umich.edu try: 15132632Sstever@eecs.umich.edu s = raw_input('calc > ') 15142632Sstever@eecs.umich.edu except EOFError: 15152632Sstever@eecs.umich.edu break 15162632Sstever@eecs.umich.edu if not s: continue 15172632Sstever@eecs.umich.edu result = yacc.parse(s) 15182632Sstever@eecs.umich.edu print result 15192632Sstever@eecs.umich.edu</pre> 15202632Sstever@eecs.umich.edu</blockquote> 15212632Sstever@eecs.umich.edu 15222632Sstever@eecs.umich.eduIn this example, each grammar rule is defined by a Python function where the docstring to that function contains the 15234479Sbinkertn@umich.eduappropriate context-free grammar specification. Each function accepts a single 15244479Sbinkertn@umich.eduargument <tt>p</tt> that is a sequence containing the values of each grammar symbol in the corresponding rule. The values of 15254479Sbinkertn@umich.edu<tt>p[i]</tt> are mapped to grammar symbols as shown here: 15262632Sstever@eecs.umich.edu 15272632Sstever@eecs.umich.edu<blockquote> 15282632Sstever@eecs.umich.edu<pre> 15294479Sbinkertn@umich.edudef p_expression_plus(p): 15302632Sstever@eecs.umich.edu 'expression : expression PLUS term' 15312632Sstever@eecs.umich.edu # ^ ^ ^ ^ 15324479Sbinkertn@umich.edu # p[0] p[1] p[2] p[3] 15334479Sbinkertn@umich.edu 15344479Sbinkertn@umich.edu p[0] = p[1] + p[3] 15352632Sstever@eecs.umich.edu</pre> 15362632Sstever@eecs.umich.edu</blockquote> 15372632Sstever@eecs.umich.edu 15384479Sbinkertn@umich.eduFor tokens, the "value" of the corresponding <tt>p[i]</tt> is the 15394479Sbinkertn@umich.edu<em>same</em> as the <tt>p.value</tt> attribute assigned 15402632Sstever@eecs.umich.eduin the lexer module. For non-terminals, the value is determined by 15414479Sbinkertn@umich.eduwhatever is placed in <tt>p[0]</tt> when rules are reduced. This 15422632Sstever@eecs.umich.eduvalue can be anything at all. However, it probably most common for 15432632Sstever@eecs.umich.eduthe value to be a simple Python type, a tuple, or an instance. In this example, we 15442632Sstever@eecs.umich.eduare relying on the fact that the <tt>NUMBER</tt> token stores an integer value in its value 15452632Sstever@eecs.umich.edufield. All of the other rules simply perform various types of integer operations and store 15462632Sstever@eecs.umich.eduthe result. 15472632Sstever@eecs.umich.edu 15484479Sbinkertn@umich.edu<P> 15494479Sbinkertn@umich.eduNote: The use of negative indices have a special meaning in yacc---specially <tt>p[-1]</tt> does 15504479Sbinkertn@umich.edunot have the same value as <tt>p[3]</tt> in this example. Please see the section on "Embedded Actions" for further 15514479Sbinkertn@umich.edudetails. 15524479Sbinkertn@umich.edu 15532632Sstever@eecs.umich.edu<p> 15542632Sstever@eecs.umich.eduThe first rule defined in the yacc specification determines the starting grammar 15552632Sstever@eecs.umich.edusymbol (in this case, a rule for <tt>expression</tt> appears first). Whenever 15562632Sstever@eecs.umich.eduthe starting rule is reduced by the parser and no more input is available, parsing 15572632Sstever@eecs.umich.edustops and the final value is returned (this value will be whatever the top-most rule 15584479Sbinkertn@umich.eduplaced in <tt>p[0]</tt>). Note: an alternative starting symbol can be specified using the <tt>start</tt> keyword argument to 15594479Sbinkertn@umich.edu<tt>yacc()</tt>. 15604479Sbinkertn@umich.edu 15614479Sbinkertn@umich.edu<p>The <tt>p_error(p)</tt> rule is defined to catch syntax errors. See the error handling section 15622632Sstever@eecs.umich.edubelow for more detail. 15632632Sstever@eecs.umich.edu 15642632Sstever@eecs.umich.edu<p> 15652632Sstever@eecs.umich.eduTo build the parser, call the <tt>yacc.yacc()</tt> function. This function 15662632Sstever@eecs.umich.edulooks at the module and attempts to construct all of the LR parsing tables for the grammar 15672632Sstever@eecs.umich.eduyou have specified. The first time <tt>yacc.yacc()</tt> is invoked, you will get a message 15682632Sstever@eecs.umich.edusuch as this: 15692632Sstever@eecs.umich.edu 15702632Sstever@eecs.umich.edu<blockquote> 15712632Sstever@eecs.umich.edu<pre> 15722632Sstever@eecs.umich.edu$ python calcparse.py 15734479Sbinkertn@umich.eduyacc: Generating LALR parsing table... 15742632Sstever@eecs.umich.educalc > 15752632Sstever@eecs.umich.edu</pre> 15762632Sstever@eecs.umich.edu</blockquote> 15772632Sstever@eecs.umich.edu 15782632Sstever@eecs.umich.eduSince table construction is relatively expensive (especially for large 15792632Sstever@eecs.umich.edugrammars), the resulting parsing table is written to the current 15802632Sstever@eecs.umich.edudirectory in a file called <tt>parsetab.py</tt>. In addition, a 15812632Sstever@eecs.umich.edudebugging file called <tt>parser.out</tt> is created. On subsequent 15822632Sstever@eecs.umich.eduexecutions, <tt>yacc</tt> will reload the table from 15832632Sstever@eecs.umich.edu<tt>parsetab.py</tt> unless it has detected a change in the underlying 15842632Sstever@eecs.umich.edugrammar (in which case the tables and <tt>parsetab.py</tt> file are 15854479Sbinkertn@umich.eduregenerated). Note: The names of parser output files can be changed if necessary. See the notes that follow later. 15862632Sstever@eecs.umich.edu 15872632Sstever@eecs.umich.edu<p> 15882632Sstever@eecs.umich.eduIf any errors are detected in your grammar specification, <tt>yacc.py</tt> will produce 15892632Sstever@eecs.umich.edudiagnostic messages and possibly raise an exception. Some of the errors that can be detected include: 15902632Sstever@eecs.umich.edu 15912632Sstever@eecs.umich.edu<ul> 15922632Sstever@eecs.umich.edu<li>Duplicated function names (if more than one rule function have the same name in the grammar file). 15932632Sstever@eecs.umich.edu<li>Shift/reduce and reduce/reduce conflicts generated by ambiguous grammars. 15942632Sstever@eecs.umich.edu<li>Badly specified grammar rules. 15952632Sstever@eecs.umich.edu<li>Infinite recursion (rules that can never terminate). 15962632Sstever@eecs.umich.edu<li>Unused rules and tokens 15972632Sstever@eecs.umich.edu<li>Undefined rules and tokens 15982632Sstever@eecs.umich.edu</ul> 15992632Sstever@eecs.umich.edu 16002632Sstever@eecs.umich.eduThe next few sections now discuss a few finer points of grammar construction. 16012632Sstever@eecs.umich.edu 16024479Sbinkertn@umich.edu<H3><a name="ply_nn25"></a>5.2 Combining Grammar Rule Functions</H3> 16034479Sbinkertn@umich.edu 16042632Sstever@eecs.umich.edu 16052632Sstever@eecs.umich.eduWhen grammar rules are similar, they can be combined into a single function. 16062632Sstever@eecs.umich.eduFor example, consider the two rules in our earlier example: 16072632Sstever@eecs.umich.edu 16082632Sstever@eecs.umich.edu<blockquote> 16092632Sstever@eecs.umich.edu<pre> 16104479Sbinkertn@umich.edudef p_expression_plus(p): 16112632Sstever@eecs.umich.edu 'expression : expression PLUS term' 16124479Sbinkertn@umich.edu p[0] = p[1] + p[3] 16132632Sstever@eecs.umich.edu 16142632Sstever@eecs.umich.edudef p_expression_minus(t): 16152632Sstever@eecs.umich.edu 'expression : expression MINUS term' 16164479Sbinkertn@umich.edu p[0] = p[1] - p[3] 16172632Sstever@eecs.umich.edu</pre> 16182632Sstever@eecs.umich.edu</blockquote> 16192632Sstever@eecs.umich.edu 16202632Sstever@eecs.umich.eduInstead of writing two functions, you might write a single function like this: 16212632Sstever@eecs.umich.edu 16222632Sstever@eecs.umich.edu<blockquote> 16232632Sstever@eecs.umich.edu<pre> 16244479Sbinkertn@umich.edudef p_expression(p): 16252632Sstever@eecs.umich.edu '''expression : expression PLUS term 16262632Sstever@eecs.umich.edu | expression MINUS term''' 16274479Sbinkertn@umich.edu if p[2] == '+': 16284479Sbinkertn@umich.edu p[0] = p[1] + p[3] 16294479Sbinkertn@umich.edu elif p[2] == '-': 16304479Sbinkertn@umich.edu p[0] = p[1] - p[3] 16312632Sstever@eecs.umich.edu</pre> 16322632Sstever@eecs.umich.edu</blockquote> 16332632Sstever@eecs.umich.edu 16342632Sstever@eecs.umich.eduIn general, the doc string for any given function can contain multiple grammar rules. So, it would 16352632Sstever@eecs.umich.eduhave also been legal (although possibly confusing) to write this: 16362632Sstever@eecs.umich.edu 16372632Sstever@eecs.umich.edu<blockquote> 16382632Sstever@eecs.umich.edu<pre> 16394479Sbinkertn@umich.edudef p_binary_operators(p): 16402632Sstever@eecs.umich.edu '''expression : expression PLUS term 16412632Sstever@eecs.umich.edu | expression MINUS term 16422632Sstever@eecs.umich.edu term : term TIMES factor 16432632Sstever@eecs.umich.edu | term DIVIDE factor''' 16444479Sbinkertn@umich.edu if p[2] == '+': 16454479Sbinkertn@umich.edu p[0] = p[1] + p[3] 16464479Sbinkertn@umich.edu elif p[2] == '-': 16474479Sbinkertn@umich.edu p[0] = p[1] - p[3] 16484479Sbinkertn@umich.edu elif p[2] == '*': 16494479Sbinkertn@umich.edu p[0] = p[1] * p[3] 16504479Sbinkertn@umich.edu elif p[2] == '/': 16514479Sbinkertn@umich.edu p[0] = p[1] / p[3] 16522632Sstever@eecs.umich.edu</pre> 16532632Sstever@eecs.umich.edu</blockquote> 16542632Sstever@eecs.umich.edu 16552632Sstever@eecs.umich.eduWhen combining grammar rules into a single function, it is usually a good idea for all of the rules to have 16562632Sstever@eecs.umich.edua similar structure (e.g., the same number of terms). Otherwise, the corresponding action code may be more 16574479Sbinkertn@umich.educomplicated than necessary. However, it is possible to handle simple cases using len(). For example: 16582632Sstever@eecs.umich.edu 16592632Sstever@eecs.umich.edu<blockquote> 16602632Sstever@eecs.umich.edu<pre> 16614479Sbinkertn@umich.edudef p_expressions(p): 16624479Sbinkertn@umich.edu '''expression : expression MINUS expression 16634479Sbinkertn@umich.edu | MINUS expression''' 16644479Sbinkertn@umich.edu if (len(p) == 4): 16654479Sbinkertn@umich.edu p[0] = p[1] - p[3] 16664479Sbinkertn@umich.edu elif (len(p) == 3): 16674479Sbinkertn@umich.edu p[0] = -p[2] 16684479Sbinkertn@umich.edu</pre> 16694479Sbinkertn@umich.edu</blockquote> 16704479Sbinkertn@umich.edu 16714479Sbinkertn@umich.edu<H3><a name="ply_nn26"></a>5.3 Character Literals</H3> 16724479Sbinkertn@umich.edu 16734479Sbinkertn@umich.edu 16744479Sbinkertn@umich.eduIf desired, a grammar may contain tokens defined as single character literals. For example: 16754479Sbinkertn@umich.edu 16764479Sbinkertn@umich.edu<blockquote> 16774479Sbinkertn@umich.edu<pre> 16784479Sbinkertn@umich.edudef p_binary_operators(p): 16794479Sbinkertn@umich.edu '''expression : expression '+' term 16804479Sbinkertn@umich.edu | expression '-' term 16814479Sbinkertn@umich.edu term : term '*' factor 16824479Sbinkertn@umich.edu | term '/' factor''' 16834479Sbinkertn@umich.edu if p[2] == '+': 16844479Sbinkertn@umich.edu p[0] = p[1] + p[3] 16854479Sbinkertn@umich.edu elif p[2] == '-': 16864479Sbinkertn@umich.edu p[0] = p[1] - p[3] 16874479Sbinkertn@umich.edu elif p[2] == '*': 16884479Sbinkertn@umich.edu p[0] = p[1] * p[3] 16894479Sbinkertn@umich.edu elif p[2] == '/': 16904479Sbinkertn@umich.edu p[0] = p[1] / p[3] 16914479Sbinkertn@umich.edu</pre> 16924479Sbinkertn@umich.edu</blockquote> 16934479Sbinkertn@umich.edu 16944479Sbinkertn@umich.eduA character literal must be enclosed in quotes such as <tt>'+'</tt>. In addition, if literals are used, they must be declared in the 16954479Sbinkertn@umich.educorresponding <tt>lex</tt> file through the use of a special <tt>literals</tt> declaration. 16964479Sbinkertn@umich.edu 16974479Sbinkertn@umich.edu<blockquote> 16984479Sbinkertn@umich.edu<pre> 16994479Sbinkertn@umich.edu# Literals. Should be placed in module given to lex() 17004479Sbinkertn@umich.eduliterals = ['+','-','*','/' ] 17014479Sbinkertn@umich.edu</pre> 17024479Sbinkertn@umich.edu</blockquote> 17034479Sbinkertn@umich.edu 17044479Sbinkertn@umich.edu<b>Character literals are limited to a single character</b>. Thus, it is not legal to specify literals such as <tt>'<='</tt> or <tt>'=='</tt>. For this, use 17054479Sbinkertn@umich.eduthe normal lexing rules (e.g., define a rule such as <tt>t_EQ = r'=='</tt>). 17064479Sbinkertn@umich.edu 17074479Sbinkertn@umich.edu<H3><a name="ply_nn26"></a>5.4 Empty Productions</H3> 17084479Sbinkertn@umich.edu 17094479Sbinkertn@umich.edu 17104479Sbinkertn@umich.edu<tt>yacc.py</tt> can handle empty productions by defining a rule like this: 17114479Sbinkertn@umich.edu 17124479Sbinkertn@umich.edu<blockquote> 17134479Sbinkertn@umich.edu<pre> 17144479Sbinkertn@umich.edudef p_empty(p): 17152632Sstever@eecs.umich.edu 'empty :' 17162632Sstever@eecs.umich.edu pass 17172632Sstever@eecs.umich.edu</pre> 17182632Sstever@eecs.umich.edu</blockquote> 17192632Sstever@eecs.umich.edu 17202632Sstever@eecs.umich.eduNow to use the empty production, simply use 'empty' as a symbol. For example: 17212632Sstever@eecs.umich.edu 17222632Sstever@eecs.umich.edu<blockquote> 17232632Sstever@eecs.umich.edu<pre> 17244479Sbinkertn@umich.edudef p_optitem(p): 17252632Sstever@eecs.umich.edu 'optitem : item' 17262632Sstever@eecs.umich.edu ' | empty' 17272632Sstever@eecs.umich.edu ... 17282632Sstever@eecs.umich.edu</pre> 17292632Sstever@eecs.umich.edu</blockquote> 17302632Sstever@eecs.umich.edu 17314479Sbinkertn@umich.eduNote: You can write empty rules anywhere by simply specifying an empty right hand side. However, I personally find that 17324479Sbinkertn@umich.eduwriting an "empty" rule and using "empty" to denote an empty production is easier to read. 17334479Sbinkertn@umich.edu 17344479Sbinkertn@umich.edu<H3><a name="ply_nn28"></a>5.5 Changing the starting symbol</H3> 17354479Sbinkertn@umich.edu 17364479Sbinkertn@umich.edu 17374479Sbinkertn@umich.eduNormally, the first rule found in a yacc specification defines the starting grammar rule (top level rule). To change this, simply 17384479Sbinkertn@umich.edusupply a <tt>start</tt> specifier in your file. For example: 17394479Sbinkertn@umich.edu 17404479Sbinkertn@umich.edu<blockquote> 17414479Sbinkertn@umich.edu<pre> 17424479Sbinkertn@umich.edustart = 'foo' 17434479Sbinkertn@umich.edu 17444479Sbinkertn@umich.edudef p_bar(p): 17454479Sbinkertn@umich.edu 'bar : A B' 17464479Sbinkertn@umich.edu 17474479Sbinkertn@umich.edu# This is the starting rule due to the start specifier above 17484479Sbinkertn@umich.edudef p_foo(p): 17494479Sbinkertn@umich.edu 'foo : bar X' 17504479Sbinkertn@umich.edu... 17514479Sbinkertn@umich.edu</pre> 17524479Sbinkertn@umich.edu</blockquote> 17534479Sbinkertn@umich.edu 17544479Sbinkertn@umich.eduThe use of a <tt>start</tt> specifier may be useful during debugging since you can use it to have yacc build a subset of 17554479Sbinkertn@umich.edua larger grammar. For this purpose, it is also possible to specify a starting symbol as an argument to <tt>yacc()</tt>. For example: 17564479Sbinkertn@umich.edu 17574479Sbinkertn@umich.edu<blockquote> 17584479Sbinkertn@umich.edu<pre> 17594479Sbinkertn@umich.eduyacc.yacc(start='foo') 17604479Sbinkertn@umich.edu</pre> 17614479Sbinkertn@umich.edu</blockquote> 17624479Sbinkertn@umich.edu 17634479Sbinkertn@umich.edu<H3><a name="ply_nn27"></a>5.6 Dealing With Ambiguous Grammars</H3> 17644479Sbinkertn@umich.edu 17652632Sstever@eecs.umich.edu 17662632Sstever@eecs.umich.eduThe expression grammar given in the earlier example has been written in a special format to eliminate ambiguity. 17672632Sstever@eecs.umich.eduHowever, in many situations, it is extremely difficult or awkward to write grammars in this format. A 17682632Sstever@eecs.umich.edumuch more natural way to express the grammar is in a more compact form like this: 17692632Sstever@eecs.umich.edu 17702632Sstever@eecs.umich.edu<blockquote> 17712632Sstever@eecs.umich.edu<pre> 17722632Sstever@eecs.umich.eduexpression : expression PLUS expression 17732632Sstever@eecs.umich.edu | expression MINUS expression 17742632Sstever@eecs.umich.edu | expression TIMES expression 17752632Sstever@eecs.umich.edu | expression DIVIDE expression 17762632Sstever@eecs.umich.edu | LPAREN expression RPAREN 17772632Sstever@eecs.umich.edu | NUMBER 17782632Sstever@eecs.umich.edu</pre> 17792632Sstever@eecs.umich.edu</blockquote> 17802632Sstever@eecs.umich.edu 17812632Sstever@eecs.umich.eduUnfortunately, this grammar specification is ambiguous. For example, if you are parsing the string 17822632Sstever@eecs.umich.edu"3 * 4 + 5", there is no way to tell how the operators are supposed to be grouped. 17834479Sbinkertn@umich.eduFor example, does the expression mean "(3 * 4) + 5" or is it "3 * (4+5)"? 17842632Sstever@eecs.umich.edu 17852632Sstever@eecs.umich.edu<p> 17862632Sstever@eecs.umich.eduWhen an ambiguous grammar is given to <tt>yacc.py</tt> it will print messages about "shift/reduce conflicts" 17872632Sstever@eecs.umich.eduor a "reduce/reduce conflicts". A shift/reduce conflict is caused when the parser generator can't decide 17882632Sstever@eecs.umich.eduwhether or not to reduce a rule or shift a symbol on the parsing stack. For example, consider 17892632Sstever@eecs.umich.eduthe string "3 * 4 + 5" and the internal parsing stack: 17902632Sstever@eecs.umich.edu 17912632Sstever@eecs.umich.edu<blockquote> 17922632Sstever@eecs.umich.edu<pre> 17932632Sstever@eecs.umich.eduStep Symbol Stack Input Tokens Action 17942632Sstever@eecs.umich.edu---- --------------------- --------------------- ------------------------------- 17952632Sstever@eecs.umich.edu1 $ 3 * 4 + 5$ Shift 3 17962632Sstever@eecs.umich.edu2 $ 3 * 4 + 5$ Reduce : expression : NUMBER 17972632Sstever@eecs.umich.edu3 $ expr * 4 + 5$ Shift * 17982632Sstever@eecs.umich.edu4 $ expr * 4 + 5$ Shift 4 17992632Sstever@eecs.umich.edu5 $ expr * 4 + 5$ Reduce: expression : NUMBER 18002632Sstever@eecs.umich.edu6 $ expr * expr + 5$ SHIFT/REDUCE CONFLICT ???? 18012632Sstever@eecs.umich.edu</pre> 18022632Sstever@eecs.umich.edu</blockquote> 18032632Sstever@eecs.umich.edu 18044479Sbinkertn@umich.eduIn this case, when the parser reaches step 6, it has two options. One is to reduce the 18052632Sstever@eecs.umich.edurule <tt>expr : expr * expr</tt> on the stack. The other option is to shift the 18062632Sstever@eecs.umich.edutoken <tt>+</tt> on the stack. Both options are perfectly legal from the rules 18072632Sstever@eecs.umich.eduof the context-free-grammar. 18082632Sstever@eecs.umich.edu 18092632Sstever@eecs.umich.edu<p> 18102632Sstever@eecs.umich.eduBy default, all shift/reduce conflicts are resolved in favor of shifting. Therefore, in the above 18112632Sstever@eecs.umich.eduexample, the parser will always shift the <tt>+</tt> instead of reducing. Although this 18122632Sstever@eecs.umich.edustrategy works in many cases (including the ambiguous if-then-else), it is not enough for arithmetic 18132632Sstever@eecs.umich.eduexpressions. In fact, in the above example, the decision to shift <tt>+</tt> is completely wrong---we should have 18144479Sbinkertn@umich.edureduced <tt>expr * expr</tt> since multiplication has higher mathematical precedence than addition. 18152632Sstever@eecs.umich.edu 18162632Sstever@eecs.umich.edu<p>To resolve ambiguity, especially in expression grammars, <tt>yacc.py</tt> allows individual 18172632Sstever@eecs.umich.edutokens to be assigned a precedence level and associativity. This is done by adding a variable 18182632Sstever@eecs.umich.edu<tt>precedence</tt> to the grammar file like this: 18192632Sstever@eecs.umich.edu 18202632Sstever@eecs.umich.edu<blockquote> 18212632Sstever@eecs.umich.edu<pre> 18222632Sstever@eecs.umich.eduprecedence = ( 18232632Sstever@eecs.umich.edu ('left', 'PLUS', 'MINUS'), 18242632Sstever@eecs.umich.edu ('left', 'TIMES', 'DIVIDE'), 18252632Sstever@eecs.umich.edu) 18262632Sstever@eecs.umich.edu</pre> 18272632Sstever@eecs.umich.edu</blockquote> 18282632Sstever@eecs.umich.edu 18292632Sstever@eecs.umich.eduThis declaration specifies that <tt>PLUS</tt>/<tt>MINUS</tt> have 18302632Sstever@eecs.umich.eduthe same precedence level and are left-associative and that 18314479Sbinkertn@umich.edu<tt>TIMES</tt>/<tt>DIVIDE</tt> have the same precedence and are left-associative. 18324479Sbinkertn@umich.eduWithin the <tt>precedence</tt> declaration, tokens are ordered from lowest to highest precedence. Thus, 18334479Sbinkertn@umich.eduthis declaration specifies that <tt>TIMES</tt>/<tt>DIVIDE</tt> have higher 18342632Sstever@eecs.umich.eduprecedence than <tt>PLUS</tt>/<tt>MINUS</tt> (since they appear later in the 18352632Sstever@eecs.umich.eduprecedence specification). 18362632Sstever@eecs.umich.edu 18372632Sstever@eecs.umich.edu<p> 18384479Sbinkertn@umich.eduThe precedence specification works by associating a numerical precedence level value and associativity direction to 18394479Sbinkertn@umich.eduthe listed tokens. For example, in the above example you get: 18402632Sstever@eecs.umich.edu 18412632Sstever@eecs.umich.edu<blockquote> 18422632Sstever@eecs.umich.edu<pre> 18434479Sbinkertn@umich.eduPLUS : level = 1, assoc = 'left' 18444479Sbinkertn@umich.eduMINUS : level = 1, assoc = 'left' 18454479Sbinkertn@umich.eduTIMES : level = 2, assoc = 'left' 18464479Sbinkertn@umich.eduDIVIDE : level = 2, assoc = 'left' 18474479Sbinkertn@umich.edu</pre> 18484479Sbinkertn@umich.edu</blockquote> 18494479Sbinkertn@umich.edu 18504479Sbinkertn@umich.eduThese values are then used to attach a numerical precedence value and associativity direction 18514479Sbinkertn@umich.eduto each grammar rule. <em>This is always determined by looking at the precedence of the right-most terminal symbol.</em> 18524479Sbinkertn@umich.eduFor example: 18534479Sbinkertn@umich.edu 18544479Sbinkertn@umich.edu<blockquote> 18554479Sbinkertn@umich.edu<pre> 18564479Sbinkertn@umich.eduexpression : expression PLUS expression # level = 1, left 18574479Sbinkertn@umich.edu | expression MINUS expression # level = 1, left 18584479Sbinkertn@umich.edu | expression TIMES expression # level = 2, left 18594479Sbinkertn@umich.edu | expression DIVIDE expression # level = 2, left 18604479Sbinkertn@umich.edu | LPAREN expression RPAREN # level = None (not specified) 18614479Sbinkertn@umich.edu | NUMBER # level = None (not specified) 18622632Sstever@eecs.umich.edu</pre> 18632632Sstever@eecs.umich.edu</blockquote> 18642632Sstever@eecs.umich.edu 18652632Sstever@eecs.umich.eduWhen shift/reduce conflicts are encountered, the parser generator resolves the conflict by 18662632Sstever@eecs.umich.edulooking at the precedence rules and associativity specifiers. 18672632Sstever@eecs.umich.edu 18682632Sstever@eecs.umich.edu<p> 18692632Sstever@eecs.umich.edu<ol> 18702632Sstever@eecs.umich.edu<li>If the current token has higher precedence, it is shifted. 18712632Sstever@eecs.umich.edu<li>If the grammar rule on the stack has higher precedence, the rule is reduced. 18722632Sstever@eecs.umich.edu<li>If the current token and the grammar rule have the same precedence, the 18732632Sstever@eecs.umich.edurule is reduced for left associativity, whereas the token is shifted for right associativity. 18742632Sstever@eecs.umich.edu<li>If nothing is known about the precedence, shift/reduce conflicts are resolved in 18752632Sstever@eecs.umich.edufavor of shifting (the default). 18762632Sstever@eecs.umich.edu</ol> 18772632Sstever@eecs.umich.edu 18784479Sbinkertn@umich.eduFor example, if "expression PLUS expression" has been parsed and the next token 18794479Sbinkertn@umich.eduis "TIMES", the action is going to be a shift because "TIMES" has a higher precedence level than "PLUS". On the other 18804479Sbinkertn@umich.eduhand, if "expression TIMES expression" has been parsed and the next token is "PLUS", the action 18814479Sbinkertn@umich.eduis going to be reduce because "PLUS" has a lower precedence than "TIMES." 18824479Sbinkertn@umich.edu 18832632Sstever@eecs.umich.edu<p> 18842632Sstever@eecs.umich.eduWhen shift/reduce conflicts are resolved using the first three techniques (with the help of 18852632Sstever@eecs.umich.eduprecedence rules), <tt>yacc.py</tt> will report no errors or conflicts in the grammar. 18862632Sstever@eecs.umich.edu 18872632Sstever@eecs.umich.edu<p> 18882632Sstever@eecs.umich.eduOne problem with the precedence specifier technique is that it is sometimes necessary to 18892632Sstever@eecs.umich.educhange the precedence of an operator in certain contents. For example, consider a unary-minus operator 18902632Sstever@eecs.umich.eduin "3 + 4 * -5". Normally, unary minus has a very high precedence--being evaluated before the multiply. 18912632Sstever@eecs.umich.eduHowever, in our precedence specifier, MINUS has a lower precedence than TIMES. To deal with this, 18922632Sstever@eecs.umich.eduprecedence rules can be given for fictitious tokens like this: 18932632Sstever@eecs.umich.edu 18942632Sstever@eecs.umich.edu<blockquote> 18952632Sstever@eecs.umich.edu<pre> 18962632Sstever@eecs.umich.eduprecedence = ( 18972632Sstever@eecs.umich.edu ('left', 'PLUS', 'MINUS'), 18982632Sstever@eecs.umich.edu ('left', 'TIMES', 'DIVIDE'), 18992632Sstever@eecs.umich.edu ('right', 'UMINUS'), # Unary minus operator 19002632Sstever@eecs.umich.edu) 19012632Sstever@eecs.umich.edu</pre> 19022632Sstever@eecs.umich.edu</blockquote> 19032632Sstever@eecs.umich.edu 19042632Sstever@eecs.umich.eduNow, in the grammar file, we can write our unary minus rule like this: 19052632Sstever@eecs.umich.edu 19062632Sstever@eecs.umich.edu<blockquote> 19072632Sstever@eecs.umich.edu<pre> 19084479Sbinkertn@umich.edudef p_expr_uminus(p): 19092632Sstever@eecs.umich.edu 'expression : MINUS expression %prec UMINUS' 19104479Sbinkertn@umich.edu p[0] = -p[2] 19112632Sstever@eecs.umich.edu</pre> 19122632Sstever@eecs.umich.edu</blockquote> 19132632Sstever@eecs.umich.edu 19142632Sstever@eecs.umich.eduIn this case, <tt>%prec UMINUS</tt> overrides the default rule precedence--setting it to that 19152632Sstever@eecs.umich.eduof UMINUS in the precedence specifier. 19162632Sstever@eecs.umich.edu 19172632Sstever@eecs.umich.edu<p> 19184479Sbinkertn@umich.eduAt first, the use of UMINUS in this example may appear very confusing. 19194479Sbinkertn@umich.eduUMINUS is not an input token or a grammer rule. Instead, you should 19204479Sbinkertn@umich.eduthink of it as the name of a special marker in the precedence table. When you use the <tt>%prec</tt> qualifier, you're simply 19214479Sbinkertn@umich.edutelling yacc that you want the precedence of the expression to be the same as for this special marker instead of the usual precedence. 19224479Sbinkertn@umich.edu 19234479Sbinkertn@umich.edu<p> 19242632Sstever@eecs.umich.eduIt is also possible to specify non-associativity in the <tt>precedence</tt> table. This would 19252632Sstever@eecs.umich.edube used when you <em>don't</em> want operations to chain together. For example, suppose 19264479Sbinkertn@umich.eduyou wanted to support comparison operators like <tt><</tt> and <tt>></tt> but you didn't want to allow 19272632Sstever@eecs.umich.educombinations like <tt>a < b < c</tt>. To do this, simply specify a rule like this: 19282632Sstever@eecs.umich.edu 19292632Sstever@eecs.umich.edu<blockquote> 19302632Sstever@eecs.umich.edu<pre> 19312632Sstever@eecs.umich.eduprecedence = ( 19322632Sstever@eecs.umich.edu ('nonassoc', 'LESSTHAN', 'GREATERTHAN'), # Nonassociative operators 19332632Sstever@eecs.umich.edu ('left', 'PLUS', 'MINUS'), 19342632Sstever@eecs.umich.edu ('left', 'TIMES', 'DIVIDE'), 19352632Sstever@eecs.umich.edu ('right', 'UMINUS'), # Unary minus operator 19362632Sstever@eecs.umich.edu) 19372632Sstever@eecs.umich.edu</pre> 19382632Sstever@eecs.umich.edu</blockquote> 19392632Sstever@eecs.umich.edu 19402632Sstever@eecs.umich.edu<p> 19414479Sbinkertn@umich.eduIf you do this, the occurrence of input text such as <tt> a < b < c</tt> will result in a syntax error. However, simple 19424479Sbinkertn@umich.eduexpressions such as <tt>a < b</tt> will still be fine. 19434479Sbinkertn@umich.edu 19444479Sbinkertn@umich.edu<p> 19452632Sstever@eecs.umich.eduReduce/reduce conflicts are caused when there are multiple grammar 19462632Sstever@eecs.umich.edurules that can be applied to a given set of symbols. This kind of 19472632Sstever@eecs.umich.educonflict is almost always bad and is always resolved by picking the 19482632Sstever@eecs.umich.edurule that appears first in the grammar file. Reduce/reduce conflicts 19492632Sstever@eecs.umich.eduare almost always caused when different sets of grammar rules somehow 19502632Sstever@eecs.umich.edugenerate the same set of symbols. For example: 19512632Sstever@eecs.umich.edu 19522632Sstever@eecs.umich.edu<blockquote> 19532632Sstever@eecs.umich.edu<pre> 19542632Sstever@eecs.umich.eduassignment : ID EQUALS NUMBER 19552632Sstever@eecs.umich.edu | ID EQUALS expression 19562632Sstever@eecs.umich.edu 19572632Sstever@eecs.umich.eduexpression : expression PLUS expression 19582632Sstever@eecs.umich.edu | expression MINUS expression 19592632Sstever@eecs.umich.edu | expression TIMES expression 19602632Sstever@eecs.umich.edu | expression DIVIDE expression 19612632Sstever@eecs.umich.edu | LPAREN expression RPAREN 19622632Sstever@eecs.umich.edu | NUMBER 19632632Sstever@eecs.umich.edu</pre> 19642632Sstever@eecs.umich.edu</blockquote> 19652632Sstever@eecs.umich.edu 19662632Sstever@eecs.umich.eduIn this case, a reduce/reduce conflict exists between these two rules: 19672632Sstever@eecs.umich.edu 19682632Sstever@eecs.umich.edu<blockquote> 19692632Sstever@eecs.umich.edu<pre> 19702632Sstever@eecs.umich.eduassignment : ID EQUALS NUMBER 19712632Sstever@eecs.umich.eduexpression : NUMBER 19722632Sstever@eecs.umich.edu</pre> 19732632Sstever@eecs.umich.edu</blockquote> 19742632Sstever@eecs.umich.edu 19752632Sstever@eecs.umich.eduFor example, if you wrote "a = 5", the parser can't figure out if this 19764479Sbinkertn@umich.eduis supposed to be reduced as <tt>assignment : ID EQUALS NUMBER</tt> or 19772632Sstever@eecs.umich.eduwhether it's supposed to reduce the 5 as an expression and then reduce 19782632Sstever@eecs.umich.eduthe rule <tt>assignment : ID EQUALS expression</tt>. 19792632Sstever@eecs.umich.edu 19804479Sbinkertn@umich.edu<p> 19814479Sbinkertn@umich.eduIt should be noted that reduce/reduce conflicts are notoriously difficult to spot 19824479Sbinkertn@umich.edusimply looking at the input grammer. To locate these, it is usually easier to look at the 19834479Sbinkertn@umich.edu<tt>parser.out</tt> debugging file with an appropriately high level of caffeination. 19844479Sbinkertn@umich.edu 19854479Sbinkertn@umich.edu<H3><a name="ply_nn28"></a>5.7 The parser.out file</H3> 19864479Sbinkertn@umich.edu 19872632Sstever@eecs.umich.edu 19882632Sstever@eecs.umich.eduTracking down shift/reduce and reduce/reduce conflicts is one of the finer pleasures of using an LR 19892632Sstever@eecs.umich.eduparsing algorithm. To assist in debugging, <tt>yacc.py</tt> creates a debugging file called 19902632Sstever@eecs.umich.edu'parser.out' when it generates the parsing table. The contents of this file look like the following: 19912632Sstever@eecs.umich.edu 19922632Sstever@eecs.umich.edu<blockquote> 19932632Sstever@eecs.umich.edu<pre> 19942632Sstever@eecs.umich.eduUnused terminals: 19952632Sstever@eecs.umich.edu 19962632Sstever@eecs.umich.edu 19972632Sstever@eecs.umich.eduGrammar 19982632Sstever@eecs.umich.edu 19992632Sstever@eecs.umich.eduRule 1 expression -> expression PLUS expression 20002632Sstever@eecs.umich.eduRule 2 expression -> expression MINUS expression 20012632Sstever@eecs.umich.eduRule 3 expression -> expression TIMES expression 20022632Sstever@eecs.umich.eduRule 4 expression -> expression DIVIDE expression 20032632Sstever@eecs.umich.eduRule 5 expression -> NUMBER 20042632Sstever@eecs.umich.eduRule 6 expression -> LPAREN expression RPAREN 20052632Sstever@eecs.umich.edu 20062632Sstever@eecs.umich.eduTerminals, with rules where they appear 20072632Sstever@eecs.umich.edu 20082632Sstever@eecs.umich.eduTIMES : 3 20092632Sstever@eecs.umich.eduerror : 20102632Sstever@eecs.umich.eduMINUS : 2 20112632Sstever@eecs.umich.eduRPAREN : 6 20122632Sstever@eecs.umich.eduLPAREN : 6 20132632Sstever@eecs.umich.eduDIVIDE : 4 20142632Sstever@eecs.umich.eduPLUS : 1 20152632Sstever@eecs.umich.eduNUMBER : 5 20162632Sstever@eecs.umich.edu 20172632Sstever@eecs.umich.eduNonterminals, with rules where they appear 20182632Sstever@eecs.umich.edu 20192632Sstever@eecs.umich.eduexpression : 1 1 2 2 3 3 4 4 6 0 20202632Sstever@eecs.umich.edu 20212632Sstever@eecs.umich.edu 20224479Sbinkertn@umich.eduParsing method: LALR 20232632Sstever@eecs.umich.edu 20242632Sstever@eecs.umich.edu 20252632Sstever@eecs.umich.edustate 0 20262632Sstever@eecs.umich.edu 20272632Sstever@eecs.umich.edu S' -> . expression 20282632Sstever@eecs.umich.edu expression -> . expression PLUS expression 20292632Sstever@eecs.umich.edu expression -> . expression MINUS expression 20302632Sstever@eecs.umich.edu expression -> . expression TIMES expression 20312632Sstever@eecs.umich.edu expression -> . expression DIVIDE expression 20322632Sstever@eecs.umich.edu expression -> . NUMBER 20332632Sstever@eecs.umich.edu expression -> . LPAREN expression RPAREN 20342632Sstever@eecs.umich.edu 20352632Sstever@eecs.umich.edu NUMBER shift and go to state 3 20362632Sstever@eecs.umich.edu LPAREN shift and go to state 2 20372632Sstever@eecs.umich.edu 20382632Sstever@eecs.umich.edu 20392632Sstever@eecs.umich.edustate 1 20402632Sstever@eecs.umich.edu 20412632Sstever@eecs.umich.edu S' -> expression . 20422632Sstever@eecs.umich.edu expression -> expression . PLUS expression 20432632Sstever@eecs.umich.edu expression -> expression . MINUS expression 20442632Sstever@eecs.umich.edu expression -> expression . TIMES expression 20452632Sstever@eecs.umich.edu expression -> expression . DIVIDE expression 20462632Sstever@eecs.umich.edu 20472632Sstever@eecs.umich.edu PLUS shift and go to state 6 20482632Sstever@eecs.umich.edu MINUS shift and go to state 5 20492632Sstever@eecs.umich.edu TIMES shift and go to state 4 20502632Sstever@eecs.umich.edu DIVIDE shift and go to state 7 20512632Sstever@eecs.umich.edu 20522632Sstever@eecs.umich.edu 20532632Sstever@eecs.umich.edustate 2 20542632Sstever@eecs.umich.edu 20552632Sstever@eecs.umich.edu expression -> LPAREN . expression RPAREN 20562632Sstever@eecs.umich.edu expression -> . expression PLUS expression 20572632Sstever@eecs.umich.edu expression -> . expression MINUS expression 20582632Sstever@eecs.umich.edu expression -> . expression TIMES expression 20592632Sstever@eecs.umich.edu expression -> . expression DIVIDE expression 20602632Sstever@eecs.umich.edu expression -> . NUMBER 20612632Sstever@eecs.umich.edu expression -> . LPAREN expression RPAREN 20622632Sstever@eecs.umich.edu 20632632Sstever@eecs.umich.edu NUMBER shift and go to state 3 20642632Sstever@eecs.umich.edu LPAREN shift and go to state 2 20652632Sstever@eecs.umich.edu 20662632Sstever@eecs.umich.edu 20672632Sstever@eecs.umich.edustate 3 20682632Sstever@eecs.umich.edu 20692632Sstever@eecs.umich.edu expression -> NUMBER . 20702632Sstever@eecs.umich.edu 20712632Sstever@eecs.umich.edu $ reduce using rule 5 20722632Sstever@eecs.umich.edu PLUS reduce using rule 5 20732632Sstever@eecs.umich.edu MINUS reduce using rule 5 20742632Sstever@eecs.umich.edu TIMES reduce using rule 5 20752632Sstever@eecs.umich.edu DIVIDE reduce using rule 5 20762632Sstever@eecs.umich.edu RPAREN reduce using rule 5 20772632Sstever@eecs.umich.edu 20782632Sstever@eecs.umich.edu 20792632Sstever@eecs.umich.edustate 4 20802632Sstever@eecs.umich.edu 20812632Sstever@eecs.umich.edu expression -> expression TIMES . expression 20822632Sstever@eecs.umich.edu expression -> . expression PLUS expression 20832632Sstever@eecs.umich.edu expression -> . expression MINUS expression 20842632Sstever@eecs.umich.edu expression -> . expression TIMES expression 20852632Sstever@eecs.umich.edu expression -> . expression DIVIDE expression 20862632Sstever@eecs.umich.edu expression -> . NUMBER 20872632Sstever@eecs.umich.edu expression -> . LPAREN expression RPAREN 20882632Sstever@eecs.umich.edu 20892632Sstever@eecs.umich.edu NUMBER shift and go to state 3 20902632Sstever@eecs.umich.edu LPAREN shift and go to state 2 20912632Sstever@eecs.umich.edu 20922632Sstever@eecs.umich.edu 20932632Sstever@eecs.umich.edustate 5 20942632Sstever@eecs.umich.edu 20952632Sstever@eecs.umich.edu expression -> expression MINUS . expression 20962632Sstever@eecs.umich.edu expression -> . expression PLUS expression 20972632Sstever@eecs.umich.edu expression -> . expression MINUS expression 20982632Sstever@eecs.umich.edu expression -> . expression TIMES expression 20992632Sstever@eecs.umich.edu expression -> . expression DIVIDE expression 21002632Sstever@eecs.umich.edu expression -> . NUMBER 21012632Sstever@eecs.umich.edu expression -> . LPAREN expression RPAREN 21022632Sstever@eecs.umich.edu 21032632Sstever@eecs.umich.edu NUMBER shift and go to state 3 21042632Sstever@eecs.umich.edu LPAREN shift and go to state 2 21052632Sstever@eecs.umich.edu 21062632Sstever@eecs.umich.edu 21072632Sstever@eecs.umich.edustate 6 21082632Sstever@eecs.umich.edu 21092632Sstever@eecs.umich.edu expression -> expression PLUS . expression 21102632Sstever@eecs.umich.edu expression -> . expression PLUS expression 21112632Sstever@eecs.umich.edu expression -> . expression MINUS expression 21122632Sstever@eecs.umich.edu expression -> . expression TIMES expression 21132632Sstever@eecs.umich.edu expression -> . expression DIVIDE expression 21142632Sstever@eecs.umich.edu expression -> . NUMBER 21152632Sstever@eecs.umich.edu expression -> . LPAREN expression RPAREN 21162632Sstever@eecs.umich.edu 21172632Sstever@eecs.umich.edu NUMBER shift and go to state 3 21182632Sstever@eecs.umich.edu LPAREN shift and go to state 2 21192632Sstever@eecs.umich.edu 21202632Sstever@eecs.umich.edu 21212632Sstever@eecs.umich.edustate 7 21222632Sstever@eecs.umich.edu 21232632Sstever@eecs.umich.edu expression -> expression DIVIDE . expression 21242632Sstever@eecs.umich.edu expression -> . expression PLUS expression 21252632Sstever@eecs.umich.edu expression -> . expression MINUS expression 21262632Sstever@eecs.umich.edu expression -> . expression TIMES expression 21272632Sstever@eecs.umich.edu expression -> . expression DIVIDE expression 21282632Sstever@eecs.umich.edu expression -> . NUMBER 21292632Sstever@eecs.umich.edu expression -> . LPAREN expression RPAREN 21302632Sstever@eecs.umich.edu 21312632Sstever@eecs.umich.edu NUMBER shift and go to state 3 21322632Sstever@eecs.umich.edu LPAREN shift and go to state 2 21332632Sstever@eecs.umich.edu 21342632Sstever@eecs.umich.edu 21352632Sstever@eecs.umich.edustate 8 21362632Sstever@eecs.umich.edu 21372632Sstever@eecs.umich.edu expression -> LPAREN expression . RPAREN 21382632Sstever@eecs.umich.edu expression -> expression . PLUS expression 21392632Sstever@eecs.umich.edu expression -> expression . MINUS expression 21402632Sstever@eecs.umich.edu expression -> expression . TIMES expression 21412632Sstever@eecs.umich.edu expression -> expression . DIVIDE expression 21422632Sstever@eecs.umich.edu 21432632Sstever@eecs.umich.edu RPAREN shift and go to state 13 21442632Sstever@eecs.umich.edu PLUS shift and go to state 6 21452632Sstever@eecs.umich.edu MINUS shift and go to state 5 21462632Sstever@eecs.umich.edu TIMES shift and go to state 4 21472632Sstever@eecs.umich.edu DIVIDE shift and go to state 7 21482632Sstever@eecs.umich.edu 21492632Sstever@eecs.umich.edu 21502632Sstever@eecs.umich.edustate 9 21512632Sstever@eecs.umich.edu 21522632Sstever@eecs.umich.edu expression -> expression TIMES expression . 21532632Sstever@eecs.umich.edu expression -> expression . PLUS expression 21542632Sstever@eecs.umich.edu expression -> expression . MINUS expression 21552632Sstever@eecs.umich.edu expression -> expression . TIMES expression 21562632Sstever@eecs.umich.edu expression -> expression . DIVIDE expression 21572632Sstever@eecs.umich.edu 21582632Sstever@eecs.umich.edu $ reduce using rule 3 21592632Sstever@eecs.umich.edu PLUS reduce using rule 3 21602632Sstever@eecs.umich.edu MINUS reduce using rule 3 21612632Sstever@eecs.umich.edu TIMES reduce using rule 3 21622632Sstever@eecs.umich.edu DIVIDE reduce using rule 3 21632632Sstever@eecs.umich.edu RPAREN reduce using rule 3 21642632Sstever@eecs.umich.edu 21652632Sstever@eecs.umich.edu ! PLUS [ shift and go to state 6 ] 21662632Sstever@eecs.umich.edu ! MINUS [ shift and go to state 5 ] 21672632Sstever@eecs.umich.edu ! TIMES [ shift and go to state 4 ] 21682632Sstever@eecs.umich.edu ! DIVIDE [ shift and go to state 7 ] 21692632Sstever@eecs.umich.edu 21702632Sstever@eecs.umich.edustate 10 21712632Sstever@eecs.umich.edu 21722632Sstever@eecs.umich.edu expression -> expression MINUS expression . 21732632Sstever@eecs.umich.edu expression -> expression . PLUS expression 21742632Sstever@eecs.umich.edu expression -> expression . MINUS expression 21752632Sstever@eecs.umich.edu expression -> expression . TIMES expression 21762632Sstever@eecs.umich.edu expression -> expression . DIVIDE expression 21772632Sstever@eecs.umich.edu 21782632Sstever@eecs.umich.edu $ reduce using rule 2 21792632Sstever@eecs.umich.edu PLUS reduce using rule 2 21802632Sstever@eecs.umich.edu MINUS reduce using rule 2 21812632Sstever@eecs.umich.edu RPAREN reduce using rule 2 21822632Sstever@eecs.umich.edu TIMES shift and go to state 4 21832632Sstever@eecs.umich.edu DIVIDE shift and go to state 7 21842632Sstever@eecs.umich.edu 21852632Sstever@eecs.umich.edu ! TIMES [ reduce using rule 2 ] 21862632Sstever@eecs.umich.edu ! DIVIDE [ reduce using rule 2 ] 21872632Sstever@eecs.umich.edu ! PLUS [ shift and go to state 6 ] 21882632Sstever@eecs.umich.edu ! MINUS [ shift and go to state 5 ] 21892632Sstever@eecs.umich.edu 21902632Sstever@eecs.umich.edustate 11 21912632Sstever@eecs.umich.edu 21922632Sstever@eecs.umich.edu expression -> expression PLUS expression . 21932632Sstever@eecs.umich.edu expression -> expression . PLUS expression 21942632Sstever@eecs.umich.edu expression -> expression . MINUS expression 21952632Sstever@eecs.umich.edu expression -> expression . TIMES expression 21962632Sstever@eecs.umich.edu expression -> expression . DIVIDE expression 21972632Sstever@eecs.umich.edu 21982632Sstever@eecs.umich.edu $ reduce using rule 1 21992632Sstever@eecs.umich.edu PLUS reduce using rule 1 22002632Sstever@eecs.umich.edu MINUS reduce using rule 1 22012632Sstever@eecs.umich.edu RPAREN reduce using rule 1 22022632Sstever@eecs.umich.edu TIMES shift and go to state 4 22032632Sstever@eecs.umich.edu DIVIDE shift and go to state 7 22042632Sstever@eecs.umich.edu 22052632Sstever@eecs.umich.edu ! TIMES [ reduce using rule 1 ] 22062632Sstever@eecs.umich.edu ! DIVIDE [ reduce using rule 1 ] 22072632Sstever@eecs.umich.edu ! PLUS [ shift and go to state 6 ] 22082632Sstever@eecs.umich.edu ! MINUS [ shift and go to state 5 ] 22092632Sstever@eecs.umich.edu 22102632Sstever@eecs.umich.edustate 12 22112632Sstever@eecs.umich.edu 22122632Sstever@eecs.umich.edu expression -> expression DIVIDE expression . 22132632Sstever@eecs.umich.edu expression -> expression . PLUS expression 22142632Sstever@eecs.umich.edu expression -> expression . MINUS expression 22152632Sstever@eecs.umich.edu expression -> expression . TIMES expression 22162632Sstever@eecs.umich.edu expression -> expression . DIVIDE expression 22172632Sstever@eecs.umich.edu 22182632Sstever@eecs.umich.edu $ reduce using rule 4 22192632Sstever@eecs.umich.edu PLUS reduce using rule 4 22202632Sstever@eecs.umich.edu MINUS reduce using rule 4 22212632Sstever@eecs.umich.edu TIMES reduce using rule 4 22222632Sstever@eecs.umich.edu DIVIDE reduce using rule 4 22232632Sstever@eecs.umich.edu RPAREN reduce using rule 4 22242632Sstever@eecs.umich.edu 22252632Sstever@eecs.umich.edu ! PLUS [ shift and go to state 6 ] 22262632Sstever@eecs.umich.edu ! MINUS [ shift and go to state 5 ] 22272632Sstever@eecs.umich.edu ! TIMES [ shift and go to state 4 ] 22282632Sstever@eecs.umich.edu ! DIVIDE [ shift and go to state 7 ] 22292632Sstever@eecs.umich.edu 22302632Sstever@eecs.umich.edustate 13 22312632Sstever@eecs.umich.edu 22322632Sstever@eecs.umich.edu expression -> LPAREN expression RPAREN . 22332632Sstever@eecs.umich.edu 22342632Sstever@eecs.umich.edu $ reduce using rule 6 22352632Sstever@eecs.umich.edu PLUS reduce using rule 6 22362632Sstever@eecs.umich.edu MINUS reduce using rule 6 22372632Sstever@eecs.umich.edu TIMES reduce using rule 6 22382632Sstever@eecs.umich.edu DIVIDE reduce using rule 6 22392632Sstever@eecs.umich.edu RPAREN reduce using rule 6 22402632Sstever@eecs.umich.edu</pre> 22412632Sstever@eecs.umich.edu</blockquote> 22422632Sstever@eecs.umich.edu 22432632Sstever@eecs.umich.eduIn the file, each state of the grammar is described. Within each state the "." indicates the current 22442632Sstever@eecs.umich.edulocation of the parse within any applicable grammar rules. In addition, the actions for each valid 22452632Sstever@eecs.umich.eduinput token are listed. When a shift/reduce or reduce/reduce conflict arises, rules <em>not</em> selected 22462632Sstever@eecs.umich.eduare prefixed with an !. For example: 22472632Sstever@eecs.umich.edu 22482632Sstever@eecs.umich.edu<blockquote> 22492632Sstever@eecs.umich.edu<pre> 22502632Sstever@eecs.umich.edu ! TIMES [ reduce using rule 2 ] 22512632Sstever@eecs.umich.edu ! DIVIDE [ reduce using rule 2 ] 22522632Sstever@eecs.umich.edu ! PLUS [ shift and go to state 6 ] 22532632Sstever@eecs.umich.edu ! MINUS [ shift and go to state 5 ] 22542632Sstever@eecs.umich.edu</pre> 22552632Sstever@eecs.umich.edu</blockquote> 22562632Sstever@eecs.umich.edu 22572632Sstever@eecs.umich.eduBy looking at these rules (and with a little practice), you can usually track down the source 22582632Sstever@eecs.umich.eduof most parsing conflicts. It should also be stressed that not all shift-reduce conflicts are 22592632Sstever@eecs.umich.edubad. However, the only way to be sure that they are resolved correctly is to look at <tt>parser.out</tt>. 22602632Sstever@eecs.umich.edu 22614479Sbinkertn@umich.edu<H3><a name="ply_nn29"></a>5.8 Syntax Error Handling</H3> 22624479Sbinkertn@umich.edu 22632632Sstever@eecs.umich.edu 22642632Sstever@eecs.umich.eduWhen a syntax error occurs during parsing, the error is immediately 22652632Sstever@eecs.umich.edudetected (i.e., the parser does not read any more tokens beyond the 22662632Sstever@eecs.umich.edusource of the error). Error recovery in LR parsers is a delicate 22672632Sstever@eecs.umich.edutopic that involves ancient rituals and black-magic. The recovery mechanism 22682632Sstever@eecs.umich.eduprovided by <tt>yacc.py</tt> is comparable to Unix yacc so you may want 22692632Sstever@eecs.umich.educonsult a book like O'Reilly's "Lex and Yacc" for some of the finer details. 22702632Sstever@eecs.umich.edu 22712632Sstever@eecs.umich.edu<p> 22722632Sstever@eecs.umich.eduWhen a syntax error occurs, <tt>yacc.py</tt> performs the following steps: 22732632Sstever@eecs.umich.edu 22742632Sstever@eecs.umich.edu<ol> 22752632Sstever@eecs.umich.edu<li>On the first occurrence of an error, the user-defined <tt>p_error()</tt> function 22762632Sstever@eecs.umich.eduis called with the offending token as an argument. Afterwards, the parser enters 22772632Sstever@eecs.umich.eduan "error-recovery" mode in which it will not make future calls to <tt>p_error()</tt> until it 22782632Sstever@eecs.umich.eduhas successfully shifted at least 3 tokens onto the parsing stack. 22792632Sstever@eecs.umich.edu 22802632Sstever@eecs.umich.edu<p> 22812632Sstever@eecs.umich.edu<li>If no recovery action is taken in <tt>p_error()</tt>, the offending lookahead token is replaced 22822632Sstever@eecs.umich.eduwith a special <tt>error</tt> token. 22832632Sstever@eecs.umich.edu 22842632Sstever@eecs.umich.edu<p> 22852632Sstever@eecs.umich.edu<li>If the offending lookahead token is already set to <tt>error</tt>, the top item of the parsing stack is 22862632Sstever@eecs.umich.edudeleted. 22872632Sstever@eecs.umich.edu 22882632Sstever@eecs.umich.edu<p> 22892632Sstever@eecs.umich.edu<li>If the entire parsing stack is unwound, the parser enters a restart state and attempts to start 22902632Sstever@eecs.umich.eduparsing from its initial state. 22912632Sstever@eecs.umich.edu 22922632Sstever@eecs.umich.edu<p> 22932632Sstever@eecs.umich.edu<li>If a grammar rule accepts <tt>error</tt> as a token, it will be 22942632Sstever@eecs.umich.edushifted onto the parsing stack. 22952632Sstever@eecs.umich.edu 22962632Sstever@eecs.umich.edu<p> 22972632Sstever@eecs.umich.edu<li>If the top item of the parsing stack is <tt>error</tt>, lookahead tokens will be discarded until the 22982632Sstever@eecs.umich.eduparser can successfully shift a new symbol or reduce a rule involving <tt>error</tt>. 22992632Sstever@eecs.umich.edu</ol> 23002632Sstever@eecs.umich.edu 23014479Sbinkertn@umich.edu<H4><a name="ply_nn30"></a>5.8.1 Recovery and resynchronization with error rules</H4> 23024479Sbinkertn@umich.edu 23032632Sstever@eecs.umich.edu 23042632Sstever@eecs.umich.eduThe most well-behaved approach for handling syntax errors is to write grammar rules that include the <tt>error</tt> 23052632Sstever@eecs.umich.edutoken. For example, suppose your language had a grammar rule for a print statement like this: 23062632Sstever@eecs.umich.edu 23072632Sstever@eecs.umich.edu<blockquote> 23082632Sstever@eecs.umich.edu<pre> 23094479Sbinkertn@umich.edudef p_statement_print(p): 23102632Sstever@eecs.umich.edu 'statement : PRINT expr SEMI' 23112632Sstever@eecs.umich.edu ... 23122632Sstever@eecs.umich.edu</pre> 23132632Sstever@eecs.umich.edu</blockquote> 23142632Sstever@eecs.umich.edu 23152632Sstever@eecs.umich.eduTo account for the possibility of a bad expression, you might write an additional grammar rule like this: 23162632Sstever@eecs.umich.edu 23172632Sstever@eecs.umich.edu<blockquote> 23182632Sstever@eecs.umich.edu<pre> 23194479Sbinkertn@umich.edudef p_statement_print_error(p): 23202632Sstever@eecs.umich.edu 'statement : PRINT error SEMI' 23212632Sstever@eecs.umich.edu print "Syntax error in print statement. Bad expression" 23222632Sstever@eecs.umich.edu 23232632Sstever@eecs.umich.edu</pre> 23242632Sstever@eecs.umich.edu</blockquote> 23252632Sstever@eecs.umich.edu 23262632Sstever@eecs.umich.eduIn this case, the <tt>error</tt> token will match any sequence of 23272632Sstever@eecs.umich.edutokens that might appear up to the first semicolon that is 23282632Sstever@eecs.umich.eduencountered. Once the semicolon is reached, the rule will be 23292632Sstever@eecs.umich.eduinvoked and the <tt>error</tt> token will go away. 23302632Sstever@eecs.umich.edu 23312632Sstever@eecs.umich.edu<p> 23322632Sstever@eecs.umich.eduThis type of recovery is sometimes known as parser resynchronization. 23332632Sstever@eecs.umich.eduThe <tt>error</tt> token acts as a wildcard for any bad input text and 23342632Sstever@eecs.umich.eduthe token immediately following <tt>error</tt> acts as a 23352632Sstever@eecs.umich.edusynchronization token. 23362632Sstever@eecs.umich.edu 23372632Sstever@eecs.umich.edu<p> 23382632Sstever@eecs.umich.eduIt is important to note that the <tt>error</tt> token usually does not appear as the last token 23392632Sstever@eecs.umich.eduon the right in an error rule. For example: 23402632Sstever@eecs.umich.edu 23412632Sstever@eecs.umich.edu<blockquote> 23422632Sstever@eecs.umich.edu<pre> 23434479Sbinkertn@umich.edudef p_statement_print_error(p): 23442632Sstever@eecs.umich.edu 'statement : PRINT error' 23452632Sstever@eecs.umich.edu print "Syntax error in print statement. Bad expression" 23462632Sstever@eecs.umich.edu</pre> 23472632Sstever@eecs.umich.edu</blockquote> 23482632Sstever@eecs.umich.edu 23492632Sstever@eecs.umich.eduThis is because the first bad token encountered will cause the rule to 23502632Sstever@eecs.umich.edube reduced--which may make it difficult to recover if more bad tokens 23512632Sstever@eecs.umich.eduimmediately follow. 23522632Sstever@eecs.umich.edu 23534479Sbinkertn@umich.edu<H4><a name="ply_nn31"></a>5.8.2 Panic mode recovery</H4> 23544479Sbinkertn@umich.edu 23552632Sstever@eecs.umich.edu 23562632Sstever@eecs.umich.eduAn alternative error recovery scheme is to enter a panic mode recovery in which tokens are 23572632Sstever@eecs.umich.edudiscarded to a point where the parser might be able to recover in some sensible manner. 23582632Sstever@eecs.umich.edu 23592632Sstever@eecs.umich.edu<p> 23602632Sstever@eecs.umich.eduPanic mode recovery is implemented entirely in the <tt>p_error()</tt> function. For example, this 23612632Sstever@eecs.umich.edufunction starts discarding tokens until it reaches a closing '}'. Then, it restarts the 23622632Sstever@eecs.umich.eduparser in its initial state. 23632632Sstever@eecs.umich.edu 23642632Sstever@eecs.umich.edu<blockquote> 23652632Sstever@eecs.umich.edu<pre> 23664479Sbinkertn@umich.edudef p_error(p): 23672632Sstever@eecs.umich.edu print "Whoa. You are seriously hosed." 23682632Sstever@eecs.umich.edu # Read ahead looking for a closing '}' 23692632Sstever@eecs.umich.edu while 1: 23702632Sstever@eecs.umich.edu tok = yacc.token() # Get the next token 23712632Sstever@eecs.umich.edu if not tok or tok.type == 'RBRACE': break 23722632Sstever@eecs.umich.edu yacc.restart() 23732632Sstever@eecs.umich.edu</pre> 23742632Sstever@eecs.umich.edu</blockquote> 23752632Sstever@eecs.umich.edu 23762632Sstever@eecs.umich.edu<p> 23772632Sstever@eecs.umich.eduThis function simply discards the bad token and tells the parser that the error was ok. 23782632Sstever@eecs.umich.edu 23792632Sstever@eecs.umich.edu<blockquote> 23802632Sstever@eecs.umich.edu<pre> 23814479Sbinkertn@umich.edudef p_error(p): 23824479Sbinkertn@umich.edu print "Syntax error at token", p.type 23832632Sstever@eecs.umich.edu # Just discard the token and tell the parser it's okay. 23842632Sstever@eecs.umich.edu yacc.errok() 23852632Sstever@eecs.umich.edu</pre> 23862632Sstever@eecs.umich.edu</blockquote> 23872632Sstever@eecs.umich.edu 23882632Sstever@eecs.umich.edu<P> 23892632Sstever@eecs.umich.eduWithin the <tt>p_error()</tt> function, three functions are available to control the behavior 23902632Sstever@eecs.umich.eduof the parser: 23912632Sstever@eecs.umich.edu<p> 23922632Sstever@eecs.umich.edu<ul> 23932632Sstever@eecs.umich.edu<li><tt>yacc.errok()</tt>. This resets the parser state so it doesn't think it's in error-recovery 23942632Sstever@eecs.umich.edumode. This will prevent an <tt>error</tt> token from being generated and will reset the internal 23952632Sstever@eecs.umich.eduerror counters so that the next syntax error will call <tt>p_error()</tt> again. 23962632Sstever@eecs.umich.edu 23972632Sstever@eecs.umich.edu<p> 23982632Sstever@eecs.umich.edu<li><tt>yacc.token()</tt>. This returns the next token on the input stream. 23992632Sstever@eecs.umich.edu 24002632Sstever@eecs.umich.edu<p> 24012632Sstever@eecs.umich.edu<li><tt>yacc.restart()</tt>. This discards the entire parsing stack and resets the parser 24022632Sstever@eecs.umich.eduto its initial state. 24032632Sstever@eecs.umich.edu</ul> 24042632Sstever@eecs.umich.edu 24052632Sstever@eecs.umich.eduNote: these functions are only available when invoking <tt>p_error()</tt> and are not available 24062632Sstever@eecs.umich.eduat any other time. 24072632Sstever@eecs.umich.edu 24082632Sstever@eecs.umich.edu<p> 24092632Sstever@eecs.umich.eduTo supply the next lookahead token to the parser, <tt>p_error()</tt> can return a token. This might be 24102632Sstever@eecs.umich.eduuseful if trying to synchronize on special characters. For example: 24112632Sstever@eecs.umich.edu 24122632Sstever@eecs.umich.edu<blockquote> 24132632Sstever@eecs.umich.edu<pre> 24144479Sbinkertn@umich.edudef p_error(p): 24152632Sstever@eecs.umich.edu # Read ahead looking for a terminating ";" 24162632Sstever@eecs.umich.edu while 1: 24172632Sstever@eecs.umich.edu tok = yacc.token() # Get the next token 24182632Sstever@eecs.umich.edu if not tok or tok.type == 'SEMI': break 24192632Sstever@eecs.umich.edu yacc.errok() 24202632Sstever@eecs.umich.edu 24212632Sstever@eecs.umich.edu # Return SEMI to the parser as the next lookahead token 24222632Sstever@eecs.umich.edu return tok 24232632Sstever@eecs.umich.edu</pre> 24242632Sstever@eecs.umich.edu</blockquote> 24252632Sstever@eecs.umich.edu 24264479Sbinkertn@umich.edu<H4><a name="ply_nn32"></a>5.8.3 General comments on error handling</H4> 24274479Sbinkertn@umich.edu 24282632Sstever@eecs.umich.edu 24292632Sstever@eecs.umich.eduFor normal types of languages, error recovery with error rules and resynchronization characters is probably the most reliable 24302632Sstever@eecs.umich.edutechnique. This is because you can instrument the grammar to catch errors at selected places where it is relatively easy 24312632Sstever@eecs.umich.eduto recover and continue parsing. Panic mode recovery is really only useful in certain specialized applications where you might want 24322632Sstever@eecs.umich.eduto discard huge portions of the input text to find a valid restart point. 24332632Sstever@eecs.umich.edu 24344479Sbinkertn@umich.edu<H3><a name="ply_nn33"></a>5.9 Line Number and Position Tracking</H3> 24354479Sbinkertn@umich.edu 24364479Sbinkertn@umich.eduPosition tracking is often a tricky problem when writing compilers. By default, PLY tracks the line number and position of 24374479Sbinkertn@umich.eduall tokens. This information is available using the following functions: 24382632Sstever@eecs.umich.edu 24392632Sstever@eecs.umich.edu<ul> 24404479Sbinkertn@umich.edu<li><tt>p.lineno(num)</tt>. Return the line number for symbol <em>num</em> 24414479Sbinkertn@umich.edu<li><tt>p.lexpos(num)</tt>. Return the lexing position for symbol <em>num</em> 24422632Sstever@eecs.umich.edu</ul> 24432632Sstever@eecs.umich.edu 24442632Sstever@eecs.umich.eduFor example: 24452632Sstever@eecs.umich.edu 24462632Sstever@eecs.umich.edu<blockquote> 24472632Sstever@eecs.umich.edu<pre> 24484479Sbinkertn@umich.edudef p_expression(p): 24492632Sstever@eecs.umich.edu 'expression : expression PLUS expression' 24504479Sbinkertn@umich.edu line = p.lineno(2) # line number of the PLUS token 24514479Sbinkertn@umich.edu index = p.lexpos(2) # Position of the PLUS token 24522632Sstever@eecs.umich.edu</pre> 24532632Sstever@eecs.umich.edu</blockquote> 24542632Sstever@eecs.umich.edu 24554479Sbinkertn@umich.eduAs an optional feature, <tt>yacc.py</tt> can automatically track line numbers and positions for all of the grammar symbols 24564479Sbinkertn@umich.eduas well. However, this 24574479Sbinkertn@umich.eduextra tracking requires extra processing and can significantly slow down parsing. Therefore, it must be enabled by passing the 24584479Sbinkertn@umich.edu<tt>tracking=True</tt> option to <tt>yacc.parse()</tt>. For example: 24594479Sbinkertn@umich.edu 24604479Sbinkertn@umich.edu<blockquote> 24614479Sbinkertn@umich.edu<pre> 24624479Sbinkertn@umich.eduyacc.parse(data,tracking=True) 24634479Sbinkertn@umich.edu</pre> 24644479Sbinkertn@umich.edu</blockquote> 24654479Sbinkertn@umich.edu 24664479Sbinkertn@umich.eduOnce enabled, the <tt>lineno()</tt> and <tt>lexpos()</tt> methods work for all grammar symbols. In addition, two 24674479Sbinkertn@umich.eduadditional methods can be used: 24684479Sbinkertn@umich.edu 24694479Sbinkertn@umich.edu<ul> 24704479Sbinkertn@umich.edu<li><tt>p.linespan(num)</tt>. Return a tuple (startline,endline) with the starting and ending line number for symbol <em>num</em>. 24714479Sbinkertn@umich.edu<li><tt>p.lexspan(num)</tt>. Return a tuple (start,end) with the starting and ending positions for symbol <em>num</em>. 24724479Sbinkertn@umich.edu</ul> 24734479Sbinkertn@umich.edu 24744479Sbinkertn@umich.eduFor example: 24754479Sbinkertn@umich.edu 24764479Sbinkertn@umich.edu<blockquote> 24774479Sbinkertn@umich.edu<pre> 24784479Sbinkertn@umich.edudef p_expression(p): 24794479Sbinkertn@umich.edu 'expression : expression PLUS expression' 24804479Sbinkertn@umich.edu p.lineno(1) # Line number of the left expression 24814479Sbinkertn@umich.edu p.lineno(2) # line number of the PLUS operator 24824479Sbinkertn@umich.edu p.lineno(3) # line number of the right expression 24834479Sbinkertn@umich.edu ... 24844479Sbinkertn@umich.edu start,end = p.linespan(3) # Start,end lines of the right expression 24854479Sbinkertn@umich.edu starti,endi = p.lexspan(3) # Start,end positions of right expression 24864479Sbinkertn@umich.edu 24874479Sbinkertn@umich.edu</pre> 24884479Sbinkertn@umich.edu</blockquote> 24894479Sbinkertn@umich.edu 24904479Sbinkertn@umich.eduNote: The <tt>lexspan()</tt> function only returns the range of values up to the start of the last grammar symbol. 24914479Sbinkertn@umich.edu 24924479Sbinkertn@umich.edu<p> 24934479Sbinkertn@umich.eduAlthough it may be convenient for PLY to track position information on 24944479Sbinkertn@umich.eduall grammar symbols, this is often unnecessary. For example, if you 24954479Sbinkertn@umich.eduare merely using line number information in an error message, you can 24964479Sbinkertn@umich.eduoften just key off of a specific token in the grammar rule. For 24974479Sbinkertn@umich.eduexample: 24984479Sbinkertn@umich.edu 24994479Sbinkertn@umich.edu<blockquote> 25004479Sbinkertn@umich.edu<pre> 25014479Sbinkertn@umich.edudef p_bad_func(p): 25024479Sbinkertn@umich.edu 'funccall : fname LPAREN error RPAREN' 25034479Sbinkertn@umich.edu # Line number reported from LPAREN token 25044479Sbinkertn@umich.edu print "Bad function call at line", p.lineno(2) 25054479Sbinkertn@umich.edu</pre> 25064479Sbinkertn@umich.edu</blockquote> 25074479Sbinkertn@umich.edu 25084479Sbinkertn@umich.edu<p> 25094479Sbinkertn@umich.eduSimilarly, you may get better parsing performance if you only propagate line number 25104479Sbinkertn@umich.eduinformation where it's needed. For example: 25114479Sbinkertn@umich.edu 25124479Sbinkertn@umich.edu<blockquote> 25134479Sbinkertn@umich.edu<pre> 25144479Sbinkertn@umich.edudef p_fname(p): 25154479Sbinkertn@umich.edu 'fname : ID' 25164479Sbinkertn@umich.edu p[0] = (p[1],p.lineno(1)) 25174479Sbinkertn@umich.edu</pre> 25184479Sbinkertn@umich.edu</blockquote> 25194479Sbinkertn@umich.edu 25204479Sbinkertn@umich.eduFinally, it should be noted that PLY does not store position information after a rule has been 25214479Sbinkertn@umich.eduprocessed. If it is important for you to retain this information in an abstract syntax tree, you 25224479Sbinkertn@umich.edumust make your own copy. 25234479Sbinkertn@umich.edu 25244479Sbinkertn@umich.edu<H3><a name="ply_nn34"></a>5.10 AST Construction</H3> 25254479Sbinkertn@umich.edu 25262632Sstever@eecs.umich.edu 25272632Sstever@eecs.umich.edu<tt>yacc.py</tt> provides no special functions for constructing an abstract syntax tree. However, such 25282632Sstever@eecs.umich.educonstruction is easy enough to do on your own. Simply create a data structure for abstract syntax tree nodes 25294479Sbinkertn@umich.eduand assign nodes to <tt>p[0]</tt> in each rule. 25302632Sstever@eecs.umich.edu 25312632Sstever@eecs.umich.eduFor example: 25322632Sstever@eecs.umich.edu 25332632Sstever@eecs.umich.edu<blockquote> 25342632Sstever@eecs.umich.edu<pre> 25352632Sstever@eecs.umich.educlass Expr: pass 25362632Sstever@eecs.umich.edu 25372632Sstever@eecs.umich.educlass BinOp(Expr): 25382632Sstever@eecs.umich.edu def __init__(self,left,op,right): 25392632Sstever@eecs.umich.edu self.type = "binop" 25402632Sstever@eecs.umich.edu self.left = left 25412632Sstever@eecs.umich.edu self.right = right 25422632Sstever@eecs.umich.edu self.op = op 25432632Sstever@eecs.umich.edu 25442632Sstever@eecs.umich.educlass Number(Expr): 25452632Sstever@eecs.umich.edu def __init__(self,value): 25462632Sstever@eecs.umich.edu self.type = "number" 25472632Sstever@eecs.umich.edu self.value = value 25482632Sstever@eecs.umich.edu 25494479Sbinkertn@umich.edudef p_expression_binop(p): 25502632Sstever@eecs.umich.edu '''expression : expression PLUS expression 25512632Sstever@eecs.umich.edu | expression MINUS expression 25522632Sstever@eecs.umich.edu | expression TIMES expression 25532632Sstever@eecs.umich.edu | expression DIVIDE expression''' 25542632Sstever@eecs.umich.edu 25554479Sbinkertn@umich.edu p[0] = BinOp(p[1],p[2],p[3]) 25564479Sbinkertn@umich.edu 25574479Sbinkertn@umich.edudef p_expression_group(p): 25582632Sstever@eecs.umich.edu 'expression : LPAREN expression RPAREN' 25594479Sbinkertn@umich.edu p[0] = p[2] 25604479Sbinkertn@umich.edu 25614479Sbinkertn@umich.edudef p_expression_number(p): 25622632Sstever@eecs.umich.edu 'expression : NUMBER' 25634479Sbinkertn@umich.edu p[0] = Number(p[1]) 25642632Sstever@eecs.umich.edu</pre> 25652632Sstever@eecs.umich.edu</blockquote> 25662632Sstever@eecs.umich.edu 25672632Sstever@eecs.umich.eduTo simplify tree traversal, it may make sense to pick a very generic tree structure for your parse tree nodes. 25682632Sstever@eecs.umich.eduFor example: 25692632Sstever@eecs.umich.edu 25702632Sstever@eecs.umich.edu<blockquote> 25712632Sstever@eecs.umich.edu<pre> 25722632Sstever@eecs.umich.educlass Node: 25732632Sstever@eecs.umich.edu def __init__(self,type,children=None,leaf=None): 25742632Sstever@eecs.umich.edu self.type = type 25752632Sstever@eecs.umich.edu if children: 25762632Sstever@eecs.umich.edu self.children = children 25772632Sstever@eecs.umich.edu else: 25782632Sstever@eecs.umich.edu self.children = [ ] 25792632Sstever@eecs.umich.edu self.leaf = leaf 25802632Sstever@eecs.umich.edu 25814479Sbinkertn@umich.edudef p_expression_binop(p): 25822632Sstever@eecs.umich.edu '''expression : expression PLUS expression 25832632Sstever@eecs.umich.edu | expression MINUS expression 25842632Sstever@eecs.umich.edu | expression TIMES expression 25852632Sstever@eecs.umich.edu | expression DIVIDE expression''' 25862632Sstever@eecs.umich.edu 25874479Sbinkertn@umich.edu p[0] = Node("binop", [p[1],p[3]], p[2]) 25882632Sstever@eecs.umich.edu</pre> 25892632Sstever@eecs.umich.edu</blockquote> 25902632Sstever@eecs.umich.edu 25914479Sbinkertn@umich.edu<H3><a name="ply_nn35"></a>5.11 Embedded Actions</H3> 25924479Sbinkertn@umich.edu 25934479Sbinkertn@umich.edu 25944479Sbinkertn@umich.eduThe parsing technique used by yacc only allows actions to be executed at the end of a rule. For example, 25954479Sbinkertn@umich.edusuppose you have a rule like this: 25964479Sbinkertn@umich.edu 25974479Sbinkertn@umich.edu<blockquote> 25984479Sbinkertn@umich.edu<pre> 25994479Sbinkertn@umich.edudef p_foo(p): 26004479Sbinkertn@umich.edu "foo : A B C D" 26014479Sbinkertn@umich.edu print "Parsed a foo", p[1],p[2],p[3],p[4] 26024479Sbinkertn@umich.edu</pre> 26034479Sbinkertn@umich.edu</blockquote> 26044479Sbinkertn@umich.edu 26054479Sbinkertn@umich.edu<p> 26064479Sbinkertn@umich.eduIn this case, the supplied action code only executes after all of the 26074479Sbinkertn@umich.edusymbols <tt>A</tt>, <tt>B</tt>, <tt>C</tt>, and <tt>D</tt> have been 26084479Sbinkertn@umich.eduparsed. Sometimes, however, it is useful to execute small code 26094479Sbinkertn@umich.edufragments during intermediate stages of parsing. For example, suppose 26104479Sbinkertn@umich.eduyou wanted to perform some action immediately after <tt>A</tt> has 26114479Sbinkertn@umich.edubeen parsed. To do this, you can write a empty rule like this: 26124479Sbinkertn@umich.edu 26134479Sbinkertn@umich.edu<blockquote> 26144479Sbinkertn@umich.edu<pre> 26154479Sbinkertn@umich.edudef p_foo(p): 26164479Sbinkertn@umich.edu "foo : A seen_A B C D" 26174479Sbinkertn@umich.edu print "Parsed a foo", p[1],p[3],p[4],p[5] 26184479Sbinkertn@umich.edu print "seen_A returned", p[2] 26194479Sbinkertn@umich.edu 26204479Sbinkertn@umich.edudef p_seen_A(p): 26214479Sbinkertn@umich.edu "seen_A :" 26224479Sbinkertn@umich.edu print "Saw an A = ", p[-1] # Access grammar symbol to left 26234479Sbinkertn@umich.edu p[0] = some_value # Assign value to seen_A 26244479Sbinkertn@umich.edu 26254479Sbinkertn@umich.edu</pre> 26264479Sbinkertn@umich.edu</blockquote> 26274479Sbinkertn@umich.edu 26284479Sbinkertn@umich.edu<p> 26294479Sbinkertn@umich.eduIn this example, the empty <tt>seen_A</tt> rule executes immediately 26304479Sbinkertn@umich.eduafter <tt>A</tt> is shifted onto the parsing stack. Within this 26314479Sbinkertn@umich.edurule, <tt>p[-1]</tt> refers to the symbol on the stack that appears 26324479Sbinkertn@umich.eduimmediately to the left of the <tt>seen_A</tt> symbol. In this case, 26334479Sbinkertn@umich.eduit would be the value of <tt>A</tt> in the <tt>foo</tt> rule 26344479Sbinkertn@umich.eduimmediately above. Like other rules, a value can be returned from an 26354479Sbinkertn@umich.eduembedded action by simply assigning it to <tt>p[0]</tt> 26364479Sbinkertn@umich.edu 26374479Sbinkertn@umich.edu<p> 26384479Sbinkertn@umich.eduThe use of embedded actions can sometimes introduce extra shift/reduce conflicts. For example, 26394479Sbinkertn@umich.eduthis grammar has no conflicts: 26404479Sbinkertn@umich.edu 26414479Sbinkertn@umich.edu<blockquote> 26424479Sbinkertn@umich.edu<pre> 26434479Sbinkertn@umich.edudef p_foo(p): 26444479Sbinkertn@umich.edu """foo : abcd 26454479Sbinkertn@umich.edu | abcx""" 26464479Sbinkertn@umich.edu 26474479Sbinkertn@umich.edudef p_abcd(p): 26484479Sbinkertn@umich.edu "abcd : A B C D" 26494479Sbinkertn@umich.edu 26504479Sbinkertn@umich.edudef p_abcx(p): 26514479Sbinkertn@umich.edu "abcx : A B C X" 26524479Sbinkertn@umich.edu</pre> 26534479Sbinkertn@umich.edu</blockquote> 26544479Sbinkertn@umich.edu 26554479Sbinkertn@umich.eduHowever, if you insert an embedded action into one of the rules like this, 26564479Sbinkertn@umich.edu 26574479Sbinkertn@umich.edu<blockquote> 26584479Sbinkertn@umich.edu<pre> 26594479Sbinkertn@umich.edudef p_foo(p): 26604479Sbinkertn@umich.edu """foo : abcd 26614479Sbinkertn@umich.edu | abcx""" 26624479Sbinkertn@umich.edu 26634479Sbinkertn@umich.edudef p_abcd(p): 26644479Sbinkertn@umich.edu "abcd : A B C D" 26654479Sbinkertn@umich.edu 26664479Sbinkertn@umich.edudef p_abcx(p): 26674479Sbinkertn@umich.edu "abcx : A B seen_AB C X" 26684479Sbinkertn@umich.edu 26694479Sbinkertn@umich.edudef p_seen_AB(p): 26704479Sbinkertn@umich.edu "seen_AB :" 26714479Sbinkertn@umich.edu</pre> 26724479Sbinkertn@umich.edu</blockquote> 26734479Sbinkertn@umich.edu 26744479Sbinkertn@umich.eduan extra shift-reduce conflict will be introduced. This conflict is caused by the fact that the same symbol <tt>C</tt> appears next in 26754479Sbinkertn@umich.eduboth the <tt>abcd</tt> and <tt>abcx</tt> rules. The parser can either shift the symbol (<tt>abcd</tt> rule) or reduce the empty rule <tt>seen_AB</tt> (<tt>abcx</tt> rule). 26764479Sbinkertn@umich.edu 26774479Sbinkertn@umich.edu<p> 26784479Sbinkertn@umich.eduA common use of embedded rules is to control other aspects of parsing 26794479Sbinkertn@umich.edusuch as scoping of local variables. For example, if you were parsing C code, you might 26804479Sbinkertn@umich.eduwrite code like this: 26814479Sbinkertn@umich.edu 26824479Sbinkertn@umich.edu<blockquote> 26834479Sbinkertn@umich.edu<pre> 26844479Sbinkertn@umich.edudef p_statements_block(p): 26854479Sbinkertn@umich.edu "statements: LBRACE new_scope statements RBRACE""" 26864479Sbinkertn@umich.edu # Action code 26874479Sbinkertn@umich.edu ... 26884479Sbinkertn@umich.edu pop_scope() # Return to previous scope 26894479Sbinkertn@umich.edu 26904479Sbinkertn@umich.edudef p_new_scope(p): 26914479Sbinkertn@umich.edu "new_scope :" 26924479Sbinkertn@umich.edu # Create a new scope for local variables 26934479Sbinkertn@umich.edu s = new_scope() 26944479Sbinkertn@umich.edu push_scope(s) 26954479Sbinkertn@umich.edu ... 26964479Sbinkertn@umich.edu</pre> 26974479Sbinkertn@umich.edu</blockquote> 26984479Sbinkertn@umich.edu 26994479Sbinkertn@umich.eduIn this case, the embedded action <tt>new_scope</tt> executes immediately after a <tt>LBRACE</tt> (<tt>{</tt>) symbol is parsed. This might 27004479Sbinkertn@umich.eduadjust internal symbol tables and other aspects of the parser. Upon completion of the rule <tt>statements_block</tt>, code might undo the operations performed in the embedded action (e.g., <tt>pop_scope()</tt>). 27014479Sbinkertn@umich.edu 27024479Sbinkertn@umich.edu<H3><a name="ply_nn36"></a>5.12 Yacc implementation notes</H3> 27034479Sbinkertn@umich.edu 27042632Sstever@eecs.umich.edu 27052632Sstever@eecs.umich.edu<ul> 27064479Sbinkertn@umich.edu<li>The default parsing method is LALR. To use SLR instead, run yacc() as follows: 27074479Sbinkertn@umich.edu 27084479Sbinkertn@umich.edu<blockquote> 27094479Sbinkertn@umich.edu<pre> 27104479Sbinkertn@umich.eduyacc.yacc(method="SLR") 27114479Sbinkertn@umich.edu</pre> 27124479Sbinkertn@umich.edu</blockquote> 27134479Sbinkertn@umich.eduNote: LALR table generation takes approximately twice as long as SLR table generation. There is no 27144479Sbinkertn@umich.edudifference in actual parsing performance---the same code is used in both cases. LALR is preferred when working 27154479Sbinkertn@umich.eduwith more complicated grammars since it is more powerful. 27164479Sbinkertn@umich.edu 27174479Sbinkertn@umich.edu<p> 27184479Sbinkertn@umich.edu 27192632Sstever@eecs.umich.edu<li>By default, <tt>yacc.py</tt> relies on <tt>lex.py</tt> for tokenizing. However, an alternative tokenizer 27202632Sstever@eecs.umich.educan be supplied as follows: 27212632Sstever@eecs.umich.edu 27222632Sstever@eecs.umich.edu<blockquote> 27232632Sstever@eecs.umich.edu<pre> 27242632Sstever@eecs.umich.eduyacc.parse(lexer=x) 27252632Sstever@eecs.umich.edu</pre> 27262632Sstever@eecs.umich.edu</blockquote> 27272632Sstever@eecs.umich.eduin this case, <tt>x</tt> must be a Lexer object that minimally has a <tt>x.token()</tt> method for retrieving the next 27282632Sstever@eecs.umich.edutoken. If an input string is given to <tt>yacc.parse()</tt>, the lexer must also have an <tt>x.input()</tt> method. 27292632Sstever@eecs.umich.edu 27302632Sstever@eecs.umich.edu<p> 27312632Sstever@eecs.umich.edu<li>By default, the yacc generates tables in debugging mode (which produces the parser.out file and other output). 27322632Sstever@eecs.umich.eduTo disable this, use 27332632Sstever@eecs.umich.edu 27342632Sstever@eecs.umich.edu<blockquote> 27352632Sstever@eecs.umich.edu<pre> 27362632Sstever@eecs.umich.eduyacc.yacc(debug=0) 27372632Sstever@eecs.umich.edu</pre> 27382632Sstever@eecs.umich.edu</blockquote> 27392632Sstever@eecs.umich.edu 27402632Sstever@eecs.umich.edu<p> 27412632Sstever@eecs.umich.edu<li>To change the name of the <tt>parsetab.py</tt> file, use: 27422632Sstever@eecs.umich.edu 27432632Sstever@eecs.umich.edu<blockquote> 27442632Sstever@eecs.umich.edu<pre> 27452632Sstever@eecs.umich.eduyacc.yacc(tabmodule="foo") 27462632Sstever@eecs.umich.edu</pre> 27472632Sstever@eecs.umich.edu</blockquote> 27482632Sstever@eecs.umich.edu 27494479Sbinkertn@umich.edu<p> 27504479Sbinkertn@umich.edu<li>To change the directory in which the <tt>parsetab.py</tt> file (and other output files) are written, use: 27514479Sbinkertn@umich.edu<blockquote> 27524479Sbinkertn@umich.edu<pre> 27534479Sbinkertn@umich.eduyacc.yacc(tabmodule="foo",outputdir="somedirectory") 27544479Sbinkertn@umich.edu</pre> 27554479Sbinkertn@umich.edu</blockquote> 27564479Sbinkertn@umich.edu 27574479Sbinkertn@umich.edu<p> 27584479Sbinkertn@umich.edu<li>To prevent yacc from generating any kind of parser table file, use: 27594479Sbinkertn@umich.edu<blockquote> 27604479Sbinkertn@umich.edu<pre> 27614479Sbinkertn@umich.eduyacc.yacc(write_tables=0) 27624479Sbinkertn@umich.edu</pre> 27634479Sbinkertn@umich.edu</blockquote> 27644479Sbinkertn@umich.edu 27654479Sbinkertn@umich.eduNote: If you disable table generation, yacc() will regenerate the parsing tables 27664479Sbinkertn@umich.edueach time it runs (which may take awhile depending on how large your grammar is). 27674479Sbinkertn@umich.edu 27682632Sstever@eecs.umich.edu<P> 27692632Sstever@eecs.umich.edu<li>To print copious amounts of debugging during parsing, use: 27702632Sstever@eecs.umich.edu 27712632Sstever@eecs.umich.edu<blockquote> 27722632Sstever@eecs.umich.edu<pre> 27732632Sstever@eecs.umich.eduyacc.parse(debug=1) 27742632Sstever@eecs.umich.edu</pre> 27752632Sstever@eecs.umich.edu</blockquote> 27762632Sstever@eecs.umich.edu 27772632Sstever@eecs.umich.edu<p> 27784479Sbinkertn@umich.edu<li>To redirect the debugging output to a filename of your choosing, use: 27794479Sbinkertn@umich.edu 27804479Sbinkertn@umich.edu<blockquote> 27814479Sbinkertn@umich.edu<pre> 27824479Sbinkertn@umich.eduyacc.parse(debug=1, debugfile="debugging.out") 27834479Sbinkertn@umich.edu</pre> 27844479Sbinkertn@umich.edu</blockquote> 27854479Sbinkertn@umich.edu 27864479Sbinkertn@umich.edu<p> 27872632Sstever@eecs.umich.edu<li>The <tt>yacc.yacc()</tt> function really returns a parser object. If you want to support multiple 27882632Sstever@eecs.umich.eduparsers in the same application, do this: 27892632Sstever@eecs.umich.edu 27902632Sstever@eecs.umich.edu<blockquote> 27912632Sstever@eecs.umich.edu<pre> 27922632Sstever@eecs.umich.edup = yacc.yacc() 27932632Sstever@eecs.umich.edu... 27942632Sstever@eecs.umich.edup.parse() 27952632Sstever@eecs.umich.edu</pre> 27962632Sstever@eecs.umich.edu</blockquote> 27972632Sstever@eecs.umich.edu 27982632Sstever@eecs.umich.eduNote: The function <tt>yacc.parse()</tt> is bound to the last parser that was generated. 27992632Sstever@eecs.umich.edu 28002632Sstever@eecs.umich.edu<p> 28014479Sbinkertn@umich.edu<li>Since the generation of the LALR tables is relatively expensive, previously generated tables are 28022632Sstever@eecs.umich.educached and reused if possible. The decision to regenerate the tables is determined by taking an MD5 28032632Sstever@eecs.umich.educhecksum of all grammar rules and precedence rules. Only in the event of a mismatch are the tables regenerated. 28042632Sstever@eecs.umich.edu 28052632Sstever@eecs.umich.edu<p> 28062632Sstever@eecs.umich.eduIt should be noted that table generation is reasonably efficient, even for grammars that involve around a 100 rules 28072632Sstever@eecs.umich.eduand several hundred states. For more complex languages such as C, table generation may take 30-60 seconds on a slow 28082632Sstever@eecs.umich.edumachine. Please be patient. 28092632Sstever@eecs.umich.edu 28102632Sstever@eecs.umich.edu<p> 28114479Sbinkertn@umich.edu<li>Since LR parsing is driven by tables, the performance of the parser is largely independent of the 28124479Sbinkertn@umich.edusize of the grammar. The biggest bottlenecks will be the lexer and the complexity of the code in your grammar rules. 28132632Sstever@eecs.umich.edu</ul> 28142632Sstever@eecs.umich.edu 28154479Sbinkertn@umich.edu<H2><a name="ply_nn37"></a>6. Parser and Lexer State Management</H2> 28164479Sbinkertn@umich.edu 28172632Sstever@eecs.umich.edu 28182632Sstever@eecs.umich.eduIn advanced parsing applications, you may want to have multiple 28192632Sstever@eecs.umich.eduparsers and lexers. Furthermore, the parser may want to control the 28202632Sstever@eecs.umich.edubehavior of the lexer in some way. 28212632Sstever@eecs.umich.edu 28222632Sstever@eecs.umich.edu<p> 28232632Sstever@eecs.umich.eduTo do this, it is important to note that both the lexer and parser are 28242632Sstever@eecs.umich.eduactually implemented as objects. These objects are returned by the 28252632Sstever@eecs.umich.edu<tt>lex()</tt> and <tt>yacc()</tt> functions respectively. For example: 28262632Sstever@eecs.umich.edu 28272632Sstever@eecs.umich.edu<blockquote> 28282632Sstever@eecs.umich.edu<pre> 28292632Sstever@eecs.umich.edulexer = lex.lex() # Return lexer object 28302632Sstever@eecs.umich.eduparser = yacc.yacc() # Return parser object 28312632Sstever@eecs.umich.edu</pre> 28322632Sstever@eecs.umich.edu</blockquote> 28332632Sstever@eecs.umich.edu 28344479Sbinkertn@umich.eduTo attach the lexer and parser together, make sure you use the <tt>lexer</tt> argumemnt to parse. For example: 28354479Sbinkertn@umich.edu 28364479Sbinkertn@umich.edu<blockquote> 28374479Sbinkertn@umich.edu<pre> 28384479Sbinkertn@umich.eduparser.parse(text,lexer=lexer) 28394479Sbinkertn@umich.edu</pre> 28404479Sbinkertn@umich.edu</blockquote> 28414479Sbinkertn@umich.edu 28422632Sstever@eecs.umich.eduWithin lexer and parser rules, these objects are also available. In the lexer, 28432632Sstever@eecs.umich.eduthe "lexer" attribute of a token refers to the lexer object in use. For example: 28442632Sstever@eecs.umich.edu 28452632Sstever@eecs.umich.edu<blockquote> 28462632Sstever@eecs.umich.edu<pre> 28472632Sstever@eecs.umich.edudef t_NUMBER(t): 28482632Sstever@eecs.umich.edu r'\d+' 28492632Sstever@eecs.umich.edu ... 28502632Sstever@eecs.umich.edu print t.lexer # Show lexer object 28512632Sstever@eecs.umich.edu</pre> 28522632Sstever@eecs.umich.edu</blockquote> 28532632Sstever@eecs.umich.edu 28542632Sstever@eecs.umich.eduIn the parser, the "lexer" and "parser" attributes refer to the lexer 28552632Sstever@eecs.umich.eduand parser objects respectively. 28562632Sstever@eecs.umich.edu 28572632Sstever@eecs.umich.edu<blockquote> 28582632Sstever@eecs.umich.edu<pre> 28594479Sbinkertn@umich.edudef p_expr_plus(p): 28602632Sstever@eecs.umich.edu 'expr : expr PLUS expr' 28612632Sstever@eecs.umich.edu ... 28624479Sbinkertn@umich.edu print p.parser # Show parser object 28634479Sbinkertn@umich.edu print p.lexer # Show lexer object 28642632Sstever@eecs.umich.edu</pre> 28652632Sstever@eecs.umich.edu</blockquote> 28662632Sstever@eecs.umich.edu 28672632Sstever@eecs.umich.eduIf necessary, arbitrary attributes can be attached to the lexer or parser object. 28682632Sstever@eecs.umich.eduFor example, if you wanted to have different parsing modes, you could attach a mode 28692632Sstever@eecs.umich.eduattribute to the parser object and look at it later. 28702632Sstever@eecs.umich.edu 28714479Sbinkertn@umich.edu<H2><a name="ply_nn38"></a>7. Using Python's Optimized Mode</H2> 28724479Sbinkertn@umich.edu 28732632Sstever@eecs.umich.edu 28742632Sstever@eecs.umich.eduBecause PLY uses information from doc-strings, parsing and lexing 28752632Sstever@eecs.umich.eduinformation must be gathered while running the Python interpreter in 28762632Sstever@eecs.umich.edunormal mode (i.e., not with the -O or -OO options). However, if you 28772632Sstever@eecs.umich.eduspecify optimized mode like this: 28782632Sstever@eecs.umich.edu 28792632Sstever@eecs.umich.edu<blockquote> 28802632Sstever@eecs.umich.edu<pre> 28812632Sstever@eecs.umich.edulex.lex(optimize=1) 28822632Sstever@eecs.umich.eduyacc.yacc(optimize=1) 28832632Sstever@eecs.umich.edu</pre> 28842632Sstever@eecs.umich.edu</blockquote> 28852632Sstever@eecs.umich.edu 28862632Sstever@eecs.umich.eduthen PLY can later be used when Python runs in optimized mode. To make this work, 28872632Sstever@eecs.umich.edumake sure you first run Python in normal mode. Once the lexing and parsing tables 28882632Sstever@eecs.umich.eduhave been generated the first time, run Python in optimized mode. PLY will use 28892632Sstever@eecs.umich.eduthe tables without the need for doc strings. 28902632Sstever@eecs.umich.edu 28912632Sstever@eecs.umich.edu<p> 28922632Sstever@eecs.umich.eduBeware: running PLY in optimized mode disables a lot of error 28932632Sstever@eecs.umich.educhecking. You should only do this when your project has stabilized 28942632Sstever@eecs.umich.eduand you don't need to do any debugging. 28952632Sstever@eecs.umich.edu 28964479Sbinkertn@umich.edu<H2><a name="ply_nn39"></a>8. Where to go from here?</H2> 28974479Sbinkertn@umich.edu 28982632Sstever@eecs.umich.edu 28992632Sstever@eecs.umich.eduThe <tt>examples</tt> directory of the PLY distribution contains several simple examples. Please consult a 29002632Sstever@eecs.umich.educompilers textbook for the theory and underlying implementation details or LR parsing. 29012632Sstever@eecs.umich.edu 29022632Sstever@eecs.umich.edu</body> 29032632Sstever@eecs.umich.edu</html> 29042632Sstever@eecs.umich.edu 29052632Sstever@eecs.umich.edu 29062632Sstever@eecs.umich.edu 29072632Sstever@eecs.umich.edu 29082632Sstever@eecs.umich.edu 29092632Sstever@eecs.umich.edu 29102632Sstever@eecs.umich.edu 2911