ply.html revision 4479
12632Sstever@eecs.umich.edu<html> 22632Sstever@eecs.umich.edu<head> 32632Sstever@eecs.umich.edu<title>PLY (Python Lex-Yacc)</title> 42632Sstever@eecs.umich.edu</head> 52632Sstever@eecs.umich.edu<body bgcolor="#ffffff"> 62632Sstever@eecs.umich.edu 72632Sstever@eecs.umich.edu<h1>PLY (Python Lex-Yacc)</h1> 82632Sstever@eecs.umich.edu 92632Sstever@eecs.umich.edu<b> 102632Sstever@eecs.umich.eduDavid M. Beazley <br> 112632Sstever@eecs.umich.edudave@dabeaz.com<br> 122632Sstever@eecs.umich.edu</b> 132632Sstever@eecs.umich.edu 142632Sstever@eecs.umich.edu<p> 152632Sstever@eecs.umich.edu<b>PLY Version: 2.3</b> 162632Sstever@eecs.umich.edu<p> 172632Sstever@eecs.umich.edu 182632Sstever@eecs.umich.edu<!-- INDEX --> 192632Sstever@eecs.umich.edu<div class="sectiontoc"> 202632Sstever@eecs.umich.edu<ul> 212632Sstever@eecs.umich.edu<li><a href="#ply_nn1">Introduction</a> 222632Sstever@eecs.umich.edu<li><a href="#ply_nn2">PLY Overview</a> 232632Sstever@eecs.umich.edu<li><a href="#ply_nn3">Lex</a> 242632Sstever@eecs.umich.edu<ul> 252632Sstever@eecs.umich.edu<li><a href="#ply_nn4">Lex Example</a> 262632Sstever@eecs.umich.edu<li><a href="#ply_nn5">The tokens list</a> 272632Sstever@eecs.umich.edu<li><a href="#ply_nn6">Specification of tokens</a> 282632Sstever@eecs.umich.edu<li><a href="#ply_nn7">Token values</a> 292632Sstever@eecs.umich.edu<li><a href="#ply_nn8">Discarded tokens</a> 302632Sstever@eecs.umich.edu<li><a href="#ply_nn9">Line numbers and positional information</a> 312632Sstever@eecs.umich.edu<li><a href="#ply_nn10">Ignored characters</a> 322632Sstever@eecs.umich.edu<li><a href="#ply_nn11">Literal characters</a> 332632Sstever@eecs.umich.edu<li><a href="#ply_nn12">Error handling</a> 342632Sstever@eecs.umich.edu<li><a href="#ply_nn13">Building and using the lexer</a> 352632Sstever@eecs.umich.edu<li><a href="#ply_nn14">The @TOKEN decorator</a> 362632Sstever@eecs.umich.edu<li><a href="#ply_nn15">Optimized mode</a> 372632Sstever@eecs.umich.edu<li><a href="#ply_nn16">Debugging</a> 382632Sstever@eecs.umich.edu<li><a href="#ply_nn17">Alternative specification of lexers</a> 392632Sstever@eecs.umich.edu<li><a href="#ply_nn18">Maintaining state</a> 402632Sstever@eecs.umich.edu<li><a href="#ply_nn19">Duplicating lexers</a> 412632Sstever@eecs.umich.edu<li><a href="#ply_nn20">Internal lexer state</a> 422632Sstever@eecs.umich.edu<li><a href="#ply_nn21">Conditional lexing and start conditions</a> 432632Sstever@eecs.umich.edu<li><a href="#ply_nn21">Miscellaneous Issues</a> 442632Sstever@eecs.umich.edu</ul> 452632Sstever@eecs.umich.edu<li><a href="#ply_nn22">Parsing basics</a> 462632Sstever@eecs.umich.edu<li><a href="#ply_nn23">Yacc reference</a> 472632Sstever@eecs.umich.edu<ul> 482632Sstever@eecs.umich.edu<li><a href="#ply_nn24">An example</a> 492632Sstever@eecs.umich.edu<li><a href="#ply_nn25">Combining Grammar Rule Functions</a> 502632Sstever@eecs.umich.edu<li><a href="#ply_nn26">Character Literals</a> 512632Sstever@eecs.umich.edu<li><a href="#ply_nn26">Empty Productions</a> 522632Sstever@eecs.umich.edu<li><a href="#ply_nn28">Changing the starting symbol</a> 532632Sstever@eecs.umich.edu<li><a href="#ply_nn27">Dealing With Ambiguous Grammars</a> 542632Sstever@eecs.umich.edu<li><a href="#ply_nn28">The parser.out file</a> 552632Sstever@eecs.umich.edu<li><a href="#ply_nn29">Syntax Error Handling</a> 562632Sstever@eecs.umich.edu<ul> 572632Sstever@eecs.umich.edu<li><a href="#ply_nn30">Recovery and resynchronization with error rules</a> 582632Sstever@eecs.umich.edu<li><a href="#ply_nn31">Panic mode recovery</a> 592632Sstever@eecs.umich.edu<li><a href="#ply_nn32">General comments on error handling</a> 602632Sstever@eecs.umich.edu</ul> 612632Sstever@eecs.umich.edu<li><a href="#ply_nn33">Line Number and Position Tracking</a> 622632Sstever@eecs.umich.edu<li><a href="#ply_nn34">AST Construction</a> 632632Sstever@eecs.umich.edu<li><a href="#ply_nn35">Embedded Actions</a> 642632Sstever@eecs.umich.edu<li><a href="#ply_nn36">Yacc implementation notes</a> 652632Sstever@eecs.umich.edu</ul> 662632Sstever@eecs.umich.edu<li><a href="#ply_nn37">Parser and Lexer State Management</a> 672632Sstever@eecs.umich.edu<li><a href="#ply_nn38">Using Python's Optimized Mode</a> 682632Sstever@eecs.umich.edu<li><a href="#ply_nn39">Where to go from here?</a> 692632Sstever@eecs.umich.edu</ul> 702632Sstever@eecs.umich.edu</div> 712632Sstever@eecs.umich.edu<!-- INDEX --> 722632Sstever@eecs.umich.edu 732632Sstever@eecs.umich.edu 742632Sstever@eecs.umich.edu 752632Sstever@eecs.umich.edu 762632Sstever@eecs.umich.edu 772632Sstever@eecs.umich.edu 782632Sstever@eecs.umich.edu<H2><a name="ply_nn1"></a>1. Introduction</H2> 792632Sstever@eecs.umich.edu 802632Sstever@eecs.umich.edu 812632Sstever@eecs.umich.eduPLY is a pure-Python implementation of the popular compiler 822632Sstever@eecs.umich.educonstruction tools lex and yacc. The main goal of PLY is to stay 832632Sstever@eecs.umich.edufairly faithful to the way in which traditional lex/yacc tools work. 842632Sstever@eecs.umich.eduThis includes supporting LALR(1) parsing as well as providing 852632Sstever@eecs.umich.eduextensive input validation, error reporting, and diagnostics. Thus, 862632Sstever@eecs.umich.eduif you've used yacc in another programming language, it should be 872632Sstever@eecs.umich.edurelatively straightforward to use PLY. 882632Sstever@eecs.umich.edu 892632Sstever@eecs.umich.edu<p> 902632Sstever@eecs.umich.eduEarly versions of PLY were developed to support an Introduction to 912632Sstever@eecs.umich.eduCompilers Course I taught in 2001 at the University of Chicago. In this course, 922632Sstever@eecs.umich.edustudents built a fully functional compiler for a simple Pascal-like 932632Sstever@eecs.umich.edulanguage. Their compiler, implemented entirely in Python, had to 942632Sstever@eecs.umich.eduinclude lexical analysis, parsing, type checking, type inference, 952632Sstever@eecs.umich.edunested scoping, and code generation for the SPARC processor. 962632Sstever@eecs.umich.eduApproximately 30 different compiler implementations were completed in 972632Sstever@eecs.umich.eduthis course. Most of PLY's interface and operation has been influenced by common 982632Sstever@eecs.umich.eduusability problems encountered by students. 992632Sstever@eecs.umich.edu 1002632Sstever@eecs.umich.edu<p> 1012632Sstever@eecs.umich.eduSince PLY was primarily developed as an instructional tool, you will 1022632Sstever@eecs.umich.edufind it to be fairly picky about token and grammar rule 1032632Sstever@eecs.umich.eduspecification. In part, this 1042632Sstever@eecs.umich.eduadded formality is meant to catch common programming mistakes made by 1052632Sstever@eecs.umich.edunovice users. However, advanced users will also find such features to 1062632Sstever@eecs.umich.edube useful when building complicated grammars for real programming 1072632Sstever@eecs.umich.edulanguages. It should also be noted that PLY does not provide much in 1082632Sstever@eecs.umich.eduthe way of bells and whistles (e.g., automatic construction of 1092632Sstever@eecs.umich.eduabstract syntax trees, tree traversal, etc.). Nor would I consider it 1102632Sstever@eecs.umich.eduto be a parsing framework. Instead, you will find a bare-bones, yet 1112632Sstever@eecs.umich.edufully capable lex/yacc implementation written entirely in Python. 1122632Sstever@eecs.umich.edu 1132632Sstever@eecs.umich.edu<p> 1142632Sstever@eecs.umich.eduThe rest of this document assumes that you are somewhat familar with 1152632Sstever@eecs.umich.eduparsing theory, syntax directed translation, and the use of compiler 1162632Sstever@eecs.umich.educonstruction tools such as lex and yacc in other programming 1172632Sstever@eecs.umich.edulanguages. If you are unfamilar with these topics, you will probably 1182632Sstever@eecs.umich.eduwant to consult an introductory text such as "Compilers: Principles, 1192632Sstever@eecs.umich.eduTechniques, and Tools", by Aho, Sethi, and Ullman. O'Reilly's "Lex 1202632Sstever@eecs.umich.eduand Yacc" by John Levine may also be handy. In fact, the O'Reilly book can be 1212632Sstever@eecs.umich.eduused as a reference for PLY as the concepts are virtually identical. 1222632Sstever@eecs.umich.edu 1232632Sstever@eecs.umich.edu<H2><a name="ply_nn2"></a>2. PLY Overview</H2> 1242632Sstever@eecs.umich.edu 1252632Sstever@eecs.umich.edu 1262632Sstever@eecs.umich.eduPLY consists of two separate modules; <tt>lex.py</tt> and 1272632Sstever@eecs.umich.edu<tt>yacc.py</tt>, both of which are found in a Python package 1282632Sstever@eecs.umich.educalled <tt>ply</tt>. The <tt>lex.py</tt> module is used to break input text into a 1292632Sstever@eecs.umich.educollection of tokens specified by a collection of regular expression 1302632Sstever@eecs.umich.edurules. <tt>yacc.py</tt> is used to recognize language syntax that has 1312632Sstever@eecs.umich.edubeen specified in the form of a context free grammar. <tt>yacc.py</tt> uses LR parsing and generates its parsing tables 1322632Sstever@eecs.umich.eduusing either the LALR(1) (the default) or SLR table generation algorithms. 1332632Sstever@eecs.umich.edu 1342632Sstever@eecs.umich.edu<p> 1352632Sstever@eecs.umich.eduThe two tools are meant to work together. Specifically, 1362632Sstever@eecs.umich.edu<tt>lex.py</tt> provides an external interface in the form of a 1372632Sstever@eecs.umich.edu<tt>token()</tt> function that returns the next valid token on the 1382632Sstever@eecs.umich.eduinput stream. <tt>yacc.py</tt> calls this repeatedly to retrieve 1392632Sstever@eecs.umich.edutokens and invoke grammar rules. The output of <tt>yacc.py</tt> is 1402632Sstever@eecs.umich.eduoften an Abstract Syntax Tree (AST). However, this is entirely up to 1412632Sstever@eecs.umich.eduthe user. If desired, <tt>yacc.py</tt> can also be used to implement 1422632Sstever@eecs.umich.edusimple one-pass compilers. 1432632Sstever@eecs.umich.edu 1442632Sstever@eecs.umich.edu<p> 1452632Sstever@eecs.umich.eduLike its Unix counterpart, <tt>yacc.py</tt> provides most of the 1462632Sstever@eecs.umich.edufeatures you expect including extensive error checking, grammar 1472632Sstever@eecs.umich.eduvalidation, support for empty productions, error tokens, and ambiguity 1482632Sstever@eecs.umich.eduresolution via precedence rules. In fact, everything that is possible in traditional yacc 1492632Sstever@eecs.umich.edushould be supported in PLY. 1502632Sstever@eecs.umich.edu 1512632Sstever@eecs.umich.edu<p> 1522632Sstever@eecs.umich.eduThe primary difference between 1532632Sstever@eecs.umich.edu<tt>yacc.py</tt> and Unix <tt>yacc</tt> is that <tt>yacc.py</tt> 1542632Sstever@eecs.umich.edudoesn't involve a separate code-generation process. 1552632Sstever@eecs.umich.eduInstead, PLY relies on reflection (introspection) 1562632Sstever@eecs.umich.eduto build its lexers and parsers. Unlike traditional lex/yacc which 1572632Sstever@eecs.umich.edurequire a special input file that is converted into a separate source 1582632Sstever@eecs.umich.edufile, the specifications given to PLY <em>are</em> valid Python 1592632Sstever@eecs.umich.eduprograms. This means that there are no extra source files nor is 1602632Sstever@eecs.umich.eduthere a special compiler construction step (e.g., running yacc to 1612632Sstever@eecs.umich.edugenerate Python code for the compiler). Since the generation of the 1622632Sstever@eecs.umich.eduparsing tables is relatively expensive, PLY caches the results and 1632632Sstever@eecs.umich.edusaves them to a file. If no changes are detected in the input source, 1642632Sstever@eecs.umich.eduthe tables are read from the cache. Otherwise, they are regenerated. 1652632Sstever@eecs.umich.edu 1662632Sstever@eecs.umich.edu<H2><a name="ply_nn3"></a>3. Lex</H2> 1672632Sstever@eecs.umich.edu 1682632Sstever@eecs.umich.edu 1692632Sstever@eecs.umich.edu<tt>lex.py</tt> is used to tokenize an input string. For example, suppose 1702632Sstever@eecs.umich.eduyou're writing a programming language and a user supplied the following input string: 1712632Sstever@eecs.umich.edu 1722632Sstever@eecs.umich.edu<blockquote> 1732632Sstever@eecs.umich.edu<pre> 1742632Sstever@eecs.umich.edux = 3 + 42 * (s - t) 1752632Sstever@eecs.umich.edu</pre> 1762632Sstever@eecs.umich.edu</blockquote> 1772632Sstever@eecs.umich.edu 1782632Sstever@eecs.umich.eduA tokenizer splits the string into individual tokens 1792632Sstever@eecs.umich.edu 1802632Sstever@eecs.umich.edu<blockquote> 1812632Sstever@eecs.umich.edu<pre> 1822632Sstever@eecs.umich.edu'x','=', '3', '+', '42', '*', '(', 's', '-', 't', ')' 1832632Sstever@eecs.umich.edu</pre> 1842632Sstever@eecs.umich.edu</blockquote> 1852632Sstever@eecs.umich.edu 1862632Sstever@eecs.umich.eduTokens are usually given names to indicate what they are. For example: 1872632Sstever@eecs.umich.edu 1882632Sstever@eecs.umich.edu<blockquote> 1892632Sstever@eecs.umich.edu<pre> 1902632Sstever@eecs.umich.edu'ID','EQUALS','NUMBER','PLUS','NUMBER','TIMES', 1912632Sstever@eecs.umich.edu'LPAREN','ID','MINUS','ID','RPAREN' 1922632Sstever@eecs.umich.edu</pre> 1932632Sstever@eecs.umich.edu</blockquote> 1942632Sstever@eecs.umich.edu 1952632Sstever@eecs.umich.eduMore specifically, the input is broken into pairs of token types and values. For example: 1962632Sstever@eecs.umich.edu 1972632Sstever@eecs.umich.edu<blockquote> 1982632Sstever@eecs.umich.edu<pre> 1992632Sstever@eecs.umich.edu('ID','x'), ('EQUALS','='), ('NUMBER','3'), 2002632Sstever@eecs.umich.edu('PLUS','+'), ('NUMBER','42), ('TIMES','*'), 2012632Sstever@eecs.umich.edu('LPAREN','('), ('ID','s'), ('MINUS','-'), 2022632Sstever@eecs.umich.edu('ID','t'), ('RPAREN',')' 2032632Sstever@eecs.umich.edu</pre> 2042632Sstever@eecs.umich.edu</blockquote> 2052632Sstever@eecs.umich.edu 2062632Sstever@eecs.umich.eduThe identification of tokens is typically done by writing a series of regular expression 2072632Sstever@eecs.umich.edurules. The next section shows how this is done using <tt>lex.py</tt>. 2082632Sstever@eecs.umich.edu 2092632Sstever@eecs.umich.edu<H3><a name="ply_nn4"></a>3.1 Lex Example</H3> 2102632Sstever@eecs.umich.edu 2112632Sstever@eecs.umich.edu 2122632Sstever@eecs.umich.eduThe following example shows how <tt>lex.py</tt> is used to write a simple tokenizer. 2132632Sstever@eecs.umich.edu 2142632Sstever@eecs.umich.edu<blockquote> 2152632Sstever@eecs.umich.edu<pre> 2162632Sstever@eecs.umich.edu# ------------------------------------------------------------ 2172632Sstever@eecs.umich.edu# calclex.py 2182632Sstever@eecs.umich.edu# 2192632Sstever@eecs.umich.edu# tokenizer for a simple expression evaluator for 2202632Sstever@eecs.umich.edu# numbers and +,-,*,/ 2212632Sstever@eecs.umich.edu# ------------------------------------------------------------ 2222632Sstever@eecs.umich.eduimport ply.lex as lex 2232632Sstever@eecs.umich.edu 2242632Sstever@eecs.umich.edu# List of token names. This is always required 2252632Sstever@eecs.umich.edutokens = ( 2262632Sstever@eecs.umich.edu 'NUMBER', 2272632Sstever@eecs.umich.edu 'PLUS', 2282632Sstever@eecs.umich.edu 'MINUS', 2292632Sstever@eecs.umich.edu 'TIMES', 2302632Sstever@eecs.umich.edu 'DIVIDE', 2312632Sstever@eecs.umich.edu 'LPAREN', 2322632Sstever@eecs.umich.edu 'RPAREN', 2332632Sstever@eecs.umich.edu) 2342632Sstever@eecs.umich.edu 2352632Sstever@eecs.umich.edu# Regular expression rules for simple tokens 2362632Sstever@eecs.umich.edut_PLUS = r'\+' 2372632Sstever@eecs.umich.edut_MINUS = r'-' 2382632Sstever@eecs.umich.edut_TIMES = r'\*' 2392632Sstever@eecs.umich.edut_DIVIDE = r'/' 2402632Sstever@eecs.umich.edut_LPAREN = r'\(' 2412632Sstever@eecs.umich.edut_RPAREN = r'\)' 2422632Sstever@eecs.umich.edu 2432632Sstever@eecs.umich.edu# A regular expression rule with some action code 2442632Sstever@eecs.umich.edudef t_NUMBER(t): 2452632Sstever@eecs.umich.edu r'\d+' 2462632Sstever@eecs.umich.edu try: 2472632Sstever@eecs.umich.edu t.value = int(t.value) 2482632Sstever@eecs.umich.edu except ValueError: 2492632Sstever@eecs.umich.edu print "Line %d: Number %s is too large!" % (t.lineno,t.value) 2502632Sstever@eecs.umich.edu t.value = 0 2512632Sstever@eecs.umich.edu return t 2522632Sstever@eecs.umich.edu 2532632Sstever@eecs.umich.edu# Define a rule so we can track line numbers 2542632Sstever@eecs.umich.edudef t_newline(t): 2552632Sstever@eecs.umich.edu r'\n+' 2562632Sstever@eecs.umich.edu t.lexer.lineno += len(t.value) 2572632Sstever@eecs.umich.edu 2582632Sstever@eecs.umich.edu# A string containing ignored characters (spaces and tabs) 2592632Sstever@eecs.umich.edut_ignore = ' \t' 2602632Sstever@eecs.umich.edu 2612632Sstever@eecs.umich.edu# Error handling rule 2622632Sstever@eecs.umich.edudef t_error(t): 2632632Sstever@eecs.umich.edu print "Illegal character '%s'" % t.value[0] 2642632Sstever@eecs.umich.edu t.lexer.skip(1) 2652632Sstever@eecs.umich.edu 2662632Sstever@eecs.umich.edu# Build the lexer 2672632Sstever@eecs.umich.edulex.lex() 2682632Sstever@eecs.umich.edu 2692632Sstever@eecs.umich.edu</pre> 2702632Sstever@eecs.umich.edu</blockquote> 2712632Sstever@eecs.umich.eduTo use the lexer, you first need to feed it some input text using its <tt>input()</tt> method. After that, repeated calls to <tt>token()</tt> produce tokens. The following code shows how this works: 2722632Sstever@eecs.umich.edu 2732632Sstever@eecs.umich.edu<blockquote> 2742632Sstever@eecs.umich.edu<pre> 2752632Sstever@eecs.umich.edu 2762632Sstever@eecs.umich.edu# Test it out 2772632Sstever@eecs.umich.edudata = ''' 2782632Sstever@eecs.umich.edu3 + 4 * 10 2792632Sstever@eecs.umich.edu + -20 *2 2802632Sstever@eecs.umich.edu''' 2812632Sstever@eecs.umich.edu 2822632Sstever@eecs.umich.edu# Give the lexer some input 2832632Sstever@eecs.umich.edulex.input(data) 2842632Sstever@eecs.umich.edu 2852632Sstever@eecs.umich.edu# Tokenize 2862632Sstever@eecs.umich.eduwhile 1: 2872632Sstever@eecs.umich.edu tok = lex.token() 2882632Sstever@eecs.umich.edu if not tok: break # No more input 2892632Sstever@eecs.umich.edu print tok 2902632Sstever@eecs.umich.edu</pre> 2912632Sstever@eecs.umich.edu</blockquote> 2922632Sstever@eecs.umich.edu 2932632Sstever@eecs.umich.eduWhen executed, the example will produce the following output: 2942632Sstever@eecs.umich.edu 2952632Sstever@eecs.umich.edu<blockquote> 2962632Sstever@eecs.umich.edu<pre> 2972632Sstever@eecs.umich.edu$ python example.py 2982632Sstever@eecs.umich.eduLexToken(NUMBER,3,2,1) 2992632Sstever@eecs.umich.eduLexToken(PLUS,'+',2,3) 3002632Sstever@eecs.umich.eduLexToken(NUMBER,4,2,5) 3012632Sstever@eecs.umich.eduLexToken(TIMES,'*',2,7) 3022632Sstever@eecs.umich.eduLexToken(NUMBER,10,2,10) 3032632Sstever@eecs.umich.eduLexToken(PLUS,'+',3,14) 3042632Sstever@eecs.umich.eduLexToken(MINUS,'-',3,16) 3052632Sstever@eecs.umich.eduLexToken(NUMBER,20,3,18) 3062632Sstever@eecs.umich.eduLexToken(TIMES,'*',3,20) 3072632Sstever@eecs.umich.eduLexToken(NUMBER,2,3,21) 3082632Sstever@eecs.umich.edu</pre> 3092632Sstever@eecs.umich.edu</blockquote> 3102632Sstever@eecs.umich.edu 3112632Sstever@eecs.umich.eduThe tokens returned by <tt>lex.token()</tt> are instances 3122632Sstever@eecs.umich.eduof <tt>LexToken</tt>. This object has 3132632Sstever@eecs.umich.eduattributes <tt>tok.type</tt>, <tt>tok.value</tt>, 3142632Sstever@eecs.umich.edu<tt>tok.lineno</tt>, and <tt>tok.lexpos</tt>. The following code shows an example of 3152632Sstever@eecs.umich.eduaccessing these attributes: 3162632Sstever@eecs.umich.edu 3172632Sstever@eecs.umich.edu<blockquote> 3182632Sstever@eecs.umich.edu<pre> 3192632Sstever@eecs.umich.edu# Tokenize 3202632Sstever@eecs.umich.eduwhile 1: 3212632Sstever@eecs.umich.edu tok = lex.token() 3222632Sstever@eecs.umich.edu if not tok: break # No more input 3232632Sstever@eecs.umich.edu print tok.type, tok.value, tok.line, tok.lexpos 3242632Sstever@eecs.umich.edu</pre> 3252632Sstever@eecs.umich.edu</blockquote> 3262632Sstever@eecs.umich.edu 3272632Sstever@eecs.umich.eduThe <tt>tok.type</tt> and <tt>tok.value</tt> attributes contain the 3282632Sstever@eecs.umich.edutype and value of the token itself. 3292632Sstever@eecs.umich.edu<tt>tok.line</tt> and <tt>tok.lexpos</tt> contain information about 3302632Sstever@eecs.umich.eduthe location of the token. <tt>tok.lexpos</tt> is the index of the 3312632Sstever@eecs.umich.edutoken relative to the start of the input text. 3322632Sstever@eecs.umich.edu 3332632Sstever@eecs.umich.edu<H3><a name="ply_nn5"></a>3.2 The tokens list</H3> 3342632Sstever@eecs.umich.edu 3352632Sstever@eecs.umich.edu 3362632Sstever@eecs.umich.eduAll lexers must provide a list <tt>tokens</tt> that defines all of the possible token 3372632Sstever@eecs.umich.edunames that can be produced by the lexer. This list is always required 3382632Sstever@eecs.umich.eduand is used to perform a variety of validation checks. The tokens list is also used by the 3392632Sstever@eecs.umich.edu<tt>yacc.py</tt> module to identify terminals. 3402632Sstever@eecs.umich.edu 3412632Sstever@eecs.umich.edu<p> 3422632Sstever@eecs.umich.eduIn the example, the following code specified the token names: 3432632Sstever@eecs.umich.edu 3442632Sstever@eecs.umich.edu<blockquote> 3452632Sstever@eecs.umich.edu<pre> 3462632Sstever@eecs.umich.edutokens = ( 3472632Sstever@eecs.umich.edu 'NUMBER', 3482632Sstever@eecs.umich.edu 'PLUS', 3492632Sstever@eecs.umich.edu 'MINUS', 3502632Sstever@eecs.umich.edu 'TIMES', 3512632Sstever@eecs.umich.edu 'DIVIDE', 3522632Sstever@eecs.umich.edu 'LPAREN', 3532632Sstever@eecs.umich.edu 'RPAREN', 3542632Sstever@eecs.umich.edu) 3552632Sstever@eecs.umich.edu</pre> 3562632Sstever@eecs.umich.edu</blockquote> 3572632Sstever@eecs.umich.edu 3582632Sstever@eecs.umich.edu<H3><a name="ply_nn6"></a>3.3 Specification of tokens</H3> 3592632Sstever@eecs.umich.edu 3602632Sstever@eecs.umich.edu 3612632Sstever@eecs.umich.eduEach token is specified by writing a regular expression rule. Each of these rules are 3622632Sstever@eecs.umich.eduare defined by making declarations with a special prefix <tt>t_</tt> to indicate that it 3632632Sstever@eecs.umich.edudefines a token. For simple tokens, the regular expression can 3642632Sstever@eecs.umich.edube specified as strings such as this (note: Python raw strings are used since they are the 3652632Sstever@eecs.umich.edumost convenient way to write regular expression strings): 3662632Sstever@eecs.umich.edu 3672632Sstever@eecs.umich.edu<blockquote> 3682632Sstever@eecs.umich.edu<pre> 3692632Sstever@eecs.umich.edut_PLUS = r'\+' 3702632Sstever@eecs.umich.edu</pre> 3712632Sstever@eecs.umich.edu</blockquote> 3722632Sstever@eecs.umich.edu 3732632Sstever@eecs.umich.eduIn this case, the name following the <tt>t_</tt> must exactly match one of the 3742632Sstever@eecs.umich.edunames supplied in <tt>tokens</tt>. If some kind of action needs to be performed, 3752632Sstever@eecs.umich.edua token rule can be specified as a function. For example, this rule matches numbers and 3762632Sstever@eecs.umich.educonverts the string into a Python integer. 3772632Sstever@eecs.umich.edu 3782632Sstever@eecs.umich.edu<blockquote> 3792632Sstever@eecs.umich.edu<pre> 3802632Sstever@eecs.umich.edudef t_NUMBER(t): 3812632Sstever@eecs.umich.edu r'\d+' 3822632Sstever@eecs.umich.edu try: 3832632Sstever@eecs.umich.edu t.value = int(t.value) 3842632Sstever@eecs.umich.edu except ValueError: 3852632Sstever@eecs.umich.edu print "Number %s is too large!" % t.value 3862632Sstever@eecs.umich.edu t.value = 0 3872632Sstever@eecs.umich.edu return t 3882632Sstever@eecs.umich.edu</pre> 3892632Sstever@eecs.umich.edu</blockquote> 3902632Sstever@eecs.umich.edu 3912632Sstever@eecs.umich.eduWhen a function is used, the regular expression rule is specified in the function documentation string. 3922632Sstever@eecs.umich.eduThe function always takes a single argument which is an instance of 3932632Sstever@eecs.umich.edu<tt>LexToken</tt>. This object has attributes of <tt>t.type</tt> which is the token type (as a string), 3942632Sstever@eecs.umich.edu<tt>t.value</tt> which is the lexeme (the actual text matched), <tt>t.lineno</tt> which is the current line number, and <tt>t.lexpos</tt> which 3952632Sstever@eecs.umich.eduis the position of the token relative to the beginning of the input text. 3962632Sstever@eecs.umich.eduBy default, <tt>t.type</tt> is set to the name following the <tt>t_</tt> prefix. The action 3972632Sstever@eecs.umich.edufunction can modify the contents of the <tt>LexToken</tt> object as appropriate. However, 3982632Sstever@eecs.umich.eduwhen it is done, the resulting token should be returned. If no value is returned by the action 3992632Sstever@eecs.umich.edufunction, the token is simply discarded and the next token read. 4002632Sstever@eecs.umich.edu 4012632Sstever@eecs.umich.edu<p> 4022632Sstever@eecs.umich.eduInternally, <tt>lex.py</tt> uses the <tt>re</tt> module to do its patten matching. When building the master regular expression, 4032632Sstever@eecs.umich.edurules are added in the following order: 4042632Sstever@eecs.umich.edu<p> 4052632Sstever@eecs.umich.edu<ol> 4062632Sstever@eecs.umich.edu<li>All tokens defined by functions are added in the same order as they appear in the lexer file. 4072632Sstever@eecs.umich.edu<li>Tokens defined by strings are added next by sorting them in order of decreasing regular expression length (longer expressions 4082632Sstever@eecs.umich.eduare added first). 4092632Sstever@eecs.umich.edu</ol> 4102632Sstever@eecs.umich.edu<p> 4112632Sstever@eecs.umich.eduWithout this ordering, it can be difficult to correctly match certain types of tokens. For example, if you 4122632Sstever@eecs.umich.eduwanted to have separate tokens for "=" and "==", you need to make sure that "==" is checked first. By sorting regular 4132632Sstever@eecs.umich.eduexpressions in order of decreasing length, this problem is solved for rules defined as strings. For functions, 4142632Sstever@eecs.umich.eduthe order can be explicitly controlled since rules appearing first are checked first. 4152632Sstever@eecs.umich.edu 4162632Sstever@eecs.umich.edu<p> 4172632Sstever@eecs.umich.eduTo handle reserved words, it is usually easier to just match an identifier and do a special name lookup in a function 4182632Sstever@eecs.umich.edulike this: 4192632Sstever@eecs.umich.edu 4202632Sstever@eecs.umich.edu<blockquote> 4212632Sstever@eecs.umich.edu<pre> 4222632Sstever@eecs.umich.edureserved = { 4232632Sstever@eecs.umich.edu 'if' : 'IF', 4242632Sstever@eecs.umich.edu 'then' : 'THEN', 4252632Sstever@eecs.umich.edu 'else' : 'ELSE', 4262632Sstever@eecs.umich.edu 'while' : 'WHILE', 4272632Sstever@eecs.umich.edu ... 4282632Sstever@eecs.umich.edu} 4292632Sstever@eecs.umich.edu 4302632Sstever@eecs.umich.edudef t_ID(t): 4312632Sstever@eecs.umich.edu r'[a-zA-Z_][a-zA-Z_0-9]*' 4322632Sstever@eecs.umich.edu t.type = reserved.get(t.value,'ID') # Check for reserved words 4332632Sstever@eecs.umich.edu return t 4342632Sstever@eecs.umich.edu</pre> 4352632Sstever@eecs.umich.edu</blockquote> 4362632Sstever@eecs.umich.edu 4372632Sstever@eecs.umich.eduThis approach greatly reduces the number of regular expression rules and is likely to make things a little faster. 4382632Sstever@eecs.umich.edu 4392632Sstever@eecs.umich.edu<p> 4402632Sstever@eecs.umich.edu<b>Note:</b> You should avoid writing individual rules for reserved words. For example, if you write rules like this, 4412632Sstever@eecs.umich.edu 4422632Sstever@eecs.umich.edu<blockquote> 4432632Sstever@eecs.umich.edu<pre> 4442632Sstever@eecs.umich.edut_FOR = r'for' 4452632Sstever@eecs.umich.edut_PRINT = r'print' 4462632Sstever@eecs.umich.edu</pre> 4472632Sstever@eecs.umich.edu</blockquote> 4482632Sstever@eecs.umich.edu 4492632Sstever@eecs.umich.eduthose rules will be triggered for identifiers that include those words as a prefix such as "forget" or "printed". This is probably not 4502632Sstever@eecs.umich.eduwhat you want. 4512632Sstever@eecs.umich.edu 4522632Sstever@eecs.umich.edu<H3><a name="ply_nn7"></a>3.4 Token values</H3> 4532632Sstever@eecs.umich.edu 4542632Sstever@eecs.umich.edu 4552632Sstever@eecs.umich.eduWhen tokens are returned by lex, they have a value that is stored in the <tt>value</tt> attribute. Normally, the value is the text 4562632Sstever@eecs.umich.eduthat was matched. However, the value can be assigned to any Python object. For instance, when lexing identifiers, you may 4572632Sstever@eecs.umich.eduwant to return both the identifier name and information from some sort of symbol table. To do this, you might write a rule like this: 4582632Sstever@eecs.umich.edu 4592632Sstever@eecs.umich.edu<blockquote> 4602632Sstever@eecs.umich.edu<pre> 4612632Sstever@eecs.umich.edudef t_ID(t): 4622632Sstever@eecs.umich.edu ... 4632632Sstever@eecs.umich.edu # Look up symbol table information and return a tuple 4642632Sstever@eecs.umich.edu t.value = (t.value, symbol_lookup(t.value)) 4652632Sstever@eecs.umich.edu ... 4662632Sstever@eecs.umich.edu return t 4672632Sstever@eecs.umich.edu</pre> 4682632Sstever@eecs.umich.edu</blockquote> 4692632Sstever@eecs.umich.edu 4702632Sstever@eecs.umich.eduIt is important to note that storing data in other attribute names is <em>not</em> recommended. The <tt>yacc.py</tt> module only exposes the 4712632Sstever@eecs.umich.educontents of the <tt>value</tt> attribute. Thus, accessing other attributes may be unnecessarily awkward. 4722632Sstever@eecs.umich.edu 4732632Sstever@eecs.umich.edu<H3><a name="ply_nn8"></a>3.5 Discarded tokens</H3> 4742632Sstever@eecs.umich.edu 4752632Sstever@eecs.umich.edu 4762632Sstever@eecs.umich.eduTo discard a token, such as a comment, simply define a token rule that returns no value. For example: 4772632Sstever@eecs.umich.edu 4782632Sstever@eecs.umich.edu<blockquote> 4792632Sstever@eecs.umich.edu<pre> 4802632Sstever@eecs.umich.edudef t_COMMENT(t): 4812632Sstever@eecs.umich.edu r'\#.*' 4822632Sstever@eecs.umich.edu pass 4832632Sstever@eecs.umich.edu # No return value. Token discarded 4842632Sstever@eecs.umich.edu</pre> 4852632Sstever@eecs.umich.edu</blockquote> 4862632Sstever@eecs.umich.edu 4872632Sstever@eecs.umich.eduAlternatively, you can include the prefix "ignore_" in the token declaration to force a token to be ignored. For example: 4882632Sstever@eecs.umich.edu 4892632Sstever@eecs.umich.edu<blockquote> 4902632Sstever@eecs.umich.edu<pre> 4912632Sstever@eecs.umich.edut_ignore_COMMENT = r'\#.*' 4922632Sstever@eecs.umich.edu</pre> 4932632Sstever@eecs.umich.edu</blockquote> 4942632Sstever@eecs.umich.edu 4952632Sstever@eecs.umich.eduBe advised that if you are ignoring many different kinds of text, you may still want to use functions since these provide more precise 4962632Sstever@eecs.umich.educontrol over the order in which regular expressions are matched (i.e., functions are matched in order of specification whereas strings are 4972632Sstever@eecs.umich.edusorted by regular expression length). 4982632Sstever@eecs.umich.edu 4992632Sstever@eecs.umich.edu<H3><a name="ply_nn9"></a>3.6 Line numbers and positional information</H3> 5002632Sstever@eecs.umich.edu 5012632Sstever@eecs.umich.edu 5022632Sstever@eecs.umich.edu<p>By default, <tt>lex.py</tt> knows nothing about line numbers. This is because <tt>lex.py</tt> doesn't know anything 5032632Sstever@eecs.umich.eduabout what constitutes a "line" of input (e.g., the newline character or even if the input is textual data). 5042632Sstever@eecs.umich.eduTo update this information, you need to write a special rule. In the example, the <tt>t_newline()</tt> rule shows how to do this. 5052632Sstever@eecs.umich.edu 5062632Sstever@eecs.umich.edu<blockquote> 5072632Sstever@eecs.umich.edu<pre> 5082632Sstever@eecs.umich.edu# Define a rule so we can track line numbers 5092632Sstever@eecs.umich.edudef t_newline(t): 5102632Sstever@eecs.umich.edu r'\n+' 5112632Sstever@eecs.umich.edu t.lexer.lineno += len(t.value) 5122632Sstever@eecs.umich.edu</pre> 5132632Sstever@eecs.umich.edu</blockquote> 5142632Sstever@eecs.umich.eduWithin the rule, the <tt>lineno</tt> attribute of the underlying lexer <tt>t.lexer</tt> is updated. 5152632Sstever@eecs.umich.eduAfter the line number is updated, the token is simply discarded since nothing is returned. 5162632Sstever@eecs.umich.edu 5172632Sstever@eecs.umich.edu<p> 5182632Sstever@eecs.umich.edu<tt>lex.py</tt> does not perform and kind of automatic column tracking. However, it does record positional 5192632Sstever@eecs.umich.eduinformation related to each token in the <tt>lexpos</tt> attribute. Using this, it is usually possible to compute 5202632Sstever@eecs.umich.educolumn information as a separate step. For instance, just count backwards until you reach a newline. 5212632Sstever@eecs.umich.edu 5222632Sstever@eecs.umich.edu<blockquote> 5232632Sstever@eecs.umich.edu<pre> 5242632Sstever@eecs.umich.edu# Compute column. 5252632Sstever@eecs.umich.edu# input is the input text string 5262632Sstever@eecs.umich.edu# token is a token instance 5272632Sstever@eecs.umich.edudef find_column(input,token): 5282632Sstever@eecs.umich.edu i = token.lexpos 5292632Sstever@eecs.umich.edu while i > 0: 5302632Sstever@eecs.umich.edu if input[i] == '\n': break 5312632Sstever@eecs.umich.edu i -= 1 5322632Sstever@eecs.umich.edu column = (token.lexpos - i)+1 5332632Sstever@eecs.umich.edu return column 5342632Sstever@eecs.umich.edu</pre> 5352632Sstever@eecs.umich.edu</blockquote> 5362632Sstever@eecs.umich.edu 5372632Sstever@eecs.umich.eduSince column information is often only useful in the context of error handling, calculating the column 5382632Sstever@eecs.umich.eduposition can be performed when needed as opposed to doing it for each token. 5392632Sstever@eecs.umich.edu 5402632Sstever@eecs.umich.edu<H3><a name="ply_nn10"></a>3.7 Ignored characters</H3> 5412632Sstever@eecs.umich.edu 5422632Sstever@eecs.umich.edu 5432632Sstever@eecs.umich.edu<p> 5442632Sstever@eecs.umich.eduThe special <tt>t_ignore</tt> rule is reserved by <tt>lex.py</tt> for characters 5452632Sstever@eecs.umich.eduthat should be completely ignored in the input stream. 5462632Sstever@eecs.umich.eduUsually this is used to skip over whitespace and other non-essential characters. 5472632Sstever@eecs.umich.eduAlthough it is possible to define a regular expression rule for whitespace in a manner 5482632Sstever@eecs.umich.edusimilar to <tt>t_newline()</tt>, the use of <tt>t_ignore</tt> provides substantially better 5492632Sstever@eecs.umich.edulexing performance because it is handled as a special case and is checked in a much 5502632Sstever@eecs.umich.edumore efficient manner than the normal regular expression rules. 5512632Sstever@eecs.umich.edu 5522632Sstever@eecs.umich.edu<H3><a name="ply_nn11"></a>3.8 Literal characters</H3> 5532632Sstever@eecs.umich.edu 5542632Sstever@eecs.umich.edu 5552632Sstever@eecs.umich.edu<p> 5562632Sstever@eecs.umich.eduLiteral characters can be specified by defining a variable <tt>literals</tt> in your lexing module. For example: 5572632Sstever@eecs.umich.edu 5582632Sstever@eecs.umich.edu<blockquote> 5592632Sstever@eecs.umich.edu<pre> 5602632Sstever@eecs.umich.eduliterals = [ '+','-','*','/' ] 5612632Sstever@eecs.umich.edu</pre> 5622632Sstever@eecs.umich.edu</blockquote> 5632632Sstever@eecs.umich.edu 5642632Sstever@eecs.umich.eduor alternatively 5652632Sstever@eecs.umich.edu 5662632Sstever@eecs.umich.edu<blockquote> 5672632Sstever@eecs.umich.edu<pre> 5682632Sstever@eecs.umich.eduliterals = "+-*/" 5692632Sstever@eecs.umich.edu</pre> 5702632Sstever@eecs.umich.edu</blockquote> 5712632Sstever@eecs.umich.edu 5722632Sstever@eecs.umich.eduA literal character is simply a single character that is returned "as is" when encountered by the lexer. Literals are checked 5732632Sstever@eecs.umich.eduafter all of the defined regular expression rules. Thus, if a rule starts with one of the literal characters, it will always 5742632Sstever@eecs.umich.edutake precedence. 5752632Sstever@eecs.umich.edu<p> 5762632Sstever@eecs.umich.eduWhen a literal token is returned, both its <tt>type</tt> and <tt>value</tt> attributes are set to the character itself. For example, <tt>'+'</tt>. 5772632Sstever@eecs.umich.edu 5782632Sstever@eecs.umich.edu<H3><a name="ply_nn12"></a>3.9 Error handling</H3> 5792632Sstever@eecs.umich.edu 5802632Sstever@eecs.umich.edu 5812632Sstever@eecs.umich.edu<p> 5822632Sstever@eecs.umich.eduFinally, the <tt>t_error()</tt> 5832632Sstever@eecs.umich.edufunction is used to handle lexing errors that occur when illegal 5842632Sstever@eecs.umich.educharacters are detected. In this case, the <tt>t.value</tt> attribute contains the 5852632Sstever@eecs.umich.edurest of the input string that has not been tokenized. In the example, the error function 5862632Sstever@eecs.umich.eduwas defined as follows: 5872632Sstever@eecs.umich.edu 5882632Sstever@eecs.umich.edu<blockquote> 5892632Sstever@eecs.umich.edu<pre> 5902632Sstever@eecs.umich.edu# Error handling rule 5912632Sstever@eecs.umich.edudef t_error(t): 5922632Sstever@eecs.umich.edu print "Illegal character '%s'" % t.value[0] 5932632Sstever@eecs.umich.edu t.lexer.skip(1) 5942632Sstever@eecs.umich.edu</pre> 5952632Sstever@eecs.umich.edu</blockquote> 5962632Sstever@eecs.umich.edu 5972632Sstever@eecs.umich.eduIn this case, we simply print the offending character and skip ahead one character by calling <tt>t.lexer.skip(1)</tt>. 5982632Sstever@eecs.umich.edu 5992632Sstever@eecs.umich.edu<H3><a name="ply_nn13"></a>3.10 Building and using the lexer</H3> 6002632Sstever@eecs.umich.edu 6012632Sstever@eecs.umich.edu 6022632Sstever@eecs.umich.edu<p> 6032632Sstever@eecs.umich.eduTo build the lexer, the function <tt>lex.lex()</tt> is used. This function 6042632Sstever@eecs.umich.eduuses Python reflection (or introspection) to read the the regular expression rules 6052632Sstever@eecs.umich.eduout of the calling context and build the lexer. Once the lexer has been built, two functions can 6062632Sstever@eecs.umich.edube used to control the lexer. 6072632Sstever@eecs.umich.edu 6082632Sstever@eecs.umich.edu<ul> 6092632Sstever@eecs.umich.edu<li><tt>lex.input(data)</tt>. Reset the lexer and store a new input string. 6102632Sstever@eecs.umich.edu<li><tt>lex.token()</tt>. Return the next token. Returns a special <tt>LexToken</tt> instance on success or 6112632Sstever@eecs.umich.eduNone if the end of the input text has been reached. 6122632Sstever@eecs.umich.edu</ul> 6132632Sstever@eecs.umich.edu 6142632Sstever@eecs.umich.eduIf desired, the lexer can also be used as an object. The <tt>lex()</tt> returns a <tt>Lexer</tt> object that 6152632Sstever@eecs.umich.educan be used for this purpose. For example: 6162632Sstever@eecs.umich.edu 6172632Sstever@eecs.umich.edu<blockquote> 6182632Sstever@eecs.umich.edu<pre> 6192632Sstever@eecs.umich.edulexer = lex.lex() 6202632Sstever@eecs.umich.edulexer.input(sometext) 6212632Sstever@eecs.umich.eduwhile 1: 6222632Sstever@eecs.umich.edu tok = lexer.token() 6232632Sstever@eecs.umich.edu if not tok: break 6242632Sstever@eecs.umich.edu print tok 6252632Sstever@eecs.umich.edu</pre> 6262632Sstever@eecs.umich.edu</blockquote> 6272632Sstever@eecs.umich.edu 6282632Sstever@eecs.umich.edu<p> 6292632Sstever@eecs.umich.eduThis latter technique should be used if you intend to use multiple lexers in your application. Simply define each 6302632Sstever@eecs.umich.edulexer in its own module and use the object returned by <tt>lex()</tt> as appropriate. 6312632Sstever@eecs.umich.edu 6322632Sstever@eecs.umich.edu<p> 6332632Sstever@eecs.umich.eduNote: The global functions <tt>lex.input()</tt> and <tt>lex.token()</tt> are bound to the <tt>input()</tt> 6342632Sstever@eecs.umich.eduand <tt>token()</tt> methods of the last lexer created by the lex module. 6352632Sstever@eecs.umich.edu 6362632Sstever@eecs.umich.edu<H3><a name="ply_nn14"></a>3.11 The @TOKEN decorator</H3> 6372632Sstever@eecs.umich.edu 6382632Sstever@eecs.umich.edu 6392632Sstever@eecs.umich.eduIn some applications, you may want to define build tokens from as a series of 6402632Sstever@eecs.umich.edumore complex regular expression rules. For example: 6412632Sstever@eecs.umich.edu 6422632Sstever@eecs.umich.edu<blockquote> 6432632Sstever@eecs.umich.edu<pre> 6442632Sstever@eecs.umich.edudigit = r'([0-9])' 6452632Sstever@eecs.umich.edunondigit = r'([_A-Za-z])' 6462632Sstever@eecs.umich.eduidentifier = r'(' + nondigit + r'(' + digit + r'|' + nondigit + r')*)' 6472632Sstever@eecs.umich.edu 6482632Sstever@eecs.umich.edudef t_ID(t): 6492632Sstever@eecs.umich.edu # want docstring to be identifier above. ????? 6502632Sstever@eecs.umich.edu ... 6512632Sstever@eecs.umich.edu</pre> 6522632Sstever@eecs.umich.edu</blockquote> 6532632Sstever@eecs.umich.edu 6542632Sstever@eecs.umich.eduIn this case, we want the regular expression rule for <tt>ID</tt> to be one of the variables above. However, there is no 6552632Sstever@eecs.umich.eduway to directly specify this using a normal documentation string. To solve this problem, you can use the <tt>@TOKEN</tt> 6562632Sstever@eecs.umich.edudecorator. For example: 6572632Sstever@eecs.umich.edu 6582632Sstever@eecs.umich.edu<blockquote> 6592632Sstever@eecs.umich.edu<pre> 6602632Sstever@eecs.umich.edufrom ply.lex import TOKEN 6612632Sstever@eecs.umich.edu 6622632Sstever@eecs.umich.edu@TOKEN(identifier) 6632632Sstever@eecs.umich.edudef t_ID(t): 6642632Sstever@eecs.umich.edu ... 6652632Sstever@eecs.umich.edu</pre> 6662632Sstever@eecs.umich.edu</blockquote> 6672632Sstever@eecs.umich.edu 6682632Sstever@eecs.umich.eduThis will attach <tt>identifier</tt> to the docstring for <tt>t_ID()</tt> allowing <tt>lex.py</tt> to work normally. An alternative 6692632Sstever@eecs.umich.eduapproach this problem is to set the docstring directly like this: 6702632Sstever@eecs.umich.edu 6712632Sstever@eecs.umich.edu<blockquote> 6722632Sstever@eecs.umich.edu<pre> 6732632Sstever@eecs.umich.edudef t_ID(t): 6742632Sstever@eecs.umich.edu ... 6752632Sstever@eecs.umich.edu 6762632Sstever@eecs.umich.edut_ID.__doc__ = identifier 6772632Sstever@eecs.umich.edu</pre> 6782632Sstever@eecs.umich.edu</blockquote> 6792632Sstever@eecs.umich.edu 6802632Sstever@eecs.umich.edu<b>NOTE:</b> Use of <tt>@TOKEN</tt> requires Python-2.4 or newer. If you're concerned about backwards compatibility with older 6812632Sstever@eecs.umich.eduversions of Python, use the alternative approach of setting the docstring directly. 6822632Sstever@eecs.umich.edu 6832632Sstever@eecs.umich.edu<H3><a name="ply_nn15"></a>3.12 Optimized mode</H3> 6842632Sstever@eecs.umich.edu 6852632Sstever@eecs.umich.edu 6862632Sstever@eecs.umich.eduFor improved performance, it may be desirable to use Python's 6872632Sstever@eecs.umich.eduoptimized mode (e.g., running Python with the <tt>-O</tt> 6882632Sstever@eecs.umich.eduoption). However, doing so causes Python to ignore documentation 6892632Sstever@eecs.umich.edustrings. This presents special problems for <tt>lex.py</tt>. To 6902632Sstever@eecs.umich.eduhandle this case, you can create your lexer using 6912632Sstever@eecs.umich.eduthe <tt>optimize</tt> option as follows: 6922632Sstever@eecs.umich.edu 6932632Sstever@eecs.umich.edu<blockquote> 6942632Sstever@eecs.umich.edu<pre> 6952632Sstever@eecs.umich.edulexer = lex.lex(optimize=1) 6962632Sstever@eecs.umich.edu</pre> 6972632Sstever@eecs.umich.edu</blockquote> 6982632Sstever@eecs.umich.edu 6992632Sstever@eecs.umich.eduNext, run Python in its normal operating mode. When you do 7002632Sstever@eecs.umich.eduthis, <tt>lex.py</tt> will write a file called <tt>lextab.py</tt> to 7012632Sstever@eecs.umich.eduthe current directory. This file contains all of the regular 7022632Sstever@eecs.umich.eduexpression rules and tables used during lexing. On subsequent 7032632Sstever@eecs.umich.eduexecutions, 7042632Sstever@eecs.umich.edu<tt>lextab.py</tt> will simply be imported to build the lexer. This 7052632Sstever@eecs.umich.eduapproach substantially improves the startup time of the lexer and it 7062632Sstever@eecs.umich.eduworks in Python's optimized mode. 7072632Sstever@eecs.umich.edu 7082632Sstever@eecs.umich.edu<p> 7092632Sstever@eecs.umich.eduTo change the name of the lexer-generated file, use the <tt>lextab</tt> keyword argument. For example: 7102632Sstever@eecs.umich.edu 7112632Sstever@eecs.umich.edu<blockquote> 7122632Sstever@eecs.umich.edu<pre> 7132632Sstever@eecs.umich.edulexer = lex.lex(optimize=1,lextab="footab") 7142632Sstever@eecs.umich.edu</pre> 7152632Sstever@eecs.umich.edu</blockquote> 7162632Sstever@eecs.umich.edu 7172632Sstever@eecs.umich.eduWhen running in optimized mode, it is important to note that lex disables most error checking. Thus, this is really only recommended 7182632Sstever@eecs.umich.eduif you're sure everything is working correctly and you're ready to start releasing production code. 7192632Sstever@eecs.umich.edu 7202632Sstever@eecs.umich.edu<H3><a name="ply_nn16"></a>3.13 Debugging</H3> 7212632Sstever@eecs.umich.edu 7222632Sstever@eecs.umich.edu 7232632Sstever@eecs.umich.eduFor the purpose of debugging, you can run <tt>lex()</tt> in a debugging mode as follows: 7242632Sstever@eecs.umich.edu 7252632Sstever@eecs.umich.edu<blockquote> 7262632Sstever@eecs.umich.edu<pre> 7272632Sstever@eecs.umich.edulexer = lex.lex(debug=1) 7282632Sstever@eecs.umich.edu</pre> 7292632Sstever@eecs.umich.edu</blockquote> 7302632Sstever@eecs.umich.edu 7312632Sstever@eecs.umich.eduThis will result in a large amount of debugging information to be printed including all of the added rules and the master 7322632Sstever@eecs.umich.eduregular expressions. 7332632Sstever@eecs.umich.edu 7342632Sstever@eecs.umich.eduIn addition, <tt>lex.py</tt> comes with a simple main function which 7352632Sstever@eecs.umich.eduwill either tokenize input read from standard input or from a file specified 7362632Sstever@eecs.umich.eduon the command line. To use it, simply put this in your lexer: 7372632Sstever@eecs.umich.edu 7382632Sstever@eecs.umich.edu<blockquote> 7392632Sstever@eecs.umich.edu<pre> 7402632Sstever@eecs.umich.eduif __name__ == '__main__': 7412632Sstever@eecs.umich.edu lex.runmain() 7422632Sstever@eecs.umich.edu</pre> 7432632Sstever@eecs.umich.edu</blockquote> 7442632Sstever@eecs.umich.edu 7452632Sstever@eecs.umich.edu<H3><a name="ply_nn17"></a>3.14 Alternative specification of lexers</H3> 7462632Sstever@eecs.umich.edu 7472632Sstever@eecs.umich.edu 7482632Sstever@eecs.umich.eduAs shown in the example, lexers are specified all within one Python module. If you want to 7492632Sstever@eecs.umich.eduput token rules in a different module from the one in which you invoke <tt>lex()</tt>, use the 7502632Sstever@eecs.umich.edu<tt>module</tt> keyword argument. 7512632Sstever@eecs.umich.edu 7522632Sstever@eecs.umich.edu<p> 7532632Sstever@eecs.umich.eduFor example, you might have a dedicated module that just contains 7542632Sstever@eecs.umich.eduthe token rules: 7552632Sstever@eecs.umich.edu 7562632Sstever@eecs.umich.edu<blockquote> 7572632Sstever@eecs.umich.edu<pre> 7582632Sstever@eecs.umich.edu# module: tokrules.py 7592632Sstever@eecs.umich.edu# This module just contains the lexing rules 7602632Sstever@eecs.umich.edu 7612632Sstever@eecs.umich.edu# List of token names. This is always required 7622632Sstever@eecs.umich.edutokens = ( 7632632Sstever@eecs.umich.edu 'NUMBER', 7642632Sstever@eecs.umich.edu 'PLUS', 7652632Sstever@eecs.umich.edu 'MINUS', 7662632Sstever@eecs.umich.edu 'TIMES', 7672632Sstever@eecs.umich.edu 'DIVIDE', 7682632Sstever@eecs.umich.edu 'LPAREN', 7692632Sstever@eecs.umich.edu 'RPAREN', 7702632Sstever@eecs.umich.edu) 7712632Sstever@eecs.umich.edu 7722632Sstever@eecs.umich.edu# Regular expression rules for simple tokens 7732632Sstever@eecs.umich.edut_PLUS = r'\+' 7742632Sstever@eecs.umich.edut_MINUS = r'-' 7752632Sstever@eecs.umich.edut_TIMES = r'\*' 7762632Sstever@eecs.umich.edut_DIVIDE = r'/' 7772632Sstever@eecs.umich.edut_LPAREN = r'\(' 7782632Sstever@eecs.umich.edut_RPAREN = r'\)' 7792632Sstever@eecs.umich.edu 7802632Sstever@eecs.umich.edu# A regular expression rule with some action code 7812632Sstever@eecs.umich.edudef t_NUMBER(t): 7822632Sstever@eecs.umich.edu r'\d+' 7832632Sstever@eecs.umich.edu try: 7842632Sstever@eecs.umich.edu t.value = int(t.value) 7852632Sstever@eecs.umich.edu except ValueError: 7862632Sstever@eecs.umich.edu print "Line %d: Number %s is too large!" % (t.lineno,t.value) 7872632Sstever@eecs.umich.edu t.value = 0 7882632Sstever@eecs.umich.edu return t 7892632Sstever@eecs.umich.edu 7902632Sstever@eecs.umich.edu# Define a rule so we can track line numbers 7912632Sstever@eecs.umich.edudef t_newline(t): 7922632Sstever@eecs.umich.edu r'\n+' 7932632Sstever@eecs.umich.edu t.lexer.lineno += len(t.value) 7942632Sstever@eecs.umich.edu 7952632Sstever@eecs.umich.edu# A string containing ignored characters (spaces and tabs) 7962632Sstever@eecs.umich.edut_ignore = ' \t' 7972632Sstever@eecs.umich.edu 7982632Sstever@eecs.umich.edu# Error handling rule 7992632Sstever@eecs.umich.edudef t_error(t): 8002632Sstever@eecs.umich.edu print "Illegal character '%s'" % t.value[0] 8012632Sstever@eecs.umich.edu t.lexer.skip(1) 8022632Sstever@eecs.umich.edu</pre> 8032632Sstever@eecs.umich.edu</blockquote> 8042632Sstever@eecs.umich.edu 8052632Sstever@eecs.umich.eduNow, if you wanted to build a tokenizer from these rules from within a different module, you would do the following (shown for Python interactive mode): 8062632Sstever@eecs.umich.edu 8072632Sstever@eecs.umich.edu<blockquote> 8082632Sstever@eecs.umich.edu<pre> 8092632Sstever@eecs.umich.edu>>> import tokrules 8102632Sstever@eecs.umich.edu>>> <b>lexer = lex.lex(module=tokrules)</b> 8112632Sstever@eecs.umich.edu>>> lexer.input("3 + 4") 8122632Sstever@eecs.umich.edu>>> lexer.token() 8132632Sstever@eecs.umich.eduLexToken(NUMBER,3,1,1,0) 8142632Sstever@eecs.umich.edu>>> lexer.token() 8152632Sstever@eecs.umich.eduLexToken(PLUS,'+',1,2) 8162632Sstever@eecs.umich.edu>>> lexer.token() 8172632Sstever@eecs.umich.eduLexToken(NUMBER,4,1,4) 8182632Sstever@eecs.umich.edu>>> lexer.token() 8192632Sstever@eecs.umich.eduNone 8202632Sstever@eecs.umich.edu>>> 8212632Sstever@eecs.umich.edu</pre> 8222632Sstever@eecs.umich.edu</blockquote> 8232632Sstever@eecs.umich.edu 8242632Sstever@eecs.umich.eduThe <tt>object</tt> option can be used to define lexers as a class instead of a module. For example: 8252632Sstever@eecs.umich.edu 8262632Sstever@eecs.umich.edu<blockquote> 8272632Sstever@eecs.umich.edu<pre> 8282632Sstever@eecs.umich.eduimport ply.lex as lex 8292632Sstever@eecs.umich.edu 8302632Sstever@eecs.umich.educlass MyLexer: 8312632Sstever@eecs.umich.edu # List of token names. This is always required 8322632Sstever@eecs.umich.edu tokens = ( 8332632Sstever@eecs.umich.edu 'NUMBER', 8342632Sstever@eecs.umich.edu 'PLUS', 8352632Sstever@eecs.umich.edu 'MINUS', 8362632Sstever@eecs.umich.edu 'TIMES', 8372632Sstever@eecs.umich.edu 'DIVIDE', 8382632Sstever@eecs.umich.edu 'LPAREN', 8392632Sstever@eecs.umich.edu 'RPAREN', 8402632Sstever@eecs.umich.edu ) 8412632Sstever@eecs.umich.edu 8422632Sstever@eecs.umich.edu # Regular expression rules for simple tokens 8432632Sstever@eecs.umich.edu t_PLUS = r'\+' 8442632Sstever@eecs.umich.edu t_MINUS = r'-' 8452632Sstever@eecs.umich.edu t_TIMES = r'\*' 8462632Sstever@eecs.umich.edu t_DIVIDE = r'/' 8472632Sstever@eecs.umich.edu t_LPAREN = r'\(' 8482632Sstever@eecs.umich.edu t_RPAREN = r'\)' 8492632Sstever@eecs.umich.edu 8502632Sstever@eecs.umich.edu # A regular expression rule with some action code 8512632Sstever@eecs.umich.edu # Note addition of self parameter since we're in a class 8522632Sstever@eecs.umich.edu def t_NUMBER(self,t): 8532632Sstever@eecs.umich.edu r'\d+' 8542632Sstever@eecs.umich.edu try: 8552632Sstever@eecs.umich.edu t.value = int(t.value) 8562632Sstever@eecs.umich.edu except ValueError: 8572632Sstever@eecs.umich.edu print "Line %d: Number %s is too large!" % (t.lineno,t.value) 8582632Sstever@eecs.umich.edu t.value = 0 8592632Sstever@eecs.umich.edu return t 8602632Sstever@eecs.umich.edu 8612632Sstever@eecs.umich.edu # Define a rule so we can track line numbers 8622632Sstever@eecs.umich.edu def t_newline(self,t): 8632632Sstever@eecs.umich.edu r'\n+' 8642632Sstever@eecs.umich.edu t.lexer.lineno += len(t.value) 8652632Sstever@eecs.umich.edu 8662632Sstever@eecs.umich.edu # A string containing ignored characters (spaces and tabs) 8672632Sstever@eecs.umich.edu t_ignore = ' \t' 8682632Sstever@eecs.umich.edu 8692632Sstever@eecs.umich.edu # Error handling rule 8702632Sstever@eecs.umich.edu def t_error(self,t): 8712632Sstever@eecs.umich.edu print "Illegal character '%s'" % t.value[0] 8722632Sstever@eecs.umich.edu t.lexer.skip(1) 8732632Sstever@eecs.umich.edu 8742632Sstever@eecs.umich.edu <b># Build the lexer 8752632Sstever@eecs.umich.edu def build(self,**kwargs): 8762632Sstever@eecs.umich.edu self.lexer = lex.lex(object=self, **kwargs)</b> 8772632Sstever@eecs.umich.edu 8782632Sstever@eecs.umich.edu # Test it output 8792632Sstever@eecs.umich.edu def test(self,data): 8802632Sstever@eecs.umich.edu self.lexer.input(data) 8812632Sstever@eecs.umich.edu while 1: 8822632Sstever@eecs.umich.edu tok = lexer.token() 8832632Sstever@eecs.umich.edu if not tok: break 8842632Sstever@eecs.umich.edu print tok 8852632Sstever@eecs.umich.edu 8862632Sstever@eecs.umich.edu# Build the lexer and try it out 8872632Sstever@eecs.umich.edum = MyLexer() 8882632Sstever@eecs.umich.edum.build() # Build the lexer 8892632Sstever@eecs.umich.edum.test("3 + 4") # Test it 8902632Sstever@eecs.umich.edu</pre> 8912632Sstever@eecs.umich.edu</blockquote> 8922632Sstever@eecs.umich.edu 8932632Sstever@eecs.umich.eduFor reasons that are subtle, you should <em>NOT</em> invoke <tt>lex.lex()</tt> inside the <tt>__init__()</tt> method of your class. If you 8942632Sstever@eecs.umich.edudo, it may cause bizarre behavior if someone tries to duplicate a lexer object. Keep reading. 8952632Sstever@eecs.umich.edu 8962632Sstever@eecs.umich.edu<H3><a name="ply_nn18"></a>3.15 Maintaining state</H3> 8972632Sstever@eecs.umich.edu 8982632Sstever@eecs.umich.edu 8992632Sstever@eecs.umich.eduIn your lexer, you may want to maintain a variety of state information. This might include mode settings, symbol tables, and other details. There are a few 9002632Sstever@eecs.umich.edudifferent ways to handle this situation. First, you could just keep some global variables: 9012632Sstever@eecs.umich.edu 9022632Sstever@eecs.umich.edu<blockquote> 9032632Sstever@eecs.umich.edu<pre> 9042632Sstever@eecs.umich.edunum_count = 0 9052632Sstever@eecs.umich.edudef t_NUMBER(t): 9062632Sstever@eecs.umich.edu r'\d+' 9072632Sstever@eecs.umich.edu global num_count 9082632Sstever@eecs.umich.edu num_count += 1 9092632Sstever@eecs.umich.edu try: 9102632Sstever@eecs.umich.edu t.value = int(t.value) 9112632Sstever@eecs.umich.edu except ValueError: 9122632Sstever@eecs.umich.edu print "Line %d: Number %s is too large!" % (t.lineno,t.value) 9132632Sstever@eecs.umich.edu t.value = 0 9142632Sstever@eecs.umich.edu return t 9152632Sstever@eecs.umich.edu</pre> 9162632Sstever@eecs.umich.edu</blockquote> 9172632Sstever@eecs.umich.edu 9182632Sstever@eecs.umich.eduAlternatively, you can store this information inside the Lexer object created by <tt>lex()</tt>. To this, you can use the <tt>lexer</tt> attribute 9192632Sstever@eecs.umich.eduof tokens passed to the various rules. For example: 9202632Sstever@eecs.umich.edu 9212632Sstever@eecs.umich.edu<blockquote> 9222632Sstever@eecs.umich.edu<pre> 9232632Sstever@eecs.umich.edudef t_NUMBER(t): 9242632Sstever@eecs.umich.edu r'\d+' 9252632Sstever@eecs.umich.edu t.lexer.num_count += 1 # Note use of lexer attribute 9262632Sstever@eecs.umich.edu try: 9272632Sstever@eecs.umich.edu t.value = int(t.value) 9282632Sstever@eecs.umich.edu except ValueError: 9292632Sstever@eecs.umich.edu print "Line %d: Number %s is too large!" % (t.lineno,t.value) 9302632Sstever@eecs.umich.edu t.value = 0 9312632Sstever@eecs.umich.edu return t 9322632Sstever@eecs.umich.edu 9332632Sstever@eecs.umich.edulexer = lex.lex() 9342632Sstever@eecs.umich.edulexer.num_count = 0 # Set the initial count 9352632Sstever@eecs.umich.edu</pre> 9362632Sstever@eecs.umich.edu</blockquote> 9372632Sstever@eecs.umich.edu 9382632Sstever@eecs.umich.eduThis latter approach has the advantage of storing information inside 9392632Sstever@eecs.umich.eduthe lexer itself---something that may be useful if multiple instances 9402632Sstever@eecs.umich.eduof the same lexer have been created. However, it may also feel kind 9412632Sstever@eecs.umich.eduof "hacky" to the purists. Just to put their mind at some ease, all 9422632Sstever@eecs.umich.eduinternal attributes of the lexer (with the exception of <tt>lineno</tt>) have names that are prefixed 9432632Sstever@eecs.umich.eduby <tt>lex</tt> (e.g., <tt>lexdata</tt>,<tt>lexpos</tt>, etc.). Thus, 9442632Sstever@eecs.umich.eduit should be perfectly safe to store attributes in the lexer that 9452632Sstever@eecs.umich.edudon't have names starting with that prefix. 9462632Sstever@eecs.umich.edu 9472632Sstever@eecs.umich.edu<p> 9482632Sstever@eecs.umich.eduA third approach is to define the lexer as a class as shown in the previous example: 9492632Sstever@eecs.umich.edu 9502632Sstever@eecs.umich.edu<blockquote> 9512632Sstever@eecs.umich.edu<pre> 9522632Sstever@eecs.umich.educlass MyLexer: 9532632Sstever@eecs.umich.edu ... 9542632Sstever@eecs.umich.edu def t_NUMBER(self,t): 9552632Sstever@eecs.umich.edu r'\d+' 9562632Sstever@eecs.umich.edu self.num_count += 1 9572632Sstever@eecs.umich.edu try: 9582632Sstever@eecs.umich.edu t.value = int(t.value) 9592632Sstever@eecs.umich.edu except ValueError: 9602632Sstever@eecs.umich.edu print "Line %d: Number %s is too large!" % (t.lineno,t.value) 9612632Sstever@eecs.umich.edu t.value = 0 9622632Sstever@eecs.umich.edu return t 9632632Sstever@eecs.umich.edu 9642632Sstever@eecs.umich.edu def build(self, **kwargs): 9652632Sstever@eecs.umich.edu self.lexer = lex.lex(object=self,**kwargs) 9662632Sstever@eecs.umich.edu 9672632Sstever@eecs.umich.edu def __init__(self): 9682632Sstever@eecs.umich.edu self.num_count = 0 9692632Sstever@eecs.umich.edu 9702632Sstever@eecs.umich.edu# Create a lexer 9712632Sstever@eecs.umich.edum = MyLexer() 9722632Sstever@eecs.umich.edulexer = lex.lex(object=m) 9732632Sstever@eecs.umich.edu</pre> 9742632Sstever@eecs.umich.edu</blockquote> 9752632Sstever@eecs.umich.edu 9762632Sstever@eecs.umich.eduThe class approach may be the easiest to manage if your application is going to be creating multiple instances of the same lexer and 9772632Sstever@eecs.umich.eduyou need to manage a lot of state. 9782632Sstever@eecs.umich.edu 9792632Sstever@eecs.umich.edu<H3><a name="ply_nn19"></a>3.16 Duplicating lexers</H3> 9802632Sstever@eecs.umich.edu 9812632Sstever@eecs.umich.edu 9822632Sstever@eecs.umich.edu<b>NOTE: I am thinking about deprecating this feature. Post comments on <a href="http://groups.google.com/group/ply-hack">ply-hack@googlegroups.com</a> or send me a private email at dave@dabeaz.com.</b> 9832632Sstever@eecs.umich.edu 9842632Sstever@eecs.umich.edu<p> 9852632Sstever@eecs.umich.eduIf necessary, a lexer object can be quickly duplicated by invoking its <tt>clone()</tt> method. For example: 9862632Sstever@eecs.umich.edu 9872632Sstever@eecs.umich.edu<blockquote> 9882632Sstever@eecs.umich.edu<pre> 9892632Sstever@eecs.umich.edulexer = lex.lex() 9902632Sstever@eecs.umich.edu... 9912632Sstever@eecs.umich.edunewlexer = lexer.clone() 9922632Sstever@eecs.umich.edu</pre> 9932632Sstever@eecs.umich.edu</blockquote> 9942632Sstever@eecs.umich.edu 9952632Sstever@eecs.umich.eduWhen a lexer is cloned, the copy is identical to the original lexer, 9962632Sstever@eecs.umich.eduincluding any input text. However, once created, different text can be 9972632Sstever@eecs.umich.edufed to the clone which can be used independently. This capability may 9982632Sstever@eecs.umich.edube useful in situations when you are writing a parser/compiler that 9992632Sstever@eecs.umich.eduinvolves recursive or reentrant processing. For instance, if you 10002632Sstever@eecs.umich.eduneeded to scan ahead in the input for some reason, you could create a 10012632Sstever@eecs.umich.educlone and use it to look ahead. 10022632Sstever@eecs.umich.edu 10032632Sstever@eecs.umich.edu<p> 10042632Sstever@eecs.umich.eduThe advantage of using <tt>clone()</tt> instead of reinvoking <tt>lex()</tt> is 10052632Sstever@eecs.umich.eduthat it is significantly faster. Namely, it is not necessary to re-examine all of the 10062632Sstever@eecs.umich.edutoken rules, build a regular expression, and construct internal tables. All of this 10072632Sstever@eecs.umich.eduinformation can simply be reused in the new lexer. 10082632Sstever@eecs.umich.edu 10092632Sstever@eecs.umich.edu<p> 10102632Sstever@eecs.umich.eduSpecial considerations need to be made when cloning a lexer that is defined as a class. Previous sections 10112632Sstever@eecs.umich.edushowed an example of a class <tt>MyLexer</tt>. If you have the following code: 10122632Sstever@eecs.umich.edu 10132632Sstever@eecs.umich.edu<blockquote> 10142632Sstever@eecs.umich.edu<pre> 10152632Sstever@eecs.umich.edum = MyLexer() 10162632Sstever@eecs.umich.edua = lex.lex(object=m) # Create a lexer 10172632Sstever@eecs.umich.edu 10182632Sstever@eecs.umich.edub = a.clone() # Clone the lexer 10192632Sstever@eecs.umich.edu</pre> 10202632Sstever@eecs.umich.edu</blockquote> 10212632Sstever@eecs.umich.edu 10222632Sstever@eecs.umich.eduThen both <tt>a</tt> and <tt>b</tt> are going to be bound to the same 10232632Sstever@eecs.umich.eduobject <tt>m</tt>. If the object <tt>m</tt> contains internal state 10242632Sstever@eecs.umich.edurelated to lexing, this sharing may lead to quite a bit of confusion. To fix this, 10252632Sstever@eecs.umich.eduthe <tt>clone()</tt> method accepts an optional argument that can be used to supply a new object. This 10262632Sstever@eecs.umich.educan be used to clone the lexer and bind it to a new instance. For example: 10272632Sstever@eecs.umich.edu 10282632Sstever@eecs.umich.edu<blockquote> 10292632Sstever@eecs.umich.edu<pre> 10302632Sstever@eecs.umich.edum = MyLexer() # Create a lexer 10312632Sstever@eecs.umich.edua = lex.lex(object=m) 10322632Sstever@eecs.umich.edu 10332632Sstever@eecs.umich.edu# Create a clone 10342632Sstever@eecs.umich.edun = MyLexer() # New instance of MyLexer 10352632Sstever@eecs.umich.edub = a.clone(n) # New lexer bound to n 10362632Sstever@eecs.umich.edu</pre> 10372632Sstever@eecs.umich.edu</blockquote> 10382632Sstever@eecs.umich.edu 10392632Sstever@eecs.umich.eduIt may make sense to encapsulate all of this inside a method: 10402632Sstever@eecs.umich.edu 10412632Sstever@eecs.umich.edu<blockquote> 10422632Sstever@eecs.umich.edu<pre> 10432632Sstever@eecs.umich.educlass MyLexer: 10442632Sstever@eecs.umich.edu ... 10452632Sstever@eecs.umich.edu def clone(self): 10462632Sstever@eecs.umich.edu c = MyLexer() # Create a new instance of myself 10472632Sstever@eecs.umich.edu # Copy attributes from self to c as appropriate 10482632Sstever@eecs.umich.edu ... 10492632Sstever@eecs.umich.edu # Clone the lexer 10502632Sstever@eecs.umich.edu c.lexer = self.lexer.clone(c) 10512632Sstever@eecs.umich.edu return c 10522632Sstever@eecs.umich.edu</pre> 10532632Sstever@eecs.umich.edu</blockquote> 10542632Sstever@eecs.umich.edu 10552632Sstever@eecs.umich.eduThe fact that a new instance of <tt>MyLexer</tt> may be created while cloning a lexer is the reason why you should never 10562632Sstever@eecs.umich.eduinvoke <tt>lex.lex()</tt> inside <tt>__init__()</tt>. If you do, the lexer will be rebuilt from scratch and you lose 10572632Sstever@eecs.umich.eduall of the performance benefits of using <tt>clone()</tt> in the first place. 10582632Sstever@eecs.umich.edu 10592632Sstever@eecs.umich.edu<H3><a name="ply_nn20"></a>3.17 Internal lexer state</H3> 10602632Sstever@eecs.umich.edu 10612632Sstever@eecs.umich.edu 10622632Sstever@eecs.umich.eduA Lexer object <tt>lexer</tt> has a number of internal attributes that may be useful in certain 10632632Sstever@eecs.umich.edusituations. 10642632Sstever@eecs.umich.edu 10652632Sstever@eecs.umich.edu<p> 10662632Sstever@eecs.umich.edu<tt>lexer.lexpos</tt> 10672632Sstever@eecs.umich.edu<blockquote> 10682632Sstever@eecs.umich.eduThis attribute is an integer that contains the current position within the input text. If you modify 10692632Sstever@eecs.umich.eduthe value, it will change the result of the next call to <tt>token()</tt>. Within token rule functions, this points 10702632Sstever@eecs.umich.eduto the first character <em>after</em> the matched text. If the value is modified within a rule, the next returned token will be 10712632Sstever@eecs.umich.edumatched at the new position. 10722632Sstever@eecs.umich.edu</blockquote> 10732632Sstever@eecs.umich.edu 10742632Sstever@eecs.umich.edu<p> 10752632Sstever@eecs.umich.edu<tt>lexer.lineno</tt> 10762632Sstever@eecs.umich.edu<blockquote> 10772632Sstever@eecs.umich.eduThe current value of the line number attribute stored in the lexer. This can be modified as needed to 10782632Sstever@eecs.umich.educhange the line number. 10792632Sstever@eecs.umich.edu</blockquote> 10802632Sstever@eecs.umich.edu 10812632Sstever@eecs.umich.edu<p> 10822632Sstever@eecs.umich.edu<tt>lexer.lexdata</tt> 10832632Sstever@eecs.umich.edu<blockquote> 10842632Sstever@eecs.umich.eduThe current input text stored in the lexer. This is the string passed with the <tt>input()</tt> method. It 10852632Sstever@eecs.umich.eduwould probably be a bad idea to modify this unless you really know what you're doing. 10862632Sstever@eecs.umich.edu</blockquote> 10872632Sstever@eecs.umich.edu 10882632Sstever@eecs.umich.edu<P> 10892632Sstever@eecs.umich.edu<tt>lexer.lexmatch</tt> 10902632Sstever@eecs.umich.edu<blockquote> 10912632Sstever@eecs.umich.eduThis is the raw <tt>Match</tt> object returned by the Python <tt>re.match()</tt> function (used internally by PLY) for the 10922632Sstever@eecs.umich.educurrent token. If you have written a regular expression that contains named groups, you can use this to retrieve those values. 10932632Sstever@eecs.umich.edu</blockquote> 10942632Sstever@eecs.umich.edu 10952632Sstever@eecs.umich.edu<H3><a name="ply_nn21"></a>3.18 Conditional lexing and start conditions</H3> 10962632Sstever@eecs.umich.edu 10972632Sstever@eecs.umich.edu 10982632Sstever@eecs.umich.eduIn advanced parsing applications, it may be useful to have different 10992632Sstever@eecs.umich.edulexing states. For instance, you may want the occurrence of a certain 11002632Sstever@eecs.umich.edutoken or syntactic construct to trigger a different kind of lexing. 11012632Sstever@eecs.umich.eduPLY supports a feature that allows the underlying lexer to be put into 11022632Sstever@eecs.umich.edua series of different states. Each state can have its own tokens, 11032632Sstever@eecs.umich.edulexing rules, and so forth. The implementation is based largely on 11042632Sstever@eecs.umich.eduthe "start condition" feature of GNU flex. Details of this can be found 11052632Sstever@eecs.umich.eduat <a 11062632Sstever@eecs.umich.eduhref="http://www.gnu.org/software/flex/manual/html_chapter/flex_11.html">http://www.gnu.org/software/flex/manual/html_chapter/flex_11.html.</a>. 11072632Sstever@eecs.umich.edu 11082632Sstever@eecs.umich.edu<p> 11092632Sstever@eecs.umich.eduTo define a new lexing state, it must first be declared. This is done by including a "states" declaration in your 11102632Sstever@eecs.umich.edulex file. For example: 11112632Sstever@eecs.umich.edu 11122632Sstever@eecs.umich.edu<blockquote> 11132632Sstever@eecs.umich.edu<pre> 11142632Sstever@eecs.umich.edustates = ( 11152632Sstever@eecs.umich.edu ('foo','exclusive'), 11162632Sstever@eecs.umich.edu ('bar','inclusive'), 11172632Sstever@eecs.umich.edu) 11182632Sstever@eecs.umich.edu</pre> 11192632Sstever@eecs.umich.edu</blockquote> 11202632Sstever@eecs.umich.edu 11212632Sstever@eecs.umich.eduThis declaration declares two states, <tt>'foo'</tt> 11222632Sstever@eecs.umich.eduand <tt>'bar'</tt>. States may be of two types; <tt>'exclusive'</tt> 11232632Sstever@eecs.umich.eduand <tt>'inclusive'</tt>. An exclusive state completely overrides the 11242632Sstever@eecs.umich.edudefault behavior of the lexer. That is, lex will only return tokens 11252632Sstever@eecs.umich.eduand apply rules defined specifically for that state. An inclusive 11262632Sstever@eecs.umich.edustate adds additional tokens and rules to the default set of rules. 11272632Sstever@eecs.umich.eduThus, lex will return both the tokens defined by default in addition 11282632Sstever@eecs.umich.eduto those defined for the inclusive state. 11292632Sstever@eecs.umich.edu 11302632Sstever@eecs.umich.edu<p> 11312632Sstever@eecs.umich.eduOnce a state has been declared, tokens and rules are declared by including the 11322632Sstever@eecs.umich.edustate name in token/rule declaration. For example: 11332632Sstever@eecs.umich.edu 11342632Sstever@eecs.umich.edu<blockquote> 11352632Sstever@eecs.umich.edu<pre> 11362632Sstever@eecs.umich.edut_foo_NUMBER = r'\d+' # Token 'NUMBER' in state 'foo' 11372632Sstever@eecs.umich.edut_bar_ID = r'[a-zA-Z_][a-zA-Z0-9_]*' # Token 'ID' in state 'bar' 11382632Sstever@eecs.umich.edu 11392632Sstever@eecs.umich.edudef t_foo_newline(t): 11402632Sstever@eecs.umich.edu r'\n' 11412632Sstever@eecs.umich.edu t.lexer.lineno += 1 11422632Sstever@eecs.umich.edu</pre> 11432632Sstever@eecs.umich.edu</blockquote> 11442632Sstever@eecs.umich.edu 11452632Sstever@eecs.umich.eduA token can be declared in multiple states by including multiple state names in the declaration. For example: 11462632Sstever@eecs.umich.edu 11472632Sstever@eecs.umich.edu<blockquote> 11482632Sstever@eecs.umich.edu<pre> 11492632Sstever@eecs.umich.edut_foo_bar_NUMBER = r'\d+' # Defines token 'NUMBER' in both state 'foo' and 'bar' 11502632Sstever@eecs.umich.edu</pre> 11512632Sstever@eecs.umich.edu</blockquote> 11522632Sstever@eecs.umich.edu 11532632Sstever@eecs.umich.eduAlternative, a token can be declared in all states using the 'ANY' in the name. 11542632Sstever@eecs.umich.edu 11552632Sstever@eecs.umich.edu<blockquote> 11562632Sstever@eecs.umich.edu<pre> 11572632Sstever@eecs.umich.edut_ANY_NUMBER = r'\d+' # Defines a token 'NUMBER' in all states 11582632Sstever@eecs.umich.edu</pre> 11592632Sstever@eecs.umich.edu</blockquote> 11602632Sstever@eecs.umich.edu 11612632Sstever@eecs.umich.eduIf no state name is supplied, as is normally the case, the token is associated with a special state <tt>'INITIAL'</tt>. For example, 11622632Sstever@eecs.umich.eduthese two declarations are identical: 11632632Sstever@eecs.umich.edu 11642632Sstever@eecs.umich.edu<blockquote> 11652632Sstever@eecs.umich.edu<pre> 11662632Sstever@eecs.umich.edut_NUMBER = r'\d+' 11672632Sstever@eecs.umich.edut_INITIAL_NUMBER = r'\d+' 11682632Sstever@eecs.umich.edu</pre> 11692632Sstever@eecs.umich.edu</blockquote> 11702632Sstever@eecs.umich.edu 11712632Sstever@eecs.umich.edu<p> 11722632Sstever@eecs.umich.eduStates are also associated with the special <tt>t_ignore</tt> and <tt>t_error()</tt> declarations. For example, if a state treats 11732632Sstever@eecs.umich.eduthese differently, you can declare: 11742632Sstever@eecs.umich.edu 11752632Sstever@eecs.umich.edu<blockquote> 11762632Sstever@eecs.umich.edu<pre> 11772632Sstever@eecs.umich.edut_foo_ignore = " \t\n" # Ignored characters for state 'foo' 11782632Sstever@eecs.umich.edu 11792632Sstever@eecs.umich.edudef t_bar_error(t): # Special error handler for state 'bar' 11802632Sstever@eecs.umich.edu pass 11812632Sstever@eecs.umich.edu</pre> 11822632Sstever@eecs.umich.edu</blockquote> 11832632Sstever@eecs.umich.edu 11842632Sstever@eecs.umich.eduBy default, lexing operates in the <tt>'INITIAL'</tt> state. This state includes all of the normally defined tokens. 11852632Sstever@eecs.umich.eduFor users who aren't using different states, this fact is completely transparent. If, during lexing or parsing, you want to change 11862632Sstever@eecs.umich.eduthe lexing state, use the <tt>begin()</tt> method. For example: 11872632Sstever@eecs.umich.edu 11882632Sstever@eecs.umich.edu<blockquote> 11892632Sstever@eecs.umich.edu<pre> 11902632Sstever@eecs.umich.edudef t_begin_foo(t): 11912632Sstever@eecs.umich.edu r'start_foo' 11922632Sstever@eecs.umich.edu t.lexer.begin('foo') # Starts 'foo' state 11932632Sstever@eecs.umich.edu</pre> 11942632Sstever@eecs.umich.edu</blockquote> 11952632Sstever@eecs.umich.edu 11962632Sstever@eecs.umich.eduTo get out of a state, you use <tt>begin()</tt> to switch back to the initial state. For example: 11972632Sstever@eecs.umich.edu 11982632Sstever@eecs.umich.edu<blockquote> 11992632Sstever@eecs.umich.edu<pre> 12002632Sstever@eecs.umich.edudef t_foo_end(t): 12012632Sstever@eecs.umich.edu r'end_foo' 12022632Sstever@eecs.umich.edu t.lexer.begin('INITIAL') # Back to the initial state 12032632Sstever@eecs.umich.edu</pre> 12042632Sstever@eecs.umich.edu</blockquote> 12052632Sstever@eecs.umich.edu 12062632Sstever@eecs.umich.eduThe management of states can also be done with a stack. For example: 12072632Sstever@eecs.umich.edu 12082632Sstever@eecs.umich.edu<blockquote> 12092632Sstever@eecs.umich.edu<pre> 12102632Sstever@eecs.umich.edudef t_begin_foo(t): 12112632Sstever@eecs.umich.edu r'start_foo' 12122632Sstever@eecs.umich.edu t.lexer.push_state('foo') # Starts 'foo' state 12132632Sstever@eecs.umich.edu 12142632Sstever@eecs.umich.edudef t_foo_end(t): 12152632Sstever@eecs.umich.edu r'end_foo' 12162632Sstever@eecs.umich.edu t.lexer.pop_state() # Back to the previous state 12172632Sstever@eecs.umich.edu</pre> 12182632Sstever@eecs.umich.edu</blockquote> 12192632Sstever@eecs.umich.edu 12202632Sstever@eecs.umich.edu<p> 12212632Sstever@eecs.umich.eduThe use of a stack would be useful in situations where there are many ways of entering a new lexing state and you merely want to go back 12222632Sstever@eecs.umich.eduto the previous state afterwards. 12232632Sstever@eecs.umich.edu 12242632Sstever@eecs.umich.edu<P> 12252632Sstever@eecs.umich.eduAn example might help clarify. Suppose you were writing a parser and you wanted to grab sections of arbitrary C code enclosed by 12262632Sstever@eecs.umich.educurly braces. That is, whenever you encounter a starting brace '{', you want to read all of the enclosed code up to the ending brace '}' 12272632Sstever@eecs.umich.eduand return it as a string. Doing this with a normal regular expression rule is nearly (if not actually) impossible. This is because braces can 12282632Sstever@eecs.umich.edube nested and can be included in comments and strings. Thus, simply matching up to the first matching '}' character isn't good enough. Here is how 12292632Sstever@eecs.umich.eduyou might use lexer states to do this: 12302632Sstever@eecs.umich.edu 12312632Sstever@eecs.umich.edu<blockquote> 12322632Sstever@eecs.umich.edu<pre> 12332632Sstever@eecs.umich.edu# Declare the state 12342632Sstever@eecs.umich.edustates = ( 12352632Sstever@eecs.umich.edu ('ccode','exclusive'), 12362632Sstever@eecs.umich.edu) 12372632Sstever@eecs.umich.edu 12382632Sstever@eecs.umich.edu# Match the first {. Enter ccode state. 12392632Sstever@eecs.umich.edudef t_ccode(t): 12402632Sstever@eecs.umich.edu r'\{' 12412632Sstever@eecs.umich.edu t.lexer.code_start = t.lexer.lexpos # Record the starting position 12422632Sstever@eecs.umich.edu t.lexer.level = 1 # Initial brace level 12432632Sstever@eecs.umich.edu t.lexer.begin('ccode') # Enter 'ccode' state 12442632Sstever@eecs.umich.edu 12452632Sstever@eecs.umich.edu# Rules for the ccode state 12462632Sstever@eecs.umich.edudef t_ccode_lbrace(t): 12472632Sstever@eecs.umich.edu r'\{' 12482632Sstever@eecs.umich.edu t.lexer.level +=1 12492632Sstever@eecs.umich.edu 12502632Sstever@eecs.umich.edudef t_ccode_rbrace(t): 12512632Sstever@eecs.umich.edu r'\}' 12522632Sstever@eecs.umich.edu t.lexer.level -=1 12532632Sstever@eecs.umich.edu 12542632Sstever@eecs.umich.edu # If closing brace, return the code fragment 12552632Sstever@eecs.umich.edu if t.lexer.level == 0: 12562632Sstever@eecs.umich.edu t.value = t.lexer.lexdata[t.lexer.code_start:t.lexer.lexpos+1] 12572632Sstever@eecs.umich.edu t.type = "CCODE" 12582632Sstever@eecs.umich.edu t.lexer.lineno += t.value.count('\n') 12592632Sstever@eecs.umich.edu t.lexer.begin('INITIAL') 12602632Sstever@eecs.umich.edu return t 12612632Sstever@eecs.umich.edu 12622632Sstever@eecs.umich.edu# C or C++ comment (ignore) 12632632Sstever@eecs.umich.edudef t_ccode_comment(t): 12642632Sstever@eecs.umich.edu r'(/\*(.|\n)*?*/)|(//.*)' 12652632Sstever@eecs.umich.edu pass 12662632Sstever@eecs.umich.edu 12672632Sstever@eecs.umich.edu# C string 12682632Sstever@eecs.umich.edudef t_ccode_string(t): 12692632Sstever@eecs.umich.edu r'\"([^\\\n]|(\\.))*?\"' 12702632Sstever@eecs.umich.edu 12712632Sstever@eecs.umich.edu# C character literal 12722632Sstever@eecs.umich.edudef t_ccode_char(t): 12732632Sstever@eecs.umich.edu r'\'([^\\\n]|(\\.))*?\'' 12742632Sstever@eecs.umich.edu 12752632Sstever@eecs.umich.edu# Any sequence of non-whitespace characters (not braces, strings) 12762632Sstever@eecs.umich.edudef t_ccode_nonspace(t): 12772632Sstever@eecs.umich.edu r'[^\s\{\}\'\"]+' 12782632Sstever@eecs.umich.edu 12792632Sstever@eecs.umich.edu# Ignored characters (whitespace) 12802632Sstever@eecs.umich.edut_ccode_ignore = " \t\n" 12812632Sstever@eecs.umich.edu 12822632Sstever@eecs.umich.edu# For bad characters, we just skip over it 12832632Sstever@eecs.umich.edudef t_ccode_error(t): 12842632Sstever@eecs.umich.edu t.lexer.skip(1) 12852632Sstever@eecs.umich.edu</pre> 12862632Sstever@eecs.umich.edu</blockquote> 12872632Sstever@eecs.umich.edu 12882632Sstever@eecs.umich.eduIn this example, the occurrence of the first '{' causes the lexer to record the starting position and enter a new state <tt>'ccode'</tt>. A collection of rules then match 12892632Sstever@eecs.umich.eduvarious parts of the input that follow (comments, strings, etc.). All of these rules merely discard the token (by not returning a value). 12902632Sstever@eecs.umich.eduHowever, if the closing right brace is encountered, the rule <tt>t_ccode_rbrace</tt> collects all of the code (using the earlier recorded starting 12912632Sstever@eecs.umich.eduposition), stores it, and returns a token 'CCODE' containing all of that text. When returning the token, the lexing state is restored back to its 12922632Sstever@eecs.umich.eduinitial state. 12932632Sstever@eecs.umich.edu 12942632Sstever@eecs.umich.edu<H3><a name="ply_nn21"></a>3.19 Miscellaneous Issues</H3> 12952632Sstever@eecs.umich.edu 12962632Sstever@eecs.umich.edu 12972632Sstever@eecs.umich.edu<P> 12982632Sstever@eecs.umich.edu<li>The lexer requires input to be supplied as a single input string. Since most machines have more than enough memory, this 12992632Sstever@eecs.umich.edurarely presents a performance concern. However, it means that the lexer currently can't be used with streaming data 13002632Sstever@eecs.umich.edusuch as open files or sockets. This limitation is primarily a side-effect of using the <tt>re</tt> module. 13012632Sstever@eecs.umich.edu 13022632Sstever@eecs.umich.edu<p> 13032632Sstever@eecs.umich.edu<li>The lexer should work properly with both Unicode strings given as token and pattern matching rules as 13042632Sstever@eecs.umich.eduwell as for input text. 13052632Sstever@eecs.umich.edu 13062632Sstever@eecs.umich.edu<p> 13072632Sstever@eecs.umich.edu<li>If you need to supply optional flags to the re.compile() function, use the reflags option to lex. For example: 13082632Sstever@eecs.umich.edu 13092632Sstever@eecs.umich.edu<blockquote> 13102632Sstever@eecs.umich.edu<pre> 13112632Sstever@eecs.umich.edulex.lex(reflags=re.UNICODE) 13122632Sstever@eecs.umich.edu</pre> 13132632Sstever@eecs.umich.edu</blockquote> 13142632Sstever@eecs.umich.edu 13152632Sstever@eecs.umich.edu<p> 13162632Sstever@eecs.umich.edu<li>Since the lexer is written entirely in Python, its performance is 13172632Sstever@eecs.umich.edulargely determined by that of the Python <tt>re</tt> module. Although 13182632Sstever@eecs.umich.eduthe lexer has been written to be as efficient as possible, it's not 13192632Sstever@eecs.umich.edublazingly fast when used on very large input files. If 13202632Sstever@eecs.umich.eduperformance is concern, you might consider upgrading to the most 13212632Sstever@eecs.umich.edurecent version of Python, creating a hand-written lexer, or offloading 13222632Sstever@eecs.umich.eduthe lexer into a C extension module. 13232632Sstever@eecs.umich.edu 13242632Sstever@eecs.umich.edu<p> 13252632Sstever@eecs.umich.eduIf you are going to create a hand-written lexer and you plan to use it with <tt>yacc.py</tt>, 13262632Sstever@eecs.umich.eduit only needs to conform to the following requirements: 13272632Sstever@eecs.umich.edu 13282632Sstever@eecs.umich.edu<ul> 13292632Sstever@eecs.umich.edu<li>It must provide a <tt>token()</tt> method that returns the next token or <tt>None</tt> if no more 13302632Sstever@eecs.umich.edutokens are available. 13312632Sstever@eecs.umich.edu<li>The <tt>token()</tt> method must return an object <tt>tok</tt> that has <tt>type</tt> and <tt>value</tt> attributes. 13322632Sstever@eecs.umich.edu</ul> 13332632Sstever@eecs.umich.edu 13342632Sstever@eecs.umich.edu<H2><a name="ply_nn22"></a>4. Parsing basics</H2> 13352632Sstever@eecs.umich.edu 13362632Sstever@eecs.umich.edu 13372632Sstever@eecs.umich.edu<tt>yacc.py</tt> is used to parse language syntax. Before showing an 13382632Sstever@eecs.umich.eduexample, there are a few important bits of background that must be 13392632Sstever@eecs.umich.edumentioned. First, <em>syntax</em> is usually specified in terms of a BNF grammar. 13402632Sstever@eecs.umich.eduFor example, if you wanted to parse 13412632Sstever@eecs.umich.edusimple arithmetic expressions, you might first write an unambiguous 13422632Sstever@eecs.umich.edugrammar specification like this: 13432632Sstever@eecs.umich.edu 13442632Sstever@eecs.umich.edu<blockquote> 13452632Sstever@eecs.umich.edu<pre> 13462632Sstever@eecs.umich.eduexpression : expression + term 13472632Sstever@eecs.umich.edu | expression - term 13482632Sstever@eecs.umich.edu | term 13492632Sstever@eecs.umich.edu 13502632Sstever@eecs.umich.eduterm : term * factor 13512632Sstever@eecs.umich.edu | term / factor 13522632Sstever@eecs.umich.edu | factor 13532632Sstever@eecs.umich.edu 13542632Sstever@eecs.umich.edufactor : NUMBER 13552632Sstever@eecs.umich.edu | ( expression ) 13562632Sstever@eecs.umich.edu</pre> 13572632Sstever@eecs.umich.edu</blockquote> 13582632Sstever@eecs.umich.edu 13592632Sstever@eecs.umich.eduIn the grammar, symbols such as <tt>NUMBER</tt>, <tt>+</tt>, <tt>-</tt>, <tt>*</tt>, and <tt>/</tt> are known 13602632Sstever@eecs.umich.eduas <em>terminals</em> and correspond to raw input tokens. Identifiers such as <tt>term</tt> and <tt>factor</tt> refer to more 13612632Sstever@eecs.umich.educomplex rules, typically comprised of a collection of tokens. These identifiers are known as <em>non-terminals</em>. 13622632Sstever@eecs.umich.edu<P> 13632632Sstever@eecs.umich.eduThe semantic behavior of a language is often specified using a 13642632Sstever@eecs.umich.edutechnique known as syntax directed translation. In syntax directed 13652632Sstever@eecs.umich.edutranslation, attributes are attached to each symbol in a given grammar 13662632Sstever@eecs.umich.edurule along with an action. Whenever a particular grammar rule is 13672632Sstever@eecs.umich.edurecognized, the action describes what to do. For example, given the 13682632Sstever@eecs.umich.eduexpression grammar above, you might write the specification for a 13692632Sstever@eecs.umich.edusimple calculator like this: 13702632Sstever@eecs.umich.edu 13712632Sstever@eecs.umich.edu<blockquote> 13722632Sstever@eecs.umich.edu<pre> 13732632Sstever@eecs.umich.eduGrammar Action 13742632Sstever@eecs.umich.edu-------------------------------- -------------------------------------------- 13752632Sstever@eecs.umich.eduexpression0 : expression1 + term expression0.val = expression1.val + term.val 13762632Sstever@eecs.umich.edu | expression1 - term expression0.val = expression1.val - term.val 13772632Sstever@eecs.umich.edu | term expression0.val = term.val 13782632Sstever@eecs.umich.edu 13792632Sstever@eecs.umich.eduterm0 : term1 * factor term0.val = term1.val * factor.val 13802632Sstever@eecs.umich.edu | term1 / factor term0.val = term1.val / factor.val 13812632Sstever@eecs.umich.edu | factor term0.val = factor.val 13822632Sstever@eecs.umich.edu 13832632Sstever@eecs.umich.edufactor : NUMBER factor.val = int(NUMBER.lexval) 13842632Sstever@eecs.umich.edu | ( expression ) factor.val = expression.val 13852632Sstever@eecs.umich.edu</pre> 13862632Sstever@eecs.umich.edu</blockquote> 13872632Sstever@eecs.umich.edu 13882632Sstever@eecs.umich.eduA good way to think about syntax directed translation is to simply think of each symbol in the grammar as some 13892632Sstever@eecs.umich.edukind of object. The semantics of the language are then expressed as a collection of methods/operations on these 13902632Sstever@eecs.umich.eduobjects. 13912632Sstever@eecs.umich.edu 13922632Sstever@eecs.umich.edu<p> 13932632Sstever@eecs.umich.eduYacc uses a parsing technique known as LR-parsing or shift-reduce parsing. LR parsing is a 13942632Sstever@eecs.umich.edubottom up technique that tries to recognize the right-hand-side of various grammar rules. 13952632Sstever@eecs.umich.eduWhenever a valid right-hand-side is found in the input, the appropriate action code is triggered and the 13962632Sstever@eecs.umich.edugrammar symbols are replaced by the grammar symbol on the left-hand-side. 13972632Sstever@eecs.umich.edu 13982632Sstever@eecs.umich.edu<p> 13992632Sstever@eecs.umich.eduLR parsing is commonly implemented by shifting grammar symbols onto a stack and looking at the stack and the next 14002632Sstever@eecs.umich.eduinput token for patterns. The details of the algorithm can be found in a compiler text, but the 14012632Sstever@eecs.umich.edufollowing example illustrates the steps that are performed if you wanted to parse the expression 14022632Sstever@eecs.umich.edu<tt>3 + 5 * (10 - 20)</tt> using the grammar defined above: 14032632Sstever@eecs.umich.edu 14042632Sstever@eecs.umich.edu<blockquote> 14052632Sstever@eecs.umich.edu<pre> 14062632Sstever@eecs.umich.eduStep Symbol Stack Input Tokens Action 14072632Sstever@eecs.umich.edu---- --------------------- --------------------- ------------------------------- 14082632Sstever@eecs.umich.edu1 $ 3 + 5 * ( 10 - 20 )$ Shift 3 14092632Sstever@eecs.umich.edu2 $ 3 + 5 * ( 10 - 20 )$ Reduce factor : NUMBER 14102632Sstever@eecs.umich.edu3 $ factor + 5 * ( 10 - 20 )$ Reduce term : factor 14112632Sstever@eecs.umich.edu4 $ term + 5 * ( 10 - 20 )$ Reduce expr : term 14122632Sstever@eecs.umich.edu5 $ expr + 5 * ( 10 - 20 )$ Shift + 14132632Sstever@eecs.umich.edu6 $ expr + 5 * ( 10 - 20 )$ Shift 5 14142632Sstever@eecs.umich.edu7 $ expr + 5 * ( 10 - 20 )$ Reduce factor : NUMBER 14152632Sstever@eecs.umich.edu8 $ expr + factor * ( 10 - 20 )$ Reduce term : factor 14162632Sstever@eecs.umich.edu9 $ expr + term * ( 10 - 20 )$ Shift * 14172632Sstever@eecs.umich.edu10 $ expr + term * ( 10 - 20 )$ Shift ( 14182632Sstever@eecs.umich.edu11 $ expr + term * ( 10 - 20 )$ Shift 10 14192632Sstever@eecs.umich.edu12 $ expr + term * ( 10 - 20 )$ Reduce factor : NUMBER 14202632Sstever@eecs.umich.edu13 $ expr + term * ( factor - 20 )$ Reduce term : factor 14212632Sstever@eecs.umich.edu14 $ expr + term * ( term - 20 )$ Reduce expr : term 14222632Sstever@eecs.umich.edu15 $ expr + term * ( expr - 20 )$ Shift - 14232632Sstever@eecs.umich.edu16 $ expr + term * ( expr - 20 )$ Shift 20 14242632Sstever@eecs.umich.edu17 $ expr + term * ( expr - 20 )$ Reduce factor : NUMBER 14252632Sstever@eecs.umich.edu18 $ expr + term * ( expr - factor )$ Reduce term : factor 14262632Sstever@eecs.umich.edu19 $ expr + term * ( expr - term )$ Reduce expr : expr - term 14272632Sstever@eecs.umich.edu20 $ expr + term * ( expr )$ Shift ) 14282632Sstever@eecs.umich.edu21 $ expr + term * ( expr ) $ Reduce factor : (expr) 14292632Sstever@eecs.umich.edu22 $ expr + term * factor $ Reduce term : term * factor 14302632Sstever@eecs.umich.edu23 $ expr + term $ Reduce expr : expr + term 14312632Sstever@eecs.umich.edu24 $ expr $ Reduce expr 14322632Sstever@eecs.umich.edu25 $ $ Success! 14332632Sstever@eecs.umich.edu</pre> 14342632Sstever@eecs.umich.edu</blockquote> 14352632Sstever@eecs.umich.edu 14362632Sstever@eecs.umich.eduWhen parsing the expression, an underlying state machine and the current input token determine what to do next. 14372632Sstever@eecs.umich.eduIf the next token looks like part of a valid grammar rule (based on other items on the stack), it is generally shifted 14382632Sstever@eecs.umich.eduonto the stack. If the top of the stack contains a valid right-hand-side of a grammar rule, it is 14392632Sstever@eecs.umich.eduusually "reduced" and the symbols replaced with the symbol on the left-hand-side. When this reduction occurs, the 14402632Sstever@eecs.umich.eduappropriate action is triggered (if defined). If the input token can't be shifted and the top of stack doesn't match 14412632Sstever@eecs.umich.eduany grammar rules, a syntax error has occurred and the parser must take some kind of recovery step (or bail out). 14422632Sstever@eecs.umich.edu 14432632Sstever@eecs.umich.edu<p> 14442632Sstever@eecs.umich.eduIt is important to note that the underlying implementation is built around a large finite-state machine that is encoded 14452632Sstever@eecs.umich.eduin a collection of tables. The construction of these tables is quite complicated and beyond the scope of this discussion. 14462632Sstever@eecs.umich.eduHowever, subtle details of this process explain why, in the example above, the parser chooses to shift a token 14472632Sstever@eecs.umich.eduonto the stack in step 9 rather than reducing the rule <tt>expr : expr + term</tt>. 14482632Sstever@eecs.umich.edu 14492632Sstever@eecs.umich.edu<H2><a name="ply_nn23"></a>5. Yacc reference</H2> 14502632Sstever@eecs.umich.edu 14512632Sstever@eecs.umich.edu 14522632Sstever@eecs.umich.eduThis section describes how to use write parsers in PLY. 14532632Sstever@eecs.umich.edu 14542632Sstever@eecs.umich.edu<H3><a name="ply_nn24"></a>5.1 An example</H3> 14552632Sstever@eecs.umich.edu 14562632Sstever@eecs.umich.edu 14572632Sstever@eecs.umich.eduSuppose you wanted to make a grammar for simple arithmetic expressions as previously described. Here is 14582632Sstever@eecs.umich.eduhow you would do it with <tt>yacc.py</tt>: 14592632Sstever@eecs.umich.edu 14602632Sstever@eecs.umich.edu<blockquote> 14612632Sstever@eecs.umich.edu<pre> 14622632Sstever@eecs.umich.edu# Yacc example 14632632Sstever@eecs.umich.edu 14642632Sstever@eecs.umich.eduimport ply.yacc as yacc 14652632Sstever@eecs.umich.edu 14662632Sstever@eecs.umich.edu# Get the token map from the lexer. This is required. 14672632Sstever@eecs.umich.edufrom calclex import tokens 14682632Sstever@eecs.umich.edu 14692632Sstever@eecs.umich.edudef p_expression_plus(p): 14702632Sstever@eecs.umich.edu 'expression : expression PLUS term' 14712632Sstever@eecs.umich.edu p[0] = p[1] + p[3] 14722632Sstever@eecs.umich.edu 14732632Sstever@eecs.umich.edudef p_expression_minus(p): 14742632Sstever@eecs.umich.edu 'expression : expression MINUS term' 14752632Sstever@eecs.umich.edu p[0] = p[1] - p[3] 14762632Sstever@eecs.umich.edu 14772632Sstever@eecs.umich.edudef p_expression_term(p): 14782632Sstever@eecs.umich.edu 'expression : term' 14792632Sstever@eecs.umich.edu p[0] = p[1] 14802632Sstever@eecs.umich.edu 14812632Sstever@eecs.umich.edudef p_term_times(p): 14822632Sstever@eecs.umich.edu 'term : term TIMES factor' 14832632Sstever@eecs.umich.edu p[0] = p[1] * p[3] 14842632Sstever@eecs.umich.edu 14852632Sstever@eecs.umich.edudef p_term_div(p): 14862632Sstever@eecs.umich.edu 'term : term DIVIDE factor' 14872632Sstever@eecs.umich.edu p[0] = p[1] / p[3] 14882632Sstever@eecs.umich.edu 14892632Sstever@eecs.umich.edudef p_term_factor(p): 14902632Sstever@eecs.umich.edu 'term : factor' 14912632Sstever@eecs.umich.edu p[0] = p[1] 14922632Sstever@eecs.umich.edu 14932632Sstever@eecs.umich.edudef p_factor_num(p): 14942632Sstever@eecs.umich.edu 'factor : NUMBER' 14952632Sstever@eecs.umich.edu p[0] = p[1] 14962632Sstever@eecs.umich.edu 14972632Sstever@eecs.umich.edudef p_factor_expr(p): 14982632Sstever@eecs.umich.edu 'factor : LPAREN expression RPAREN' 14992632Sstever@eecs.umich.edu p[0] = p[2] 15002632Sstever@eecs.umich.edu 15012632Sstever@eecs.umich.edu# Error rule for syntax errors 15022632Sstever@eecs.umich.edudef p_error(p): 15032632Sstever@eecs.umich.edu print "Syntax error in input!" 15042632Sstever@eecs.umich.edu 15052632Sstever@eecs.umich.edu# Build the parser 15062632Sstever@eecs.umich.eduyacc.yacc() 15072632Sstever@eecs.umich.edu 15082632Sstever@eecs.umich.edu# Use this if you want to build the parser using SLR instead of LALR 15092632Sstever@eecs.umich.edu# yacc.yacc(method="SLR") 15102632Sstever@eecs.umich.edu 15112632Sstever@eecs.umich.eduwhile 1: 15122632Sstever@eecs.umich.edu try: 15132632Sstever@eecs.umich.edu s = raw_input('calc > ') 15142632Sstever@eecs.umich.edu except EOFError: 15152632Sstever@eecs.umich.edu break 15162632Sstever@eecs.umich.edu if not s: continue 15172632Sstever@eecs.umich.edu result = yacc.parse(s) 15182632Sstever@eecs.umich.edu print result 15192632Sstever@eecs.umich.edu</pre> 15202632Sstever@eecs.umich.edu</blockquote> 15212632Sstever@eecs.umich.edu 15222632Sstever@eecs.umich.eduIn this example, each grammar rule is defined by a Python function where the docstring to that function contains the 15232632Sstever@eecs.umich.eduappropriate context-free grammar specification. Each function accepts a single 15242632Sstever@eecs.umich.eduargument <tt>p</tt> that is a sequence containing the values of each grammar symbol in the corresponding rule. The values of 15252632Sstever@eecs.umich.edu<tt>p[i]</tt> are mapped to grammar symbols as shown here: 15262632Sstever@eecs.umich.edu 15272632Sstever@eecs.umich.edu<blockquote> 15282632Sstever@eecs.umich.edu<pre> 15292632Sstever@eecs.umich.edudef p_expression_plus(p): 15302632Sstever@eecs.umich.edu 'expression : expression PLUS term' 15312632Sstever@eecs.umich.edu # ^ ^ ^ ^ 15322632Sstever@eecs.umich.edu # p[0] p[1] p[2] p[3] 15332632Sstever@eecs.umich.edu 15342632Sstever@eecs.umich.edu p[0] = p[1] + p[3] 15352632Sstever@eecs.umich.edu</pre> 15362632Sstever@eecs.umich.edu</blockquote> 15372632Sstever@eecs.umich.edu 15382632Sstever@eecs.umich.eduFor tokens, the "value" of the corresponding <tt>p[i]</tt> is the 15392632Sstever@eecs.umich.edu<em>same</em> as the <tt>p.value</tt> attribute assigned 15402632Sstever@eecs.umich.eduin the lexer module. For non-terminals, the value is determined by 15412632Sstever@eecs.umich.eduwhatever is placed in <tt>p[0]</tt> when rules are reduced. This 15422632Sstever@eecs.umich.eduvalue can be anything at all. However, it probably most common for 15432632Sstever@eecs.umich.eduthe value to be a simple Python type, a tuple, or an instance. In this example, we 15442632Sstever@eecs.umich.eduare relying on the fact that the <tt>NUMBER</tt> token stores an integer value in its value 15452632Sstever@eecs.umich.edufield. All of the other rules simply perform various types of integer operations and store 15462632Sstever@eecs.umich.eduthe result. 15472632Sstever@eecs.umich.edu 15482632Sstever@eecs.umich.edu<P> 15492632Sstever@eecs.umich.eduNote: The use of negative indices have a special meaning in yacc---specially <tt>p[-1]</tt> does 15502632Sstever@eecs.umich.edunot have the same value as <tt>p[3]</tt> in this example. Please see the section on "Embedded Actions" for further 15512632Sstever@eecs.umich.edudetails. 15522632Sstever@eecs.umich.edu 15532632Sstever@eecs.umich.edu<p> 15542632Sstever@eecs.umich.eduThe first rule defined in the yacc specification determines the starting grammar 15552632Sstever@eecs.umich.edusymbol (in this case, a rule for <tt>expression</tt> appears first). Whenever 15562632Sstever@eecs.umich.eduthe starting rule is reduced by the parser and no more input is available, parsing 15572632Sstever@eecs.umich.edustops and the final value is returned (this value will be whatever the top-most rule 15582632Sstever@eecs.umich.eduplaced in <tt>p[0]</tt>). Note: an alternative starting symbol can be specified using the <tt>start</tt> keyword argument to 15592632Sstever@eecs.umich.edu<tt>yacc()</tt>. 15602632Sstever@eecs.umich.edu 15612632Sstever@eecs.umich.edu<p>The <tt>p_error(p)</tt> rule is defined to catch syntax errors. See the error handling section 15622632Sstever@eecs.umich.edubelow for more detail. 15632632Sstever@eecs.umich.edu 15642632Sstever@eecs.umich.edu<p> 15652632Sstever@eecs.umich.eduTo build the parser, call the <tt>yacc.yacc()</tt> function. This function 15662632Sstever@eecs.umich.edulooks at the module and attempts to construct all of the LR parsing tables for the grammar 15672632Sstever@eecs.umich.eduyou have specified. The first time <tt>yacc.yacc()</tt> is invoked, you will get a message 15682632Sstever@eecs.umich.edusuch as this: 15692632Sstever@eecs.umich.edu 15702632Sstever@eecs.umich.edu<blockquote> 15712632Sstever@eecs.umich.edu<pre> 15722632Sstever@eecs.umich.edu$ python calcparse.py 15732632Sstever@eecs.umich.eduyacc: Generating LALR parsing table... 15742632Sstever@eecs.umich.educalc > 15752632Sstever@eecs.umich.edu</pre> 15762632Sstever@eecs.umich.edu</blockquote> 15772632Sstever@eecs.umich.edu 15782632Sstever@eecs.umich.eduSince table construction is relatively expensive (especially for large 15792632Sstever@eecs.umich.edugrammars), the resulting parsing table is written to the current 15802632Sstever@eecs.umich.edudirectory in a file called <tt>parsetab.py</tt>. In addition, a 15812632Sstever@eecs.umich.edudebugging file called <tt>parser.out</tt> is created. On subsequent 15822632Sstever@eecs.umich.eduexecutions, <tt>yacc</tt> will reload the table from 15832632Sstever@eecs.umich.edu<tt>parsetab.py</tt> unless it has detected a change in the underlying 15842632Sstever@eecs.umich.edugrammar (in which case the tables and <tt>parsetab.py</tt> file are 15852632Sstever@eecs.umich.eduregenerated). Note: The names of parser output files can be changed if necessary. See the notes that follow later. 15862632Sstever@eecs.umich.edu 15872632Sstever@eecs.umich.edu<p> 15882632Sstever@eecs.umich.eduIf any errors are detected in your grammar specification, <tt>yacc.py</tt> will produce 15892632Sstever@eecs.umich.edudiagnostic messages and possibly raise an exception. Some of the errors that can be detected include: 15902632Sstever@eecs.umich.edu 15912632Sstever@eecs.umich.edu<ul> 15922632Sstever@eecs.umich.edu<li>Duplicated function names (if more than one rule function have the same name in the grammar file). 15932632Sstever@eecs.umich.edu<li>Shift/reduce and reduce/reduce conflicts generated by ambiguous grammars. 15942632Sstever@eecs.umich.edu<li>Badly specified grammar rules. 15952632Sstever@eecs.umich.edu<li>Infinite recursion (rules that can never terminate). 15962632Sstever@eecs.umich.edu<li>Unused rules and tokens 15972632Sstever@eecs.umich.edu<li>Undefined rules and tokens 15982632Sstever@eecs.umich.edu</ul> 15992632Sstever@eecs.umich.edu 16002632Sstever@eecs.umich.eduThe next few sections now discuss a few finer points of grammar construction. 16012632Sstever@eecs.umich.edu 16022632Sstever@eecs.umich.edu<H3><a name="ply_nn25"></a>5.2 Combining Grammar Rule Functions</H3> 16032632Sstever@eecs.umich.edu 16042632Sstever@eecs.umich.edu 16052632Sstever@eecs.umich.eduWhen grammar rules are similar, they can be combined into a single function. 16062632Sstever@eecs.umich.eduFor example, consider the two rules in our earlier example: 16072632Sstever@eecs.umich.edu 16082632Sstever@eecs.umich.edu<blockquote> 16092632Sstever@eecs.umich.edu<pre> 16102632Sstever@eecs.umich.edudef p_expression_plus(p): 16112632Sstever@eecs.umich.edu 'expression : expression PLUS term' 16122632Sstever@eecs.umich.edu p[0] = p[1] + p[3] 16132632Sstever@eecs.umich.edu 16142632Sstever@eecs.umich.edudef p_expression_minus(t): 16152632Sstever@eecs.umich.edu 'expression : expression MINUS term' 16162632Sstever@eecs.umich.edu p[0] = p[1] - p[3] 16172632Sstever@eecs.umich.edu</pre> 16182632Sstever@eecs.umich.edu</blockquote> 16192632Sstever@eecs.umich.edu 16202632Sstever@eecs.umich.eduInstead of writing two functions, you might write a single function like this: 16212632Sstever@eecs.umich.edu 16222632Sstever@eecs.umich.edu<blockquote> 16232632Sstever@eecs.umich.edu<pre> 16242632Sstever@eecs.umich.edudef p_expression(p): 16252632Sstever@eecs.umich.edu '''expression : expression PLUS term 16262632Sstever@eecs.umich.edu | expression MINUS term''' 16272632Sstever@eecs.umich.edu if p[2] == '+': 16282632Sstever@eecs.umich.edu p[0] = p[1] + p[3] 16292632Sstever@eecs.umich.edu elif p[2] == '-': 16302632Sstever@eecs.umich.edu p[0] = p[1] - p[3] 16312632Sstever@eecs.umich.edu</pre> 16322632Sstever@eecs.umich.edu</blockquote> 16332632Sstever@eecs.umich.edu 16342632Sstever@eecs.umich.eduIn general, the doc string for any given function can contain multiple grammar rules. So, it would 16352632Sstever@eecs.umich.eduhave also been legal (although possibly confusing) to write this: 16362632Sstever@eecs.umich.edu 16372632Sstever@eecs.umich.edu<blockquote> 16382632Sstever@eecs.umich.edu<pre> 16392632Sstever@eecs.umich.edudef p_binary_operators(p): 16402632Sstever@eecs.umich.edu '''expression : expression PLUS term 16412632Sstever@eecs.umich.edu | expression MINUS term 16422632Sstever@eecs.umich.edu term : term TIMES factor 1643 | term DIVIDE factor''' 1644 if p[2] == '+': 1645 p[0] = p[1] + p[3] 1646 elif p[2] == '-': 1647 p[0] = p[1] - p[3] 1648 elif p[2] == '*': 1649 p[0] = p[1] * p[3] 1650 elif p[2] == '/': 1651 p[0] = p[1] / p[3] 1652</pre> 1653</blockquote> 1654 1655When combining grammar rules into a single function, it is usually a good idea for all of the rules to have 1656a similar structure (e.g., the same number of terms). Otherwise, the corresponding action code may be more 1657complicated than necessary. However, it is possible to handle simple cases using len(). For example: 1658 1659<blockquote> 1660<pre> 1661def p_expressions(p): 1662 '''expression : expression MINUS expression 1663 | MINUS expression''' 1664 if (len(p) == 4): 1665 p[0] = p[1] - p[3] 1666 elif (len(p) == 3): 1667 p[0] = -p[2] 1668</pre> 1669</blockquote> 1670 1671<H3><a name="ply_nn26"></a>5.3 Character Literals</H3> 1672 1673 1674If desired, a grammar may contain tokens defined as single character literals. For example: 1675 1676<blockquote> 1677<pre> 1678def p_binary_operators(p): 1679 '''expression : expression '+' term 1680 | expression '-' term 1681 term : term '*' factor 1682 | term '/' factor''' 1683 if p[2] == '+': 1684 p[0] = p[1] + p[3] 1685 elif p[2] == '-': 1686 p[0] = p[1] - p[3] 1687 elif p[2] == '*': 1688 p[0] = p[1] * p[3] 1689 elif p[2] == '/': 1690 p[0] = p[1] / p[3] 1691</pre> 1692</blockquote> 1693 1694A character literal must be enclosed in quotes such as <tt>'+'</tt>. In addition, if literals are used, they must be declared in the 1695corresponding <tt>lex</tt> file through the use of a special <tt>literals</tt> declaration. 1696 1697<blockquote> 1698<pre> 1699# Literals. Should be placed in module given to lex() 1700literals = ['+','-','*','/' ] 1701</pre> 1702</blockquote> 1703 1704<b>Character literals are limited to a single character</b>. Thus, it is not legal to specify literals such as <tt>'<='</tt> or <tt>'=='</tt>. For this, use 1705the normal lexing rules (e.g., define a rule such as <tt>t_EQ = r'=='</tt>). 1706 1707<H3><a name="ply_nn26"></a>5.4 Empty Productions</H3> 1708 1709 1710<tt>yacc.py</tt> can handle empty productions by defining a rule like this: 1711 1712<blockquote> 1713<pre> 1714def p_empty(p): 1715 'empty :' 1716 pass 1717</pre> 1718</blockquote> 1719 1720Now to use the empty production, simply use 'empty' as a symbol. For example: 1721 1722<blockquote> 1723<pre> 1724def p_optitem(p): 1725 'optitem : item' 1726 ' | empty' 1727 ... 1728</pre> 1729</blockquote> 1730 1731Note: You can write empty rules anywhere by simply specifying an empty right hand side. However, I personally find that 1732writing an "empty" rule and using "empty" to denote an empty production is easier to read. 1733 1734<H3><a name="ply_nn28"></a>5.5 Changing the starting symbol</H3> 1735 1736 1737Normally, the first rule found in a yacc specification defines the starting grammar rule (top level rule). To change this, simply 1738supply a <tt>start</tt> specifier in your file. For example: 1739 1740<blockquote> 1741<pre> 1742start = 'foo' 1743 1744def p_bar(p): 1745 'bar : A B' 1746 1747# This is the starting rule due to the start specifier above 1748def p_foo(p): 1749 'foo : bar X' 1750... 1751</pre> 1752</blockquote> 1753 1754The use of a <tt>start</tt> specifier may be useful during debugging since you can use it to have yacc build a subset of 1755a larger grammar. For this purpose, it is also possible to specify a starting symbol as an argument to <tt>yacc()</tt>. For example: 1756 1757<blockquote> 1758<pre> 1759yacc.yacc(start='foo') 1760</pre> 1761</blockquote> 1762 1763<H3><a name="ply_nn27"></a>5.6 Dealing With Ambiguous Grammars</H3> 1764 1765 1766The expression grammar given in the earlier example has been written in a special format to eliminate ambiguity. 1767However, in many situations, it is extremely difficult or awkward to write grammars in this format. A 1768much more natural way to express the grammar is in a more compact form like this: 1769 1770<blockquote> 1771<pre> 1772expression : expression PLUS expression 1773 | expression MINUS expression 1774 | expression TIMES expression 1775 | expression DIVIDE expression 1776 | LPAREN expression RPAREN 1777 | NUMBER 1778</pre> 1779</blockquote> 1780 1781Unfortunately, this grammar specification is ambiguous. For example, if you are parsing the string 1782"3 * 4 + 5", there is no way to tell how the operators are supposed to be grouped. 1783For example, does the expression mean "(3 * 4) + 5" or is it "3 * (4+5)"? 1784 1785<p> 1786When an ambiguous grammar is given to <tt>yacc.py</tt> it will print messages about "shift/reduce conflicts" 1787or a "reduce/reduce conflicts". A shift/reduce conflict is caused when the parser generator can't decide 1788whether or not to reduce a rule or shift a symbol on the parsing stack. For example, consider 1789the string "3 * 4 + 5" and the internal parsing stack: 1790 1791<blockquote> 1792<pre> 1793Step Symbol Stack Input Tokens Action 1794---- --------------------- --------------------- ------------------------------- 17951 $ 3 * 4 + 5$ Shift 3 17962 $ 3 * 4 + 5$ Reduce : expression : NUMBER 17973 $ expr * 4 + 5$ Shift * 17984 $ expr * 4 + 5$ Shift 4 17995 $ expr * 4 + 5$ Reduce: expression : NUMBER 18006 $ expr * expr + 5$ SHIFT/REDUCE CONFLICT ???? 1801</pre> 1802</blockquote> 1803 1804In this case, when the parser reaches step 6, it has two options. One is to reduce the 1805rule <tt>expr : expr * expr</tt> on the stack. The other option is to shift the 1806token <tt>+</tt> on the stack. Both options are perfectly legal from the rules 1807of the context-free-grammar. 1808 1809<p> 1810By default, all shift/reduce conflicts are resolved in favor of shifting. Therefore, in the above 1811example, the parser will always shift the <tt>+</tt> instead of reducing. Although this 1812strategy works in many cases (including the ambiguous if-then-else), it is not enough for arithmetic 1813expressions. In fact, in the above example, the decision to shift <tt>+</tt> is completely wrong---we should have 1814reduced <tt>expr * expr</tt> since multiplication has higher mathematical precedence than addition. 1815 1816<p>To resolve ambiguity, especially in expression grammars, <tt>yacc.py</tt> allows individual 1817tokens to be assigned a precedence level and associativity. This is done by adding a variable 1818<tt>precedence</tt> to the grammar file like this: 1819 1820<blockquote> 1821<pre> 1822precedence = ( 1823 ('left', 'PLUS', 'MINUS'), 1824 ('left', 'TIMES', 'DIVIDE'), 1825) 1826</pre> 1827</blockquote> 1828 1829This declaration specifies that <tt>PLUS</tt>/<tt>MINUS</tt> have 1830the same precedence level and are left-associative and that 1831<tt>TIMES</tt>/<tt>DIVIDE</tt> have the same precedence and are left-associative. 1832Within the <tt>precedence</tt> declaration, tokens are ordered from lowest to highest precedence. Thus, 1833this declaration specifies that <tt>TIMES</tt>/<tt>DIVIDE</tt> have higher 1834precedence than <tt>PLUS</tt>/<tt>MINUS</tt> (since they appear later in the 1835precedence specification). 1836 1837<p> 1838The precedence specification works by associating a numerical precedence level value and associativity direction to 1839the listed tokens. For example, in the above example you get: 1840 1841<blockquote> 1842<pre> 1843PLUS : level = 1, assoc = 'left' 1844MINUS : level = 1, assoc = 'left' 1845TIMES : level = 2, assoc = 'left' 1846DIVIDE : level = 2, assoc = 'left' 1847</pre> 1848</blockquote> 1849 1850These values are then used to attach a numerical precedence value and associativity direction 1851to each grammar rule. <em>This is always determined by looking at the precedence of the right-most terminal symbol.</em> 1852For example: 1853 1854<blockquote> 1855<pre> 1856expression : expression PLUS expression # level = 1, left 1857 | expression MINUS expression # level = 1, left 1858 | expression TIMES expression # level = 2, left 1859 | expression DIVIDE expression # level = 2, left 1860 | LPAREN expression RPAREN # level = None (not specified) 1861 | NUMBER # level = None (not specified) 1862</pre> 1863</blockquote> 1864 1865When shift/reduce conflicts are encountered, the parser generator resolves the conflict by 1866looking at the precedence rules and associativity specifiers. 1867 1868<p> 1869<ol> 1870<li>If the current token has higher precedence, it is shifted. 1871<li>If the grammar rule on the stack has higher precedence, the rule is reduced. 1872<li>If the current token and the grammar rule have the same precedence, the 1873rule is reduced for left associativity, whereas the token is shifted for right associativity. 1874<li>If nothing is known about the precedence, shift/reduce conflicts are resolved in 1875favor of shifting (the default). 1876</ol> 1877 1878For example, if "expression PLUS expression" has been parsed and the next token 1879is "TIMES", the action is going to be a shift because "TIMES" has a higher precedence level than "PLUS". On the other 1880hand, if "expression TIMES expression" has been parsed and the next token is "PLUS", the action 1881is going to be reduce because "PLUS" has a lower precedence than "TIMES." 1882 1883<p> 1884When shift/reduce conflicts are resolved using the first three techniques (with the help of 1885precedence rules), <tt>yacc.py</tt> will report no errors or conflicts in the grammar. 1886 1887<p> 1888One problem with the precedence specifier technique is that it is sometimes necessary to 1889change the precedence of an operator in certain contents. For example, consider a unary-minus operator 1890in "3 + 4 * -5". Normally, unary minus has a very high precedence--being evaluated before the multiply. 1891However, in our precedence specifier, MINUS has a lower precedence than TIMES. To deal with this, 1892precedence rules can be given for fictitious tokens like this: 1893 1894<blockquote> 1895<pre> 1896precedence = ( 1897 ('left', 'PLUS', 'MINUS'), 1898 ('left', 'TIMES', 'DIVIDE'), 1899 ('right', 'UMINUS'), # Unary minus operator 1900) 1901</pre> 1902</blockquote> 1903 1904Now, in the grammar file, we can write our unary minus rule like this: 1905 1906<blockquote> 1907<pre> 1908def p_expr_uminus(p): 1909 'expression : MINUS expression %prec UMINUS' 1910 p[0] = -p[2] 1911</pre> 1912</blockquote> 1913 1914In this case, <tt>%prec UMINUS</tt> overrides the default rule precedence--setting it to that 1915of UMINUS in the precedence specifier. 1916 1917<p> 1918At first, the use of UMINUS in this example may appear very confusing. 1919UMINUS is not an input token or a grammer rule. Instead, you should 1920think of it as the name of a special marker in the precedence table. When you use the <tt>%prec</tt> qualifier, you're simply 1921telling yacc that you want the precedence of the expression to be the same as for this special marker instead of the usual precedence. 1922 1923<p> 1924It is also possible to specify non-associativity in the <tt>precedence</tt> table. This would 1925be used when you <em>don't</em> want operations to chain together. For example, suppose 1926you wanted to support comparison operators like <tt><</tt> and <tt>></tt> but you didn't want to allow 1927combinations like <tt>a < b < c</tt>. To do this, simply specify a rule like this: 1928 1929<blockquote> 1930<pre> 1931precedence = ( 1932 ('nonassoc', 'LESSTHAN', 'GREATERTHAN'), # Nonassociative operators 1933 ('left', 'PLUS', 'MINUS'), 1934 ('left', 'TIMES', 'DIVIDE'), 1935 ('right', 'UMINUS'), # Unary minus operator 1936) 1937</pre> 1938</blockquote> 1939 1940<p> 1941If you do this, the occurrence of input text such as <tt> a < b < c</tt> will result in a syntax error. However, simple 1942expressions such as <tt>a < b</tt> will still be fine. 1943 1944<p> 1945Reduce/reduce conflicts are caused when there are multiple grammar 1946rules that can be applied to a given set of symbols. This kind of 1947conflict is almost always bad and is always resolved by picking the 1948rule that appears first in the grammar file. Reduce/reduce conflicts 1949are almost always caused when different sets of grammar rules somehow 1950generate the same set of symbols. For example: 1951 1952<blockquote> 1953<pre> 1954assignment : ID EQUALS NUMBER 1955 | ID EQUALS expression 1956 1957expression : expression PLUS expression 1958 | expression MINUS expression 1959 | expression TIMES expression 1960 | expression DIVIDE expression 1961 | LPAREN expression RPAREN 1962 | NUMBER 1963</pre> 1964</blockquote> 1965 1966In this case, a reduce/reduce conflict exists between these two rules: 1967 1968<blockquote> 1969<pre> 1970assignment : ID EQUALS NUMBER 1971expression : NUMBER 1972</pre> 1973</blockquote> 1974 1975For example, if you wrote "a = 5", the parser can't figure out if this 1976is supposed to be reduced as <tt>assignment : ID EQUALS NUMBER</tt> or 1977whether it's supposed to reduce the 5 as an expression and then reduce 1978the rule <tt>assignment : ID EQUALS expression</tt>. 1979 1980<p> 1981It should be noted that reduce/reduce conflicts are notoriously difficult to spot 1982simply looking at the input grammer. To locate these, it is usually easier to look at the 1983<tt>parser.out</tt> debugging file with an appropriately high level of caffeination. 1984 1985<H3><a name="ply_nn28"></a>5.7 The parser.out file</H3> 1986 1987 1988Tracking down shift/reduce and reduce/reduce conflicts is one of the finer pleasures of using an LR 1989parsing algorithm. To assist in debugging, <tt>yacc.py</tt> creates a debugging file called 1990'parser.out' when it generates the parsing table. The contents of this file look like the following: 1991 1992<blockquote> 1993<pre> 1994Unused terminals: 1995 1996 1997Grammar 1998 1999Rule 1 expression -> expression PLUS expression 2000Rule 2 expression -> expression MINUS expression 2001Rule 3 expression -> expression TIMES expression 2002Rule 4 expression -> expression DIVIDE expression 2003Rule 5 expression -> NUMBER 2004Rule 6 expression -> LPAREN expression RPAREN 2005 2006Terminals, with rules where they appear 2007 2008TIMES : 3 2009error : 2010MINUS : 2 2011RPAREN : 6 2012LPAREN : 6 2013DIVIDE : 4 2014PLUS : 1 2015NUMBER : 5 2016 2017Nonterminals, with rules where they appear 2018 2019expression : 1 1 2 2 3 3 4 4 6 0 2020 2021 2022Parsing method: LALR 2023 2024 2025state 0 2026 2027 S' -> . expression 2028 expression -> . expression PLUS expression 2029 expression -> . expression MINUS expression 2030 expression -> . expression TIMES expression 2031 expression -> . expression DIVIDE expression 2032 expression -> . NUMBER 2033 expression -> . LPAREN expression RPAREN 2034 2035 NUMBER shift and go to state 3 2036 LPAREN shift and go to state 2 2037 2038 2039state 1 2040 2041 S' -> expression . 2042 expression -> expression . PLUS expression 2043 expression -> expression . MINUS expression 2044 expression -> expression . TIMES expression 2045 expression -> expression . DIVIDE expression 2046 2047 PLUS shift and go to state 6 2048 MINUS shift and go to state 5 2049 TIMES shift and go to state 4 2050 DIVIDE shift and go to state 7 2051 2052 2053state 2 2054 2055 expression -> LPAREN . expression RPAREN 2056 expression -> . expression PLUS expression 2057 expression -> . expression MINUS expression 2058 expression -> . expression TIMES expression 2059 expression -> . expression DIVIDE expression 2060 expression -> . NUMBER 2061 expression -> . LPAREN expression RPAREN 2062 2063 NUMBER shift and go to state 3 2064 LPAREN shift and go to state 2 2065 2066 2067state 3 2068 2069 expression -> NUMBER . 2070 2071 $ reduce using rule 5 2072 PLUS reduce using rule 5 2073 MINUS reduce using rule 5 2074 TIMES reduce using rule 5 2075 DIVIDE reduce using rule 5 2076 RPAREN reduce using rule 5 2077 2078 2079state 4 2080 2081 expression -> expression TIMES . expression 2082 expression -> . expression PLUS expression 2083 expression -> . expression MINUS expression 2084 expression -> . expression TIMES expression 2085 expression -> . expression DIVIDE expression 2086 expression -> . NUMBER 2087 expression -> . LPAREN expression RPAREN 2088 2089 NUMBER shift and go to state 3 2090 LPAREN shift and go to state 2 2091 2092 2093state 5 2094 2095 expression -> expression MINUS . expression 2096 expression -> . expression PLUS expression 2097 expression -> . expression MINUS expression 2098 expression -> . expression TIMES expression 2099 expression -> . expression DIVIDE expression 2100 expression -> . NUMBER 2101 expression -> . LPAREN expression RPAREN 2102 2103 NUMBER shift and go to state 3 2104 LPAREN shift and go to state 2 2105 2106 2107state 6 2108 2109 expression -> expression PLUS . expression 2110 expression -> . expression PLUS expression 2111 expression -> . expression MINUS expression 2112 expression -> . expression TIMES expression 2113 expression -> . expression DIVIDE expression 2114 expression -> . NUMBER 2115 expression -> . LPAREN expression RPAREN 2116 2117 NUMBER shift and go to state 3 2118 LPAREN shift and go to state 2 2119 2120 2121state 7 2122 2123 expression -> expression DIVIDE . expression 2124 expression -> . expression PLUS expression 2125 expression -> . expression MINUS expression 2126 expression -> . expression TIMES expression 2127 expression -> . expression DIVIDE expression 2128 expression -> . NUMBER 2129 expression -> . LPAREN expression RPAREN 2130 2131 NUMBER shift and go to state 3 2132 LPAREN shift and go to state 2 2133 2134 2135state 8 2136 2137 expression -> LPAREN expression . RPAREN 2138 expression -> expression . PLUS expression 2139 expression -> expression . MINUS expression 2140 expression -> expression . TIMES expression 2141 expression -> expression . DIVIDE expression 2142 2143 RPAREN shift and go to state 13 2144 PLUS shift and go to state 6 2145 MINUS shift and go to state 5 2146 TIMES shift and go to state 4 2147 DIVIDE shift and go to state 7 2148 2149 2150state 9 2151 2152 expression -> expression TIMES expression . 2153 expression -> expression . PLUS expression 2154 expression -> expression . MINUS expression 2155 expression -> expression . TIMES expression 2156 expression -> expression . DIVIDE expression 2157 2158 $ reduce using rule 3 2159 PLUS reduce using rule 3 2160 MINUS reduce using rule 3 2161 TIMES reduce using rule 3 2162 DIVIDE reduce using rule 3 2163 RPAREN reduce using rule 3 2164 2165 ! PLUS [ shift and go to state 6 ] 2166 ! MINUS [ shift and go to state 5 ] 2167 ! TIMES [ shift and go to state 4 ] 2168 ! DIVIDE [ shift and go to state 7 ] 2169 2170state 10 2171 2172 expression -> expression MINUS expression . 2173 expression -> expression . PLUS expression 2174 expression -> expression . MINUS expression 2175 expression -> expression . TIMES expression 2176 expression -> expression . DIVIDE expression 2177 2178 $ reduce using rule 2 2179 PLUS reduce using rule 2 2180 MINUS reduce using rule 2 2181 RPAREN reduce using rule 2 2182 TIMES shift and go to state 4 2183 DIVIDE shift and go to state 7 2184 2185 ! TIMES [ reduce using rule 2 ] 2186 ! DIVIDE [ reduce using rule 2 ] 2187 ! PLUS [ shift and go to state 6 ] 2188 ! MINUS [ shift and go to state 5 ] 2189 2190state 11 2191 2192 expression -> expression PLUS expression . 2193 expression -> expression . PLUS expression 2194 expression -> expression . MINUS expression 2195 expression -> expression . TIMES expression 2196 expression -> expression . DIVIDE expression 2197 2198 $ reduce using rule 1 2199 PLUS reduce using rule 1 2200 MINUS reduce using rule 1 2201 RPAREN reduce using rule 1 2202 TIMES shift and go to state 4 2203 DIVIDE shift and go to state 7 2204 2205 ! TIMES [ reduce using rule 1 ] 2206 ! DIVIDE [ reduce using rule 1 ] 2207 ! PLUS [ shift and go to state 6 ] 2208 ! MINUS [ shift and go to state 5 ] 2209 2210state 12 2211 2212 expression -> expression DIVIDE expression . 2213 expression -> expression . PLUS expression 2214 expression -> expression . MINUS expression 2215 expression -> expression . TIMES expression 2216 expression -> expression . DIVIDE expression 2217 2218 $ reduce using rule 4 2219 PLUS reduce using rule 4 2220 MINUS reduce using rule 4 2221 TIMES reduce using rule 4 2222 DIVIDE reduce using rule 4 2223 RPAREN reduce using rule 4 2224 2225 ! PLUS [ shift and go to state 6 ] 2226 ! MINUS [ shift and go to state 5 ] 2227 ! TIMES [ shift and go to state 4 ] 2228 ! DIVIDE [ shift and go to state 7 ] 2229 2230state 13 2231 2232 expression -> LPAREN expression RPAREN . 2233 2234 $ reduce using rule 6 2235 PLUS reduce using rule 6 2236 MINUS reduce using rule 6 2237 TIMES reduce using rule 6 2238 DIVIDE reduce using rule 6 2239 RPAREN reduce using rule 6 2240</pre> 2241</blockquote> 2242 2243In the file, each state of the grammar is described. Within each state the "." indicates the current 2244location of the parse within any applicable grammar rules. In addition, the actions for each valid 2245input token are listed. When a shift/reduce or reduce/reduce conflict arises, rules <em>not</em> selected 2246are prefixed with an !. For example: 2247 2248<blockquote> 2249<pre> 2250 ! TIMES [ reduce using rule 2 ] 2251 ! DIVIDE [ reduce using rule 2 ] 2252 ! PLUS [ shift and go to state 6 ] 2253 ! MINUS [ shift and go to state 5 ] 2254</pre> 2255</blockquote> 2256 2257By looking at these rules (and with a little practice), you can usually track down the source 2258of most parsing conflicts. It should also be stressed that not all shift-reduce conflicts are 2259bad. However, the only way to be sure that they are resolved correctly is to look at <tt>parser.out</tt>. 2260 2261<H3><a name="ply_nn29"></a>5.8 Syntax Error Handling</H3> 2262 2263 2264When a syntax error occurs during parsing, the error is immediately 2265detected (i.e., the parser does not read any more tokens beyond the 2266source of the error). Error recovery in LR parsers is a delicate 2267topic that involves ancient rituals and black-magic. The recovery mechanism 2268provided by <tt>yacc.py</tt> is comparable to Unix yacc so you may want 2269consult a book like O'Reilly's "Lex and Yacc" for some of the finer details. 2270 2271<p> 2272When a syntax error occurs, <tt>yacc.py</tt> performs the following steps: 2273 2274<ol> 2275<li>On the first occurrence of an error, the user-defined <tt>p_error()</tt> function 2276is called with the offending token as an argument. Afterwards, the parser enters 2277an "error-recovery" mode in which it will not make future calls to <tt>p_error()</tt> until it 2278has successfully shifted at least 3 tokens onto the parsing stack. 2279 2280<p> 2281<li>If no recovery action is taken in <tt>p_error()</tt>, the offending lookahead token is replaced 2282with a special <tt>error</tt> token. 2283 2284<p> 2285<li>If the offending lookahead token is already set to <tt>error</tt>, the top item of the parsing stack is 2286deleted. 2287 2288<p> 2289<li>If the entire parsing stack is unwound, the parser enters a restart state and attempts to start 2290parsing from its initial state. 2291 2292<p> 2293<li>If a grammar rule accepts <tt>error</tt> as a token, it will be 2294shifted onto the parsing stack. 2295 2296<p> 2297<li>If the top item of the parsing stack is <tt>error</tt>, lookahead tokens will be discarded until the 2298parser can successfully shift a new symbol or reduce a rule involving <tt>error</tt>. 2299</ol> 2300 2301<H4><a name="ply_nn30"></a>5.8.1 Recovery and resynchronization with error rules</H4> 2302 2303 2304The most well-behaved approach for handling syntax errors is to write grammar rules that include the <tt>error</tt> 2305token. For example, suppose your language had a grammar rule for a print statement like this: 2306 2307<blockquote> 2308<pre> 2309def p_statement_print(p): 2310 'statement : PRINT expr SEMI' 2311 ... 2312</pre> 2313</blockquote> 2314 2315To account for the possibility of a bad expression, you might write an additional grammar rule like this: 2316 2317<blockquote> 2318<pre> 2319def p_statement_print_error(p): 2320 'statement : PRINT error SEMI' 2321 print "Syntax error in print statement. Bad expression" 2322 2323</pre> 2324</blockquote> 2325 2326In this case, the <tt>error</tt> token will match any sequence of 2327tokens that might appear up to the first semicolon that is 2328encountered. Once the semicolon is reached, the rule will be 2329invoked and the <tt>error</tt> token will go away. 2330 2331<p> 2332This type of recovery is sometimes known as parser resynchronization. 2333The <tt>error</tt> token acts as a wildcard for any bad input text and 2334the token immediately following <tt>error</tt> acts as a 2335synchronization token. 2336 2337<p> 2338It is important to note that the <tt>error</tt> token usually does not appear as the last token 2339on the right in an error rule. For example: 2340 2341<blockquote> 2342<pre> 2343def p_statement_print_error(p): 2344 'statement : PRINT error' 2345 print "Syntax error in print statement. Bad expression" 2346</pre> 2347</blockquote> 2348 2349This is because the first bad token encountered will cause the rule to 2350be reduced--which may make it difficult to recover if more bad tokens 2351immediately follow. 2352 2353<H4><a name="ply_nn31"></a>5.8.2 Panic mode recovery</H4> 2354 2355 2356An alternative error recovery scheme is to enter a panic mode recovery in which tokens are 2357discarded to a point where the parser might be able to recover in some sensible manner. 2358 2359<p> 2360Panic mode recovery is implemented entirely in the <tt>p_error()</tt> function. For example, this 2361function starts discarding tokens until it reaches a closing '}'. Then, it restarts the 2362parser in its initial state. 2363 2364<blockquote> 2365<pre> 2366def p_error(p): 2367 print "Whoa. You are seriously hosed." 2368 # Read ahead looking for a closing '}' 2369 while 1: 2370 tok = yacc.token() # Get the next token 2371 if not tok or tok.type == 'RBRACE': break 2372 yacc.restart() 2373</pre> 2374</blockquote> 2375 2376<p> 2377This function simply discards the bad token and tells the parser that the error was ok. 2378 2379<blockquote> 2380<pre> 2381def p_error(p): 2382 print "Syntax error at token", p.type 2383 # Just discard the token and tell the parser it's okay. 2384 yacc.errok() 2385</pre> 2386</blockquote> 2387 2388<P> 2389Within the <tt>p_error()</tt> function, three functions are available to control the behavior 2390of the parser: 2391<p> 2392<ul> 2393<li><tt>yacc.errok()</tt>. This resets the parser state so it doesn't think it's in error-recovery 2394mode. This will prevent an <tt>error</tt> token from being generated and will reset the internal 2395error counters so that the next syntax error will call <tt>p_error()</tt> again. 2396 2397<p> 2398<li><tt>yacc.token()</tt>. This returns the next token on the input stream. 2399 2400<p> 2401<li><tt>yacc.restart()</tt>. This discards the entire parsing stack and resets the parser 2402to its initial state. 2403</ul> 2404 2405Note: these functions are only available when invoking <tt>p_error()</tt> and are not available 2406at any other time. 2407 2408<p> 2409To supply the next lookahead token to the parser, <tt>p_error()</tt> can return a token. This might be 2410useful if trying to synchronize on special characters. For example: 2411 2412<blockquote> 2413<pre> 2414def p_error(p): 2415 # Read ahead looking for a terminating ";" 2416 while 1: 2417 tok = yacc.token() # Get the next token 2418 if not tok or tok.type == 'SEMI': break 2419 yacc.errok() 2420 2421 # Return SEMI to the parser as the next lookahead token 2422 return tok 2423</pre> 2424</blockquote> 2425 2426<H4><a name="ply_nn32"></a>5.8.3 General comments on error handling</H4> 2427 2428 2429For normal types of languages, error recovery with error rules and resynchronization characters is probably the most reliable 2430technique. This is because you can instrument the grammar to catch errors at selected places where it is relatively easy 2431to recover and continue parsing. Panic mode recovery is really only useful in certain specialized applications where you might want 2432to discard huge portions of the input text to find a valid restart point. 2433 2434<H3><a name="ply_nn33"></a>5.9 Line Number and Position Tracking</H3> 2435 2436Position tracking is often a tricky problem when writing compilers. By default, PLY tracks the line number and position of 2437all tokens. This information is available using the following functions: 2438 2439<ul> 2440<li><tt>p.lineno(num)</tt>. Return the line number for symbol <em>num</em> 2441<li><tt>p.lexpos(num)</tt>. Return the lexing position for symbol <em>num</em> 2442</ul> 2443 2444For example: 2445 2446<blockquote> 2447<pre> 2448def p_expression(p): 2449 'expression : expression PLUS expression' 2450 line = p.lineno(2) # line number of the PLUS token 2451 index = p.lexpos(2) # Position of the PLUS token 2452</pre> 2453</blockquote> 2454 2455As an optional feature, <tt>yacc.py</tt> can automatically track line numbers and positions for all of the grammar symbols 2456as well. However, this 2457extra tracking requires extra processing and can significantly slow down parsing. Therefore, it must be enabled by passing the 2458<tt>tracking=True</tt> option to <tt>yacc.parse()</tt>. For example: 2459 2460<blockquote> 2461<pre> 2462yacc.parse(data,tracking=True) 2463</pre> 2464</blockquote> 2465 2466Once enabled, the <tt>lineno()</tt> and <tt>lexpos()</tt> methods work for all grammar symbols. In addition, two 2467additional methods can be used: 2468 2469<ul> 2470<li><tt>p.linespan(num)</tt>. Return a tuple (startline,endline) with the starting and ending line number for symbol <em>num</em>. 2471<li><tt>p.lexspan(num)</tt>. Return a tuple (start,end) with the starting and ending positions for symbol <em>num</em>. 2472</ul> 2473 2474For example: 2475 2476<blockquote> 2477<pre> 2478def p_expression(p): 2479 'expression : expression PLUS expression' 2480 p.lineno(1) # Line number of the left expression 2481 p.lineno(2) # line number of the PLUS operator 2482 p.lineno(3) # line number of the right expression 2483 ... 2484 start,end = p.linespan(3) # Start,end lines of the right expression 2485 starti,endi = p.lexspan(3) # Start,end positions of right expression 2486 2487</pre> 2488</blockquote> 2489 2490Note: The <tt>lexspan()</tt> function only returns the range of values up to the start of the last grammar symbol. 2491 2492<p> 2493Although it may be convenient for PLY to track position information on 2494all grammar symbols, this is often unnecessary. For example, if you 2495are merely using line number information in an error message, you can 2496often just key off of a specific token in the grammar rule. For 2497example: 2498 2499<blockquote> 2500<pre> 2501def p_bad_func(p): 2502 'funccall : fname LPAREN error RPAREN' 2503 # Line number reported from LPAREN token 2504 print "Bad function call at line", p.lineno(2) 2505</pre> 2506</blockquote> 2507 2508<p> 2509Similarly, you may get better parsing performance if you only propagate line number 2510information where it's needed. For example: 2511 2512<blockquote> 2513<pre> 2514def p_fname(p): 2515 'fname : ID' 2516 p[0] = (p[1],p.lineno(1)) 2517</pre> 2518</blockquote> 2519 2520Finally, it should be noted that PLY does not store position information after a rule has been 2521processed. If it is important for you to retain this information in an abstract syntax tree, you 2522must make your own copy. 2523 2524<H3><a name="ply_nn34"></a>5.10 AST Construction</H3> 2525 2526 2527<tt>yacc.py</tt> provides no special functions for constructing an abstract syntax tree. However, such 2528construction is easy enough to do on your own. Simply create a data structure for abstract syntax tree nodes 2529and assign nodes to <tt>p[0]</tt> in each rule. 2530 2531For example: 2532 2533<blockquote> 2534<pre> 2535class Expr: pass 2536 2537class BinOp(Expr): 2538 def __init__(self,left,op,right): 2539 self.type = "binop" 2540 self.left = left 2541 self.right = right 2542 self.op = op 2543 2544class Number(Expr): 2545 def __init__(self,value): 2546 self.type = "number" 2547 self.value = value 2548 2549def p_expression_binop(p): 2550 '''expression : expression PLUS expression 2551 | expression MINUS expression 2552 | expression TIMES expression 2553 | expression DIVIDE expression''' 2554 2555 p[0] = BinOp(p[1],p[2],p[3]) 2556 2557def p_expression_group(p): 2558 'expression : LPAREN expression RPAREN' 2559 p[0] = p[2] 2560 2561def p_expression_number(p): 2562 'expression : NUMBER' 2563 p[0] = Number(p[1]) 2564</pre> 2565</blockquote> 2566 2567To simplify tree traversal, it may make sense to pick a very generic tree structure for your parse tree nodes. 2568For example: 2569 2570<blockquote> 2571<pre> 2572class Node: 2573 def __init__(self,type,children=None,leaf=None): 2574 self.type = type 2575 if children: 2576 self.children = children 2577 else: 2578 self.children = [ ] 2579 self.leaf = leaf 2580 2581def p_expression_binop(p): 2582 '''expression : expression PLUS expression 2583 | expression MINUS expression 2584 | expression TIMES expression 2585 | expression DIVIDE expression''' 2586 2587 p[0] = Node("binop", [p[1],p[3]], p[2]) 2588</pre> 2589</blockquote> 2590 2591<H3><a name="ply_nn35"></a>5.11 Embedded Actions</H3> 2592 2593 2594The parsing technique used by yacc only allows actions to be executed at the end of a rule. For example, 2595suppose you have a rule like this: 2596 2597<blockquote> 2598<pre> 2599def p_foo(p): 2600 "foo : A B C D" 2601 print "Parsed a foo", p[1],p[2],p[3],p[4] 2602</pre> 2603</blockquote> 2604 2605<p> 2606In this case, the supplied action code only executes after all of the 2607symbols <tt>A</tt>, <tt>B</tt>, <tt>C</tt>, and <tt>D</tt> have been 2608parsed. Sometimes, however, it is useful to execute small code 2609fragments during intermediate stages of parsing. For example, suppose 2610you wanted to perform some action immediately after <tt>A</tt> has 2611been parsed. To do this, you can write a empty rule like this: 2612 2613<blockquote> 2614<pre> 2615def p_foo(p): 2616 "foo : A seen_A B C D" 2617 print "Parsed a foo", p[1],p[3],p[4],p[5] 2618 print "seen_A returned", p[2] 2619 2620def p_seen_A(p): 2621 "seen_A :" 2622 print "Saw an A = ", p[-1] # Access grammar symbol to left 2623 p[0] = some_value # Assign value to seen_A 2624 2625</pre> 2626</blockquote> 2627 2628<p> 2629In this example, the empty <tt>seen_A</tt> rule executes immediately 2630after <tt>A</tt> is shifted onto the parsing stack. Within this 2631rule, <tt>p[-1]</tt> refers to the symbol on the stack that appears 2632immediately to the left of the <tt>seen_A</tt> symbol. In this case, 2633it would be the value of <tt>A</tt> in the <tt>foo</tt> rule 2634immediately above. Like other rules, a value can be returned from an 2635embedded action by simply assigning it to <tt>p[0]</tt> 2636 2637<p> 2638The use of embedded actions can sometimes introduce extra shift/reduce conflicts. For example, 2639this grammar has no conflicts: 2640 2641<blockquote> 2642<pre> 2643def p_foo(p): 2644 """foo : abcd 2645 | abcx""" 2646 2647def p_abcd(p): 2648 "abcd : A B C D" 2649 2650def p_abcx(p): 2651 "abcx : A B C X" 2652</pre> 2653</blockquote> 2654 2655However, if you insert an embedded action into one of the rules like this, 2656 2657<blockquote> 2658<pre> 2659def p_foo(p): 2660 """foo : abcd 2661 | abcx""" 2662 2663def p_abcd(p): 2664 "abcd : A B C D" 2665 2666def p_abcx(p): 2667 "abcx : A B seen_AB C X" 2668 2669def p_seen_AB(p): 2670 "seen_AB :" 2671</pre> 2672</blockquote> 2673 2674an extra shift-reduce conflict will be introduced. This conflict is caused by the fact that the same symbol <tt>C</tt> appears next in 2675both the <tt>abcd</tt> and <tt>abcx</tt> rules. The parser can either shift the symbol (<tt>abcd</tt> rule) or reduce the empty rule <tt>seen_AB</tt> (<tt>abcx</tt> rule). 2676 2677<p> 2678A common use of embedded rules is to control other aspects of parsing 2679such as scoping of local variables. For example, if you were parsing C code, you might 2680write code like this: 2681 2682<blockquote> 2683<pre> 2684def p_statements_block(p): 2685 "statements: LBRACE new_scope statements RBRACE""" 2686 # Action code 2687 ... 2688 pop_scope() # Return to previous scope 2689 2690def p_new_scope(p): 2691 "new_scope :" 2692 # Create a new scope for local variables 2693 s = new_scope() 2694 push_scope(s) 2695 ... 2696</pre> 2697</blockquote> 2698 2699In this case, the embedded action <tt>new_scope</tt> executes immediately after a <tt>LBRACE</tt> (<tt>{</tt>) symbol is parsed. This might 2700adjust internal symbol tables and other aspects of the parser. Upon completion of the rule <tt>statements_block</tt>, code might undo the operations performed in the embedded action (e.g., <tt>pop_scope()</tt>). 2701 2702<H3><a name="ply_nn36"></a>5.12 Yacc implementation notes</H3> 2703 2704 2705<ul> 2706<li>The default parsing method is LALR. To use SLR instead, run yacc() as follows: 2707 2708<blockquote> 2709<pre> 2710yacc.yacc(method="SLR") 2711</pre> 2712</blockquote> 2713Note: LALR table generation takes approximately twice as long as SLR table generation. There is no 2714difference in actual parsing performance---the same code is used in both cases. LALR is preferred when working 2715with more complicated grammars since it is more powerful. 2716 2717<p> 2718 2719<li>By default, <tt>yacc.py</tt> relies on <tt>lex.py</tt> for tokenizing. However, an alternative tokenizer 2720can be supplied as follows: 2721 2722<blockquote> 2723<pre> 2724yacc.parse(lexer=x) 2725</pre> 2726</blockquote> 2727in this case, <tt>x</tt> must be a Lexer object that minimally has a <tt>x.token()</tt> method for retrieving the next 2728token. If an input string is given to <tt>yacc.parse()</tt>, the lexer must also have an <tt>x.input()</tt> method. 2729 2730<p> 2731<li>By default, the yacc generates tables in debugging mode (which produces the parser.out file and other output). 2732To disable this, use 2733 2734<blockquote> 2735<pre> 2736yacc.yacc(debug=0) 2737</pre> 2738</blockquote> 2739 2740<p> 2741<li>To change the name of the <tt>parsetab.py</tt> file, use: 2742 2743<blockquote> 2744<pre> 2745yacc.yacc(tabmodule="foo") 2746</pre> 2747</blockquote> 2748 2749<p> 2750<li>To change the directory in which the <tt>parsetab.py</tt> file (and other output files) are written, use: 2751<blockquote> 2752<pre> 2753yacc.yacc(tabmodule="foo",outputdir="somedirectory") 2754</pre> 2755</blockquote> 2756 2757<p> 2758<li>To prevent yacc from generating any kind of parser table file, use: 2759<blockquote> 2760<pre> 2761yacc.yacc(write_tables=0) 2762</pre> 2763</blockquote> 2764 2765Note: If you disable table generation, yacc() will regenerate the parsing tables 2766each time it runs (which may take awhile depending on how large your grammar is). 2767 2768<P> 2769<li>To print copious amounts of debugging during parsing, use: 2770 2771<blockquote> 2772<pre> 2773yacc.parse(debug=1) 2774</pre> 2775</blockquote> 2776 2777<p> 2778<li>To redirect the debugging output to a filename of your choosing, use: 2779 2780<blockquote> 2781<pre> 2782yacc.parse(debug=1, debugfile="debugging.out") 2783</pre> 2784</blockquote> 2785 2786<p> 2787<li>The <tt>yacc.yacc()</tt> function really returns a parser object. If you want to support multiple 2788parsers in the same application, do this: 2789 2790<blockquote> 2791<pre> 2792p = yacc.yacc() 2793... 2794p.parse() 2795</pre> 2796</blockquote> 2797 2798Note: The function <tt>yacc.parse()</tt> is bound to the last parser that was generated. 2799 2800<p> 2801<li>Since the generation of the LALR tables is relatively expensive, previously generated tables are 2802cached and reused if possible. The decision to regenerate the tables is determined by taking an MD5 2803checksum of all grammar rules and precedence rules. Only in the event of a mismatch are the tables regenerated. 2804 2805<p> 2806It should be noted that table generation is reasonably efficient, even for grammars that involve around a 100 rules 2807and several hundred states. For more complex languages such as C, table generation may take 30-60 seconds on a slow 2808machine. Please be patient. 2809 2810<p> 2811<li>Since LR parsing is driven by tables, the performance of the parser is largely independent of the 2812size of the grammar. The biggest bottlenecks will be the lexer and the complexity of the code in your grammar rules. 2813</ul> 2814 2815<H2><a name="ply_nn37"></a>6. Parser and Lexer State Management</H2> 2816 2817 2818In advanced parsing applications, you may want to have multiple 2819parsers and lexers. Furthermore, the parser may want to control the 2820behavior of the lexer in some way. 2821 2822<p> 2823To do this, it is important to note that both the lexer and parser are 2824actually implemented as objects. These objects are returned by the 2825<tt>lex()</tt> and <tt>yacc()</tt> functions respectively. For example: 2826 2827<blockquote> 2828<pre> 2829lexer = lex.lex() # Return lexer object 2830parser = yacc.yacc() # Return parser object 2831</pre> 2832</blockquote> 2833 2834To attach the lexer and parser together, make sure you use the <tt>lexer</tt> argumemnt to parse. For example: 2835 2836<blockquote> 2837<pre> 2838parser.parse(text,lexer=lexer) 2839</pre> 2840</blockquote> 2841 2842Within lexer and parser rules, these objects are also available. In the lexer, 2843the "lexer" attribute of a token refers to the lexer object in use. For example: 2844 2845<blockquote> 2846<pre> 2847def t_NUMBER(t): 2848 r'\d+' 2849 ... 2850 print t.lexer # Show lexer object 2851</pre> 2852</blockquote> 2853 2854In the parser, the "lexer" and "parser" attributes refer to the lexer 2855and parser objects respectively. 2856 2857<blockquote> 2858<pre> 2859def p_expr_plus(p): 2860 'expr : expr PLUS expr' 2861 ... 2862 print p.parser # Show parser object 2863 print p.lexer # Show lexer object 2864</pre> 2865</blockquote> 2866 2867If necessary, arbitrary attributes can be attached to the lexer or parser object. 2868For example, if you wanted to have different parsing modes, you could attach a mode 2869attribute to the parser object and look at it later. 2870 2871<H2><a name="ply_nn38"></a>7. Using Python's Optimized Mode</H2> 2872 2873 2874Because PLY uses information from doc-strings, parsing and lexing 2875information must be gathered while running the Python interpreter in 2876normal mode (i.e., not with the -O or -OO options). However, if you 2877specify optimized mode like this: 2878 2879<blockquote> 2880<pre> 2881lex.lex(optimize=1) 2882yacc.yacc(optimize=1) 2883</pre> 2884</blockquote> 2885 2886then PLY can later be used when Python runs in optimized mode. To make this work, 2887make sure you first run Python in normal mode. Once the lexing and parsing tables 2888have been generated the first time, run Python in optimized mode. PLY will use 2889the tables without the need for doc strings. 2890 2891<p> 2892Beware: running PLY in optimized mode disables a lot of error 2893checking. You should only do this when your project has stabilized 2894and you don't need to do any debugging. 2895 2896<H2><a name="ply_nn39"></a>8. Where to go from here?</H2> 2897 2898 2899The <tt>examples</tt> directory of the PLY distribution contains several simple examples. Please consult a 2900compilers textbook for the theory and underlying implementation details or LR parsing. 2901 2902</body> 2903</html> 2904 2905 2906 2907 2908 2909 2910 2911