



(16 ratings)
Creating a lexer/tokenizer with Plex
Lexical analysis -- The tokenizer in our recursive descent parser example was (for demonstration purposes) overly simple. You can always write more complex tokenizers by hand. However, for more complex (and real) tokenizers, you may want to use a tool to build your tokenizer.
In this section we'll describe Plex and use it to produce a tokenizer for our recursive descent parser.
You can obtain Plex at http://www.cosc.canterbury.ac.nz/~greg/python/Plex/.
In order to use it, you may want to add Plex-1.1.4/Plex to your PYTHONPATH.
Here is a simple example from the Plex tutorial:
#!/usr/bin/env python
# python_201_plex1.py
#
# Sample Plex lexer
#
import sys
import Plex
def test(infileName):
letter = Plex.Range("AZaz")
digit = Plex.Range("09")
name = letter + Plex.Rep(letter | digit)
number = Plex.Rep1(digit)
space = Plex.Any(" \t\n")
comment = Plex.Str('"') + Plex.Rep( Plex.AnyBut('"')) + Plex.Str('"')
resword = Plex.Str("if", "then", "else", "end")
lexicon = Plex.Lexicon([
(resword, 'keyword'),
(name, 'ident'),
(number, 'int'),
( Plex.Any("+-*/=<>"), Plex.TEXT),
(space, Plex.IGNORE),
(comment, 'comment'),
])
infile = open(infileName, "r")
scanner = Plex.Scanner(lexicon, infile, infileName)
while 1:
token = scanner.read()
position = scanner.position()
print '(%d, %d) tok: %s tokType: %s' % \
(position[1], position[2], token[1], token[0])
if token[0] is None:
break
USAGE_TEXT = """
Usage: python python_201_plex1.py <infile>
"""
def usage():
print USAGE_TEXT
sys.exit(-1)
def main():
args = sys.argv[1:]
if len(args) != 1:
usage()
infileName = args[0]
test(infileName)
if __name__ == '__main__':
main()
#import pdb
#pdb.run('main()')
Comments and explanation:
- Create a lexicon from scanning patterns.
- See the Plex tutorial and reference (and below) for more information on how to construct the patterns that match various tokens.
- Create a scanner with a lexicon, an input file, and an input file name.
- The call "scanner.read()" gets the next token. It returns a tuple containing (1) the token value and (2) the token type.
- The call "scanner.position()" gets the position of the current token. It returns a tuple containing (1) the input file name, (2) the line number, and (3) the column number.
And, here are some comments on constructing the patterns used in a lexicon:
Rangeconstructs a pattern that matches any character in the range.Repconstructs a pattern that matches a sequence of zero or more items.Rep1constructs a pattern that matches a sequence of one or more items.- "pat1 + pat2" constructs a pattern that matches a sequence containing pat1 followed by pat2.
- "pat1 | pat2" constructs a pattern that matches either pat1 or pat2.
Anyconstructs a pattern that matches any one character in its argument.
Now let's revisit our recursive descent parser, this time with a tokenizer built with Plex. The tokenizer is trivial, but will serve as an example of how to hook it into a parser.
#!/usr/bin/env python
"""
python_201_rparser_plex.py
A recursive descent parser example.
This example uses Plex to implement a tokenizer.
The grammar:
Prog ::= Command | Command Prog
Command ::= Func_call
Func_call ::= Term '(' Func_call_list ')'
Func_call_list ::= Func_call | Func_call ',' Func_call_list
Term = <word>
"""
import sys, string, types
import getopt
import Plex
## from IPython.Shell import IPShellEmbed
## ipshell = IPShellEmbed((),
## banner = '>>>>>>>> Into IPython >>>>>>>>',
## exit_msg = '<<<<<<<< Out of IPython <<<<<<<<')
#
# Constants
#
# AST node types
NoneNodeType = 0
ProgNodeType = 1
CommandNodeType = 2
FuncCallNodeType = 3
FuncCallListNodeType = 4
TermNodeType = 5
# Token types
NoneTokType = 0
LParTokType = 1
RParTokType = 2
WordTokType = 3
CommaTokType = 4
EOFTokType = 5
# Dictionary to map node type values to node type names
NodeTypeDict = {
NoneNodeType: 'NoneNodeType',
ProgNodeType: 'ProgNodeType',
CommandNodeType: 'CommandNodeType',
FuncCallNodeType: 'FuncCallNodeType',
FuncCallListNodeType: 'FuncCallListNodeType',
TermNodeType: 'TermNodeType',
}
#
# Representation of a node in the AST (abstract syntax tree).
#
class ASTNode:
def __init__(self, nodeType, *args):
self.nodeType = nodeType
self.children = []
for item in args:
self.children.append(item)
def show(self, level):
self.showLevel(level)
print 'Node -- Type %s' % NodeTypeDict[self.nodeType]
level += 1
for child in self.children:
if isinstance(child, ASTNode):
child.show(level)
elif type(child) == types.ListType:
for item in child:
item.show(level)
else:
self.showLevel(level)
print 'Child:', child
def showLevel(self, level):
for idx in range(level):
print ' ',
#
# The recursive descent parser class.
# Contains the "recognizer" methods, which implement the grammar
# rules (above), one recognizer method for each production rule.
#
class ProgParser:
def __init__(self):
pass
def parseFile(self, infileName):
self.tokens = None
self.tokenType = NoneTokType
self.token = ''
self.lineNo = -1
self.infile = file(infileName, 'r')
self.tokens = genTokens(self.infile, infileName)
try:
self.tokenType, self.token, self.lineNo = self.tokens.next()
except StopIteration:
raise RuntimeError, 'Empty file'
result = self.prog_reco()
self.infile.close()
self.infile = None
return result
def parseStream(self, instream):
self.tokens = None
self.tokenType = NoneTokType
self.token = ''
self.lineNo = -1
self.tokens = genTokens(self.instream, '<stream>')
try:
self.tokenType, self.token, self.lineNo = self.tokens.next()
except StopIteration:
raise RuntimeError, 'Empty stream'
result = self.prog_reco()
self.infile.close()
self.infile = None
return result
def prog_reco(self):
commandList = []
while 1:
result = self.command_reco()
if not result:
break
commandList.append(result)
return ASTNode(ProgNodeType, commandList)
def command_reco(self):
if self.tokenType == EOFTokType:
return None
result = self.func_call_reco()
return ASTNode(CommandNodeType, result)
def func_call_reco(self):
if self.tokenType == WordTokType:
term = ASTNode(TermNodeType, self.token)
self.tokenType, self.token, self.lineNo = self.tokens.next()
if self.tokenType == LParTokType:
self.tokenType, self.token, self.lineNo = self.tokens.next()
result = self.func_call_list_reco()
if result:
if self.tokenType == RParTokType:
self.tokenType, self.token, self.lineNo = \
self.tokens.next()
return ASTNode(FuncCallNodeType, term, result)
else:
raise ParseError(self.lineNo, 'missing right paren')
else:
raise ParseError(self.lineNo, 'bad func call list')
else:
raise ParseError(self.lineNo, 'missing left paren')
else:
return None
def func_call_list_reco(self):
terms = []
while 1:
result = self.func_call_reco()
if not result:
break
terms.append(result)
if self.tokenType != CommaTokType:
break
self.tokenType, self.token, self.lineNo = self.tokens.next()
return ASTNode(FuncCallListNodeType, terms)
#
# The parse error exception class.
#
class ParseError(Exception):
def __init__(self, lineNo, msg):
RuntimeError.__init__(self, msg)
self.lineNo = lineNo
self.msg = msg
def getLineNo(self):
return self.lineNo
def getMsg(self):
return self.msg
#
# Generate the tokens.
# Usage - example
# gen = genTokens(infile)
# tokType, tok, lineNo = gen.next()
# ...
def genTokens(infile, infileName):
letter = Plex.Range("AZaz")
digit = Plex.Range("09")
name = letter + Plex.Rep(letter | digit)
lpar = Plex.Str('(')
rpar = Plex.Str(')')
comma = Plex.Str(',')
comment = Plex.Str("#") + Plex.Rep(Plex.AnyBut("\n"))
space = Plex.Any(" \t\n")
lexicon = Plex.Lexicon([
(name, 'word'),
(lpar, 'lpar'),
(rpar, 'rpar'),
(comma, 'comma'),
(comment, Plex.IGNORE),
(space, Plex.IGNORE),
])
scanner = Plex.Scanner(lexicon, infile, infileName)
while 1:
tokenType, token = scanner.read()
name, lineNo, columnNo = scanner.position()
if tokenType == None:
tokType = EOFTokType
token = None
elif tokenType == 'word':
tokType = WordTokType
elif tokenType == 'lpar':
tokType = LParTokType
elif tokenType == 'rpar':
tokType = RParTokType
elif tokenType == 'comma':
tokType = CommaTokType
else:
tokType = NoneTokType
tok = token
yield (tokType, tok, lineNo)
def test(infileName):
parser = ProgParser()
#ipshell('(test) #1\nCtrl-D to exit')
result = None
try:
result = parser.parseFile(infileName)
except ParseError, exp:
sys.stderr.write('ParseError: (%d) %s\n' % \
(exp.getLineNo(), exp.getMsg()))
if result:
result.show(0)
USAGE_TEXT = """
Usage:
python python_201_rparser_plex.py [options] <inputfile>
Options:
-h, --help Display this help message.
Example:
python python_201_rparser_plex.py myfile.txt
"""
def usage():
print USAGE_TEXT
sys.exit(-1)
def main():
args = sys.argv[1:]
try:
opts, args = getopt.getopt(args, 'h', ['help'])
except:
usage()
for opt, val in opts:
if opt in ('-h', '--help'):
usage()
if len(args) != 1:
usage()
infileName = args[0]
test(infileName)
if __name__ == '__main__':
main()
#import pdb
#pdb.run('main()')
And, here is a sample of the data we can apply this parser to:
# Test for recursive descent parser and Plex.
# Command #1
aaa()
# Command #2
bbb (ccc()) # An end of line comment.
# Command #3
ddd(eee(), fff(ggg(), hhh(), iii()))
# End of test
Comments:
- We can now put comments in our input, and they will be ignored. Comments begin with a "#" and continue to the end of line. See the definition of comment in function genTokens.
- This tokenizer does not require us to separate tokens with whitespace as did the simple tokenizer in the earlier version of our recursive descent parser.
- The changes we made over the earlier version were to:
- Import Plex.
- Replace the definition of the tokenizer function genTokens.
- Change the call to genTokens so that the call passes in the file name, which is needed to create the scanner.
- Our new version of genTokens does the following:
- Create patterns for scanning.
- Create a lexicon (an instance of Plex.Lexicon), which uses the patterns.
- Create a scanner (an instance of Plex.Scanner), which uses the lexicon.
- Execute a loop that reads tokens (from the scanner) and "yields" each one.
A survey of existing tools
For complex parsing tasks, you may want to consider the following tools:
- kwParsing -- A parser generator in Python -- http://www.pythonpros.com/arw/kwParsing/kwParsing.html.
- PLY -- Python Lex-Yacc -- http://systems.cs.uchicago.edu/ply/.
- PyLR -- Fast LR parsing in python -- http://starship.python.net/crew/scott/PyLR.html.
- Yapps -- The Yapps Parser Generator System -- http://theory.stanford.edu/~amitp/Yapps/.
And, for lexical analysis, you may also want to look at -- Using Regular Expressions for Lexical Analysis.
20 Random Tutorials from the same category :
Data Structures
Google Sitemaps - Python
Python and Java - A Side by Side Comparison
Why is Python popular with Linux users?
Data Types in Python
Extending and embedding Python
GUI Applications in Python
Modules
Brief Tour of the Standard Library
Python Issues and some Tips
Functions and Strings -Python
An Informal Introduction to Python
Simple Statements In Python
Whetting Your Appetite
History
Modules in Python
Use serialization to store Python objects
Organization in Python
Control Structures In Python
Input and Output - Python













