(16 ratings)   
By: David Kuhlman
Python is an excellent language for text analysis.In some cases, simply splitting lines of text into words will be enough. In these cases use string.split(). In other cases, regular expressions may be able to do the parsing you need. If so, see the...
Added: 02 June 2008    Views: 107  
PathComputers    Programming    Python
Keywords: computer   programming   python   code   coder   language   coding   parsing   plex  
Do you like this tutorial? Now you can support our team to add more :     
 
 
 
Browse Pages :   <<    <    1    2    3    >    >>  

Creating a lexer/tokenizer with Plex

Lexical analysis -- The tokenizer in our recursive descent parser example was (for demonstration purposes) overly simple. You can always write more complex tokenizers by hand. However, for more complex (and real) tokenizers, you may want to use a tool to build your tokenizer.

In this section we'll describe Plex and use it to produce a tokenizer for our recursive descent parser.

You can obtain Plex at http://www.cosc.canterbury.ac.nz/~greg/python/Plex/.

In order to use it, you may want to add Plex-1.1.4/Plex to your PYTHONPATH.

Here is a simple example from the Plex tutorial:

#!/usr/bin/env python



# python_201_plex1.py

#

# Sample Plex lexer

#



import sys

import Plex



def test(infileName):

letter = Plex.Range("AZaz")

digit = Plex.Range("09")

name = letter + Plex.Rep(letter | digit)

number = Plex.Rep1(digit)

space = Plex.Any(" \t\n")

comment = Plex.Str('"') + Plex.Rep( Plex.AnyBut('"')) + Plex.Str('"')

resword = Plex.Str("if", "then", "else", "end")

lexicon = Plex.Lexicon([

(resword, 'keyword'),

(name, 'ident'),

(number, 'int'),

( Plex.Any("+-*/=<>"), Plex.TEXT),

(space, Plex.IGNORE),

(comment, 'comment'),

])

infile = open(infileName, "r")

scanner = Plex.Scanner(lexicon, infile, infileName)

while 1:

token = scanner.read()

position = scanner.position()

print '(%d, %d) tok: %s tokType: %s' % \

(position[1], position[2], token[1], token[0])

if token[0] is None:

break



USAGE_TEXT = """

Usage: python python_201_plex1.py <infile>

"""



def usage():

print USAGE_TEXT

sys.exit(-1)



def main():

args = sys.argv[1:]

if len(args) != 1:

usage()

infileName = args[0]

test(infileName)



if __name__ == '__main__':

main()

#import pdb

#pdb.run('main()')

Comments and explanation:

  • Create a lexicon from scanning patterns.
  • See the Plex tutorial and reference (and below) for more information on how to construct the patterns that match various tokens.
  • Create a scanner with a lexicon, an input file, and an input file name.
  • The call "scanner.read()" gets the next token. It returns a tuple containing (1) the token value and (2) the token type.
  • The call "scanner.position()" gets the position of the current token. It returns a tuple containing (1) the input file name, (2) the line number, and (3) the column number.

And, here are some comments on constructing the patterns used in a lexicon:

  • Range constructs a pattern that matches any character in the range.
  • Rep constructs a pattern that matches a sequence of zero or more items.
  • Rep1 constructs a pattern that matches a sequence of one or more items.
  • "pat1 + pat2" constructs a pattern that matches a sequence containing pat1 followed by pat2.
  • "pat1 | pat2" constructs a pattern that matches either pat1 or pat2.
  • Any constructs a pattern that matches any one character in its argument.

Now let's revisit our recursive descent parser, this time with a tokenizer built with Plex. The tokenizer is trivial, but will serve as an example of how to hook it into a parser.

#!/usr/bin/env python



"""

python_201_rparser_plex.py



A recursive descent parser example.

This example uses Plex to implement a tokenizer.



The grammar:



Prog ::= Command | Command Prog

Command ::= Func_call

Func_call ::= Term '(' Func_call_list ')'

Func_call_list ::= Func_call | Func_call ',' Func_call_list

Term = <word>



"""



import sys, string, types

import getopt

import Plex



## from IPython.Shell import IPShellEmbed

## ipshell = IPShellEmbed((),

## banner = '>>>>>>>> Into IPython >>>>>>>>',

## exit_msg = '<<<<<<<< Out of IPython <<<<<<<<')



#

# Constants

#



# AST node types

NoneNodeType = 0

ProgNodeType = 1

CommandNodeType = 2

FuncCallNodeType = 3

FuncCallListNodeType = 4

TermNodeType = 5



# Token types

NoneTokType = 0

LParTokType = 1

RParTokType = 2

WordTokType = 3

CommaTokType = 4

EOFTokType = 5



# Dictionary to map node type values to node type names

NodeTypeDict = {

NoneNodeType: 'NoneNodeType',

ProgNodeType: 'ProgNodeType',

CommandNodeType: 'CommandNodeType',

FuncCallNodeType: 'FuncCallNodeType',

FuncCallListNodeType: 'FuncCallListNodeType',

TermNodeType: 'TermNodeType',

}



#

# Representation of a node in the AST (abstract syntax tree).

#

class ASTNode:

def __init__(self, nodeType, *args):

self.nodeType = nodeType

self.children = []

for item in args:

self.children.append(item)

def show(self, level):

self.showLevel(level)

print 'Node -- Type %s' % NodeTypeDict[self.nodeType]

level += 1

for child in self.children:

if isinstance(child, ASTNode):

child.show(level)

elif type(child) == types.ListType:

for item in child:

item.show(level)

else:

self.showLevel(level)

print 'Child:', child

def showLevel(self, level):

for idx in range(level):

print ' ',



#

# The recursive descent parser class.

# Contains the "recognizer" methods, which implement the grammar

# rules (above), one recognizer method for each production rule.

#

class ProgParser:

def __init__(self):

pass



def parseFile(self, infileName):

self.tokens = None

self.tokenType = NoneTokType

self.token = ''

self.lineNo = -1

self.infile = file(infileName, 'r')

self.tokens = genTokens(self.infile, infileName)

try:

self.tokenType, self.token, self.lineNo = self.tokens.next()

except StopIteration:

raise RuntimeError, 'Empty file'

result = self.prog_reco()

self.infile.close()

self.infile = None

return result



def parseStream(self, instream):

self.tokens = None

self.tokenType = NoneTokType

self.token = ''

self.lineNo = -1

self.tokens = genTokens(self.instream, '<stream>')

try:

self.tokenType, self.token, self.lineNo = self.tokens.next()

except StopIteration:

raise RuntimeError, 'Empty stream'

result = self.prog_reco()

self.infile.close()

self.infile = None

return result



def prog_reco(self):

commandList = []

while 1:

result = self.command_reco()

if not result:

break

commandList.append(result)

return ASTNode(ProgNodeType, commandList)



def command_reco(self):

if self.tokenType == EOFTokType:

return None

result = self.func_call_reco()

return ASTNode(CommandNodeType, result)



def func_call_reco(self):

if self.tokenType == WordTokType:

term = ASTNode(TermNodeType, self.token)

self.tokenType, self.token, self.lineNo = self.tokens.next()

if self.tokenType == LParTokType:

self.tokenType, self.token, self.lineNo = self.tokens.next()

result = self.func_call_list_reco()

if result:

if self.tokenType == RParTokType:

self.tokenType, self.token, self.lineNo = \

self.tokens.next()

return ASTNode(FuncCallNodeType, term, result)

else:

raise ParseError(self.lineNo, 'missing right paren')

else:

raise ParseError(self.lineNo, 'bad func call list')

else:

raise ParseError(self.lineNo, 'missing left paren')

else:

return None



def func_call_list_reco(self):

terms = []

while 1:

result = self.func_call_reco()

if not result:

break

terms.append(result)

if self.tokenType != CommaTokType:

break

self.tokenType, self.token, self.lineNo = self.tokens.next()

return ASTNode(FuncCallListNodeType, terms)



#

# The parse error exception class.

#

class ParseError(Exception):

def __init__(self, lineNo, msg):

RuntimeError.__init__(self, msg)

self.lineNo = lineNo

self.msg = msg

def getLineNo(self):

return self.lineNo

def getMsg(self):

return self.msg



#

# Generate the tokens.

# Usage - example

# gen = genTokens(infile)

# tokType, tok, lineNo = gen.next()

# ...

def genTokens(infile, infileName):

letter = Plex.Range("AZaz")

digit = Plex.Range("09")

name = letter + Plex.Rep(letter | digit)

lpar = Plex.Str('(')

rpar = Plex.Str(')')

comma = Plex.Str(',')

comment = Plex.Str("#") + Plex.Rep(Plex.AnyBut("\n"))

space = Plex.Any(" \t\n")

lexicon = Plex.Lexicon([

(name, 'word'),

(lpar, 'lpar'),

(rpar, 'rpar'),

(comma, 'comma'),

(comment, Plex.IGNORE),

(space, Plex.IGNORE),

])

scanner = Plex.Scanner(lexicon, infile, infileName)

while 1:

tokenType, token = scanner.read()

name, lineNo, columnNo = scanner.position()

if tokenType == None:

tokType = EOFTokType

token = None

elif tokenType == 'word':

tokType = WordTokType

elif tokenType == 'lpar':

tokType = LParTokType

elif tokenType == 'rpar':

tokType = RParTokType

elif tokenType == 'comma':

tokType = CommaTokType

else:

tokType = NoneTokType

tok = token

yield (tokType, tok, lineNo)



def test(infileName):

parser = ProgParser()

#ipshell('(test) #1\nCtrl-D to exit')

result = None

try:

result = parser.parseFile(infileName)

except ParseError, exp:

sys.stderr.write('ParseError: (%d) %s\n' % \

(exp.getLineNo(), exp.getMsg()))

if result:

result.show(0)



USAGE_TEXT = """

Usage:

python python_201_rparser_plex.py [options] <inputfile>

Options:

-h, --help Display this help message.

Example:

python python_201_rparser_plex.py myfile.txt

"""



def usage():

print USAGE_TEXT

sys.exit(-1)



def main():

args = sys.argv[1:]

try:

opts, args = getopt.getopt(args, 'h', ['help'])

except:

usage()

for opt, val in opts:

if opt in ('-h', '--help'):

usage()

if len(args) != 1:

usage()

infileName = args[0]

test(infileName)



if __name__ == '__main__':

main()

#import pdb

#pdb.run('main()')

And, here is a sample of the data we can apply this parser to:

# Test for recursive descent parser and Plex.

# Command #1

aaa()

# Command #2

bbb (ccc()) # An end of line comment.

# Command #3

ddd(eee(), fff(ggg(), hhh(), iii()))

# End of test

Comments:

  • We can now put comments in our input, and they will be ignored. Comments begin with a "#" and continue to the end of line. See the definition of comment in function genTokens.
  • This tokenizer does not require us to separate tokens with whitespace as did the simple tokenizer in the earlier version of our recursive descent parser.
  • The changes we made over the earlier version were to:
    • Import Plex.
    • Replace the definition of the tokenizer function genTokens.
    • Change the call to genTokens so that the call passes in the file name, which is needed to create the scanner.
  • Our new version of genTokens does the following:
    1. Create patterns for scanning.
    2. Create a lexicon (an instance of Plex.Lexicon), which uses the patterns.
    3. Create a scanner (an instance of Plex.Scanner), which uses the lexicon.
    4. Execute a loop that reads tokens (from the scanner) and "yields" each one.

A survey of existing tools

For complex parsing tasks, you may want to consider the following tools:

And, for lexical analysis, you may also want to look at -- Using Regular Expressions for Lexical Analysis.

Browse Pages :   <<    <    1    2    3    >    >>  
About the Author :
Parsing in Phyton
Articles and Tutorials Directory by www.learnfobia.com
 Rate this tutorial : Rate 1Rate 2Rate 3Rate 4Rate 5
  |    Add to Favorites
  |    Send to Friend
  |    Print
Comments