



(16 ratings)
Python is an excellent language for text analysis.
In some cases, simply splitting lines of text into words will be enough. In these cases use string.split().
In other cases, regular expressions may be able to do the parsing you need. If so, see the section on regular expressions in this document.
However, in some cases, more complex analysis of input text is required. This section describes some of the ways that Python can help you with this complex parsing and analysis.
Special purpose parsers
There are a number of special purpose parsers which you will find in the Python standard library:
- ConfigParser -- Configuration file parser. See Python Library Reference: 5.10 ConfigParser - Configuration file parser.
- getopt -- Parse command line arguments. See Python Library Reference: 6.18 getopt - Parser for command line options.
- optparse -- Powerful parser for command line arguments. (Added in Python 2.3.)
- urlparse -- Parse URLs into components. See Python Library Reference: 11.14 urlparse - Parse URLs into components.
- os.path -- Parsers (and other capabilities) for file paths and names. See Python Library Reference: 6.2 os.path - Common pathname manipulations.
- PyXML
and the XML parsers -- There is lots of support for parsing and
processing XML. Here are a few places to look for support:
- The Python standard library -- Python Library Reference: 13. Structured Markup Processing Tools.
- PyXML -- http://www.python.org/sigs/xml-sig/.
- Gnosis -- http://www.gnosis.cx/download/.
- libxml2 and libxslt -- http://xmlsoft.org.
- Dave' support for Python and XML -- http://www.rexx.com/~dkuhlman.
Writing a recursive descent parser by hand
For simple grammars, this is not so hard.
You will need to implement:
- A recognizer method or function for each production rule in your grammar. Each recognizer method begins looking at the current token, then consumes as many tokens as needed to recognize it's own production rule. It calls the recognizer functions for any non-terminals on its right-hand side.
- A tokenizer -- Something that will enable each recognizer function to get tokens, one by one. There are a variety of ways to do this, e.g. (1) a function that produces a list of tokens from which recognizers can pop tokens; (2) a generator whose next method returns the next token; etc.
Here is an example of a recursive descent parser written in Python. After the example is some explanation.
#!/usr/bin/env python
"""
python_201_rparser.py
A recursive descent parser example.
The grammar:
Prog ::= Command | Command Prog
Command ::= Func_call
Func_call ::= Term '(' Func_call_list ')'
Func_call_list ::= Func_call | Func_call ',' Func_call_list
Term = <word>
"""
import sys, string, types
import getopt
## from IPython.Shell import IPShellEmbed
## ipshell = IPShellEmbed((),
## banner = '>>>>>>>> Into IPython >>>>>>>>',
## exit_msg = '<<<<<<<< Out of IPython <<<<<<<<')
#
# Constants
#
# AST node types
NoneNodeType = 0
ProgNodeType = 1
CommandNodeType = 2
FuncCallNodeType = 3
FuncCallListNodeType = 4
TermNodeType = 5
# Token types
NoneTokType = 0
LParTokType = 1
RParTokType = 2
WordTokType = 3
CommaTokType = 4
EOFTokType = 5
# Dictionary to map node type values to node type names
NodeTypeDict = {
NoneNodeType: 'NoneNodeType',
ProgNodeType: 'ProgNodeType',
CommandNodeType: 'CommandNodeType',
FuncCallNodeType: 'FuncCallNodeType',
FuncCallListNodeType: 'FuncCallListNodeType',
TermNodeType: 'TermNodeType',
}
#
# Representation of a node in the AST (abstract syntax tree).
#
class ASTNode:
def __init__(self, nodeType, *args):
self.nodeType = nodeType
self.children = []
for item in args:
self.children.append(item)
def show(self, level):
self.showLevel(level)
print 'Node -- Type %s' % NodeTypeDict[self.nodeType]
level += 1
for child in self.children:
if isinstance(child, ASTNode):
child.show(level)
elif type(child) == types.ListType:
for item in child:
item.show(level)
else:
self.showLevel(level)
print 'Child:', child
def showLevel(self, level):
for idx in range(level):
print ' ',
#
# The recursive descent parser class.
# Contains the "recognizer" methods, which implement the grammar
# rules (above), one recognizer method for each production rule.
#
class ProgParser:
def __init__(self):
pass
def parseFile(self, infileName):
self.infileName = infileName
self.tokens = None
self.tokenType = NoneTokType
self.token = ''
self.lineNo = -1
self.infile = file(self.infileName, 'r')
self.tokens = genTokens(self.infile)
try:
self.tokenType, self.token, self.lineNo = self.tokens.next()
except StopIteration:
raise RuntimeError, 'Empty file'
result = self.prog_reco()
self.infile.close()
self.infile = None
return result
def parseStream(self, instream):
self.tokens = genTokens(instream, '<instream>')
try:
self.tokenType, self.token, self.lineNo = self.tokens.next()
except StopIteration:
raise RuntimeError, 'Empty file'
result = self.prog_reco()
return result
def prog_reco(self):
commandList = []
while 1:
result = self.command_reco()
if not result:
break
commandList.append(result)
return ASTNode(ProgNodeType, commandList)
def command_reco(self):
if self.tokenType == EOFTokType:
return None
result = self.func_call_reco()
return ASTNode(CommandNodeType, result)
def func_call_reco(self):
if self.tokenType == WordTokType:
term = ASTNode(TermNodeType, self.token)
self.tokenType, self.token, self.lineNo = self.tokens.next()
if self.tokenType == LParTokType:
self.tokenType, self.token, self.lineNo = self.tokens.next()
result = self.func_call_list_reco()
if result:
if self.tokenType == RParTokType:
self.tokenType, self.token, self.lineNo = \
self.tokens.next()
return ASTNode(FuncCallNodeType, term, result)
else:
raise ParseError(self.lineNo, 'missing right paren')
else:
raise ParseError(self.lineNo, 'bad func call list')
else:
raise ParseError(self.lineNo, 'missing left paren')
else:
return None
def func_call_list_reco(self):
terms = []
while 1:
result = self.func_call_reco()
if not result:
break
terms.append(result)
if self.tokenType != CommaTokType:
break
self.tokenType, self.token, self.lineNo = self.tokens.next()
return ASTNode(FuncCallListNodeType, terms)
#
# The parse error exception class.
#
class ParseError(Exception):
def __init__(self, lineNo, msg):
RuntimeError.__init__(self, msg)
self.lineNo = lineNo
self.msg = msg
def getLineNo(self):
return self.lineNo
def getMsg(self):
return self.msg
def is_word(token):
for letter in token:
if letter not in string.ascii_letters:
return None
return 1
#
# Generate the tokens.
# Usage:
# gen = genTokens(infile)
# tokType, tok, lineNo = gen.next()
# ...
def genTokens(infile):
lineNo = 0
while 1:
lineNo += 1
try:
line = infile.next()
except:
yield (EOFTokType, None, lineNo)
toks = line.split()
for tok in toks:
if is_word(tok):
tokType = WordTokType
elif tok == '(':
tokType = LParTokType
elif tok == ')':
tokType = RParTokType
elif tok == ',':
tokType = CommaTokType
yield (tokType, tok, lineNo)
def test(infileName):
parser = ProgParser()
#ipshell('(test) #1\nCtrl-D to exit')
result = None
try:
result = parser.parseFile(infileName)
except ParseError, exp:
sys.stderr.write('ParseError: (%d) %s\n' % \
(exp.getLineNo(), exp.getMsg()))
if result:
result.show(0)
USAGE_TEXT = """
Usage:
python rparser.py [options] <inputfile>
Options:
-h, --help Display this help message.
Example:
python rparser.py myfile.txt
"""
def usage():
print USAGE_TEXT
sys.exit(-1)
def main():
args = sys.argv[1:]
try:
opts, args = getopt.getopt(args, 'h', ['help'])
except:
usage()
relink = 1
for opt, val in opts:
if opt in ('-h', '--help'):
usage()
if len(args) != 1:
usage()
test(args[0])
if __name__ == '__main__':
main()
#import pdb
#pdb.run('main()')
And, here is a sample of the data we can apply this parser to:
aaa ( )
bbb ( ccc ( ) )
ddd ( eee ( ) , fff ( ggg ( ) , hhh ( ) , iii ( ) ) )
Comments and explanation:
- The tokenizer is a Python generator. It returns a Python generator that can produce "(tokType, tok, lineNo)" tuples. Our tokenizer is so simple-minded that we have to separate all of our tokens with whitespace. (A little later, we'll see how to use Plex to overcome this limitation.)
- The parser class (ProgParser) contains the recognizer methods that implement the production rules. Each of these methods recognizes a syntactic construct defined by a rule. In our example, these methods have names that end with "_reco".
- We could have, alternatively, implemented our recognizers as global functions, instead of as methods in a class. However, using a class gives us a place to "hang" the variables that are needed across methods and saves us from having to use ("evil") global variables.
- A recognizer method recognizes a terminals (syntactic elements on the right-hand side of the grammar rule for which there is no grammar rule) by (1) checking the token type and the token value, and then (2) calling the tokenizer to get the next token (because it has consumed a token).
- A recognizer method checks for and processes a non-terminal (syntactic elements on the right-hand side for which there is a grammar rule) by calling the recognizer method that implements that non-terminal.
- If a recognizer method finds a syntax error, it raises an exception of class ParserError.
- Since our example recursive descent parser creates an AST (an abstract syntax tree), whenever a recognizer method successfully recognizes a syntactic construct, it creates an instance of class ASTNode to represent it and returns that instance to its caller. The instance of ASTNode has a node type and contains child nodes which were constructed by recognizer methods called by this one (i.e. that represent non-terminals on the right-hand side of a grammar rule).
- Each time a recognizer method "consumes a token", it calls the tokenizer to get the next token (and token type and line number).
- The tokenizer returns a token type in addition to the token value. It also returns a line number for error reporting.
- The syntax tree is constructed from instances of class ASTNode.
- The ASTNode class has a show method, which walks the AST and produces output. You can imagine that a similar method could do code generation. And, you should consider the possibility of writing analogous tree walk methods that perform tasks such as optimization, annotation of the AST, etc.
20 Random Tutorials from the same category :
Floating Point Arithmetic: Issues and Limitations
Google Sitemaps - Python
Parsing in Phyton
Conditionals and recursion - Python
Brief Tour of the Standard Library - Part II
History
Data Types in Python
Extending and embedding Python
Whetting Your Appetite
Brief Tour of the Standard Library
Input and Output - Python
Schema evolution Python
Using the Python Interpreter
Unit Tests
Python vs. Perl
Simple Statements In Python
Data Structures
Defining regular expressions in Phyton
Input and Output
Python and Java - A Side by Side Comparison













