XML Processing with Python: Part Five

XML processingWhen parsing XML, you have your choice of two different types of parsers: SAX and DOM. SAX stands for the Simple API for XML. It was originally only implemented for Java, and was added to Python as of version 2.0. It is a stream-based, event-driven parser. The events are known as document events, and a document event might be one of several things; the start of an element, the end of an element, encountering a text node, or encountering a comment. For example, the following document:

<?xml version=”1.0″?>
<team>
<name>New York Mets</name>
</team>

might fire the following events:

   start document
   start element: team
   start element: name
   characters: New York Mets
   end element: name
   end element: team
   end document

Whenever a document event occurs, the parser fires an event for the calling application to handle. More precisely, it fires an event for the calling application’s Content Handler object to handle. Content Handlers are objects that implement a known interface specified by the SAX API from which the parser can call methods.

When parsing a document with SAX, the document is read and parsed in the order in which it appears. The parser opens the file or another datasource as a stream of data (so it doesn’t have to do it all at once) and then fires events whenever an element is encountered. Because the parser does not wait for the whole document to load before beginning parsing, SAX can parse documents soon after it begins reading the document. Because SAX does not read the whole document before it begins processing, however, it may process a partial document before discovering it is badly formed. As a result, SAX-based applications should implement their own error-checking.

When working with SAX, document events are handled by event handlers. You declare callback functions for specific types of document events, which are then passed to the parser and called when a document event occurs that matches the callback function.

In the next article, we will introduce DOM, and the pros and cons of using SAX or DOM, as well as a discussion of available parsers.

External Links:

SAX on Wikipedia