XML Processing with Python: Part Seven

XML processingThe following Python SAX and DOM parsers are available: xml.sax and xml.dom.minidom. Here is an overview of xml.dom.

xml.dom.minidom is a lightweight DOM implementation, designed to be simpler and smaller than a full DOM implementation.

Converting from XML to DOM and Back

In the following example, we’ll use the example XML file from the first article in this series, which we’ll save as mlb.xml:

<?xml version=”1.0″?>
<mlb>
<team>
<name>New York Mets</name>
<generalmanager>Sandy Alderson</generalmanager>
</team>
<team>
<name>Washington Nationals</name>
<generalmanager>Mike Rizzo</generalmanager>
</team>
<team>
<name>Atlanta Braves</name>
<generalmanager>John Hart</generalmanager>
</team>
</mlb>

Then we’ll enter the following code into our Python interpreter:

from xml.dom.minidom import parse

from xml.dom.minidom import parse

def printMLB(mlb):
    teams = mlb.getElementsByTagName("team")
    for team in teams:
        print("*****Team*****")
        print("Name: %s"  % team.getElementsByTagName("name")[0].childNodes[0].data)
        for gm in team.getElementsByTagName("generalmanager"):
            print("General manager: %s" % gm.childNodes[0].data)
# open an XML file and parse it into a DOM
myDoc = parse('mlb.xml')
myMLB = myDoc.getElementsByTagName("mlb")[0]
# Get all the team elements in the library
teams = myMLB.getElementsByTagName("team")
# Print each team's name and general manager
printMLB(myMLB)
# Insert a new team in the list
newTeam = myDoc.createElement("team")
newTeamName = myDoc.createElement("name")
teamNameText = myDoc.createTextNode("Miami Marlins")
newTeamName.appendChild(teamNameText)
newTeam.appendChild(newTeamName)
newGeneralManager = myDoc.createElement("generalmanager")
generalManager = myDoc.createTextNode("Dan Jennings")
newGeneralManager.appendChild(generalManager)
newTeam.appendChild(newGeneralManager)
myMLB.appendChild(newTeam)
print("Added a new team!")
print("##########################")
printMLB(myMLB)
# Remove a team from the list
# Find New York Mets
for team in myMLB.getElementsByTagName("team"):
    for name in team.getElementsByTagName("name"):
        if name.childNodes[0].data.find("New York Mets") != -1:
            removedTeam = myMLB.removeChild(team)
            removedTeam.unlink()
print("Removed a team.")
print("##########################")
printMLB(myMLB)
# Write back to the XML file
mlb = open("mlb.xml", "w")
mlb.write(myMLB.toprettyxml(" "))
mlb.close()

To create a DOM, the document needs to be parse into a document tree. This is done by calling the parse method from xml.dom.mindom. This method returns a Document object, which contains methods for querying for child nodes, getting all nodes in the document of a certain name, and creating new nodes, among other things. The getElementsByTagName method returns a list of node objects whose names match the argument, which is used to extract the root node of the document: the <mlb> node. The print method uses getElementsByTagName again, and then for each team node, prints the name and general manager. nodes with text that follows them are considered to have a single child node, and the text is stored in the data attribute of that node, so team.getElementsByTagName(“name”)[0].childNodes[0].data simply retrieves the text node below the <name> element and returns its data as a string.

Constructing a new node in DOM requires creating a new node as a piece of the Document object, adding all necessary attributes and child nodes, and then attaching it to the correct node in the document tree. The createElement(tagName) method of the Document object correct node in the document tree. The createElement(tagName) method of the Document object creates a new node with a tag name set to whatever argument has been passed in. adding text nodes is accomplished almost the same way, with a call to createTextNode(string). When all the nodes have been created, the structure is created by calling the appendChild method of the node to which the newly created node will be attached. Node also has a method called insertBefore(newChild, refChild) for inserting nodes in an arbitrary location in the list of child nodes, and replaceChild(newChild, oldChild) to replace one node with another.

Removing nodes requires first getting a reference to the node being removed and then a call to removeChild(childNode). After the child has been removed, it’s advisable to call unlink() on it to force garbage collection for that node and any children that may still be attached. This method is specific to the minidom implementation and and is not available in xml.dom.

Finally, having made all these changes to the document, it would be a good idea to write the DOM back to the file from which it came. A utility method is included with xml.dom.minidom called toprettyxml, which takes two optional arguments: an indentation string and a newline character. If not specified, these default to a tabulator and \n, respectively. This utility prints a DOM as nicely indented XML and is just the thing for printing back to the file.

Note that in the program, we added one team (Miami Marlins) and deleted one team (New York Mets). The program should produce the following output:

*****Team*****
Name: New York Mets
General manager: Sandy Alderson
*****Team*****
Name: Washington Nationals
General manager: Mike Rizzo
*****Team*****
Name: Atlanta Braves
General manager: John Hart
Added a new team!
##########################
*****Team*****
Name: New York Mets
General manager: Sandy Alderson
*****Team*****
Name: Washington Nationals
General manager: Mike Rizzo
*****Team*****
Name: Atlanta Braves
General manager: John Hart
*****Team*****
Name: Miami Marlins
General manager: Dan Jennings
Removed a team.
##########################
*****Team*****
Name: Washington Nationals
General manager: Mike Rizzo
*****Team*****
Name: Atlanta Braves
General manager: John Hart
*****Team*****
Name: Miami Marlins
General manager: Dan Jennings

After running the program, I discovered that toprettyxml inserts extra whitespaces. Therefore, it might be better if we used toxml() (which doesn’t insert extra whitespaces), or use something else entirely such as xml.dom.ext.PrettyPrint, as described in this blog article.

External Links:

Download page for PyXML – an XML parser with a better solution than toprettyxml (PrettyPrint)

Wikipedia page on DOM

XML Processing with Python: Part Six

XML ProcessingDOM (Document Object Model)

At the heart of DOM lies the Document object. This is a tree-based representation of the XML document. Tree-based models are a natural fit for XML’s hierarchical structure, making this a very intuitive way of working with XML. Each element in the tree is called a Node object, and it may have attributes, child nodes, text, and so forth, all of which are also objects that are stored in the tree. DOM objects have a number of methods for creating and adding nodes, for finding nodes of a specific type or name, and for reordering or deleting nodes.

Differences between SAX and DOM

The major difference between SAX and DOM is DOM’s ability to store the entire document in memory and manipulate and search it as a tree, rather than force you to parse the document repeatedly, or force you to build your own in-memory representation of the document. The document is parsed once, and then nodes can be added, removed, or changed in memory and then written back out to a file when the program is finished.

Although either SAX or DOM can do almost anything you might want to do with XML, you might want to use one over the other in certain circumstances. For instance, if you are working on an application in which you will be modifying an XML document repeatedly based on user input, you might want the convenient random access capabilities for DOM. but if you are building an application that needs to process a stream of XML quickly with minimal overhead, SAX might be a better choice for you.

DOM is designed with random access in mind. It provides a tree that can be manipulated at runtime and needs to be loaded into memory only once. SAX is stream-based, so data comes in as a stream one character after the next, but the document isn’t seen in its entirety before it starts getting processed; therefore, if you want to randomly access data, you have to either build a partial tree of the document in memory based on document events, or reparse the document every time you want a different piece of data.

Most people find the object-oriented behavior of DOM very intuitive and easy to learn. The event-driven model of SAX is more similar to functional programming and can be more challenging to get up to speed on.

If you are working in a memory-limited environment, DOM is probably not a good choice. Even on a fairly high-end system, constructing a DOM tree for a large document (say 2-3 MB) can bring the computer to a halt while it processes. Because SAX treats the document as a stream, it never loads the whole document into memory, so it is preferable if you are memory constrained or working with very large documents.

Using DOM requires a great deal of processing time while the document tree is being built, but once the tree is built, DOM allows for much faster searching and manipulation of nodes because the entire document is in memory. SAX is somewhat fast for searching documents, but not as efficient for their manipulation. However, for document transformations, SAX is considered to be the parser of choice because the event-driven model is fast and very compatible with how XSLT works.

In the next article, we’ll look at SAX and DOM parsers for Python.

External Links:

XML DOM Parser at W3Schools

Document Object Model at Wikipedia

XML Processing with Python: Part Five

XML processingWhen parsing XML, you have your choice of two different types of parsers: SAX and DOM. SAX stands for the Simple API for XML. It was originally only implemented for Java, and was added to Python as of version 2.0. It is a stream-based, event-driven parser. The events are known as document events, and a document event might be one of several things; the start of an element, the end of an element, encountering a text node, or encountering a comment. For example, the following document:

<?xml version=”1.0″?>
<team>
<name>New York Mets</name>
</team>

might fire the following events:

   start document
   start element: team
   start element: name
   characters: New York Mets
   end element: name
   end element: team
   end document

Whenever a document event occurs, the parser fires an event for the calling application to handle. More precisely, it fires an event for the calling application’s Content Handler object to handle. Content Handlers are objects that implement a known interface specified by the SAX API from which the parser can call methods.

When parsing a document with SAX, the document is read and parsed in the order in which it appears. The parser opens the file or another datasource as a stream of data (so it doesn’t have to do it all at once) and then fires events whenever an element is encountered. Because the parser does not wait for the whole document to load before beginning parsing, SAX can parse documents soon after it begins reading the document. Because SAX does not read the whole document before it begins processing, however, it may process a partial document before discovering it is badly formed. As a result, SAX-based applications should implement their own error-checking.

When working with SAX, document events are handled by event handlers. You declare callback functions for specific types of document events, which are then passed to the parser and called when a document event occurs that matches the callback function.

In the next article, we will introduce DOM, and the pros and cons of using SAX or DOM, as well as a discussion of available parsers.

External Links:

SAX on Wikipedia

XML Processing with Python: Part Four

XML ProcessingXML is similar in structure and form to HTML. This is not entirely an accidental thing. XML and HTML both originated from SGML and share a number of syntactic features. The earlier versions of HTML are not directly compatible with XML, though, because XML requires that every tag be closed, and certain HTML tags don’t require a closing tag (such as <br> and <img>). However, the W3C has declared the XHTML schema in an attempt to bring the two standards in line with each other. XHTML can be manipulated using the same sets of tools as pure XML. However, Python also comes with specialized libraries designed specifically for dealing with HTML.

The HTMLParser class, unlike the htmllib class, is not based on an SGML parser and can be used for both XHTML and earlier versions of HTML. To try using the HTMLParser class, create a sample HTML file named headings.html that contains at least one h1 tag. Then save the file to your Python directory and run the following code:

from html.parser import HTMLParser
class HeadingParser(HTMLParser):
    inHeading = False
    def handle_starttag(self, tag, attrs):
        if tag == "h1":
            self.inHeading = True
            print("Found a Heading 1")
    def handle_data(self, data):
        if self.inHeading:
            print(data)
    def handle_endtag(self, tag):
        if tag == "h1":
            self.inHeading = False
hParser = HeadingParser()
file = open("headings.html", "r")
html = file.read()
file.close()
hParser.feed(html)

The HTMLParser class defines methods, which are called when the parser finds certain types of content, such as a beginning tag, an end tag, or a processing instruction. By default, these methods do nothing. To parse an HTML document, a class that inherits from HTMLParser and implements the necessary methods must be created. After a parse class has been created and instantiated, the parser is fed data using the feed method. Data can be fed to it one line at a time or all at once.

This example class only handles tags of type <h1>. When an HTMLParser encounters a tag, the handle_starttag method is called, and the tag name and any attached attributes are passed to it.

The handle_starttag method determines whether the tag is an <h1>. If so, it prints a message saying it has encountered an h1 and sets a flag indicating that it is currently an <h1>. If text data is found, the handle_data function is called, which determines whether it is an <h1>, based on the flag. If the flag is true, the method prints the text data. If a closing tag is encountered, the handle_endtag method is called, which determines whether the tag that was just closed was an <h1>. If so, it prints a message, and then sets the flag to false.

External Links:

HTMLParser at docs.python.org

Using the Python HTMLParser library