XML Processing with Python: Part Seven

XML processingThe following Python SAX and DOM parsers are available: xml.sax and xml.dom.minidom. Here is an overview of xml.dom.

xml.dom.minidom is a lightweight DOM implementation, designed to be simpler and smaller than a full DOM implementation.

Converting from XML to DOM and Back

In the following example, we’ll use the example XML file from the first article in this series, which we’ll save as mlb.xml:

<?xml version=”1.0″?>
<mlb>
<team>
<name>New York Mets</name>
<generalmanager>Sandy Alderson</generalmanager>
</team>
<team>
<name>Washington Nationals</name>
<generalmanager>Mike Rizzo</generalmanager>
</team>
<team>
<name>Atlanta Braves</name>
<generalmanager>John Hart</generalmanager>
</team>
</mlb>

Then we’ll enter the following code into our Python interpreter:

from xml.dom.minidom import parse

from xml.dom.minidom import parse

def printMLB(mlb):
    teams = mlb.getElementsByTagName("team")
    for team in teams:
        print("*****Team*****")
        print("Name: %s"  % team.getElementsByTagName("name")[0].childNodes[0].data)
        for gm in team.getElementsByTagName("generalmanager"):
            print("General manager: %s" % gm.childNodes[0].data)
# open an XML file and parse it into a DOM
myDoc = parse('mlb.xml')
myMLB = myDoc.getElementsByTagName("mlb")[0]
# Get all the team elements in the library
teams = myMLB.getElementsByTagName("team")
# Print each team's name and general manager
printMLB(myMLB)
# Insert a new team in the list
newTeam = myDoc.createElement("team")
newTeamName = myDoc.createElement("name")
teamNameText = myDoc.createTextNode("Miami Marlins")
newTeamName.appendChild(teamNameText)
newTeam.appendChild(newTeamName)
newGeneralManager = myDoc.createElement("generalmanager")
generalManager = myDoc.createTextNode("Dan Jennings")
newGeneralManager.appendChild(generalManager)
newTeam.appendChild(newGeneralManager)
myMLB.appendChild(newTeam)
print("Added a new team!")
print("##########################")
printMLB(myMLB)
# Remove a team from the list
# Find New York Mets
for team in myMLB.getElementsByTagName("team"):
    for name in team.getElementsByTagName("name"):
        if name.childNodes[0].data.find("New York Mets") != -1:
            removedTeam = myMLB.removeChild(team)
            removedTeam.unlink()
print("Removed a team.")
print("##########################")
printMLB(myMLB)
# Write back to the XML file
mlb = open("mlb.xml", "w")
mlb.write(myMLB.toprettyxml(" "))
mlb.close()

To create a DOM, the document needs to be parse into a document tree. This is done by calling the parse method from xml.dom.mindom. This method returns a Document object, which contains methods for querying for child nodes, getting all nodes in the document of a certain name, and creating new nodes, among other things. The getElementsByTagName method returns a list of node objects whose names match the argument, which is used to extract the root node of the document: the <mlb> node. The print method uses getElementsByTagName again, and then for each team node, prints the name and general manager. nodes with text that follows them are considered to have a single child node, and the text is stored in the data attribute of that node, so team.getElementsByTagName(“name”)[0].childNodes[0].data simply retrieves the text node below the <name> element and returns its data as a string.

Constructing a new node in DOM requires creating a new node as a piece of the Document object, adding all necessary attributes and child nodes, and then attaching it to the correct node in the document tree. The createElement(tagName) method of the Document object correct node in the document tree. The createElement(tagName) method of the Document object creates a new node with a tag name set to whatever argument has been passed in. adding text nodes is accomplished almost the same way, with a call to createTextNode(string). When all the nodes have been created, the structure is created by calling the appendChild method of the node to which the newly created node will be attached. Node also has a method called insertBefore(newChild, refChild) for inserting nodes in an arbitrary location in the list of child nodes, and replaceChild(newChild, oldChild) to replace one node with another.

Removing nodes requires first getting a reference to the node being removed and then a call to removeChild(childNode). After the child has been removed, it’s advisable to call unlink() on it to force garbage collection for that node and any children that may still be attached. This method is specific to the minidom implementation and and is not available in xml.dom.

Finally, having made all these changes to the document, it would be a good idea to write the DOM back to the file from which it came. A utility method is included with xml.dom.minidom called toprettyxml, which takes two optional arguments: an indentation string and a newline character. If not specified, these default to a tabulator and \n, respectively. This utility prints a DOM as nicely indented XML and is just the thing for printing back to the file.

Note that in the program, we added one team (Miami Marlins) and deleted one team (New York Mets). The program should produce the following output:

*****Team*****
Name: New York Mets
General manager: Sandy Alderson
*****Team*****
Name: Washington Nationals
General manager: Mike Rizzo
*****Team*****
Name: Atlanta Braves
General manager: John Hart
Added a new team!
##########################
*****Team*****
Name: New York Mets
General manager: Sandy Alderson
*****Team*****
Name: Washington Nationals
General manager: Mike Rizzo
*****Team*****
Name: Atlanta Braves
General manager: John Hart
*****Team*****
Name: Miami Marlins
General manager: Dan Jennings
Removed a team.
##########################
*****Team*****
Name: Washington Nationals
General manager: Mike Rizzo
*****Team*****
Name: Atlanta Braves
General manager: John Hart
*****Team*****
Name: Miami Marlins
General manager: Dan Jennings

After running the program, I discovered that toprettyxml inserts extra whitespaces. Therefore, it might be better if we used toxml() (which doesn’t insert extra whitespaces), or use something else entirely such as xml.dom.ext.PrettyPrint, as described in this blog article.

External Links:

Download page for PyXML – an XML parser with a better solution than toprettyxml (PrettyPrint)

Wikipedia page on DOM

XML Processing with Python: Part Six

DOM (Document Object Model) At the heart of DOM lies the Document object. This is a tree-based representation of the XML document. Tree-based models are a natural fit for XML's hierarchical structure, making this a very intuitive way of working with … [Continue reading]

XML Processing with Python: Part Five

When parsing XML, you have your choice of two different types of parsers: SAX and DOM. SAX stands for the Simple API for XML. It was originally only implemented for Java, and was added to Python as of version 2.0. It is a stream-based, event-driven … [Continue reading]

XML Processing with Python: Part Four

XML is similar in structure and form to HTML. This is not entirely an accidental thing. XML and HTML both originated from SGML and share a number of syntactic features. The earlier versions of HTML are not directly compatible with XML, though, … [Continue reading]

XML Processing with Python: Part Three

In the previous article, we discussed the Document Type Definition (DTD) language. In this article, we will discuss Schema and XPath. XML Processing with Python: Schema Schema was designed to address some of the limitations of DTDs and provide … [Continue reading]

XML Processing with Python: Part Two

XML is more than just a way to store hierarchical data. Otherwise, it would fall to more lightweight data storage methods that already exist. XML's big strength lies in its extensibility, and its companion standards, XSLT, XPath, Schema, and DTD … [Continue reading]

XML Processing with Python: Part One

Extensible Markup Language, or XML, is a powerful, open standards-based method of data storage. The vocabulary of XML is infinitely customizable to fit whatever kind of data you want to store. Its format makes it human readable, while remaining easy … [Continue reading]

Python Database Programming: Part Nine

Python Database Programming: Committing and Rolling Back Transactions Each connection, while it is engaged in action, manages a transaction. With SQL, data is not modified unless you commit a transaction. The database then guarantees that it will … [Continue reading]

Python Database Programming: Part Eight

Python database programming

In the previous article, we introduced some more complex query operations using Sqlite. In this article, we will use Sqlite to both modify a table entry and delete a table entry. Python Database Programming: Updating a Record First, we need to … [Continue reading]

Python Database Programming: Part Seven

Python database programming

In the previous article, we used connection objects to insert values into an existing database. In this article, we will cover performing some simple join operations. Python Database Programming: A Join Query The following script implements a … [Continue reading]