XML Processing with Python: Part One

XML processingExtensible Markup Language, or XML, is a powerful, open standards-based method of data storage. The vocabulary of XML is infinitely customizable to fit whatever kind of data you want to store. Its format makes it human readable, while remaining easy to parse for programs. It encourages semantic markup, rather than formatting-based markup, separating content and presentation from each other, so that a single piece of data can be repurposed many times and displayed in many ways.

XML Processing: A Simple Hierarchical Markup Language

At the core of XML is a simple hierarchical markup language. Tags are used to mark off sections of content with different semantic meanings, and attributes are used to add metadata about the content.

Here is an example of a simple XML document that could be used to describe different baseball teams:

<?xml version=”1.0″?>
<mlb>
<team>
<name>New York Mets</name>
<generalmanager>Sandy Alderson</generalmanager>
</team>
<team>
<name>Washington Nationals</name>
<generalmanager>Mike Rizzo</generalmanager>
</team>
<team>
<name>Atlanta Braves</name>
<generalmanager>John Hart</generalmanager>
</team>
</mlb>

Notice that every piece of data is wrapped in a tag and that tags are nested in a hierarchy that contains further information about the data it wraps. You probably guessed that <generalmanager> is a child piece of information for <team>, as is <name>.

Unlike semantic markup languages like LaTeX, every piece of data in XML must be enclosed in tags. The top-level tag is known as the document root, which encloses everything in the document. an XML document can have only one document root.

Just before the document root is the XML declaration: <?xml version=”1.0″?>. This mandatory element lets the processor know that this is an XML document. As of this writing, there are two versions of XML: 1.0 (last updated in 2008) and 1.1 (last updated in 2006). Because version 1.1 is not fully supported yet, for our examples we will be concentrating on version 1.0.

One problem with semantic markup is the possibility for confusion as data changes contexts. For instance, you might want to have a list of teams in a database about baseball. However, without a human to look at it, the database has no way of knowing that <team> means a baseball team, as opposed to, for example, a football team. This is where namespaces come in. A namespace is used to provide a frame of reference for tags and is given a unique ID in the form of a URL, plus a prefix to apply to tags from that namespace. For example, you might create an baseball namespace, with an identifier of http://server.domain.tld/NameSpaces/Baseball and with a prefix of mlb: and use that to provide a frame of reference for the tags. With a namespace, the document would look like this:

<?xml version=”1.0″?>
<mlb:baseball
xmlns:mlb=”http://server.domain.tld/NameSpaces/Baseball”>
<mlb:team>
<mlb:name>New York Mets</mlb:name>
<mlb:generalmanager>Sandy Alderson</generalmanager>
</mlb:team>
<mlb:team>
<mlb:name>Washington Nationals</name>
<generalmanager>Mike Rizzo</mlb:generalmanager>
</mlb:team>
<mlb:team>
<mlb:name>Atlanta Braves</mlb:name>
<mlb:generalmanager>John Hart</mlb:generalmanager>
</mlb:team>
</mlb:baseball>

It’s now explicit that the team element comes from a set of elements defined by a baseball namespace, and can be treated accordingly.

A namespace declaration can be added to any node in a document, and that namespace will be available to every descendant node of that node. In most documents, all namespace declarations are applied to the root element of the document, even if the namespace is not used until deeper in the document. In this case, the namespace is applied to every tag in the document, so the namespace declaration must be on the root element.

A document can have and use multiple namespaces. For instance, the preceding example library might use one namespace for library information and a second one to add publisher information.

Notice the xmlns: prefix for the the namespace declaration. Certain namespace prefixes are reserved for use by XML and its associated languages, such as xml:, xsl:, and xmlns:. A namespace declaration can be added to any node in a document, and that namespace will be available to every descendant node of that node.

External Links:

XML at Wikipedia

W3 XML home page