XML Processing with Python: Part Three

XML ProcessingIn the previous article, we discussed the Document Type Definition (DTD) language. In this article, we will discuss Schema and XPath.

XML Processing with Python: Schema

Schema was designed to address some of the limitations of DTDs and provide a more sophisticated XML-based language for describing document models. It enables you to cleanly specify numeric models for content, describe character data patterns using regular expressions, and express content models such as sequences, choices, and unrestricted models.

If you wanted to translate the hypothetical library model into a schema with the same information contained in the DTD, you would wind up with something like the following:

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/20001/XMLSchema">
<xs:element name="mlb">
   <xs:complexType>
      <xs:sequence>
         <xs:element name="team" maxOccurs="unbounded">
            <xs:complexType>
               <xs:sequence>
                  <xs:element name="name" type="xs:string"/>
                  <xs:element name="generalmanager" type="xs:string" maxOccurs="unbounded"/>
               </xs:sequence>
            </xs:complexType>
         </xs:element>
      </xs:sequence>
   <xs:attribute name="owner" type="xs:string" use="required"/>
   </xs:complexType>
</xs:element>
</xs:schema>

This expresses exactly the same data model as the DTD, but some differences are immediately apparent.

To begin with, the document’s top-level node contains a namespace declaration, specifying that all tags starting with xs: belong to the namespace identified by the URL “http://www.w3.org/2001/XMLSchema“. For practical purposes, this means that you now have a document model that you can validate your schema against, using the same tools you would use to validate any other XML document.

Next, notice that the preceding document has a hierarchy very similar to the document it is describing. Rather than create individual elements and link them together using references, the document model mimics the structure of the document as closely as possible. You can also create global elements and then reference them in a structure, but you are not required to use references; they are optional. This creates are more intuitive structure for visualizing the form of possible documents that can be created from this model.

Finally, schemas support attributes such as maxOccurs, which will take either a numeric value from 1 to infinity or the value unbounded, which expresses that any number of that element or grouping may occur. Although this schema doesn’t illustrate it, schemas can express that an element matches a specific regular expression, using the pattern attribute, and schemas can express more flexible content models by mixing the choice and sequence content models.

XML Processing with Python: XPath

XPath is a language for describing locations and node sets within an XML document. An XPath expression contains a description of a pattern that a node must match. If the node matches, it is selected; otherwise, it is ignored. Patterns are composed of a series of steps, either relative to a context node or absolutely defined from the document root. An absolute path begins with a slash, a relative one does not, and each step is separated by a slash.

A step contains three parts: an axis that describes the direction to travel, a node test to select nodes along that axis, and optional predicates, which are Boolean tests that a node must meet. An example step might be ancestor-or-self::team[1], where ancestor-or-self is the axis to move along, team is the node test, and [1] is a predicate specifying to select the first node that meets all the other conditions. If the axis is omitted, it is assumed to refer to the child axis for the current node, so mlb/team[1]/name[1] would select the name of the first team in the MLB database.

A node test can be a function as well as a node name. For instance, team/node() will return all nodes below the selected team node, regardless of whether they are text or elements.

The following table describes a handful of shortcuts for axes:

 

Shortcut Meaning
@ Specifies the attribute axis. This is an abbreviation for attribute::.
*
// Specifies any descendant of the current node. This is an abbreviation for descendant-or-self::*//. If used at the beginning of an XPath, it matches elements anywhere in the document.

External Links:

More info on Schema

More info on XPath