XML Processing with Python: Part Three

XML ProcessingIn the previous article, we discussed the Document Type Definition (DTD) language. In this article, we will discuss Schema and XPath.

XML Processing with Python: Schema

Schema was designed to address some of the limitations of DTDs and provide a more sophisticated XML-based language for describing document models. It enables you to cleanly specify numeric models for content, describe character data patterns using regular expressions, and express content models such as sequences, choices, and unrestricted models.

If you wanted to translate the hypothetical library model into a schema with the same information contained in the DTD, you would wind up with something like the following:

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/20001/XMLSchema">
<xs:element name="mlb">
   <xs:complexType>
      <xs:sequence>
         <xs:element name="team" maxOccurs="unbounded">
            <xs:complexType>
               <xs:sequence>
                  <xs:element name="name" type="xs:string"/>
                  <xs:element name="generalmanager" type="xs:string" maxOccurs="unbounded"/>
               </xs:sequence>
            </xs:complexType>
         </xs:element>
      </xs:sequence>
   <xs:attribute name="owner" type="xs:string" use="required"/>
   </xs:complexType>
</xs:element>
</xs:schema>

This expresses exactly the same data model as the DTD, but some differences are immediately apparent.

To begin with, the document’s top-level node contains a namespace declaration, specifying that all tags starting with xs: belong to the namespace identified by the URL “http://www.w3.org/2001/XMLSchema“. For practical purposes, this means that you now have a document model that you can validate your schema against, using the same tools you would use to validate any other XML document.

Next, notice that the preceding document has a hierarchy very similar to the document it is describing. Rather than create individual elements and link them together using references, the document model mimics the structure of the document as closely as possible. You can also create global elements and then reference them in a structure, but you are not required to use references; they are optional. This creates are more intuitive structure for visualizing the form of possible documents that can be created from this model.

Finally, schemas support attributes such as maxOccurs, which will take either a numeric value from 1 to infinity or the value unbounded, which expresses that any number of that element or grouping may occur. Although this schema doesn’t illustrate it, schemas can express that an element matches a specific regular expression, using the pattern attribute, and schemas can express more flexible content models by mixing the choice and sequence content models.

XML Processing with Python: XPath

XPath is a language for describing locations and node sets within an XML document. An XPath expression contains a description of a pattern that a node must match. If the node matches, it is selected; otherwise, it is ignored. Patterns are composed of a series of steps, either relative to a context node or absolutely defined from the document root. An absolute path begins with a slash, a relative one does not, and each step is separated by a slash.

A step contains three parts: an axis that describes the direction to travel, a node test to select nodes along that axis, and optional predicates, which are Boolean tests that a node must meet. An example step might be ancestor-or-self::team[1], where ancestor-or-self is the axis to move along, team is the node test, and [1] is a predicate specifying to select the first node that meets all the other conditions. If the axis is omitted, it is assumed to refer to the child axis for the current node, so mlb/team[1]/name[1] would select the name of the first team in the MLB database.

A node test can be a function as well as a node name. For instance, team/node() will return all nodes below the selected team node, regardless of whether they are text or elements.

The following table describes a handful of shortcuts for axes:

 

Shortcut Meaning
@ Specifies the attribute axis. This is an abbreviation for attribute::.
*
// Specifies any descendant of the current node. This is an abbreviation for descendant-or-self::*//. If used at the beginning of an XPath, it matches elements anywhere in the document.

External Links:

More info on Schema

More info on XPath

XML Processing with Python: Part Two

XML ProcessingXML is more than just a way to store hierarchical data. Otherwise, it would fall to more lightweight data storage methods that already exist. XML’s big strength lies in its extensibility, and its companion standards, XSLT, XPath, Schema, and DTD languages, as well as other standards for querying, linking, describing, displaying and manipulating data. Schemas and DTDs provide a way for describing XML vocabularies and a way to validate documents. XSLT provides a powerful transformation engine to turn one XML vocabulary into another, or into HTML, plaintext, PDF, or a host of other formats. XPath is a query language for describing XML node sets. XSL-FO provides a way to create XML that describes the format and layout of a document for transformation to PDF or other visual formats.

Another good thing about XML is that most of the tools for working with XML are also written in XML, and can be manipulated using the same tools. XSLTs are written in XML, as are schemas. What this means in practical terms is that it is easy to use an XSLT to write another XSLT or a schema, or to validate XSLTs or schemas using schemas.

XML Processing with Python: Schemas and Document Type Definitions

Schemas and Document Type Definitions (DTDs) are both ways of implementing document models. A document model is a way of describing the vocabulary and structure of a document. You define the data elements that will be present in your document, what relationship they have to one another, and how many of them you expect. For example, a document model for the XML example in the previous article might read as follows:

Major League Baseball is a collection of teams overseen by a single commissioner. Each team has a name and a general manager.

DTDs and schemas have different ways of expressing this document model, but they both describe the same basic formula for the document. Subtle differences exist between the two, but they have roughly the same capabilities.

Document models are used when you want to be able to validate content against a standard before manipulating or processing it. They are useful whenever you will be interchanging data with an application that may change data models unexpectedly, or when you want to constrain what a user can enter, as in an XML-based documentation system where you will be working with hand-created XML rather than with something from an application.

A DTD is a Document Type Definition. Therse were the original methods of expressing a document model and are commonplace on the Internet. DTDs were originally created for describing SGML, and the syntax has barely changed since that time, so DTDs have had quite a while to proliferate. The World Wide Web Consortium (W3C) continues to express document types using DTDs, so DTDs exist for each of the HTML standards, for Scalable Vector Graphics (SVG), MathmL and for other useful XML vocabularies.

If you were to translate the English description of the example Major League Baseball document into a DTD, it might look something like this:

<?xml version=”1.0″?>
<!ELEMENT mlb (team+)>
<!ATTLIST mlb
commisioner CDATA #REQUIRED
>
<!ELEMENT team (name, generalmanager+)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT generalmanager (#PCDATA)>

To add a reference to this DTD in the library file discussed before, you would insert a line at the top of the file after the XML declaration that read <!DOCTYPE config SYSTEM “mlb.dtd”>, where mlb.dtd was the path to the DTD on your system.

The first line, <?xml version=”1.0″?> tells you that this is going to be an XML document. Technically, this line is optional. The next line, <!ELEMENT mlb (team+)>, tells you that there is an element known as mlb, which can have one or more child elements of the team type. The syntax for element frequencies and groupings in DTDs is terse, but similar to that of regular expressions.

The next bit is:
<!ATTLIST mlb
commissioner CDATA #REQUIRED
>

The first line specifies that the mlb element has a list of attributes. Notice that the attribute list is separate from the mlb element declaration itself and linked to it by the element name. If the element name changes, the attribute list must be updated to point to the new element name. Next is a list of attributes for the element. In this case, mlb has only one attribute, but the list can contain an unbounded number of attributes. The attribute declaration has three mandatory elements: an attribute name, an attribute type, and an attribute description. An attribute type can be either a data type, as specified by the DTD specification, or a list of allowed values. The attribute description is used to specify the behavior of the attribute. A default value can be described here, and whether the attribute is optional or required.

DTDs have a number of limitations. Although it is possible to express complex structures in DTDs, it becomes very difficult to maintain. DTDs have difficulty cleanly expressing numeric bounds on a document model. If you wanted to specify that MLB can contain no more than 30 teams, you could write <!ELEMENT mlb (team, team, team, team etc etc)>, but that quickly becomes an unreadable mess of code. DTDs also make it hard to permit a number of elements in any order. If you have three elements that you could receive in any order, you have to write <!ELEMENT team ( ( (name, ((generalmanager, stadium) | (stadium, generalmanager))) | (generalmanager, ((name, stadium) | (stadium, name))) | (stadium, ((name, generalmanager) | (generalmanager, name)))))>, which is beginnign to look more like LISP and is more complicated than it should be. Finally, DTDs do not allow you to specify a pattern for data. Thankfully, the W3C has published a specification for a slightly more sophisticated language for describing documents, known as Schema.

External Links:

Document Type Definition on Wikipedia