XML Processing with Python: Part Two

XML ProcessingXML is more than just a way to store hierarchical data. Otherwise, it would fall to more lightweight data storage methods that already exist. XML’s big strength lies in its extensibility, and its companion standards, XSLT, XPath, Schema, and DTD languages, as well as other standards for querying, linking, describing, displaying and manipulating data. Schemas and DTDs provide a way for describing XML vocabularies and a way to validate documents. XSLT provides a powerful transformation engine to turn one XML vocabulary into another, or into HTML, plaintext, PDF, or a host of other formats. XPath is a query language for describing XML node sets. XSL-FO provides a way to create XML that describes the format and layout of a document for transformation to PDF or other visual formats.

Another good thing about XML is that most of the tools for working with XML are also written in XML, and can be manipulated using the same tools. XSLTs are written in XML, as are schemas. What this means in practical terms is that it is easy to use an XSLT to write another XSLT or a schema, or to validate XSLTs or schemas using schemas.

XML Processing with Python: Schemas and Document Type Definitions

Schemas and Document Type Definitions (DTDs) are both ways of implementing document models. A document model is a way of describing the vocabulary and structure of a document. You define the data elements that will be present in your document, what relationship they have to one another, and how many of them you expect. For example, a document model for the XML example in the previous article might read as follows:

Major League Baseball is a collection of teams overseen by a single commissioner. Each team has a name and a general manager.

DTDs and schemas have different ways of expressing this document model, but they both describe the same basic formula for the document. Subtle differences exist between the two, but they have roughly the same capabilities.

Document models are used when you want to be able to validate content against a standard before manipulating or processing it. They are useful whenever you will be interchanging data with an application that may change data models unexpectedly, or when you want to constrain what a user can enter, as in an XML-based documentation system where you will be working with hand-created XML rather than with something from an application.

A DTD is a Document Type Definition. Therse were the original methods of expressing a document model and are commonplace on the Internet. DTDs were originally created for describing SGML, and the syntax has barely changed since that time, so DTDs have had quite a while to proliferate. The World Wide Web Consortium (W3C) continues to express document types using DTDs, so DTDs exist for each of the HTML standards, for Scalable Vector Graphics (SVG), MathmL and for other useful XML vocabularies.

If you were to translate the English description of the example Major League Baseball document into a DTD, it might look something like this:

<?xml version=”1.0″?>
<!ELEMENT mlb (team+)>
<!ATTLIST mlb
commisioner CDATA #REQUIRED
>
<!ELEMENT team (name, generalmanager+)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT generalmanager (#PCDATA)>

To add a reference to this DTD in the library file discussed before, you would insert a line at the top of the file after the XML declaration that read <!DOCTYPE config SYSTEM “mlb.dtd”>, where mlb.dtd was the path to the DTD on your system.

The first line, <?xml version=”1.0″?> tells you that this is going to be an XML document. Technically, this line is optional. The next line, <!ELEMENT mlb (team+)>, tells you that there is an element known as mlb, which can have one or more child elements of the team type. The syntax for element frequencies and groupings in DTDs is terse, but similar to that of regular expressions.

The next bit is:
<!ATTLIST mlb
commissioner CDATA #REQUIRED
>

The first line specifies that the mlb element has a list of attributes. Notice that the attribute list is separate from the mlb element declaration itself and linked to it by the element name. If the element name changes, the attribute list must be updated to point to the new element name. Next is a list of attributes for the element. In this case, mlb has only one attribute, but the list can contain an unbounded number of attributes. The attribute declaration has three mandatory elements: an attribute name, an attribute type, and an attribute description. An attribute type can be either a data type, as specified by the DTD specification, or a list of allowed values. The attribute description is used to specify the behavior of the attribute. A default value can be described here, and whether the attribute is optional or required.

DTDs have a number of limitations. Although it is possible to express complex structures in DTDs, it becomes very difficult to maintain. DTDs have difficulty cleanly expressing numeric bounds on a document model. If you wanted to specify that MLB can contain no more than 30 teams, you could write <!ELEMENT mlb (team, team, team, team etc etc)>, but that quickly becomes an unreadable mess of code. DTDs also make it hard to permit a number of elements in any order. If you have three elements that you could receive in any order, you have to write <!ELEMENT team ( ( (name, ((generalmanager, stadium) | (stadium, generalmanager))) | (generalmanager, ((name, stadium) | (stadium, name))) | (stadium, ((name, generalmanager) | (generalmanager, name)))))>, which is beginnign to look more like LISP and is more complicated than it should be. Finally, DTDs do not allow you to specify a pattern for data. Thankfully, the W3C has published a specification for a slightly more sophisticated language for describing documents, known as Schema.

External Links:

Document Type Definition on Wikipedia