XML Processing with Python: Part Three

XML ProcessingIn the previous article, we discussed the Document Type Definition (DTD) language. In this article, we will discuss Schema and XPath.

XML Processing with Python: Schema

Schema was designed to address some of the limitations of DTDs and provide a more sophisticated XML-based language for describing document models. It enables you to cleanly specify numeric models for content, describe character data patterns using regular expressions, and express content models such as sequences, choices, and unrestricted models.

If you wanted to translate the hypothetical library model into a schema with the same information contained in the DTD, you would wind up with something like the following:

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/20001/XMLSchema">
<xs:element name="mlb">
   <xs:complexType>
      <xs:sequence>
         <xs:element name="team" maxOccurs="unbounded">
            <xs:complexType>
               <xs:sequence>
                  <xs:element name="name" type="xs:string"/>
                  <xs:element name="generalmanager" type="xs:string" maxOccurs="unbounded"/>
               </xs:sequence>
            </xs:complexType>
         </xs:element>
      </xs:sequence>
   <xs:attribute name="owner" type="xs:string" use="required"/>
   </xs:complexType>
</xs:element>
</xs:schema>

This expresses exactly the same data model as the DTD, but some differences are immediately apparent.

To begin with, the document’s top-level node contains a namespace declaration, specifying that all tags starting with xs: belong to the namespace identified by the URL “http://www.w3.org/2001/XMLSchema“. For practical purposes, this means that you now have a document model that you can validate your schema against, using the same tools you would use to validate any other XML document.

Next, notice that the preceding document has a hierarchy very similar to the document it is describing. Rather than create individual elements and link them together using references, the document model mimics the structure of the document as closely as possible. You can also create global elements and then reference them in a structure, but you are not required to use references; they are optional. This creates are more intuitive structure for visualizing the form of possible documents that can be created from this model.

Finally, schemas support attributes such as maxOccurs, which will take either a numeric value from 1 to infinity or the value unbounded, which expresses that any number of that element or grouping may occur. Although this schema doesn’t illustrate it, schemas can express that an element matches a specific regular expression, using the pattern attribute, and schemas can express more flexible content models by mixing the choice and sequence content models.

XML Processing with Python: XPath

XPath is a language for describing locations and node sets within an XML document. An XPath expression contains a description of a pattern that a node must match. If the node matches, it is selected; otherwise, it is ignored. Patterns are composed of a series of steps, either relative to a context node or absolutely defined from the document root. An absolute path begins with a slash, a relative one does not, and each step is separated by a slash.

A step contains three parts: an axis that describes the direction to travel, a node test to select nodes along that axis, and optional predicates, which are Boolean tests that a node must meet. An example step might be ancestor-or-self::team[1], where ancestor-or-self is the axis to move along, team is the node test, and [1] is a predicate specifying to select the first node that meets all the other conditions. If the axis is omitted, it is assumed to refer to the child axis for the current node, so mlb/team[1]/name[1] would select the name of the first team in the MLB database.

A node test can be a function as well as a node name. For instance, team/node() will return all nodes below the selected team node, regardless of whether they are text or elements.

The following table describes a handful of shortcuts for axes:

 

Shortcut Meaning
@ Specifies the attribute axis. This is an abbreviation for attribute::.
*
// Specifies any descendant of the current node. This is an abbreviation for descendant-or-self::*//. If used at the beginning of an XPath, it matches elements anywhere in the document.

External Links:

More info on Schema

More info on XPath

XML Processing with Python: Part Two

XML ProcessingXML is more than just a way to store hierarchical data. Otherwise, it would fall to more lightweight data storage methods that already exist. XML’s big strength lies in its extensibility, and its companion standards, XSLT, XPath, Schema, and DTD languages, as well as other standards for querying, linking, describing, displaying and manipulating data. Schemas and DTDs provide a way for describing XML vocabularies and a way to validate documents. XSLT provides a powerful transformation engine to turn one XML vocabulary into another, or into HTML, plaintext, PDF, or a host of other formats. XPath is a query language for describing XML node sets. XSL-FO provides a way to create XML that describes the format and layout of a document for transformation to PDF or other visual formats.

Another good thing about XML is that most of the tools for working with XML are also written in XML, and can be manipulated using the same tools. XSLTs are written in XML, as are schemas. What this means in practical terms is that it is easy to use an XSLT to write another XSLT or a schema, or to validate XSLTs or schemas using schemas.

XML Processing with Python: Schemas and Document Type Definitions

Schemas and Document Type Definitions (DTDs) are both ways of implementing document models. A document model is a way of describing the vocabulary and structure of a document. You define the data elements that will be present in your document, what relationship they have to one another, and how many of them you expect. For example, a document model for the XML example in the previous article might read as follows:

Major League Baseball is a collection of teams overseen by a single commissioner. Each team has a name and a general manager.

DTDs and schemas have different ways of expressing this document model, but they both describe the same basic formula for the document. Subtle differences exist between the two, but they have roughly the same capabilities.

Document models are used when you want to be able to validate content against a standard before manipulating or processing it. They are useful whenever you will be interchanging data with an application that may change data models unexpectedly, or when you want to constrain what a user can enter, as in an XML-based documentation system where you will be working with hand-created XML rather than with something from an application.

A DTD is a Document Type Definition. Therse were the original methods of expressing a document model and are commonplace on the Internet. DTDs were originally created for describing SGML, and the syntax has barely changed since that time, so DTDs have had quite a while to proliferate. The World Wide Web Consortium (W3C) continues to express document types using DTDs, so DTDs exist for each of the HTML standards, for Scalable Vector Graphics (SVG), MathmL and for other useful XML vocabularies.

If you were to translate the English description of the example Major League Baseball document into a DTD, it might look something like this:

<?xml version=”1.0″?>
<!ELEMENT mlb (team+)>
<!ATTLIST mlb
commisioner CDATA #REQUIRED
>
<!ELEMENT team (name, generalmanager+)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT generalmanager (#PCDATA)>

To add a reference to this DTD in the library file discussed before, you would insert a line at the top of the file after the XML declaration that read <!DOCTYPE config SYSTEM “mlb.dtd”>, where mlb.dtd was the path to the DTD on your system.

The first line, <?xml version=”1.0″?> tells you that this is going to be an XML document. Technically, this line is optional. The next line, <!ELEMENT mlb (team+)>, tells you that there is an element known as mlb, which can have one or more child elements of the team type. The syntax for element frequencies and groupings in DTDs is terse, but similar to that of regular expressions.

The next bit is:
<!ATTLIST mlb
commissioner CDATA #REQUIRED
>

The first line specifies that the mlb element has a list of attributes. Notice that the attribute list is separate from the mlb element declaration itself and linked to it by the element name. If the element name changes, the attribute list must be updated to point to the new element name. Next is a list of attributes for the element. In this case, mlb has only one attribute, but the list can contain an unbounded number of attributes. The attribute declaration has three mandatory elements: an attribute name, an attribute type, and an attribute description. An attribute type can be either a data type, as specified by the DTD specification, or a list of allowed values. The attribute description is used to specify the behavior of the attribute. A default value can be described here, and whether the attribute is optional or required.

DTDs have a number of limitations. Although it is possible to express complex structures in DTDs, it becomes very difficult to maintain. DTDs have difficulty cleanly expressing numeric bounds on a document model. If you wanted to specify that MLB can contain no more than 30 teams, you could write <!ELEMENT mlb (team, team, team, team etc etc)>, but that quickly becomes an unreadable mess of code. DTDs also make it hard to permit a number of elements in any order. If you have three elements that you could receive in any order, you have to write <!ELEMENT team ( ( (name, ((generalmanager, stadium) | (stadium, generalmanager))) | (generalmanager, ((name, stadium) | (stadium, name))) | (stadium, ((name, generalmanager) | (generalmanager, name)))))>, which is beginnign to look more like LISP and is more complicated than it should be. Finally, DTDs do not allow you to specify a pattern for data. Thankfully, the W3C has published a specification for a slightly more sophisticated language for describing documents, known as Schema.

External Links:

Document Type Definition on Wikipedia

XML Processing with Python: Part One

XML processingExtensible Markup Language, or XML, is a powerful, open standards-based method of data storage. The vocabulary of XML is infinitely customizable to fit whatever kind of data you want to store. Its format makes it human readable, while remaining easy to parse for programs. It encourages semantic markup, rather than formatting-based markup, separating content and presentation from each other, so that a single piece of data can be repurposed many times and displayed in many ways.

XML Processing: A Simple Hierarchical Markup Language

At the core of XML is a simple hierarchical markup language. Tags are used to mark off sections of content with different semantic meanings, and attributes are used to add metadata about the content.

Here is an example of a simple XML document that could be used to describe different baseball teams:

<?xml version=”1.0″?>
<mlb>
<team>
<name>New York Mets</name>
<generalmanager>Sandy Alderson</generalmanager>
</team>
<team>
<name>Washington Nationals</name>
<generalmanager>Mike Rizzo</generalmanager>
</team>
<team>
<name>Atlanta Braves</name>
<generalmanager>John Hart</generalmanager>
</team>
</mlb>

Notice that every piece of data is wrapped in a tag and that tags are nested in a hierarchy that contains further information about the data it wraps. You probably guessed that <generalmanager> is a child piece of information for <team>, as is <name>.

Unlike semantic markup languages like LaTeX, every piece of data in XML must be enclosed in tags. The top-level tag is known as the document root, which encloses everything in the document. an XML document can have only one document root.

Just before the document root is the XML declaration: <?xml version=”1.0″?>. This mandatory element lets the processor know that this is an XML document. As of this writing, there are two versions of XML: 1.0 (last updated in 2008) and 1.1 (last updated in 2006). Because version 1.1 is not fully supported yet, for our examples we will be concentrating on version 1.0.

One problem with semantic markup is the possibility for confusion as data changes contexts. For instance, you might want to have a list of teams in a database about baseball. However, without a human to look at it, the database has no way of knowing that <team> means a baseball team, as opposed to, for example, a football team. This is where namespaces come in. A namespace is used to provide a frame of reference for tags and is given a unique ID in the form of a URL, plus a prefix to apply to tags from that namespace. For example, you might create an baseball namespace, with an identifier of http://server.domain.tld/NameSpaces/Baseball and with a prefix of mlb: and use that to provide a frame of reference for the tags. With a namespace, the document would look like this:

<?xml version=”1.0″?>
<mlb:baseball
xmlns:mlb=”http://server.domain.tld/NameSpaces/Baseball”>
<mlb:team>
<mlb:name>New York Mets</mlb:name>
<mlb:generalmanager>Sandy Alderson</generalmanager>
</mlb:team>
<mlb:team>
<mlb:name>Washington Nationals</name>
<generalmanager>Mike Rizzo</mlb:generalmanager>
</mlb:team>
<mlb:team>
<mlb:name>Atlanta Braves</mlb:name>
<mlb:generalmanager>John Hart</mlb:generalmanager>
</mlb:team>
</mlb:baseball>

It’s now explicit that the team element comes from a set of elements defined by a baseball namespace, and can be treated accordingly.

A namespace declaration can be added to any node in a document, and that namespace will be available to every descendant node of that node. In most documents, all namespace declarations are applied to the root element of the document, even if the namespace is not used until deeper in the document. In this case, the namespace is applied to every tag in the document, so the namespace declaration must be on the root element.

A document can have and use multiple namespaces. For instance, the preceding example library might use one namespace for library information and a second one to add publisher information.

Notice the xmlns: prefix for the the namespace declaration. Certain namespace prefixes are reserved for use by XML and its associated languages, such as xml:, xsl:, and xmlns:. A namespace declaration can be added to any node in a document, and that namespace will be available to every descendant node of that node.

External Links:

XML at Wikipedia

W3 XML home page

Python Database Programming: Part Nine

Python Database ProgrammingPython Database Programming: Committing and Rolling Back Transactions

Each connection, while it is engaged in action, manages a transaction. With SQL, data is not modified unless you commit a transaction. The database then guarantees that it will perform all of the modifications in the transaction or none. As a result, you will not leave your database in a potentially erroneous condition.

To commit a transaction, we call the commit method of a connection:

conn.commit()

Note the the transaction methods are part of the connection class, not the cursor class.

If something goes wrong, like an exception is thrown that you can handle, you should call the rollback method to undo the effects of the incomplete transaction; this will restore the database to the state it was in before you started the transaction, guaranteed:

conn.rollback()

The capability to roll back a transaction is very important, because you can handle errors by ensuring that the database does not get changed. In addition, rollbacks are very useful for testing. You can insert, modify and delete a number of rows as part of a unit test and then roll back the transaction to undo the effects of all the changes. This enables your unit tests to run without making any permanent changes to the database. It also enables your unit tests to be run repeatedly, because each run resets the data.

The DB API defines several globals that need to be defined at the module level. You can use these globals to determine information about the database module and the features it supports. The following table lists these globals.

Global What It Holds
Apilevel Should hold ‘2.0’ for the DB API 2.0, or ‘1.0’ for the 1.0 API.
Paramstyle Defines how you can indicate the placeholders for dynamic data in your SQL statements. The values include:

  • ‘qmark’: Use question marks
  • ‘numeric’: Use a positional number style; e.g. ‘:1’, ‘:2’, etc.
  • ‘named’: Use a colon and a name for each parameter.
  • ‘format’: Use the ANSI C sprintf format codes.
  • ‘pyformat’: Use the Python extended format codes.

With a cursor object, you can check the definition attribute to see information about the data returned. This information should be a set of seven-element sequences, one for each column of result data. These sequences include the following items:

(name, type_code, display_size, internal_size, precision, scale, null_ok)

None can be used for all but the first two items, as shown in this example:

((‘FIRSTNAME’, None, None, None, None, None, None),
(‘LASTNAME’, None, None, None, None, None, None),
(‘NAME’, None, None, None, None, None, None))

With databases, errors happen a lot. The DB API defines a number of errors that must exist in each database module. The following table lists those errors:

Exception Usage
Warning Used for non-fatal issues. Must subclass StandardError.
Error Base class for errors. Must subclass StandardError.
InterfaceError Used for errors in the database module, not the database itself. Must subclass Error.
DatabaseError Used for errors in the database. Must subclass Error.
DataError Subclass of Database error that refers to errors in the data.
OperationalError Subclass of DatabaseError that refers to errors such as the loss of the connection to the database. These errors are generally outside of the control of a Python scripter.
IntegrityError Subclass of DatabaseError for situations that would damage the relational integrity, such as uniqueness constraints or foreign keys.
InternalError Subclass of DatabaseError that refers to errors internal to the database module, such as a cursor no longer being active.
ProgrammingError Subclass of DatabaseError that refers to errors such as bad table name and other things that can safely be blamed on the scripter.
NotSupportedError Subclass of DatabaseError that refers to trying to call unsupported functionality.

Your Python scripts should handle these errors. You can get more information about them by reading the DB API specification.

External Links:

Database Programming at wiki.python.org

Database Programming at python.about.com

Databases at docs.python-guide.org

Python Database Programming: Part Eight

Python database programming

Executing updateteam.py at the command line.

In the previous article, we introduced some more complex query operations using Sqlite. In this article, we will use Sqlite to both modify a table entry and delete a table entry.

Python Database Programming: Updating a Record

First, we need to enter the code for updating the player table, which we will call updateteam.py. We want to re-assign player Travis d’Arnaud to the Washington Nationals:

import sqlite3
import sys
conn = sqlite3.connect('my_database')
cursor = conn.cursor()
newteam = sys.argv[2]
player = sys.argv[1]
# Query to find the player ID:
query = """
select p.idnum
from user u, player p
where u.username=? and u.idnum = p.idnum
"""
cursor.execute(query,(player,));
for row in cursor.fetchone():
    if (row != None):
        playerid = row
# Now, modify the player:
cursor.execute("update player set team=? where idnum=?", (newteam,playerid))
conn.commit()
cursor.close()
conn.close()

When you run this script, you need to pass the name of the user/player to update, as well as the team ID number of the team to which you want to transfer the player. For example:

>python finduser.py travis
(u'travis', ':', u"Travis d'Arnaud", 'plays for the', u'New York Mets')

>python updateteam.py travis 20

>python finduser.py travis
(u'travis', ':', u"Travis d'Arnaud", 'plays for the', u'Washington Nationals')

The example output shows the before and after picture of the player row, verifying that the updateteam.py script work.

The updateteam.py script expects two values from the user: the user name of the player to update (in this case, travis) and the team number of the new team (in this case 20, the team id number of the Washington Nationals). This example also shows the use of the fetchone method on the Cursor. The final SQL statement then updates the player row for the given user to have a new manager.

Python database programming

Executing terminate.py at the command line.

Python Database Programming: Deleting a Record

The next example uses a similar technique to delete a player from the table.

import sqlite3
import sys
conn=sqlite3.connect('my_database')
cursor = conn.cursor()
player = sys.argv[1]
# Query to find the employee ID:
query = """
select p.idnum
from user u, player p
where u.username=? and u.idnum = p.idnum
"""
cursor.execute(query,(player,))
for row in cursor.fetchone():
    if (row != None):
        playerid = row
# Now, modify the employee:
cursor.execute("delete from player where idnum=?", (playerid,))
conn.commit()
cursor.close()
conn.close()

When you run this script, you need to pass the user name of the player to delete. You should see no output unless the script raises an error:

>python finduser.py travis
(u'travis', ':', u"Travis d'Arnaud", 'plays for the', u'Washington Nationals')
>python terminate.py travis
>python finduser.py travis
>

This script uses the same techniques as the updateteam.py script by performing an initial query to get the player ID number for the given user name and then using this ID number in a later SQL statement. With the final SQL statement, the script deletes the player from the player table.

External Links:

Database Programming at wiki.python.org

Database Programming at python.about.com

Databases at docs.python-guide.org