XML Processing with Python: Part Six

XML ProcessingDOM (Document Object Model)

At the heart of DOM lies the Document object. This is a tree-based representation of the XML document. Tree-based models are a natural fit for XML’s hierarchical structure, making this a very intuitive way of working with XML. Each element in the tree is called a Node object, and it may have attributes, child nodes, text, and so forth, all of which are also objects that are stored in the tree. DOM objects have a number of methods for creating and adding nodes, for finding nodes of a specific type or name, and for reordering or deleting nodes.

Differences between SAX and DOM

The major difference between SAX and DOM is DOM’s ability to store the entire document in memory and manipulate and search it as a tree, rather than force you to parse the document repeatedly, or force you to build your own in-memory representation of the document. The document is parsed once, and then nodes can be added, removed, or changed in memory and then written back out to a file when the program is finished.

Although either SAX or DOM can do almost anything you might want to do with XML, you might want to use one over the other in certain circumstances. For instance, if you are working on an application in which you will be modifying an XML document repeatedly based on user input, you might want the convenient random access capabilities for DOM. but if you are building an application that needs to process a stream of XML quickly with minimal overhead, SAX might be a better choice for you.

DOM is designed with random access in mind. It provides a tree that can be manipulated at runtime and needs to be loaded into memory only once. SAX is stream-based, so data comes in as a stream one character after the next, but the document isn’t seen in its entirety before it starts getting processed; therefore, if you want to randomly access data, you have to either build a partial tree of the document in memory based on document events, or reparse the document every time you want a different piece of data.

Most people find the object-oriented behavior of DOM very intuitive and easy to learn. The event-driven model of SAX is more similar to functional programming and can be more challenging to get up to speed on.

If you are working in a memory-limited environment, DOM is probably not a good choice. Even on a fairly high-end system, constructing a DOM tree for a large document (say 2-3 MB) can bring the computer to a halt while it processes. Because SAX treats the document as a stream, it never loads the whole document into memory, so it is preferable if you are memory constrained or working with very large documents.

Using DOM requires a great deal of processing time while the document tree is being built, but once the tree is built, DOM allows for much faster searching and manipulation of nodes because the entire document is in memory. SAX is somewhat fast for searching documents, but not as efficient for their manipulation. However, for document transformations, SAX is considered to be the parser of choice because the event-driven model is fast and very compatible with how XSLT works.

In the next article, we’ll look at SAX and DOM parsers for Python.

External Links:

XML DOM Parser at W3Schools

Document Object Model at Wikipedia