XML Processing with Python: Part Four

XML ProcessingXML is similar in structure and form to HTML. This is not entirely an accidental thing. XML and HTML both originated from SGML and share a number of syntactic features. The earlier versions of HTML are not directly compatible with XML, though, because XML requires that every tag be closed, and certain HTML tags don’t require a closing tag (such as <br> and <img>). However, the W3C has declared the XHTML schema in an attempt to bring the two standards in line with each other. XHTML can be manipulated using the same sets of tools as pure XML. However, Python also comes with specialized libraries designed specifically for dealing with HTML.

The HTMLParser class, unlike the htmllib class, is not based on an SGML parser and can be used for both XHTML and earlier versions of HTML. To try using the HTMLParser class, create a sample HTML file named headings.html that contains at least one h1 tag. Then save the file to your Python directory and run the following code:

from html.parser import HTMLParser
class HeadingParser(HTMLParser):
    inHeading = False
    def handle_starttag(self, tag, attrs):
        if tag == "h1":
            self.inHeading = True
            print("Found a Heading 1")
    def handle_data(self, data):
        if self.inHeading:
            print(data)
    def handle_endtag(self, tag):
        if tag == "h1":
            self.inHeading = False
hParser = HeadingParser()
file = open("headings.html", "r")
html = file.read()
file.close()
hParser.feed(html)

The HTMLParser class defines methods, which are called when the parser finds certain types of content, such as a beginning tag, an end tag, or a processing instruction. By default, these methods do nothing. To parse an HTML document, a class that inherits from HTMLParser and implements the necessary methods must be created. After a parse class has been created and instantiated, the parser is fed data using the feed method. Data can be fed to it one line at a time or all at once.

This example class only handles tags of type <h1>. When an HTMLParser encounters a tag, the handle_starttag method is called, and the tag name and any attached attributes are passed to it.

The handle_starttag method determines whether the tag is an <h1>. If so, it prints a message saying it has encountered an h1 and sets a flag indicating that it is currently an <h1>. If text data is found, the handle_data function is called, which determines whether it is an <h1>, based on the flag. If the flag is true, the method prints the text data. If a closing tag is encountered, the handle_endtag method is called, which determines whether the tag that was just closed was an <h1>. If so, it prints a message, and then sets the flag to false.

External Links:

HTMLParser at docs.python.org

Using the Python HTMLParser library