matplotlib: Part One

matplotlib

matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oritented API for embedding plots into applications using general-purpose GUI toolkits.

Since it is designed to resemble MATLAB, the pylab interface makes matplotlib easy to learn for experienced MATLAB users. Some of the advantages of matplotlib over MATLAB include the following:

  • It is based on Python, a full-featured modern object-oriented programming language suitable for large-scale software development.
  • Free, open source (therefore, there are no license servers)
  • Native SVG support

Unfortunately, matplotlib was designed to work with Python, so it is hard or impossible to use in other languages. Also it heavily relies on packages such as NumPy and SciPy, so there are a number of dependencies.

Installing matplotlib

matplotlib is easy to install in Linux from the repositories. In Debian, Ubuntu, Mint and many other distributions, you just need to type:

sudo apt-get install python-matplotlib

In Fedora, Redhat and CentOS, it’s:

sudo yum install python-matplotlib

Under Windows, in addition to the standard Python installation, you will also need to install compatible versions of setuptools, NumPy, python-dateutil, pytz, pyparsing and six in addition to matplotlib. NumPy depends on files installed with Microsoft Visual C++ 2008 (for Python 2.6 to 3.2) or Microsoft Visual C++ 2010 (for Python 3.3 and 3.4), so you will need to have Visual C++ installed. If not, it may be easier to install one of the scipy-stack compatible Python distributions such as Python(x,y), Enthought Canopy, or Continuum Anaconda, which have matplotlib and many of its dependencies, plus other useful packages, preinstalled.

Installation of Python(x,y) is easy. Download the current version from the official pythonxy download site and run the installer (the installation notes also recommend that you uninstall any other Python distribution before installing it). The installer will present you with a few basic options, after which installation will proceed. Once installation is complete, you can utilize Python(x,y). This package includes a Python interpreter called iPython (you can access it by typing “ipython” at the command line). I chose to use Eclipse instead. Eclipse automatically detected the installation of Python(x,y), and asked if I wanted to use it as the default Python interpreter. I selected this option and proceeded.

We need some code to test matplotlab, so I wrote the following:

'''A short bit of code to test matplotlib
and display a graph'''
import random, pylab

def mpltest():
    dicerolls = []
    for i in range(10):
        x = int(random.random()*6+1)
        y = int(random.random()*6+1)
        dicerolls.append(x+y)
    pylab.plot(dicerolls,'ro')
    pylab.title('Result of dice rolls')
    pylab.xlabel('Trial #')
    pylab.ylabel('Outcome')
    pylab.show()

mpltest()

In mpltest, the first few lines just create a list called dicerolls and use iteration to add data for the dice rolls to the list. The next few lines introduce pylab and some of its functions. pylab.plot will cause the data to be plotted. The second parameter, ‘ro’, will cause only red circles to be shown (“red only”); since each dice roll is independent of the others, we don’t want a line to be drawn connecting the points. pylab.title allows the graph to have a title, and pylab.xlabel and pylab.ylabel creates labels for the x and y-axis, respectively. Nothing actually appears on the screen, though, until we call pylab.show, which we do on the last line of the function call.

It would be nice to graph a continuous function as well, so let’s graph the growth of $10,000 if we have 7 percent interest, compounded yearly:

def interest():
    principal = 10000.0
    interestRate = 0.07
    years = 10
    iValues = []
    for i in range(years+1):
        principal += principal*interestRate
        iValues.append(principal)
    pylab.plot(iValues)
    pylab.title('7% Growth, Compounded Annually')
    pylab.xlabel('Years of Compounding')
    pylab.ylabel('Value of Principal ($)')
    pylab.show()
    
interest()

The format of the function is similar to the first one, although we have replaced the dice rolls with the growth of principal over time. In pylab.plot, we want to have a line connecting the points, so the second parameter is omitted.

We can see the results in the screen capture on the side of this article. matplotlib does seem to work as advertised. We will post more articles about matplotlib in the future, but hopefully this will serve as a useful introduction.

External Links:

matplotlib at Wikipedia

The official matplotlib site

Pyplot tutorial at matplotlib.org

Python (x,y) download site

Python Sets: Part Two

Python setsIn part one of this series, we took a first look at sets, a relatively new core data type in Python. In this article, we will continue cover set methods, as well as a new literal format for sets.

In addition to expressions, the set object provides methods that correspond to the set operations. For example, the set add method inserts one item; update provides an in-place union, and remove deletes an item by value. Run a dir call on any set instance or the set type name to see all the available methods. Assuming we have mySet and otherSet from the previous article:

>>> mySet = set(‘hello’)
>>> otherSet = set([‘a’,’b’,’c’,’d’,’e’])

then we can perform these operations:

>>> si = mySet.intersection(otherSet)
>>> si
{‘e’}
si.add(‘MORE’)
si
{‘MORE’, ‘e’}
>>> si.update(set([‘X’,’Y’]))
>>> si
{‘X’, ‘MORE’, ‘e’, ‘Y’}
>>> si.remove(‘e’)
si
{‘X’, ‘MORE’, ‘Y’ }

As iterable containers, sets can also be used in operations such as len, for loops, and list comprehensions. Because they are unordered, however, they do not support sequence operations like indexing and slicing.

>>> for item in set('aeiou'):
        print(item * 4)

uuuu
iiii
oooo
aaaa
eeee

Although these set expressions generally require two sets, they often work with any iterable types, such as lists, tuples, and ranges. For example:

mySet.union([‘a’,’b’,’c’])
{‘b’, ‘l’, ‘c’, ‘o’, ‘a’, ‘e’, ‘h’}
mySet.intersection((‘e’,’l’,’p’))
{‘e’, ‘l’}

Using Curly Braces

Python 3.0 adds a new set literal form, using the curly braces formerly reserved for dictionaries. In 3.0 and above, the following are equivalent:

set([‘a’,’b’,’c’])
{‘a’, ‘b’, ‘c’}

Because sets are unordered, unique, and immutable, a set’s items behave much like a dictionary’s keys, and therefore the new syntax makes sense. Regardless of how a set is made, 3.0 and above displays sets using the new literal format. The set built-in is still required in 3.0 to create empty sets and to build sets from existing iterable objects, but the new literal format is convenient for initializing sets of known structure:

>>> {‘a’,’b’,’c’}
{‘c’, ‘b’, ‘a’}

>>> S = {‘a’,’e’,’i’,’o’,’u’}
>>> S.add(‘MORE’)
>>> S
{‘MORE’, ‘i’, ‘o’, ‘a’, ‘e’, ‘u’}

All the set processing operations discussed in the previous article work the same in 3.0 and above, but the result sets print differently:

>>> S1 = {‘a’,’b’,’c’}
>>> S1 & {‘a’,’c’}
{‘c’,’a’}
>>> {‘a’,’e’,’i’,’o’,’u’} | S1
{‘b’, ‘i’, ‘o’, ‘c’, ‘a’, ‘e’, ‘u’}
>>> S1 – {‘a’,’b’}
{‘c’}
>>> S1 > {‘a’,’c’}
True

Note that {} is still a dictionary in Python. Empty sets must be created with the set built-in, and print the same way:

>>> S1 – {‘a’,’b’,’c’}
set()
>>> type({})
<class ‘dict’>

>>> mySet = set()
>>> mySet.add(‘a’)
>>> mySet
{‘a’}

As in Python 2.6, sets created with 3.0 literals support the same methods, some of which allow general iterable operands that expressions do not:

>>> {‘a’,’b’,’c’} | {‘c’,’d’}
{‘c’, ‘b’, ‘a’, ‘d’}
>>> {‘a’,’b’,’c’} | [‘c’,’d’]
Traceback (most recent call last):
File “<pyshell#34>”, line 1, in
{‘a’,’b’,’c’} | [‘c’,’d’]
TypeError: unsupported operand type(s) for |: ‘set’ and ‘list’

>>> {‘a’,’b’,’c’}.union([‘c’,’d’])
{‘c’, ‘b’, ‘a’, ‘d’}
>>> {‘a’,’b’,’c’}.union([‘c’,’d’])
{‘c’, ‘b’, ‘a’, ‘d’}
>>> {‘a’,’b’,’c’}.union(set([‘c’,’d’]))
{‘c’, ‘b’, ‘a’, ‘d’}
>>> {‘a’,’b’,’c’}.intersection((‘c’,’e’,’f’))
{‘c’}
>>> {2,3,4}.issubset(range(-5,5))
True
>>>

External Links:

Set, frozenset at docs.python.org – Official documentation on sets and frozensets for Python 3.4

Python Sets: Part One

Python sets

No, not that kind of set.

One of the core Python data types that we did not mention in earlier articles that perhaps deserves some attention is the set. Sets are a recent addition to Python; they are neither mappings (dictionaries) nor sequences (strings, lists and typles). Sets are created by calling the built-in set function or using new set literals and expressions in 3.0, and they support the usual mathematical set operations.

Introduction to Sets

Creating a set can be done two different ways:

>>> mySet = set(‘hello’)
>>> otherSet = {‘a’,’b’,’c’,’d’,’e’}

len returns the number of unique set items, so we get:
>>> len(otherSet)
5
But if we run the same operation on mySet, we get:
>>> len(mySet)
4

>>> ‘h’ in mySet
True
>>> ‘i’ in mySet
False
>>> mySet.isdisjoint(otherSet)
False
>>> newSet = set(‘he’)
newSet.issubset(mySet)
True

Mathematical set operations are generally valid on sets:

>>> mySet, otherSet
({‘o’, ‘e’, ‘l’, ‘h’}, {‘e’, ‘d’, ‘b’, ‘a’, ‘c’})

>>> mySet & otherSet
{‘e’}

>>> mySet | otherSet
{‘e’, ‘h’, ‘o’, ‘d’, ‘b’, ‘c’, ‘a’, ‘l’}
>>> mySet – otherSet
{‘o’, ‘l’, ‘h’}

There are a few other operations. <= tests whether ever element in the left operand set is in the right operand set. For example: >>> newSet <= mySet
True

But what if we want to return False if the sets are equal? Then we use <:
>> newSet < mySet

True

>>> exactcopy = set(mySet)
>>> exactcopy < mySet

False

>>> exactcopy <= mySet

True

We can flip the operand around, and check to see if the left operand set is a superset of the right operand set:

>>> mySet > newSet
True
>>> mySet >= newSet
True

Sets are mutable; you can add and remove items with the add and remove methods:

>>> otherSet.add(‘f’)
{‘e’, ‘f’, ‘d’, ‘b’, ‘c’, ‘a’}
>>> otherSet.remove(‘c’)
>>> otherSet
{‘e’, ‘f’, ‘d’, ‘b’, ‘a’}

You can also iterate over a set. For example:
>>> for s in mySet:
print(s)

o
e
l
h

Since sets are mutable, there are some things we cannot do with them: for example, we cannot use them as dictionary keys. But objects of type frozenset are immutable. Therefore:

>>> unchangeableSet = frozenset(‘abc’)

Although we can’t add and remove items, we can perform the usual set operations on frozensets, or a combination of sets and frozensets:

>>> mySet | unchangeableSet
{‘o’, ‘e’, ‘b’, ‘h’, ‘c’, ‘a’, ‘l’}

For an example of using frozenset to create keys for a dictionary, here’s a sample:

>>> keySet = frozenset(‘abc’)
>>> names = [ ‘Able’, ‘Baker’, ‘Charlie’ ]
>>> myDict = { }
>>> i = 0
>>> for s in keySet:
myDict[s] = names[i]
i += 1

>>> print(myDict)
{‘a’: ‘Able’, ‘b’: ‘Baker’, ‘c’: ‘Charlie’}

Here, we created an immutable set called keySet, and a list of items to put in our dictionary. We iterate through the set, mapping items in the list to keys in keySet. When we print out the results, we see that each item was successfully mapped to a key.

In the next article, we will continue our look at sets.

External Links:

Set, frozenset at docs.python.org – Official documentation on sets and frozensets for Python 3.4

Python Debugging: An Overview

debugging

The Eclipse debugger in action.

At some point, one of your programs is likely to do something you did not expect it to do, and in such a scenario, you are going to have to debug your code. Fortunately, whether you are using the command line Python interpreter, the IDLE interface, or a third-party IDE, there are multiple options for debugging.

Debugging Options

One option is to rely on the error messages Python provides. If you already know Python, and especially if you understand your own code, this is often enough: you just read the error message and go and fix the tagged line and file. It may not always be a good option, however, for larger programs you did not write.

Another possibility, and probably the main way that Python programmers use for debugging their code is to insert print statements and run the program again. Because Python runs immediately after changes, this is usually the quickest way to get more information than error messages provide. Typically, a display of variable values is enough to provide the information you need.

For larger systems you did not write, and for beginners who want to trace code in more detail, most Python development GUIs have some sort of point-and-click debugging support. IDLE has a debugger as well, but it does not appear to be used very often (possibly because it has no command line, or possibly because adding print statements is usually quicker than setting up an IDLE GUI debugging session. But other IDEs such an Eclipse, NetBeans, Komodo and Wing IDE offer advanced point-and-click debuggers.

Python comes with a source-code debugger named pdb, available as a module in Python’s standard library. In pdb, you type commands to step line by line, display variables, set and clear breakpoints, or continue to a breakpoint or error. It can be launched interactively by importing it, or as a top-level script. Either way, it provides a powerful debugging tool.

To provide a real-world example of debugging to solve a coding problem, I introduced a bug into the binary sort code. The code is provided here; some readers may be able to find the bug just by looking at the code. In any case, when we run the code, we get the following result:

[None, None, 17, 144, 372, 51, 79, 123, 519, 997]

This result is wrong for at least two reasons. First, two list members have been replaced with “None”. Secondly, the list is not sorted. Fortunately, this gives us an opportunity to use our debugging skills. The heapSort function does the following:

  1. Creates a binary heap.
  2. While there is at least one item on the heap, delete the top item from the heap. If there is remaining heap data, rebuild the heap, consolidate the heap, and repeat this step.

A cursory review of the code seems to indicate that we correctly deleted the top item and re-inserted it after the heap. The error must be in heapify, rebuildHeap, or consolidateHeap. For starters, I decided to set a breakpoint after heapify is called and check to see if it returns a valid binary heap. If it does, we can be pretty sure the error is in rebuildHeap or consolidateHeap.

In Eclipse, I set a breakpoint in the second line of heapSort by right mouse-clicking on the left column next to that line of code (the column containing numbers indicating what line of code it is) and selecting “Add Breakpoint). Now I was able to debug the code by pressing “F11”. Eclipse now presents a dialog box asking if I want to switch to the debug perspective. Since I do, I press “Yes” and continue.

In the upper right corner of the Debug perspective, there is a section with two tabs: “Variables” and “Breakpoints”. The “Variables” tab provides us the information about local variables, including the current values stored in L. We can confirm that the list items form a valid binary heap:

        997
      /     \
    519     239
   /  \     / \
  144  372 17 29
  / \   /
79  51 123

It does not appear that the problem is in heapify, so it must be in either rebuildHeap or consolidateHeap. To find the bug, we will first step through the heapSort function line by line, by pressing F6.

As soon as we press F6 the first time, the variable high shows up on the variable list, since it has now made its first appearance in the code. The same goes for temp, when it makes its first appearance two lines later. The next line deletes the top node, and the change is reflected in the values for L shown on the “Variables” tab. All is well so far.

But after we run rebuildHeap for the first time, it is a different story. Our “heap” looks like this:

          519
         /   \
      None    29
      /  \    / \
    144  372 17 None
    / \   /
   79 51 123

It looks like rebuildHeap has failed us. Nodes marked “None” should not have children. The next step will be to remove the breakpoint from heapSort and put a breakpoint in rebuildHeap. Then we will restart the program. We terminate the program by pressing “CTRL-F2” and then pressing “F11” again.

Once we do this, we can continue debugging and begin stepping through the source code again. Looking at the “Variables” tab, we can see that lnode (left node) is initially set to 1, the correct value for the left node of the root node. The next line checks to see if there is at least one heap item and a left node. There is, so we proceed to the next line. We have to promote either the left node or the right node. We promote the left node if it is greater than the right node or if there is no right node. If so, we assign the left node’s value to the root node, mark the left node as empty, and run rebuildHeap recursively on the left node. When we reach the recursive rebuildHeap call, we will use “F5” to step into the second copy of rebuildHeap.

Once we start stepping into the second rebuildHeap, we can see something is wrong. lnode is now 5, but as the lnode of the lnode of the root, it should be 3. This is because root is equal to 2 when it should be 1. Looking at the first recursive rebuildHeap call in the function, we see that the second parameter is lnode+1 (which calls the function recursively on the right node), when it should be lnode (to call the function recursively on the left node). If we keep stepping through the code, rebuildHeap will perform operations on the wrong nodes and tamper with the integrity of heap data that should be kept intact. Here, it will promote one of the children of the right node, and erase data, as we can see by viewing list L’s values just before the next rebuildHeap call:

[519, None, 29, 144, 372, 17, None, 79, 51, 123]

In other words, it looks this is what accounts for the error in rebuildHeap. We could continue to step through the code, but to save time, let’s change the second rebuildHeap parameter at line 56 to lnode, remove the breakpoint, and see what happens.

Running the program after that modification results in the following output:

[17, 29, 51, 79, 123, 144, 239, 372, 519, 997]

Thus, it seems as if we have isolated and fixed the bug that was causing the error we were seeing earlier. This does not guarantee that our code is bug-free, but if there are other bugs to be found, we can have some confidence that the methods used here can be applied to finding them.

In this example, we used the Eclipse debugger for debugging, but don’t let this discourage you from using whatever tools you want. The Winpdb system, for example, is an open source stand-alone debugger with advanced debugging support and cross-platform GUI and console interfaces. These options become more important as you start writing larger scripts, but at least Python’s debugging support makes the process of finding errors a lot easier than they might be otherwise.

External Links:

The official Eclipse web site – An excellent IDE with good debugging capabilities

The official Winpdb web site

Classes and Inheritance: Part One

classesIn the previous article, we introduced some hashing functions, but had to pass the number of buckets (numBuckets) to each function, which was not an ideal solution. In this article, we introduce a programming construct that will provide a better way of working with these hashing functions: classes.

Classes Defined

A class is an extensible program-code-template for creating objects, providing initial values for member variables and implementations of functions. When an object is created by a constructor of the class, the resulting object is called an instance of the class, and the member variables specific to the object are called instance variables (unlike class variables, which are variables that are shared across the class).

Classes are essentially combinations of variables and functions. We use the word “encapsulation” to refer to this combination. The variables either belong to the class or specific instances of the class. The functions are subroutines with the ability to operate on objects or classes. Many kinds of functions exist. In most programming language, there are constructors (functions automatically called when an instance of the class is created), destructors (functions automatically called when an instance of the class is deleted or otherwise destroyed), and conversion operators (functions that define how certain operators, e.g. the equality operator, work with instances of the class).

Classes: Implementation in Python

In Python, the simplest form of a class definition looks like this:

class HashObject:
	
	...
	

Class definition must precede any instantiation of the class to have any effect. The statements inside a class definition will usually be function definitions, but other statements are allowed. The function definitions inside a class normally have a peculiar form of argument list, dictated by the calling convention for functions.

Class objects support attribute references and instantiation. Attribute reference use the standard syntax used for all attribute references in Python: obj.name. If the class definition looked like this:

class HashObject:
	i = 123
	def myfunc(self)
		return 'A simple function definition'

then HashObject.i and HashObject.myfunc are valid attribute references. Class attribute can be assigned, so you can change the value of HashObject.i by assignment.

Class instantiation uses function notation; just pretend that the class object is a parameterless function that returns a new instance of the class. For example,

h = HashObject()

creates a new instance of the class HashObject and assigns this object to the local variable h.

The instantiation operation creates an empty object. Many classes like to create objects with instances customized to a specific initial state. Therefore, a class may define a special method, called a constructor. Here is an example:

def __init__(self):
	self.data = []

When a class defines an __init__() method, class instantiation automatically invokes __init__() for the newly-created class instance. In the case of HashObject, a new, initialized instance can be obtained by:

h = HashObject()

The __init__() method may have arguments for greater flexibility. In that case, arguments given to the class instantiation operator are passed on to init. For example:

class HashObject:
   def __init__(self, num):
      self.numBuckets = num

>>> h = HashObject(51)
>>> x.numBuckets
51

With instance objects, there are two kinds of valid attribute names, data attributes and methods. Data attributes correspond to data members in C++. Like local variables, they do not need to be declared and spring into existence when they first have a value assigned to them. For example, if we created an instance of HashObject called h, we can create and print out the value of x like so:
>>> h.x = 123
>>> print(‘h.x = ‘,h.x)
123
The other kind of instance attribute is a method. A method is a function that belongs to an object. Valid method names of an instance object depend on its class. By definition, all attributes of a class that are function objects define corresponding methods of its instances.

Usually, a method is called right after it is bound. Suppose we add a function definition to HashObject:

class HashObject:
    def test(self):
        print('Hello, world!')

We can call the method in object h like so:
>>> h.test()
Hello, world!

Notice that h.test() was called without an argument, even though the function definition for test() specifies an argument. This is where Python deviates from C++. In this example, the call h.test() is exactly equivalent to HashObject.test(h). In general calling a method with a list of n arguments is the same as calling the corresponding function with an argument list that is created by inserting the method’s object before the first argument.

Python classes also supports inheritance. The syntax for a derived class definition looks like this:

class DerivedClass(BaseClass):
	
	...
	

When the class object is constructed, the base class is remembered. If a requested attribute is not found in the class, the search proceeds to look in the base class. This rule is applied recursively if the base class itself is derived from some other class.

If we create an instance of DerivedClass, method references are resolved as follows: the corresponding class attribute is searched, descending down the chain of base classes if necessary, and the method reference is valid if this process yields a function object. Derived classes may also override methods of their base classes. Because methods have no special privileges when calling other methods of the same object, a method of a base class that calls another method defined in the same base class may end up calling a method of a derived class that overrides it. This is similar to virtual functions in C++, where a function is left undefined in the base class but defined in one or more derived classes.

Python also supports a form of multiple inheritance as well. A class definition with multiple base classes looks like this:

class DerivedClass(BaseClass1, BaseClass2, BaseClass3)
	
	...
	

The search for attributes is depth-first, left to right, so if an attribute is not found in the derived class, it is searched for in BaseClass1, then the base classes of BaseClass1, then BaseClass2, and so on. It should be mentioned that the multiple inheritance method resolution in Python uses a dynamic algorithm to linearize the search order, both preserving the search order, and ensuring that each parent is called only once.

One significant way in which Python classes differ from C++ classes is that there is no means of declaring functions or variables to be private or protected. Thus, there are no private variables in Python. The closest you’ll get are variables that start with a double underscore, like __variable. Such a variable cannot be directly referenced outside the class definition, at least not directly. Thus, if we have something like this:

class HashObject:
	def __init__(self, num):
		__numBuckets = num

and we create an instance of HashObject:

>>> h = HashObject(47)

We can’t access __numBuckets directly:

>>> print(h.__numBuckets)

will result in an error message.

Now that we have a solid foundation for understanding Python classes, we should be able to rewrite our hash functions from the previous article as a class. We will do that in the next article.

External Links:

Classes Tutorial on python.org

Class (computer programming) on Wikipedia

Python File I/O

Python file I/OIn the previous articles, we covered quite a bit of rudimentary Python programming, but we haven’t covered on of the most important elements of any programming language: the ability to perform file I/O. In this article, we will cover file operations, and put it into practice by using Python file I/O to generate a file of numbers to sort with the quicksort algorithm developed in the previous article.

Python File I/O: Simple Commands

To open a file, we use open(), which returns a file object, and is commonly used with two arguments: open(filename, mode).

>>> f = open(‘mylist.txt’,’w’)

The first argument is a string containing the filename. The second argument is another string describing how the file will be used. ‘r’ is specified when the file will only be read, ‘w’ for only writing (an existing file with the name name will be erased), and ‘a’ opens the file for appending. Any data written to the file is automatically added to the end. ‘r+’ opens the file for both reading and writing. The mode argument is optional: ‘r’ is the default value if it is omitted.

On Windows, ‘b’ appended to the mode opens the file in binary mode. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This modification is OK for ASCII text files, but it will corrupt binary data like that in JPEG or EXE files. On Unix, it doesn’t hurt to append a ‘b’ to the mode, so you can use it platform-independently for all binary files.

To read a file’s content, call the read() function of the object returned by open() (in the above example, f). read() reads some quantity of data and returns it as a string. When size is omitted or negative, the entire contents of the file, so you probably want to specify a number of bytes if the file is large. If the end of the file has been reached, f.read() will return an empty string:

>>> f.read()
‘This is the contents of a text file.\n’

readline() reads a single line from the file; a newline character is left at the end of the string. It is only omitted on the last line of the file if the file does not end in a newine. Thus if f.readline() returns an empty string, the end of file has been reached, while a blank line is represented by ‘\n’.

For readling lines from a file, you can loop over the file object:

>>> for line in f:
print line

If you want to read all the files of a file, you can also use list(f) or f.readlines().

That covers read operations. write(string) writes the contents of string to the file, returning None.

To write something other than a string, it needs to be converted to a string first; e.g.:

>>> value = (‘The sum is ‘,101)
>>> s = str(value)
>>> f.write(s)

f.tell() returns an integer giving the file object’s current position in the file, measured in bytes from the beginning of the file. To change the file’s object position, we use f.seek(offset, param). The position is computed from adding offset to a reference point. The reference point is selected by the param argument. A param value of 0 measures from the beginning of the file; 1 uses the current file position, and 2 uses the end of the file as the reference point. The second parameter can be omitted; the default value is 0, using the beginning of the file as a reference point. You may have noticed that the behavior of this function is similar to that of the fseek() function in C/C++. When you’re done with a file, call f.close() and free up any system resources taken up by the open file. After calling f.close(), attempts to use the file will automatically fail.

Python File I/O: Serialization

Finally, you can write data to a file easily with serialization. The read() function only returns strings, and when you want to save complex data types like nested lists and dictionaries, it can become unwieldy. Python, however, allows you to use the popular data interchange format called JSON (JavaScript Object Notation). The standard module called json can take Python data hierarchies, and convert them to string representations. This process is called serializing, and reconstructing the data from the string representation is called deserializing. To write an object x to a file object f opened for writing, we use:

json.dump(x, f)

And to decode the object, we use the following:

x = json.load(f)

This simple serialization technique can handle lists and dictionaries without any additional effort, but serializing arbitrary class instances in JSON requires some extra work.

Here’s a chance to apply what we know about Python file I/O to generate a file containing randomly-generated numbers. We will then read the data into a list, and use quicksort to sort the list. We start with a function to generate the file:

def generateNumbers(filename,num):
'''Use Python file I/O to generate a file of name 
filename and fill it with num randomly-generated 
integers of range 1 to num'''
temp = []
f = open(filename,'w')
for i in range(0,num):
temp.append(int((random.random()*num)+1))
json.dump(temp,f)
f.close()

As you can see, I used JSON to save the data to the file, which should make the process of reading the numbers in easier. Now for the function to read in the data:

def readList(filename):
'''Use Python file I/O to read in a file of name 
filename using serialization'''
retval = []
f = open(filename,'r')
temp = []
try:
retval = json.load(f)
except EOFError:
print('EOF error')
f.close()
return retval

The following code uses Python file I/O to generate a list of 1000 integers, reads in the same list, sorts the list and prints out both the size of the list along with the sorted list:

generateNumbers('mylist.lst',1000)
myList = readList('mylist.lst')
quickSort(myList)
print('Size of myList = ',len(myList))
print(myList)

We could save the list of integers as ASCII text, but if we did, we would have to parse the list when we read it back in, which would be somewhat cumbersome. Serialization makes the process easy, and it is undoubtedly a technique we will use in the future.

External Links:

Reading and writing Files at docs.python.org

Quicksort: A Python Implementation

QuicksortIn the previous articles, we introduced sorting algorithms and went through the process of coding two efficient sorting algorithms: merge sort and heapsort. In this article, we’ll take a look at one more efficient sorting algorithm: quicksort.

The quicksort algorithm was developed in 1960 by Tony Hoare, and has since gained widespread adoption. A quicksort function has been part of C since 1975, and was standardized as part of ANSI C in 1989. Python does not have a build-in quicksort function, but quicksort is easy enough to code.

Implementing the Quicksort Algorithm

Like all the other efficient sorting algorithms covered in this series of articles, the quicksort algorithm is a recursive, divide and conquer algorithm. It is based on the following steps:

  1. Pick an element (called a pivot) from the list.
  2. Reorder the list so that all elements with values less than the pivot come before the pivot, while all elements with values greater than the pivot come after it. Equal values can go either way.
  3. Apply the above steps recursively to each half of the list.

The pseudocode for this algorithm is as follows:

quicksort(L, low, high):
	if low < high
		p = partition(L, low, high)
		quicksort(L, low, p-1)
		quicksort(L, p+1, high)

The manner in which the pivot is chosen can be key. In some implementations, the leftmost element of the partition would often be chosen as the pivot. This would lead to worst case behavior in already-sorted lists. This illustrates the basic problem with quicksort: if we keep choosing the lowest element (or highest element) as the pivot, then quicksort becomes like a selection sort. One possible alternative is to choose the midpoint, which generally leads to good average case behavior. One possibility, suggested by computer scientist Robert Sedgewick, is to choose the median of the first element, the last element, and the midpoint. This guarantees a better estimate of the optimal pivot in an unsorted list. Here is my Python function for choosing a pivot:

def choosePivot(L,low=0,high=-1):
    if high == -1:
        high = len(L)-1
    mid = low+int((high-low)/2)
    # Return the median of low, middle, and high
    if L[low] < L[mid] and L[low] < L[high] and L[mid] < L[high]:
        return mid
    elif L[mid] < L[low] and L[mid] < L[high] and L[low] < L[high]:
        return low
    else:
        return high

Most of the work will be done by the partitioning function, which performs the following steps:

  1. Choose a pivot.
  2. Move the pivot to the end of the list, putting it out of the way (temporarily).
  3. Initialize two indices, i and j, to the index of the first list element.
  4. If list[i] is less than the pivot, swap it with list[j] and increment j.
  5. Increment i, and repeat step [4] as long as i is less than the index of the last list element.
  6. Swap the pivot (the last list element) with list[j]. Now the list is partitioned.

The Python representation of this function is as follows:

def partition(L,low=0,high=-1):
    if high == -1:
        high = len(L)-1
    # Select a pivot
    pivot = choosePivot(L,low,high)
    # Put the pivot at the end where it
    # will be out of the way
    swap(L,pivot,high)
    # Separate the list into elements 
    # less than and greater than pivot
    j = low
    for i in range(low,high+1):
        if L[i] < L[high]:
            swap(L,i,j)
            j += 1
    # Put the pivot between the two halves
    swap(L,high,j)
    return j

The swap function from heapsort is re-used:

def swap(L,x,y):
    temp = L[x]
    L[x] = L[y]
    L[y] = temp

Now that we have functions to choose the pivot and partition the list, writing the code for the quicksort function is relatively easy:

def quickSort(L,low=0,high=-1):
   '''Sort a list by selecting a "pivot", dividing
   the list into elements less than and greater than
   the pivot, then apply the sort recursively on the
   two list halves'''
   if high == -1:
      high = len(L)-1
   if high > low:
      # Partition into two halves, then sort
      # each half
      pivot = partition(L,low,high)
      # Check to see if pivot > 0; necessary
      # because if pivot == 0, then high = -1
      # and whole list is sorted
      if pivot > 0:
         quickSort(L,low,pivot-1)
      quickSort(L,pivot+1,high)

Recursion Errors (and a Solution)

This implementation of the function will work most of the time, but during testing, I encountered a potential issue. Assume that you have an unsorted list with a lot of duplicates – for example, the following list:

[3, 4, 2, 3, 1, 4, 2, 2, 3, 1, 3, 1, 2, 4, 1, 3, 2, 1, 3, 2]

After a few partitionings, you will likely end up with smaller sublists of all the same number. In such a case, the pivot value returned will be zero. If we allow the recursion to continue, quicksort will be called on a list of n-1 items, then n-2 items, and so on. If the lists are big enough, Python will reach the maximum allowed number of levels of recursion and the interpreter will return an error. Since we don’t want this to happen, I made a further modification to the partition function. After we are done partitioning the list, we will iterate through the list. If there are any two items that are not equal, we will stop iterating through the list and return the pivot value. If not, we will return -1, which will end any further recursive calls to quicksort. The new partition function looks like this:

def partition(L,low=0,high=-1):
    '''partition function for quicksort
    implementation'''
    if high == -1:
        high = len(L)-1
    # Select a pivot
    pivot = choosePivot(L,low,high)
    # Put the pivot at the end where it
    # will be out of the way
    swap(L,pivot,high)
    # Separate the list into elements 
    # less than and greater than pivot
    j = low
    for i in range(low,high+1):
        if L[i] < L[high]:
            swap(L,i,j)
            j += 1
    # Put the pivot between the two halves
    swap(L,high,j)
    for i in range(low,high):
        if L[i] != L[i+1]:
            return j
    return -1

In the worst-case scenario, we have to look through the entire list. Thus, instead of having a maximum of N comparisons for a list of size N, we will have 2N comparisons, but at least we avoid exponential growth in the number of operations, and this will prevent our program from crashing when there are a lot of duplicate items.

I also had to make a slight modification to the quicksort function. Here is the newly-modified function:

def quickSort(L,low=0,high=-1):
    '''quicksort: Sort a list by selecting a "pivot", 
    dividing the list into elements less than and greater 
    than the pivot, then apply the sort recursively on the 
    two list halves'''
    if high == -1:
        high = len(L)-1
    if high > low: 
        # Partition into two halves, then sort
        # each half
        pivot = partition(L,low,high)
        # Check to see if pivot > 0; necessary
        # because if pivot == 0, then high = -1
        # and whole list is sorted
        if pivot > 0:
            quickSort(L,low,pivot-1)
        if pivot > -1:
            quickSort(L,pivot+1,high)

This should work fairly well, but obviously, I may have overlooked something, so if anyone has any further suggestions for improvements to this quicksort implementation, I would love to hear them. The source code for this implementation can be downloaded here.

Built in Sort Functions: Python and C

Finally, it should be mentioned that Python lists have their own sort function. In our example, if we had a list called myList, we could make the following call to sort the list:

myList.sort()

sort takes up to three optional parameters in Python 2.7: cmp, key, and reverse. cmp specifies a function of two arguments (list items) which should return a negative, zero, or positive number depending on whether the first argument is considered smaller than, equal to, or greater than the second argument. key specifies a function of one argument that is used to extract a comparison key from each list element (e.g. str.lower). Reverse is a boolean value. If set to True, then the list elements are sorted as if each comparison were reversed. Newer versions of Python have eliminated the first argument (cmp), which should not be a major issue, since you can still sort the list in order (or reverse order using the “reverse” parameter). The key paramter is potentially useful in a number of cases. Assume we have a list of lists, where each of the nested lists represents an employee record. We want to sort the records by employee ID, but the employee ID is the third item in each record. To sort by employee ID, all we need to do is define a key function:

def getKey(a):
    return a[2]

and call the sort function with getKey for the key parameter.

Quicksort is probably the easiest of the efficient sorting algorithms covered so far to understand and code. And in C/C++, we don’t even have to code it, as we are provided with the qsort function, which takes 4 arguments: a pointer to the array to be sorted; the number of items to be compared; the size of each item in the array, and a pointer to a function to compare two items. The comparison function should have as parameters pointers to the two items being compared, and should return -1 if the first item is less than the second, 0 if they are equal, and 1 if the first item is greater than the second (in order to sort the array in ascending order). Reverse the return values to sort the array in descending order.

External Links:

Quicksort on Wikipedia

Qsort page on cplusplus.com

Heapsort: Part Two

HeapsortIn the previous article, we introduced the heapsort algorithm and coded several helper functions in Python. In this article, we will finally code the heapsort function and test it on some input data.

To recap, we already introduced the following functions:

  • heapify: Turns a list into a binary heap.
  • rebuildHeap: Accepts as input a heap with the root node deleted and rebuilds the heap.
  • consolidateHeap: Moves all empty nodes to the bottom of the heap so we can shrink the heap.
  • swap: Swaps the contents of two variables.

Since we did all the heavy lifting in writing the helper functions, coding heapSort is rather simple:

def heapSort(L):
   '''Given a list, perform a binary sort on it
   '''
   # First, reconstitute list as a binary heap: 
   heapify(L)
   high = len(L)
   # While there is still data in the heap,
   # remove the root node, rebuild and
   # consolidate the heap, and move the old
   # root node to the end
   while high > 0:
      # Delete root node; save in temp
      temp = L[0]
      L[0] = None
      # Rebuild and consolidate heap
      rebuildHeap(L,0,high)
      consolidateHeap(L,high)
      high -= 1
      # Move old root node to end
      L[high] = temp

As you can see, once we create the heap, all we are doing is [1] deleting the root node from the heap by setting its value to “None”; [2] rebuild the heap and consolidate the heap; [3] decrement the variable indicating the size of the heap; [4] move the old root node back into the list right after the heap; [5] repeat until the heap size is 0. You may have noticed that when we call rebuildHeap and consolidateHeap, we specify the size of the heap. This is to ensure the sorted list which we are inserting at the end of the list is not treated as part of the heap.

To test our heapsort function, I reused the list from the previous article, and used the following code:

>>> myList = [123,79,29,997,519,17,239,144,51,372]
>>> heapSort(myList)
>>> print(myList)
[17, 29, 51, 79, 123, 144, 239, 372, 519, 997]

As you can see, the heapsort function seems to work, and has the added advantage (over merge sort) of supporting the ability to sort in place. In the next article, we’ll examine one more efficient sorting algorithm: quicksort.

The source code for this heapsort implementation can be found here.

External Links:

Heapsort on Wikipedia

Python Programming: Part Three (Conditional Statements; Lists, Tuples and Dictionaries)

Python listIn the previous article, we covered variables, and how to save and run Python modules. In this article, we will introduce some more basic concepts including statements that alter the control flow of Python, and lists, tuples and dictionaries.

Up to this point, we have been executing Python statements using a linear control flow. A programming language is of minimal value, however, without conditional statements. Python provides three such statements: if, elif, and else. For example:

>>> if x < 0:
                print(‘x is a negative number’)

In this example, we used the comparison operator “<” (less than) to test to see if x is less than 0. If x is less than zero, we will print a statement that tells us x is a negative number. We might also want to print something out if x is zero or positive:

>>> if x < 0:
               print(‘x is a negative number’)
         elif x == 0:
                print(‘x equals zero’)
         else:
                print(‘x is a positive number’)

Here, the first two lines are the same, but if x is not less than zero, we test it to see if it is equal to zero and if it is, we print a message. If x is not less than zero or zero, we assume it is positive and print another message. Another useful control flow statement is the for statement. The for statement in Python differs a bit from the for statement in C and Pascal. Rather than always iterating over an arithmetic progression of numbers, as in Pascal, or giving the user the ability to define both the iteration step and halting condition as in C, Python’s for statement iterates over the items of any sequence in the order that they appear in the sequence. For example:

>>> word = ‘coffee’
>>> for c in word:
                 print(c)

This will print every letter in the string word on a separate line. If you need to iterate over a sequence of numbers, the built in function range() comes in handy. It generates arithmetic progressions:

>>> for i in range(10):
                 print(i,’ ‘,i**2)

This code will print out two columns: one containing integers 0 through 9, the other containing their squares. If you specify only a single parameter, 0 will be the lower bound and the specified parameter will be the upper bound. But you can specify a lower bound:

>>> for i in range(5,10):
                print(i,’ ‘,i**2)

This code will print out integers 5 through 9 and their squares. We can also specify a step value:

>>> for i in range(1,10,2):
                 print(i,’ ‘,i**2)

This code will print out all odd-numbered integers from 1 through 9 and their squares.

Lists, Tuples and Dictionaries

In C, there are arrays, which act as a compound data type. In Python, there are two sequence types not yet covered: lists and tuples. There are also dictionaries, which is not a sequence type, but is nonetheless very useful.

Lists are collections of some data type. Creating a list can be done quite easily:

>>> mylist = [5, 10, 15]

This will create a list with three items in it. We can iterate through the list as well:

>>> for i in mylist:
                 print(i)

This will print each item in mylist on a separate line. You can use the assignment operator on a list:

>>> otherlist = mylist

Now we will be able to access the list, but it is important to note that otherlist is not a separate copy of mylist. It also points to mylist. So this statement:

>>> otherlist[0] = 100

will alter the contents of mylist:

>>> print(mylist)
[100, 10, 15]

You can also nest a list within a list. For example:

>>> secondlist = [1, 2, 3, 4]
>>> mylist[0] = secondlist

nests secondlist inside mylist as the first element.

You can also create an empty list, by specifying an empty set of brackets in the assignment, like this:

mylist = []

Lists have a built-in method called append that can be used to append an item to the end of a list:

>>> mylist.append(125)

appends 125 to the end of the list.

Whereas a list is a mutable sequence data type, a tuple is an immutable sequence data type. A tuple consists of a number of values separated by commas. For example:

>>> tu = 1, 2, 3

A tuple need not contain all the same data type. The following statement is valid:

>>> tu = 5, 10, 15, ‘twenty’

To confirm that tuples are immutable, we can try to change one of the items:

>>> tu[0] = 121
Traceback (most recent call last):
File “”, line 1, in
TypeError: ‘tuple’ object does not support item assignment

On output, tuples are always enclosed in parentheses, so that nested tuples are interpreted correctly; they may be input with or without surrounding parenthesis, although often parenthesis are necessary anyway if the tuple is part of a larger expression. It is not possible to make an assignment to an individual item of a tuple; however, it is possible to create tuples which contain mutable objects, such as lists.

Another useful data type built into Python is the dictionary. Unlike sequences, which are indexed by a range of numbers, dictionaries are indexed by keys, which can be any immutable type. Strings and numbers can always be keys. Tuples can be used as keys if they contain only strings, numbers, or tuples. If a tuple contains any mutable object either directly or indirectly, it cannot be used as a key. You cannot use lists as keys, since lists can be modified.

You can think of dictionaries as unordered sets of key: value pairs, with the requirement that the keys are unique within one dictionary. A pair of braces creates an empty dictionary. Placing a comma-separated list of key:value pairs within the braces adds initial key:value pairs to the dictionary.

Here is an example of a dictionary:

>>> generals = { ‘Germany’: ‘Hindenburg’, ‘Russia’: ‘Samsonov’, ‘France’: ‘Nivelle’, ‘United Kingdom’: ‘Haig’ }
>>> print(generals[‘Germany’])
‘Hindenburg’
>>> del generals[‘Russia’]
>>> generals[‘France’] = ‘Foch’
>>> print(generals)
{‘Germany’: ‘Hindenburg’, ‘France’: ‘Foch’, ‘United Kingdom’: ‘Haig’}

You can use for to iterate through a dictionary. For example:

>>> for c in generals.keys()
print(generals[c])

Hindenburg
Foch
Haig

In the next article, we will introduce the concept of modules, and write our first function.

External Links:

Python documentation from the official Python website

Python Programming: Part Two (Python IDLE)

Python IDLEIn the previous article, we introduced some basic concepts about Python and how to use Python at the interactive command line. In this article, we will introduce variables, and also consider how to save a code module and run it, both from the command line and the Python IDLE interface.

The previous article contained the following Python statement:
>>>> a = 3+4

This was our first example of a Python variable. As you may have deduced, in Python, unlike C/C++ or some other languages, you don’t have to declare variables separately. You declare a variable when you use it for the first time. There are three distinct numeric types: integers, floating point numbers, and complex numbers. In addition, Booleans are a subtype of integers.

Numbers are created by numeric literals or as the result of built-in functions and operators. Unadorned integer literals yield integers. For example:
>>> x = 10

yields an integer variable x. Numeric literals containing a decimal point or an exponent sign yield floating point numbers. For example:
>>> x = 3.5

yields a floating point variable x.

You can also create a complex number. Appending ‘j’ or ‘J’ to a numeric literal yields an imaginary number to which you can add an integer or float to get a complex number with real and imaginary parts. For example:
>>> x = 10j + 5

creates a complex number with 5 as the real part and 10 as the imaginary part. You can retrieve both parts by typing:
>>> print(x)

or just the real or imaginary parts:
>>> print(x.real)
>>> print(x.imag)

The variables real and imag, however, are read-only variables and cannot be used to change the values of the real and imaginary components. The constructors int(), float() and complex() can be used to produce numbers of a specific type. For example:
>>> x = int(23)

creates an integer with a value of 23. Interestingly, leaving the parameter blank like this:
>>> x = int()

results in the value 0 being assigned to x.

Textual data in Python is handled with str objects, or strings. Strings are immutable sequences of Unicode code points. String literals can be written in several different ways – single quoted, double quoted, or triple quoted:
>>> text = ‘This is a test’
>>> text = “This is a test”
>>> text = ”’This is a test”’
>>> text = “””This is a test”””

Strings are immutable sequences of Unicode code points. Therefore:
>>> text[0] = ‘B’

is not allowed. However, if we want to print a single character from the string, we can do this:
>>> print(text[0])

or, if we want to print multiple characters that represent a subset of the string, we can specify two indices separater by a colon. For example:
>>> print(text[0:3])

will result in ‘This‘ being printed.

So far, we have been typing in programs at the interactive prompt, which has one big disadvantage: programs you type there go away as soon as the Python interpreter executes them. Because the code typed in interactively is never stored in a file, you cannot run it again without retyping it. To save programs permanently, you need to write your code in files, which are usually called modules. Modules are text files containing Python statements. Once a module is coded and saved in a file, you can ask the Python interpreter to execute the module any number of times.

To code a module, open your favorite text editor (e.g. Notepad in Windows, perhaps vi, emacs or gedit in Linux) and type some statements into a new text file called module01.py:

# My first Python script
name = str(input(‘Please enter your name: ‘)) # Read some input
print(‘Welcome to Python, ‘+name) # Print something out

In this module, we introduced three new concepts. The first line of the module is a comment. A hashtag (#) introduces a comment, which is just a description of what the program is doing. Comments can appear on a line by themselves or to the right of a statement. Multiline comments can be added, but must begin and end with three single quote marks, like this:
”’This is an example
 of a multi-line comment that can be inserted into a Python script on the command line or in the Python IDLE interface”’

The other concept we introduced was the input function. input() simply reads characters from the standard input device (in this case, a keyboard). We used the str() constructor to ensure that name is created as a string.

The third concept we introduced was the plus sign (+) to concatenate the literal string ‘Welcome to Python, ‘ and name. As you probably guessed, the program will read in a string from the keyboard representing the user’s name, and will print out a welcome message containing the user’s name.

Once you have saved this text file, you can ask Python to run it by listing its full filename as the first argument to a python command, typed at the system command prompt; e.g.:
> python module01.py

Using the Python IDLE Interface to Run a Module

Python IDLE

Python IDLE interface running our simple script.

Alternatively, from the Python IDLE interface, you can navigate to File -> Open, and then browse to module01.py and open it. A new window should appear with the code for the module in it. Navigate to Run -> Module (or in Windows, just press F5) to run the module, and the output should appear in the main IDLE window, after it prints the “RESTART” banner message:

Please enter your name: Grumbledook
Welcome to Python, Grumbledook

There are other ways we can load a module. We could use the import command, assuming the module is within Python’s path:

>>> import module01

If we make changes to the code, we would have to use the reload command in order for them to take effect:

>>> from imp import reload
>>> reload(module01)

Finally, we could use the exec() and open.read() commands in conjunction to simultaneously load and run the module:

>>> exec(open(‘module01.py’).read())

These three options should work both at the command line and in the Python IDLE interface.

In the next article, we will introduce some additional concepts and also code our first Python function.

External Links:

Wikipedia article on Python IDLE interface

Python IDLE website

Python IDLE wiki

How to Install Python IDLE on Linux

How to Install Python IDLE on eHow