Hash Functions in Python

HashingThe efficient sorting algorithms presented in the previous articles all have their own advantages and drawbacks, and all of them – merge sort, heapsort, and quicksort – have an average complexity of O(n log n). Once a list has been sorted, it becomes relatively easy to find an item, using binary search techniques. But the overhead of doing the sort is an issue, and you may have been wondering if there is a way of avoiding this overhead.

In fact, there is a means of finding items quickly without having the overhead of sorting the list, albeit at the cost of having everything neatly organized into a sequentially-organized list. We can use a hash function. A hash function is any function that can be used to map digital data of arbitrary size to digital data of fixed size, with slight differences in input data producing very big differences in output data. The values returned by a hash function is called a hash value. Hash functions are primarily used in hash tables to quickly locate a data record. Typically, one field of the record will be designated as the search key. The hash function will be used to map the search key to an index. The index gives the place in the hash table where the corresponding record will be stored.

Implementing Hash Functions

To provide a concrete example, let’s assume we have a series of employee records for a company. Let’s also assume that the records consist of four fields: employee ID, name, job description, and number of years with the company. Assume that employee IDs are unique and that no ID is assigned to more than one worker. We have an employee record that looks like this:

[2175, ‘Homer Simpson’,’Technician’,20]

We decide to use the employee ID as the search key. This ID number is used as input for the hash function, which returns a value (e.g. 33). We store the record in the 33rd location in the hash table. Each location is known as a bucket. A very simple hash function might look something like this:

def hashElement(elem,num):
    if type(elem) == int:
        return elem%num

This function simply takes an integer as input, and returns the modulus of the integer, thus mapping the integer into a more limited range of numbers so we can store the data in a limited number of buckets. We can also generate an empty list, which will serve as our buckets:

def createBuckets(num):
    hashSet = []
    for i in range(num):
        hashSet.append([])
    return hashSet

In Python, we can nest lists inside lists, so it is very easy to represent a record as a list, and then store the list at a location within a list. Our insert function is coded as follows:

def insert(hashSet,i,num):
    hashSet[hashElement(i[0],num)].append(i)

The insert function takes as its parameters a list which contains the buckets, a second list representing the record to be stored in one of the buckets, and the number of buckets. The function calls hashElement to get the index at which to store the record, and appends the record to that location.

Note that it is possible for more than one record to hash to the same location. We call this a collision, and we deal with it by allowing multiple items at the same bucket. This complicates removal of records, since we have to make sure we don’t delete all records in the bucket, just the record that matches the one we want to delete:

def remove(hashSet,i,num):
    newElement = []
    for j in hashSet[hashElement(i,num)]:
        if j[0] != i:
            newElement.append(j)
    hashSet[hashElement(i,num)] = newElement

The remove function takes as arguments the list containing the buckets, an integer representing the employee ID of the employee whose record is to be deleted, and the number of buckets. The function iterates through the bucket containing the employee’s record, and copies all records except that record into a temporary location. It then copies the temporary list into the bucket location. We’ll provide one more function to take an employee ID number and return True if the record exists and False if it does not exist:

def member(hashSet,i,num):
    return i in hashSet[hashElement(i,num)]

Now we can write some code to test these functions:

>>> numBuckets = 51
>>> b = createBuckets(numBuckets)
>>> insert(b,[2175,’Homer Simpson’,’Technician’,20],numBuckets)
>>> print(b[hashElement(2175,numBuckets)])

Here we create a list with 51 buckets, insert a record into the list, and print the record. Running this code results in the following output:

[[2175, ‘Homer Simpson’, ‘Technician’, 20]]

Note that if we try to print out a nonexistent record, the hash function will still work, but the bucket may be empty:

>>> print(b[hashElement(101,numBuckets)])
[]

We could insert additional records if needed:

>>> insert(b,[4158,’Elmer Higgins’,’Engineer’,17],numBuckets)

It is possible to have a collision. If we insert this record:

>>> insert(b,[2583,’Waylon Smithers’,’Assistant’,25],numBuckets)

then there will be a collision at the bucket where we stored the first record:

>>> print(b[hashElement(2175,numBuckets)])
[[2175, ‘Homer Simpson’, ‘Technician’, 20], [2583, ‘Waylon Smithers’, ‘Assistant’, 25]]

But we can delete the first record and leave the other record in the bucket intact:

>>> remove(b,2175,numBuckets)
>>> print(b[hashElement(2583,numBuckets)])
[[2583, ‘Waylon Smithers’, ‘Assistant’, 25]]

When hashing data, we want to ensure that the hash function maps roughly the same number of inputs to each hash value. The method we are using where index = num % numBuckets should work pretty well in our case, where each input value (the employee ID) is equally likely. Another way of mapping the numbers to the bucket range might be to divide the key value by a number But in some cases, it would be less than ideal. For instance, most patrons of a supermarket will live in the same geographic area, so their phone numbers are likely to begin with the same 3 or 4 digits. In that case, the division formula will generate a lot of collisions. However, the modulus formula might work quite well, since it is more sensitive to the trailing digits.

One problem with our hashing functions is that the number of buckets is constantly being passed as a parameter. We could solve this problem by making numBuckets a global variable, but there is a better solution, which we will cover in the next article.

External Links:>

Hash function on Wikipedia

Be Sociable, Share!

Trackbacks

  1. […] the previous article, we introduced some hashing functions, but had to pass the number of buckets (numBuckets) to each function, which was not an ideal […]

Speak Your Mind

*