Hash Tables

Much effort in the data structures field is devoted to designing structures that can save objects based on the objects' keys. Ideally, the objects are inserted, retrieved, and deleted in less than linear time. For example, binary search of a sorted array of N objects takes on the order of log₂ N time, as does insertion and lookup of ordered trees (binary search trees).

In the previous lecture, we learned that an object whose key is a sequence of symbols can be stored within a spelling tree, where the insertion/lookup time is related to the length of the key, whose length is on the order of log_mN, where N is the total number of objects that might be inserted into the tree, and m is the size of the alphabet used to write the keys.

(Note that, when m is 2, then the keys are just binary numerals, and the spelling tree is a variant of a binary search tree.)

Is there any data structure that lets us do insertion, retrieval (and deletion) in better than log_mN time? Only the array data structure has this behavior: If an array has length N, and if the keys are integers in the range, 0..N-1, then array lookup/insertion operates in constant time --- the integer index is multiplied by the array's elements' size and added to the address of the array's starting location; this gives the location of the desired element in 2 fixed steps of arithmetic.

But remember that integers are coded as binary numerals --- sequences of 1s and 0s --- so the integers are ``really'' keys coded with a 2-symbol alphabet, stored in one 32-bit computer fullword. But the computer chip is hardwired to compute on fullwords quickly, even processing the bits in parallel. As a result, array indexing takes a ``constant'' amount of time.

To have this speed in processing, we must restrict the ``keys'' used to index an array to be fullword integers in some range, 0..N-1 (or use a finite set of keys that simply map one-one and onto 0..N-1, e.g., characters like 'a', 'b', 'c', ...). Alas, the real world rarely uses a fullword integer as the identification key for a person, an automobile, or a book.

Is there a technique that lets us use ordinary keys --- sequences of symbols --- with an array?

Hash table: an array with ``smarter'' keys

A hash table is an attempt to use an array as a data structure for holding keyed objects. In its basic form, a hash table is an array, indexed by 0..N-1. But the keys that go with objects might be sequences, e.g., 515569876 or QA76.345Z or "Fred Mertz". These keys must be mechanically translated into integers in the range 0..N-1. The translator function is called a hashing function, and a key is translated by a hashing function into its hash code --- the hash function translates the key into a hash code in the range, 0..N-1.

Let h be the name of the hashing function. Then, we write h(k) to get the hash code returned for key k. The basic plan is simple:

Construct a hash table as an array, r, that holds objects indexed by 0..N-1. Initialize all r's elements to null.
To insert an object, e, into the array, translate its key, k_e into its hash code, h( k_e ) and save in the array:
```
r[ h( k_e ) ] = e;
```
To retrieve the object with key k, translate the key to its hash code and index the array:
```
r[ h( k_e ) ]
```
Delete the object with key k as
```
r[ h( k_e ) ] = null;
```

For this scheme to work, we must devise h as follows:

It must be completely mechanical, e.g., a Java function.
It must be completely deterministic, that is, for a sequence of symbols, k, h(k) computes the same integer result for k every time h(k) is called.
The algorithm that h uses to compute the hash code must be completely independent of the knowledge of which objects are already inserted into the hash table.
h must be relatively fast, that is, have time complexity on the order of the length of key k. (And recall from the beginning of these notes that the length of k will be log_mN, where N is the number of objects that own keys, and m is the cardinality of the alphabet used to express the keys. This ensures that the hash-table technique is at least as fast as the tree techniques we already know.)

In summary, the code for h must be a fast ``numerical game'' for converting a key into an integer.

Perhaps h is written so that it mechanically converts a key, k, to an integer in the range, 0.. N-1. There is one last question:

How do we ensure that h maps each unique key to a unique hash code?

Well, given the previous requirements, we can't ensure this! (Perhaps the table has size N but there are more than N distinct objects, each with its own distinct key!) The best we can do is write a hash function that rarely maps two distinct keys to the same integer, and then when this rare event happens (it is called a collision), we must have a procedure to deal with it.

A small example

Before we deal with the technical problems just mentioned, let's consider a simple example of a hash table and pretend that nothing goes wrong. Perhaps the table is this array, r:

private int SOME_PRIME_NUMBER = 37;
private Object[] r = new Object[SOME_PRIME_NUMBER];

For reasons explained later, hash tables almost always contain a prime-number-quantity of element slots.

Say that keys are strings. Perhaps we wish to insert object a1 whose key is "abc". We use some hash function, h, to compute a hash code for "abc", e.g., h("abc") == 7. So, we insert a1 into r[7].

Similarly, perhaps a2 has key "def", and h("def") == 35. After inserting it, the table looks like this:

      0      1            7         35    36
    ------------------------------------------
   | null | null | ... | a1 | ... | a2 | null |
    ------------------------------------------

Most of the space in the table is wasted, but this is the price we pay for using a hash table.

Next, say that we wish to retrieve the value whose key is "def"; since h("def") == 35, we fetch the value at r[35].

Finally, say that we insert a3 whose key is "ghi", and by bad luck, h("ghi") == 7. Now, what do we do? We address this momentarily.

Hash functions

A hash function must map a key to an integer in the range, 0..N-1. In the usual case, a key is a sequence, such as a sequence of letters and/or numerals. The hash function must do a ``good'' (but not perfect) job of mapping each unique sequence to a unique integer in 0..N-1. (Can you understand why it is impossible, in general, for any hash function to do a perfect job of this?)

We assume that a key has the format:

k == x₀ x₁ x₂ ... x_m-1 x_m

where each x_j is a symbol that has an numerical value.

To translate k into an integer in range 0..N-1, we take a two-step approach:

translate k into an almost unique integer i_k, which might well fall outside the range of 0..N-1
``compress'' i_k into 0..N-1 by
```
hash_code = Math.abs( i_k ) % N
```
That is, take the absolute value and mod by the size of the array.

The usual technique for doing Step 1 is called polynomial coding: Choose a base to be an positive int; call it a. Next, compute this integer from the symbols in key k == x₀x₁x₂...x_m-1x_m:

i_k ==  (x₀ * a^m) + (x₁ * a^m-1) + (x₂ * a^m-2) + ... + (x_m-1) * a  +  x_m

To see this technique at work, let a be 100. Say that the key is "abc". Recall that, in Java, the characters 'a', 'b', 'c' can be treated as integers, specifically,

int a_code = (int)'a';

assigns 97 to a_code. Hence, "abc" can be read as the three-integer sequence, 97 98 99.

Here is the polynomial code for "abc":

(97 * 100²) + (98 * 100) + 99  ==  979899

At this point, we convert 979899 into its hash code by performing Step 2. Say that N is 37:

Math.abs(979899) % 37 == 28

Hence, the object whose key is "abc" should be stored in element 28 of the hash table.

The compression step maps all polynomial codes into the range 0..36. Collisions are almost inevitable. It is a deep result of number theory that modulo by prime numbers reduces the number of collisions.

For fun, calculate the polynomial code for "def", that is, for 100 101 102. Then, calculate the hash code.

There is no reason why the base, a, must be 100 or 10 or whatever. Indeed, experimental evidence has shown that primes like 37 and 41 make good values for bases. (Obviously, small nonprimes, like 2 and 4, make terrible bases for polynomial codings.)

Common sense tells us that, if a key is written with a character set of M characters, then setting a == M would map each key to a unique polynomial code. But extensive analysis has shown that a can be a smaller value than M, e.g., 37 or 41. The point is that we do not want to use a base that inadvertantly maps ``too many'' distinct keys into the same polynomial code --- e.g., a == 2 does badly. Also, a long key can easily cause the polynomial coding calculation to overflow the standard integer fullword (which holds 32 bits), and we want the ``overflowed'' number to be nonetheless useful as a polynomial code. Again, bases like 37 and 41 cope well with this situation.

But this is half of the story---since the polynomial code must be ``compressed'' into the range 0..N-1, there is the chance that the compression will cause distinct polynomial codes to ``collide'' into the same hash code.

Deep analysis as well as experimental evidence suggest that such collisions can be reduced when

N, the size of the hash table, is a prime. This is because modulo-by-N tends to deemphasize numerical ``repetitions'' and ``patterns'' that can appear when distinct keys share common subphrases and when the polynomial coding introduces patterns due to its powers-of-a expansion.
The size of the hash table, N, is not a multiple (or not almost a multiple) of a, the base used for the polynomial codings. This reduces the chance of ``patterns'' in the polynomial codings are propagated by modulo-by-N.

Handling collisions: buckets and spillovers

It is inevitable that two objects with distinct keys receive the same hash code from the hashing function. Such a case is called a collision. Earlier in these notes, we saw an example of a collision, where the hash table looked like this:


      0      1            7         35    36
    ------------------------------------------
   | null | null | ... | a1 | ... | a2 | null |
    ------------------------------------------

and we wished to insert object a3 with key "ghi", but h("ghi") == 7.

A good solution to the collision is to create a linked list at element 7; this is called a bucket:


      0      1            7      
    ------------------------------
   | null | null | ... | * | ... 
    ---------------------|--------
                         v
                        ---
                       | a1 |
                        ----
                       | *  |
                        -|--
                         v
                        ---
                       | a3 |
                        ----
                       |null|
                        ----

A second solution to a collision is to place a3 in the first vacant element following element 7 (``wrapping around'' to element 0 if the right suffix of the array is filled). Here, since element 8 is empty, a3 would be placed there:


      0      1            7    8         35    36
    -----------------------------------------------
   | null | null | ... | a1 | a3 | ... | a2 | null |
    -----------------------------------------------

This is called linear spillover.

Both buckets and linear spillover mean that the retrieval operation must first compute a hash code and then start a linear search, starting from the array element named by the hash code.

Finally, another approach to collisions is rehashing---this is the computation of a second, different hash code when the first hash code caused a collision. Rehashing will not be discussed here (partly because a recomputed hash code can generate a second collision, so then you must do re-rehashing, etc.), but you can consult a data structures text for this technique.

Hash tables are not well suited for deletions, and use of linear spillover makes deletions painful---use buckets if you expect to handle deletions also.

When to use a hash table

As noted earlier, a hash table is attractive because it lets us work with a simple data structure---the array---while ensuring that insertion and lookup time is based on the length of keys. The major disadvantage to a hash table is that, once the table is almost or completely full, lookups and insertions slow dramatically, and ultimately, the table must be scrapped and rebuilt at a larger size.

Here are some guidelines for when to employ a hash table:

When you insert and find objects with keys that are sequences of symbols: The sequences neatly map to hash codes by means of polynomial coding.
When the approximate quantity of objects to be saved is known in advance: If you know in advance that approximately M items will be inserted in the hash table, then you construct an array that holds at least 2M elements. (Experience has shown that a size of 2M reduces collisions to a reasonable amount while not wasting too much space.)
When there are few or no deletions to be done.