In the previous lecture, we learned that an object whose key is a sequence of symbols can be stored within a spelling tree, where the insertion/lookup time is related to the length of the key, whose length is on the order of logmN, where N is the total number of objects that might be inserted into the tree, and m is the size of the alphabet used to write the keys.
(Note that, when m is 2, then the keys are just binary numerals, and the spelling tree is a variant of a binary search tree.)
Is there any data structure that lets us do insertion, retrieval (and deletion) in better than logmN time? Only the array data structure has this behavior: If an array has length N, and if the keys are integers in the range, 0..N-1, then array lookup/insertion operates in constant time --- the integer index is multiplied by the array's elements' size and added to the address of the array's starting location; this gives the location of the desired element in 2 fixed steps of arithmetic.
But remember that integers are coded as binary numerals --- sequences of 1s and 0s --- so the integers are ``really'' keys coded with a 2-symbol alphabet, stored in one 32-bit computer fullword. But the computer chip is hardwired to compute on fullwords quickly, even processing the bits in parallel. As a result, array indexing takes a ``constant'' amount of time.
To have this speed in processing, we must restrict the ``keys'' used to index an array to be fullword integers in some range, 0..N-1 (or use a finite set of keys that simply map one-one and onto 0..N-1, e.g., characters like 'a', 'b', 'c', ...). Alas, the real world rarely uses a fullword integer as the identification key for a person, an automobile, or a book.
Is there a technique that lets us use ordinary keys --- sequences of symbols --- with an array?
A hash table is an attempt to use an array as a data structure for holding keyed objects. In its basic form, a hash table is an array, indexed by 0..N-1. But the keys that go with objects might be sequences, e.g., 515569876 or QA76.345Z or "Fred Mertz". These keys must be mechanically translated into integers in the range 0..N-1. The translator function is called a hashing function, and a key is translated by a hashing function into its hash code --- the hash function translates the key into a hash code in the range, 0..N-1.
Let h be the name of the hashing function. Then, we write h(k) to get the hash code returned for key k. The basic plan is simple:
r[ h( ke ) ] = e;
r[ h( ke ) ]
r[ h( ke ) ] = null;
For this scheme to work, we must devise h as follows:
Perhaps h is written so that it mechanically converts
a key, k, to an integer in the range, 0.. N-1.
There is one last question:
Well, given the previous requirements,
we can't ensure this! (Perhaps the table has size N
but there are more than
N distinct objects, each with its own distinct key!)
The best we can do is write a hash function
that rarely maps two distinct keys to the same integer, and
then when this rare event happens (it is called a
collision), we must have a procedure to deal with it.
private int SOME_PRIME_NUMBER = 37; private Object[] r = new Object[SOME_PRIME_NUMBER];For reasons explained later, hash tables almost always contain a prime-number-quantity of element slots.
Say that keys are strings. Perhaps we wish to insert object a1 whose key is "abc". We use some hash function, h, to compute a hash code for "abc", e.g., h("abc") == 7. So, we insert a1 into r[7].
Similarly, perhaps a2 has key "def", and h("def") == 35. After inserting it, the table looks like this:
0 1 7 35 36 ------------------------------------------ | null | null | ... | a1 | ... | a2 | null | ------------------------------------------Most of the space in the table is wasted, but this is the price we pay for using a hash table.
Next, say that we wish to retrieve the value whose key is "def"; since h("def") == 35, we fetch the value at r[35].
Finally, say that we insert a3 whose key is "ghi", and by bad luck, h("ghi") == 7. Now, what do we do? We address this momentarily.
We assume that a key has the format:
k == x0 x1 x2 ... xm-1 xmwhere each xj is a symbol that has an numerical value.
To translate k into an integer in range 0..N-1, we take a two-step approach:
hash_code = Math.abs( ik ) % NThat is, take the absolute value and mod by the size of the array.
The usual technique for doing Step 1 is called polynomial coding: Choose a base to be an positive int; call it a. Next, compute this integer from the symbols in key k == x0x1x2...xm-1xm:
ik == (x0 * am) + (x1 * am-1) + (x2 * am-2) + ... + (xm-1) * a + xm
To see this technique at work, let a be 100. Say that the key is "abc". Recall that, in Java, the characters 'a', 'b', 'c' can be treated as integers, specifically,
int a_code = (int)'a';assigns 97 to a_code. Hence, "abc" can be read as the three-integer sequence, 97 98 99.
Here is the polynomial code for "abc":
(97 * 1002) + (98 * 100) + 99 == 979899At this point, we convert 979899 into its hash code by performing Step 2. Say that N is 37:
Math.abs(979899) % 37 == 28Hence, the object whose key is "abc" should be stored in element 28 of the hash table.
The compression step maps all polynomial codes into the range 0..36. Collisions are almost inevitable. It is a deep result of number theory that modulo by prime numbers reduces the number of collisions.
For fun, calculate the polynomial code for "def", that is, for 100 101 102. Then, calculate the hash code.
There is no reason why the base, a, must be 100 or 10 or whatever. Indeed, experimental evidence has shown that primes like 37 and 41 make good values for bases. (Obviously, small nonprimes, like 2 and 4, make terrible bases for polynomial codings.)
Common sense tells us that, if a key is written with a character set of M characters, then setting a == M would map each key to a unique polynomial code. But extensive analysis has shown that a can be a smaller value than M, e.g., 37 or 41. The point is that we do not want to use a base that inadvertantly maps ``too many'' distinct keys into the same polynomial code --- e.g., a == 2 does badly. Also, a long key can easily cause the polynomial coding calculation to overflow the standard integer fullword (which holds 32 bits), and we want the ``overflowed'' number to be nonetheless useful as a polynomial code. Again, bases like 37 and 41 cope well with this situation.
But this is half of the story---since the polynomial code must be ``compressed'' into the range 0..N-1, there is the chance that the compression will cause distinct polynomial codes to ``collide'' into the same hash code.
Deep analysis as well as experimental evidence suggest that such collisions can be reduced when
0 1 7 35 36 ------------------------------------------ | null | null | ... | a1 | ... | a2 | null | ------------------------------------------and we wished to insert object a3 with key "ghi", but h("ghi") == 7.
A good solution to the collision is to create a linked list at element 7; this is called a bucket:
0 1 7 ------------------------------ | null | null | ... | * | ... ---------------------|-------- v --- | a1 | ---- | * | -|-- v --- | a3 | ---- |null| ----
A second solution to a collision is to place a3 in the first vacant element following element 7 (``wrapping around'' to element 0 if the right suffix of the array is filled). Here, since element 8 is empty, a3 would be placed there:
0 1 7 8 35 36 ----------------------------------------------- | null | null | ... | a1 | a3 | ... | a2 | null | -----------------------------------------------This is called linear spillover.
Both buckets and linear spillover mean that the retrieval operation must first compute a hash code and then start a linear search, starting from the array element named by the hash code.
Finally, another approach to collisions is rehashing---this is the computation of a second, different hash code when the first hash code caused a collision. Rehashing will not be discussed here (partly because a recomputed hash code can generate a second collision, so then you must do re-rehashing, etc.), but you can consult a data structures text for this technique.
Hash tables are not well suited for deletions, and use of linear spillover makes deletions painful---use buckets if you expect to handle deletions also.
As noted earlier, a hash table is attractive because it lets us work with a simple data structure---the array---while ensuring that insertion and lookup time is based on the length of keys. The major disadvantage to a hash table is that, once the table is almost or completely full, lookups and insertions slow dramatically, and ultimately, the table must be scrapped and rebuilt at a larger size.
Here are some guidelines for when to employ a hash table: