Spelling Trees

Some people call a data structure a dictionary if it is a collection of ``words,'' and it has methods for inserting and finding specific words in the collection.

If a ``word'' is a sequence of characters, or in general, a sequence (or list) whose elements can be ordered, then there is a clever implementation of a dictionary as a so-called spelling tree or trie (pronounced ``try''). For example, say that we have some objects, o1, o2, o3, o4, and o5, and say that the respective keys of these objects are the the words, be, bed, bee, been, and it. We can organize the objects so that each key defines a path from the root of the tree to the node that holds the object:

               null
           b/       \i
          null      null
          e|         |t
          o1        o5
        d/ e\  
        o2  o3 
             |n
            o4

For example, object o1, whose key is be, is found by traversing the path labelled b followed by e. Notice that object o2, whose key is bed, is located by following the path labelled b then e, then d.

For completeness sake, some paths lead to nodes where there are no values; such nodes hold ``null''. For example, the key, i, leads to a node that holds null. Notice that the ``leaves'' in the drawing are nodes that do not hold (links to) more nodes.

The labels on the branches replace the usual labels (fields) named ``left'' and ``right,'' and a Node may have some nonnegative quantity of subtrees. For this reason, a spelling tree is an example of an ``n-ary'' tree, where the value of n is a nonnegative number. (Of course, a binary tree is a 2-ary tree.)

Note that

Insertions take time that is linear in the length of the key, and there is no limit to the quantity of keyed objects that can be inserted.
Assuming that each object that might ever be stored in the tree has a unique key, and assuming that the symbols in the keys are chosen from an alphabet of size M, and assuming that there are N such objects with unique keys, then we note that the longest path in a spelling tree is equal to the longest key, which has length log_M(N).
This implies that insertions and lookups in a spelling tree take time on the order of log_M(N), which is roughly equivalent to the time taken to do insertions in lookups in a binary tree, where M = 2. (For some intuition, calculate the values of log_M(N), for these ranges of M and N:
M = 2 or 10; N = 32 or 100 or 1000 or 10000.)

In practice, spelling trees are often preferred over binary trees to store keyed objects, because it is easy to work directly with the symbols within the keys. But we will see that the implementation becomes slightly more complicated.

Designing Spelling Trees

To make the previous drawing of a spelling tree come to life, we use a fixed alphabet for the keys (e.g., the characters 'a' through 'z'). Then, the inductive data type definition for a spelling tree might be written like this:

A SpellingTree object is

A Node object, which contains

a Value
a set of SpellingTree objects (which might be empty), where each spelling-tree object is labelled by a symbol of the alphabet.

A Value is either:

an object, called the ``value'', or
empty (also known as ``null'')

That is, a spelling tree is a node that holds/links to other spelling trees.

The above inductive definition is not the only way to define the data type of spelling trees.

Implementing Spelling Trees

The inductive definition gives us a strong hint how to build a spelling tree: class Node would hold (the addresses of) a set of spelling trees, plus a ``value'' (which is an object or null).

There are many ways to model sets, but since the set of spelling trees are indexed by letters of the alphabet, an array implementation works well. For the example tree above, the root Node object might look like this in heap storage:

   a1 : Node
   ---------------------------------
   |  value ==| null |
   |           ------
   |             'a'  'b'  'c'  ...   'i'  ...   'z' 
   |            ------------------------------------
   |  subtree: |null| a2 |null| ... | a3 | ... |null|
   |            ------------------------------------
   |

subtree is an array whose indexes are the letters of the alphabet used to form keys. (Unfortunately, Java does not allow letters to be array indices, so you must do a conversion when you code the array.) Since the root has one subtree indexed by 'b' and one subtree indexed by 'i', there are nonnull addresses to the Node objects for these two subtrees.

Next, the node indexed by 'b' looks like this:

   a2 : Node
   ---------------------------------
   |  value ==| null |
   |           ------
   |             'a'  'b'  'c'  'd'   'e'  ...   'z' 
   |            ------------------------------------
   |  subtree: |null|null|null| null | a4 | ... |null|
   |            ------------------------------------
   |

and its subtree indexed by 'e' looks like this:

   a4 : Node
   ---------------------------------
   |  value ==| o1 |
   |           ------
   |             'a'  'b'  'c'  'd'  'e'  ...   'z' 
   |            ------------------------------------
   |  subtree: |null|null|null| a5 | a6 | ... |null|
   |            ------------------------------------
   |

The main advantage of using an array to label the subtrees is that key processing is fast (because array lookup is fast). The main disadvantage is that a huge alphabet requires huge arrays within each node---this is a major loss if the array holds mostly null addresses.

When the alphabet is huge or not fixed, a linked list can be used to save the subtrees for a given Node (but this can make key lookup much slower).