CIS300 Spring 2005: Assignment 7

10 points; due Friday, April 22, at 10:00pm

Spelling trees and hash tables are sometimes called dictionaries, because they store objects whose keys are ``words'' (sequences of symbols).

A standard use of a ``dictionary'' is counting occurrences of words in a file. For practice in building hash-tables, you must use a hash table to implement a word-frequency counting program. The program reads a file of words and counts the occurrences of each of the words. Then, the program prints all the words and their frequency of occurrences. For example, a text file that contains these lines:

Bed be been bee it. 
It be bed, been it?
It be!

would generate this output, in alphabetical order:

Frequency counts:
be: 3
bed: 2
bee: 1
been: 2
it: 4

Upper-case letters are converted to lower case; punctuation is ignored. Assume that the input file consists of letter-only words and the punctuation symbols, , . ; : - ? !

Behavior

The program is started as java WordCounter.Start
(That is, you must build a package WordCounter, which contains class Start, which contains the main method that starts the program.

Once started, the program shows a dialog that requests the name of a sequential text file, FILENAME. The program reads FILENAME, counts the words, and prints the results in the command window and into an output sequential file named FILENAME.out. The words and their counts appear in alphabetical order.

Implementation

In your application, you will model the dictionary that holds the words and their counts by means of a hash table. If you choose to handle collisions by means of linear spillover, then the class you write will look like this:

package HashTable
public class HashTable
{ private Word[] table;   // array that holds words and their counts
  private int TABLE_SIZE;  // the length of the array

  ... // methods that add/increment a word into the table
      // and print the table's contents

  // The hash function should be placed here, also:
  public int hash(String key)
  { ... } // calculates  key's  hash code, an int in the range, 0..TABLE_SIZE-1
}

If you use ``buckets'' (linked lists) to handle collisions, the class looks like this:

package HashTable
public class HashTable
{ private Cell[] table;   // array that holds lists of words and their counts
  private int TABLE_SIZE;  // the length of the array

   ...
}

where each Cell object will hold a Word object as its value part.

The objects that hold words and their counts will probably look like this:

package HashTable;
public class Word
{ private String the_word; // the word acts as its own ``key''
  private int count;  // how many times the word has appeared

  ... // methods that increment count and return values of the_word and count
}

You should construct the the hash table as an array of prime-numbered size (e.g., 101 elements). Also, use polynomial coding for computing the hash function and use an appropriate base, e.g., 31, 37, or 41.

There is only one annoyance to writing this program: When the program prints the final listing of words and their counts, it must do so in alphabetical order. You should not assume that the words are inserted into the hash table in alphabetical order (indeed, they will not be inserted alphabetically), so you must some standard sorting technique to sort the words for printing.

Here are some Java-related notes that might be helpful:

Sequential-file manipulation is covered in Chapter 11 of the CIS200 notes, www.cis.ksu.edu/~schmidt/CIS200.
Chapter 11 of the CIS200 notes shows how to use a StringTokenizer object to disassemble a line of text into its constituent words. For example,
```
StringTokenizer t = new StringTokenizer(text_line, ",.;:-?! ");
```
will be useful for extracting the words within the string, text_line.
Before you add a word to the hash table, convert it into all lower-case characters first:
```
String lower_case_word = input_word.toLowerCase();
```
Java lets you do integer arithmetic on characters. For example,
```
char c = ...;
total = total + (c * (int)(Math.pow(base, x)));
```
adds to integer variable, total, the underlying integer value of c multiplied by base^x.
The substring and charAt methods extract subparts and single characters from a string, respectively. Remember to use equals for checking two strings for equality.

Submission

Create a directory, Assign7, that holds your packages. Run javadoc on the packages you write. Create the .jar file, Assign7.jar, and submit to CIS300 submissions page.