CIS505 Assignment 4

10 points; due Friday, October 17

Question 1.

Here is an ML function that returns the int that is least in a nonempty list of ints:

===================================================

(* least : int list -> int
   returns the smallest int in NONEMPTY list, ns.
   E.g.,   least [3,1,2]  returns  1 
           least [4]  returns  4
           least [5, 3, 8, 3]  returns  3  *)
fun least [n] = n
|   least (n::ns) =  let val m = least ns 
                     in  if n < m then n else m  end
|   least nil = raise Empty  (* the "empty list" error.  This equation is optional *)

===================================================

For example, least [3,2,7,8] returns 2. This is because least [8] = 8 least [7,8] = 7 least [2,7,8] = 2
The recursive calls lead to least [3,2,7,8] = 2.

Now, make a file, Q1.sml, and place the above coding of least in it. Run it and test it with the above three test cases.

Next, code this function and add it to Q1.sml:

===================================================

(* remove : int * int list  -> int list
   builds an answer list that is exactly  ns  but with the first occurrence, if any, of  m  removed.
   For example,  remove(2, [3, 2 ,7])  returns  [3, 7]
                 remove(2, [3, 2, 7, 2])  returns  [3, 7, 2]
                 remove(2, [3])  returns  [3]
*)
remove(m, nil) = ...
remove(m, (n::ns)) = ...

===================================================

Test your function on at least the three test cases listed above.

Here's the last part: Selection sort repeatedly finds and moves the least element of a sequence to the front:

[3 2 5 1 4] =>  1 [3 2 5 4] =>  1 2 [3 5 4] =>  1 2 3 [4 5] => 1 2 3 4 [5] => 1 2 3 4 5

You can understand the process as (i) finding the least element in the unsorted sequence, (ii) moving it to the front, (iii) repeating steps (i) and (ii) with what's left. Use least and remove in your ML coding of selection sort, which you add to Q2.sml:

===================================================

(* ssort : int list ->  int list   uses the selection-sort strategy to build
   a list that contains the ints in the argument list, sorted in ascending order
      E.g.,  ssort [3,2,5,1,4]  returns  [1,2,3,4,5] 
             ssort [3]  returns  [3] 
             ssort [3,2,1,2]  returns  [1,2,2,3] 
*)
fun ssort nil = ...  (* empty-list case *)
|   ssort xs =  ...  (* non-empty list case *)

===================================================

Test your function on at least the three test cases listed above.

Save the output from all your tests in the file, Q1tests.txt.

How to do the testing

You should install and use SML of New Jersey. Place your functions in the file Q2.sml. To run the functions:

Start a command window; cd to the folder where Q1.sml lives; type sml.
After you see the ML prompt, type use "Q1.sml"; If there are errors, you can try again by retyping use "Q1.sml";. (Try using the up-arrow on your computer to save the retyping.)
When your functions are processed without errors, you call them, e.g., least [1,4,5];
If you test a function and it goes lost ("loops"), press Ctrl+c and then return.
When you want to exit SML but retain the command window, type Ctrl+z and then return.
Copy and paste your output from the test cases into a file, Q1tests.txt

Question 2

Here is a small exercise to get you working with trees. Make a file, Q2.sml, and place in it this definition of IntTree: datatype IntTree = Leaf | Node of int * IntTree * IntTree
Now, code this function, using any of the functions you have encountered so far on lists and trees:

===================================================

(* sumInts : IntTree -> int
     adds up all the ints contained in InTree t and returns the total.
   Examples:   sumInts (Node(3, Node(2, Leaf, Node(4, Leaf, Leaf)), Leaf)) = 9
               sumInts (Leaf) = 0
*)

===================================================

Test your function on the two test cases listed above. Copy and paste your output from the test cases into a file, Q2tests.txt

Question 3

Chapter 7 stated that data bases save their updates separately so that (i) multiple reader processes can finish their searches at the same time an update is being made; and (ii) updates can be rolled back (undone) easily until a commitment is made. This is called optimistic concurrency control and is desirable for huge databases that are shared by hundreds of users at a time, e.g., Amazon.com. Such a database is best implemented with a combination of a spelling tree ("trie", as Dr. Howell calls it) and hash tables, but we can study the basic idea with something simpler that I have coded for you in ML.

The database you will modify is a binary-search tree (ordered tree) holding (key,value) pairs, where a key is a single character and a value is an int. Here is an example database:

In SML, this database's type is defined

datatype DB = Leaf |  Node of (char * int * DB * DB)

and the usual binary search (lookup) is done like this: (* lookup: char * DB -> int searches for a key in a database (search tree) and returns the int value saved with the key. *) fun lookup(key, Leaf) = raise Empty (* empty-data-structure error *) | lookup(key, Node(k, v, left, right)) = if key = k then v else if key < k then lookup(key, left) else (* key >= k *) lookup(key, right)
The lookup is fast: a maximum of log₂N nodes will be searched, where N is the number of entries in the database (e.g., at most 11 nodes searched to find an item in a database of approx. 1000 entries; at most 21 nodes searched to find an item in a database of one million entries).

Say that a new key,value pair, "d", 2, must be added to the database. The standard, destructive update alters the database by removing a leaf and adding a node, like this:

This will crash those searches that are using the tree along the path as the update. Also, if the update must be rolled back, then the tree must be destructively altered again --- more trouble.

How do we alter the database without crashing the active searches on it? The classic, nondestructive update in ML-style goes like this:

(* insert(k, v, db)  adds (k,v) into the database,  db,  constructing
   a new database that shares as much of  db  as possible. *)
fun insert(k, v, Leaf) = Node(k, v, Leaf, Leaf)
|   insert(k, v, Node(kk, vv, left, right)) =
        if k = kk      then  Node(k, v, left, right)
        else if  k < kk then Node(kk, vv, insert(k, v, left), right)
        else  (* k > kk *)   Node(kk, vv, left, insert(k, v, right))

Here's a diagram of what the insertion of "d", 2 does with the above function:

A new "spine" is built, consisting of copies of all the nodes traversed to the insertion point. The other parts of the original database are shared. On the order of log₂N new nodes must be constructed, where N is the number of entries in the database. This is a slightly expensive price to pay for inserting one new entry, but at least the original tree is unharmed, and lookups on the original nodes still complete in order log₂N-time.

Database with `update` nodes

The ML system I have built for you adds just one new node to the database for an update:

datatype DB = Leaf | Node of (char * int * DB * DB) | Update of (char * int * DB * DB)

The lookup algorithm operates the same, because the Update node steers lookups for those keys not-equal-to "d" downwards into the original tree. Also, it is trivial to roll back the update --- we merely reuse the orginal entry point into the original database!

Alas, this simple strategy has a flaw: if there are multiple updates, a linear chain accumulates at the entry point:

and all lookups quickly slow down from order log-time (log₂N) to order linear time (N), which is unacceptable.

You will modify the naive insertion algorithm I have written so that it links each Update node to the appropriate subtrees, preserving more of the log-time lookups. Here is what the insertions look like using the improved algorithm:

When it is time to commit (make permanent) the updates, the database is traversed and rebuilt as a balanced search tree:

You must write the traversal algorithm that collects the contents of the updated tree so that the balanced tree can be constructed.

Operating the prototype database

I have defined the database and have written simple maintenance operations for building, searching, updating, and undoing it. I also wrote a simplistic driver that "loops", asking for a human's inputs. (If the database was a shared, cloud-storage entity, then we would replace the driver with controller code that enforces a communications protocol.)

The code you start from is at www.cis.ksu.edu/~schmidt/505f14/Exercises/Ex4/Q3.sml. Save it as Q3.sml.

Say we start with this list of entries to initialize the database: [(#"a",4), (#"b",9), (#"e",5), (#"f",5), (#"g",2)]. (Note: in SML, #"a" means the char, 'a', and "a" means the string, "a". Sorry about this.) We type the following in the command window:

sml
use "Q3.sml";
run [(#"a",4), (#"b",9), (#"e",5), (#"f",5), (#"g",2)];

The search tree constructed is Node(e,5, Node(b,9, Node(a,4, Leaf, Leaf), Leaf), Node(g,2, Node(f,5, Leaf, Leaf), Leaf))
Next, say the user requests these two updates: > u d 2 > u f 1
(u means "update".) The database changes into the picture shown above that has the two Update nodes chained to the entry to the tree. Now, a lookup of key d gives 2, a lookup of f gives 1, a lookup of b gives 9, etc.: > l d > l f > l b
When we print the contents of the database, > p
we see the following printout, which lists all nodes encountered on all the paths through the updated tree. (Lots of duplicates appear; think about why!) Update(f,1, IGNORE KEYS >= f IN LEFT SUBTREE: Update(d,2, IGNORE KEYS >= d IN LEFT SUBTREE: Node(e,5, Node(b,9, Node(a,4,Leaf,Leaf), Leaf), Node(g,2, Node(f,5,Leaf,Leaf), Leaf)), IGNORE KEYS <= d IN RIGHT SUBTREE: Node(e,5, Node(b,9, Node(a,4,Leaf,Leaf), Leaf), Node(g,2, Node(f,5,Leaf,Leaf), Leaf))), IGNORE KEYS <= f IN RIGHT SUBTREE: Update(d,2, IGNORE KEYS >= d IN LEFT SUBTREE: Node(e,5, Node(b,9, Node(a,4,Leaf,Leaf), Leaf), Node(g,2, Node(f,5,Leaf,Leaf), Leaf)), IGNORE KEYS <= d IN RIGHT SUBTREE: Node(e,5, Node(b,9, Node(a,4,Leaf,Leaf), Leaf), Node(g,2, Node(f,5,Leaf,Leaf), Leaf))))
The IGNORE messages remind you about which parts of the shared tree are unreachable by lookups.

If we "commit" now to all the updates, the database gets traversed, its active pairs are supposed to be collected into the sorted list, [(#"a",4), (#"b",9), (#"d",2), (#"e",5), (#"f",1), (#"g",2)] and the list is rebuilt into a balanced search tree, as in the picture shown above. But the coding I have given you to collect the pairs into the sorted list is incorrect.

Part 3A.

Replace the coding of function update by one that sets correctly the links of the Update node so that most lookups execute in log-time. (See the pictures above!) It is ok to write additional "helper functions" that help update do its job. Do the tests listed in Q3testcases.txt. Copy and paste your output into the file, Q3tests.txt.

Part 3B.

The coding I supplied of function collect, which traverses the database and collects the list of key,value pairs for rebuilding the database, is incorrect! Try it for this example, and you will see that the collected list has duplicates and the rebuilt database is wrong: sml use "Q3.sml"; run [(#"a",4), (#"b",9), (#"e",5), (#"f",5), (#"g",2)]; u d 2 c
This example also goes wrong: sml use "Q3.sml"; run [(#"a",4), (#"b",9), (#"e",5), (#"f",5), (#"g",2)]; u f 1 c
Your job is to recode collect so that it assembles a list of key,value pairs that matches exactly the values that are found by the lookup function. That is, there are no duplicates, overridden values are discarded, and the list is sorted in ascending order of keys. It is ok to write additional "helper functions" that help collect do its job.

The tests you should do are listed in Q3testcases.txt. Copy the results of the tests into Q3tests.txt.

Submission and grading

Place your answers, Q1.sml, Q1tests.txt, Q2.sml, Q2tests.txt, Q3.sml, and Q3tests.txt into a folder, YourFirstNameYourLastName4. Zip the folder into a .zip file and submit the .zip file to the CIS505 site on K-State Online.

Important: This is a solo assignment --- it must be done by you, individually. If you had a partner for Exercise 3, I know it will be tempting to work in a pair, but you gotta do your own work to get these ideas loaded in your brain! It is certainly OK to discuss the assignment with the instructor, TA, or your colleagues, but all the coding must be typed by you alone, and all the concepts in the coding must be stored in your head so that you can reproduce them on demand.