Chapter 7: Data Compression

Chapter 5, on checksums, showed how a line/data word might be squeezed into an almost-unique number. But the original line cannot be recovered from the number.

Chapter 7 presents techniques for compressing data into a format that can be expanded into the original artifact. These techniques are used to make image, audio, and massive-program files into a manageable size for storage or transmission. iPods, iPads, cloud storage, etc., could not exist without data compression.

Discussion question

Discuss how the abbreviations use in texting can be understood as a form of text compression. Why do people use abbreviations in texting? (What are they saving? space? time? their thumbs?)

Lossless compression

The requirement for this chapter is to produce an algorithm that can compress a text file into a smaller form that can later be expanded to the original text.

To illustrate different ways of satisfying the requirement, we'll work with a running example. Here are the lyrics to the song, ``She Loves You,'' by the Beatles. (By the way, it will be helpful to remember the melody that goes with the lyrics. You can play the song on YouTube --- do a search for Beatles She Loves You to find it. Now, do this --- really.)

She loves you, yeah yeah yeah
She loves you, yeah yeah yeah
She loves you, yeah yeah yeah yeah

You think you've lost your love
Well, I saw her yesterday
It's you she's thinking of
And she told me what to say

She says she loves you
And you know that can't be bad
Yes, she loves you
And you know you should be glad

She said you hurt her so
She almost lost her mind
But now she says she knows
You're not the hurting kind

She says she loves you
And you know that can't be bad
Yes, she loves you
And you know you should be glad, ooh!

She loves you, yeah yeah yeah
She loves you, yeah yeah yeah
With a love like that
You know you should be glad
There's lots of repetition, and we count the words to see where:
a : 1
almost : 1
and : 5
bad : 2
be : 5
but : 1
can't : 2
glad : 3
her : 3
hurt : 1
hurting : 1
i : 1
it's : 1
kind : 1
know : 5
knows : 1
like : 1
lost : 2
love : 2
loves : 9
me : 1
mind : 1
not : 1
now : 1
of : 1
ooh : 1
said : 1
saw : 1
say : 1
says : 3
she : 16
she's : 1
should : 3
so : 1
that : 3
the : 1
think : 1
thinking : 1
to : 1
told : 1
well : 1
what : 1
with : 1
yeah : 16
yes : 2
yesterday : 1
you : 20
you're : 1
you've : 1
your : 1
(To keep this simple, we didn't worry about the case of letters.)

Here's an easy compression technique: (i) make an abbreviation table of the words that appear at least 3 times, and (ii) insert the abbreviations into the lyrics:

ABREVIATION TABLE:
num word     num word     num word
---------------------------------------------
 1  and       5  know     9   should
 2  be        6  loves    10  that
 3  glad      7  says     11  yeah
 4  her       8  she      12  you


COMPRESSED LYRICS:

8 6 12, 11 11 11
8 6 12, 11 11 11
8 6 12, 11 11 11 11

12 think you've lost your love
Well, I saw 4 yesterday
It's 12 she's thinking of
1 8 told me what to say

8 7 8 6 12
1 12 5 10 can't 2 bad
Yes, 8 6 12
1 12 5 12 9 2 3

8 said 12 hurt 4 so
8 almost lost 4 mind
But now 8 7 8 knows
You're not the hurting kind

8 7 8 6 12
1 12 5 10 can't 2 bad
Yes, 8 6 12
1 12 5 12 9 2 3, ooh!

8 6 12, 11 11 11
8 6 12, 11 11 11
With a love like 10
12 5 12 9 2 3
The abbreviations are shorter than the words they replace, so the table plus the lyrics take less space than before. But we should do more: there are words with common prefixes (think, thinking, you, your, you're, you've, yes, yesterday) that should be compressed together, and more importantly, there are phrase patterns (8 6 1211 11 1112 5 12 9 2 3) that appear multiple times and should be compressed. This leads to an important compression algorithm.

Exercise

The above compression technique doesn't compress repeated phrases. Let's try to fix that: A repeated phrase must consist of a sequence of repeated words. So, we should be able to spot the repeated phrases in the song by looking for number sequences that repeat (like 11 11 11). Write down in detail how to do this and what actions should be taken to compress the repeated phrases.

Same as Earlier: LV77

The compression algorithm inside the zip utility was invented in 1977 by Lempel and Ziv; it is called LV77; MacCormick calls it the "Same as Earlier Technique" (Pages 108-09).

Rather than use an abbreviation table, LZ77 cleverly leaves the first occurrence of a repeated word in its original position in the text, and when the repetition occurs, the repeated word is replaced by ``go backwards XX characters and copy the YY letters at that point.'' For example,

luv_luvs_yeah_yeah_yeah is compressed to luv_(4,3)s_yeah(5,10)
because (4,3) means ``go back 4 letters and copy 3 letters'' and (5,10) means ``go back 5 letters and copy 10 letters.'' The (5,10) is a nifty trick because it is copying letters that have just been copied:
EXPANDED PART             WHAT'S LEFT TO EXPAND

                          luv_(4,3)s_yeah(5,10)
luv_                      (4,3)s_yeah(5,10)

*luv_                     (*,3)s_yeah(5,10)   // * marks the copy position
l*uv_l                    (*,2)s_yeah(5,10)
lu*v_lu                   (*,1)s_yeah(5,10)
luv*_luv                  (*,0)s_yeah(5,10)   // done

luv_luv                   s_yeah(5,10)
luv_luvs_yeah             (5,10)
luv_luvs*_yeah            (*,10)
luv_luvs_*yeah_           (*,9)
luv_luvs_y*eah_y          (*,8)
  ...
luv_luvs_yeah*_yeah       (*,5)    // now, we copy what we just copied
  ...
luv_luvs_yeah_yeah_yeah*  (*,0)    //  done

luv_luvs_yeah_yeah_yeah
Here's the first half of the LV-77-compressed Beatles song. Spaces are represented by underlines, and the end-of-lines are represented by semicolons. (I apologize if I typed some of the numbers wrong!)
She_loves_you,_yeah(5,10);               // repeat "_yeah" twice
(30,59)(5,5);                            // repeat previous line twice plus "_yeah"

You_think(35,4)'ve_lost(12,4)r(53,5);    // "_you"+"'ve"..."_you"+"r"..."_love" */
Well,_I_saw_her_yesterday;
It's(41,4)_she's(55,6)ing_of;
And(23,4)_told_me_what_to_s(57,3)

She(8,4)s(33,4)(100,5)s(70,4);
(51,3)(8,4)_know_that_can't_be_bad;
Yes,(76,28)(8,4)should(49,4)glad,_ooh!;  // "_she_loves_you;
                                             And you know_" + "you_"..."_be_" */
LV-77 is a great compression technique since it doesn't require an abbreviation table and readily finds repeated words, prefixes, and phrases. Its only limitation is that, for long files, it cannot search all the way back to the beginning of the file to find repetitions; it must keep a buffer of say, the last 500 lines read, when searching for possible repetitions.

Exercises

  1. Draw figures like the one above that show how each of these two strings expand:
    lo_(3,2)ve(5,3)(9,5) (should expand to lo_love_lov_love)
    a(1,3)b(2,2) (should expand to aaaabab)
  2. It is important to understand how LV-77 compresses a string. Finish the figure below, which shows how to compress the string, luv_luvs_yeah_yeah_yeah, into luv_(4,3)s_yeah(5,10).
    INPUT STRING                  COMPRESSED VERSION
    -------------------------     --------------------
    *luv_luvs_yeah_yeah_yeah      (nothing) // the * marks how much has been compressed
    
    l*uv_luvs_yeah_yeah_yeah       l
    lu*v_luvs_yeah_yeah_yeah       lu
     ...
    luv_*luvs_yeah_yeah_yeah       luv_
                                         // l  matches an earlier letter in the input
    l!uv_l*uvs_yeah_yeah_yeah      luv_(4,1)  // the ! marks how far we've matched
    lu!v_lu*vs_yeah_yeah_yeah      luv_(4,2)
    luv!_luv*s_yeah_yeah_yeah      luv_(4,3)  // the matching ends here
    uv_luvs*_yeah_yeah_yeah        luv_(4,3)s
    
                                              (finish me)              
    
  3. The algorithm for compression should find the best (longest) match for each repeated string. What should the compressed string be for a_ab_abc? for ab_a_abc? for lo_love_lov_love?

Shorter Symbols: Huffman codings

Each symbol in a text file is represented as an ASCII-symbol, which is stored in an 8-bit byte. This representation is not the most space efficient --- it makes more sense that frequently used letters (like e, t, a, o, i, ...) should have shorter representations and rarely used letters (x, q, z, ...) have longer representations. This is the principle behind Huffman codings, which MacCormick explains in a simplistic way (''The Shorter Symbol Trick,'' Pages 109--113).

Huffman's algorithm for compressing a text file goes like this:

INPUT: a text file
OUTPUT: the compressed text file, in binary
ALGORITHM:
1. count the occurrences of all symbols in the file
2. build the Huffman Tree by repeatly joining the symbols 
     with the lowest counts (see below)
3. label the branches in the Huffman Tree with 1s and 0s (see below)
4. extract the Huffman Codes for all symbols by examining all
     labelled paths in the Huffman Tree (see below)
5. translate the text file into the Huffman Codes  (see below)
6. return the Huffman Codes and the translated file
We explain the algorithm through an example, by compressing the first line from the Beatles's song. Look at the figure and read the explanation just underneath:
INPUT:  she_loves_you_yeah_yeah_yeah_

STEP 1: Count symbols
_: 6      a: 3    l: 1
e: 5      s: 2    v: 1
h: 4      o: 2    u: 1
y: 4


STEP 2: Build Huffman Tree
                               _:6---+
               a:3---+               | 
u:1---+              |               _auvl:12-------+
      uv:2--+        auvl:6----------+              |
v:1---+     |        |                              |
            uvl:3----+                              |
      l:1---+                                       |
                                                    _auvlosyeh:29
           o:2---+                                  |
                 os:4-----+                         |
           s:2---+        osy:8-----------+         |
                     y:4--+               |         |
                                          osyeh:17--+
                            e:5---+       |
                                  eh:9----+
                            h:4---+
Step 1, counting symbols, is easy enough. For Step 2 in the picture, the Huffman tree is constructed from left to right: We start with the two symbols with the lowest counts (here, u:1 and v:1), and connect them together, adding their counts. This gives us uv:2.

We repeat: The second time, l:1 and uv:2 have the lowest counts, so we combine them, giving uvl:3. The third time, the two lowest counts are o:2 and s:2, so we combine them into os:4. This repeats (a:3 and uvl:3, etc. etc.!) until all the symbols are collected together, always choosing the two symbols with the lowest counts to merge.

Once the branches of the tree are labelled with 0s and 1s (Step 3), we have the Huffman codes for the symbols (Step 4), where the most-used symbols have the shortest bit codes:

STEP 3: Label the tree                 STEP 4: extract codes:

                   0                   _ : 00
          0     _---+        
  0     a---+       |  0               a : 010
u---+ 0     |    1  *---+              u : 01100
  1 *---+   *-------+   |
v---+   | 1 |           |              v : 01101
      1 *---+           |
    l---+               |              l : 0111
                        |
      0                 *<--
    o---+   0           |              o : 1000
      1 *------+  0     |
    s---+    1 *----+   |              s : 1001
           y---+    | 1 |              y : 101
             0      *---+
           e---+  1 |                  e : 110
             1 *----+
           h---+                       h : 111
In Step 3, we no longer need the counts and the symbol groups, only the symbols at the ends (``leaves'') of the tree. For Step 4, we follow the path from the tree's entry point (''root'') to each symbol to find its Huffman Code.
STEP 5: Translate the message

10011111100001111000011011101001001011000011000010111001011100...
s   h  e  _ l   o   v    e  s   _ y  o   u    _ y  e  a  h  _ ...
The miracle of this message is, even though the symbols are represented by bit codes of differing length, you can precisely decode the message by reading the numbers until you match a code in the Huffman-code list!

The one-line message is compressed to 112 bits, which is less than half the size of the original ASCII message, which is 232 bits. The Huffman codes for the letters must be included with the compressed message, and these are saved compactly in tree form. See the details in https://www.siggraph.org/education/materials/HyperGraph/video/mpeg/mpegfaq/huffman_tutorial.html

Huffman proved (using mathematics) that the Huffman tree produces the most compact representations for the symbols within the input file --- you cannot do any better.

The algorithm in the zip utility uses LZ-77 to compress repetitions and encodes individual symbols as Huffman codes.

Exercises

  1. Finish the translation in ``Step 5'' above for the line of the song. Then, use the code list in ``Step 4'' to convert the sequence of 1s and 0s back to their symbols. Then, use the tree in ``Step 3'' to translate the sequence of 1s and 0s back to their symbols. Using the tree is simpler, yes? Why is that?
  2. Here is a step-by-step presentation of how to build a Huffman tree. For input string, hip_hop:
    SYMBOL COUNTS              HUFFMAN TREE SO FAR
    (i)
    h:2    p:2                 (nothing)
    i:1    _:1
    o:1
    
    (ii)
    h:2    p:2                 _:1--+
    i:1    _o:2                     _o:2
                               o:1--+
    
    (iii)
    h:2                        _:1--+
    p:2                             _o:2--+
    _oi:3                      o:1--+     |
                                          _oi:3 
                                    i:1---+
    (iv)
    (finish me)
    
    At each step, the symbols with the lowest counts are paired together and added to the tree. Finish the tree. Then, label the branches with 0s and 1s. If you did this correctly, then h and p have Huffman codes that are two bits in length, and o has a code of length three. Translate hip_hop into 1s and 0s.
  3. Perhaps you noticed that there can be more than one Huffman tree for a set of symbols and their counts. Rework the previous exercise, where step (i) pairs i and o, and step (ii) pairs _ and p. Finish the tree, label its branches with 1s and 0s, and write out the codes. Compare them to the ones you obtained in the previous exercise. In what (mathematical) sense are they the "same"? (Hint: compress the input string and check the length.)

Lossy compression

Music and image files tend to be huge, and they often need compression. An LV-77 or Huffman-style technique will not give enough compression, so we are forced to discard non-redundant information that can be "guessed" later when it is time to expand the file and play/show it. This is the basis of the mp3 and jpg compression algorithms.

Here is an example of ``lossy text compression'' but with words. The algorithm is: Remove any vowel surrounded by two consonants (since we can guess what it is later). We also use a few standard texting abbreviations and ignore punctuation. Here's the start of the Beatles's song with this form of lossy compression:

(She lvs U yeah yeah yeah)*3  yeah

U thnk Uve lst yur lve
Wll I sw hr ystrdy
Its U shes thnkng of
& she tld me wht 2 sy
Now thnk might expand to thank or think and lst to any of last, list, lost, lust. If the expansion algorithm guesses wrong, the lyrics will change but only a little --- it is surprising how much of the song's content is preserved. This same principle holds true when compressed music or images are expanded --- it is certain there will be inaccuracies in the expanded versions, but it takes close scrutiny to detect them.

Discussion (?) question

Expand the above compressed song by singing it, inserting a short e vowel (that is, pronounced "eh" as in "alphabet") whenever needed to keep the song rolling along. What you've done is similar to the expansion done by an mp3 decoder.

JPG

On Pages 115-120, MacCormick states how image files can be compressed by literally discarding every other row and column of image bits. There is some truth to what he is saying, but the jpg-image-compression algorithm is far more complex than what he describes.

I'll give you a few principles here, then you are on your own to read more. First off, a single pixel (dot) in an image is encoded as three numbers, (R,G,B), which list the red, green, and blue content in the dot. The intensitity of each falls in the range, 0..255. Black is (0,0,0), white is (255,255,255), pure red is (255,0,0), and so on. There is a fun table of colors at http://www.fizzylogic.com/users/bulbul/ColorSpecifier.html

An image is a huge rectangle of pixels ((R,G,B) numbers), laid out in a grid. Such an image is often too huge to store, and the JPG-compression algorithm reformats the grid as follows:

Now, there is definitely loss of information when the (R,G,B) values are converted and also when the averages are computed, but the algorithm that expands the compressed file and fills in the complete grid of pixels is good at "guessing" missing information and deceiving the human eye, just like most people are good at guessing missing vowels in a word.

This explanation is developed better (with pictures!) at http://www.ams.org/samplings/feature-column/fcarc-image-compression