Chapter 7 presents techniques for compressing data into a format that can be expanded into the original artifact. These techniques are used to make image, audio, and massive-program files into a manageable size for storage or transmission. iPods, iPads, cloud storage, etc., could not exist without data compression.
To illustrate different ways of satisfying the requirement, we'll work with a running example. Here are the lyrics to the song, ``She Loves You,'' by the Beatles. (By the way, it will be helpful to remember the melody that goes with the lyrics. You can play the song on YouTube --- do a search for Beatles She Loves You to find it. Now, do this --- really.)
She loves you, yeah yeah yeah She loves you, yeah yeah yeah She loves you, yeah yeah yeah yeah You think you've lost your love Well, I saw her yesterday It's you she's thinking of And she told me what to say She says she loves you And you know that can't be bad Yes, she loves you And you know you should be glad She said you hurt her so She almost lost her mind But now she says she knows You're not the hurting kind She says she loves you And you know that can't be bad Yes, she loves you And you know you should be glad, ooh! She loves you, yeah yeah yeah She loves you, yeah yeah yeah With a love like that You know you should be gladThere's lots of repetition, and we count the words to see where:
a : 1 almost : 1 and : 5 bad : 2 be : 5 but : 1 can't : 2 glad : 3 her : 3 hurt : 1 hurting : 1 i : 1 it's : 1 kind : 1 know : 5 knows : 1 | like : 1 lost : 2 love : 2 loves : 9 me : 1 mind : 1 not : 1 now : 1 of : 1 ooh : 1 said : 1 saw : 1 say : 1 says : 3 she : 16 she's : 1 should : 3 | so : 1 that : 3 the : 1 think : 1 thinking : 1 to : 1 told : 1 well : 1 what : 1 with : 1 yeah : 16 yes : 2 yesterday : 1 you : 20 you're : 1 you've : 1 your : 1 |
Here's an easy compression technique: (i) make an abbreviation table of the words that appear at least 3 times, and (ii) insert the abbreviations into the lyrics:
ABREVIATION TABLE: num word num word num word --------------------------------------------- 1 and 5 know 9 should 2 be 6 loves 10 that 3 glad 7 says 11 yeah 4 her 8 she 12 you COMPRESSED LYRICS: 8 6 12, 11 11 11 8 6 12, 11 11 11 8 6 12, 11 11 11 11 12 think you've lost your love Well, I saw 4 yesterday It's 12 she's thinking of 1 8 told me what to say 8 7 8 6 12 1 12 5 10 can't 2 bad Yes, 8 6 12 1 12 5 12 9 2 3 8 said 12 hurt 4 so 8 almost lost 4 mind But now 8 7 8 knows You're not the hurting kind 8 7 8 6 12 1 12 5 10 can't 2 bad Yes, 8 6 12 1 12 5 12 9 2 3, ooh! 8 6 12, 11 11 11 8 6 12, 11 11 11 With a love like 10 12 5 12 9 2 3The abbreviations are shorter than the words they replace, so the table plus the lyrics take less space than before. But we should do more: there are words with common prefixes (think, thinking, you, your, you're, you've, yes, yesterday) that should be compressed together, and more importantly, there are phrase patterns (8 6 12, 11 11 11, 12 5 12 9 2 3) that appear multiple times and should be compressed. This leads to an important compression algorithm.
Rather than use an abbreviation table, LZ77 cleverly leaves the first occurrence of a repeated word in its original position in the text, and when the repetition occurs, the repeated word is replaced by ``go backwards XX characters and copy the YY letters at that point.'' For example,
EXPANDED PART WHAT'S LEFT TO EXPAND luv_(4,3)s_yeah(5,10) luv_ (4,3)s_yeah(5,10) *luv_ (*,3)s_yeah(5,10) // * marks the copy position l*uv_l (*,2)s_yeah(5,10) lu*v_lu (*,1)s_yeah(5,10) luv*_luv (*,0)s_yeah(5,10) // done luv_luv s_yeah(5,10) luv_luvs_yeah (5,10) luv_luvs*_yeah (*,10) luv_luvs_*yeah_ (*,9) luv_luvs_y*eah_y (*,8) ... luv_luvs_yeah*_yeah (*,5) // now, we copy what we just copied ... luv_luvs_yeah_yeah_yeah* (*,0) // done luv_luvs_yeah_yeah_yeahHere's the first half of the LV-77-compressed Beatles song. Spaces are represented by underlines, and the end-of-lines are represented by semicolons. (I apologize if I typed some of the numbers wrong!)
She_loves_you,_yeah(5,10); // repeat "_yeah" twice (30,59)(5,5); // repeat previous line twice plus "_yeah" You_think(35,4)'ve_lost(12,4)r(53,5); // "_you"+"'ve"..."_you"+"r"..."_love" */ Well,_I_saw_her_yesterday; It's(41,4)_she's(55,6)ing_of; And(23,4)_told_me_what_to_s(57,3) She(8,4)s(33,4)(100,5)s(70,4); (51,3)(8,4)_know_that_can't_be_bad; Yes,(76,28)(8,4)should(49,4)glad,_ooh!; // "_she_loves_you; And you know_" + "you_"..."_be_" */LV-77 is a great compression technique since it doesn't require an abbreviation table and readily finds repeated words, prefixes, and phrases. Its only limitation is that, for long files, it cannot search all the way back to the beginning of the file to find repetitions; it must keep a buffer of say, the last 500 lines read, when searching for possible repetitions.
INPUT STRING COMPRESSED VERSION ------------------------- -------------------- *luv_luvs_yeah_yeah_yeah (nothing) // the * marks how much has been compressed l*uv_luvs_yeah_yeah_yeah l lu*v_luvs_yeah_yeah_yeah lu ... luv_*luvs_yeah_yeah_yeah luv_ // l matches an earlier letter in the input l!uv_l*uvs_yeah_yeah_yeah luv_(4,1) // the ! marks how far we've matched lu!v_lu*vs_yeah_yeah_yeah luv_(4,2) luv!_luv*s_yeah_yeah_yeah luv_(4,3) // the matching ends here uv_luvs*_yeah_yeah_yeah luv_(4,3)s (finish me)
Huffman's algorithm for compressing a text file goes like this:
INPUT: a text file OUTPUT: the compressed text file, in binary ALGORITHM: 1. count the occurrences of all symbols in the file 2. build the Huffman Tree by repeatly joining the symbols with the lowest counts (see below) 3. label the branches in the Huffman Tree with 1s and 0s (see below) 4. extract the Huffman Codes for all symbols by examining all labelled paths in the Huffman Tree (see below) 5. translate the text file into the Huffman Codes (see below) 6. return the Huffman Codes and the translated fileWe explain the algorithm through an example, by compressing the first line from the Beatles's song. Look at the figure and read the explanation just underneath:
INPUT: she_loves_you_yeah_yeah_yeah_ STEP 1: Count symbols _: 6 a: 3 l: 1 e: 5 s: 2 v: 1 h: 4 o: 2 u: 1 y: 4 STEP 2: Build Huffman Tree _:6---+ a:3---+ | u:1---+ | _auvl:12-------+ uv:2--+ auvl:6----------+ | v:1---+ | | | uvl:3----+ | l:1---+ | _auvlosyeh:29 o:2---+ | os:4-----+ | s:2---+ osy:8-----------+ | y:4--+ | | osyeh:17--+ e:5---+ | eh:9----+ h:4---+Step 1, counting symbols, is easy enough. For Step 2 in the picture, the Huffman tree is constructed from left to right: We start with the two symbols with the lowest counts (here, u:1 and v:1), and connect them together, adding their counts. This gives us uv:2.
We repeat: The second time, l:1 and uv:2 have the lowest counts, so we combine them, giving uvl:3. The third time, the two lowest counts are o:2 and s:2, so we combine them into os:4. This repeats (a:3 and uvl:3, etc. etc.!) until all the symbols are collected together, always choosing the two symbols with the lowest counts to merge.
Once the branches of the tree are labelled with 0s and 1s (Step 3), we have the Huffman codes for the symbols (Step 4), where the most-used symbols have the shortest bit codes:
STEP 3: Label the tree STEP 4: extract codes: 0 _ : 00 0 _---+ 0 a---+ | 0 a : 010 u---+ 0 | 1 *---+ u : 01100 1 *---+ *-------+ | v---+ | 1 | | v : 01101 1 *---+ | l---+ | l : 0111 | 0 *<-- o---+ 0 | o : 1000 1 *------+ 0 | s---+ 1 *----+ | s : 1001 y---+ | 1 | y : 101 0 *---+ e---+ 1 | e : 110 1 *----+ h---+ h : 111In Step 3, we no longer need the counts and the symbol groups, only the symbols at the ends (``leaves'') of the tree. For Step 4, we follow the path from the tree's entry point (''root'') to each symbol to find its Huffman Code.
STEP 5: Translate the message 10011111100001111000011011101001001011000011000010111001011100... s h e _ l o v e s _ y o u _ y e a h _ ...The miracle of this message is, even though the symbols are represented by bit codes of differing length, you can precisely decode the message by reading the numbers until you match a code in the Huffman-code list!
The one-line message is compressed to 112 bits, which is less than half the size of the original ASCII message, which is 232 bits. The Huffman codes for the letters must be included with the compressed message, and these are saved compactly in tree form. See the details in https://www.siggraph.org/education/materials/HyperGraph/video/mpeg/mpegfaq/huffman_tutorial.html
Huffman proved (using mathematics) that the Huffman tree produces the most compact representations for the symbols within the input file --- you cannot do any better.
The algorithm in the zip utility uses LZ-77 to compress repetitions and encodes individual symbols as Huffman codes.
SYMBOL COUNTS HUFFMAN TREE SO FAR (i) h:2 p:2 (nothing) i:1 _:1 o:1 (ii) h:2 p:2 _:1--+ i:1 _o:2 _o:2 o:1--+ (iii) h:2 _:1--+ p:2 _o:2--+ _oi:3 o:1--+ | _oi:3 i:1---+ (iv) (finish me)At each step, the symbols with the lowest counts are paired together and added to the tree. Finish the tree. Then, label the branches with 0s and 1s. If you did this correctly, then h and p have Huffman codes that are two bits in length, and o has a code of length three. Translate hip_hop into 1s and 0s.
Here is an example of ``lossy text compression'' but with words. The algorithm is: Remove any vowel surrounded by two consonants (since we can guess what it is later). We also use a few standard texting abbreviations and ignore punctuation. Here's the start of the Beatles's song with this form of lossy compression:
(She lvs U yeah yeah yeah)*3 yeah U thnk Uve lst yur lve Wll I sw hr ystrdy Its U shes thnkng of & she tld me wht 2 syNow thnk might expand to thank or think and lst to any of last, list, lost, lust. If the expansion algorithm guesses wrong, the lyrics will change but only a little --- it is surprising how much of the song's content is preserved. This same principle holds true when compressed music or images are expanded --- it is certain there will be inaccuracies in the expanded versions, but it takes close scrutiny to detect them.
I'll give you a few principles here, then you are on your own to read more. First off, a single pixel (dot) in an image is encoded as three numbers, (R,G,B), which list the red, green, and blue content in the dot. The intensitity of each falls in the range, 0..255. Black is (0,0,0), white is (255,255,255), pure red is (255,0,0), and so on. There is a fun table of colors at http://www.fizzylogic.com/users/bulbul/ColorSpecifier.html
An image is a huge rectangle of pixels ((R,G,B) numbers), laid out in a grid. Such an image is often too huge to store, and the JPG-compression algorithm reformats the grid as follows:
This explanation is developed better (with pictures!) at http://www.ams.org/samplings/feature-column/fcarc-image-compression