Project 3 -- EECS380 -- Winter 2001

Compression

Assigned Friday April 6, 2001. Officially released at 6:50pm
Original due date Wed April 18th at 6pm
Extended to Sunday April 22nd at 6pm.

Changes:

Overview

You will write a program which attempts to perform data compression on any type of file. Your program is to provide a number of different compression schemes. You will provide some external documentation on one of the algorithms used.

Comments

Read this entire document before you start on the program. Really. It will help.

The program specification (short form)

Your executable, built by the `make' command, is to be called my_compress. It is to take arguments as follows:
my_compress  {-LZW or -LZY or -H or -D } -I filename -O filename 
Exactly one of -LZW or -LZY or -H or -D must be specified. The -I and -O arguments are required.

The arguments have the following meaning. Each algorithm is described below.

Your program may not create or modify any file other than the output file.

Note: In order for the uncompress option to work with any of the four possible compress algorithms it will be necessary to the output files encode a message explaining how they were encoded. We require that this be done by using a different 4 letter set of ASCII characters at the start of each compressed file. This is discussed in more detail below.

You may not use or copy or use code from any source other than our text, the STL and the standard C/C++ libraries. You should clearly cite any code you use from our text.


The following section is from the comp.compression FAQ. We suggest you at least scan the whole entry for question #70 in part 2. Much of the FAQ is actually quite interesting. This part is written by Peter Gutmann. Other links include an animation of the LZ algorithm and a more acadmic treatment of the LZ algorithms.

The LZ78 family of compressors

LZ78-based schemes work by entering phrases into a *dictionary* and then, when a repeat occurrence of that particular phrase is found, outputting the dictionary index instead of the phrase. There exist several compression algorithms based on this principle, differing mainly in the manner in which they manage the dictionary. The most well-known scheme (in fact the most well-known of all the Lempel-Ziv compressors, the one which is generally (and mistakenly) referred to as "Lempel-Ziv Compression"), is Terry Welch's LZW scheme, which he designed in 1984 for implementation in hardware for high- performance disk controllers.

Input string: /WED/WE/WEE/WEB

Character input:    Code output:    New code value and associated string:
    /W                  /                   256 = /W
    E                   W                   257 = WE
    D                   E                   258 = ED
    /                   D                   259 = D/
    WE                  256                 260 = /WE
    /                   E                   261 = E/
    WEE                 260                 262 = /WEE
    /W                  261                 263 = E/W
    EB                  257                 264 = WEB
    *END*               B
LZW starts with a 4K dictionary, of which entries 0-255 refer to individual bytes, and entries 256-4095 refer to substrings. Each time a new code is generated it means a new string has been parsed. New strings are generated by appending the current character K to the end of an existing string w. The algorithm for LZW compression is as follows:
  set w = NIL
  loop
      read a character K
      if wK exists in the dictionary
          w = wK
      else
          output the code for w
          add wK to the string table
          w = K
  endloop
  
We highly suggest you make sure you walk the entire example shown above before going any further.

The difficult part now is to see how the decompresser has enough information to reconstruct the data exactly. The key is that the decompresser can build exactly the same table as the compressor built. So the first 4 things in output by the compressor are / W E D. Note that each of those are output as 12-bit values rather than 8 bit chars.

A key question is when the decompresser sees the 256, how can it know that is supposed to be a /W? The answer is that the decompresser knows the first 4 characters the compressor saw. So knows that the compressor had a table that consisted of

256 = /W
257 = WE
258 = ED
So the decompresser knows exactly what to do with the 256. This observation is key to understanding all of the LZ algorithms.

LZ variants

LZW
For the -LZW option you are to use a fixed table length of 4096 entries, each of 12 bits (as described above.) Once you get to 4096 table entries the table becomes static and unchanging.

LZY
For this argument you can modify the LZ algorithm however you wish (The Y in LZY stands for `Your'.) You will be graded based on how well your algorithm compresses the data. You may not use LZW algorithm without at least some change. Looking at various web pages about LZ and LZW techniques might give you some ideas. Also, do not choose algorithms that take significantly longer to run than the LZW algorithm. (Your LZY algorithm should have a runtime within a factor of 2 or 3 of the LZW algorithm.)

Implementation

You should use a reasonable (in terms of speed) format to store your dictionary. Feel free to use the STL as needed...


Huffman encoding

(A local copy of some of the following documentation is also available.)

The basic idea of Huffman encoding is to come up with an encoding scheme that uses shorter encodings for often-occurring strings (or characters). An outstanding description of Huffman encoding is found at the University of Western Australia. Be sure to follow the link to the operation of the Huffman algorithm. Also, the FAQ referenced in the LZ section also has some information on Huffman encoding.

One can do Huffman encoding on characters, groups of characters, words, or whatever else. For this assignment we ask that it be done on a character basis.

Parts of a Huffman-encoded file.

There are three basic parts to most files created with Huffman encoding: The headers, the dictionary, and the data.

The data is simply the encoded message using the encodings generated by the Huffman algorithm.

The dictionary contains the translation from the original encoding to the Huffman encoding. There are many ways to do this.

The header contains whatever other information is needed by the decompresser to be able to decompress the file. This could include the number of dictionary entries and whatever else your algorithm might need.

Huffman Example

Imagine we are writing a Huffman encoder that is designed for small files with only a few characters. Say we have the string AAABBBAAAC. Our Huffman encoding scheme might generate encodings where
A = 1
B = 01
C = 00
The header might consist of a single integer indicating how many dictionary entries there are. There can't be more than 256 different characters, so this can safely be stored in 7 bits. Further the operating system will tell us if we have hit the end of the file. But it does this on a character basis rather than a bit basis. It is very important to know at what _bit_ the data ends. We thus include a character which tells us what bit the last character ends on. (This is number of bits in the data modulo 8.)

We can use standard ASCII to identify each of the characters in the dictionary using 8 bits. Further we are sure we will have very few characters in each file, so we guess that no encoding will have more than 7 bits. However we notice that, for example the encoding for 1 and 01, using 7 bits would both be 0000001, a major problem. There are a number of ways to get around this. For this scheme we choose to have the bit before the actual encoding starts be indicated by a 1. So 01 would be stored (in 8 bits now) as 00000101 while 1 would be stored as 00000011. So in binary we have:

0000 0011    // header indicating we have 3 dictionary entries
0000 0110    // header indicating the last character uses 6 bits.
0100 0001    // ASCII for A
0000 0011    // encoding for A.  Really "1"
0100 0010    // ASCII for B
0000 0101    // encoding for B.  Really "01" 
0100 0011    // ASCII for C
0000 0100    // encoding for C.  Really "00" 
And from the header we can tell that the remaining information is the data section.
1110 1010    // AAABB and 1/2 of the next B
1111 0000    // The rest of the B, AAAC, and two extra zeros
             // We know the last two zeros aren't an extra char because of the 
	     // header that said the last char uses 6 bits // (111100)
Notice that both the old and new file took up 80 bits. Also notice we did not include the 4 byte identifier required in the specifications section... Still, it was a very small file and we would hope to do better on a larger file.

You do not need to (nor should you!) use the exact scheme shown above. It won't work for larger encodings. But it should provide some ideas. (You can use a scheme as close to the above as you wish.)

Huffman requirements

Your scheme to build the Huffman tree should use the STL's priority queue function. Your dictionary lookup (both when encoding and decoding) must have a better-than-linear runtime in the size of the dictonary.

The encoding scheme you choose should make a reasonable use of space. Remember the goal is to compress the file size. Some waste (especially in the dictionary and headers) is acceptable.

Your encoding and decoding schemes must have a reasonable run time. Don't worry about making it go really fast. Just don't do anything that is terribly inefficient.


Decompression

If the -D flag is given, the input file should be decoded and sent to the output file. You should not assume that the output file will be a text file. Your program will need to examine the first four characters of the file and determine what compression scheme was used. No other hints (command-line arguments, file extensions) can be used to determine this.

Other things

How to turn it in

This is pretty much the same as projects 1 and 2.

Do all of your work, with all needed files, in some directory other than your home directory. Before you turn in your code be sure that:

The simple autograder will be turned on by midnight on Monday April 9th. It will be announced on the newsgroup.

Turn-in list

Both will be turned in via the autograder.

Grading

A compression scheme is not working if you can't correctly decompress it. There will be no partial credit for any of the "working and reasonable" parts.

You will get zero points for the project if you create any files other than the specified output file!

It is not our intent to grade for coding style. However your code should be fairly readable as we will need to review it by hand (to make sure you are implementing the right algorithms!) If we can't understand what is going on we may remove points!

Thoughts

Many excellent references on Lempel Ziv can be found by Web search, for example at google. Several sites provide animations of compression algorithms.

Manipulating the individual bits of the characters to be written may be one of the more difficult parts of this assignment for many of you. Make good use of the shift operators, bit-wise operators and/or STL's bit-vectors.

Reading and writing binary files in C++ is explained in handouts. If you didn't do it for HW5, it would be a good idea to play with this a little bit before you start doing the actual program.

As a general rule, for large files you should be within a factor of 50 of the runtime of the Unix utilites compress or gzip on the same file. The basic goal is to not take forever.