Tuesday, April 6
4:30 - 5:30 PM
Room 1003 EECS
Very large databases store enormous amounts of information: terabyte storage systems for radiological images are available commercially, and petabyte systems for scientific and medical databases are expected in the near future. The amount of information to be stored in some cases is so large that it overwhelms the ability of the physical storage mechanisms to deliver arbitrary requests quickly. There are thus fundamental limits on attainable performance in such systems.
We develop a general model of the information retrieval problem, and demonstrate constructive procedures for arranging data in the storage system that are essentially optimum. The arguments use basic information theoretic concepts such as typical sequences, randomization, and redundancy (though not entropy or mutual information).
Allowing redundancy greatly lowers the achievable access times, even when the amount added is an arbitrarily small fraction of the total amount of information in the database. The achievable limits both with and without redundancy are computed; in the case where redundancy is allowed the limits essentially coincide with lower limits for more general storage systems.
The following simplified example can be greatly generalized. A two-dimensional image is divided into an n x n array of blocks, and is to be stored on a one-dimensional tape. Taking successive user requests to be generated by a two-dimensional simple random walk on the image, we seek to arrange the data in the storage system so as to minimize the long-term average access time. (We take the image to be a torus and the tape a loop for simplicity.) We show that when the blocks are stored without replication, the average access time is at least n/2, and the raster scan is thus essentially optimal. On the other hand, if redundancy is allowed, the access time can be reduced to less than K n^a for any a > 0; the total storage achieving this may be made less than (1+d) times the original storage, for any d > 0; and there are constructive procedures for arranging the data that achieve this. Thus an arbitrarily small amount of redundancy is sufficient to reduce the average access time from linear in n to subpolynomial.
To link to Prof. Coffey's Home Page just click here