Tuesday, April 6
4:30 - 5:30 PM
Room 1003 EECS
Abstract-
Very large databases store enormous amounts of information:
terabyte
storage systems for radiological images are available
commercially,
and petabyte systems for scientific and medical databases
are expected
in the near future. The amount of information to be stored
in some
cases is so large that it overwhelms the ability of the
physical
storage mechanisms to deliver arbitrary requests quickly.
There are
thus fundamental limits on attainable performance in such
systems.
We develop a general model of the information retrieval
problem, and
demonstrate constructive procedures for arranging data in
the storage
system that are essentially optimum. The arguments use
basic information
theoretic concepts such as typical sequences, randomization,
and redundancy
(though not entropy or mutual information).
Allowing redundancy greatly lowers the achievable access
times, even
when the amount added is an arbitrarily small fraction of
the total
amount of information in the database. The achievable
limits both
with and without redundancy are computed; in the case where
redundancy
is allowed the limits essentially coincide with lower limits
for more
general storage systems.
The following simplified example can be greatly generalized.
A
two-dimensional image is divided into an n x n array of
blocks, and is
to be stored on a one-dimensional tape. Taking successive
user
requests to be generated by a two-dimensional simple random
walk on
the image, we seek to arrange the data in the storage system
so as to
minimize the long-term average access time. (We take the
image to be
a torus and the tape a loop for simplicity.) We show that
when the
blocks are stored without replication, the average access
time is at
least n/2, and the raster scan is thus essentially optimal.
On the
other hand, if redundancy is allowed, the access time can be
reduced
to less than K n^a for any a > 0; the total storage
achieving this may
be made less than (1+d) times the original storage, for any
d > 0; and
there are constructive procedures for arranging the data
that achieve
this. Thus an arbitrarily small amount of redundancy is
sufficient to
reduce the average access time from linear in n to
subpolynomial.
To link to Prof. Coffey's Home Page just click here