Massive Datasets

Research Areas -> Theory of Computation -> Massive Datasets
 
Overview
Massive datasets abound in a wide variety of applications facing academic researchers, government agencies, and corporations. Massive datasets arise in the physical sciences through observation of the heavens and earth and through experiments with particle colliders. Genomic information provides massive datasets in biology. Telecommunications, including high-frequency radar as well as voice and data networks, provide examples in engineering. Transactional data arising from commerce, medical care, and other areas are also becoming larger than standard computational techniques can handle.

In many applications, the size of the dataset as presented is much larger than the size of the useful information in the dataset. A typical class of problem is called "heavy hitters," in which we must track the most significant k items out of n >> k. We will want to use resources (time, space, communication) polynomial in k but much less than n. Typically, we present randomized approximation algorithms that, with high probability, give a solution that may be less than optimal but is provably close to optimal, in a sense that depends on the application at hand. We may also need to satisfy additional constraints, such as privacy constraints in cryptographic contexts.

 
Faculty
Strauss, Martin J.


Related Labs, Centers, and Groups
Software Systems Laboratory