Home


Project Overview

Our Publications


Personnel


Reference Shelf


Links


Data


Private


News

Learning Intrinsic Dimension and Entropy of High Dimensional Data Sets



Project Title: Learning Intrinsic Dimension and Entropy of High Dimensional Data Sets

PI Involved: Alfred O. Hero III,
Professor, Electrical Engineering & Computer Science, University of Michigan

Student: Jose A. Costa,
Ph.D. 2005, Electrical Engineering & Computer Science, University of Michigan

Project Description

Interesting classes of signals, from bioinformatics to Internet traffic analysis, are high dimensional in nature and thus appear highly complex. However, in many cases such apparent complexity can be explained by only a few simple variables, as a consequence of coherent structures in the data that lead to strong correlations across dimensions. This work is concerned with the characterization of high dimensional data sets through two key quantities: the data intrinsic dimension ("roughly" characterizes the number of independent variables needed to generate the signal) and entropy ("roughly" measures the information conveyed by the signal). We propose a nonparametric graph theoretic approach. Our method simply constructs an entropic graph - e.g., geodesic-minimal-spanning-tree (GMST), k-nearest neighbor (k-NN) graph, etc - on sub samples of the data set and then uses the total edge lengths of these graphs to estimate dimension and entropy. Figures bellow illustrate these constructions for a 2-dimensional dataset living in a 3D space.
Swiss Roll

Fig.1 : GMST and 4-NN graph on 400 points uniformly distributed on the 2D "Swiss roll" manifold.

 

Linear fit of total length
Fig.2 : Log-log plot of the k-NN total edge lenght as a function of the number of samples. The slope of the corresponding linear fit provides an estimate of the intrinsic dimension, while the intercept with the y-axis determines the intrinsic entropy.

References

1. J.A. Costa, (Hero advisor), " Random graphs for structure discovery in high dimensional data," PhD Thesis, Dept. EECS, Aug. 2005.
2. J. A. Costa and A. O. Hero, " Learning intrinsic dimension and intrinsic entropy of high dimensional datasets," to appear in Proc. of EUSIPCO , Vienna, Austria, September, 2004.
3. J. A. Costa and A. O. Hero, "Geodesic Entropic Graphs for Dimension and Entropy Estimation in Manifold Learning," to appear in IEEE Trans. on Signal Processing, August, 2004.
4. J. A. Costa and A. O Hero, "Manifold Learning Using k-Nearest Neighbor Graphs,"Proc. of IEEE Int. Conf. on Acoust. Speech and Signal Processing, Montreal, Canada, May, 2004.
5. J. A. Costa and A. O Hero, "Entropic Graphs for Manifold Learning,"Proc. of IEEE Asilomar Conf. on Signals, Systems, and Computers, Pacific Groove, CA, November, 2003.
6. J.A. Costa, A. Girotra and A.O. Hero (2005), " Estimating Local Intrinsic Dimension with k-Nearest Neighbor Graphs,, "Proc of IEEE Workshop on Statistical Signal Processing (SSP), Bordeaux, July 2005.
7. J.A Costa and A. O. Hero, " Learning intrinsic dimension and entropy of shapes, in "Statistics and analysis of shapes, Eds. H. Krim and T. Yezzi, Birkhauser, 2006.
8. J.A. Costa, N. Patwari, A.O. Hero (2006), " Distributed Weighted Multidimensional Scaling for Node Localization in Sensor Networks," to appear in 2006 in ACM Journal on Networking.
9. S. Grikschat, J.A. Costa and A.O. Hero, " Dual rooted-diffusions for clustering and classification on manifolds," 2006 IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, Toulouse France, 2006.

Software Resources

Intrinsic Dimension and Entropy Estimation in Manifold Learning webpage.