A team of computer science researchers that includes U-M Prof. Atul Prakash is pursuing the first funded investigation of non-consumptive research of a major mass collection of content. The study, funded by a grant from the Alfred P. Sloan Foundation, is led by Principal Investigators Beth Plale at the Data To Insight Center at Indiana University and Prof. Prakash of the University of Michigan, and is being conducted in partnership with the HathiTrust Research Center (HTRC).
Non-consumptive research involves computational analysis of one or more books without the researcher having the ability to reassemble the collection. Rather than reading the material, researchers use specialized algorithms to analyze text as a massive data set and the Sloan grant will help ensure the work can be conducted in a secure environment.
The Sloan-funded project will enable the researchers to pursue a research track around non-consumptive research uses of the HathiTrust digital corpus. At the end of the project, they expect to have cyberinfrastructure in place that successfully demonstrates that non-consumptive research can be carried out safely under the conditions of unintended malicious user algorithms.
In some cases, HTRC will own the algorithms used by researchers, so HTRC needs to examine the security requirements for users, the algorithms and the data, all within the context of using the suite of algorithms available in the Software Environment for the Advancement of Scholarly Research (SEASR). In other cases, researchers will own and submit their own algorithms for use and the Sloan Foundation funding will be used to allow the scholars the freedom to experiment with new algorithms on a huge body of information, but with technological "trust but verify" mechanisms in place to confirm compliance with non-consumptive research policy. The security aspects of the project will leverage research by Prof. Prakash.
Without taking into account the actual content of materials, researchers using their own complex algorithms might analyze such massive data sets for anything as simple as repetition of words to complex linguistic structures or the evolution of word usage over a range of time, space or even demographic class. The HathiTrust repository contains almost 8.6 million digitized volumes, and about 2.2 million of those - roughly 26 percent - are in the public domain and currently available for non-consumptive research.
Prof. Prakash is affiliated with Software Systems Laboratory, and his research is in the areas of security and privacy and cyber-physical systems. He has taught courses in Computer and Network Security (EECS 588), Information Systems Design and Programming (EECS 282), and Building Information Environments (EECS 182).
Posted: September 1, 2011