Michael Cafarella, Assistant Professor of Computer Science and Engineering, was recently awarded an NSF CAREER grant for his research project, "Building and Searching a Structured Web Database."
The CAREER grant is one of NSF's most prestigious awards, conferred for "the early career-development activities of those teacher-scholars who most effectively integrate research and education within the context of the mission of their organization."
This work investigates techniques for extracting and searching Web-embedded structured datasets. For example, a manufacturer's site may contain technical product data, and a governmental site may contain economic statistics. Unfortunately, such data can be hard to isolate from surrounding text, and difficult to find using existing search engines that focus exclusively on documents.
The approach for the data extraction step is to use current incomplete datasets to induce a large "portfolio" of possible extractors, apply all of them to crawled Web content, and then test to learn which are most successful. The approach for the search step is to examine user query logs to find common patterns that describe the relationship between topic words and words that describe the dataset's structure; e.g., "endangered species near the Mississippi river" is a prototype for a many-to-one geographic relationship. The central goal of this work is to eventually construct a working search engine for the structured-data component of the Web.
If successful, this project is likely to increase access to structured datasets for a very broad population of users. It should also help users from scientific, engineering, and policy fields who need to find quantitative data to support their work.
Prof. Cafarella is affiliated with the Software Systems Lab in Computer Science and Engineering. He is also a member of the Database Group. his research interests are in databases, information extraction, and data mining. He is particularly interested in applying data mining techniques to Web data and scientific applications.
Posted: March 10, 2011