Querying Heterogeneous and Uncertain Information

Project Award Number:  IIS-0219513
Principal Investigator:  H. V. Jagadish  

Electrical Engineering and Computer Science Dept.

University of Michigan

1301 Beal Ave

Ann Arbor, MI 48104

734-763-4079

FAX: 734-763-8094

jag@umich.edu

http://www.eecs.umich.edu/~jag                                                                                                                                                                                        

Keywords

Probabilistic data

Uncertainty

Schema heterogeneity

Data Integration

XML

Project Summary

There are many different sources from which relevant information can be

obtained for a task at hand.  The accuracy of this information is frequently

imperfect.  Information obtained from each source is frequently also different in

structure and scope.

 

XML has emerged today as the standard for information exchange.  Its

flexibility permits the representation of information from heterogeneous

sources in a single unified framework.

The goal of this project is to manage uncertain (probabilistic) data,

represented in XML.  While there have been previous efforts to develop

probabilistic relational systems, the need for representing and querying

uncertain information is much greater for loosely structured data not

readily amenable to a relational representation.

 

XML data poses several modeling challenges that are the focus of this

project: due to the data having (an inclusion hierarchy) structure,

probabilities cannot be assigned independently; due to XML elements occurring

at multiple granularities, probabilities may be assigned at multiple levels;

due to the possibility of missing and repeated sub-elements, one may not be

able to obtain complete distributions.

Publications and Products

Cong Yu, Hong Qi, H. V. Jagadish, “Integration of IR into an XML Database” First Annual Workshop of the Initiative for the Evaluation of XML Retrieval (INEX), Schloss Dagstuhl, Wadern, Germany, 2002.

Cong Yu, D. Radev, and H. V. Jagadish, “Querying XML Using Structures and Keywords in Timber,” Proc. ACM-SIGIR Int'l Conference on Information Retrieval, Toronto, ON, July 2003.

Shurug Al-Khalifa, Cong Yu, and H. V. Jagadish, “Querying Structured Text in an XML Database,” Proc. ACM-SIGMOD Int'l Conference on the Management of Data, San Diego, CA, June 2003, pp 4—15.

 

Project Impact

Data integration is an important and difficult topic.  This project is attacking this problem from a new angle, that we hope will produce greater benefit.

Biological data integration is a central application domain objective of this project.  As such, research in the biological sciences is positively impacted by this project.  The biological data integration work being undertaken as an application of the concepts developed in this project is likely to become a resource of value to biological scientists, quite independent of its technical merits in the computer science sense.

Two graduate students are centrally involved with this project.  Several additional graduate students also contribute.

 

Goals, Objectives and Targeted Activities

There are three main thrusts:

1. A probabilistically valid manipulation of uncertainty: A certainty level (probability) can be associated with each data item, and then used in query processing.  However, XML data poses several modeling challenges: due to the data having (an inclusion hierarchy) structure, probabilities cannot be assigned independently; due to XML elements occurring at multiple granularities, probabilities may be assigned at multiple levels; due to the possibility of missing and repeated sub-elements, one may not be able to obtain complete distributions.  These are the challenges we address.

We have developed a probability model for XML data, where the probability associated with a node is interpreted as its conditional probability given its parents.  Incomplete probabilities and probability distributions can be represented conveniently.  However, arbitrary probability dependences cannot be captured.  We believe that the probability framework we have developed is effective in a wide variety of scenarios, and have a significant biological database (having to do with protein interactions) expressed in our probabilistic XML framework.

2. Unstructured Textual Content: Traditional database technology is designed for data that can be neatly structured into specific fields whose values can be queries and manipulated.  Often, we have text fields intermixed with structured data (e.g. 'comments', 'desrciption').  At other times we have multiple data sources to be integrated, some of which are textual and others of which are structured.  Effective query processing in this context requires that we have a uniform representation of unstructured information in a structured context.  To achieve the closure property, so important to be able to build up complex queries, we need to model such data with suitable scoring functions.  We are studying these challenges.

It is important to integrate ``information retrieval style'' query evaluation, which is well-suited for natural language text, with standard ``database style'' query evaluation, which handles structured queries efficiently. Relevance scoring is central to information retrieval. In the case of XML, this operation becomes more complex because the data required for scoring could reside not directly in an element itself but also in its descendant elements.

We propose a bulk-algebra, TIX, and describe how it can be used as a basis for integrating information retrieval techniques into a standard pipelined database query evaluation engine. We develop new evaluation strategies essential to obtaining good performance, including a stack-based TermJoin algorithm for efficiently scoring composite elements. We report results from an extensive experimental evaluation, which show, among other things, that the new TermJoin access method outperforms a direct implementation of the same functionality using standard operators by a large factor.

Using these ideas, we participated in INEX (Initiative for Evaluation for XML Retrieval).  We turned in a creitable performance, coming in 4th amongs the dozens of interational participants.  The primary finding from this participation for us was the importance of correctly selecting the XML granule that is returned as the result of a query.  We are continuing to research this important topic to improve upon our current capabilities.

3. Schematic Heterogeneity:  If the schema of a new data source is known, a schema mapping can be defined and the new source can be integrated with the existing sources.  But what if the schema is not known?  One possibility is to try to learn the schema.  An alternative that we are exploring is to sidestep this entirely and see how far we can push structured queries into systems that have a scheme that the query-writer is not aware of.

As a separate project, we have been developing a native XML database, Timber.  The code for this system provides us with an effective platform on which to build our current project.  All our ideas are being implemented on top of Timber, which provides an ideal platform for deployment and testing.

Area Background

Data integration has been studied extensively, and from many different perspectives.  However, most data integration efforts worry, for obvious reasons, about how to perform the integration rather than about modeling the integration errors.

Probabilistic databases have also been proposed quite some years ago.  There has been a small but continuing stream of research on the topic, but commercial adoption has been absent thus far.  Our premise is that the advent of XML provides an opportunity to change matters.

Area References

There isn’t really any text-book treatment of the topics of interest to us.  Perhaps the best reference in terms of prior work is:

Laks V. S. Lakshmanan, Nicola Leone, Robert Ross, and V. S. Subrahmanian,  “ProbView: A flexible probabilistic database system”.

ACM Transactions on Database Systems, 22(3):419--469, 1997.

Potential Related Projects

There is a probabilistic data project at the Univ. of Maryland, led by Lise Getoor and V. S. Subrahmanian.

Project Website
http://www.eecs.umich.edu/db/timber

Until we accumulate a sufficient mass of results in this project, we have chosen to leverage off of the pre-existing Timber project website, and to show the data integration work done under this project as an extension/application of Timber.