Querying Heterogeneous and Uncertain Information
Project Award
Number: IIS-0219513
Principal Investigator: H. V. Jagadish
Electrical Engineering and Computer Science Dept.
University of Michigan
1301 Beal Ave
Ann Arbor, MI 48104
734-763-4079
FAX: 734-763-8094
jag@umich.edu
http://www.eecs.umich.edu/~jag
Keywords
Probabilistic
data
Uncertainty
Schema
heterogeneity
Data
Integration
XML
Project Summary
There
are many different sources from which relevant information can be
obtained
for a task at hand. The accuracy of
this information is frequently
imperfect. Information obtained from each source is
frequently also different in
structure
and scope.
XML
has emerged today as the standard for information exchange. Its
flexibility
permits the representation of information from heterogeneous
sources
in a single unified framework.
The
goal of this project is to manage uncertain (probabilistic) data,
represented
in XML. While there have been previous
efforts to develop
probabilistic
relational systems, the need for representing and querying
uncertain
information is much greater for loosely structured data not
readily
amenable to a relational representation.
XML
data poses several modeling challenges that are the focus of this
project:
due to the data having (an inclusion hierarchy) structure,
probabilities
cannot be assigned independently; due to XML elements occurring
at
multiple granularities, probabilities may be assigned at multiple levels;
due
to the possibility of missing and repeated sub-elements, one may not be
able
to obtain complete distributions.
Publications and
Products
Cong
Yu, Hong Qi, H. V. Jagadish, “Integration of IR into an XML Database” First
Annual Workshop of the Initiative for the Evaluation of XML Retrieval (INEX),
Schloss Dagstuhl, Wadern, Germany, 2002.
Cong
Yu, D. Radev, and H. V. Jagadish, “Querying XML Using Structures and Keywords
in Timber,” Proc. ACM-SIGIR Int'l Conference on Information Retrieval,
Toronto, ON, July 2003.
Shurug
Al-Khalifa, Cong Yu, and H. V. Jagadish, “Querying Structured Text in an XML
Database,” Proc. ACM-SIGMOD Int'l Conference on the Management of Data,
San Diego, CA, June 2003, pp 4—15.
Project Impact
Data
integration is an important and difficult topic. This project is attacking this problem from a new angle, that we
hope will produce greater benefit.
Biological
data integration is a central application domain objective of this
project. As such, research in the
biological sciences is positively impacted by this project. The biological data integration work being
undertaken as an application of the concepts developed in this project is
likely to become a resource of value to biological scientists, quite
independent of its technical merits in the computer science sense.
Two
graduate students are centrally involved with this project. Several additional graduate students also
contribute.
Goals, Objectives
and Targeted Activities
There
are three main thrusts:
1.
A probabilistically valid manipulation of uncertainty: A certainty level
(probability) can be associated with each data item, and then used in query
processing. However, XML data poses
several modeling challenges: due to the data having (an inclusion hierarchy)
structure, probabilities cannot be assigned independently; due to XML elements occurring
at multiple granularities, probabilities may be assigned at multiple levels; due
to the possibility of missing and repeated sub-elements, one may not be able to
obtain complete distributions. These
are the challenges we address.
We
have developed a probability model for XML data, where the probability associated
with a node is interpreted as its conditional probability given its
parents. Incomplete probabilities and
probability distributions can be represented conveniently. However, arbitrary probability dependences
cannot be captured. We believe that the
probability framework we have developed is effective in a wide variety of
scenarios, and have a significant biological database (having to do with
protein interactions) expressed in our probabilistic XML framework.
2.
Unstructured Textual Content: Traditional database technology is designed for
data that can be neatly structured into specific fields whose values can be
queries and manipulated. Often, we have
text fields intermixed with structured data (e.g. 'comments', 'desrciption'). At other times we have multiple data sources
to be integrated, some of which are textual and others of which are structured. Effective query processing in this context
requires that we have a uniform representation of unstructured information in a
structured context. To achieve the
closure property, so important to be able to build up complex queries, we need
to model such data with suitable scoring functions. We are studying these challenges.
It
is important to integrate ``information retrieval style'' query evaluation,
which is well-suited for natural language text, with standard ``database
style'' query evaluation, which handles structured queries efficiently.
Relevance scoring is central to information retrieval. In the case of XML, this
operation becomes more complex because the data required for scoring could
reside not directly in an element itself but also in its descendant elements.
We
propose a bulk-algebra, TIX, and describe how it can be used as a basis for
integrating information retrieval techniques into a standard pipelined database
query evaluation engine. We develop new evaluation strategies essential to obtaining
good performance, including a stack-based TermJoin algorithm for efficiently scoring
composite elements. We report results from an extensive experimental evaluation,
which show, among other things, that the new TermJoin access method outperforms a
direct implementation of the same functionality using standard operators by a
large factor.
Using
these ideas, we participated in INEX (Initiative for Evaluation for XML Retrieval). We turned in a creitable performance, coming
in 4th amongs the dozens of interational participants. The primary finding from this participation
for us was the importance of correctly selecting the XML granule that is
returned as the result of a query. We
are continuing to research this important topic to improve upon our current
capabilities.
3.
Schematic Heterogeneity: If the schema
of a new data source is known, a schema mapping can be defined and the new
source can be integrated with the existing sources. But what if the schema is not known? One possibility is to try to learn the schema. An alternative that we are exploring is to
sidestep this entirely and see how far we can push structured queries into systems
that have a scheme that the query-writer is not aware of.
As
a separate project, we have been developing a native XML database, Timber. The code for this system provides us with an
effective platform on which to build our current project. All our ideas are being implemented on top
of Timber, which provides an ideal platform for deployment and testing.
Area Background
Data
integration has been studied extensively, and from many different
perspectives. However, most data
integration efforts worry, for obvious reasons, about how to perform the
integration rather than about modeling the integration errors.
Probabilistic
databases have also been proposed quite some years ago. There has been a small but continuing stream
of research on the topic, but commercial adoption has been absent thus
far. Our premise is that the advent of
XML provides an opportunity to change matters.
Area References
There
isn’t really any text-book treatment of the topics of interest to us. Perhaps the best reference in terms of prior
work is:
Laks
V. S. Lakshmanan, Nicola Leone, Robert Ross, and V. S. Subrahmanian, “ProbView: A flexible probabilistic database
system”.
ACM
Transactions on Database Systems, 22(3):419--469, 1997.
Potential Related
Projects
There
is a probabilistic data project at the Univ. of Maryland, led by Lise Getoor
and V. S. Subrahmanian.
Project Website
http://www.eecs.umich.edu/db/timber
Until we accumulate a sufficient mass of results in
this project, we have chosen to leverage off of the pre-existing Timber project
website, and to show the data integration work done under this project as an extension/application
of Timber.