A Uniform Query Processing Approach for Integrating Data from Heterogeneous Resources
Doctoral thesis, 2010
Scientists who need to explore several different databases in their
research can find it difficult and tedious to extract and combine
information from various heterogeneous data sources manually.
This is a particular problem for researchers in the life sciences,
since technical advances in the last decade have resulted in a dramatic
increase in the quantity and variety of data.
Many databases of interest are developed independently by different
research groups, and the database administrators often want to
keep their databases autonomous so that they can develop and maintain
them without being constrained by other database sources.
Therefore, there is a need for software solutions to the problem of
data integration that facilitate combining
up-to-date data from autonomous, heterogeneous databases located at
different sites.
A system for data integration from heterogeneous (relational and RDF/S),
autonomous and distributed data sources has been designed and implemented in
this work. The main aim in the design and implementation of the system has
been to make large parts of query and result processing independent of
the kinds of data resources that are being used. The queries are held
in a resource independent form through large parts of the query processing.
We refer to this as uniform query and result processing. The user
states queries, global queries, against an integrated view of the
underlying data resources. The integrated view does not reveal the
structure of the underlying data sources. A global query is rewritten by
using rules that describe the mapping from concepts in the integrated view to
concepts in the data sources. This is then split into sub-queries that
each relate to one of the data sources. Wrappers translate sub-queries
into the query languages of the component databases, send these sub-queries
to the component databases and then retrieve the results. Several small
example federations have been implemented to test the system, one of
which is a federation of biological databases. We have focused
on incorporating data in relational databases and RDF Schema data, since
these are widely used and are becoming increasingly popular for
managing data collections.
An outcome of this work is a functioning prototype system that
applies a uniform query and result processing approach, and has
a modular system design that is easy to use as a starting point
for modifications and extensions.
Query processing
Functional data model
Rewrite rules
Data integration