Download Report

USING AN ONTOLOGY
TO SIMPLIFY DATA ACCESS
BY EDUARD HOVY
W
hen we turn to government for
information, we expect it to be
timely, thorough, and above all
accurate. However, we also demand
that government be split in many
different ways—federal, state, and
local; executive, judicial, and legislative; tax management separated
from pensions and health. Moreover, government data should be
collected at different times by different people. The resulting heterogeneity (especially incompatible data resources) places special
demands on government software systems. A good example of the
data problem appears in the more than 70 U.S. Federal Statistics
(FedStats) agencies that collect information about all aspects of life in
the U.S. Collectively, these agencies have tens of thousands of databases, stored in numerous formats (database software, Web pages,
typewritten tables, and so on), with new ones being added every day.
Frequently, portions of this data overlap, or are semi-complementary
(an individual in one database may be part of a family in another, for
example). Often, the classes of data are related, near-identical, or even
identical (what is termed “salary” in one database, for example, might
be exactly the same as “income” in another and “wages” in a third, but
might be quite different from what some other agency means by
“salary”).
Some mechanism is required to standardize data types, enable
sharing, and facilitate perusal of
others’ data. Ideally, such a mechanism should:
• Include specialized domain
terminology to support the
detailed representation of finegrained technical distinctions
in the data, which allows
experts to use the system;
• Also include lay terminology,
to enable nonexperts to
quickly locate information
without having to know the
expert word usage; and
• Support automatic inference,
so that computer programs
can help users find and merge
information, find correspondences across data sets in the
same domain, and (semi-)
automatically extend the
mechanism to incorporate
new domains.
These desiderata make up a
very tall order, for which no sim-
COMMUNICATIONS OF THE ACM January 2003/Vol. 46, No. 1
47
W
hat makes our approach
new is an attempt to automate much of the ontology and
domain model construction.
ple and wholly complete solution might ever be
found. But some progress has been made in recent
work. We describe one variant of such a mechanism,
the ontology. We define an ontology simply as a taxonomized set of terms, ranging from very general
terms at the top (allowing nonexpert users to find
access points) down to very specialized ones at the
bottom (allowing them to be connected to specific
columns in databases). For example, the general term
“wage” may head a small subtaxonomy including
Agency1-wage, Agency2-salary, and so on, each one
with its own definition, associated Web pages, and of
course associated data.
O
ur ontology is being used in a prototype system called the Energy Data Collection (EDC) system [1]. This system is
being built by members of the Digital
Government Research Center (DGRC;
www.dgrc.org), which consists of faculty, staff, and
students at the Information Science Institute (ISI) of
the University of Southern California (USC) and
Columbia University’s Computer Science Department and its Center for Research on
Information Access.
Figure 2.
The EDC project was started in the Ontology
and
NSF’s Digital Government Program in domain models.
1999. We are working
with representatives from
the Census Bureau, the
Bureau of Labor Statistics,
the Department of Energy’s Energy Information
Administration (EIA), and
the California Energy
Commission. For example, GAW link
the EIA’s www.eia.doe.gov mapping
provides extensive monthly Logical
energy data to the public. mapping
This site receives hundreds
of thousands of hits a
month, even though most
of its information is avail48
January 2003/Vol. 46, No. 1 COMMUNICATIONS OF THE ACM
able only in standardized HTML Web pages or prepared PDF documents, and only for the last few
years. The current facility thus supports only limited
access to this very rich data source.
In order to support more dynamic yet homogeneous access to multiple energy databases, the EDC
system includes three principal components detailed
as follows:
...
Unit
Point of Sale
Gasoline
Time
Series
Period
G. Leaded
G. Regular
Week
Year
Product
G. Premium
G. Unleaded
Month
Measurement
Footnote
G. Premium
Unleaded
Area
Date
Value
Tag
Text
USA
CA
Quality
NY
Subclass
Part-of
General Relation
Source Mapping
Price
CPI
Volume
PPI
Figure 1.
Fragment of the
EDC domain
model.
Interface. The interface allows
users to construct data requests,
either by ontology browsing, natural
language type-in, or cascaded menus. A completed
request is dispatched to the query processor, which
returns data tables and
graphs to the interface
Large ontology
(SENSUS)
for display.
Information
access planner. The
query
processor
employs USC/ISI’s
SIMS system [2] that
decomposes
data
requests into database
Domain-specific
queries according to
ontologies
(SIMS models)
the content and
nature of the data
Data
sources, retrieves data
sources
from them, and
reassembles the results
appropriately. Since we have incorporated over
50,000 energy-related data tables of various kinds,
SIMS uses a model of the data that identifies and
describes their contents. This domain model, which
unifies the various databases’ metadata descriptions,
forms the lowermost portion of the ontology. A fragment of the EDC model (about 500 nodes, manually
defined) is shown in Figure 1. A typical query includes
some type of gasoline (chosen from the top-right cluster), some quality grade (bottom right), some area of
interest (bottom left) and so on.
Ontology as metadata. It is not simple to unify
different databases’ metadata and/or domain terms,
and to create a coherent domain model. As Figure 1
shows, the various clusters represent independent and
quite different concepts: gasoline type and quality,
geographic region, units of measurement, and so on.
In order to place these concepts in a single coherent
framework, which will also facilitate the future addition of domains and databases dealing with very different information, we used as overarching ontology
USC/ISI’s 70,000-node terminological taxonomy
SENSUS [8]. SENSUS is a rearrangement and extension of Princeton’s WordNet 1.6 [4], retaxonomized
under USC/ISI’s Penman Upper Model [3] (built to
support natural language processing). SENSUS can
be accessed using the browsers DINO at
edc.isi.edu:8011/dino or its predecessor Ontosaurus
at mozart.isi.edu:8003/sensus/sensus_frame.html.
What makes our approach new is an attempt to
automate much of the ontology and domain model
construction. In order to facilitate model building and
ontologization, we have developed algorithms that:
• Identify and extract from data sources terms likely
to be important for domain modeling;
• Cluster them into mini-taxonomies [6];
• Create mappings/alignments of terms and clusters
into the ontology [5, 6]; and
• Our Columbia University partners have investigated extracting and analyzing terms from online
glossaries [7].
T
he ontology for the EDC project has the
structure shown in Figure 2. To create it, we
identified the principal domain terms,
manually defined the domain model of
approximately 500 nodes to represent the
concepts present in the EDC gasoline domain, and
linked these domain concepts into SENSUS using the
semiautomated alignment algorithms.
We needed two types of links for this work. The
links between data sources and domain model terms
express logical equivalences, as required to ensure the
correctness of SIMS reasoning. They must therefore
be checked manually. To connect concepts in the
upper ontology and the domain model we defined a
new type of link called “generally-associated-with”
(GAW). GAW links enable the user while browsing to
rapidly proceed from high-level concepts to the
domain model terms associated with real data in the
databases. In contrast to domain model links, the
semantics of GAW links are intentionally vague. This
vagueness allows us to link a specifically defined
domain model term (such as price) to very disparate
(though still thematically related) SENSUS concepts
(such as price, cost, money, charge, dollar, amount,
fee, payment, paying, and so on). Clearly, these links
cannot support automated inference. They can, however, help the nonexpert user to start browsing or
forming queries using whatever terms are most familiar. In addition, the vague semantics has a fortunate
side effect, in that it facilitates automated alignment
of concepts from domain model to SENSUS. Since
the alignment techniques are still not very accurate,
we cannot without considerable manual intervention
employ them where logically strict equivalence links
are required. For GAW links, however, they are quite
well suited. c
References
1. Ambite, J.L., Arens, Y., Bourne, W., Davis, P.T., Hovy, E.H., Klavans,
J.L. Philpot, A., Popper, S., Ross, K., Shih, J-L, Sommer, P.,
Temiyabutr, S., Zadoff, L. A portal for access to complex distributed
information about energy. In Proceedings of the 2nd National Conference
on Digital Government. (Los Angeles, Calif., 2002).
2. Arens, Y., Knoblock, C.A., and Hsu, C-N. Query processing in the
SIMS Information Mediator. A. Tate, Ed., Advanced Planning Technology (1996) AAAI Press, Menlo Park, CA.
3. Bateman, J.A., Kasper, R.T., Moore, J.D., and Whitney, R.A. A General
Organization of Knowledge for Natural Language Processing: The Penman Upper Model. Unpublished research report, 1989. USC/Information Sciences Institute, Marina del Rey, CA.
4. Fellbaum, C. Ed WordNet: An On-Line Lexical Database and Some of its
Applications. (1998) MIT Press, Cambridge, MA.
5. Hovy, E.H. Combining and standardizing large-scale, practical ontologies
for machine translation and other uses. In Proceedings of the 1st International
Conference on Language Resources and Evaluation. (Granada, Spain, 1998).
6. Hovy, E.H., Philpot, A.G., Ambite, J.L., Arens, Y., Klavans, J.L.,
Bourne, W., and Saroz, D. Data acquisition and integration in the
DGRC’s data collection project. In Proceedings of the 1st National Conference on Digital Government. (Los Angeles, Calif, 2001).
7. Klavans, J.L., Davis, P.T., and Popper, S. Building large ontologies using
Web-crawling and glossary analysis techniques. In Proceedings of the 2nd
National Conference on Digital Government. (Los Angeles, Calif., 2002).
8. Knight, K. and Luk, S.K. Building a large-scale knowledge base for
machine translation. In Proceedings of the 11th National Conference on
Artificial Intelligence (1994).
Eduard Hovy ([email protected]) is a research fellow in the Information
Sciences Institute at the University of Southern California, Marina del Rey.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for
profit or commercial advantage and that copies bear this notice and the full citation on
the first page. To copy otherwise, to republish, to post on servers or to redistribute to
lists, requires prior specific permission and/or a fee.
© 2003 ACM 0002-0782/03/0100 $5.00
COMMUNICATIONS OF THE ACM January 2003/Vol. 46, No. 1
49