USING AN ONTOLOGY TO SIMPLIFY DATA ACCESS BY EDUARD HOVY W hen we turn to government for information, we expect it to be timely, thorough, and above all accurate. However, we also demand that government be split in many different ways—federal, state, and local; executive, judicial, and legislative; tax management separated from pensions and health. Moreover, government data should be collected at different times by different people. The resulting heterogeneity (especially incompatible data resources) places special demands on government software systems. A good example of the data problem appears in the more than 70 U.S. Federal Statistics (FedStats) agencies that collect information about all aspects of life in the U.S. Collectively, these agencies have tens of thousands of databases, stored in numerous formats (database software, Web pages, typewritten tables, and so on), with new ones being added every day. Frequently, portions of this data overlap, or are semi-complementary (an individual in one database may be part of a family in another, for example). Often, the classes of data are related, near-identical, or even identical (what is termed “salary” in one database, for example, might be exactly the same as “income” in another and “wages” in a third, but might be quite different from what some other agency means by “salary”). Some mechanism is required to standardize data types, enable sharing, and facilitate perusal of others’ data. Ideally, such a mechanism should: • Include specialized domain terminology to support the detailed representation of finegrained technical distinctions in the data, which allows experts to use the system; • Also include lay terminology, to enable nonexperts to quickly locate information without having to know the expert word usage; and • Support automatic inference, so that computer programs can help users find and merge information, find correspondences across data sets in the same domain, and (semi-) automatically extend the mechanism to incorporate new domains. These desiderata make up a very tall order, for which no sim- COMMUNICATIONS OF THE ACM January 2003/Vol. 46, No. 1 47 W hat makes our approach new is an attempt to automate much of the ontology and domain model construction. ple and wholly complete solution might ever be found. But some progress has been made in recent work. We describe one variant of such a mechanism, the ontology. We define an ontology simply as a taxonomized set of terms, ranging from very general terms at the top (allowing nonexpert users to find access points) down to very specialized ones at the bottom (allowing them to be connected to specific columns in databases). For example, the general term “wage” may head a small subtaxonomy including Agency1-wage, Agency2-salary, and so on, each one with its own definition, associated Web pages, and of course associated data. O ur ontology is being used in a prototype system called the Energy Data Collection (EDC) system [1]. This system is being built by members of the Digital Government Research Center (DGRC; www.dgrc.org), which consists of faculty, staff, and students at the Information Science Institute (ISI) of the University of Southern California (USC) and Columbia University’s Computer Science Department and its Center for Research on Information Access. Figure 2. The EDC project was started in the Ontology and NSF’s Digital Government Program in domain models. 1999. We are working with representatives from the Census Bureau, the Bureau of Labor Statistics, the Department of Energy’s Energy Information Administration (EIA), and the California Energy Commission. For example, GAW link the EIA’s www.eia.doe.gov mapping provides extensive monthly Logical energy data to the public. mapping This site receives hundreds of thousands of hits a month, even though most of its information is avail48 January 2003/Vol. 46, No. 1 COMMUNICATIONS OF THE ACM able only in standardized HTML Web pages or prepared PDF documents, and only for the last few years. The current facility thus supports only limited access to this very rich data source. In order to support more dynamic yet homogeneous access to multiple energy databases, the EDC system includes three principal components detailed as follows: ... Unit Point of Sale Gasoline Time Series Period G. Leaded G. Regular Week Year Product G. Premium G. Unleaded Month Measurement Footnote G. Premium Unleaded Area Date Value Tag Text USA CA Quality NY Subclass Part-of General Relation Source Mapping Price CPI Volume PPI Figure 1. Fragment of the EDC domain model. Interface. The interface allows users to construct data requests, either by ontology browsing, natural language type-in, or cascaded menus. A completed request is dispatched to the query processor, which returns data tables and graphs to the interface Large ontology (SENSUS) for display. Information access planner. The query processor employs USC/ISI’s SIMS system [2] that decomposes data requests into database Domain-specific queries according to ontologies (SIMS models) the content and nature of the data Data sources, retrieves data sources from them, and reassembles the results appropriately. Since we have incorporated over 50,000 energy-related data tables of various kinds, SIMS uses a model of the data that identifies and describes their contents. This domain model, which unifies the various databases’ metadata descriptions, forms the lowermost portion of the ontology. A fragment of the EDC model (about 500 nodes, manually defined) is shown in Figure 1. A typical query includes some type of gasoline (chosen from the top-right cluster), some quality grade (bottom right), some area of interest (bottom left) and so on. Ontology as metadata. It is not simple to unify different databases’ metadata and/or domain terms, and to create a coherent domain model. As Figure 1 shows, the various clusters represent independent and quite different concepts: gasoline type and quality, geographic region, units of measurement, and so on. In order to place these concepts in a single coherent framework, which will also facilitate the future addition of domains and databases dealing with very different information, we used as overarching ontology USC/ISI’s 70,000-node terminological taxonomy SENSUS [8]. SENSUS is a rearrangement and extension of Princeton’s WordNet 1.6 [4], retaxonomized under USC/ISI’s Penman Upper Model [3] (built to support natural language processing). SENSUS can be accessed using the browsers DINO at edc.isi.edu:8011/dino or its predecessor Ontosaurus at mozart.isi.edu:8003/sensus/sensus_frame.html. What makes our approach new is an attempt to automate much of the ontology and domain model construction. In order to facilitate model building and ontologization, we have developed algorithms that: • Identify and extract from data sources terms likely to be important for domain modeling; • Cluster them into mini-taxonomies [6]; • Create mappings/alignments of terms and clusters into the ontology [5, 6]; and • Our Columbia University partners have investigated extracting and analyzing terms from online glossaries [7]. T he ontology for the EDC project has the structure shown in Figure 2. To create it, we identified the principal domain terms, manually defined the domain model of approximately 500 nodes to represent the concepts present in the EDC gasoline domain, and linked these domain concepts into SENSUS using the semiautomated alignment algorithms. We needed two types of links for this work. The links between data sources and domain model terms express logical equivalences, as required to ensure the correctness of SIMS reasoning. They must therefore be checked manually. To connect concepts in the upper ontology and the domain model we defined a new type of link called “generally-associated-with” (GAW). GAW links enable the user while browsing to rapidly proceed from high-level concepts to the domain model terms associated with real data in the databases. In contrast to domain model links, the semantics of GAW links are intentionally vague. This vagueness allows us to link a specifically defined domain model term (such as price) to very disparate (though still thematically related) SENSUS concepts (such as price, cost, money, charge, dollar, amount, fee, payment, paying, and so on). Clearly, these links cannot support automated inference. They can, however, help the nonexpert user to start browsing or forming queries using whatever terms are most familiar. In addition, the vague semantics has a fortunate side effect, in that it facilitates automated alignment of concepts from domain model to SENSUS. Since the alignment techniques are still not very accurate, we cannot without considerable manual intervention employ them where logically strict equivalence links are required. For GAW links, however, they are quite well suited. c References 1. Ambite, J.L., Arens, Y., Bourne, W., Davis, P.T., Hovy, E.H., Klavans, J.L. Philpot, A., Popper, S., Ross, K., Shih, J-L, Sommer, P., Temiyabutr, S., Zadoff, L. A portal for access to complex distributed information about energy. In Proceedings of the 2nd National Conference on Digital Government. (Los Angeles, Calif., 2002). 2. Arens, Y., Knoblock, C.A., and Hsu, C-N. Query processing in the SIMS Information Mediator. A. Tate, Ed., Advanced Planning Technology (1996) AAAI Press, Menlo Park, CA. 3. Bateman, J.A., Kasper, R.T., Moore, J.D., and Whitney, R.A. A General Organization of Knowledge for Natural Language Processing: The Penman Upper Model. Unpublished research report, 1989. USC/Information Sciences Institute, Marina del Rey, CA. 4. Fellbaum, C. Ed WordNet: An On-Line Lexical Database and Some of its Applications. (1998) MIT Press, Cambridge, MA. 5. Hovy, E.H. Combining and standardizing large-scale, practical ontologies for machine translation and other uses. In Proceedings of the 1st International Conference on Language Resources and Evaluation. (Granada, Spain, 1998). 6. Hovy, E.H., Philpot, A.G., Ambite, J.L., Arens, Y., Klavans, J.L., Bourne, W., and Saroz, D. Data acquisition and integration in the DGRC’s data collection project. In Proceedings of the 1st National Conference on Digital Government. (Los Angeles, Calif, 2001). 7. Klavans, J.L., Davis, P.T., and Popper, S. Building large ontologies using Web-crawling and glossary analysis techniques. In Proceedings of the 2nd National Conference on Digital Government. (Los Angeles, Calif., 2002). 8. Knight, K. and Luk, S.K. Building a large-scale knowledge base for machine translation. In Proceedings of the 11th National Conference on Artificial Intelligence (1994). Eduard Hovy ([email protected]) is a research fellow in the Information Sciences Institute at the University of Southern California, Marina del Rey. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. © 2003 ACM 0002-0782/03/0100 $5.00 COMMUNICATIONS OF THE ACM January 2003/Vol. 46, No. 1 49
© Copyright 2024