ONTOLOGY-BASED WEB INFORMATICS SYSTEM By WENYANG HU

ONTOLOGY-BASED WEB INFORMATICS SYSTEM
By
WENYANG HU
A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF ENGINEERING
UNIVERSITY OF FLORIDA
2002
Copyright 2002
by
Wenyang Hu
ACKNOWLEDGMENTS
I express my sincere gratitude to my advisor, Prof. Limin Fu, for giving me the
opportunity to work on this challenging topic and for providing continuous guidance
during my thesis writing. I am thankful to Prof. Joachim Hammer and Prof. Jonathan Liu
for agreeing to be on my supervisory committee.
I would like to take this opportunity to thank my parents, my husband and my
son, for their continued and encouraging support throughout my period of study and
especially in this endeavor.
iii
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS ................................................................................................. iii
LIST OF FIGURES ........................................................................................................... vi
ABSTRACT...................................................................................................................... vii
CHAPTER
1 INTRODUCTION ............................................................................................................1
1.1 Background and Motivation ..................................................................................... 1
1.2 Organization of This Thesis...................................................................................... 3
2 WHY DEVELOP AN ONTOLOGY? ..............................................................................4
2.1 Ontology Role in the Information Retrieval ............................................................. 4
2.1.1 Basic Concepts of an Ontology....................................................................... 4
2.1.2 Refined Definition and Categories of Ontologies........................................... 5
2.1.2.1 Redefinition of Ontology ......................................................................5
2.1.2.2 Different kinds of ontologies .................................................................6
2.2 Current Ontology Applications................................................................................. 7
2.2.1 Reusability ...................................................................................................... 7
2.2.2 Search.............................................................................................................. 8
2.2.3 Specification and Knowledge Acquisition...................................................... 9
2.2.4 Reliability and Maintenance ........................................................................... 9
2.3 Building Ontologies – Ontology Editors, Languages, and Platforms..................... 10
2.3.1 Ontobroker .................................................................................................... 10
2.3.2 DAML+OIL .................................................................................................. 11
2.3.3 SHOE ............................................................................................................ 12
2.3.4 OntoEdit ........................................................................................................ 12
2.3.5 Protégé-2000 ................................................................................................. 12
3 BUILD MEDICAL SUBJECT HEADING ONTOLOGY WITH PROTEGE-2000 .....14
3.1 Overview of Protégé-2000...................................................................................... 14
3.2 Protégé-2000 Ontology Model ............................................................................... 15
3.2.1 Creating and Editing Classes ........................................................................ 16
3.2.2 Creating and Editing Instances ..................................................................... 17
3.2.3 Storage Models and Persistence.................................................................... 18
iv
3.3 Background of Medical Subject Heading ............................................................... 19
3.4 Motivation of Creating an Ontology in Medical Subject Heading Domain ........... 21
3.5 From MeSH Thesaurus to MeSH Ontology ........................................................... 22
3.5.1 Ontology and Structure-based Search........................................................... 23
3.5.2 Building MeSH Ontology ............................................................................. 25
3.5.3 Ontology Data Import and Export ................................................................ 29
Importing existing data of MeSH thesaurus to ontology .................................29
Exporting MeSH ontology to an XML document ...........................................29
4 ONTOLOGY, XML AND XQUERY ............................................................................33
4.1 Extensible Markup Language XML ....................................................................... 33
4.2 Ontologies as Conceptual Models for Generate XML Documents. ....................... 34
4.2.1 XML Itself Is Not Enough ............................................................................ 34
4.2.2 Add Ontology as Conceptual Model............................................................. 35
4.3 XQuery.................................................................................................................... 36
4.3.1 XML Query Language XQuery .................................................................... 36
4.3.2 XQuery Implementation Quip....................................................................... 38
5 IMPLEMENTATION OF AN ONTOLOGY-BASED WEB APPLICATION SYSTEM40
5.1 Building Web Informatics System.......................................................................... 40
5.1.1 Ontology-based Web Application................................................................. 40
5.1.2 Using JSP ...................................................................................................... 41
5.1.3 System Architecture ...................................................................................... 43
5.2 Query the Ontology................................................................................................. 45
5.2.1 Query from Direct Typing ............................................................................ 45
5.2.2 Upload Existing Local Query Files............................................................... 46
5.2.3 Query Ontology or Build Up Ontology Objects with the Ontology Wizard 47
5.2.4 Choose the Query File and Download the Result ......................................... 48
6 CONCLUSIONS AND FUTURE WORK .....................................................................50
LIST OF REFERENCES...................................................................................................52
BIOGRAPHICAL SKETCH .............................................................................................55
v
LIST OF FIGURES
page
Figure
3-1 Protégé architecture .........................................................................................................15
3-2 Protégé-2000: Class definition in 'MeSH ontology.........................................................16
3-3 Acquire instances in Protégé-2000..................................................................................17
3-5 Medical subject heading hierarchy..................................................................................21
3-6 MeSH thesaurus browser.................................................................................................24
3-7 MeSH ontology class hierarchy.......................................................................................26
3-8 Detailed structures and relationships of DescriptorRecord, Concept and Terms in
MeSH ontology .........................................................................................................27
3-9 Ontology exported to an XML document .......................................................................32
5-1 Web informatics system architecture ..............................................................................42
5-2 Interface of server program and Quip execution .............................................................44
5-2--continued Interface of server program and Quip execution ..........................................45
5-3 Onine links point to ontology structures .........................................................................46
5-4 Query the ontology by typing or uploading local query files..........................................47
5-5 Query the ontology under the help of ontology query wizard.........................................48
vi
Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Engineering
ONTOLOGY-BASED WEB INFORMATICS SYSTEM
By
Wenyang Hu
May 2002.
Chair: Prof. Limin Fu
Department: Computer and Information Science and Engineering Department
In the context of knowledge, the term ontology means a specification of a
conceptualization. On the application side, the main emphasis is given to the use of
ontologies for knowledge sharing and reuse, electronic commerce, and enterprise
integration. Ontologies also make it possible to add more semantics to web pages to make
information extraction precise and efficient. In this project, I applied a popular ontology
editor, Protégé-2000, to build an ontology of the field Medical Subject Heading (MeSH),
using concepts, relationships, and data from that field to design ontology classes, slots,
facets, proper constraints, and objects. The ontology was later exported to an XML
document. A web informatics system was implemented based on that ontology. A
carefully designed interface makes it possible for the web application system to invoke an
XQuery engine to obtain query output to the end user. The ontology query wizard guides
users with ontology information to form legal and proper queries on the ontology data or
construct particular ontology objects for users. Information retrieval on the more
vii
structured web systems, as with the one constructed in this project, could provide answers
to sophisticated knowledge-based queries.
viii
CHAPTER 1
INTRODUCTION
1.1 Background and Motivation
Imagine that you want to buy the book “The Little Prince” by Antoine de SaintExupery online. Searching existing web indices in your favorite search engines yields
thousands of pages. Only several lead to places where you can actually purchase the
book, and others lead to a variety of fan sites and baby product sites.
This scenario is common to many people on the World Wide Web. A major
problem with this kind of search (called keyword-based search) on the web today is that
data available on the web have little semantic organization beyond a simple structural
arrangement of text, keywords, titles, or abstracts. As the web expands exponentially in
size, the lack of organization makes it difficult to retrieve useful information out of the
web. Researchers are trying to find efficient ways to let the web answer queries such as:
Find all web pages where
X isA book,
Y isA person,
Title(X)=”The Little Prince’,
Name(Y)= “Antoine de Saint-Exupery”
Published-by(X, Y)
An ordinary HTML page is not appropriate for such queries if no semantics has
been added. Previous information retrieval approaches include keyword-based search and
field-based search, both of which have disadvantages. A keyword-based search suffers
because it associates the semantic meaning of web pages with actual lexical or syntactic
content. Tens of hundreds of irrelevant resulting pages are thus unavoidable. The field-
1
2
based approach describes an item not by a set of keywords, but by a set of attribute-value
pairs. But usually for a specific application domain, this type of search is supported only
by some specially designed browsers. Except from simple linkage, none of these
approaches allows for inferences about relationships between web pages. Sophisticated
queries are therefore clearly out of reach.
The solution for solving these problems is to add semantics to HTML pages, but
terms and definitions often differ between groups. Sometimes different groups use
identical terms with different meanings. So there is a need to share the meaning of terms
in a given domain. Achieving a shared understanding is accomplished by agreeing on an
appropriate way to conceptualize the domain, and then to make it explicit in a language.
The result is an ontology, which can be widely applied to a variety of contexts for various
purposes.
By providing a shared and common understanding of a domain, ontologies can be
communicated across people and application systems for facilitating knowledge sharing
and exchange, and they build the conceptual backbone of the Semantic Web. Only in the
few years, interest has increased significantly in researches and applications of ontology.
The Medical Subject Heading (MeSH) is a thesaurus that has been used and
updated for the past 40 years. It has been a success for indexing and searching journal
articles in medical subject databases, books and so forth. However, MeSH still has some
limitations as being a thesaurus, and it could be updated and extended to achieve better
performance. I chose this thesaurus as a starting point to implement an ontology. The
ontology of MeSH belongs to the same domain, imports all useful structures and data
from the MeSH thesaurus and adds deeper semantics and ontology validation constraints.
3
Through a web informatics system based on MeSH ontology, users are finally given more
flexibility to form sophisticated queries and obtain satisfactory results.
1.2 Organization of This Thesis
Chapter 1 provides introduction and overview. Chapter 2 discusses some concepts
related to ontology and its applications– such as definitions, current ontology application
areas, and why people want to develop domain ontologies. The first part of my
implementation is to use Protégé-2000 as a tool to create the ontology. In Chapter 3, I
introduce key features of Protégé-2000, provide background information about the MeSH
thesaurus, and state how and why I use Protégé-2000 to develop a medical subject
heading ontology. The MeSH ontology was created and exported to an XML document in
my project implementations. In chapter 4 I will present some of the basic ideas about
XML and its query language XQuery, give the relationship between ontology and XML,
and tell why the XML format was chosen to save the ontology. Details of the second part
of my implementation, that is, how I designed a web informatics system based on MeSH
ontology, are introduced in Chapter 5. In Chapter 6, I will mention some of the flaws still
existing in this system and possible future improvements.
CHAPTER 2
WHY DEVELOP AN ONTOLOGY?
2.1 Ontology Role in the Information Retrieval
2.1.1 Basic Concepts of an Ontology
An ontology is a shared and common understanding of some domain that can be
communicated across users and computers. It can be defined as “a formal, explicit
specification of a shared conceptualization” (Gruber, 1993). Conceptualization refers to
an abstract model of some phenomena in the world. Explicit means that the type of
concepts and relationships between them are explicitly defined. Shared reflects the fact
that an ontology captures consensual knowledge, which is accepted by a group of people.
Formal refers to the fact that an ontology should be machine readable and accessible.
Typically an ontology is constructed in an collaborative effort of domain experts, endusers, and IT specialists.
Research on ontology is becoming increasingly widespread in the computer
science community. The term ontology is actually borrowed from philosophy. Though
this term has been rather confined to the philosophical area in the past, it is now gaining a
specific role in many diverse fields.
In the research field of AI, an ontology refers to an engineering artifact. It is
constituted by a specific vocabulary used to describe a certain reality, plus a set of
assumptions regarding the intended meaning of the vocabulary words. This set of
assumptions usually has the form of a first-order logical theory in which vocabulary
words appear as unary (concepts) or binary predicate names (relationships).
4
5
In the context of knowledge sharing, ontology means a specification of a
conceptualization. That is, an ontology is a description of the concepts and relationships
that can exist for an agent or a community of agents.
In the simplest case, an ontology describes a hierarchy of concepts related by
subsumption relationships. In more sophisticated cases, suitable axioms are added in
order to express other relationships between concepts and to constrain their intended
interpretations.
2.1.2 Refined Definition and Categories of Ontologies
2.1.2.1 Redefinition of ontology
The role of an ontology can be considered as a set of logical axioms designed to
account for the intended meaning of a vocabulary. Given a language L with an
ontological commitment K, an ontology for L is a set of axioms designed in a way such
that the set of its models approximates as best as possible the set of intended models of L,
according to K. The following definition of ontology refines Gruber’s definition by
making clear the difference between an ontology and a conceptualization:
An ontology is a logical theory accounting for the intended meaning of a formal
vocabulary, i.e., its ontological commitment to a particular conceptualization of
the world. The intended models of a logical language using such a vocabulary are
constrained by its ontological commitment. An ontology indirectly reflects this
commitment (and the underlying conceptualization) by approximating these
intended models. (Guarino, “Formal Ontology and Information System”, page 5,
1988)
The ontology is language-dependent while a conceptualization is languageindependent. It is essential to separate the concepts of ontology with conceptualization
when addressing the issues related to ontology sharing, fusion, and translation.
6
2.1.2.2 Different kinds of ontologies
Ontologies can be classified according to their accuracy in characterizing the
conceptualization to which they commit. An ontology can get closer to a
conceptualization in several possible ways, such as by developing a richer axiomatization
and adopting a richer domain and/or a richer set of relevant conceptual relations. Those
get closer to a conceptualization are called “fine-grained” ontologies, compared to
“coarse” ontologies.
A tradeoff exists between a coarse and a fine-grained ontology committing to the
same conceptualization. Fine-grained ontology may be used to establish a consensus
about sharing that vocabulary because it gets closer to specifying the intended meaning of
a vocabulary. But it may be hard to develop and to reason on due to the number of
axioms and the expressiveness of the language adopted. Building such detailed ontologies
is usually for the purpose of being accessed from time to time. A coarse ontology, on the
other hand, may consist of a minimal set of axioms written in a minimal expressive
language. It is intended to be shared among users who already agreed on the underlying
conceptualization and to support only a limited set of specific services, for example, to
support core system’s functionalities.
According to their level of generality, ontologies can also be categorized
by top-level ontologies, domain and task ontologies, and application ontologies. Toplevel ontologies describe very general concepts, independent of a particular problem or
domain. Domain ontologies describe the vocabulary related to a generic domain (such as
medical subject heading domain on which we focus). Task ontologies describe a generic
task or activity, such as diagnosing, advertising and so forth) Domain and task ontologies
7
inherit and specialize the terms introduced in the top-level ontology. Application
ontologies describe concepts depending on both a particular domain and task. These
concepts often correspond to roles played by domain entities while performing a certain
task. It seems therefore quite obvious and reasonable to have unified top-level ontologies
for large communities of users (Guarino, 1998).
The ontology can be regarded as a particular knowledge base, describing facts
assumed to be always true by a group of users in a certain domain. And by authority of
the agreed-upon meaning of the vocabulary used. It contains state-independent
information while the “core” knowledge base, on the other hand, contains statedependent information. So, ontology is a “kind of” knowledge base, but these two
concepts are different.
2.2 Current Ontology Applications
The research and application communities in which ontology has been useful
include software developers, standard organizations as well as database communities.
They all need to overcome interoperability difficulties brought by disparate vocabularies,
representations, and various tools in their prosperous context.
Fundamentally, ontologies are used to improve communication between humans
or computers. The current applications of ontology can be grouped into the following
areas:
2.2.1 Reusability
Many researchers have been designing ontologies for the purpose of enabling
knowledge sharing and reuse. The ontology is the basis for a formal representation of the
important concepts, processes, and their interrelationships in the domain of interest. This
8
formal representation may be a reusable and shared component in a system, it may also
be translated between different modeling systems and used as an interchange format. The
scenario can be, for example, an author creates an ontology, which different application
developers agree to use. Each pair of translators, for a given application, in effect, defines
an application interface that can be used to read/write data from/to the ontology.
Ecoyc (Karp et al.,1996) is a commercial product, for example, that uses a shared
ontology to make possible access to various heterogeneous databases in the field of
molecular biology. In Stanford Medical Informatics (SMI) Protégé 2000, various data
formats (XML, Ontolingua, RDF) can be used to import data to the ontology. Ontology
can also be exported to different data formats (RDF, XML, OKBC, relational DBMS) so
as to let various communities share the ontology data in their own application formats.
2.2.2 Search
In recent years, there have been numerous papers and reports announcing
attempts and some successes at applying ontologies, especially in the area of search and
information retrieval. An ontology can be used as a metadata serving as an index into a
repository of systematically ordered relevant concepts in a given domain. A consensual
ontology can assist knowledge workers in identifying concepts in which they are
interested by providing various users with a clean common vocabulary and clearly
defined relationships. The motivation is to improve precision (make sophisticated queries
possible), as well as reduce the overall amount of time spent on searching.
Supporting technologies in this application area include ontology browsers, search
engines, automatic tagging tools, automatic classification of documents, metadata
languages, such as XML and so forth. One variation is to assist in query formulation.
9
Ontology can drive the user interface for creating and refining queries, which is the case
in my project. I will show in my project how a sophisticated query becomes possible
based on an ontology information. Yahoo is another example of large web ontology
taxonomies categorizing web sites to facilitate search efficiency.
2.2.3 Specification and Knowledge Acquisition
The ontology can assist the process of identifying requirements and defining a
specification for an IT system. The basic idea of this scenario is to let an ontology model
the application domain, and let it provide a vocabulary for specifying the requirements
for one or multiple target applications. When building knowledge-based systems, using
an existing ontology as the starting point and basis for guiding knowledge acquisition
may also increase the speed and reliability.
A typical example for the scenario is Protégé-2000, which used to help automate
the process of knowledge acquisition and software development. Generating knowledge
acquisition tools from an ontology automatically, Protégé-2000 ensures that the
acquisition user interface connects tightly to an ontology. This scenario facilitates the
entering of knowledge and also makes certain the validating of ontology can be carried
out at most appropriate time.
2.2.4 Reliability and Maintenance
Using ontology in system development or as part of the end application can make
maintenance easier in a various ways. First of all, a formal representation characteristic of
ontology makes software consistency check possible. Software check is automatic and
therefore more reliable, and thus makes the system more reliable. Secondly, building
10
software using explicit ontology data helps to improve the documentation, which reduces
the cost of maintenance.
2.3 Building Ontologies – Ontology Editors, Languages, and Platforms
Ontologies have become common on the World Wide Web. Realizing the
importance of ontology in various application fields, a number of languages for defining
ontologies on the web, such as RDF(S) and DAML+OIL, were developed. Also being
developed are those ontology editors or platforms helping to create ontologies in the most
reliable and efficient ways. In this section, I briefly introduce some of the popular tools.
2.3.1 Ontobroker
Ontobroker processes information sources and content descriptions in the HTML,
XML, and RDF format and provides information retrieval, query answering, and
maintenance support. Use of ontologies to explicitly describe background knowledge is
central in Ontobroker. A broker architecture is provided in Ontobroker with four
elements: a query interface, an info agent, an inference engine, and a database manager.
It is an integrated system for collecting knowledge from the web using annotations,
creating and query ontologies, and deriving additional implicit factual knowledge
automatically.
• The query engine receives queries and answers them by checking the content of
the databases that were filled by the info and inference agents.
• The info agent is responsible for collecting factual knowledge from the web
using various styles of meta annotations and direct annotations. In this part,
annotation can be done manually using an RDF format or using a small extension
of HTML called HTMLA to integrate semantic annotations in HTML documents.
• The inference engine uses facts and ontologies to derive additional factual
knowledge that is only provided implicitly. It frees knowledge providers from the
burden of specifying each fact explicitly.
11
• The database manager is the backbone of the entire system. It receives facts
from the info agent, exchanges facts as input and output with the inference agent,
and provides facts to the query engine.
A representation language, which is based on Frame logic (Kifer et al., 1995), is
used to formulate an ontology in Ontobroker. Basically the language provides classes,
attributes with domain and range definitions, is-a hierarchies with a set inclusion of
subclasses and multiple attribute inheritance. It also provides logical axioms that can be
used to further characterize relationships between elements of an ontology and its
instances.
2.3.2 DAML+OIL
In order to support the use of ontologies, a number of representational formats
have been proposed including the Resource Description Framework (RDF)schema, the
Ontology Interchange Language (OIL) and the Darpa Agent Markup Language (DAML).
DAML+OIL, the language now being proposed as a W3C standard for ontological and
metadata representation, is formed by bringing the last two languages together.
DAML+OIL is written in RDF, which in turn, is written in XML, using XML
namespaces and URIs. Yet it is a language for expressing far more sophisticated
classifications and properties of resources than RDFS. It draws heavily on the original
OIL specification, but has some key differences. OIL is a proposal for a web-based
representation and inference layer for ontologies, including Frame-based representations,
Description logics and Web-based languages. It is compatible with RDF, and presents a
layered approach to a standard ontology language. Each additional layer adds some
functionalities and complexities to the previous layer. One example of the difference of
DAML+OIL with OIL is OIL has explicit "OIL" instances, DAML+OIL relies on RDF
12
for instances. Emphasis of March 2001 edition of DAML+OIL is put on the work done to
support W3C XML schema.
2.3.3 SHOE
Compared to DAML+OIL, SHOE is much simpler and less expressive. The
SHOE is a SGML/XML HTML-based knowledge representation language which can be
regarded as a superset of HTML, as it adds the tags necessary to embed arbitrary
semantic data into web pages. The general steps to add semantics to a web page using
SHOE are as follows: 1) First, define an ontology describing valid classification of web
objects and plus valid relationships between web objects. This ontology may also borrow
from other ontologies. 2) Annotate HTML pages to describe themselves, other pages, or
subsections of themselves, with attributes as described in one or more ontolgies.
2.3.4 OntoEdit
OntoEdit is a development environment for design, adaptation, and import of
knowledge models for application systems, using GUI to represent views on concepts,
concepts hierarchy, relations, and axioms. It is a tool that enables inspecting, browsing,
codifying, and modifying ontologies, and it supports in this way an ontology maintenance
task. Modeling ontologies using OntoEdit means modeling as much independence as
possible of a concrete representation language. The conceptual model of an ontology is
internally stored using a powerful ontology model, which can be mapped onto different,
concrete representation languages.
2.3.5 Protégé-2000
Protégé-2000 is similar to OntoEdit but is more flexible. It is a platform which
allows the user to construct a domain ontology, customize knowledge-acquisition forms,
and enter domain knowledge. Its flexibility comes from the many plug-ins and widgets
13
available for its GUI. Each plug-in can either extend the system capability or the system
functionality. Extended with graphical widgets, Protégé-2000 can have tables, diagrams
and animation components to access other knowledge-based systems embedded
applications; Protégé-2000 can also be regarded as a library that other applications can
use to access and display knowledge bases. We will go into detail about Protégé-2000 in
Chapter 3.
There are still a lot of other tools, for example, OntoLingua, OntoSeek, OntoWeb
and so forth. There is no one correct way to model a domain; there are always viable
alternatives. The best solution almost always depends on the application that one have in
mind and the extensions that one anticipates. Among several viable alternatives, we will
need to determine which one would work better for the projected task, be more intuitive,
more extensible, and more maintainable.
.
CHAPTER 3
BUILD MEDICAL SUBJECT HEADING ONTOLOGY WITH PROTEGE-2000
The Knowledge Modeling Group (KMG) at Stanford University has developed a
variety of knowledge-modeling tools as part of the Protégé project for the past 15 years.
The current version of the ontology edit tool Protégé is an extensible, open-source
application that is now available as free software under the open-source Mozilla Public
License and compatible with a wide range of knowledge representation languages. Some
basic features of Protégé were introduced in Chapter 2. More detailed technique
backgrounds of Protégé, especially how and why I use Protégé-2000 to build an ontology
in the Medical Subject Heading domain, will be covered in this chapter.
3.1 Overview of Protégé-2000
Protégé-2000 is a tool that allows the user to
1. Construct a domain ontology by defining classes and class hierarchy, slots
and slot-value restrictions, relationships between classes, and properties of
these relationships.
2. Customize knowledge-acquisition forms by generating a default form for
acquiring instances, based on the types of the slots that the user specified.
3. Enter domain knowledge. You can use the instances tab in Protégé, which is
a knowledge-acquisition tool to acquire instances of the classes defined in the
ontology.
14
15
As it is shown in Figure3-1, Protégé-2000 system architecture includes three
parts-Core protégé Framework, widgets and plug-ins that extend system functionalities
and ontology storage models.
Strorage
Strorage
Storage
Model
Model
Model
Core
Protege
Framework
Widget
Widget
Widget
Plug-in
Plug-in
Plug-in
Widgets know how to display
certain value types while Plug-ins
extend system functionality.
Storage model is to save the ontology
to more persistent storage. Currently
three models available in ProtegeRDF schema, JDBC relational
database, and standard text format
Core protege framework is responsible
for maintaining the in-memory ontology
using API. It is also responsible for
managing Protege name spaces.
Figure 3-1 Protégé architecture
3.2 Protégé-2000 Ontology Model
Protégé-2000 is a frame-based system. The main elements of the Protégé ontology
model are frames representing
•
Classes. correspond to concepts in the domain
•
Instances. of classes
•
Slots. properties of classes and instances
•
Facets. properties of slots
Classes are organized into a “subclass-of” hierarchy with multiple inheritances.
Every instance of a class A is also an instance of any of the super-classes of A. Classes
themselves can be instances of other classes. Slots are first-class objects in Protégé-2000.
Slots are attached to classes and instances either as template slots or as own slots.
16
Template slots describe the properties of instances of that class. Value-type restrictions
can be defined for template slots.
Template slots for a class become own slots when
instances of that class are created.
3.2.1 Creating and Editing Classes
Protégé-2000 simplifies the task of developing an appropriate class hierarchy for
a given application. The users can easily create or browse class hierarchy and bind slots
to classes in the Protégé-2000 ontology editor.
Figure 3-2 Protégé-2000: Class definition in 'MeSH ontology
In the above figure, the left-hand pane visualizes the class hierarchy and the righthand pane summarizes the slots that are attached to the highlighted class. Each slot has
cardinality (single or multiple) and value type defining the types of values. Additional
restrictions on the values can be specified using facets, according to the type of values
defined.
17
3.2.2 Creating and Editing Instances
The Instances tab in Protégé-2000 provides the interface for creating instances of
classes. Protégé-2000 makes a distinction between classes and instances. Classes
correspond to definitions of concepts (just like schemas in a database) and instances
correspond to specific examples of a concept (just like tuples in a database). In addition,
slots are a third type of modeling abstraction. They are first-class objects that correspond
to attributes of either a class or an instance. A forms interface is used in Protégé-2000 for
acquiring the slot values for instances. Protégé-2000 automatically generates the layout
and content of the instance forms based on the values and cardinalities of slots for the
class. The user can then customize the forms using the Form tab (Fig 3-3).
Figure 3-3 Acquire instances in Protégé-2000
The complete editing cycle is therefore as follows: define a concept, lay out the
associated form, and use the form to acquire instances.
18
3.2.3 Storage Models and Persistence
The core framework of Protégé-2000, as shown in Figure 3-1, interacts with the
savings of ontology via a published (and formally defined) API. In this way the widgets
and user interface are decoupled from the actual persistent storage and mechanism,
thereby enabling Protégé-2000 to save a given ontology to a wide variety of formats. For
example, an RDF storage layer can import RDF files to Protégé, and Protégé ontology
can be stored in RDF. The RDF storage layer also performs the necessary interpretation
and translation. Actually, if counting those possible target formats by utilizing some of
the plug-ins in Protégé, the system can export ontology and content knowledge to target
formats OKBC, XML, RDF, Ontolingua, JDBC database and so forth.
For the special case of exporting ontology to an XML file (we shall use that in our
project), the transfer would include the following rules as shown in Figure 3-4
•
Un-referenced instances become top-level elements (cyclic references are
handled)
•
Classes and slots become tag names
•
Objects that are references more than once are shared and reused with id
or idref.
In the first part of my implementation, I created an ontology of the Medical
Subject Heading using Protégé-2000. I will introduce some of the related background
information in that field in the next section.
19
Figure 3-4 Exporting ontology to XML
3.3 Background of Medical Subject Heading
The MeSH thesaurus has been produced by the National Library of Medicine
(NLM) since 1960. Thesauri, also known as classification structures, controlled
vocabularies, and ordering systems includes carefully constructed sets of terms and
relationships among the terms. The relationships are usually represented as “broader
than” “narrower-than”, and “related” links.
The MeSH thesaurus is NLM’s controlled vocabulary for subject indexing and
searching of journal articles in MEDLINE, books, journal titles, and non-print materials
in NLM’s catalog. Translated into many different languages, MeSH is widely used in
indexing and cataloging by libraries and other institutions around the world. Forty years
of heavy use have led to a significant expansion in the MeSH content and to considerable
20
evolution in its structure. It is one of the most highly sophisticated thesauri in existence
today. The selection and assignment of the thesaurus terms are crucial to an information
retrieval system, MeSH is quite successful at this issue.
MeSH applications
1. It is a vital component of NLM’s computer-based information retrieval system.
2. The MeSH thesaurus is used by NLM for indexing articles from 4,300 of the
world’s leading biomedical journals for the MEDLINE database and for other
NLM-produced databases which include cataloging of the books, documents, and
audiovisuals acquired by the library. Each bibliographic reference is associated
with a set of MeSH terms to describe the content of the item. A retrieval query
can then be formed using MeSH terms to find items on a desired topic.
3. MeSH is the source of the headings used as index terms in NLM’s Index Medicus
and is fundamental to the organization of this monthly guide to articles from more
than 3,400 international journals. [MeSH fact sheet]
An example of a partial MeSH hierarchy is represented in Figure3-5:
1.
Anatomy [A]
o
Body Regions [A01] +
o
Musculoskeletal System [A02] +
o
Digestive System [A03] +
o
•
Biliary Tract [A03.159] +
•
Esophagus [A03.365] +
•
Gastrointestinal System [A03.492] +
•
Liver [A03.620] +
•
Pancreas [A03.734] +
Respiratory System [A04] +
21
o
Urogenital System [A05] +
o
Animal Structures [A13] +
o
Stomatognathic System [A14] +
2.
Organisms [B]
3.
Diseases [C]
4.
Chemicals and Drugs [D]
5.
Analytical, Diagnostic and Therapeutic Techniques and Equipment [E]
6.
Psychiatry and Psychology [F]
7.
Biological Sciences [G]
8.
Physical Sciences [H]
Figure 3-5 Medical subject heading hierarchy.
3.4 Motivation of Creating an Ontology in Medical Subject Heading Domain
From the introduction in chapter 1, we can see that ontologies are used as a
solution to sophisticated queries and other issues related to the Semantic Web in many
application domains, mainly due to their abilities of explicitly specifying the semantics
and relations and expressing them in a computer understandable language. Conventional
knowledge organization tools such as the MeSH thesaurus, resemble the concept of
ontology in a way that they define concepts and relationships in a systematic manner
(MeSH also has hierarchical, associative, and equivalence relationships), but they are less
expressive than ontologies when it comes to machine language. The major differences
between the two models are in the value ontology added through deeper semantics in
describing objects, both conceptually and relationally.
The MeSH thesaurus appears like a gathering of terms and knowledges which are
widely used in medical subject indexing, cataloging, and querying. This successful
application area would be a good place to implement ontology techniques and make
22
domain knowledge more efficiently shared and reused by people or related software
agents, and it will lead to more powerful queries. Huge amounts of useful data coming
from the thesaurus can be reused through some efficient ways. Ontologies can also help
users in runtime to build their own knowledge base or ontology objects in the specific
field even though the user may lack familiarity of some special vocabularies used in the
thesaurus.
3.5 From MeSH Thesaurus to MeSH Ontology
The ontology includes machine-interpretable definitions of basic concepts in that
domain and relations among them. Recall why would someone want to develop an
ontology? Some of the reasons include
•
To share common understanding of the structure of information among people or
software agents
•
To enable reuse of domain knowledge
•
To make domain assumptions explicit
•
To analyze domain knowledge
The MeSH thesaurus is great in many aspects, but if we want to create a
knowledge-rich description of objects, such as required by a Semantic Web, thesaurus
turned out to provide only part of the knowledge needed. The goal of the Semantic Web
initiative is to annotate large amounts of information resources with knowledge-rich
metadata. Such annotations would achieve much better performance based on a rich
metadata structure in connection with an ontology. In the ontology construction process,
additional knowledge was added to the basic hierarchical structure of the concepts
derived from the thesaurus.
23
3.5.1 Ontology and Structure-based Search
We are familiar with keyword-based search without any closed vocabulary (those
we used in Yahoo and Google). The suffering of huge irrelevant query results are not so
surprising to most of us. Usually in order to circumvent the problems of ambiguity in
keyword searching, search descriptions should be limited to a fixed set of predefined
structures and a closed vocabulary. Thus we come to another solution, namely, fieldbased approach, which describes or retrieves an item not by a set of keywords, but by a
set of attribute-value pairs.
Typically, a metadata system is predefined and describes the elements (fields),
giving some indication what values can be assigned to a particular field. Many of the
field-based initiatives recommend the use of closed vocabularies but do not associate
particular parts of a thesaurus with a field. As a consequence the only support that a
human indexer has is the thesaurus browser. The MeSh browser for examples, presents
the thesaurus to users, the users are then restricted to use this specific browser to obtain
all information they need. No other way has been offered to users to create a subthesaurus for their own limited purposes, and there has not been any flexibility in
choosing which fields the user wants to use for searching besides those fixed ones in the
browser. Figure 3-6 shows the MeSH browser screen.
24
Figure 3-6 MeSH thesaurus browser
Compared with a flat-structure of attribute-value pairs essentially used in a fieldbased search, the “structure-based” approach allows a more complex description
involving relations, introducing a large degree of complexity in the indexing process.
Considering the fact that relational descriptions can vary widely between different
objects, we need to find a way to solve the problem of complexity of the indexing and
annotating process. One of the possible solutions is to use contextual information to
constrain the relations and terms presented to the indexer. Suppose the structured
descriptions are created by a human annotator using specialized tools. How can a human
be supported during the annotating process? From previous discussions, we know a
related ontology will be the best answer at this situation.
25
3.5.2 Building MeSH Ontology
MeSH thesaurus is useful in filed- and structure-based approaches. We use it as a
basis for building our ontology. Transforming this thesaurus to an ontology makes it
possible for us to augment it with more semantic information. A number of concepts with
additional slots and fillers can be added to the previous thesaurus structure. Another step
is to add information about the relationships between possible values of fields and nodes
in the ontology. In Protégé-2000, each slot has related facets that specify the constraints
applied on that slot – Such as “Does it has multiple values?” “Is it required to be single?”
We can also find in Protégé-2000 a Protégé Axiom Language (Pal) constraint plug-in. It
could be utilized to validate the whole ontology during ontology establishment by some
restrictions and relation constraint checks applied on the ontology data.
The mapping from the MeSH thesaurus structure to the Medical Subject Heading
ontology would be in such a way that previous databases were built in ontology as
different classes, attributes in specific databases became slots, and constraints became
facet restrictions plus an ontology validation mechanism-“Pal” constraints. Of course,
the enormous data in MeSH also helped me a lot when building the ontology.
1. The full MeSH “part/whole” hierarchy was converted into ontology as a hierarchy
of structure, where each concept has a class definition corresponding with the
main term in MeSH. There were some considerations of MeSH thesaurus
developers when they were defining MeSH “class/subclass” hierarchy. I followed
their definitions here to let articles in MeSH be indexed with the most specific
headings available. Hierarchical information of class/subclass instances can be
traced by different levels of MeSH tree numbers.
26
Both of the hierarchical relationships in MeSH are represented at the level
of the descriptor instead of at the level of the concept. The general hierarchical
structure of MeSH ontology is presented in the following figures.
DescriptorRecord
Concept
DescriptorRef
Term
ConceptRef
EntryCombination
DescriptorRec
(Pharmacological
Action)
DescriptorRec
Figure 3-7 MeSH ontology class hierarchy
QualifierRef
DesQua
Combination
QualifierRef
27
SLOTS
CLASS
Annotation
DescriptorClass
DescriptorRef
DescriptorRecord
DateCreated
EntryCombinationList
DescriptorReference
PublicMeshNote
ConceptRef
QualifierReference
OnlineNote
ScopeNote
MeSHOntology
SeeRelatedList
Concept
TermList
PreviousIndexingList
Is-instanceOf
QualifierReference
DateEstablished
RelatedRegistry
NumberList
CASN1NAme
TreeNumberList
pharmacological
ActionList
HistoryNote
PreferredConcept
ConceptReference
ConsiderAlso
Term
Is-instanceOf
ConceptList
AllowableQualifierList
Figure 3-8 Detailed structures and relationships of DescriptorRecord, Concept and Terms
in MeSH ontology .
2. The next step is to augment a member of concepts with additional slots and filters.
For example, “DescriptorRecord” is a major class in MeSH ontology which
records most of the medical heading information. I added a slot named
“DescriptorRef” which is defined to be an instance of DescriptorReference class.
The DescriptorReference class, in turn, includes some unique information about
the DescriptorRecord, i.e. DescriptorUI (unique id) and DescritorName. The
DescriptorRecord class also includes multiple instances of QualifierRef, which
has
related
Qualifiers
information,
that
is,
QualifierUI
(unique
id),
28
QuialifierName, and Abbreviation of that Qualifier. That makes the relationships
between
DescriptorRecord
and
DescriptorRef;
DescriptorRecord
and
QualifierRef explicit (see Figure3-8).
3. The third step was to add data validation mechanism. Validation of the ontology
data guarantees the data accurately reflect the real world as long as the constraints
are well defined.
The first level of data validation is provided by the notion of slot value-types.
When a slot is attached to a class, it can be given a value-type (which could be
one of: Any, Boolean, Class, Instance, String, Symbol). Slots also have an
associated cardinality, either single or multiple. Protégé facets let the users define
all these first level restrictions in all slots. It supports about a dozen facets that get
exposed in the user interface on slot forms.
It is not always possible for the user to have consistent ontology data while
editing, therefore the user should decide when to check the constraints.
“Pal” is Protégé Axiom Language that helps to enforce the semantic properties of
ontology data encoded in Protégé, it uses the Knowledge Interchange Format
(KIF) connectives and the KIF syntax. A sample Pal axiom would be like:
(defrange ?X1:FRAME DescriptorReference)
(defrange ?X2:FRAME DescriptorReference)
(forall ?X1 (not (exists ?X2 (= (DescriptorUI ?X1) (DescriptorUI ?X2)))))
This constraints means there are no DescriptorRecords which have identical ids.
29
3.5.3 Ontology Data Import and Export
Importing existing data of MeSH thesaurus to ontology
There are more than 19,000 main headings in the MeSH thesaurus. In addition to
these headings, there are 103,500 headings called Supplementary Concept Records within
a separate chemical thesaurus. The NLM updates thesaurus data at the beginning of each
month. So it will greatly benefit the development of our system if we could get the
existing data from MeSH databases.
The task of import data from the MeSH thesaurus into MeSH ontology is actually
certain kind of metadata transformation that will transform one model (MeSH thesaurus)
into a rendering format (Protégé data input format). The release of the MeSH thesaurus in
various formats made this process easier and somehow reduced the developing time spent
on this issue. Many interoperability tools can be used to carry out this task, for example
XSLT. Based on the MeSH thesaurus structure and also on what I learned from Protégé
developers about its data import requirements, I used XQuery to develop a program
(Xquery will be discussed in Chapter 4)which made the existing data of MeSH thesaurus
reusable by MeSH ontology. The program read the data from the thesaurus according to
its structure and matched those data with the format of the MeSH ontology, based on
classes, slots and facets definitions. The thesaurus data file was then imported
successfully to the ontology under validation of facets and Pal restrictions.
Exporting MeSH ontology to an XML document
The ontology file was exported to an XML document after finishing the structure
construction and data importation. The reason why I use the XML document instead of a
traditional database, to save the ontology is based on the following considerations:
30
•
Relational databases are particularly good for storing highly structured
information, and not particularly good at managing semi-structured data, as the
one we meet here- not all data content in Medical Subject Headings are highly
structured. The instances of the same slot vary in length, degree of completeness,
and cardinality. The document-oriented XML usually has a varying structure to
allow for the flexibility inherent in prose.
•
The most common use for XML is as a means of integration or data interchange
between enterprise application inside and outside the firewall. We will talk about
more features of XML in Chapter 4. Data in the XML format can be easily
reused/shared by other users or agents in the same domain of interest.
•
In second part of my project, I implemented a web informatics system. XML is
clearly targeted at the web. The many ways and supporting tools to connect XML
documents with web applications will certainly ease the development of my later
implementation.
This is also a widely debated area actually. We need to separate fully structured
data with semi-structured data first before we can accept a common conclusion. Semistructured data is data that have some structure, but are not rigidly structured. An
example of semi-structured data is a health record. For example, one patient might have a
list of vaccinations, another might have height and weight, the other one might have a list
of operations he/she has undergone. (Medical Subject Headings can be regarded as semistructured data.) Semi-structured data is difficult to store in a relational database because
it means you either have many different tables (which means many joins and slow
retrieval time) or a single table with many null columns (as is the case in the MeSH
thesaurus). Semi-structured data are very easy to be stored as XML and are a good fit for
a native XML database.
Although the details vary depending on the individual RDBMS, generally
speaking, relational DBMSs do not handle semi-structured data very well. On the other
hand, handling semi-structured data is one of the main virtues of the (data-centric) XML
data model. In the real world, one always tries one's best to establish standards, but
31
inevitably requirements and needs change over time – For example, new users enter the
arena, the software is expanded to take on a wider scope of jobs, and business practices
change slightly over time. It is therefore difficult to keep everything inside totally tight
and expect schemas to remain the same forever. XML provides a framework in which
formats can evolve carefully for backward compatibility. If someone adds a new child
element somewhere, my existing XQuery (or XPath expressions) continue on working, as
long as the addition has not disrupted the semantics too harshly.
In Chapter 4, I will discuss the relationship of ontology and XML techniques and
XQuery engine to query XML documents. In later description of my implementations, we
can see that building such an ontology makes it possible that the ontology be used and
shared by more agents in diverse environments, and it makes queries much more flexible
and efficient. Moreover, ontology can help users create their own ontology objects in
XML (later it could be imported to knowledge application systems). The format of
ontology in XML will partially look as in figure 3-9.
32
<Project>
- <DescriptorRecord>
- <DescriptorRef p_attr="t">
- <DescriptorReference>
<DescriptorUI>D000029</DescriptorUI>
<DescriptorName>Abortion, Legal</DescriptorName>
</DescriptorReference>
</DescriptorRef>
<DateCreated>01/01/1999</DateCreated>
<DateEstablished>01/01/1964</DateEstablished>
- <AllowableQualifierList p_attr="t">
- <QualifierReference>
<QualifierUI>Q000009</QualifierUI>
<QualifierName>adverse effects</QualifierName>
<Abbreviation>AE</Abbreviation>
</QualifierReference>
…
</AllowableQualifierList>
<HistoryNote>64</HistoryNote>
<PublicMeSHNote>64</PublicMeSHNote>
<TreeNumberList>E04.520.050.055</TreeNumberList>
- <ConceptList p_attr="t">
- <Concept PreferredConceptYN="Y">
- <ConceptRef p_attr="t">
- <ConceptReference>
<ConceptUI>M0000047</ConceptUI>
<ConceptName>Abortion, Legal</ConceptName>
<ConceptUMLSUI>C0000812</ConceptUMLSUI>
</ConceptReference>
</ConceptRef>
<ScopeNote>Termination of pregnancy under conditions
allowed under local laws. (POPLINE Thesaurus,
1991)</ScopeNote>
<PharmacologicalActionList p_attr="t" />
- <TermList p_attr="t">
- <Term ConceptPreferredTermYN="Y">
<String>Abortion, Legal</String>
<TermUI>T000087</TermUI>
</Term>
- <Term>
<String>Abortions, Legal</String>
<TermUI>T000087</TermUI>
</Term>
- </TermList>
</Concept>
</ConceptList>
</DescriptorRecord>
</Project>
Figure 3-9 Ontology exported to an XML document
CHAPTER 4
ONTOLOGY, XML AND XQUERY
4.1 Extensible Markup Language XML
Everyone agrees that XML represents a significant step forward both for
electronic commerce and a number of other types of Internet application. As it is
gradually getting to be known by everybody who is interested in its features, I will briefly
mention some of the concepts that are related to my work.
Extensible Markup Language, namely, XML, is an extremely flexible method for
creating a consistent way of sharing information over the Internet, intranets, or anywhere
else. XML is a simplified subset of the SGML(Standard Generalized Markup Language).
The use of XML for "tagging" data based on content allows for a more focused and
powerful way to search data. Because XML enables documents to use semantic markup
that identifies data elements according to what they are, rather than how they should
appear, many diverse applications also can make use of the information in XML
documents.
Free to use whatever appropriate tags to add semantic information, XML is
designed to describe document types for all thinkable domains and purposes, for example
multi-media presentation, HTML-pages of arbitrary contents(XHTML), business
transactions. Via standardized interfaces, such as SAX and DOM, explicitly structured
textual XML documents can easily be accessed by application programs. Suppose web
developers, database developers, document managers, desktop publishers, programmers,
scientists, or other professionals all get involved in a certain project. XML in this case
33
34
can provide a simple format that is flexible enough to accommodate such diverse needs.
Simplicity, Extensibility, Interoperability, Openness are the most significant features
offered in XML.
Thus XML could play an important role as a basic technology in the context of
knowledge management and dissemination and also when it comes to managing large
scale web sites. XML supports corporate design, style sheets (XSL), automatic generation
of customized views to documents, consistency between documents, superior linking
facilities (XLINK, XPOINTER) and so forth. All these features are based on an
individually definable tag set that is tailored to the application needs. Compared with the
case in HTML where tags only have pure layout purposes of presentation, the tags in
XML have semantic purposes so that they can be exploited for several tasks, such as
those mentioned above or as metadata that support intelligent information retrieval.
4.2 Ontologies as Conceptual Models for Generate XML Documents.
4.2.1 XML Itself Is Not Enough
In spite of these positive features of XML, it must be understood that XML is
solely a description language to specify the structure of documents and thus their
syntactic dimension. It is a widely accepted foundation layer upon which to build, but it
is not a cure-all for system interoperability. The document structure can represent some
semantic properties, but it is only understood by special purpose applications if there is
no way to deploy those properties outside. It permits us to use tags but gives us no
guidance as to which words are appropriate and commonly acceptable (no close
vocabularies being agreed upon by a group of users in the domain of interest). So two
questions arise -How should XML be extended to support the representation of business
information? How can the numerous heterogeneous systems be unified to enable the low-
35
friction marketplace of the future? These two questions lead directly to the use of
ontologies.
4.2.2 Add Ontology as Conceptual Model
Ontologies establish a joint terminology between members of a community of
interest. If we add true semantics to XML documents by relating the document structure
to an ontology, we would be able to represent facts that are compatible with the designed
domain model, that is, an ontology. This could be done by mapping ontology concepts
and attributes to XML elements via the definition of a Data type definition(DTD). XML
documents can thus be authored to represent facts that are compatible with the designed
domain model, an ontology.
An ontology is a “formal specification of a conceptualization” [Gruber 1993] and
thus provides a basis for semantics-based processing of XML documents. At XML level,
we have a sequential order of context with element nesting. Only by reaching the level of
ontology can we speak about concepts (classes) and semantic relationships (class
hierarchy, attribute restrictions, and ontology validation constraints) and that should be
regarded as an appropriate level for structuring the contents of documents. Of course,
concepts and relationships have to be expressed and stored in linear form in documents,
but this is pure representation, i.e. DTDs, and the document structures are not enough to
give XML a sound semantics. Using ontology as the primary source for structuring a set
of XML documents of certain domain of interest makes ontology act as a kind of
mediator between the information seeker and those XML documents. The ontology
unifies the different syntaxes and structures of these documents. These documents can
then be accessed in a more semantic way (For example use conceptual terms for
36
retrieving facts). In my approach, I took MeSH ontology (represented in classes, slots,
facets, logical constraints, and not in XML) and generate from it an XML file which can
be used later for query or other purposes. (In real application cases, a set of XML files
may be available, and we need to generate DTD from the existing ontology to make the
XML files compatible with the ontology).
4.3 XQuery
4.3.1 XML Query Language XQuery
XML is an extremely versatile markup language which is capable of labeling the
information content of diverse data sources (structured/semi-structured documents,
relational databases )and so forth. A query language that uses the structure of XML
intelligently should be able to express queries across all these kinds of data, whether they
are physically stored in XML or viewed as XML via middleware. Most existing
proposals for XML query languages are robust only for specific kinds of data. The
XQuery, on the other hand, is designed to be broadly applicable across all types of XML
data sources.
The XQuery is the W3C's query language for XML. It is derived from Quilt, an
earlier XML query language, which, in turn, borrowed features from several other
languages, including Xpath1.0, XQL, XML-QL, SQL and OQL.
XQuery is designed to meet the requirements identified by the W3C XML Query
Working Group XML Query 1.0 Requirements and the use cases in XML Query Use
Cases. It is designed to be a small, easily implementable language in which queries are
concise and easily understood. It is also flexible enough to query a broad spectrum of
XML information sources, including both databases and documents. Also, XQuery is
designed to meet the requirement of a human-readable query system.
37
At its simplest, an XQuery expression could be:
Document (“bookPublished.xml”)//book
This is standard Xpath and is a complete, valid self-contained XQuery query. It
means
to
return
a
list
of
all
book
elements
existing
in
the
document
“bookPublished.xml”. If we assume for the moment that we are getting back a serialized
XML, the above queries pull <book> elements out of the documents and return some
content that look like this:
<book year="1994">
<title>TCP/IP Illustrated</title>
<author><last>Stevens</last><first>W.</first></author>
...
<book year="1992">
<title>Advanced Programming in the Unix environment</title>
...
<book year="2000">
<title>Data on the Web</title>
...
FLWR (pronounced flower) expression makes the queries in Xquery much more
interesting and efficient. It is an acronym that stands for four of the possible XQuery subexpressions which are FOR, LET, WHERE, and RETURN. The productions in the XQuery
grammar that formally define a FLWR expression are as follows:
FlwrExpr ::= (ForClause | letClause)+ whereClause? returnClause
ForClause ::= 'FOR' Variable 'IN Expr (',' Variable IN Expr)*
LetClause ::= 'LET' Variable ':=' Expr (',' Variable := Expr)*
WhereClause ::= 'WHERE' Expr
ReturnClause ::= 'RETURN' Expr
Following these definitions makes the FLWR production very malleable, highly
recursive, and capable of generating a large number of possible query instances,
38
including just about any combination of FOR, LET, WHERE, and RETURN statements
imaginable.
Debates still exist on whether XQuery overlaps too much with XSLT. Both
XQuery and XSLT will use XPath 2.0, and the two Working Groups are working closely
together on this. So the two languages will share a great deal. The main differences
between the two languages were differences of culture and perspective, but also that
XQuery was more ambitious than XSLT and would require more complex optimizations.
At a high level, while XSLT is a transformation data language with an interchange
centric model, XQuery is a query language with a storage centric model. XSLT uses
XPath as a "sublanguage" located in attributes of XML syntax, while XQuery is
constructed as a superset of XPath. XQuery is being oriented toward large sets of XML
data, therefore there would be a better acceptance for commercial XQuery
Three areas to justify creating XQuery as a language separate from XSLT are 1)
ease of use; 2) optimizability; and 3) strong data typing. [W3C Working Draft 2001]
4.3.2 XQuery Implementation Quip
The Software AG's QuiP is a prototype of XQuery for Windows 32 bit platforms,
the W3C XML Query Language. QuiP is designed to make it easy to learn and use the
language. The following points explain QuiP in a concise way:
•
Graphical User Interface for writing queries and viewing results
•
Online help includes syntax diagrams for XQuery
•
Examples include 76 queries and 51 XML files
•
Syntax conforms to the 07 June 2001 Working Draft of XQuery
39
•
Most of the XQuery language has been implemented
•
Queries may be made for XML files or XML stored in a Tamino database
[onathan Robie, Software AG, Nov. 2001]
I used the Quip Engine in my web project mainly because the following four
reasons:
•
Easiness. With the easiness of Quip plus the help of the MeSH ontology query
wizard, the users will be able to write their own queries in Quip without any
difficulty;
•
It is the most up-to-date implementation of XQuery, according to W3C XQuery
requirements draft and use cases;
•
The software is OpenSource and can be got freely online.
•
Careful analysis on the Quip engine implementation made it possible for me to
write interface to embed Quip query engine in my server program (implemented
in Java Servlet-Tomcat).
CHAPTER 5
IMPLEMENTATION OF AN ONTOLOGY-BASED WEB APPLICATION SYSTEM
5.1 Building Web Informatics System
5.1.1 Ontology-based Web Application
Sharing common understanding of the structure of information among users or
software agents in the domain of interest is one of the common goals in developing
Medical Subject Heading ontology. Also, as a common place in ontology applications,
this ontology is akin to defining a set of data and their structures for other programs to
use. Not only can the users build up their ontology objects in XML format, which can
later be imported to other knowledge bases, under the help of the ontology information
built before, ontology data (instances) itself could also build up a web site and offer its
users with semantic queries on its prosperous content.
According to the implementation and techniques introduced in previous chapters,
we know 1) the MeSH thesaurus is a good starting point to build an ontology in the same
domain and 2) Protégé- 2000 can be used as an efficient tool to construct the ontology
and acquire ontology instances through its user API. It is also possible to apply “Pal”
validation constraints to ontology data. In this chapter, I would like to introduce the
implementation of a web informatics system, based on the existing MeSH ontology. This
web informatics system will give the user flexibility to query the ontology instances and
the option to construct ontology objects in various ways.
The first scenario is as follows: the ontology itself remains on the server side and
user queries could be typed in and sent to the server by “Http” protocol, as in normal web
40
41
application systems. The server accepts the query, searches the ontology by invoking the
XQuery engine in the server side, and sends back the query results, as long as the query is
a valid one. The second scenario is as follows: The users want to query his own
documents instead of the default ontology. In this case, they can first upload to the server
their own query files (with file extension .xquery) and document files(with file extension
.xml), then the server will complete the queries on those files for clients just like a web
agent. These two scenarios require that the users know a good deal about the ontology
data model. As they are the persons who write the queries, the users are required to
understand XQuery grammar as well. Sometimes the users are not so familiar with the
structures and relationships defined in the ontology model, or they may not be so
interested in learning XQuery grammar. The third scenario in this case may seem to be
the most appropriate one: An online ontology query wizard will help the clients to form
proper queries with adequate ontology knowledge and query format, clients will then get
through all the difficulties in making valid queries on that specific ontology and will be
able to query the ontology according to their own purposes.
5.1.2 Using JSP
The increasing sophisticated web applications definitely need to present dynamic
information. First-generation solutions of this kind of application included CGI, which is
a mechanism for running external programs through the web server. The problem with
CGI is scalability. A new process is created for every request.
Second-generation solutions included web server programming platforms, plugins and APIs for their servers. Therefore their solutions were specific to their server
products. For example, ASP worked only on Microsoft IIS or a personal web server.
42
Using ASP means lost of freedom of selecting a favorite web server and operation
systems.
JSP pages seem to be the third generation solution that can be combined easily
with some second-generation solutions, creating dynamic web content, and making it
easier and faster to build web-based applications. These web-based applications work
with a variety of other technologies: server, browsers, or other development tools.
I use JSP in server programming, which includes a security check in the server
side, file upload and download, query request, query results response, ontology query
wizard program and so forth.
type in XQuery
Ontology Query
Wizard
local XQuery
files
Generate
<Project>
{ FOR $var in document("MeSHOntology")//DescriptorReference
WHERE $var/DescriptorUI<"D000019'
RETURN
<Descriptor>
{$var/DecriptorName}
{$var//Annotation}
{$var//ScopeNote}
<//Descriptor>
</Project>
local xml files
Uploading
Client
Server
default document
MeSHOntology.xml
Quip
(XQuery Engine)
user created xml
files(sub ontologies)
<Project>
<Descriptor>
<DescriptorName>Calcimycin </DescriptorName>
<ScopeNote> An ionophorous, polyether antibiotic from
Streptomyces chartreusensis. </ScopeNote>
</Descriptor>
<Descriptor>
...
<Project>
Figure 5-1 Web informatics system architecture
Download?
Y
43
5.1.3 System Architecture
The overall system architecture is presented in Figure 5-1. Queries are formed by
clients through one of three ways, according to the scenarios I already introduced. The
server accepts the query requests from the client side and invokes Quip (XQuery engine)
to query the ontology data (or the document data the user created or uploaded earlier).
Then responses of query results get back to the clients if the queries are valid ones.
Interface to invoke XQuery in server program
As I introduced in Chapter 4, Quip is a successful implementation of the XQuery
standard. It is produced by SoftwareAG. Another famous product of that company is
Tamino, the world’s leading native XML database. (I might consider importing MeSH
ontology into the Tamino database in the future).
Quip is implemented in Java, which gives the possibility of embedding Quip
queries in my server JSP program. The class java.lang.runtime features a static method
called getRuntime(),which retrieves the current Java Runtime Environment and which is
the only way to obtain a reference to the Runtime object. With that reference, I can run
external programs (e,g, Quip) by invoking the Runtime class's exec() method, but I need
to pay special attention to the redirections of input/output of that runtime execution. In a
JSP server program, both the standard input/output streams and HttpInputStream/
HttpOutputStream are in effect handling input/output in the client/server architecture. I
could not simply invoke exec(“java quip…”) method to input a query file to Quip and
capture the output from Quip execution. The final solution was to create a special
interface class file to smooth the execution in the server program of the Quip query. I
finally made it successful. The partial content of this interface file is as follows:
44
class StreamGobbler extends Thread
{
InputStream is;
String type;
OutputStream os;
StreamGobbler(InputStream is, String type)
{
this(is, type, null); }
StreamGobbler(InputStream is, String type, OutputStream redirect)
{ this.is = is;
this.type = type;
this.os = redirect;
}
public void run()
{ try
{ PrintWriter pw = null;
if (os != null)
pw = new PrintWriter(os);
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr);
String line=null;
while ( (line = br.readLine()) != null)
{ if (pw != null)
pw.println(line);
System.out.println(type + ">" + line);
}
if (pw != null)
pw.flush();
} catch (IOException ioe)
{ ioe.printStackTrace(); }
}
}
…
try{
FileOutputStream fos = new FileOutputStream(xml_result);
Runtime rt = Runtime.getRuntime();
String inqf=null;
if (cfile!=null) inqf=_filename+cfile ;
else inqf=_filename+pfilenm;
Process proc = rt.exec("java -classpath quip.jar;crimson.jar;jaxp.jar
com.softwareag.xtools.quip.Main --quipcmd quip.exe -f "+inqf+" -o "+xml_result);
Figure5-2 Interface of server program and Quip execution
45
StreamGobbler errorGobbler = new
// any error message?
StreamGobbler(proc.getErrorStream(), "ERROR");
// any output?
StreamGobbler outputGobbler = new
StreamGobbler(proc.getInputStream(), "OUTPUT", fos);
errorGobbler.start();
//kick them off
outputGobbler.start();
// any error???
int exitVal = proc.waitFor();
System.out.println("ExitValue: " + exitVal);
fos.flush();
fos.close();
} catch (Throwable t)
{
t.printStackTrace ();
}
Figure5-2--continued Interface of server program and Quip execution
5.2 Query the Ontology
Considering all possibilities of queries on the ontology from the client side, the
web application was implemented in such a flexible way that clients may choose the most
convenient way to query the ontology. Professionals from this research domain may
create their queries ahead of time in XQuery format and may save them locally. For those
who are not familiar with the ontology structures, or those do not know how to form a
legal query according to Quip, a powerful wizard gives structure information of the
ontology and will guide the users through the query processes.
5.2.1 Query from Direct Typing
After being checked with the username and password, users can type queries from
the client side. As in a normal web application, the JSP server programs display a
dynamic web page at the client side, including useful collections of information and
descriptions of the ontology. Some input forms need to be filled by clients. Clients can
choose to type in their query content directly in a scrolling text area. Upon submitting the
46
forms, the server JSP programs handle all the contents transferred and invoke the Quip
query, informing the clients with their query outputs after finishing (see Figure 5-4).
5.2.2 Upload Existing Local Query Files
Clients can also choose to upload their existing query files to server programs in
case it is not appropriate for online inputting, for example, when the query is quite long.
Both of this method and the above one require the users to at least understand the
structures of the ontology in order to write queries or construct ontology objects
correctly. They should also know how to write legal queries based on XQuery grammar.
Useful information appears on the web pages describing the ontology structures to the
user. (see Figure 5-3, 5-4)
Figure5-3 Onine links point to ontology structures
47
Figure 5-4 Query the ontology by typing or uploading local query files
5.2.3 Query Ontology or Build Up Ontology Objects with the Ontology Wizard
We need to consider the situation when the users are not familiar with the
ontology structures, or they may not know how to construct legal queries of XQuery. I
implemented a MeSH ontology query wizard to give online hints and proper restrictions
for querying the ontology. The following steps are designed in the wizard not only to let
the user form a proper query according to his initial purpose but also to facilitate
constructing well-structured ontology objects in the same domain (see Figure 5-5).
•
The ontology hierarchy is extracted from the ontology and saved in the ontology
configure file.
•
One menu of the wizard page will prompt the users with available query files on
the sever for a particular user. Every user is able to query the default ontology
file. Users who uploaded special query files in the XML format from local
computers can also make queries on them. The same rule applies for ontology
objects generated automatically in earlier queries.
•
The other menu itemized by ontology structures is constructed from the ontology
configure file.
48
•
A JSP program does as much as it can to form legal queries automatically, based
on the ontology and some information given by the user. The users need to choose
the query target file from one menu; select one initial point within the ontology
structure to start queries from another menu; type in the condition constraints they
want to put on the queries if any; and if they want to construct a new ontology
object, they need to give the tag name to put in future ontology objects and
specify which fields of data in the initial ontology should appear in the new
objects. Then they submit their queries.
•
The JSP program displays the complete query format to users to help them get
familiar with it although users do not have to remember everything.
Figure 5-5 Query the ontology under the help of ontology query wizard
5.2.4 Choose the Query File and Download the Result
Not only the default ontology file MeSHOntology.xml can be queried, the local
xml files (or those ontology object files saved in the XML document made in previous
queries) can also be made as the target query files. In doing this, the web application first
uploads the local XML files to the server and saves it under that user’s folder. The user
can then query on his own XML files in the same way he queries MeSH ontology. When
the query results come back, local users can choose to download the results to their own
49
computers. That also brings the possibility of diverse formats local files be queried or
accessed in a community of people who commits to the same ontology
CHAPTER 6
CONCLUSIONS AND FUTURE WORK
Ontologies are usually built by the cooperation of computer professionals and
domain experts. Lack of medical subject knowledge made the ontology I built not so
ideally defined and constructed, maybe the domain knowledge was not completely
included in the ontology. In the future, if medical subject professionals get evolved in
system improvement, better performance can be achieved.
This web application system is actually not only restricted to one specific domain,
Medical Subject Heading. All the features could be updated with ease to let users from
various domains apply this technique on different ontologies and query in diverse
client/sever systems.
The most obvious use of ontology is in connection with a database component.
An ontology can be compared with the schema component of a database. Ontology can
play an important role in the requirement analysis and conceptual modeling phase. The
resulting conceptual model can be represented as an ontology that can be processed by
computer. And from there it can be mapped to concrete target platforms, including
databases so as to facilitate the system with various database advantages.
This system is now saving the ontology into an XML document. In some sense it
lacks many of the benefits found in real databases, such as efficient storage, indices,
security, transaction and data integrity, multi-user access, triggers, and so on. Considering
the irregular format of MeSH data, importing the ontology into a native XML DB (e.g.
Tamino DA), is more reasonable than to a traditional database to get the most out of a
50
51
database system, as well as an ontology. I might consider this issue in future
improvement.
LIST OF REFERENCES
Abason, J. M., Gomez, M. “MELISA. An Ontology-based Agent for Information
Retrieval in Medicine.” Viewed: Jan. 2002.
http://www.ics.forth.gr/proj/isst/SemWeb/proceedings/session3-1/paper.pdf
Bourret, R. “XML and Databases.” February 2002.
http://www.rpbourret.com/xml/XMLAndDatabases.htm
Clark, J. “XSL Transformations (XSLT) Specification 1.0.” W3C Working
Draft, April 21, 1999.
http://www.w3.org/TR/1999/WD-xslt-19990421.html
Decker, S., Harmelen, F. V., Broekstra, J. “The Semantic Web- on the Respective
Roles of XML and RDF.”
http://www.ontoknowledge.org/oil/downl/IEEE00.pdf
Deutsch, A., Fernandez, M., Florescu, D., Levy, A., Suciu, D. “A Query Language
for XML.”1998.
http://citeseer.nj.nec.com/correct/138418
Extensible Markup Language (XML) Viewed: Jan. 2002.
http://www.w3.org/XML/
Fensel, D., Angele, J.,Decker, S. “On2broker: Semantic-Based Access to Information
Sources at the WWW.” Proceedings of the World Conference on the WWW and
Internet (WebNet 99), Honolulu, Hawai, 1999.
ftp://ftp.aifb.uni-karlsruhe.de/pub/mike/dfe/paper/webnet.pdf
Gruber, T.R. “A Translation Approach to Portable Ontologies.” Knowledge
Acquisition, 5(2):199-220, 1993a.
http://ksl-web.stanford.edu/KSL_Abstracts/KSL-92-71.html
Gruber, T. R. “Toward Principles for the Design of Ontologies Used for Knowledge
Sharing.” Presented at the Padua Workshop on Formal Ontology, March 1993b.
http://ksl-web.stanford.edu/KSL_Abstracts/KSL-93-04.html
Guarino, N. “Formal Ontology in Information Systems.” Proceedings of FOIS’98,
Trento, Italy, 6-8 June 1998. Amsterdam, IOS Press, pp.3-15
52
53
Heflin, J. “Towards the Semantic Web: Knowledge Representation in a Dynamic,
Distributed Environment.” Viewed: Feb. 2002.
http://www.cs.umd.edu/projects/plus/SHOE/pubs/heflin-thesis-orig.pdf
Huffman, S. B., Baudin, C. “Toward Structured Retrieval in Semi-structured
Information Spaces.” In Proceedings of the 15th International Joint Conf. on
Artificial Intelligence (IJCAI-97), 1997
Hunter, J. “MetaNet -- A Metadata Term Thesaurus to Enable Semantic
Interoperability Between Metadata Domains.” Journal of Digital information, 1(8),
2001.
http://jodi.ecs.soton.ac.uk/Articles/v01/i08/Hunter/
Jasper, R., Uschold, M. “A Framework for Understanding and Classifying Ontology
Applications” in Proceedings of the IJCAI-99 Ontology Workshop, Stockholm,
Sweden,1999.
Katz, H. “A Look at the W3C’s Proposed Standard for an XML Query Language.”
June, 2001.
http://www-106.ibm.com/developerworks/xml/library/x-xquery.html
Lawrence, S., Giles, C.L. “Context and Page Analysis for Improved Web Search.”
IEEE Internet Computing, 2(4), 38-46, 1998.
http://www.neci.nec.com/~lawrence/papers/search-ic98/
Mahalingam, K., Huhns, M.N. “An Ontology Tool for Query Formulation in an
Agent-Based Context.” proceedings of the 2nd IFCIS International Conference on
Cooperative Information Systems (CoopIS '97), 1997.
Mahmoud, Q. H. “Web Application Development with JSP and XML.”
June, 2001.
http://developer.java.sun.com/developer/technicalArticles/xml/WebAppDev/
Nelson, S.J., Johnston, W.D., Humphreys, B.L. “Relationships in Medical Subject
Headings (MeSH).” National Library of Medicine, Viewed: Jan. 2002.
http://www.nlm.nih.gov/mesh/meshrels.html
Noy, N.F., McGuinness, D.L. ``Ontology Development 101: A Guide to Creating
Your First Ontology.'' Stanford Knowledge Systems Laboratory Technical Report
KSL-01-05 and Stanford Medical Informatics Technical Report SMI-2001-0880,
March 2001.
http://ksl.stanford.edu/people/dlm/papers/ontology-tutorial-noy-mcguinnessabstract.html
54
Qin, J., Paling, S. “Converting a Controlled Vocabulary into an Ontology: The Case
of GEM.” Information Research, 6(2), 2001.
http://InformationR.net/ir/6-2/paper94.html
Savage, A. “Changes in MeSH Data Structure.” NLM Tech Bull. 2000 Mar-Apr
(313):e2.
Schmidt, A., Kersten, M., Windhouwer, M., Waas, F. “Efficient Relational Storage
and Retrieval of XML Documents.”
http://www.cwi.nl/themes/ins1/publications/docs/ScKeWiWa:WEBDB:00.pdf
Soergel, Dagobert “Functions of a Thesaurus / Classification /Ontological Knowledge
Base.” October 1997.
http://www.clis.umd.edu/faculty/soergel/soergelfctclass.pdf
United States National Library of Medicine, “Medical Subject Heading (MeSH) ,
Fact Sheet. Viewed: Jan. 2002.
http://www.nlm.nih.gov/mesh
Volz, R. “OntoServer – Infrastructure for the Semantic Web. ” University of
Karlsruhe, Germany, 2001.
http://www.aifb.uni-karlsruhe.de/WBS
Wielinga, B.J., Schreiber, A.Th., Wielemaker, J., Sandberg, J.A.C. “From Thesaurus
to Ontology.” 2001.
http://www.swi.psy.uva.nl/usr/Schreiber/papers/Wielinga01a.pdf
W3C Working Draft “XQuery 1.0: An XML Query Language.” 20 December 2001.
http://www.w3.org/TR/xquery/
W3C Working Draft “XML Query Use Cases. ” 20 December 2001
http://www.w3.org/TR/xmlquery-use-cases
www-rdf-logic “Annotated DAML+OIL (March 2001) Ontology Markup.” Viewed:
March 2002.
http://www.daml.org/2001/03/daml+oil-walkthru.html
BIOGRAPHICAL SKETCH
Wenyang Hu was born in HeFei, Anhui Province in China. She received a
Bachelor of Science degree in computer engineering from Zhejiang University,
Hangzhou, China, in July 1988. Then she entered the graduate program at Zhejiang
University and obtained a Master of Engineering degree in computer engineering in
1991. After graduation, she became a faculty member at Anhui University, teaching
courses in computer science. At the same time she took part in many co-op projects. The
courses she taught included Programming Languages C/C++, DBMS Introduction, Data
Structure, and Operating System Principles. The projects in which she participated
included the “DIV"(Dial in voice), and the “Multi-Model Interactive Platform.”
She enrolled at the University of Florida in August 2000 in the Department of
Computer and Information Science and Engineering. She worked as a research assistant
with Prof. Limin Fu,and later worked as a teaching assistant for Prof. Beverly A. Sanders.
Her future research interests include ontology-based web applications and web server
programming.
55