Optimal Training Sets for Bayesian Prediction of MeSH® Assignment S , P

Downloaded from jamia.bmj.com on June 9, 2014 - Published by group.bmj.com
546
Sohn et al., Optimal Training Sets
Research Paper 䡲
Optimal Training Sets for Bayesian Prediction of MeSH®
Assignment
SUNGHWAN SOHN, PHD, WON KIM, PHD, DONALD C. COMEAU, PHD, W. JOHN WILBUR, MD, PHD
A b s t r a c t Objectives: The aim of this study was to improve naïve Bayes prediction of Medical Subject
Headings (MeSH) assignment to documents using optimal training sets found by an active learning inspired
method.
Design: The authors selected 20 MeSH terms whose occurrences cover a range of frequencies. For each MeSH
term, they found an optimal training set, a subset of the whole training set. An optimal training set consists of all
documents including a given MeSH term (C1 class) and those documents not including a given MeSH term (C⫺1
class) that are closest to the C1 class. These small sets were used to predict MeSH assignments in the MEDLINE®
database.
Measurements: Average precision was used to compare MeSH assignment using the naïve Bayes learner trained
on the whole training set, optimal sets, and random sets. The authors compared 95% lower confidence limits of
average precisions of naïve Bayes with upper bounds for average precisions of a K-nearest neighbor (KNN)
classifier.
Results: For all 20 MeSH assignments, the optimal training sets produced nearly 200% improvement over use of
the whole training sets. In 17 of those MeSH assignments, naïve Bayes using optimal training sets was statistically
better than a KNN. In 15 of those, optimal training sets performed better than optimized feature selection. Overall
naïve Bayes averaged 14% better than a KNN for all 20 MeSH assignments. Using these optimal sets with another
classifier, C-modified least squares (CMLS), produced an additional 6% improvement over naïve Bayes.
Conclusion: Using a smaller optimal training set greatly improved learning with naïve Bayes. The performance is
superior to a KNN. The small training set can be used with other sophisticated learning methods, such as CMLS,
where using the whole training set would not be feasible.
䡲 J Am Med Inform Assoc. 2008;15:546 –553. DOI 10.1197/jamia.M2431.
Introduction
MEDLINE is a large collection of bibliographic records of
articles in the biomedical literature maintained by the National Library of Medicine (NLM). In late 2006, MEDLINE
included about 16.5 million references, which have been
processed by human indexing. Each MEDLINE reference is
assigned a number of relevant medical subject headings
(MeSH). MeSH is a controlled vocabulary produced by the
NLM and used for indexing, cataloging, and searching
biomedical and health-related information and documents
(see http://www.nlm.nih.gov/mesh/ for details of MeSH).
Human indexing is costly and requires intensive labor. The
indexing cost at the NLM consists of data entry, NLM staff
indexing and revising, contract indexing, equipment, and
Affiliation of the authors: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health,
Bethesda, MD.
Supported by the Intramural Research Program of the National
Institutes of Health, National Library of Medicine. The authors
thank the reviewers for valuable feedback and suggestions.
Correspondence: Dr. Sunghwan Sohn, National Library of Medicine, Building 38A, 6N611C, 8600 Rockville Pike, Bethesda, MD
20894; e-mail: ⬍[email protected]⬎.
Received for review: 03/09/07; accepted for publication: 02/07/08
telecommunication costs.1 The annual budget for contracts to
perform MEDLINE indexing including purchase orders at four
foreign centers is several million dollars (James Marcetich,
Head, NLM Index Section, personal communication, August
2007). NLM’s indexers are highly trained in MEDLINE
indexing practice as well as in a subject domain(s) in the
MEDLINE database. Since 1990, the MEDLINE database has
grown faster than before with more documents available in
electronic form. The cost of human indexing of the biomedical literature is high, so many attempts have been made to
provide automatic indexing.1-8 The NLM Indexing Initiative
is a research effort to explore indexing methodologies for
semiautomated user-assisted indexing as well as fully automated indexing applications.1,2 This project has produced a
system for recommending indexing terms for arbitrary biomedical text, especially titles and abstracts of journal articles.
The system has been in use by library indexers since
September 2002. The system consists of several methods of
discovering MeSH terms that are combined to produce an
ordered list of recommended indexing terms. K-nearest
neighbor (KNN) is used as one method to rank the MeSH
terms that are candidates for indexing a document.5 This
study sought to investigate the naïve Bayes learner as an
alternative to KNN.
For the MEDLINE database, all references have titles and about
half have abstracts. This is more than 16.5 gigabytes of data.
Downloaded from jamia.bmj.com on June 9, 2014 - Published by group.bmj.com
Journal of the American Medical Informatics Association
Volume 15
With this much data, it is not realistic to run the most
sophisticated machine learning algorithms. Only the simplest
and most efficient algorithms can process this data on high-end
commodity servers (2 CPUs, 4 GB of memory). Naïve Bayes
has the efficiency to work with all of MEDLINE. In this
study, the performance of using naïve Bayes for automatic
MeSH assignment was investigated. One challenge is that
many MeSH terms occur in a relatively very small portion of
the MEDLINE database, and the corresponding naïve
Bayes’s performance is not good. This poor performance is
related to a significantly imbalanced class distribution—very
few documents include a given MeSH term (C1 class) and a
very large number of documents do not include it (C⫺1
class). In Bayesian learning for binary classification, a large
preponderance of C⫺1 documents dominates the decision
process and deteriorates classification performance on the
unseen test set. Careful example selection must be considered to solve this problem and improve the classification
performance.
In this article, we perform example selection that starts from
a small training set (STS) and iteratively adds informative
examples selected from the whole training set (WTS) into the
STS until an optimum is reached. Because a given MeSH
term occurs in only a small portion of the WTS, all C1
documents are placed in the initial STS. Then, the C⫺1
documents most similar to C1 documents are iteratively
added to the STS. As the size of the STS increases, the STS
that produces the best result on the training set by a
leave-one-out cross-validation method is selected as our
optimal training set (OTS). Although we use the word
optimal, the selected set might not be a global optimum
because we find this set by a greedy approach. The detailed
procedure will be explained under Example Selection in the
Methods section. Naïve Bayes using this OTS provides a
superior alternative to KNN, which is one method currently
used in the NLM Indexing Initiative1,2 for MeSH prediction.
Traditionally, example selection has been used for three
major reasons.9 The first reason is to control computational
cost. Standard support vector machines (SVM)10 require
long training times that scale superlinearly with the number
of features and become prohibitive with large data sets.
Pavlov et al.11 used boosting to combine multiple SVMs
trained on small subsets. Boley and Cao12 reduced the
training data by partitioning the training set into disjoint
clusters and replacing any cluster containing only nonsupport vectors with its representative. Quinlan13 developed a
windowing technique to reduce the time for constructing
decision trees for very large training sets. A decision tree
was built on a randomly selected subset from the training set
and tested for the remaining examples that were not included in this subset. Then, the selected misclassified examples were added to the initial subset, and a tree was
constructed from the enlarged training set and tested on the
remaining set. This cycle was repeated until all the remaining examples were correctly classified. For us cost is not a
concern because naïve Bayes can be trained efficiently on all
of MEDLINE.
The second reason for example selection is to reduce labeling cost when labeling all examples is too expensive. Active
learning is a way to deal with this problem. It uses current
knowledge to predict the best choice of unknown data to
Number 4
July / August 2008
547
label next in an attempt to improve the efficiency of learning.
Active learning starts with a small number of labeled data as
the initial training set. It then repeatedly cycles through
learning from the training set, selecting the most informative
data to be labeled, labeling them by a human expert, and
adding newly labeled data to the training set. These informative documents may be near the decision boundary14,15
or they may be the documents producing maximal disagreement among a committee of classifiers.16,17 The emphasis is
on the best possible learning from the fewest possible
documents.14,18,19 While studying active learning methods,
we saw instances where the results using a small training set
produced better results than using the whole training set.
Others have seen the same effect.15,20,21 However, they
observed this effect only for some limited cases and the
improvement was small (generally ⬍10%). By contrast we
find a large improvement for all cases, nearly 200% on
average. Also, we do not need active learning to avoid
labeling because the entire training set is labeled. We use an
active learning–inspired approach not to minimize labeling
cost, but to maximize performance.
The third reason is to improve learning by focusing on the
most relevant examples. Our example selection belongs to
this category. Boosting22 can also be implemented as a type
of example selection. Wilbur23 used staged Bayesian retrieval on a subset of MEDLINE, and it outperformed a more
standard boosting approach. He initially trained naïve Bayes
on the whole training set and used it to select examples that
had a higher probability of belonging to a small specialty
database. Both the selected and the small specialty data set
were used as a training set for the second stage classifier.
Then he combined the two classifiers to obtain the best
performance. His method seems to be similar to our example
selection, but he did not know about the poor performance
of the naïve Bayes classifier on the whole MEDLINE database. He used a relatively small subset of MEDLINE and
saw only a small (⬍10%) improvement in performance.
What he did was like the first round of optimization to
obtain the OTS in our method. By contrast, we iteratively
perform example selection to reach an optimum for a single
classifier and find much greater improvement. Example
selection to deal with the imbalanced class problem has
previously been proposed as a method to improve learning.
Various strategies have been suggested to tackle this problem. Sampling to balance examples between the majority
and minority classes is a commonly used method.24 Upsampling randomly selects examples with replacement from
the minority class until the size of minority class matches
with the majority class. It does not gain information about
the minority class, but increases the misclassification cost of
the minority class. Alternatively one can directly assign a
larger misclassification cost for the minority class than that
of the majority class.25,26 Down-sampling eliminates examples from the majority class until it balances with the
minority class. Examples to be eliminated can be selected
randomly or focused further away from the minority class.
Others, using clustering and various other algorithms, have
attempted to reflect the character of the majority class in a
fair manner.27 Down-sampling may lose information from
the majority class and risks harming performance. Our
method is conceptually similar to focused down-sampling.
Downloaded from jamia.bmj.com on June 9, 2014 - Published by group.bmj.com
548
In focused down-sampling the criteria are set a priori and
then examples are selected to balance class size. However,
we do not aim for a balanced class size in our OTS. We
explicitly adjust the subset of focused examples from the
majority set iteratively until the best training is achieved.
Methods
Data Preparation
MEDLINE is a collection of references to articles in the
biomedical literature. We used the titles and, where available, the abstracts. At the time of our experiment, MEDLINE
included 16,534,506 references. For each MeSH term, the
WTS was created by randomly selecting two-thirds of the
documents from the C1 class and two-thirds from the C⫺1
class. This gives the WTS the same proportion of C1 and C⫺1
documents as in all of MEDLINE. The remaining documents
served as our test set. Stop words were removed, but no
stemming was performed. Using all single words and twoword phrases in the titles and abstracts provided 56,194,161
features. However, we used feature selection (Appendix A,
online supplement available at www.jamia.org) and so not
all of them were used in a naïve Bayes classifier. The MeSH
terms are not used as features in the actual training and test
process, but are only used to define classes.
Classification Tasks
Our classification task was to predict which documents were
assigned a particular MeSH term (C1 class) and which
documents were not assigned that term (C⫺1 class). We
selected 20 MeSH terms with the number of C1 class articles
covering a wide frequency range: approximately 100,000,
50,000, 30,000, 20,000, 10,000, 5,000, 4,000, 3,000, 2,000, and
1,000 C1 class articles. All but one of these terms are leaf
MeSH terms, and the appropriate documents can be
searched for directly in PubMed. “Myocardial infarction” is
an internal node of the MeSH hierarchy. The proper search
is a union of the results of searching for it directly and
searching for the terms below it in the hierarchy: “myocardial stunning” and “shock, cardiogenic.” A detailed explanation of MeSH can be found at http://www.nlm.nih.gov/
mesh/.
Learning Methods
For our principal learner we used the naïve Bayes Binary
Independence Model (BIM),28 in which a document is represented by a vector of binary attributes indicating presence
or absence of features in the document (for details refer to
Appendix A, online supplement). We made this choice
because BIM can be trained rapidly and can efficiently
handle a large amount of data.
C-modified least squares29 is a wide-margin classifier that
has many properties in common with SVMs. However, its
smooth loss function allows us to apply a gradient search
method to optimize rapidly, and thus it can be applied to
larger data sets. Although CMLS can be trained faster than
an SVM, it is still impractical to apply CMLS to the WTS.
However, we can run CMLS on the smaller OTS. Typically
CMLS performs better than Bayes. The question is, how will
it perform on an example set optimized for Bayesian learning?
Because the NLM Indexing Initiative1,2 currently uses a
KNN method to aid MeSH prediction, it is valuable to
Sohn et al., Optimal Training Sets
compare our Bayes results on the OTS with the same KNN
method. The standard approach of a KNN classifier would
compare all pairs of documents, one from the test set and
one from the training set. This is a very expensive computation for a huge database such as MEDLINE. To reduce the
computational cost we obtained the upper bounds for KNN
average precision. For details refer to Appendix C (online
supplement). The upper bounds of KNN were compared
with 95% lower confidence limits of Bayes, which were
obtained by Student’s t-test. Because a higher average precision is better, if the lower bound of the naïve Bayes method
using the OTS is higher than the upper bounds we found for
the KNN method, we can safely conclude that the naïve
Bayes method using OTS is better than the KNN method.
Example Selection
To identify the optimal training set (OTS) for Bayesian
learning, we followed a procedure that simulates active
learning (Figure 1). This is not active learning because it
requires the entire training set to be labeled before this
example selection is performed.
Good results have been obtained from random downsampling of examples in other domains.24 To address questions of the size of our OTS versus the specific documents in
that set, we used random sampling of the C⫺1 documents to
create a random STS (Ran STS) with the same number of
elements as the OTS. We then applied Bayes learning to
these Ran STS.
F i g u r e 1. Example selection algorithm.
*A score is defined in equation (A.5) in Appendix A (online
supplement).
†For M, we used 1% of the number of C1 documents. For
example, if the number of C1 documents is 100,000, then we
use M ⫽ 1,000. This percentage was determined experimentally to balance the number of iterations to find the OTS with
the chance of missing the true peak of the curve. If M is too
large, the number of iterations will be reduced, but performance might be slightly worse if the true peak is missed. If
M is too small, the number of iterations will be increased
without significant improvement in performance.
‡Using the same data set for both training and testing would
generally lead to overtraining and gives misleading results.
This can be avoided by holding out a distinct validation set.
Instead, we used leave-one-out cross validation on the
training set. Each training set document is scored as if the
model were trained on all other training set documents.
Because of the independence assumptions of Bayes, an
efficient implementation of this scoring is possible. The
details of this implementation are described in Appendix B
(online supplement).
Downloaded from jamia.bmj.com on June 9, 2014 - Published by group.bmj.com
Journal of the American Medical Informatics Association
Volume 15
Feature Selection
Proper feature selection often improves the performance of
a machine learner. For the naïve Bayes classifier we implemented feature selection by setting a threshold and retaining
only those features whose weights are in absolute value
above that threshold. Previous research has shown this to be
a highly effective feature selection method for naïve Bayes
when class size is unbalanced.30 This allows us to reduce the
feature dimensionality and generally see a gain in performance. In most of our work with naïve Bayes reported here,
we use a fixed threshold value of 1. This allows a large
reduction in the number of features and generally does not
degrade performance. To further investigate the effectiveness of feature selection, we estimated optimal threshold
values for each classification task on the WTS (using leaveone-out) and tested them in order to compare the performance with our example selection method.
Evaluation
To measure classification performance, we used average
precision. This simple measure is well suited to our problem. To calculate average precision,31 the 5,511,502 documents in the test set are ordered by the score from the
machine learner. Precisions are calculated at each rank
where a C1 document is found and these precision values are
averaged. A detailed definition of average precision is
provided in Appendix D (online supplement). We also
present precision-recall curves for a sample of the classification tasks.
Previous studies have shown limitations using accuracy and
ROC for imbalanced data sets.32 Accuracy is only meaningful when the cost of misclassification is the same for all
documents. Because we are dealing with cases where the
class C1 is much smaller than C⫺1, the cost of misclassifying
a C1 document needs to be much higher. One could classify
Number 4
549
July / August 2008
all documents as C⫺1 and obtain high accuracy if the cost is
taken to be the same over all documents. For example, with
our largest C1 class, calling all documents C⫺1 gives an
accuracy over 99%. For the smallest class, the accuracy
would be more than 99.99%. Clearly a more sensitive
measure is needed.
The challenge of our data set is measuring how high the C1
documents are ranked without being unduly influenced by
the large number of low-ranking C⫺1 documents. Given a
particular set of C1 and C⫺1 documents and their associated
ROC score, the ROC score can be increased simply by
adding additional irrelevant C⫺1 documents with scores
lower than any existing C1 documents. In fact an arbitrarily
high ROC score can be obtained by adding enough irrelevant low-scoring C⫺1 documents. In sharp contrast, adding
C⫺1 documents that score below the lowest C1 document,
has no effect on the average precision. Average precision is
much more sensitive to retrieving C1 documents in high
ranks. Because we are only interested in such high-ranking
C1 documents, average precision is well suited to our
purpose.
Results
The numerical results of these experiments appear in Table 1.
For naïve Bayes we used a fixed weight threshold of 1 for
feature selection except for WTS Bayes OptCut, where we
used a customized threshold value for each classification
task. Using a threshold of 1 allowed Bayes to use a much
smaller number of features on the WTS, ranging from 15,261
to 1,080,168 features depending on the classification task
(without a threshold there are 56,194,161 features). The OTS
size varied from 2,508 to 742,540 documents for different
MeSH assignments, which is 0.02% to 6.74% of the WTS size.
Table 1 y Average Precisions for Prediction of MeSH Terms in MEDLINE Articles
MeSH Terms
Number of C1
OTS Size
WTS Bayes
WTS Bayes OptCut*
OTS Bayes
Ran STS Bayes
OTS CMLS
Rats, Wistar
Myocardial infarction
Blood platelets
Serotonin
State medicine
Bladder
Drosophila melanogaster
Tryptophan
Laparotomy
Crowns
Streptococcus mutans
Infectious mononucleosis
Blood banks
Humeral fractures
Tuberculosis, lymph node
Mentors
Tooth discoloration
Pentazocine
Hepatitis E
Genes, p16
Average
122,815
101,810
51,793
50,522
31,338
30,572
21,695
20,391
10,284
10,152
5,105
5,040
4,076
4,087
3,036
3,275
2,052
2,014
1,032
1,057
742,540
252,131
128,286
124,581
357,993
154,715
53,740
194,950
173,304
51,138
12,430
46,260
39,494
31,793
66,584
55,214
19,764
12,202
2,508
11,847
0.160
0.325
0.274
0.175
0.096
0.231
0.243
0.108
0.043
0.178
0.374
0.134
0.109
0.128
0.117
0.048
0.108
0.041
0.309
0.100
0.165
0.309
0.644
0.600
0.564
0.215
0.474
0.684
0.514
0.218
0.501
0.716
0.537
0.256
0.450
0.249
0.367
0.302
0.678
0.611
0.319
0.460
0.376
0.674
0.599
0.578
0.262
0.481
0.688
0.500
0.209
0.551
0.744
0.583
0.315
0.507
0.343
0.368
0.365
0.590
0.675
0.286
0.485
0.154
0.322
0.265
0.163
0.091
0.219
0.204
0.094
0.040
0.168
0.386
0.130
0.107
0.122
0.108
0.038
0.094
0.030
0.290
0.081
0.155
0.386
0.688
0.649
0.626
0.244
0.514
0.689
0.557
0.289
0.581
0.752
0.614
0.345
0.569
0.348
0.419
0.469
0.681
0.629
0.244
0.515
CMLS ⫽ C-modified least squares; MeSH ⫽ medical subject headings; OTS ⫽ optimal training set; Ran STS ⫽ random small training set.
*Used an optimal cutoff for feature selection in Bayes. The other Bayes classification tasks used cutoff value 1.
Whole training set (WTS) ⫽ 11,023,004 documents. Optimal training set (OTS) ⫽ C1 documents ⫹ optimal C⫺1 documents (details in Figure 1).
Downloaded from jamia.bmj.com on June 9, 2014 - Published by group.bmj.com
550
Sohn et al., Optimal Training Sets
F i g u r e 2. A comparison of precision-recall curves of WTS Bayes and OTS Bayes. (A) drosophila melanogaster, (B) streptococcus
mutans, (C) mentors, (D) pentazocine.
The average precision using Bayes on the OTS (OTS Bayes)
ranged from 0.209 to 0.744, with an average of 0.485.
Compared to the overall average of 0.165 seen on the WTS,
this is nearly a 200% improvement. Figure 2 shows precision-recall curves of WTS Bayes and OTS Bayes for some
classification tasks. Here it is helpful to recall that the
average precision is the area under the precision-recall
curve. In all cases the area under the curve for OTS training
was much larger than for WTS training. Also, the OTS curve
was above the WTS curve for most recall levels except for
recall levels close to 1 in some cases. This is highly preferable
in information retrieval where relevant examples should
appear in the top ranks.
To address the importance of the size of the OTS versus the
specific documents in the OTS, we created a comparable
random small training set (Ran STS in Table 1). It includes
all of the C1 documents, just as in the OTS, and a number of
randomly selected C⫺1 documents equal to the number of
C⫺1 documents in the OTS. Thus, Ran STS has the same size
as the OTS. For example, the MeSH term “rat, wistar” has
122,815 C1 documents and 619,725 (⫽ OTS size ⫺ C1 size ⫽
742,540 ⫺ 122,815) randomly selected C⫺1 documents from
WTS. When Bayesian learning was performed on the random STS (Ran STS Bayes), the average precisions were a
little lower than the WTS, but were much lower than the
OTS.
The overall average using Bayes on the WTS with an optimal
threshold (WTS Bayes OptCut in Table 1) was 0.460. The
threshold ranged from 3.6 to 9.4. The performance was
much improved over the WTS with a fixed cutoff value of 1
(WTS Bayes), but not as good as using Bayes on the OTS. In
15 of 20 MeSH assignments, the OTS Bayes performed better
than WTS Bayes OptCut.
The upper bounds of KNN were compared with 95% lower
confidence limits of Bayes. Table 2 shows the 95% lower
confidence limits of the average precisions for Bayes trained
on the OTS. These were obtained by Student’s t-test. It also
shows upper bounds for the KNN average precisions. In 17
of 20 MeSH assignments, Bayes using OTS was statistically
better than KNN. We also performed the Sign test and
superiority 17 of 20 times yields a p-value of 0.00129.
Therefore, we can safely conclude that naïve Bayes using the
OTS is better than KNN.
Although it is not feasible to train a complex machine learning
algorithm on the WTS because of its huge size, the much
smaller size of the OTS allows us to use a more sophisticated learner such as CMLS. Using CMLS on the OTS
(OTS CMLS in Table 1) further improved the performance
and produced 6% better results on average than Bayes.
A plot of the average precision versus the size of the STS for
three MeSH terms appears in Figure 3. The OTS occurs at the
peak of each curve, at a much smaller size than the WTS.
Downloaded from jamia.bmj.com on June 9, 2014 - Published by group.bmj.com
Journal of the American Medical Informatics Association
Volume 15
Table 2 y A Comparison of 95% Lower Confidence
Limit of Average Precision for Bayes and Upper
Bound to KNN Average Precision
MeSH Terms
Rats, Wistar
Myocardial infarction
Blood platelets
Serotonin
State medicine
Bladder
Drosophila melanogaster
Tryptophan
Laparotomy
Crowns
Streptococcus mutans
Infectious mononucleosis
Blood banks
Humeral fractures
Tuberculosis, lymph node
Mentors
Tooth discoloration
Pentazocine
Hepatitis E
Genes, p16
Average
OTS Bayes 95% Lower
Confidence Limit
KNN Upper
Bound
0.374*
0.671
0.595
0.574
0.258
0.476
0.683
0.494
0.202
0.542
0.733
0.571
0.305
0.493*
0.327
0.350
0.348*
0.565
0.654
0.271
0.414
0.623
0.521
0.473
0.216
0.461
0.579
0.398
0.151
0.518
0.674
0.506
0.231
0.530
0.295
0.301
0.366
0.436
0.567
0.267
0.426
*95% lower confidence limit of OTS Bayes is less than the KNN
upper bound.
KNN ⫽ K-nearest neighbor; MeSH, medical subject headings; OTS,
optimal training set.
Discussion
Using an optimal training set can greatly improve learning
with naïve Bayes. Although 11 million training documents
can easily be handled, it is not the best option. Much better
results can be obtained using a carefully chosen smaller
training set. In the smallest improvement seen, the average
precision nearly doubled. In the best case, the improvement
was by a factor of 14 times. Proper feature selection is
another way to improve the naïve Bayes performance. When
using an optimized weight threshold (WTS Bayes OptCut)
for each classification task, we saw much better performance
than using a fixed weight threshold for all classification
tasks (WTS Bayes). The overall performance, however, was
not as good as using the OTS. As a further benefit of the OTS
approach, the small size of the OTS allows using a more
sophisticated machine learner such as CMLS. CMLS generally performs better than naïve Bayes. In our case, CMLS
trained on the OTS showed a 6% improvement over naïve
Bayes trained on the OTS. In practice, it would make sense
to apply CMLS to OTS, if affordable, because in most cases
it was better than naïve Bayes. A KNN classifier, which is
used as one method to rank MeSH terms by the Indexing
Initiative at the NLM,1,2 was also compared with naïve
Bayes on the OTS. In most cases, naïve Bayes performed
better than KNN. On average CMLS on the OTS was 21%
better than KNN.
An important question is why training on the relatively
small OTS produces such a large improvement in the naïve
Bayes classifier’s performance when compared with training
on the WTS. At the most basic level, if the naïve assumption
that all features are independent of each other given the
Number 4
July / August 2008
551
context of the class were true, arguably, naïve Bayes would
be the ideal classifier. Because this naïve assumption is
generally false, different more complex algorithms such as
support vector machines, CMLS, and decision trees, and
even more sophisticated Bayesian network approaches33
have been developed to deal at some level with dependencies among features. These more complex algorithms often
give an improvement over naïve Bayes but with a higher
computational cost. Because our approach is successful in
improving over naïve Bayes on the WTS, it must also be a
way to deal with dependencies.
To see how our approach deals with dependencies, it is
helpful to consider the following argument. Naïve Bayes on
the WTS learns how to distinguish C1 and C⫺1 documents.
But C⫺1 consists of two types of documents, i.e.,
C⫺1 ⫽ B1 艛 B⫺1
(1)
where B1 is a very small set of documents that are close in
content to C1 documents, whereas B⫺1 is most of C⫺1 and
consists of documents that are distant from C1 and unlikely
to be confused with it. Now training naïve Bayes on the WTS
is essentially teaching it to distinguish C1 and B⫺1 as B1 will
have almost no influence on the probability estimates used
to compute weights. Such training may be far from optimal
in its ability to distinguish between C1 and B1. Our method
is a way to determine B1 so that OTS ⫽ C1艛B1. Then training
on the OTS is optimal naïve Bayesian training to distinguish
C1 and B1, and because B⫺1 is already distant from C1 we
may expect to see a large improvement in performance. But
the very existence of such a set as B1 is only possible because
features are not independently distributed in C⫺1. It is the
co-occurrence of a number of features in a single document
in B1 above random that gives that document a particular
flavor or topicality that can be very similar to and confused
with a document in C1. Removal of the B⫺1 set from the
training process alleviates the feature dependency problem
and improves the results. It is a solution somewhat similar to
the support vector machine in which the training generally
focuses on a small set of support vectors that determine the
final training result while a large part of the training set is
ignored. However, our Bayesian approach is much more
efficient to train than a support vector machine.
F i g u r e 3. Average precision versus number of documents for several MeSH terms.
Downloaded from jamia.bmj.com on June 9, 2014 - Published by group.bmj.com
552
Sohn et al., Optimal Training Sets
In the final analysis, whatever improvement one sees must
be reflected in a difference in the weights computed. Equation (A.6) in Appendix A (online supplement) gives the
definition of a weight for a Bayesian learner. A document’s
score is the sum of weights for features that appear in the
document. A positive weight value denotes that a feature is
more likely to appear in C1 documents, a negative weight
value denotes that a feature is more likely to appear in C⫺1
documents, and a weight value close to zero denotes no
preference for either class. Tables 3 and 4 show examples of
features (words) from the “pentazocine” classification task
with significant weight changes between the OTS and the
WTS. The features in Table 3 appear equally likely in C1 and
the nearby C⫺1 documents included in the OTS (set B1).
These are not useful for distinguishing the two sets and have
OTS weights near zero. However when using the WTS, C⫺1
includes many distant documents that do not include these
features (set B⫺1), so they now appear relatively more
frequent in C1 documents, leading to positive WTS weights.
In Table 4 are features that are somewhat related to C1
documents, but that appear more frequently in set B1
documents, so have negative OTS weights. They are useful
for recognizing that a document is not a C1 document. When
including the many distant B⫺1 documents in the WTS,
these features also emerge as more common in C1 compared
to C⫺1 and receive a positive weight. In both Tables 3 and 4,
the more positive weights obtained with the WTS move the
scores of documents in B1 to a more positive value, harming
the precision. Careful example selection, eliminating irrelevant documents from the majority class, C⫺1, helps to
alleviate this problem.
The poor results from the Ran STS show that random
down-sampling of examples does not work in our data sets
even though others observed better performance24 in a
different setting. The dramatically better results from the
OTS, which is the same size as Ran STS, demonstrates that
the better results are due to the particular documents
selected, not just the small size. The nature of the OTS is
more important than its size. In the OTS, C⫺1 documents are
more likely to be close to C1 documents—they lie near the
decision boundary. This gives more discriminative learning
and better results.
We believe there is a larger lesson in our experience estimating probabilities of word occurrences. In any endeavor in
which probabilities must be estimated, the choice of training
data can be crucial to success. A large number of training
examples that are irrelevant to the issue can seriously dilute
those relevant examples that would otherwise provide useful probability estimates. This phenomenon may be impor-
Table 3 y “Pentazocine” Classification Task: Neutral
Features Inappropriately Given Positive Weight by
Training on the WTS
Feature Term
Weight Trained
on OTS
Weight Trained
on WTS
Antinociception
Addictive
Anesthetic agents
Central action
0.0019
0.0115
0.0115
0.0115
3.7516
3.1606
2.3108
3.4684
Abbreviations as in Table 1.
Table 4 y “Pentazocine” Classification Task: Negative
Weighted Features Inappropriately Given Positive
Weight by Training on the WTS
Feature Term
Weight Trained
on OTS
Weight Trained
on WTS
Bupivacaine
Thiopental
Dynorphin
Sphincter of Oddi
⫺1.0459
⫺1.0459
⫺1.0014
⫺1.0014
1.2844
1.2598
1.4213
3.6554
Abbreviations as in Table 1.
tant not only for naïve Bayes learners, but also for Bayesian
networks, Markov models, and decision trees, which all use
probabilities.
There is a need for further work. Much work has gone into
feature selection. More consideration should be given to
example selection. This is especially true when the C1
documents are a very small proportion of the available
training set. However, we have seen similar results in a few
cases of balanced training sets. Although we are not doing
active learning, our identification of the optimal training set
for Bayesian learning is still iterative. We would like to
identify this set directly, possibly using information available from learning on the whole training set. Finally, we
would like to investigate optimal training set creation for
other learning methods, such as CMLS.
References y
Note: References 34 –36 are cited in the online data supplement to
this article at www.jamia.org.
1. Aronson AR, Bodenreider O, Chang HF, et al. The NLM
Indexing Initiative. Proc AMIA Symp 2000:17–21.
2. Aronson AR, Mork JG, Gay CW, Humphrey SM, Rogers WJ.
The NLM Indexing Initiative’s Medical Text Indexer. Medinfo
2004:268 –72.
3. Cooper GF, Miller RA. An experiment comparing lexical and
statistical methods for extracting MeSH terms from clinical free
text. J Am Med Inform Assoc 1998;5:62–75.
4. Fowler J, Maram S, Kouramajian V, Devadhar V. Automated
MeSH indexing of the World-Wide Web. Proc Annu Symp
Comput Appl Med Care 1995:893–7.
5. Kim W, Aronson AR, Wilbur WJ. Automatic MeSH term assignment and quality assessment. Proc AMIA Symp 2001:319 –23.
6. Kim W, Wilbur WJ. A strategy for assigning new concepts in the
MEDLINE database. AMIA 2005 Symp Proc 2005:395–9.
7. Kouramajian V, Devadhar V, Fowler J, Maram S. Categorization
by reference: A novel approach to MeSH term assignment. Proc
Annu Symp Comput Appl Med Care 1995:878 – 82.
8. Ruch P. Automatic assignment of biomedical categories: Toward a generic approach. Bioinformatics 2006;22:658 – 64.
9. Blum AL, Langley P. Selection of relevant features and examples in machine learning. Art Intell 1997;97:245–71.
10. Burges CJC. A tutorial on support vector machines for pattern
recognition. Available electronically from the author: Bell Laboratories, Lucent Technologies, 1999.
11. Pavlov D, Mao J, Dom B. Scaling-up support vector machines
using boosting algorithm. 15th International Conference on Pattern
Recognition, Barcelona, Spain, September 3– 8, 2000. Los Alamitos,
CA: IEEE Computer Society; 2000:2219 –22. Available at http://
doi.ieeecomputersociety.org/10.1109/ICPR.2000.906052. Accessed
May 22, 2008.
12. Boley D, Cao D. Training Support vector machines using adaptive
clustering. In Berry M, Dayal U, Kamath C, Skillicorn D, eds. 4th
Downloaded from jamia.bmj.com on June 9, 2014 - Published by group.bmj.com
Journal of the American Medical Informatics Association
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
Volume 15
SIAM International Conference on Data Mining, Lake Buena Vista,
Florida, April 22–24, 2004. Philadelphia, PA: Society for Industrial
and Applied Mathematics; 2004:126 –37.
Quinlan JR. C4.5: Programs for Machine Learning. San Mateo,
CA: Morgan Kaufman Publishers, 1993.
Lewis DD, Catlett J. Heterogeneous uncertainty sampling for
supervised learning. In Cohen WW, Hirsh H, eds. Eleventh
International Conference on Machine Learning, New Brunswick, New Jersey, July 10 –13, 1994. San Francisco, CA: Morgan
Kaufmann Publishers; 1994:148 –56.
Lewis DD, Gale WA. A sequential algorithm for training text
classifiers. 17th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, Dublin,
Ireland, July 3– 6, 1994. New York, NY: Springer-Verlag; 1994:
3–12.
Freund Y, Seung H, Shamir E, Tishby N. Selective sampling
using the query by committee algorithm. Mach Learn 1997;28:
133– 68.
Seung HS, Opper M, Sompolinsky H. Query by committee. Fifth
Annual Workshop on Computational Learning Theory, Pittsburgh, Pennsylvania, July 27–29, 1992. New York, NY: ACM
Press; 1992:287–94. Available at http://doi.acm.org/10.1145/
130385.130417. Accessed May 22, 2008.
Tong S, Koller D. Support vector machine active learning with
applications to text classification. J Mach Learn Res 2001;2:
45– 66.
Roy N, McCallum A. Toward optimal active learning through
sampling estimation of error reduction. In Brodley CE, Danyluk
AP, eds. Eighteenth International Conference on Machine
Learning, Williamstown, MA, June 28 –July 01, 2001. San Francisco, CA: Morgan Kaufmann Publishers; 2001.
Bordes A, Ertekin S, Weston J, Bottou L. Fast kernel classifiers
with online and active learning. J Mach Learn Res 2005;6:1579 –
619.
Schohn M, Cohn D. Less is more: Active learning with support
vector machines. In: Langley P (ed). Proceedings of the Seventeenth International Conference on Machine Learning 2000. San
Francisco, CA: Morgan Kaufmann, 2000.
Schapire RE. The boosting approach to machine learning: An
overview. MSRI Workshop on Nonlinear Estimation and Classification; 2002, 2002.
Wilbur WJ. Boosting naive Bayesian learning on a large subset
of MEDLINE. American Medical Informatics 2000 Annual Sym-
Number 4
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
July / August 2008
553
posium; 2000. Los Angeles, CA: American Medical Informatics
Association, 2000:918 –22.
Japkowicz N, Stephen S. The class imbalance problem: A
systematic study. Intell Data Anal 2002;6:429 –50.
Domingos P. MetaCost: A general method for making classifiers
cost-sensitive. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
1999. San Diego, CA, August 15–18, 1999. New York, NY: ACM
Press, 1999:155– 64.
Maloof M. Learning when data sets are imbalanced and when
costs are unequal and unknown. Proceedings of the ICML-2003
Workshop: Learning with Imbalanced Data Sets II, August
21–24, 2003. Menlo Park, CA: AAAI Press, 2003:73– 80.
Nickerson AS, Japkowicz N, Milios E. Using unsupervised
learning to guide resampling in imbalanced data sets. Proceedings of the Eighth International Workshop on AI and Statistics,
January 4 –7, 2001. London, UK: Gatsby Computational Neuroscience Unit, 2001:261–5.
Lewis DD. Naive (Bayes) at forty: The independence assumption in information retrieval. ECML 1998:4 –15.
Zhang T, Oles FJ. Text categorization based on regularized
linear classification methods. Inf Retrieval 2001;4:5–31.
Mladenic D, Grobelnik M. Feature selection for unbalanced
class distribution and naive Bayes. Sixteenth International Conference on Machine Learning, 1999. San Francisco, CA: Morgan
Kaufmann, 1999:258 – 67.
Manning CD, Schutze H. Foundations of Statistical Natural
Language Processing. Cambridge, MA: MIT Press, 1999.
Visa S, Ralescu A. Issues in Mining Imbalanced Data Sets—A
Review Paper. Proceedings of the Sixteen Midwest Artificial
Intelligence and Cognitive Science Conference, April 16 –17,
2005. Cincinnatti, OH: University of Cincinnatti, 2005:
67–73.
Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers. Machine Learning 1997;29(2–3):131– 63.
Madsen RE, Kauchak D, Elkan C. Modeling word burstiness
using the Dirichlet distribution. 22nd International Conference
on Machine Learning, 2005. Bonn, Germany: ACM Press, 2005:
545–52.
Witten IH, Moffat A, Bell TC. Managing Gigabytes. Second
edition. San Francisco: Morgan-Kaufmann, 1999.
Salton G. Automatic Text Processing. Reading, MA: AddisonWesley, 1989.
Downloaded from jamia.bmj.com on June 9, 2014 - Published by group.bmj.com
Optimal Training Sets for Bayesian
Prediction of MeSH ® Assignment
Sunghwan Sohn, Won Kim, Donald C Comeau, et al.
J Am Med Inform Assoc 2008 15: 546-553
doi: 10.1197/jamia.M2431
Updated information and services can be found at:
http://jamia.bmj.com/content/15/4/546.full.html
These include:
Data Supplement
"Data Supplement"
http://jamia.bmj.com/content/suppl/2009/11/20/15.4.546.DC1.html
References
This article cites 9 articles, 2 of which can be accessed free at:
http://jamia.bmj.com/content/15/4/546.full.html#ref-list-1
Article cited in:
http://jamia.bmj.com/content/15/4/546.full.html#related-urls
Email alerting
service
Receive free email alerts when new articles cite this article. Sign up in
the box at the top right corner of the online article.
Notes
To request permissions go to:
http://group.bmj.com/group/rights-licensing/permissions
To order reprints go to:
http://journals.bmj.com/cgi/reprintform
To subscribe to BMJ go to:
http://group.bmj.com/subscribe/