Download Report

Small Sample Inference in Classification Using the CUD
Bound
Eric Laber
Susan A. Murphy
University of Michigan, Ann Arbor
August 7, 2008
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
1 / 19
Introduction
What is this talk about?
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
2 / 19
Introduction
What is this talk about?
Constructing faithful confidence sets for the generalization error
Coping with non-regularity
Living in the small sample world (no test set)
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
2 / 19
Introduction
What is this talk about?
Constructing faithful confidence sets for the generalization error
Coping with non-regularity
Living in the small sample world (no test set)
What is this talk not about?
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
2 / 19
Introduction
What is this talk about?
Constructing faithful confidence sets for the generalization error
Coping with non-regularity
Living in the small sample world (no test set)
What is this talk not about?
Mining Complex and High-Dimensional Data
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
2 / 19
Introduction
What is this talk about?
Constructing faithful confidence sets for the generalization error
Coping with non-regularity
Living in the small sample world (no test set)
What is this talk not about?
Mining Complex and High-Dimensional Data
Muppets
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
2 / 19
Setup
Classification in a nutshell:
We observe iid (feature, label) pairs {(xi , yi )}ni=1 from distribution F
(assume xi ∈ Rp and yi ∈ {−1, 1})
Goal in classification: construct a “rule” which predicts label Y from
feature X with high probability
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
3 / 19
Setup
Classification in a nutshell:
We observe iid (feature, label) pairs {(xi , yi )}ni=1 from distribution F
(assume xi ∈ Rp and yi ∈ {−1, 1})
Goal in classification: construct a “rule” which predicts label Y from
feature X with high probability
Given a class of rules F we’d like:
fˆ = arg min P1Yf (X )<0
f ∈F
(Generalization Error)
but we settle for:
fˆ = arg min Pn l(x, y , f )
f ∈F
(Surrogate Loss)
for some smooth surrogate loss function l
ex. l(x, y , f ) = (y − f (x))2
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
3 / 19
Setup
Given classification rule fˆ = arg minf ∈F Pn l(x, y , f ) we want Q1−γ so that:
n
o
Pr (Pn − P)1y fˆ(x)<0 < Q1−γ
≥ 1−γ
where Q1−γ is a function of the training data {(xi , yi )}ni=1 (and of course
γ)
Things to note:
We’re operating in a data impoverished setting: no testing set
No distributional assumptions are made about (Pn − P)1y fˆ(x)<0 or F
√
n(Pn − P)1y fˆ(x)<0 need not converge to a tight limit
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
4 / 19
Relevant Work
“But I can think of at least a dozen things to try. Let’s see, there’s the
bootstrap, cross-validation, normal approximation, Bayesian methods, VC
bounds...”
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
5 / 19
Relevant Work
A snapshot of the difficulties:
Bootstrap
Non-smooth sampling distribu√
tion of n(Pn − P)1y fˆ(x)
Distributional Approximations
Sampling
distribution
of
√
n(Pn − P)1y fˆ(x) need not follow any particular distribution
VC Bounds
Small samples make this bound
too loose
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
6 / 19
A New Approach
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
7 / 19
The CUD Bound
Trivially we have the following inequality:
(Pn − P)1 ˆ
≤ sup (Pn − P)1yf (x)<0 y f (x)<0
f ∈F
The above bound is too loose because the sup is taken over a huge
space (we call this a uniform deviation bound)
Taking the sup over a smaller space will lead to tighter bounds
What space should we choose?
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
8 / 19
The CUD Bound
Trivially we have the following inequality:
(Pn − P)1 ˆ
≤ sup (Pn − P)1yf (x)<0 y f (x)<0
f ∈F
The above bound is too loose because the sup is taken over a huge
space (we call this a uniform deviation bound)
Taking the sup over a smaller space will lead to tighter bounds
What space should we choose?
Solution: take the sup over a contiguous ball centered at
f ∗ = arg min Pl(x, y , f )
f ∈F
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
8 / 19
The CUD Bound
Since fˆ minimizes Pn l(x, y , f ) we have:
√ √ n (Pn − P)1y fˆ(x)<0 ≤ supf ∈F n(Pn − P)1y fˆ(x)<0 ×
n
1 Pn (l(x, y , f ∗ ) − l(x, y , f )) ≥
−1
αn
o
where αn is a sequence of positive numbers tending to ∞ depending on
surrogate loss l and sample size n
The non-smoothness of the indicator function is problematic, we replace it
with smooth surrogate g (x) = (1 + x) ∧ 1 to obtain the CUD bound:
√ √ n (Pn − P)1y fˆ(x)<0 ≤ supf ∈F n(Pn − P)1y fˆ(x)<0 ×
g {αn Pn (l(x, y , f ∗ ) − l(x, y , f ))}
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
9 / 19
The CUD Bound
What does CUD stand for?
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
10 / 19
The CUD Bound
What does CUD stand for?
Constrained
Uniform
Deviation
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
10 / 19
The CUD Bound
What does CUD stand for?
Constrained
Uniform
Deviation
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
10 / 19
The CUD Bound
Theorem
If Q1−γ is 1 − γ quantile of:
√ sup n(Pn − P)1y fˆ(x)<0 × g {αn Pn (l(x, y , f ∗ ) − l(x, y , f ))}
f ∈F
where αn is any positive sequence on numbers then:
n√ o
Pr
n(Pn − P)1y fˆ(x)<0 < Q1−γ ≥ 1 − γ
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
11 / 19
The CUD Bound
Theorem
If Q1−γ is 1 − γ quantile of:
√ sup n(Pn − P)1y fˆ(x)<0 × g {αn Pn (l(x, y , f ∗ ) − l(x, y , f ))}
f ∈F
where αn is any positive sequence on numbers then:
n√ o
Pr
n(Pn − P)1y fˆ(x)<0 < Q1−γ ≥ 1 − γ
Issues:
How to choose αn
How do we use this result?
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
11 / 19
Answers:
Choose αn so that the CUD bound converges to a tight limit
with a linear classifier and strictly convex l(x, y , f ) we have αn = n.
In practice we must estimate the CUD bound using the bootstrap
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
12 / 19
Answers:
Choose αn so that the CUD bound converges to a tight limit
with a linear classifier and strictly convex l(x, y , f ) we have αn = n.
In practice we must estimate the CUD bound using the bootstrap
Theorem
If F is the space of linear classifiers and l(x, y , f ) is strictly convex (and
subject to additional regularity conditions) then:
√ n ˆ ˆ
ˆ
Pn l(x, y , f ) − l(x, y , f )
sup n (Pn − Pn )1yf (x)<0 g
log (n)
f ∈F
converges weakly to a tight limit in l ∞ (Rp ).
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
12 / 19
Answers:
Choose αn so that the CUD bound converges to a tight limit
with a linear classifier and strictly convex l(x, y , f ) we have αn = n.
In practice we must estimate the CUD bound using the bootstrap
Theorem
If F is the space of linear classifiers and l(x, y , f ) is strictly convex (and
subject to additional regularity conditions) then:
√ n ˆ ˆ
ˆ
Pn l(x, y , f ) − l(x, y , f )
sup n (Pn − Pn )1yf (x)<0 g
log (n)
f ∈F
converges weakly to a tight limit in l ∞ (Rp ).
Note* So in the bootstrap case we see αn should be chosen to be
n
log (n) .
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
12 / 19
Computation
How do we compute the sup?
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
13 / 19
Computation
At first glance the required optimization problem:
√ n ˆ ˆ n − Pn )1yf (x)<0 g
Pn l(x, y , fˆ) − l(x, y , f )
sup n(P
log (n)
f ∈F
seems infeasible.
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
14 / 19
Computation
At first glance the required optimization problem:
√ n ˆ ˆ n − Pn )1yf (x)<0 g
Pn l(x, y , fˆ) − l(x, y , f )
sup n(P
log (n)
f ∈F
seems infeasible.
It turns out the classifier is linear this can be computed using a series of 2n
MINLP problems. When l is convex, the sup in the preceding display can
be computed efficiently and exactly.
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
14 / 19
The CUD Bound: Simulations
We compare the performance of the CUD bound with several methods in
the literature.
Use a mix of test sets (from UCI) and simulated datasets
Use linear classifiers with squared error loss
We record the coverage based on 1000 generations of each dataset
As competitors we consider:
I
I
I
I
Two bootstrap methods
Two distributional methods
One CV method
An approximation to Langford’s inverse binomial method
See www.stat.lsa.umich.edu/∼laber for details
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
15 / 19
Simulation Results
Data
Spam
Donut
Ionosphere
Heart
Diabetes
Outlier
3 Pt.
Abalone
Liv.
Magic
χ2
Mammogram
CUD
1.00
.999
.998
.999
1.00
.988
.993
.983
.961
.998
.985
1.00
BS1
.478
.880
.605
.406
.654
.884
.827
.963
.964
.918
.958
.678
n = 30
BS2
.983
.908
.926
.981
.900
.736
.717
.737
.878
.920
.786
.994
K
.782
.631
.816
.718
.912
.808
.844
.972
.980
.961
.961
.646
M
.632
.633
.757
.475
.986
.838
.896
.991
1.00
.982
.982
.426
Y
.996
.937
.994
.998
.997
.910
.753
.998
1.00
.991
.996
.983
L
.636
.620
.747
.458
.984
.961
.952
.997
1.00
.992
1.00
.406
Table: Average coverage of new method and competitors. Target coverage level
was .950, those meeting or exceeding this level are highlighted in orange.
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
16 / 19
Simulation Results
Data
Spam
Donut
Ionosphere
Heart
Diabetes
Outlier
3 Pt.
Abalone
Liver
Magic
χ2
Mammogram
CUD
.469
.485
.437
.459
.433
.555
.508
.617
.559
.606
.595
.464
BS1
.432
.590
.429
.468
.305
.491
.476
.314
.372
.307
.330
.532
n = 30
BS2
.432
.590
.429
.468
.305
.491
.476
.314
.372
.307
.330
.532
K
.315
.324
.301
.319
.307
.328
.318
.331
.327
.300
.329
.321
M
.314
.324
.296
.319
.310
.329
.317
.331
.326
.283
.330
.321
Y
.451
.414
.501
.433
.312
.456
.457
.504
.485
.464
.501
.421
L
.354
.364
.337
.359
.350
.368
.356
.371
.366
.324
.351
.361
Table: Average diameter of confidence set constructed using new method and
competitors. Method with smallest diameter and having correct coverage is
highlighted in yellow.
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
17 / 19
Conclusions
CUD bound provided the appropriate coverage on all datasets
CUD bound was conservative but not trivial (smallest in diameter for
two of the datasets)
Future work includes trying to correct conservatism and generalize
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
18 / 19
Conclusions
CUD bound provided the appropriate coverage on all datasets
CUD bound was conservative but not trivial (smallest in diameter for
two of the datasets)
Future work includes trying to correct conservatism and generalize
Thanks!
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
18 / 19
References:
B. Efron.
Estimating the error rate of a prediction rule: Improvement on cross-validation.
J. Amer. Statis. Assoc., 78(382):316–331, 1983.
B. Efron and R. Tibshirani.
Cross-validation and the bootstrap: Estimating the error of a prediction rule.
Technical report, 1995.
Ron Kohavi.
A study of cross-validation and bootstrap for accuracy estimation and model selection.
In IJCAI, pages 1137–1145, 1995.
J. Langford.
Combining train set and test set bounds.
ICML, 2002.
J. Langford.
Tutorial on practical prediction theory for classification.
J. Machine Learning Research, 6:273–306, 2005.
J. Kent Martin and Daniel S. Hirschberg.
Small sample statistics for classification error rates II: Confidence intervals and significance tests.
Technical Report ICS-TR-96-22, 1996.
Rosa A. Schiavo and David J. Hand.
Ten more years of error rate research.
International Statistical Review, 68(3):295–310, 2000.
Yuhong Yang.
Comparing learning methods for classification.
Statistica Sinica, 16(2):635–657, 2006.
Eric Laber Susan A. Murphy (University of Michigan,
Small Sample
Ann Inference
Arbor) in Classification Using the CUD Bound
August 7, 2008
19 / 19