How to be a Bayesian without believing Yoav Freund

How to be a Bayesian
without believing
Yoav Freund
Joint work with Rob Schapire and Yishay Mansour
1
Motivation
• Statistician: “Are you a Bayesian or a
Frequentist?”
• Yoav: “I don’t know, you tell me…”
• I need a better answer….
2
Toy example
• Computer receives telephone call
• Measures Pitch of voice
• Decides gender of caller
Male
Human
Voice
Female
3
Generative modeling
Probability
mean1
mean2
var1
var2
Voice Pitch
4
No. of mistakes
Discriminative approach
Voice Pitch
5
Discriminative Bayesian approach
Conditional probability:
Probability
Prior

1  a 2
P0    e
Z
1
Pg  m | x 
 x 
1 e
Posterior

Voice Pitch
6
No. of mistakes
Suggested approach
Definitely
female
Definitely
male
Unsure
Voice Pitch
7
Formal Frameworks
For stating theorems regarding
the dependence of the
generalization error on the size of
the training set.
8
The PAC set-up
1. Learner chooses classifier set C
c  C, c: X  {-1,+1}
and requests m training examples
2. Nature chooses a target classifier c
from C and a distribution P over X
3. Nature generates training set
(x1,y1), (x2,y2), … ,(xm,ym)
4. Learner generates h: X  {-1,+1}
Goal: P(h(x) c(x)) <  c,P
9
The agnostic set-up
Vapnik’s pattern-recognition problem
1. Learner chooses classifier set C
c  C, c: X  {-1,+1}
and requests m training examples
2. Nature chooses distribution D over
X  {-1,+1}
3. Nature generates training set according to D
(x1,y1), (x2,y2), … ,(xm,ym)
4. Learner generates h: X  {-1,+1}
Goal: PD(h(x)  y) < PD(c*(x)  y) +  D
Where c* = argminc  C(PD(c(x)  y))
10
Self-bounding learning
Freund 97
1. Learner selects concept class C
2. Nature generates training set
T=(x1,y1), (x2,y2), … ,(xm,ym)
IID according to a distribution D over X  {-1,+1}
3. Learner generates h: X  {-1,+1}
and a bound T such that with high probability
over the random choice of the training set T
PD(h(x)  y) < PD(c*(x)  y) + T
11
Learning a region predictor
Vovk 2000
1. Learner selects concept class C
2. Nature generates training set
(x1,y1), (x2,y2), … ,(xm,ym)
IID according to a distribution D over X  {-1,+1}
3. Learner generates
h: X  { {-1}, {+1}, {-1,+1} , {} }
such that with high probability
PD(y  h(x)) < PD(c*(x)  y) + 1
and
PD(h(x)={-1,+1} ) < 2
12
Intuitions
The rough idea
13
A motivating example
?
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+
- -
-
-
+ ++
+
+
- +
+ ++ +
+
?+ +
+
++ ++ +
+ +
+ +
+
+
+ +
+
+
+
- -- + + + + +++ +
++
+
+
?
+
+
-
14
Distribution of errors
True error
1/2 Empirical error
0
0
1/2
Worst case
Typical case
0
1/2
Contenders
for best.
-> Predict with
majority vote
Non-contenders -> ignore!
15
Main result
Finite concept class
16
Notation
Data distribution:
x, y ~ D;
Generalization error:
y  1,1
h P
Ý x,y ~D h(x)  y
Training set: T  x1,y1,x2 ,y2 ,...,xm ,ym ; T ~ Dm
Training error: 
1
ˆ(h) 
Ý 1h(x)  y P
Ý x,y ~T h(x)  y

m x,y T
17
The algorithm
Parameters   0,   0
w(h) e
Ý
Hypothesis weight:

Empirical
Log Ratio

Prediction rule:

:
ˆ h 
  w(h) 
h:h ( x)1 
1
lˆ x 
Ý ln 

   w(h) 
h:h ( x)1 
 1
if

pˆ , x  1,1 if

if
 1
lˆx  
lˆx  
lˆx  
18
Suggested tuning
  ln 8 H m1 2
2
Yields:
m
8m1 2
*


 2   ln 8 H 
ln m 
 2h  O 1/2 
m

1) P
mistake   Px,y ~D y  pˆ (x)
2) for
ln

1/  

m =  ln 1  ln H  


 ln 1   ln H 


*


ˆ
P(abstain )  Px,y ~D p(x)  1,1  5h  O
1/2



m


19
Main properties
1. The ELR is very stable. Probability of
large deviations is independent of size of
concept class.
2. Expected value of ELR is close to the
True Log Ratio (using true hypothesis
errors instead of estimates.)
3. TLR is a good proxy of the best concept
in the class.
20
McDiarmid’s theorem
f : Xm  R
If x1, ,xm; xi  X
f x1, ,xm   f x1, ,xi1, x
,xm   ci
i ,xi1 ,


And X1,

Then


,Xm are independent random variables


2
2 

P f X1, , X m   E f X1, , X m     2 exp  m
2 

c


i1 i 

21
Empirical log ratio is stable
K  H


ˆ h 
Rˆ K  
ln
e


Ý

 hK

1
lˆ x  Rˆ h | hx  1 Rˆ h | hx  1

ˆh  
training
 error with
one example changed
ˆh 
ˆh  1 m


ˆh 


Rˆ K  
e
Ý ln 



 hK

1

22
Bounded variation proof

1
ˆ  h
Rˆ K   Rˆ K   ln  e  
 hK

1

 
ln max e
hK
e
ˆ h 
hK
ˆ h 
e



ˆ h 

ˆh 
ˆh 1 m
 max 
hK


23
Infinite concept classes
Geometry of the concept class
24
Infinite concept classes
• Stated bounds are vacuous.
• How to approximate a infinite class with a
finite class?
• Unlabeled examples give useful
information.
25
A metric space of classifiers
Classifier space
d
Example Space
f
g
d(f,g) = P( f(x) = g(x) )
Neighboring models make similar predictions
26
-covers
Classifier space
 1/10
Classifier class
No. of neighbors
increases like 1 
 No. of neighbors
increases like 1  2
 1/20


27
Computational issues
• How to compute the -cover?
• We can use unlabeled examples to
generate cover.
• Estimate prediction by ignoring concepts
with high error.
28
Application: comparing perfect
features
• 45,000 features
• Training Examples:
 102 negative
 2-10 positive
 104 unlabeled
• >1 features have zero training error.
• Which feature(s) should we use?
• How to combine them?
29
No. of images
A typical perfect feature
Unlabeld
examples
Negative
examples
Positive examples
Feature value
30
Pseudo-Bayes for single
threhold
• Set of possible thresholds is uncountably
infinite
• Using an -cover over thresholds
• Equivalent to using the distribution of
unlabeled examples as the
prior distribution over the set of thresholds.
31
What it will do
+1
0
-1
Prior
weights
Error factor
Negative
examples
Feature value
32
Relation to large margins
SVM and Adaboost search for a
linear discriminator with a large margin
Neighborhood
of good
classifiers
33
Relation to Bagging
• Bagging:
 Generate classifiers from random subsets of training
set.
 Predict according to the majority vote among
classifiers.
(Another possibility: flip label of a small random subset
of the training set)
• Can be seen as a randomized estimate of the
log ratio.
34
Bias/Variance for classification
• Bias: error of predicting with the sign of the
True Log Ratio (infinite training set).
• Variance: additional error from predicting
with the sign of the Empirical Log Ratio
which is based on a finite training sample.
35
New directions
How a measure of confidence can
help in practice
36
Face Detection
• Paul Viola and Mike Jones developed a face detector that can work
in real time (15 frames per second).
QuickTime™ and a YUV420 codec dec ompres sor are needed to see this pic ture.
37
Using confidence to save time
The detector combines 6000 simple features using Adaboost.
In most boxes, only 8-9 features are calculated.
All
boxes
Feature 1
Might be a face
Feature 2
Definitely
not a face
38
Selective sampling
Unlabeled data
Partially trained
classifier
Sample of unconfident
examples
Labeled
examples
39
Co-training
Partially trained
Color based
Classifier
Images that
Might contain
faces
Partially trained
Shape based
Classifier
Confident
Predictions
Confident
Predictions
40
Summary
• Bayesian averaging is justifiable even without
Bayesian assumptions.
• Infinite concept classes: use -covers
• Efficient implementations: Thresholds, SVM,
boosting, bagging… still largely open.
• Calibration (Recent work of Vovk)
• A good measure of confidence is very important
in practice.
• >2 classes (predicting with a subset)
41