How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour 1 Motivation • Statistician: “Are you a Bayesian or a Frequentist?” • Yoav: “I don’t know, you tell me…” • I need a better answer…. 2 Toy example • Computer receives telephone call • Measures Pitch of voice • Decides gender of caller Male Human Voice Female 3 Generative modeling Probability mean1 mean2 var1 var2 Voice Pitch 4 No. of mistakes Discriminative approach Voice Pitch 5 Discriminative Bayesian approach Conditional probability: Probability Prior  1  a 2 P0    e Z 1 Pg  m | x   x  1 e Posterior  Voice Pitch 6 No. of mistakes Suggested approach Definitely female Definitely male Unsure Voice Pitch 7 Formal Frameworks For stating theorems regarding the dependence of the generalization error on the size of the training set. 8 The PAC set-up 1. Learner chooses classifier set C c  C, c: X  {-1,+1} and requests m training examples 2. Nature chooses a target classifier c from C and a distribution P over X 3. Nature generates training set (x1,y1), (x2,y2), … ,(xm,ym) 4. Learner generates h: X  {-1,+1} Goal: P(h(x) c(x)) <  c,P 9 The agnostic set-up Vapnik’s pattern-recognition problem 1. Learner chooses classifier set C c  C, c: X  {-1,+1} and requests m training examples 2. Nature chooses distribution D over X  {-1,+1} 3. Nature generates training set according to D (x1,y1), (x2,y2), … ,(xm,ym) 4. Learner generates h: X  {-1,+1} Goal: PD(h(x)  y) < PD(c*(x)  y) +  D Where c* = argminc  C(PD(c(x)  y)) 10 Self-bounding learning Freund 97 1. Learner selects concept class C 2. Nature generates training set T=(x1,y1), (x2,y2), … ,(xm,ym) IID according to a distribution D over X  {-1,+1} 3. Learner generates h: X  {-1,+1} and a bound T such that with high probability over the random choice of the training set T PD(h(x)  y) < PD(c*(x)  y) + T 11 Learning a region predictor Vovk 2000 1. Learner selects concept class C 2. Nature generates training set (x1,y1), (x2,y2), … ,(xm,ym) IID according to a distribution D over X  {-1,+1} 3. Learner generates h: X  { {-1}, {+1}, {-1,+1} , {} } such that with high probability PD(y  h(x)) < PD(c*(x)  y) + 1 and PD(h(x)={-1,+1} ) < 2 12 Intuitions The rough idea 13 A motivating example ? - - - - - - - - - - - - - - + - - - - + ++ + + - + + ++ + + ?+ + + ++ ++ + + + + + + + + + + + + - -- + + + + +++ + ++ + + ? + + - 14 Distribution of errors True error 1/2 Empirical error 0 0 1/2 Worst case Typical case 0 1/2 Contenders for best. -> Predict with majority vote Non-contenders -> ignore! 15 Main result Finite concept class 16 Notation Data distribution: x, y ~ D; Generalization error: y  1,1 h P Ý x,y ~D h(x)  y Training set: T  x1,y1,x2 ,y2 ,...,xm ,ym ; T ~ Dm Training error:  1 ˆ(h)  Ý 1h(x)  y P Ý x,y ~T h(x)  y  m x,y T 17 The algorithm Parameters   0,   0 w(h) e Ý Hypothesis weight:  Empirical Log Ratio  Prediction rule:  : ˆ h    w(h)  h:h ( x)1  1 lˆ x  Ý ln      w(h)  h:h ( x)1   1 if  pˆ , x  1,1 if  if  1 lˆx   lˆx   lˆx   18 Suggested tuning   ln 8 H m1 2 2 Yields: m 8m1 2 *    2   ln 8 H  ln m   2h  O 1/2  m  1) P mistake   Px,y ~D y  pˆ (x) 2) for ln  1/    m =  ln 1  ln H      ln 1   ln H    *   ˆ P(abstain )  Px,y ~D p(x)  1,1  5h  O 1/2    m   19 Main properties 1. The ELR is very stable. Probability of large deviations is independent of size of concept class. 2. Expected value of ELR is close to the True Log Ratio (using true hypothesis errors instead of estimates.) 3. TLR is a good proxy of the best concept in the class. 20 McDiarmid’s theorem f : Xm  R If x1, ,xm; xi  X f x1, ,xm   f x1, ,xi1, x ,xm   ci i ,xi1 ,   And X1,  Then   ,Xm are independent random variables   2 2   P f X1, , X m   E f X1, , X m     2 exp  m 2   c   i1 i   21 Empirical log ratio is stable K  H   ˆ h  Rˆ K   ln e   Ý   hK  1 lˆ x  Rˆ h | hx  1 Rˆ h | hx  1  ˆh   training  error with one example changed ˆh  ˆh  1 m   ˆh    Rˆ K   e Ý ln      hK  1  22 Bounded variation proof  1 ˆ  h Rˆ K   Rˆ K   ln  e    hK  1    ln max e hK e ˆ h  hK ˆ h  e    ˆ h   ˆh  ˆh 1 m  max  hK   23 Infinite concept classes Geometry of the concept class 24 Infinite concept classes • Stated bounds are vacuous. • How to approximate a infinite class with a finite class? • Unlabeled examples give useful information. 25 A metric space of classifiers Classifier space d Example Space f g d(f,g) = P( f(x) = g(x) ) Neighboring models make similar predictions 26 -covers Classifier space  1/10 Classifier class No. of neighbors increases like 1   No. of neighbors increases like 1  2  1/20   27 Computational issues • How to compute the -cover? • We can use unlabeled examples to generate cover. • Estimate prediction by ignoring concepts with high error. 28 Application: comparing perfect features • 45,000 features • Training Examples:  102 negative  2-10 positive  104 unlabeled • >1 features have zero training error. • Which feature(s) should we use? • How to combine them? 29 No. of images A typical perfect feature Unlabeld examples Negative examples Positive examples Feature value 30 Pseudo-Bayes for single threhold • Set of possible thresholds is uncountably infinite • Using an -cover over thresholds • Equivalent to using the distribution of unlabeled examples as the prior distribution over the set of thresholds. 31 What it will do +1 0 -1 Prior weights Error factor Negative examples Feature value 32 Relation to large margins SVM and Adaboost search for a linear discriminator with a large margin Neighborhood of good classifiers 33 Relation to Bagging • Bagging:  Generate classifiers from random subsets of training set.  Predict according to the majority vote among classifiers. (Another possibility: flip label of a small random subset of the training set) • Can be seen as a randomized estimate of the log ratio. 34 Bias/Variance for classification • Bias: error of predicting with the sign of the True Log Ratio (infinite training set). • Variance: additional error from predicting with the sign of the Empirical Log Ratio which is based on a finite training sample. 35 New directions How a measure of confidence can help in practice 36 Face Detection • Paul Viola and Mike Jones developed a face detector that can work in real time (15 frames per second). QuickTime™ and a YUV420 codec dec ompres sor are needed to see this pic ture. 37 Using confidence to save time The detector combines 6000 simple features using Adaboost. In most boxes, only 8-9 features are calculated. All boxes Feature 1 Might be a face Feature 2 Definitely not a face 38 Selective sampling Unlabeled data Partially trained classifier Sample of unconfident examples Labeled examples 39 Co-training Partially trained Color based Classifier Images that Might contain faces Partially trained Shape based Classifier Confident Predictions Confident Predictions 40 Summary • Bayesian averaging is justifiable even without Bayesian assumptions. • Infinite concept classes: use -covers • Efficient implementations: Thresholds, SVM, boosting, bagging… still largely open. • Calibration (Recent work of Vovk) • A good measure of confidence is very important in practice. • >2 classes (predicting with a subset) 41