Classification with parametrized models Classifiers with a fixed number of parameters can represent a limited set of functions. Learning a model is about picking a good approximation. Classification with generative models Typically the x’s are points in p-dimensional Euclidean space, Rp . CSE 250B Two ways to classify: • Generative: model the individual classes. • Discriminative: model the decision boundary between the classes. Quick review of conditional probability Formula for conditional probability: for any events A, B, Pr(A|B) = Pr(A ∩ B) . Pr(B) Suppose events A1 , . . . , Ak are disjoint events, one of which must occur. Then for any other event E , Pr(E ) = Pr(E , A1 ) + Pr(E , A2 ) + · · · + Pr(E , Ak ) = Pr(E |A1 )Pr(A1 ) + Pr(E |A2 )Pr(A2 ) + · · · + Pr(E |Ak )Pr(Ak ) Applied twice, this yields Bayes’ rule: Pr(H|E ) = Summation rule Pr(E |H) Pr(H). Pr(E ) Example: Sex bias in graduate admissions. In 1969, there were 12673 applicants for graduate study at Berkeley. 44% of the male applicants were accepted, and 35% of the female applicants. Example: Toss ten coins. What is the probability that the first is heads, given that nine of them are heads? Over the sample space of applicants, define: M = male H = first coin is heads F = female E = nine of the ten coins are heads A = admitted Pr(E |H) Pr(H|E ) = · Pr(H) = Pr(E ) 9 1 8 29 10 1 9 210 · 1 9 = 2 10 So: Pr(A|M) = 0.44 and Pr(A|F ) = 0.35. In every department, the accept rate for female applicants was at least as high as the accept rate for male applicants. How could this be? Generative models The Bayes-optimal prediction An unknown underlying distribution D over X × Y. Generating a point (x, y ) in two steps: 1 Last week: first choose x, then choose y given x. 2 Now: first choose y , then choose x given y . Pr(x) P3(x) P1(x) P2(x) Pr(x) P3(x) π1= 10% P1(x) Example: X =R Y = {1, 2, 3} π2= 50% π3= 40% x P2(x) π1= 10% π2= 50% Labels Y = {1, 2, . . . , k}, density Pr(x) = π1 P1 (x) + · · · + πk Pk (x). π3= 40% For any x ∈ X and any label j, x Pr(y = j|x) = The overall density is a mixture of the individual densities, Pr(x) = π1 P1 (x) + · · · + πk Pk (x). Estimating class-conditional distributions Estimating an arbitrary distribution in Rp : πj Pj (x) Pr(y = j)Pr(x|y = j) = Pk Pr(x) i=1 πi Pi (x) Bayes-optimal prediction: h∗ (x) = arg maxj πj Pj (x). Estimating the πj is easy. Estimating the Pj is hard. Naive Bayes Labels Y = {1, 2, . . . , k}, density Pr(x) = π1 P1 (x) + · · · + πk Pk (x). • Can be done, e.g. with kernel density estimation. • But number of samples needed is exponential in p. Binarized MNIST: • k = 10 classes Instead: approximate each Pj with a simple, parametric distribution. • X = {0, 1}784 Some options: • Product distributions. Assume coordinates are independent: naive Bayes. Assume that within each class, the individual pixel values are independent: • Multivariate Gaussians. Linear and quadratic discriminant analysis. • More general graphical models. Pj (x) = Pj1 (x1 ) · Pj2 (x2 ) · · · Pj,784 (x784 ). Each Pji is a coin flip: trivial to estimate! Smoothed estimate of coin bias Pick a class j and a pixel i. We need to estimate pji = Pr(xi = 1|y = j). Out of a training set of size n, Form of the classifier Data space X = {0, 1}p , label space Y = {1, . . . , k}. Estimate: • {πj : 1 ≤ j ≤ k} • {pji : 1 ≤ j ≤ k, 1 ≤ i ≤ p} Then classify point x as nj = # of instances of class j πj arg max nji = # of instances of class j with xi = 1 j p Y i=1 pjixi (1 − pji )1−xi . Then the maximum-likelihood estimate of pji is pbji = nji /nj . This causes problems if nji = 0. Instead, use “Laplace smoothing”: pbji = nji + 1 . nj + 2 To avoid underflow: take the log: arg max log πj + j p X i=1 | (xi log pji + (1 − xi ) log(1 − pji )) {z of the form w · x + b } A linear classifier! Example: MNIST Other types of data Result of training: mean vectors for each class. How would you handle data: • Whose features take on more than two discrete values (such as ten possible colors)? • Whose features are real-valued? • Whose features are positive integers? • Whose features are mixed: some real, some Boolean, etc? Test error rate: 15.54%. “Confusion matrix” −→ How would you handle “missing data”: situations in which data points occasionally (or regularly) have missing entries? • At train time: ??? • At test time: ??? Multinomial naive Bayes Improving performance of multinomial naive Bayes Bag-of-words: vectorial representation of text documents. A variety of heuristics that are standard in text retrieval, such as: It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way – in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only. 1 despair 2 evil 0 happiness 1 foolishness 1 Compensating for burstiness. Problem: Once a word has appeared in a document, it has a much higher chance of appearing again. Solution: Instead of the number of occurrences f of a word, use log(1 + f ). 2 Downweighting common words. Problem: Common words can have a unduly large influence on classification. Solution: Weight each word w by inverse document frequency: • Fix V = some vocabulary. • Treat the words in a document as independent draws from a log multinomial distribution over V : p = (p1 , . . . , p|V | ), such that pi ≥ 0 and X # docs #(docs containing w ) pi = 1 i Multinomial naive Bayes: each class is modeled by a multinomial. Recall: generative model framework The univariate Gaussian Pr(x) P3(x) P1(x) P2(x) π1= 10% π2= 50% π3= 40% x Labels Y = {1, 2, . . . , k}, density Pr(x) = π1 P1 (x) + · · · + πk Pk (x). Approximate each Pj with a simple, parametric distribution: • Product distributions. Assume coordinates are independent: naive Bayes. • Multivariate Gaussians. Linear and quadratic discriminant analysis. • More general graphical models. The Gaussian N(µ, σ 2 ) has mean µ, variance σ 2 , and density function 1 (x − µ)2 p(x) = exp − . 2σ 2 (2πσ 2 )1/2 The multivariate Gaussian Special case: spherical Gaussian The Xi are independent and all have the same variance σ 2 . Thus N(µ, Σ): Gaussian in Rp • mean: µ ∈ R • covariance: p × p matrix Σ µ Σ = σ 2 Ip p Simplified density: p(x) = p(x) = 1 (2π)p/2 |Σ|1/2 1 exp − (x − µ)T Σ−1 (x − µ) 2 1 kx − µk2 exp − 2σ 2 (2π)p/2 σ p Density at a point depends only on its distance from µ: Let X = (X1 , X2 , . . . , Xp ) be a random draw from N(µ, Σ). µ • EX = µ. That is, EXi = µi for all 1 ≤ i ≤ p. T • E(X − µ)(X − µ) = Σ. That is, for all 1 ≤ i, j ≤ p, cov(Xi , Xj ) = E(Xi − µi )(Xj − µj ) = Σij Each Xi is an independent univariate Gaussian N(µi , σ 2 ). In particular, var(Xi ) = E(Xi − µi )2 = Σii . Special case: diagonal Gaussian The general Gaussian The Xi are independent, with variances σi2 . Thus u1 u2 Σ = diag(σ12 , . . . , σp2 ) Simplified density: p X (xi − µi )2 1 p(x) = exp − 2σi2 (2π)p/2 σ1 · · · σp i=1 ! N(µ, Σ): µ Contours of equal density are axis-aligned ellipsoids centered at µ: 2 µ Eigendecomposition of Σ: 1 • Eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λp • Corresponding eigenvectors u1 , . . . , up Each Xi is an independent univariate Gaussian N(µi , σi2 ). N(µ, Σ) is simply a rotated version of N(µ, diag(λ1 , . . . , λp )). Binary classification with Gaussian generative model Estimate class probabilities π1 , π2 and fit a Gaussian to each class: P1 = N(µ1 , Σ1 ), P2 = N(µ2 , Σ2 ) (Blackboard) E.g. If data points x (1) , . . . , x (m) ∈ Rp are class 1: µ1 = m 1 X (i) 1 (1) x + · · · + x (m) and Σ1 = (x − µ1 )(x (i) − µ1 )T m m i=1 Mathematical interlude: Linear algebra Given a new point x, predict class 1 iff: π1 P1 (x) > π2 P2 (x) ⇔ x T Mx + 2w T x ≥ θ, where: 1 −1 (Σ − Σ−1 1 ) 2 2 −1 −1 w = Σ1 µ1 − Σ2 µ2 M= and θ is a constant depending on the various parameters. Σ1 = Σ2 : linear decision boundary. Otherwise, quadratic boundary. Common covariance: Σ1 = Σ2 = Σ Example 2: Again spherical, but now π1 > π2 . Linear decision boundary: choose class 1 iff x · Σ−1 (µ1 − µ2 ) ≥ θ. | {z } w µ1 µ2 Example 1: Spherical Gaussians with Σ = Ip and π1 = π2 . w bisector of line joining means One-d projection onto w : µ1 µ2 w Different covariances: Σ1 6= Σ2 Example 3: Non-spherical. Quadratic boundary: choose class 1 iff x T Mx + 2w T x ≥ θ, where: 1 −1 (Σ − Σ−1 1 ) 2 2 −1 w = Σ−1 1 µ1 − Σ2 µ2 M= µ1 Example 1: Σ1 = σ12 Ip and Σ2 = σ22 Ip with σ1 > σ2 µ2 µ1 µ2 w=⌃ 1 (µ1 µ2 ) µ1 µ2 Rule: w · x ≥ θ • w , θ dictated by probability model, assuming it is a perfect fit • Common practice: choose w as above, but fit θ to minimize training/validation error Example 3: A parabolic boundary. Example 2: Same thing in 1-d. X = R. class 2 class 1 µ1 Many other possibilities! µ2 Multiclass discriminant analysis k classes: weights πj , class-conditional distributions Pj = N(µj , Σj ). Example: “wine” data set Data from three wineries from the same region of Italy • 13 attributes: hue, color intensity, flavanoids, ash content, ... Each class has an associated quadratic function • 178 instances in all: split into 118 train, 60 test fj (x) = log (πj Pj (x)) To class a point x, pick arg maxj fj (x). If Σ1 = · · · = Σk , the boundaries are linear. Test error using multiclass discriminant analysis: 1/60 Example: MNIST Fisher’s linear discriminant A framework for linear classification without Gaussian assumptions. To each digit, fit: • class probability πj • mean µj ∈ R784 • covariance matrix Σj ∈ R784×784 Problem: formula for normal density uses Σ−1 j , which is singular. • Need to regularize: Σj → Σj + σ 2 I • This is a good idea even without the singularity issue Use only first- and second-order statistics of the classes. Class 1 mean µ1 cov Σ1 # pts n1 Class 2 mean µ2 cov Σ2 # pts n2 A linear classifier projects all data onto a direction w . Choose w so that: • Projected means are well-separated, i.e. (w · µ1 − w · µ2 )2 is large. • Projected within-class variance is small. Error rate with regularization: ??? better than Fisher LDA (linear discriminant analysis) Fisher LDA: proof Two classes: means µ1 , µ2 ; covariances Σ1 , Σ2 ; sample sizes n1 , n2 . Project data onto direction (unit vector) w . • Projected means: w · µ1 and w · µ2 Goal: find w to maximize J(w ) = (w · µ1 − w · µ2 )2 w T Σw • Projected variances: w T Σ1 w and w T Σ2 w • Average projected variance: T T n1 (w Σ1 w ) + n2 (w Σ2 w ) = w T Σw , n1 + n2 1 Assume Σ1 , Σ2 are full rank; else project. 2 Since Σ1 and Σ2 are p.d., so is their weighted average, Σ. 3 Write u = Σ1/2 w . Then where Σ = (n1 Σ1 + n2 Σ2 )/(n1 + n2 ). max w (w · µ1 − w · µ2 )2 Find w to maximize J(w ) = w T Σw Solution: w ∝ Σ −1 (µ1 − µ2 ). Look familiar? (w T (µ1 − µ2 ))2 (u T Σ−1/2 (µ1 − µ2 ))2 = max u w T Σw uT u −1/2 = max (u · (Σ (µ1 − µ2 )))2 u:kuk=1 4 5 Solution: u is the unit vector in direction Σ−1/2 (µ1 − µ2 ). Therefore: w = Σ−1/2 u ∝ Σ−1 (µ1 − µ2 ).