Classification with generative models

Classification with parametrized models
Classifiers with a fixed number of parameters can represent a limited set
of functions. Learning a model is about picking a good approximation.
Classification with generative models
Typically the x’s are points in p-dimensional Euclidean space, Rp .
CSE 250B
Two ways to classify:
• Generative: model the individual classes.
• Discriminative: model the decision boundary between the classes.
Quick review of conditional probability
Formula for conditional probability: for any events A, B,
Pr(A|B) =
Pr(A ∩ B)
.
Pr(B)
Suppose events A1 , . . . , Ak are disjoint events, one of which must occur.
Then for any other event E ,
Pr(E ) = Pr(E , A1 ) + Pr(E , A2 ) + · · · + Pr(E , Ak )
= Pr(E |A1 )Pr(A1 ) + Pr(E |A2 )Pr(A2 ) + · · · + Pr(E |Ak )Pr(Ak )
Applied twice, this yields Bayes’ rule:
Pr(H|E ) =
Summation rule
Pr(E |H)
Pr(H).
Pr(E )
Example: Sex bias in graduate admissions. In 1969, there were 12673
applicants for graduate study at Berkeley. 44% of the male applicants
were accepted, and 35% of the female applicants.
Example: Toss ten coins. What is the probability that the first is heads,
given that nine of them are heads?
Over the sample space of applicants, define:
M = male
H = first coin is heads
F = female
E = nine of the ten coins are heads
A = admitted
Pr(E |H)
Pr(H|E ) =
· Pr(H) =
Pr(E )
9 1
8 29
10 1
9 210
·
1
9
=
2
10
So: Pr(A|M) = 0.44 and Pr(A|F ) = 0.35.
In every department, the accept rate for female applicants was at least as
high as the accept rate for male applicants. How could this be?
Generative models
The Bayes-optimal prediction
An unknown underlying distribution D over X × Y.
Generating a point (x, y ) in two steps:
1 Last week: first choose x, then choose y given x.
2 Now: first choose y , then choose x given y .
Pr(x)
P3(x)
P1(x)
P2(x)
Pr(x)
P3(x)
π1=
10%
P1(x)
Example:
X =R
Y = {1, 2, 3}
π2=
50%
π3=
40%
x
P2(x)
π1=
10%
π2=
50%
Labels Y = {1, 2, . . . , k}, density Pr(x) = π1 P1 (x) + · · · + πk Pk (x).
π3=
40%
For any x ∈ X and any label j,
x
Pr(y = j|x) =
The overall density is a mixture of the individual densities,
Pr(x) = π1 P1 (x) + · · · + πk Pk (x).
Estimating class-conditional distributions
Estimating an arbitrary distribution in Rp :
πj Pj (x)
Pr(y = j)Pr(x|y = j)
= Pk
Pr(x)
i=1 πi Pi (x)
Bayes-optimal prediction: h∗ (x) = arg maxj πj Pj (x).
Estimating the πj is easy. Estimating the Pj is hard.
Naive Bayes
Labels Y = {1, 2, . . . , k}, density Pr(x) = π1 P1 (x) + · · · + πk Pk (x).
• Can be done, e.g. with kernel density estimation.
• But number of samples needed is exponential in p.
Binarized MNIST:
• k = 10 classes
Instead: approximate each Pj with a simple, parametric distribution.
• X = {0, 1}784
Some options:
• Product distributions.
Assume coordinates are independent: naive Bayes.
Assume that within each class, the individual pixel values are
independent:
• Multivariate Gaussians.
Linear and quadratic discriminant analysis.
• More general graphical models.
Pj (x) = Pj1 (x1 ) · Pj2 (x2 ) · · · Pj,784 (x784 ).
Each Pji is a coin flip: trivial to estimate!
Smoothed estimate of coin bias
Pick a class j and a pixel i. We need to estimate
pji = Pr(xi = 1|y = j).
Out of a training set of size n,
Form of the classifier
Data space X = {0, 1}p , label space Y = {1, . . . , k}. Estimate:
• {πj : 1 ≤ j ≤ k}
• {pji : 1 ≤ j ≤ k, 1 ≤ i ≤ p}
Then classify point x as
nj = # of instances of class j
πj
arg max
nji = # of instances of class j with xi = 1
j
p
Y
i=1
pjixi (1 − pji )1−xi .
Then the maximum-likelihood estimate of pji is
pbji = nji /nj .
This causes problems if nji = 0. Instead, use “Laplace smoothing”:
pbji =
nji + 1
.
nj + 2
To avoid underflow: take the log:
arg max
log πj +
j
p
X
i=1
|
(xi log pji + (1 − xi ) log(1 − pji ))
{z
of the form w · x + b
}
A linear classifier!
Example: MNIST
Other types of data
Result of training: mean vectors for each class.
How would you handle data:
• Whose features take on more than two discrete values (such as ten
possible colors)?
• Whose features are real-valued?
• Whose features are positive integers?
• Whose features are mixed: some real, some Boolean, etc?
Test error rate: 15.54%.
“Confusion matrix” −→
How would you handle “missing data”: situations in which data points
occasionally (or regularly) have missing entries?
• At train time: ???
• At test time: ???
Multinomial naive Bayes
Improving performance of multinomial naive Bayes
Bag-of-words: vectorial representation of text documents.
A variety of heuristics that are standard in text retrieval, such as:
It was the best of times, it was the
worst of times, it was the age of
wisdom, it was the age of foolishness,
it was the epoch of belief, it was the
epoch of incredulity, it was the
season of Light, it was the season of
Darkness, it was the spring of hope,
it was the winter of despair, we had
everything before us, we had nothing
before us, we were all going direct to
Heaven, we were all going direct the
other way – in short, the period was
so far like the present period, that
some of its noisiest authorities
insisted on its being received, for
good or for evil, in the superlative
degree of comparison only.
1
despair
2
evil
0
happiness
1
foolishness
1
Compensating for burstiness.
Problem: Once a word has appeared in a document, it has a much
higher chance of appearing again.
Solution: Instead of the number of occurrences f of a word, use
log(1 + f ).
2
Downweighting common words.
Problem: Common words can have a unduly large influence on
classification.
Solution: Weight each word w by inverse document frequency:
• Fix V = some vocabulary.
• Treat the words in a document as independent draws from a
log
multinomial distribution over V :
p = (p1 , . . . , p|V | ), such that pi ≥ 0 and
X
# docs
#(docs containing w )
pi = 1
i
Multinomial naive Bayes: each class is modeled by a multinomial.
Recall: generative model framework
The univariate Gaussian
Pr(x)
P3(x)
P1(x)
P2(x)
π1=
10%
π2=
50%
π3=
40%
x
Labels Y = {1, 2, . . . , k}, density Pr(x) = π1 P1 (x) + · · · + πk Pk (x).
Approximate each Pj with a simple, parametric distribution:
• Product distributions.
Assume coordinates are independent: naive Bayes.
• Multivariate Gaussians.
Linear and quadratic discriminant analysis.
• More general graphical models.
The Gaussian N(µ, σ 2 ) has mean µ, variance σ 2 , and density function
1
(x − µ)2
p(x) =
exp
−
.
2σ 2
(2πσ 2 )1/2
The multivariate Gaussian
Special case: spherical Gaussian
The Xi are independent and all have the same variance σ 2 . Thus
N(µ, Σ): Gaussian in Rp
• mean: µ ∈ R
• covariance: p × p matrix Σ
µ
Σ = σ 2 Ip
p
Simplified density:
p(x) =
p(x) =
1
(2π)p/2 |Σ|1/2
1
exp − (x − µ)T Σ−1 (x − µ)
2
1
kx − µk2
exp
−
2σ 2
(2π)p/2 σ p
Density at a point depends only on its distance from µ:
Let X = (X1 , X2 , . . . , Xp ) be a random draw from N(µ, Σ).
µ
• EX = µ. That is, EXi = µi for all 1 ≤ i ≤ p.
T
• E(X − µ)(X − µ)
= Σ. That is, for all 1 ≤ i, j ≤ p,
cov(Xi , Xj ) = E(Xi − µi )(Xj − µj ) = Σij
Each Xi is an independent univariate Gaussian N(µi , σ 2 ).
In particular, var(Xi ) = E(Xi − µi )2 = Σii .
Special case: diagonal Gaussian
The general Gaussian
The Xi are independent, with variances σi2 . Thus
u1
u2
Σ = diag(σ12 , . . . , σp2 )
Simplified density:
p
X (xi − µi )2
1
p(x) =
exp
−
2σi2
(2π)p/2 σ1 · · · σp
i=1
!
N(µ, Σ):
µ
Contours of equal density are axis-aligned ellipsoids centered at µ:
2
µ
Eigendecomposition of Σ:
1
• Eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λp
• Corresponding eigenvectors u1 , . . . , up
Each Xi is an independent univariate Gaussian
N(µi , σi2 ).
N(µ, Σ) is simply a rotated version of N(µ, diag(λ1 , . . . , λp )).
Binary classification with Gaussian generative model
Estimate class probabilities π1 , π2 and fit a Gaussian to each class:
P1 = N(µ1 , Σ1 ), P2 = N(µ2 , Σ2 )
(Blackboard)
E.g. If data points x (1) , . . . , x (m) ∈ Rp are class 1:
µ1 =
m
1 X (i)
1 (1)
x + · · · + x (m) and Σ1 =
(x − µ1 )(x (i) − µ1 )T
m
m
i=1
Mathematical interlude:
Linear algebra
Given a new point x, predict class 1 iff:
π1 P1 (x) > π2 P2 (x) ⇔ x T Mx + 2w T x ≥ θ,
where:
1 −1
(Σ − Σ−1
1 )
2 2
−1
−1
w = Σ1 µ1 − Σ2 µ2
M=
and θ is a constant depending on the various parameters.
Σ1 = Σ2 : linear decision boundary. Otherwise, quadratic boundary.
Common covariance: Σ1 = Σ2 = Σ
Example 2: Again spherical, but now π1 > π2 .
Linear decision boundary: choose class 1 iff
x · Σ−1 (µ1 − µ2 ) ≥ θ.
|
{z
}
w
µ1
µ2
Example 1: Spherical Gaussians with Σ = Ip and π1 = π2 .
w
bisector of line
joining means
One-d projection onto w :
µ1
µ2
w
Different covariances: Σ1 6= Σ2
Example 3: Non-spherical.
Quadratic boundary: choose class 1 iff x T Mx + 2w T x ≥ θ, where:
1 −1
(Σ − Σ−1
1 )
2 2
−1
w = Σ−1
1 µ1 − Σ2 µ2
M=
µ1
Example 1: Σ1 = σ12 Ip and Σ2 = σ22 Ip with σ1 > σ2
µ2
µ1
µ2
w=⌃
1
(µ1
µ2 )
µ1
µ2
Rule: w · x ≥ θ
• w , θ dictated by probability model, assuming it is a perfect fit
• Common practice: choose w as above, but fit θ to minimize
training/validation error
Example 3: A parabolic boundary.
Example 2: Same thing in 1-d. X = R.
class 2
class 1
µ1
Many other possibilities!
µ2
Multiclass discriminant analysis
k classes: weights πj , class-conditional distributions Pj = N(µj , Σj ).
Example: “wine” data set
Data from three wineries from the same region of Italy
• 13 attributes: hue, color intensity, flavanoids, ash content, ...
Each class has an associated quadratic function
• 178 instances in all: split into 118 train, 60 test
fj (x) = log (πj Pj (x))
To class a point x, pick arg maxj fj (x).
If Σ1 = · · · = Σk , the boundaries are linear.
Test error using multiclass discriminant analysis: 1/60
Example: MNIST
Fisher’s linear discriminant
A framework for linear classification without Gaussian assumptions.
To each digit, fit:
• class probability πj
• mean µj ∈ R784
• covariance matrix Σj ∈ R784×784
Problem: formula for normal density uses Σ−1
j , which is singular.
• Need to regularize: Σj → Σj + σ 2 I
• This is a good idea even without the singularity issue
Use only first- and second-order statistics of the classes.
Class 1
mean µ1
cov Σ1
# pts n1
Class 2
mean µ2
cov Σ2
# pts n2
A linear classifier projects all data onto a direction w . Choose w so that:
• Projected means are well-separated, i.e. (w · µ1 − w · µ2 )2 is large.
• Projected within-class variance is small.
Error rate with regularization: ???
better
than
Fisher LDA (linear discriminant analysis)
Fisher LDA: proof
Two classes: means µ1 , µ2 ; covariances Σ1 , Σ2 ; sample sizes n1 , n2 .
Project data onto direction (unit vector) w .
• Projected means: w · µ1 and w · µ2
Goal: find w to maximize J(w ) =
(w · µ1 − w · µ2 )2
w T Σw
• Projected variances: w T Σ1 w and w T Σ2 w
• Average projected variance:
T
T
n1 (w Σ1 w ) + n2 (w Σ2 w )
= w T Σw ,
n1 + n2
1
Assume Σ1 , Σ2 are full rank; else project.
2
Since Σ1 and Σ2 are p.d., so is their weighted average, Σ.
3
Write u = Σ1/2 w . Then
where Σ = (n1 Σ1 + n2 Σ2 )/(n1 + n2 ).
max
w
(w · µ1 − w · µ2 )2
Find w to maximize J(w ) =
w T Σw
Solution: w ∝ Σ
−1
(µ1 − µ2 ). Look familiar?
(w T (µ1 − µ2 ))2
(u T Σ−1/2 (µ1 − µ2 ))2
=
max
u
w T Σw
uT u
−1/2
= max (u · (Σ
(µ1 − µ2 )))2
u:kuk=1
4
5
Solution: u is the unit vector in direction Σ−1/2 (µ1 − µ2 ).
Therefore: w = Σ−1/2 u ∝ Σ−1 (µ1 − µ2 ).