Discriminant function analysis (DFA) Discriminant function analysis (DFA)

Discriminant function analysis (DFA)
●
●
●
Rationale and use of
DFA
The underlying model
(what is a discriminant
function anyway?)
Finding discriminant
functions: principles
and procedures
●
●
●
●
Linear versus quadratic
discriminant functions
Significance testing
Rotating discriminant
functions
Component retention,
significance, and
reliability.
Bio 8100s Multivariate biostatistics
L9.1
Université d’Ottawa / University of Ottawa
1999
What is discriminant function
analysis?
Given a set of p
variables X1, X2,…, Xp,
and a set of N objects
belonging to m known
groups (classes) G1,
G2,…, Gm , we try and
construct a set of
functions Z1, Z2,…,
Zmin{m-1,p} that allow us
to classify each object
correctly.
●
The hope (sometimes
faint) is that “good”
classification results (i.e.,
low misclassification
rate, high reliability) will
be obtained through a
relatively small set of
simple functions.
Bio 8100s Multivariate biostatistics
L9.2
Université d’Ottawa / University of Ottawa
What is a discriminant
function anyway?
● A discriminant function is a
function:
Z i = f i ( X 1 ,K, X p )
1999
Group 1
Group 2
Frequency
●
which maximizes the
“separation” between the
groups under consideration, or
(more technically), maximizes
the ratio of between
group/within group variation.
Bio 8100s Multivariate biostatistics
Z1 (not so good)
Group 1
Z2
Group
2
(better)
L9.3
Université d’Ottawa / University of Ottawa
1999
1
The linear discriminant model
●
For a set of p variables
X1, X2,…, Xp, the general
model is
●
p
Z i = ∑ aij X j
j =1
●
where the Xjs are the
original variables and the
aijs are the discriminant
function coefficients.
●
Note: unlike in PCA and
FA, the discriminant
functions are based on
the raw (unstandardized)
variables, since the
resulting classifications
are unaffected by scale.
For p variables and m
groups, the maximum
number of DFs is min{p,
m-1}.
Bio 8100s Multivariate biostatistics
L9.4
Université d’Ottawa / University of Ottawa
1999
The geometry of a single linear
discriminant function
●
2 groups with
measurements of two
variables (X1 and X2) on
each object.
X1
In this case, the linear DF
Z* results in no
misclassifications,
whereas another possible
DF (Z) gives two
misclassifications.
Misclassified under Z
but not under Z*
Z
Z*
Group 1
Group 2
X2
Bio 8100s Multivariate biostatistics
L9.5
Université d’Ottawa / University of Ottawa
Finding discriminant
functions: principles
●
●
The first discriminant
function is that which
maximizes the
differences between
groups compared to
the differences within
groups…
…which is equivalent to
maximizing F in a oneway ANOVA.
F(Z)
●
1999
Z1
a = (a1,…, ap)
F (Z ) =
MS B ( Z )
,
MSW ( Z )
Z1 = max{F ( Z )}
Bio 8100s Multivariate biostatistics
L9.6
Université d’Ottawa / University of Ottawa
1999
2
●
●
The second discriminant
function is that which
maximizes the differences
between groups compared
to the differences within
groups unaccounted for by
Z1...
…which is equivalent to
maximizing F in a one-way
ANOVA given the constraint
that Z1, Z2 are uncorrelated.
F(Z)
Finding discriminant
functions: principles
Z2
a = (a1,…, ap)
F (Z ) =
MS B ( Z )
,
MSW ( Z )
Z 2 = max{F ( Z ) rZ1, Z2 = 0}
Bio 8100s Multivariate biostatistics
L9.7
Université d’Ottawa / University of Ottawa
1999
The geometry of several linear
discriminant functions
●
●
2 groups with
measurements of two
variables (X1 and X2) on
X1
each individual.
Using only Z1, 4 objects are
misclassified, whereas
using both Z1 and Z2, only
one object is misclassified.
Group 1
Group 2
Z2
Z1
X2
Misclassified using only
Z1
Misclassified using both
Bio 8100s Multivariate biostatistics
Z1 and Z2
L9.8
Université d’Ottawa / University of Ottawa
xijk
SSCP matrices:
within, between, and x
jk
total
●
The total (T) SSCP matrix
(based on p variables X1,
X2,…, Xp ) in a sample of
objects belonging to m
groups G1, G2,…, Gm with
sizes n1, n2,…, nm can be
partitioned into withingroups (W) and betweengroups (B) SSCP
matrices:
T=B+W
1999
Value of variable Xk for
ith observation in group j
Mean of variable Xk for
group j
xk
Overall mean of variable Xk
trc , wrcElement in row r and
column c of total (T, t) and
within (W, w) SSCP
m
nj
trc = ∑∑ (xijr − xr )( xijc − xc )
j =1 i =1
m
nj
wrc = ∑∑ (xijr − x jr )( xijc − x jc )
j =1 i =1
Bio 8100s Multivariate biostatistics
L9.9
Université d’Ottawa / University of Ottawa
1999
3
Finding discriminant functions:
analytic procedures
Calculate total (T), within (W)
and between (W) SSCPs.
Determine eigenvalues and
eigenvectors of the product
W-1 B.
●
●
T=B+W
λ ( B −1W ) = (λ1 , λ2 ,K, λ p )
λι is ratio of between to
SS ( Z )
within SSs for the ith
λi = B i
discriminant function Zi…
SSW ( Zi )
…and the elements of the
corresponding eigenvectors
are the discriminant function ξ ( B −1W ) = ( a , a ,K , a )
i
i1
i2
ip
coefficients.
●
●
Bio 8100s Multivariate biostatistics
L9.10
Université d’Ottawa / University of Ottawa
1999
G1
Assumptions
●
Variable X1
Equality of within-group
covariance matrices (C1
= C2 = ...) implies that
each element of C1 is
equal to the
corresponding element
in C2 , etc.
X2
X3
X1
s12
X2
c21
s22
X3
c31
c32
s32
Variable X1
X2
X3
X1
s12
X2
c21
s22
X3
c31
c32
Covariance
G2
s32
Variance
Bio 8100s Multivariate biostatistics
L9.11
Université d’Ottawa / University of Ottawa
1999
Quadratic Z1
The quadratic
discriminant model
X1
●
For a set of p variables
X1, X2,…, Xp, the general
quadratic model is
Linear Z1
p
Z i = ∑ aij X j + bij X i X j
j =1
●
Group 1
Group 2
●
where the Xjs are the
original variables and the
aijs are the linear
coefficients and the bijs
are the 2nd order
coefficients.
X2
Because the quadratic
model involves many
more parameters, sample
sizes must be
considerably larger to get
reasonably stable
estimates of coefficients.
Bio 8100s Multivariate biostatistics
L9.12
Université d’Ottawa / University of Ottawa
1999
4
Fitting discriminant function
models: the problems
●
●
●
●
Goal: find the “best” model, given the available data
Problem 1: what is “best”?
Problem 2: even if “best” is defined, by what method
do we find it?
Possibilities:
■
■
If there are m variables, we might compute DFs using all
possible subsets (2m -1) of variables models and choose
the best one
use some procedure for winnowing down the set of
possible models.
Bio 8100s Multivariate biostatistics
L9.13
Université d’Ottawa / University of Ottawa
1999
Criteria for choosing the “best”
discriminant model
●
●
●
Discriminating ability:
better models are better
able to distinguish
among groups
Implication: better
models will have lower
misclassification rates.
N.B. Raw
misclassification rates
can be very misleading.
●
●
Parsimony: a discriminant
model which includes fewer
variables is better than one
with more variables.
Implication: if the
elimination/addition of a
variable does not
significantly
increase/decrease the
misclassification rate, it
may not be very useful.
Bio 8100s Multivariate biostatistics
L9.14
Université d’Ottawa / University of Ottawa
1999
Criteria for choosing the “best”
discriminant model (cont’d)
●
●
Model stability: better
models have
coefficients that are
stable as judged
through crossvalidation.
Procedure: Judge
stability through crossvalidation (jackknifing,
bootstrapping).
●
●
NB.1. In general, linear
discriminant functions
will be more stable than
quadratic functions,
especially if the sample
is small.
N.B.2. If sample is
small, then ”outliers”
may dramatically
decrease model
stability.
Bio 8100s Multivariate biostatistics
L9.15
Université d’Ottawa / University of Ottawa
1999
5
Fitting discriminant function
models: the problems
●
●
●
●
Goal: find the “best” model, given the available data
Problem 1: what is “best”?
Problem 2: even if “best” is defined, by what method
do we find it?
Possibilities:
■
■
If there are m variables, we might compute DFs using all
possible subsets (2m -1) of variables models and choose
the best one
use some procedure for winnowing down the set of
possible models.
Bio 8100s Multivariate biostatistics
L9.16
Université d’Ottawa / University of Ottawa
1999
Analytic procedures: general
approach
Evaluate significance of a
variable (Xi) in DF by
computing the difference in
group resolution between
two models, one with the
variable included, the other
with it excluded.
Evaluate change in
discriminating ability (∆ DA)
associated with inclusion of
the variable in question
Unfortunately, change in
discriminating ability may
depend on what other
variables are in model!
●
●
●
●
●
●
Model A
(Xi in)
∆ DA
Model B
(Xi out)
Retain Xi
(∆ large)
Delete Xi
(∆ small)
Bio 8100s Multivariate biostatistics
L9.17
Université d’Ottawa / University of Ottawa
1999
Strategy I: computing all possible
models
●
●
compute all possible
models and choose the
“best” one.
Impractical unless
number of variables is
relatively small.
{X1, X2, X3}
{X1}
{X1, X2}
{X2}
{X2, X3}
{X3}
{X1, X3}
{X1, X2, X3}
Bio 8100s Multivariate biostatistics
L9.18
Université d’Ottawa / University of Ottawa
1999
6
Strategy II:
forward selection
●
●
●
●
●
●
(X1, X2, X3, X4 )
All variables
F2 > F1, F3, F4
F2 > F to enter
start with variable for which
(< p to enter)
differences among group
(X2)
means are the largest
(largest F-value)
F1 > F3 , F4
add others one at a time
F1 > F to enter
based on F to enter (p to
(X1, X2) (< p to enter)
enter) until no further
significant increase in
discriminating ability is
achieved.
F4 > F3; F4< F to enter
problem: if Xj is included, it
(> p to enter)
stays in even if it
contributes little to
discriminating ability once
other variables are included.
(X1, X2) Final model
Bio 8100s Multivariate biostatistics
L9.19
Université d’Ottawa / University of Ottawa
1999
What is F to enter/remove (p to
enter/remove) anyway?
●
When no variables are
in the model, F to enter
is the F-value from a
univariate one-way
ANOVA comparing
group means with
respect to the variable
in question, and p to
enter is the Type I
probability associated
with the null that all
group means are equal.
●
When other variables
are in the model, F to
enter corresponds to
the F-value for an
ANCOVA comparing
group means with
respect to the variable
in question, where the
covariates are the
variables already
entered.
Bio 8100s Multivariate biostatistics
L9.20
Université d’Ottawa / University of Ottawa
Strategy III:
backward selection
●
●
●
●
●
●
(X1, X2, X3, X4 )
1999
All variables in
F2 < F1, F3, F4
F2 < F to remove
Start with all variables and
(> p to remove)
drop that for which differences (X1, X3, X4 )
among group means are the
F1 < F3 , F4
smallest (smallest F-value)
F1 < F to remove
Delete others one at a time
(> p to remove)
based on F to remove (p to
(X3, X4)
remove) until further removal
results in a significant
reduction in the ability to
discriminate groups.
F4 < F3; F4 > F to remove
problem: if Xj is excluded, it (< p to remove)
stays out even if it contributes
substantially to discriminating
ability once other variables are
(X3, X4) Final model
excluded.
Bio 8100s Multivariate biostatistics
L9.21
Université d’Ottawa / University of Ottawa
1999
7
Canonical scores
●
Because discriminant
functions are functions,
we can “plug in” the
values for each variable
for each observation,
and calculate a
canonical score for
each observation and
each discriminant
function.
Observation
X1
X2
1
3.7
11.5
2
2.3
10.2
0.27 0.97 
a=

0.92 0.39 
Z11 = .027(3.7) + 0.97(11.5)
Z12 = 0.92(3.7) + 0.39(11.5)
Z 21 = .027(2.3) + 0.97(10.2)
Z 22 = 0.92(2.3) + 0.39(10.2)
Bio 8100s Multivariate biostatistics
L9.22
Université d’Ottawa / University of Ottawa
Canonical scores
of group means
Canonical scores
plots
●
1
2
7.608 0.215
-1.825 -0.728
-5.783 0.513
1
2
3
Plots of canonical
scores for each object.
The better the model,
the greater the
separation between
clouds of points
representing individual
groups, e.g. Fisher’s
famous irises.
10
5
FACTOR(2)
●
1999
95% confidence
ellipse
0
-5
-10
-10
-5
Bio 8100s Multivariate biostatistics
0
FACTOR(1)
5
10
L9.23
Université d’Ottawa / University of Ottawa
1999
Priors
●
In standard DFA, it is
assumed that in the
absence of any
information, the a priori
(prior) probability φi of a
given object belonging
to one of I = 1,…,m
groups is the same for
all groups:
φi =
●
●
But, if each group is not
equally likely, then
priors should be
adjusted so as to
reflect this bias.
E.g. in species with
biased sex-ratios,
males and females
should have unequal
priors.
1
m
Bio 8100s Multivariate biostatistics
L9.24
Université d’Ottawa / University of Ottawa
1999
8
Caveats: unequal priors
●
●
For a given set of
discriminant functions,
misclassification rates
will usually depend on
the priors…
…so that artificially low
misclassification rates
can be obtained simply
by strategically
adjusting the priors.
●
So, only adjust priors if
you are confident that
the true frequency of
each group in the
population is
(reasonably) accurately
estimated by the group
frequencies in the
sample.
Bio 8100s Multivariate biostatistics
L9.25
Université d’Ottawa / University of Ottawa
1999
Significance testing
●
●
Question: which
discriminant functions
are statistically
“significant”?
For testing significance
of all r DFs for m
groups based on p
variables, calculate
Bartlett’s V and
compare to χ2
distribution with p(m-1)
degrees of freedom
1


V =  N − 1 − ( p + m) 
2


r
×∑ ln(1 + λi )
i =1
λi
Eigenvalue associated
with ith discriminant
function
Bio 8100s Multivariate biostatistics
L9.26
Université d’Ottawa / University of Ottawa
1999
Significance testing (cont’d)
●
●
●
●
1
Each DF is tested in a

 r
hierarchical fashion by V =  N −1 − 2 ( p + m) × ln(1 + λi )

 i =1
first testing significance
1
of all DFs combined.

 r
V1 =  N −1 − ( p + m − 1) × ln(1 + λi )
If all DFs combined not
2

 i =2
significant, then no DF is
1

 r
significant.
V2 =  N − 1 − ( p + m − 2) × ln(1 + λi )
2
If all DFs combined are

 i =3
significant, then remove
first DF and recalculate V
(= V1) and test.
1

 r
Continue until residual Vj V j =  N −1 − ( p + m − j ) × ln(1 + λi )
2

 i= j
no longer significant at df
= (p – j)(m – j - 1)
∑
∑
∑
∑
Bio 8100s Multivariate biostatistics
L9.27
Université d’Ottawa / University of Ottawa
1999
9
Caveats/assumptions: tests of
significance
●
●
●
Tests of significance assume that within-group
covariance matrices are the same for all groups,
and that within groups, observations have a
multivariate normal distribution
Tests of signficance can be very misleading
because jth discriminant function in the
population may not appear as jth discriminant
function in the sample due to sampling errors…
So be careful, especially if the sample is small!
Bio 8100s Multivariate biostatistics
L9.28
Université d’Ottawa / University of Ottawa
1999
Caveats/assumptions: tests of
significance
●
●
If stepwise (forward or backward) procedures
are used, significance tests are biased because
given enough variables, significant discriminant
functions can be produced by chance alone.
In such cases, it is advisable to (1) test results
with more standard analyses or (2) use
randomization procedures whereby objects are
randomly assigned to groups.
Bio 8100s Multivariate biostatistics
L9.29
Université d’Ottawa / University of Ottawa
1999
Assessing classification accuracy I. Raw
classification results
Group
Total
●
●
The derived discriminant
functions are used to
classify all objects in the
sample, and a
classification table is
produced.
Classification accuracy is
likely to be overestimated,
since the data used to
generate the DFs in the
first place are themselves
being classified.
Group
1
2
1
43
5
48
2
8
14
22
Total
51
19
70
Misclassification (G1) = 5/48
Misclassification (G2) = 8/22
Overall
misclassification = 13/70
Bio 8100s Multivariate biostatistics
L9.30
Université d’Ottawa / University of Ottawa
1999
10
Assessing classification accuracy II.
Jackknifed classification
Group
●
●
●
Total
Discriminant functions are
Group
1
2
derived using N – 1 objects, and
1
41
7
48
the Nth object is then classified.
This procedure is repeated for all
2
9
13
22
N objects, each time leaving a
Total
51
19
70
different one out, and a
classification table produced.
Misclassification (G1) = 7/48
In general, jackknifed
classification results are worse
Misclassification (G2) = 9/22
than raw classification results,
Overall
but more reliable.
misclassification = 16/70
Bio 8100s Multivariate biostatistics
L9.31
Université d’Ottawa / University of Ottawa
1999
Assessing classification accuracy III.
Data splitting
Group
●
●
●
Use 2/3 of sample data (randomly)
selected to generate discriminant
functions (learning set)
Use derived discriminant
functions to classified other 1/3
(test set) and produce
classification table.
In general, data-splitting
classification results are worse
than both raw and jackknifed
classification results, but more
reliable.
Total
Group
1
2
1
40
8
48
2
9
13
22
Total
51
19
70
Misclassification (G1) = 8/48
Misclassification (G2) = 9/22
Overall
misclassification = 17/70
Bio 8100s Multivariate biostatistics
L9.32
Université d’Ottawa / University of Ottawa
1999
Assessing classification accuracy IV.
Bootstrapped data splitting
Group
●
●
●
●
●
●
●
●
Use 2/3 of sample data (randomly
sampled) to generate
discriminant functions (learning
set)
Use derived discriminant
functions to classify other 1/3
(test set) and produce
classification results.
Repeat a large number (e.g.
1000) times, each time sampling
with replacement.
Generate classification statistics
over bootstrapped samples, e.g.
mean classification results,
standard errors, etc.
Total
Group
1
2
1
41.2
6.8
– 1.7
– 0 .6
2
9.3
12.7
– 0.5
– 1.1
22
Total
51
19
70
48
Misclassification (G1) = 14.2%
Misclassification (G2) = 42.3%
Overall
misclassification = 23.0%
Bio 8100s Multivariate biostatistics
L9.33
Université d’Ottawa / University of Ottawa
1999
11
Interpreting discriminant functions
●
●
Examine standardized
coefficients
(coefficients of
discriminant functions
based on standardized
values)
For interpretation, use
variables with large
absolute standardized
coefficients.
●
●
Examine the
discriminant-variable
correlations.
For interpretation, use
variables with high
correlations with
important discriminant
functions.
Bio 8100s Multivariate biostatistics
L9.34
SEPALWID
SEPALLEN
●
Data: four variables
(sepal length, sepal
width, petal length,
petal width), 3
species, N = 150 (50
for each species).
Problem: find the
“best” set of DFs.
1999
PETALLEN
●
Example:
Fisher’s
famous irises
PETALWID
Université d’Ottawa / University of Ottawa
SEPALLEN
SEPALWID
PETALLEN
PETALWID
Bio 8100s Multivariate biostatistics
L9.35
Université d’Ottawa / University of Ottawa
Example: Fisher’s
famous irises:
betweenbetween-groups FFbetween-groups
matrix
●
Matrix entries are F
– values from oneway MANOVA
comparing group
means, and can be
considered
measures of the
distance between
group centroids.
1999
Species
●
Species
1
1
0.0
2
2
550.2
0.0
3
1098.3
105.3
3
0.0
N.B. do not use probabilities
associated with F-tests to
determine “significance”
unless you correct for multiple
tests.
Bio 8100s Multivariate biostatistics
L9.36
Université d’Ottawa / University of Ottawa
1999
12
Example: Fisher’s
famous irises:
canonical discriminant
functions
Four variables
(sepal length,
sepal width, petal
length, petal
width), 3 species,
N = 150 (50 for
each species).
●
Canonical discriminant
functions
Constant
1
2
2.105 -6.661
SEPALLEN
SEPALWID
PETALLEN
PETALWID
0.829 0.024
1.534 2.165
-2.201 -0.932
-2.810 2.839
Note: discriminant functions are derived
using equal priors.
Bio 8100s Multivariate biostatistics
L9.37
Université d’Ottawa / University of Ottawa
Example: Fisher’s
famous irises:
standardized
canonical
discriminant
functions
●
Four variables
(sepal length,
sepal width, petal
length, petal
width), 3 species,
N = 150 (50 for
each species).
1999
Standardized canonical
discriminant
functions
SEPALLEN
SEPALWID
PETALLEN
PETALWID
1
2
0.427 0.012
0.521 0.735
-0.942 -0.401
-2.810 0.581
Note: canonical discriminant functions
are based on standardized values.
Bio 8100s Multivariate biostatistics
L9.38
Université d’Ottawa / University of Ottawa
Example: Fisher’s
famous irises:
eigenvalues,
eigenvalues,
canonical correlations
and cumulative
dispersion
●
Eigenvalues give amount
of differences among
groups captured by a a
particular discriminant
function, and cumulative
proportion of dispersion
is the corresponding
proportion.
1999
Discriminant
function
Parameter
1
2
Eigenvalues
32.192
0.285
Canonical
correlation
0.985
0.471
Cumulative
proportion of
dispersion
0.991
1.000
●
Canonical correlation is the
correlation between a given
canonical variate and a set of
two dummy variables
representing each group.
Bio 8100s Multivariate biostatistics
L9.39
Université d’Ottawa / University of Ottawa
1999
13
Fisher’s irises: raw and
jackknifed classification
Species
results
●
In this case,
results are
identical (a
relatively rare
occurrence!)
%
correct
Species
1
2
3
1
50
0
0
100
2
0
48
2
96
3
0
1
49
98
Total
50
49
51
98
%
correct
Species
Species
1
2
3
1
50
0
0
100
2
0
48
2
96
3
0
1
49
98
Total
50
49
51
98
Bio 8100s Multivariate biostatistics
L9.40
Université d’Ottawa / University of Ottawa
1999
Dicriminant function analysis:
caveats and notes
●
●
Unless the ratio of number
of objects/number of
variables is large (> 20),
standardized coefficients
and correlations are
unstable.
DFA is unaffected by
differences among
variables in scale, so
standardization is not
required (unlike PCA, FA,
etc.)
●
●
Linear DFA is quite sensitive
to the assumption of
equality of covariance
matrices among groups. If
this assumption is violated,
use quadratic classification.
However, quadratic DFA is
more unstable when N is
small and normality does
not hold.
Bio 8100s Multivariate biostatistics
L9.41
Université d’Ottawa / University of Ottawa
1999
14