Logistic

Logistic
January 6, 2015
0.1
Logistic regression
Binary outcomes
• Most models so far have had response Y as continuous.
• Many responses in practice fall into the Y ES/N O framework.
• Examples:
•
0.1.1
•
•
•
•
0.1.2
1. medical: presence or absence of cancer
2. financial: bankrupt or solvent
3. industrial: passes a quality control test or not
Modelling probabilities
For 0 − 1 responses we need to model π(x1 , . . . , xp ) = P (Y = 1|X1 = x1 , . . . , Xp = xp )
That is, Y is Bernoulli with a probability that depends on covariates {X1 , . . . , Xp }.
Note: Var(Y ) = π(1 − π) = E(Y ) · (1 − E(Y ))
Or, the binary nature forces a relation between mean and variance of Y .
Flu shot example
• A local health clinic sent fliers to its clients to encourage everyone, but especially older persons at high
risk of complications, to get a flu shot in time for protection against an expected flu epidemic.
• In a pilot follow-up study, 50 clients were randomly selected and asked whether they actually received
a flu shot. Y = Shot
• In addition, data were collected on their age X1 = Age and their health awareness X2 = Health.Aware
0.1.3
A possible model
• Simplest model π(X1 , X2 ) = β0 + β1 X1 + β2 X2
• Problems:
– We must have 0 ≤ E(Y ) = π(X1 , X2 ) ≤ 1. OLS will not force this.
– Ordinary least squares will not work because of relation between mean and variance.
0.1.4
Logistic model
exp(β0 +β1 X1 +β2 X2 )
• Logistic model π(X1 , X2 ) = 1+exp(β
0 +β1 X1 +β2 X2 )
• This automatically fixes 0 ≤ E(Y
)
=
π(X1, X2 ) ≤ 1.
• Note: logit(π(X1 , X2 )) = log
π(X1 ,X2 )
1−π(X1 ,X2 )
= β0 + β1 X1 + β2 X2
1
0.1.5
Logistic curve
In [1]: %%R
g <- function(p) {
return(log(p / (1 - p)))
}
g.inv <- function(x) {
return(exp(x) / (1 + exp(x)))
}
x = seq(g(0.01), g(0.99), length=200)
plot(x, g.inv(x), lwd=2, type=’l’, col=’red’)
2
0.1.6
Logistic transform
In [2]: %%R
p = seq(0.01,0.99,length=200)
plot(p, g(p), lwd=2, type=’l’, col=’red’)
0.1.7
Binary regression models
• Models E(Y ) as some increasing function of β0 + β1 X1 + β2 X2 .
• The logistic model uses the function f (x) = ex /(1 + ex ).
• Can be fit using Maximum Likelihood / Iteratively Reweighted Least Squares.
• For logistic regression, coefficients have nice interpretation in terms of odds ratios
• What about inference?
3
0.1.8
Criterion used to fit model
Instead of sum of squares, logistic regression uses deviance:
• DEV (µ|Y ) = −2 log L(µ|Y ) + 2 log L(Y |Y ) where µ is a location estimator for Y .
Pn
• If Y is Gaussian with independent N (µi , σ 2 ) entries DEV (µ|Y ) = σ12 i=1 (Yi − µi )2
n
X
(Yi log(πi ) + (1 − Yi ) log(1 − πi ))
• If Y is a binary vector, with mean vector π DEV (π|Y ) = −2
i=1
0.1.9
Deviance for logistic regression
• For any binary regression model, the deviance is:
DEV (β|Y ) = −2
n
X
(Yi logit(πi (β)) + log(1 − πi (β)))
i=1
• For the logistic model, the RHS is:

−2 (Xβ)T y +
n
X



p
X
log 1 + exp 
Xi jβj 
i=1
j=1
• The logistic model is special in that logit(π(β)) = Xβ. If we used a different transformation, the first
part would not be linear in Xβ.
In [3]: %%R
flu.table <- read.table(’http://stats191.stanford.edu/data/flu.table’, header=T)
flu.glm = glm(Shot ~ Age + Health.Aware, data=flu.table, family=binomial())
print(summary(flu.glm))
Call:
glm(formula = Shot ~ Age + Health.Aware, family = binomial(),
data = flu.table)
Deviance Residuals:
Min
1Q
Median
-1.5522 -0.2962 -0.1124
3Q
0.4208
Max
2.3244
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -21.58458
6.41824 -3.363 0.000771 ***
Age
0.22178
0.07436
2.983 0.002858 **
Health.Aware
0.20351
0.06273
3.244 0.001178 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 68.029
Residual deviance: 32.416
AIC: 38.416
on 49
on 47
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 6
4
0.1.10
An algorithm to fit the model
1. Initialize π
bi = Y¯ , 1 ≤ i ≤ n
2. Define Zi = g(b
πi ) + g 0 (b
πi )(Yi − πbi )
Pp
3. Fit weighted least squares model Zi = j=1 βj Xij ,
Pp
4. Set π
bi = logit−1 βb0 + j=1 βbj Xij .
wi = πbi (1 − π
bi )
5. Repeat steps 2-4 until convergence. This is basically Newton-Raphson to minimize deviance.
0.1.11
Odds Ratios
• One reason logistic models are popular is that the parameters have simple interpretations in terms of
P (A)
odds ODDS(A) = 1−P
(A) .
• Logistic model: ORXj =
ODDS(Y =1|...,Xj =xj +1,... )
ODDS(Y =1|...,Xj =xj ,... )
= eβj
• If Xj ∈ 0, 1 is dichotomous, then odds for group with Xj = 1 are eβj higher, other parameters being
equal.
0.1.12
Rare disease hypothesis
• When incidence is rare, P (Y = 0) u 1 no matter what the covariates Xj ’s are.
• In this case, odds ratios are almost ratios of probabilities:
ORXj u
P(Y = 1| . . . , Xj = xj + 1, . . . )
P(Y = 1| . . . , Xj = xj , . . . )
• Hypothetical example: in a lung cancer study, if Xj is an indicator of smoking or not, a βj of 5 means
for smoking vs. non-smoking means smokers are e5 ≈ 150 times more likely to develop lung cancer
• In flu example, the odds for a 45 year old with health awareness 50 compared to a 35 year old with
the same health awareness are
e−1.429284+3.647052 = 9.18
In [4]: %%R
logodds = predict(flu.glm, list(Age=c(35,45),Health.Aware=c(50,50)))
logodds
1
2
-3.647052 -1.429284
The estimated probabilities are below, yielding a ratio of 0.1932/0.0254 ≈ 7.61.
In [5]: %%R
exp(logodds)/(1+exp(logodds))
1
2
0.0254056 0.1932103
0.1.13
Inference
One reason for introducing the IRLS procedure is to see what the approximate limiting distribution is.
• The IRLS procedure suggests using approximation βb ≈ N (β, (X 0 W X)−1 )
• This allows us to construct CIs, test linear hypotheses, etc.
5
• Intervals formed this way are called Wald intervals
In [6]: %%R
center = coef(flu.glm)[’Age’]
SE = sqrt(vcov(flu.glm)[’Age’, ’Age’])
U = center + SE * qnorm(0.975)
L = center + SE * qnorm(0.025)
data.frame(L, center, U)
L
center
U
Age 0.0760395 0.2217768 0.3675141
These are slightly different from what R will give you if you ask it for confidence intervals:
In [7]: %%R
confint(flu.glm)
Waiting for profiling to be done...
2.5 %
97.5 %
(Intercept) -38.0402235 -11.6669218
Age
0.1004533
0.4046856
Health.Aware
0.1026984
0.3595480
0.1.14
Testing in logistic regression
What about comparing full and reduced model?
• For a model M, DEV (M) replaces SSE(M).
• In least squares regression, we use
1
(SSE(MR ) − SSE(MF )) ∼ χ2dfR −dfF
σ2
• This is replaced with
n→∞
DEV (MR ) − DEV (MF ) ∼ χ2dfR −dfF
• Resulting tests do not agree with those coming from IRLS (Wald tests). Both are often used.
In [8]: %%R
anova(glm(Shot ~ Health.Aware, data=flu.table, family=binomial()), flu.glm)
Analysis of Deviance Table
Model 1:
Model 2:
Resid.
1
2
Shot ~ Health.Aware
Shot ~ Age + Health.Aware
Df Resid. Dev Df Deviance
48
49.279
47
32.416 1
16.863
We should compare this difference in deviance with a χ21 random variable.
In [9]: %%R
1 - pchisq(16.863,1)
6
[1] 4.017719e-05
Let’s compare this with the Wald test:
In [10]: %%R
summary(flu.glm)
Call:
glm(formula = Shot ~ Age + Health.Aware, family = binomial(),
data = flu.table)
Deviance Residuals:
Min
1Q
Median
-1.5522 -0.2962 -0.1124
3Q
0.4208
Max
2.3244
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -21.58458
6.41824 -3.363 0.000771 ***
Age
0.22178
0.07436
2.983 0.002858 **
Health.Aware
0.20351
0.06273
3.244 0.001178 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 68.029
Residual deviance: 32.416
AIC: 38.416
on 49
on 47
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 6
0.1.15
Diagnostics
• Similar
p to least square regression, only residuals used are deviance residuals ri = sign(Yi −
πi |Yi ).
π
bi ) DEV (b
In [11]: %%R
par(mfrow=c(2,2))
plot(flu.glm)
7
In [12]: %%R
influence.measures(flu.glm)
Influence measures of
glm(formula = Shot ~ Age + Health.Aware, family = binomial(),
1
2
3
4
5
6
7
8
9
dfb.1_
-0.017208
-0.102897
-0.015043
-0.058889
-0.103488
-0.130919
-0.066734
-0.043827
-0.018883
dfb.Age dfb.Hl.A
dffit cov.r
cook.d
hat inf
0.01543 0.01564 -0.01763 1.083 3.66e-05 0.01604
0.09560 0.10416 0.13491 1.102 2.24e-03 0.05231
0.01280 0.01442 -0.01556 1.081 2.85e-05 0.01454
0.02418 0.11920 0.28755 1.040 1.18e-02 0.05850
0.03853 0.17167 0.23475 1.144 6.94e-03 0.09789
0.11898 0.11098 -0.15203 1.109 2.86e-03 0.06040
0.06691 0.06003 0.07761 1.104 7.22e-04 0.04249
0.04399 0.03438 -0.04704 1.101 2.62e-04 0.03508
0.01881 0.01515 -0.01980 1.085 4.62e-05 0.01867
8
data = flu.table) :
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
-0.112978
-0.009578
0.020212
0.192770
0.372025
-0.038716
-0.057484
-0.046412
-0.029026
0.111762
-0.063893
-0.148884
-0.046451
-0.058938
-0.046668
0.181347
-0.009162
-0.085845
-0.030981
-0.017942
-0.140841
-0.160987
-0.005600
-0.094999
-0.003504
-0.106728
-0.019690
-0.094324
0.652043
-0.019865
-0.114652
-0.094999
-0.019865
-0.069257
0.050764
-0.079968
-0.038716
-0.000879
0.040334
-0.025666
-0.016308
0.1.16
0.09695
0.09584
-0.13373
-0.50776
-0.47261
0.03596
0.06480
0.13914
0.02499
0.19947
0.05199
0.11621
0.03665
0.03695
0.04039
-0.10844
0.00759
0.05982
0.02315
0.01549
0.11498
0.27877
0.00447
0.09844
0.00331
0.08563
0.01858
0.08615
-0.52099
0.16852
0.12911
0.09844
0.16852
-0.23785
-0.06530
0.08479
0.03596
-0.00838
-0.19110
0.02050
0.01570
0.13265
-0.04305
0.05812
0.11790
-0.14549
0.03351
0.03735
-0.10858
0.03004
-0.49074
0.06246
0.14276
0.05220
0.06933
0.04338
-0.17866
0.00903
0.10771
0.03287
0.01840
0.13148
0.03379
0.00572
0.08375
0.00324
0.10366
0.01689
0.09622
-0.63916
-0.19502
0.09175
0.08375
-0.19502
0.34314
0.01674
0.05701
0.03351
0.05023
0.07980
0.02589
0.01369
0.20841
0.36490
-0.43566
-0.93778
0.68814
-0.04035
-0.06892
-0.47973
0.03257
-1.00669
-0.06972
-0.19327
0.05658
-0.07587
-0.04918
0.43049
-0.00953
0.12539
-0.03466
0.01972
-0.17269
0.40526
-0.00592
0.11838
0.00365
-0.12263
-0.02026
0.11987
0.73958
-0.60555
0.15634
0.11838
-0.60555
-0.81279
0.34876
-0.09176
-0.04035
0.31156
-0.50899
-0.02749
-0.01682
1.076
1.004
0.979
1.040
0.902
1.095
1.121
0.993
1.090
0.979
1.105
1.096
1.101
1.116
1.098
0.921
1.077
1.116
1.094
1.083
1.103
1.178
1.074
1.107
1.071
1.111
1.085
1.104
0.630
0.998
1.110
1.107
0.998
1.267
0.982
1.121
1.095
1.001
0.992
1.089
1.083
5.68e-03
2.08e-02
3.20e-02
1.63e-01
1.02e-01
1.93e-04
5.65e-04
3.86e-02
1.25e-04
2.10e-01
5.80e-04
4.76e-03
3.81e-04
6.87e-04
2.87e-04
3.50e-02
1.07e-05
1.91e-03
1.42e-04
4.58e-05
3.74e-03
2.18e-02
4.12e-06
1.71e-03
1.57e-06
1.83e-03
4.84e-05
1.76e-03
2.72e-01
6.42e-02
3.03e-03
1.71e-03
6.42e-02
9.58e-02
1.96e-02
1.01e-03
1.93e-04
1.49e-02
4.40e-02
8.92e-05
3.33e-05
0.05526
0.06318
0.07059
0.18972
0.09827
0.02970
0.05393
0.08404
0.02416
0.18084
0.04169
0.06199
0.03615
0.05121
0.03329
0.05453
0.00994
0.05923
0.02750
0.01681
0.06098
0.14807
0.00685
0.05249
0.00442
0.05562
0.01820
0.05039
0.05275
0.11145
0.06186
0.05249
0.11145
0.25544
0.05344
0.05710
0.02970
0.05090
0.08986
0.02260
0.01600
*
*
*
*
Model selection
As the model is a likelihood based model, each fitted model has an AIC. Stepwise selection can be used
easily . . .
In [13]: %%R
step(flu.glm, scope=list(upper= ~.^2), direction=’both’)
Start: AIC=38.42
Shot ~ Age + Health.Aware
9
+ Age:Health.Aware
<none>
- Age
- Health.Aware
Df Deviance
AIC
1
24.283 32.283
32.416 38.416
1
49.279 53.279
1
56.078 60.078
Step: AIC=32.28
Shot ~ Age + Health.Aware + Age:Health.Aware
<none>
- Age:Health.Aware
Df Deviance
AIC
24.283 32.283
1
32.416 38.416
Call: glm(formula = Shot ~ Age + Health.Aware + Age:Health.Aware, family = binomial(),
data = flu.table)
Coefficients:
(Intercept)
26.75944
Age
-0.88151
Health.Aware
-0.82239
Degrees of Freedom: 49 Total (i.e. Null);
Null Deviance:
68.03
Residual Deviance: 24.28 AIC: 32.28
0.1.17
Age:Health.Aware
0.02365
46 Residual
Probit model
• Probit regression model:
Φ
−1
(E(Yi )) =
p
X
βj Xij
j=1
where Φ is CDF of N (0, 1), i.e. Φ(t) = pnorm(t).
• Complementary log-log model (cloglog):
−log(−log(E(Yi )) =
p
X
βj Xij .
j=1
• In logit, probit and cloglog Var(Yi ) = πi (1 − πi ) but the model for the mean is different.
• Coefficients no longer have an odds ratio interpretation.
In [14]: %%R
summary(glm(Shot ~ Age + Health.Aware, data=flu.table,
family=binomial(link=’probit’)))
Call:
glm(formula = Shot ~ Age + Health.Aware, family = binomial(link = "probit"),
data = flu.table)
Deviance Residuals:
Min
1Q
Median
-1.5471 -0.2883 -0.0648
3Q
0.4060
Max
2.2955
Coefficients:
10
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.35039
3.22797 -3.826 0.000130 ***
Age
0.12786
0.03887
3.289 0.001005 **
Health.Aware
0.11642
0.03237
3.596 0.000323 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 68.029
Residual deviance: 32.076
AIC: 38.076
on 49
on 47
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 7
0.1.18
Generalized linear models
Given a dataset (Yi , Xi1 , . . . ,P
Xip ), 1 ≤ i ≤ n we consider a model for the distribution of Y |X1 , . . . , Xp . *
p
If ηi = g(E(Yi )) = g(µi ) = j=1 βj Xij then g is called the link function for the model. * If Var(Yi ) =
φ · V (E(Yi )) = φ · V (µi ) for φ > 0 and some function V , then V is the called variance function for the model.
* Canonical reference: Generalized Linear Models , McCullagh and Nelder.
0.1.19
Binary regression as GLM
• For a logistic model, g(µ) = logit(µ),
V (µ) = µ(1 − µ).
• For a probit model, g(µ) = Φ−1 (µ),
V (µ) = µ(1 − µ).
• For a cloglog model, g(µ) = − log(− log(µ)),
V (µ) = µ(1 − µ).
11