Logistic January 6, 2015 0.1 Logistic regression Binary outcomes • Most models so far have had response Y as continuous. • Many responses in practice fall into the Y ES/N O framework. • Examples: • 0.1.1 • • • • 0.1.2 1. medical: presence or absence of cancer 2. financial: bankrupt or solvent 3. industrial: passes a quality control test or not Modelling probabilities For 0 − 1 responses we need to model π(x1 , . . . , xp ) = P (Y = 1|X1 = x1 , . . . , Xp = xp ) That is, Y is Bernoulli with a probability that depends on covariates {X1 , . . . , Xp }. Note: Var(Y ) = π(1 − π) = E(Y ) · (1 − E(Y )) Or, the binary nature forces a relation between mean and variance of Y . Flu shot example • A local health clinic sent fliers to its clients to encourage everyone, but especially older persons at high risk of complications, to get a flu shot in time for protection against an expected flu epidemic. • In a pilot follow-up study, 50 clients were randomly selected and asked whether they actually received a flu shot. Y = Shot • In addition, data were collected on their age X1 = Age and their health awareness X2 = Health.Aware 0.1.3 A possible model • Simplest model π(X1 , X2 ) = β0 + β1 X1 + β2 X2 • Problems: – We must have 0 ≤ E(Y ) = π(X1 , X2 ) ≤ 1. OLS will not force this. – Ordinary least squares will not work because of relation between mean and variance. 0.1.4 Logistic model exp(β0 +β1 X1 +β2 X2 ) • Logistic model π(X1 , X2 ) = 1+exp(β 0 +β1 X1 +β2 X2 ) • This automatically fixes 0 ≤ E(Y ) = π(X1, X2 ) ≤ 1. • Note: logit(π(X1 , X2 )) = log π(X1 ,X2 ) 1−π(X1 ,X2 ) = β0 + β1 X1 + β2 X2 1 0.1.5 Logistic curve In [1]: %%R g <- function(p) { return(log(p / (1 - p))) } g.inv <- function(x) { return(exp(x) / (1 + exp(x))) } x = seq(g(0.01), g(0.99), length=200) plot(x, g.inv(x), lwd=2, type=’l’, col=’red’) 2 0.1.6 Logistic transform In [2]: %%R p = seq(0.01,0.99,length=200) plot(p, g(p), lwd=2, type=’l’, col=’red’) 0.1.7 Binary regression models • Models E(Y ) as some increasing function of β0 + β1 X1 + β2 X2 . • The logistic model uses the function f (x) = ex /(1 + ex ). • Can be fit using Maximum Likelihood / Iteratively Reweighted Least Squares. • For logistic regression, coefficients have nice interpretation in terms of odds ratios • What about inference? 3 0.1.8 Criterion used to fit model Instead of sum of squares, logistic regression uses deviance: • DEV (µ|Y ) = −2 log L(µ|Y ) + 2 log L(Y |Y ) where µ is a location estimator for Y . Pn • If Y is Gaussian with independent N (µi , σ 2 ) entries DEV (µ|Y ) = σ12 i=1 (Yi − µi )2 n X (Yi log(πi ) + (1 − Yi ) log(1 − πi )) • If Y is a binary vector, with mean vector π DEV (π|Y ) = −2 i=1 0.1.9 Deviance for logistic regression • For any binary regression model, the deviance is: DEV (β|Y ) = −2 n X (Yi logit(πi (β)) + log(1 − πi (β))) i=1 • For the logistic model, the RHS is: −2 (Xβ)T y + n X p X log 1 + exp Xi jβj i=1 j=1 • The logistic model is special in that logit(π(β)) = Xβ. If we used a different transformation, the first part would not be linear in Xβ. In [3]: %%R flu.table <- read.table(’http://stats191.stanford.edu/data/flu.table’, header=T) flu.glm = glm(Shot ~ Age + Health.Aware, data=flu.table, family=binomial()) print(summary(flu.glm)) Call: glm(formula = Shot ~ Age + Health.Aware, family = binomial(), data = flu.table) Deviance Residuals: Min 1Q Median -1.5522 -0.2962 -0.1124 3Q 0.4208 Max 2.3244 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -21.58458 6.41824 -3.363 0.000771 *** Age 0.22178 0.07436 2.983 0.002858 ** Health.Aware 0.20351 0.06273 3.244 0.001178 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 68.029 Residual deviance: 32.416 AIC: 38.416 on 49 on 47 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 6 4 0.1.10 An algorithm to fit the model 1. Initialize π bi = Y¯ , 1 ≤ i ≤ n 2. Define Zi = g(b πi ) + g 0 (b πi )(Yi − πbi ) Pp 3. Fit weighted least squares model Zi = j=1 βj Xij , Pp 4. Set π bi = logit−1 βb0 + j=1 βbj Xij . wi = πbi (1 − π bi ) 5. Repeat steps 2-4 until convergence. This is basically Newton-Raphson to minimize deviance. 0.1.11 Odds Ratios • One reason logistic models are popular is that the parameters have simple interpretations in terms of P (A) odds ODDS(A) = 1−P (A) . • Logistic model: ORXj = ODDS(Y =1|...,Xj =xj +1,... ) ODDS(Y =1|...,Xj =xj ,... ) = eβj • If Xj ∈ 0, 1 is dichotomous, then odds for group with Xj = 1 are eβj higher, other parameters being equal. 0.1.12 Rare disease hypothesis • When incidence is rare, P (Y = 0) u 1 no matter what the covariates Xj ’s are. • In this case, odds ratios are almost ratios of probabilities: ORXj u P(Y = 1| . . . , Xj = xj + 1, . . . ) P(Y = 1| . . . , Xj = xj , . . . ) • Hypothetical example: in a lung cancer study, if Xj is an indicator of smoking or not, a βj of 5 means for smoking vs. non-smoking means smokers are e5 ≈ 150 times more likely to develop lung cancer • In flu example, the odds for a 45 year old with health awareness 50 compared to a 35 year old with the same health awareness are e−1.429284+3.647052 = 9.18 In [4]: %%R logodds = predict(flu.glm, list(Age=c(35,45),Health.Aware=c(50,50))) logodds 1 2 -3.647052 -1.429284 The estimated probabilities are below, yielding a ratio of 0.1932/0.0254 ≈ 7.61. In [5]: %%R exp(logodds)/(1+exp(logodds)) 1 2 0.0254056 0.1932103 0.1.13 Inference One reason for introducing the IRLS procedure is to see what the approximate limiting distribution is. • The IRLS procedure suggests using approximation βb ≈ N (β, (X 0 W X)−1 ) • This allows us to construct CIs, test linear hypotheses, etc. 5 • Intervals formed this way are called Wald intervals In [6]: %%R center = coef(flu.glm)[’Age’] SE = sqrt(vcov(flu.glm)[’Age’, ’Age’]) U = center + SE * qnorm(0.975) L = center + SE * qnorm(0.025) data.frame(L, center, U) L center U Age 0.0760395 0.2217768 0.3675141 These are slightly different from what R will give you if you ask it for confidence intervals: In [7]: %%R confint(flu.glm) Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) -38.0402235 -11.6669218 Age 0.1004533 0.4046856 Health.Aware 0.1026984 0.3595480 0.1.14 Testing in logistic regression What about comparing full and reduced model? • For a model M, DEV (M) replaces SSE(M). • In least squares regression, we use 1 (SSE(MR ) − SSE(MF )) ∼ χ2dfR −dfF σ2 • This is replaced with n→∞ DEV (MR ) − DEV (MF ) ∼ χ2dfR −dfF • Resulting tests do not agree with those coming from IRLS (Wald tests). Both are often used. In [8]: %%R anova(glm(Shot ~ Health.Aware, data=flu.table, family=binomial()), flu.glm) Analysis of Deviance Table Model 1: Model 2: Resid. 1 2 Shot ~ Health.Aware Shot ~ Age + Health.Aware Df Resid. Dev Df Deviance 48 49.279 47 32.416 1 16.863 We should compare this difference in deviance with a χ21 random variable. In [9]: %%R 1 - pchisq(16.863,1) 6 [1] 4.017719e-05 Let’s compare this with the Wald test: In [10]: %%R summary(flu.glm) Call: glm(formula = Shot ~ Age + Health.Aware, family = binomial(), data = flu.table) Deviance Residuals: Min 1Q Median -1.5522 -0.2962 -0.1124 3Q 0.4208 Max 2.3244 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -21.58458 6.41824 -3.363 0.000771 *** Age 0.22178 0.07436 2.983 0.002858 ** Health.Aware 0.20351 0.06273 3.244 0.001178 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 68.029 Residual deviance: 32.416 AIC: 38.416 on 49 on 47 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 6 0.1.15 Diagnostics • Similar p to least square regression, only residuals used are deviance residuals ri = sign(Yi − πi |Yi ). π bi ) DEV (b In [11]: %%R par(mfrow=c(2,2)) plot(flu.glm) 7 In [12]: %%R influence.measures(flu.glm) Influence measures of glm(formula = Shot ~ Age + Health.Aware, family = binomial(), 1 2 3 4 5 6 7 8 9 dfb.1_ -0.017208 -0.102897 -0.015043 -0.058889 -0.103488 -0.130919 -0.066734 -0.043827 -0.018883 dfb.Age dfb.Hl.A dffit cov.r cook.d hat inf 0.01543 0.01564 -0.01763 1.083 3.66e-05 0.01604 0.09560 0.10416 0.13491 1.102 2.24e-03 0.05231 0.01280 0.01442 -0.01556 1.081 2.85e-05 0.01454 0.02418 0.11920 0.28755 1.040 1.18e-02 0.05850 0.03853 0.17167 0.23475 1.144 6.94e-03 0.09789 0.11898 0.11098 -0.15203 1.109 2.86e-03 0.06040 0.06691 0.06003 0.07761 1.104 7.22e-04 0.04249 0.04399 0.03438 -0.04704 1.101 2.62e-04 0.03508 0.01881 0.01515 -0.01980 1.085 4.62e-05 0.01867 8 data = flu.table) : 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 -0.112978 -0.009578 0.020212 0.192770 0.372025 -0.038716 -0.057484 -0.046412 -0.029026 0.111762 -0.063893 -0.148884 -0.046451 -0.058938 -0.046668 0.181347 -0.009162 -0.085845 -0.030981 -0.017942 -0.140841 -0.160987 -0.005600 -0.094999 -0.003504 -0.106728 -0.019690 -0.094324 0.652043 -0.019865 -0.114652 -0.094999 -0.019865 -0.069257 0.050764 -0.079968 -0.038716 -0.000879 0.040334 -0.025666 -0.016308 0.1.16 0.09695 0.09584 -0.13373 -0.50776 -0.47261 0.03596 0.06480 0.13914 0.02499 0.19947 0.05199 0.11621 0.03665 0.03695 0.04039 -0.10844 0.00759 0.05982 0.02315 0.01549 0.11498 0.27877 0.00447 0.09844 0.00331 0.08563 0.01858 0.08615 -0.52099 0.16852 0.12911 0.09844 0.16852 -0.23785 -0.06530 0.08479 0.03596 -0.00838 -0.19110 0.02050 0.01570 0.13265 -0.04305 0.05812 0.11790 -0.14549 0.03351 0.03735 -0.10858 0.03004 -0.49074 0.06246 0.14276 0.05220 0.06933 0.04338 -0.17866 0.00903 0.10771 0.03287 0.01840 0.13148 0.03379 0.00572 0.08375 0.00324 0.10366 0.01689 0.09622 -0.63916 -0.19502 0.09175 0.08375 -0.19502 0.34314 0.01674 0.05701 0.03351 0.05023 0.07980 0.02589 0.01369 0.20841 0.36490 -0.43566 -0.93778 0.68814 -0.04035 -0.06892 -0.47973 0.03257 -1.00669 -0.06972 -0.19327 0.05658 -0.07587 -0.04918 0.43049 -0.00953 0.12539 -0.03466 0.01972 -0.17269 0.40526 -0.00592 0.11838 0.00365 -0.12263 -0.02026 0.11987 0.73958 -0.60555 0.15634 0.11838 -0.60555 -0.81279 0.34876 -0.09176 -0.04035 0.31156 -0.50899 -0.02749 -0.01682 1.076 1.004 0.979 1.040 0.902 1.095 1.121 0.993 1.090 0.979 1.105 1.096 1.101 1.116 1.098 0.921 1.077 1.116 1.094 1.083 1.103 1.178 1.074 1.107 1.071 1.111 1.085 1.104 0.630 0.998 1.110 1.107 0.998 1.267 0.982 1.121 1.095 1.001 0.992 1.089 1.083 5.68e-03 2.08e-02 3.20e-02 1.63e-01 1.02e-01 1.93e-04 5.65e-04 3.86e-02 1.25e-04 2.10e-01 5.80e-04 4.76e-03 3.81e-04 6.87e-04 2.87e-04 3.50e-02 1.07e-05 1.91e-03 1.42e-04 4.58e-05 3.74e-03 2.18e-02 4.12e-06 1.71e-03 1.57e-06 1.83e-03 4.84e-05 1.76e-03 2.72e-01 6.42e-02 3.03e-03 1.71e-03 6.42e-02 9.58e-02 1.96e-02 1.01e-03 1.93e-04 1.49e-02 4.40e-02 8.92e-05 3.33e-05 0.05526 0.06318 0.07059 0.18972 0.09827 0.02970 0.05393 0.08404 0.02416 0.18084 0.04169 0.06199 0.03615 0.05121 0.03329 0.05453 0.00994 0.05923 0.02750 0.01681 0.06098 0.14807 0.00685 0.05249 0.00442 0.05562 0.01820 0.05039 0.05275 0.11145 0.06186 0.05249 0.11145 0.25544 0.05344 0.05710 0.02970 0.05090 0.08986 0.02260 0.01600 * * * * Model selection As the model is a likelihood based model, each fitted model has an AIC. Stepwise selection can be used easily . . . In [13]: %%R step(flu.glm, scope=list(upper= ~.^2), direction=’both’) Start: AIC=38.42 Shot ~ Age + Health.Aware 9 + Age:Health.Aware <none> - Age - Health.Aware Df Deviance AIC 1 24.283 32.283 32.416 38.416 1 49.279 53.279 1 56.078 60.078 Step: AIC=32.28 Shot ~ Age + Health.Aware + Age:Health.Aware <none> - Age:Health.Aware Df Deviance AIC 24.283 32.283 1 32.416 38.416 Call: glm(formula = Shot ~ Age + Health.Aware + Age:Health.Aware, family = binomial(), data = flu.table) Coefficients: (Intercept) 26.75944 Age -0.88151 Health.Aware -0.82239 Degrees of Freedom: 49 Total (i.e. Null); Null Deviance: 68.03 Residual Deviance: 24.28 AIC: 32.28 0.1.17 Age:Health.Aware 0.02365 46 Residual Probit model • Probit regression model: Φ −1 (E(Yi )) = p X βj Xij j=1 where Φ is CDF of N (0, 1), i.e. Φ(t) = pnorm(t). • Complementary log-log model (cloglog): −log(−log(E(Yi )) = p X βj Xij . j=1 • In logit, probit and cloglog Var(Yi ) = πi (1 − πi ) but the model for the mean is different. • Coefficients no longer have an odds ratio interpretation. In [14]: %%R summary(glm(Shot ~ Age + Health.Aware, data=flu.table, family=binomial(link=’probit’))) Call: glm(formula = Shot ~ Age + Health.Aware, family = binomial(link = "probit"), data = flu.table) Deviance Residuals: Min 1Q Median -1.5471 -0.2883 -0.0648 3Q 0.4060 Max 2.2955 Coefficients: 10 Estimate Std. Error z value Pr(>|z|) (Intercept) -12.35039 3.22797 -3.826 0.000130 *** Age 0.12786 0.03887 3.289 0.001005 ** Health.Aware 0.11642 0.03237 3.596 0.000323 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 68.029 Residual deviance: 32.076 AIC: 38.076 on 49 on 47 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 7 0.1.18 Generalized linear models Given a dataset (Yi , Xi1 , . . . ,P Xip ), 1 ≤ i ≤ n we consider a model for the distribution of Y |X1 , . . . , Xp . * p If ηi = g(E(Yi )) = g(µi ) = j=1 βj Xij then g is called the link function for the model. * If Var(Yi ) = φ · V (E(Yi )) = φ · V (µi ) for φ > 0 and some function V , then V is the called variance function for the model. * Canonical reference: Generalized Linear Models , McCullagh and Nelder. 0.1.19 Binary regression as GLM • For a logistic model, g(µ) = logit(µ), V (µ) = µ(1 − µ). • For a probit model, g(µ) = Φ−1 (µ), V (µ) = µ(1 − µ). • For a cloglog model, g(µ) = − log(− log(µ)), V (µ) = µ(1 − µ). 11
© Copyright 2025