BIOSTAT 880 (Winter 2014) Statistical Analysis with Missing Data Time: Friday 1pm-4pm Location: Room M4138, SPH II Instructor Lu Wang, Assistant Professor of Biostatistics M4132, SPH II, University of Michigan Phone: 734-647-6935 Email: [email protected] Office hour Friday, 4pm-5pm (location: M4132, SPH II) Important days No class on March 7 (Spring break) Term paper due: April 18 Prerequisites Biostat 601, 602, 650, 651, and 653. Requires knowledge of standard statistical models such as the multivariate normal, multiple linear regression, contingency tables, as well as matrix algebra, Hilbert space theory, calculus, and maximum likelihood theory. Class website https://ctools.umich.edu/portal/ Evaluation Individual presentation in class: 30% Class participation and Homework: 25% Final paper and presentation (in group): 45% Final paper should be written with the aim of publication in an appropriate statistical journal. The paper will be evaluated on the basis of originality, scholarship (including appropriate literature citations), clarity, organization and relevance to class goals. Oral presentation of the project during one of the final class sessions is mandatory. Text Books (Required) • Little, R.J.A. and Rubin, D.B. (2002). Statistical Analysis with Missing Data. New York: John Wiley. • Tsiatis, A.A. (2006). Semiparametric Theory and Missing Data. Springer. • Lecture notes. (Adapted from the lecture notes by Rod Little and Andrea Rotnitzky, as well as my own research) 1 Additional Reading (highly recommended): • Newey, W. (1990). Semiparametric efficiency bounds. Journal of Applied Econometrics, vol 5, 99-135. (This is a GREAT introductory paper on semiparametric theory) • Bickel, Klaassen, Ritov and Wellner. (1993) Efficient and adaptive inference in semiparametric models. (This book provides a rigorous treatment of semiparametric theory) • Van der Vaart. (2000) Asymptotic Statistics (This is a great book. Chapter 24 is on semiparametric theory) • van der Laan, M. and Robins, J. (2003) Unified methods for censored and longitudinal data. (This book has everything you ever want to know about inference in semiparametric models with missing at random data, but nothing about non-ignorable data. However, the book is a bit chaotic and disorganized and you may find it a bit hard to read on a first attempt). • Ibragimov, I. A. and Hasminskii R. Z. (1981). Statistical Estimation: asymptotic theory, Springer Verlag, New York. (This book has a rigorous treatment of asymptotic efficiency in parametric models). • Luenberger, D. G. (1969). Optimization by Vector Space Methods. Wiley, New York. (This is a fabulously clear book that contains all that you need to know about Hilbert space theory for this course). Description of the Course This course discusses statistical theory and methodology aimed at addressing missing data problems. We will discuss (1). Parametric modeling with missing data, including likelihood-based inference, data augmentation, multiple imputation. Computational tools include the EM algorithm and extensions, the Gibbs' sampler and Bootstrapping methods. (2). Semiparametric theory with missing data, including observed data tangent space, characterization of pathwise differentiable parameters in the observed data model, semiparametric efficiency bound, doubly robust estimator, locally efficient estimator. Overall, this course covers both applied and theoretical aspects related to statistical analysis with missing data, but will mainly focus on the theoretical part. 2 Tentative Topics Introduction, and Naive Methods in Statistical Packages What is a missing data problem? Missing Patterns; Missing data Mechanisms; Examples; Complete-case analysis; Available case analysis; Weighting; Single imputation and multiple imputation; Likelihood-based methods; Estimating-equation based methods; Properties and limitations of each method. Imputation and Multiple Imputation Common approaches of single imputation; Imputing Means; Imputing Draws; Pros and cons; Accounting for imputation uncertainty; Bootstrap imputations; Multiple imputation. Likelihood-based Method for Missing Data Review of maximum likelihood and Bayes inference for complete data; Extend to incomplete data; Missing at Random and ignorable missingness; Likelihood theory for ignorable missing data; Factored likelihood methods for special patterns; Bayes Methods; Data Augmentation; Gibbs Sampling. EM algorithm, and extensions The E and M Steps of EM algorithm; EM for exponential families; EM with parameter constraints; Rate of convergence; Generalized EM; E-Conditional M algorithm; Other extensions of EM. More on Multiple Imputation Link between multiple imputation and modern statistical tools; Bayesian theory of multiple imputation; Gibbs Sampling; Software for multiple imputation analysis; guest lecture on IVEWARE. Robustness of Estimation with Missing Data Robust likelihood-based inference; Robust Bayesian models; more attention to model checks; flexible modeling against misspecification. Introduction of Semiparametric Models with Missing Data Objectives of semiparametric inference; Elementary notions of Hilbert space theory; Inner product, norm and orthogonality; Closed spaces, inner product spaces, Cauchy sequences, Banach spaces, Hilbert spaces; Pythagorean theorem; Projections on a finite dimensional space; The normal equations. Semiparametric Efficiency Theory Regular parametric models and Regular estimators; The Cramer-Rao bound; Hajeck representation theorem; Characterization of regular asymptotically linear estimators; Regular parametric submodels; Tangent space; Semiparametric variance bound; Pathwise differentiable parameters; Gradients; Efficient influence function; Efficient estimators in semiparametric models. 3 Semiparametric Model with Missing data Models with factorized likelihoods; Double robustness; Examples: missing covariates in regression, drop-out in longitudinal studies; Full and observed data models; Identification; Likelihood factorization under missing at random (MAR). Geometric Structure of Models for MAR Score operators; Observed data tangent space as the range of the score operator; Characterization of pathwise differentiable parameters in the observed data model. Inference in Semiparametric Models with Missing Data Tangent space in a model that imposes only MAR; Linear operator theory; Representation of the tangent space in arbitrary MAR models; The representation theorem for the set of gradients of parameters of an arbitrary semiparametric full data model under the sole assumption of MAR. Derivation of Semiparametric Efficient Score and Efficiency Bound Missing data models with probability bounded away from zero of observing full data (an example where there is always positive information for full data pathwise differentiable parameters); Representation of the efficient score in missing data models with probability bounded away from zero of observing full data; Examples: Missing outcomes in longitudinal studies with drop-out; Missing covariates in regression. Academic Integrity: The faculty of the School of Public Health believes that the conduct of a student registered or taking courses in the School should be consistent with that of a professional person. Courtesy, honesty and respect should be shown by students toward faculty members, guest lecturers, administrative support staff and fellow students. Similarly, students should expect faculty to treat them fairly, showing respect for their ideas and opinions and striving to help them achieve maximum benefits from their experience in the School. Student academic misconduct refers to behavior that may include plagiarism, cheating, fabrication, falsification of records or official documents, intentional misuse of equipment or materials (including library materials), and aiding and abetting the perpetration of such acts. The preparation of reports, papers, and examinations, assigned on an individual basis, must represent each student's own effort. Reference sources should be indicated clearly. The use of assistance from other students or aids of any kind during a written examination, except when the use of aids such as electronic devices, books or notes has been approved by an instructor, is a violation of the standard of academic conduct. 4 Competencies covered in this course • Describe the roles biostatistics serves in the discipline of public health. (partially covered) • Describe basic concepts of probability, random variation and commonly used statistical probability distributions. (partially covered) • Describe preferred methodological alternatives to commonly used statistical methods when assumptions are not met. (partially covered) • Distinguish among the different measurement scales and the implications for selection of statistical methods to be used based on these distinctions. (partially covered) • Apply descriptive techniques commonly used to summarize public health data. (partially covered) • Apply common statistical methods for inference. (partially covered) • Apply descriptive and inferential methodologies according to the type of study design for answering a particular research question. (partially covered) • Interpret results of statistical analyses found in public health studies. (partially covered) • Develop written and oral presentations based on statistical analyses for both public health professionals and educated lay audiences. (partially covered) • Apply the basic terminology and definitions of epidemiology. (partially covered) • Calculate basic epidemiology measures. (partially covered) • Draw appropriate inferences from epidemiologic data. (partially covered) • Apply evidence-based principles and the scientific knowledge base to critical evaluation and decision-making in public health. (partially covered) 5
© Copyright 2024