Download Report

BIOSTAT 880 (Winter 2014)
Statistical Analysis with Missing Data
Time: Friday 1pm-4pm
Location: Room M4138, SPH II
Instructor
Lu Wang, Assistant Professor of Biostatistics
M4132, SPH II, University of Michigan
Phone: 734-647-6935
Email: [email protected]
Office hour
Friday, 4pm-5pm (location: M4132, SPH II)
Important days
No class on March 7 (Spring break)
Term paper due: April 18
Prerequisites
Biostat 601, 602, 650, 651, and 653.
Requires knowledge of standard statistical models such as the
multivariate normal, multiple linear regression, contingency tables, as
well as matrix algebra, Hilbert space theory, calculus, and maximum
likelihood theory.
Class website
https://ctools.umich.edu/portal/
Evaluation
Individual presentation in class: 30%
Class participation and Homework: 25%
Final paper and presentation (in group): 45%
Final paper should be written with the aim of publication in an
appropriate statistical journal. The paper will be evaluated on the basis
of originality, scholarship (including appropriate literature citations),
clarity, organization and relevance to class goals. Oral presentation of
the project during one of the final class sessions is mandatory.
Text Books
(Required)
• Little, R.J.A. and Rubin, D.B. (2002). Statistical Analysis with
Missing Data. New York: John Wiley.
• Tsiatis, A.A. (2006). Semiparametric Theory and Missing Data.
Springer.
• Lecture notes. (Adapted from the lecture notes by Rod Little and
Andrea Rotnitzky, as well as my own research)
1
Additional Reading (highly recommended):
•
Newey, W. (1990). Semiparametric efficiency bounds. Journal of Applied
Econometrics, vol 5, 99-135. (This is a GREAT introductory paper on semiparametric
theory)
•
Bickel, Klaassen, Ritov and Wellner. (1993) Efficient and adaptive inference in
semiparametric models. (This book provides a rigorous treatment of semiparametric
theory)
•
Van der Vaart. (2000) Asymptotic Statistics (This is a great book. Chapter 24 is on
semiparametric theory)
•
van der Laan, M. and Robins, J. (2003) Unified methods for censored and longitudinal
data. (This book has everything you ever want to know about inference in
semiparametric models with missing at random data, but nothing about non-ignorable
data. However, the book is a bit chaotic and disorganized and you may find it a bit
hard to read on a first attempt).
•
Ibragimov, I. A. and Hasminskii R. Z. (1981). Statistical Estimation: asymptotic theory,
Springer Verlag, New York. (This book has a rigorous treatment of asymptotic
efficiency in parametric models).
•
Luenberger, D. G. (1969). Optimization by Vector Space Methods. Wiley, New York.
(This is a fabulously clear book that contains all that you need to know about Hilbert
space theory for this course).
Description of the Course
This course discusses statistical theory and methodology aimed at addressing
missing data problems. We will discuss
(1). Parametric modeling with missing data, including likelihood-based
inference, data augmentation, multiple imputation. Computational tools include
the EM algorithm and extensions, the Gibbs' sampler and Bootstrapping methods.
(2). Semiparametric theory with missing data, including observed data tangent
space, characterization of pathwise differentiable parameters in the observed data
model, semiparametric efficiency bound, doubly robust estimator, locally efficient
estimator.
Overall, this course covers both applied and theoretical aspects related to
statistical analysis with missing data, but will mainly focus on the theoretical part.
2
Tentative Topics
Introduction, and Naive Methods in Statistical Packages
What is a missing data problem? Missing Patterns; Missing data Mechanisms; Examples;
Complete-case analysis; Available case analysis; Weighting; Single imputation and
multiple imputation; Likelihood-based methods; Estimating-equation based methods;
Properties and limitations of each method.
Imputation and Multiple Imputation
Common approaches of single imputation; Imputing Means; Imputing Draws; Pros and
cons; Accounting for imputation uncertainty; Bootstrap imputations; Multiple imputation.
Likelihood-based Method for Missing Data
Review of maximum likelihood and Bayes inference for complete data; Extend to
incomplete data; Missing at Random and ignorable missingness; Likelihood theory for
ignorable missing data; Factored likelihood methods for special patterns; Bayes Methods;
Data Augmentation; Gibbs Sampling.
EM algorithm, and extensions
The E and M Steps of EM algorithm; EM for exponential families; EM with parameter
constraints; Rate of convergence; Generalized EM; E-Conditional M algorithm; Other
extensions of EM.
More on Multiple Imputation
Link between multiple imputation and modern statistical tools; Bayesian theory of
multiple imputation; Gibbs Sampling; Software for multiple imputation analysis; guest
lecture on IVEWARE.
Robustness of Estimation with Missing Data
Robust likelihood-based inference; Robust Bayesian models; more attention to model
checks; flexible modeling against misspecification.
Introduction of Semiparametric Models with Missing Data
Objectives of semiparametric inference; Elementary notions of Hilbert space theory;
Inner product, norm and orthogonality; Closed spaces, inner product spaces, Cauchy
sequences, Banach spaces, Hilbert spaces; Pythagorean theorem; Projections on a finite
dimensional space; The normal equations.
Semiparametric Efficiency Theory
Regular parametric models and Regular estimators; The Cramer-Rao bound; Hajeck
representation theorem; Characterization of regular asymptotically linear estimators;
Regular parametric submodels; Tangent space; Semiparametric variance bound; Pathwise
differentiable parameters; Gradients; Efficient influence function; Efficient estimators in
semiparametric models.
3
Semiparametric Model with Missing data
Models with factorized likelihoods; Double robustness; Examples: missing covariates in
regression, drop-out in longitudinal studies; Full and observed data models;
Identification; Likelihood factorization under missing at random (MAR).
Geometric Structure of Models for MAR
Score operators; Observed data tangent space as the range of the score operator;
Characterization of pathwise differentiable parameters in the observed data model.
Inference in Semiparametric Models with Missing Data
Tangent space in a model that imposes only MAR; Linear operator theory;
Representation of the tangent space in arbitrary MAR models; The representation
theorem for the set of gradients of parameters of an arbitrary semiparametric full data
model under the sole assumption of MAR.
Derivation of Semiparametric Efficient Score and Efficiency Bound
Missing data models with probability bounded away from zero of observing full data (an
example where there is always positive information for full data pathwise differentiable
parameters); Representation of the efficient score in missing data models with probability
bounded away from zero of observing full data; Examples: Missing outcomes in
longitudinal studies with drop-out; Missing covariates in regression.
Academic Integrity:
The faculty of the School of Public Health believes that the conduct of a student
registered or taking courses in the School should be consistent with that of a professional
person. Courtesy, honesty and respect should be shown by students toward faculty
members, guest lecturers, administrative support staff and fellow students. Similarly,
students should expect faculty to treat them fairly, showing respect for their ideas and
opinions and striving to help them achieve maximum benefits from their experience in
the School.
Student academic misconduct refers to behavior that may include plagiarism, cheating,
fabrication, falsification of records or official documents, intentional misuse of
equipment or materials (including library materials), and aiding and abetting the
perpetration of such acts. The preparation of reports, papers, and examinations,
assigned on an individual basis, must represent each student's own effort. Reference
sources should be indicated clearly. The use of assistance from other students or aids of
any kind during a written examination, except when the use of aids such as electronic
devices, books or notes has been approved by an instructor, is a violation of the standard
of academic conduct.
4
Competencies covered in this course
•
Describe the roles biostatistics serves in the discipline of public health. (partially
covered)
•
Describe basic concepts of probability, random variation and commonly used
statistical probability distributions. (partially covered)
•
Describe preferred methodological alternatives to commonly used statistical methods
when assumptions are not met. (partially covered)
•
Distinguish among the different measurement scales and the implications for selection
of statistical methods to be used based on these distinctions. (partially covered)
•
Apply descriptive techniques commonly used to summarize public health data.
(partially covered)
•
Apply common statistical methods for inference. (partially covered)
•
Apply descriptive and inferential methodologies according to the type of study design
for answering a particular research question. (partially covered)
•
Interpret results of statistical analyses found in public health studies. (partially
covered)
•
Develop written and oral presentations based on statistical analyses for both public
health professionals and educated lay audiences. (partially covered)
•
Apply the basic terminology and definitions of epidemiology. (partially covered)
•
Calculate basic epidemiology measures. (partially covered)
•
Draw appropriate inferences from epidemiologic data. (partially covered)
• Apply evidence-based principles and the scientific knowledge base to critical
evaluation and decision-making in public health. (partially covered)
5