Political Science W4365, Design and Analysis of Sample Surveys Class meeting:

Political Science W4365, Design and Analysis of Sample Surveys
Columbia University, Spring 2013
Class meeting: Mon/Wed 10-11:30, Pupin 425
Section meeting: Time and place to be arranged
Instructor: Andrew Gelman
Teaching assistant: Tiffany Washburn
Course description: Survey sampling is central to modern social science. We discuss
how to design, conduct, and analyze surveys, with a particular focus on public opinion
polls in the United States.
Prerequisites: Basic statistics and regression analysis (for example, Pols 4911, Stat
2024 or 4315, Soc 4075, etc.).
Textbooks:
- Groves, R. M., Fowler, F. J., Couper, M. P., Lepkowski, J. M., Singer, E., and
Tourangeau, R. (2009). Survey Methodology, second edition. Wiley.
- Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley.
- Gelman, A., and Hill, J. (2007). Data Analysis Using Regression and
Multilevel/Hierarchical Models. Cambridge University Press.
Also, readings for each week at http://www.stat.columbia.edu/~gelman/surveys.course/
Final exam: Last year’s final is at
http://www.stat.columbia.edu/~gelman/surveys.course/final2012.pdf
Tentative syllabus:
Weeks 1-2: Statistical background
Class 1a:
Class 1b:
Class 2a:
Class 2b:
Analyzing survey data in R
Statistical inference and linear regression
Logistic regression
The challenge of estimating small effects
Topics:
Estimates of proportions
Estimates and standard errors for continuous parameters
Linear regression
Logistic regression
Translating regressions into real-world predictions
Computation in R: proportions and averages, linear and logistic regression, simple
graphs, random sampling
Stories:
The 1/sqrt(N) story
55,000 residents desperately need your help!
Arsenic in drinking water in Bangladesh
Video:
Osorio, F., Gelman, A., and Leeman, L. (2012). Working with the General Social Survey
in R. http://www.youtube.com/watch?v=mu2sEf12Eu4
Readings:
Gelman and Hill, chapters 2-5.
Gelman, A., Lee, D., Dorie, V., and Chan, V. (2011). Statistics: What’s the Difference?,
chapters 1-3.
Gelman, A., and Stern, H. S. (2006). The difference between “significant” and “not
significant” is not itself statistically significant. American Statistician 60, 328-331.
Gelman, A. (2012). 1.5 million people were told that extreme conservatives are happier
than political moderates. Approximately .0001 million Americans learned that the
opposite is true. Statistical Modeling, Causal Inference, and Social Science blog, 14 Aug.
http://andrewgelman.com/2012/08/1-5-million-people-were-told-that-extremeconservatives-are-happier-than-political-moderates-approximately-0001-millionamericans-learned-that-the-opposite-is-true/
Homework due beginning of class 3a (problems 1 and 2) and class 4a (problems 3 and 4):
1. Sample size calculation. In a survey of n people, half are asked if they support “the
health care law recently passed by Congress” and half are asked if they support “the law
known as Obamacare.” The goal is to estimate the effect of the wording on the
proportion of Yes responses. How large must n be for the effect to be estimated within a
standard error of 5 percentage points?
2. Linear regression. The file at
http://www.stat.columbia.edu/~gelman/surveys.course/pew_research_center_june_elect_
wknd_data.dta has data from Pew Research Center polls taken during the 2008 election
campaign. You can read these data into R using the read.dta() function (after first
loading the “foreign” package into R). For this homework set, ignore the survey weights.
Fit a linear regression (using the lm() function in R) to predict political ideology (on a 5point scale: –2 = very liberal, –1 = liberal, 0 = moderate, 1 = conservative, 2 = very
conservative, with nonresponses coded as 0’s), given sex, age, and marital status. Use
the display() function (after first loading the “arm” package) to display the result. In a
short paragraph, describe the meaning of each coefficient in the fitted model.
3. Logistic regression. Using this same survey, fit a logistic regression (using the glm()
function in R) to predict whether a person is liberal (that is, responds “liberal” or “very
liberal” to the ideology question, excluding respondents who do not respond to this
question), given sex, age, and marital status. Use the display() function to display the
result. In a short paragraph, describe the meaning of each coefficient in the fitted model.
4. Working with survey data in R. Using this same survey, compute the percentage of
respondents in each state (excluding Alaska and Hawaii) who are liberal.
Make the following three graphs, putting them into a single image, hwk1_4.png, using
the following commands in R:
png (“hwk1_1.png”, height=600, width=600)
par (mfrow=c(2,2))
...
dev.off ()
(a) Plot estimated proportion liberal in each state vs. Obama's vote share in 2008 (data
available at
http://www.stat.columbia.edu/~gelman/surveys.course/2008ElectionResult.csv, readable
in R using read.csv()), as a scatterplot using the two-letter state abbreviations (see
state.abb() in R).
(b) Plot estimated proportion liberal in each state vs. sample size in each state (again as a
scatterplot using the two-letter state abbreviations).
(c) Map estimated proportion liberal using colors in a U.S. map.
Weeks 3-4: Missing data and adjusting for known differences between sample and
population
Class 3a:
Class 3b:
Class 4a:
Class 4b:
Missing-data imputation
Survey nonresponse
Weighting and poststratification
Ratio and regression estimation
Topics:
Imputing missing values
Weighting
Poststratification
Stories:
Alcoholics Anonymous survey
Exit polls and election-night results
Readings:
Gelman and Hill, chapter 25.
Groves et al., chapter 6.
Lumley, chapters 5 and 9.
Gelman, A. (2006). Counting churchgoers. Statistical Modeling, Causal Inference, and
Social Science blog, 11 Jul. http://andrewgelman.com/2006/07/counting_church/
Hadaway, C. K., Marler, P. L., and Chaves, M. (1993). What the polls don’t show: A
closer look at U.S. church attendance. American Sociological Review 58, 741-752.
Smith, T W. (1983). The hidden 25 percent: An analysis of nonresponse on the 1980
General Social Survey. Public Opinion Quarterly 47, 386-404.
Gelman, A. (2007). Struggles with survey weighting and regression modeling (with
discussion). Statistical Science 22, 153-164.
Lohr, S. (2007). Comment: Struggles with survey weighting and regression modeling.
Statistical Science 22, 175-178.
Gelman, A. (2007). Rejoinder: Struggles with survey weighting and regression
modeling. Statistical Science 22, 184-188.
Homework due beginning of class 5a (problems 1 and 2) and class 6a (problems 3 and 4):
1. Weighted analysis. Using the Pew surveys from the previous homework:
(a) Compute the weighted average proportion liberal in each state and plot vs. the raw
average; this should be a square plot (in R, par (pty=”s”)) with identical scales on x and y
axes, and each state indicated by its two-letter abbreviation.
(b) Using the “survey” package in R, fit a linear regression (using the svyglm() function
in R) to predict political ideology, given sex, age, and marital status. Compare to the
unweighted results.
2. Poststratification. A survey is taken of 100 undergraduates, 100 graduate students,
and 100 continuing education students at a university. Assume a simple random sample
within each group. Each student is asked to rate his or her satisfaction (on a 1–10 scale)
with his or her experiences. Write the estimate and standard error of the average
satisfaction of all the students at the university. Introduce notation as necessary for all
the information needed to solve the problem.
3. Missing-data imputation. Create a miniature version of the 2010 General Social
Survey (http://www.thearda.com/Archive/Files/Codebooks/GSS10PAN_CB.asp),
including the following variables: sex, age, ethnicity (use four categories),
urban/suburban/rural, education (use five categories), political ideology (on a 7-point
scale from “extremely liberal” to “extremely conservative”), and general happiness.
(a) Fit a logistic regression on whether respondents feel “not too happy,” given the other
variables in the dataset. Display (using display()) the results for the logistic regression fit
to the complete cases (this is the result if you just feed the data including NA’s into R).
(b) Impute the missing values using mi() in the “mi” package in R. Then take one of the
completed datasets and fit and display a logistic regression as above.
(c) Repeat, this time imputing using aregImpute() in the “Hmisc” package.
(e) Briefly discuss the differences between the four inferences above.
4. Ratio and regression estimation. Exercise 5.3 from Lumley: Using the data from
Wave 1 of the 1996 SIPP panel (see Lumley Figure 3.8):
(a) Estimate the ratio of population totals for monthly rent (“tmthrnt”) and total
household income (“thtrninc”) over the whole population and over the subpopulation
who pay rent.
(b) Compute the individual-level ratio, i.e., the proportion of household income paid in
rent, and estimate the population mean over the whole population and the subpopulation
who pay rent.
Weeks 5-6: Sampling and estimation
Class 5a:
Class 5b:
Class 6a:
Class 6b:
Simple and stratified random sampling
Cluster sampling with equal cluster sizes
Cluster sampling with unequal cluster sizes
Inference for regression coefficients
Topics:
Stratified sampling
Cluster sampling
Estimating population averages and totals
Estimating regression models
Stories:
Sampling names and addresses
Postal surveys
Readings:
Groves et al., chapters 3-4.
Lumley, chapters 1-6.
Gelman and Hill, chapters 7-8.
Carlin, J. B., Stevenson, M. R., Roberts, I., Bennett, C. M., Gelman, A., and Nolan, T.
(1997). Walking to school and traffic exposure in Australian children. Australian and
New Zealand Journal of Public Health 21, 286-292.
Homework due beginning of class 7a (problems 1 and 2) and class 8a (problems 3 and 4):
1. Random sampling and regression. Sample 100 random data points x from the normal
distribution with mean 10 and standard deviation 5. Then simulate 100 data points y
from the model, y = 2 + 10x – x2 + error, where the errors are normally distributed with
mean 0 and standard deviation 1.
(a) Fit a linear regression to the data and fit a quadratic regression to the data. Display
the fitted regressions (using the display() function).
(b) Make a scatterplot showing the data (using plot()) and the fitted linear and quadratic
regression lines (using curve(a+b*x,add=TRUE) and
curve(b0+b1*x+b2*x^2,add=TRUE)).
2. Cluster sampling. Suppose you have a library of 100 books and you want to estimate
the frequency of the different words in this library. So you decide to take a random
sample of 1000 words. Come up with a sampling scheme in which all words are equally
likely to be selected (in proportion to their total number of appearances in the library).
3. Simulation and analysis of stratified sample. Write an R function to take a random
subsample of the 2010 General Social Survey using regions of the country as strata.
(a) Perform a sample of size 100 with each stratum sampled in proportion to its
population size (in this case, the “population” is just the full 2010 GSS). Use this
subsample to estimate the proportion of people who favor a law which would require a
person to obtain a police permit before he or she could buy a gun. Also compute the
standard error for this estimate, first directly using the formula for the standard error of a
cluster sample, then using the “survey” package in R. (These two standard errors should
be identical.)
(b) Put step (a) above in a loop and do it 100 times. Check that your estimate is unbiased
and that its standard deviation is approximately equal to the average standard error
computed in the 100 simulations.
4. Simulation and analysis of cluster sample. Write an R function to take a random
subsample of the 2010 General Social survey using occupations as clusters.
(a) Take a cluster sample in the following way: first sample 20 occupations at random,
then sample 50% of the respondents from each sampled occupation. From this sample,
estimate the proportion of people in the population who favor a law which would require
a person to obtain a police permit before he or she could buy a gun. Compute the
standard error of this estimate.
(b) Repeat (a), but this time taking the sample as follows: first sample 20 occupations at
random, then sample 5 people from each sampled occupation (or, if there are fewer then
5 people with that occupation category, sample all of them). Again get an estimate and
standard error for the gun control question.
(c) Repeat (a), but this time first sample 20 occupations with probability proportional to
size, then sample 5 from each sampled occupation (or, if there are fewer then 5 people
with that occupation category, sample all of them). Again get an estimate and standard
error for the gun control question.
Weeks 7-8: Measurement
Class 7a:
Class 7b:
Class 8a:
Class 8b:
Survey interviewing
Challenges in survey measurement
Using surveys to answer questions in political science
Conducting a survey in the real world
Topics:
Observational measurement
Experimental measurement
Survey interviewing
Statistical models for measurement error
Item response theory
Stories:
Framing effects
The U-shaped pattern on happiness
Measuring gun use
Measuring religiosity
Readings:
Groves et al., chapters 7-9.
Tversky, A., and Kahneman, D. (1981). The framing of decisions and the psychology of
choice. Science 211, 453-458.
Gelman, A. (2010). Age and happiness: The pattern isn't as clear as you might think.
Statistical Modeling, Causal Inference, and Social Science blog, 26 Dec.
http://andrewgelman.com/2010/12/age_and_happine/
Frijters, P., and Beaton, T. (2008). The mystery of the U-shaped relationship between
happiness and age. Working paper #26. National Centre for Econometric Research,
Australia.
Blanchflower, D. G., and Oswald, A. (2008). Is well-being U-shaped over the life cycle?
Social Science & Medicine 66, 1733-1749.
Stone, A. A., Schwartz, J. E., Broderick, J. E., and Deaton, A. (2010). A snapshot of the
age distribution of psychological well-being in the United States. Proceedings of the
National Academy of Sciences USA 107, 9985-9990.
Gelman, A. (2010). God, guns, and gaydar: The laws of probability push you to
overestimate small groups. Statistical Modeling, Causal Inference, and Social Science
blog, 12 Jul. http://andrewgelman.com/2010/07/god_guns_and_ga/
Hemenway, D. (1997). The myth of millions of annual self-defense gun uses: a case
study of survey overestimates of rare events. Chance 10 (3), 6-10.
Homework due beginning of class 9a (problem 1) and class 10a (problem 2):
1. Survey interviewing. Design a survey form and try it out on five friends.
2. Survey measurement. Find a measurement effect in an existing survey.
Weeks 9-10: Surveys in political science
Class 9a: Voting
Class 9b: Public opinion
Class 10a: Political participation
Class 10b: Understanding and displaying data
Topics:
General Social Survey and the National Election Study
Mass-media opinion polls
Other surveys that are available for research
Manipulating data and performing simple analyses using R
Key ideas of sampling and surveys
Stories:
Uniform swing in opinions
Comparing health care attitudes in 1994 and 2009
What’s (not) the matter with Portugal?
Video:
Nanjiani, K. (2008). Cheese Heroin. http://www.youtube.com/watch?v=WVIC2gJTD9s
Readings:
Groves et al., chapters 1 and 2.
Page, B. I., and Shapiro, R. Y. (1982). Changes in Americans’ policy preferences, 19351979. Public Opinion Quarterly 46, 24-42.
Page, B. I., Shapiro, R. Y., and Dempsey, G. R. (1987). What moves public opinion?
American Political Science Review 81, 23-43.
Shapiro, R. Y., and Page, B. I. (1988). Foreign policy and the rational public. Journal of
Conflict Resolution 32, 211-247.
Gelman, A., and King, G. (1993). Why are American Presidential election campaign
polls so variable when votes are so predictable? British Journal of Political Science 23,
409-451.
Baldassarri, D., and Gelman, A. (2008). Partisans without constraint: Political
polarization and trends in American public opinion. American Journal of Sociology 114,
408-486.
Gelman, A. (2009). Debunking the so-called Human Development Index of U.S. states.
Statistical Modeling, Causal Inference, and Social Science blog, 20 May.
http://andrewgelman.com/2009/05/20/debunking_the_s/
Gelman, A. (2008). Peeking behind the curtain, or, What’s (not) the matter with
Portugal? Statistical Modeling, Causal Inference, and Social Science blog, 25 Mar.
http://andrewgelman.com/2008/03/peeking_behind/
Gelman, A., and Cai, C. J. (2008). Should the Democrats move to the left on economic
policy? Annals of Applied Statistics 2, 536-549.
Gelman, A. (2012). College football, voting, and the law of large numbers. Statistical
Modeling, Causal Inference, and Social Science blog, 25 Oct.
http://andrewgelman.com/2012/10/25/college-football-voting-and-the-law-of-largenumbers/
Homework due beginning of class 11a (problem 1) and class 12a (problem 2):
1. Survey responses. From 1984 through 2008 (and maybe in other years), the National
Election Study asked attitudes on several issues, and also perceptions of the stances on
these issues held by the major presidential candidates. (For example, in 2004 these issues
included the role of women, gun-control policy, government aid to African Americans,
the level of spending that the government should undertake in the economy, the role of
the government in providing an economic environment where there is job security, and
the level at which the government should spend on defense. Each respondent was asked
how he or she stood on these issues and where they would place George W. Bush and
John Kerry.) It turns out that attitudes about the candidates’ views are strongly correlated
with respondents’ own ideologies; see Figure 5 of Gelman and Cai (2008).
(a) Replicate Figure 5 of Gelman and Cai (2008) using the 2004 NES. Gelman and Cai
(2008) had a serious coding error, so your results should be different from theirs.
(b) Discuss any difficulties you have, and compare your results to the published paper.
2. Social science. Write a three-page mini-paper addressing some interesting social
science question using the NES or GSS. The topic and the analysis do not need to be
deep, but they must be original, and you need to go beyond simple toplines and crosstabs.
Weeks 11-12: More elaborate statistical modeling
Class 11a:
Class 11b:
Class 12a:
Class 12b:
Bayesian inference
Ideal-point modeling
Multilevel regression and poststratification
Challenges in multilevel regression and poststratification
Topics:
Bayesian inference
Ideal-point modeling
Multilevel regression for stratified and cluster sampling
Multilevel regression and poststratification
Stories:
Voting
Health care
Gay rights
Readings:
Gelman and Hill, chapters 11-12.
Gelman, A. (2006). Multilevel modeling: What it can and cannot do. Technometrics 48,
432-435.
Gelman, A. (2005). Two-stage regression and multilevel modeling: a commentary.
Political Analysis 13, 459-461.
Gelman, A. (2012). Is it meaningful to talk about a probability of “65.7%” that Obama
will win the election? Statistical Modeling, Causal Inference, and Social Science blog,
22 Oct. http://andrewgelman.com/2012/10/is-it-meaningful-to-talk-about-a-probabilityof-65-7-that-obama-will-win-the-election/
Gelman, A., and Lock, K. (2010). Bayesian combination of state polls and election
forecasts. Political Analysis 18, 337-348.
Gelman, A., Lee, D., and Ghitza, Y. (2010). Public opinion on health care reform. The
Forum 8 (1), article 8.
Shapiro, R. Y., and Arrow, S. A. (2009). Support for health care reform: Is public
opinion more favorable for Obama than it was for Clinton in 1994?
Lax, J., and Phillips, J. (2009). How should we estimate public opinion in the states?
American Journal of Political Science 53, 107-121.
Lax, J., and Phillips, J. (2009). Gay rights in the states: Public opinion and policy
responsiveness. American Political Science Review 103, 367-386.
Gelman, A., and Su, Y. S. (2011). Public opinion on school vouchers.
Ghitza, Y., and Gelman, A. (2013). Deep interactions with MRP: Presidential turnout
and voting patterns among small electoral subgroups. American Journal of Political
Science.
Homework due beginning of class 13a (problems 1 and 2) and class 14a (problems 3 and
4):
1. Bayesian inference. From a survey of 500 people, you estimate the proportion who
support candidate A in the upcoming election to be 60%. From a forecast (not using this
poll) you get a prediction that candidate A will win 51% of the vote. Let X be the
standard error of this forecast. Further suppose that you estimate the nonsampling error
of this poll to be equal to the sampling error.
(a) Suppose that, given the above information, your Bayesian forecast is that A will
receive 54% of the vote. What is X, and what is the standard error of your Bayesian
forecast?
(b) What is your Bayesian probability that candidate A will win the election?
2. Ideal-point modeling. You will create a measure of economic ideology using the
following questions from the 2000 Annenberg survey: Are tax rates a problem (CBB01),
Favor cutting taxes or strengthening social security (CBB05), Federal government should
reduce the top tax rate (CBB10), Federal government should adopt flat tax (CBB13),
Federal government should spend more on social security (CBC01), Favor investing
social security in stock market (CBC05), Is poverty a problem (CBP01), Federal
government should reduce income differences (CBP02), Federal government should
spend more on aid to mothers with young children (CBP03), Federal government should
expend effort to eliminate many business regulations (CBT01).
Fit a hierarchical logistic regression to estimate ideal points for individuals and survey
questions.
(a) Display the estimated ideal points and standard errors of the survey questions (listing
the questions in order of their estimated ideal points)
(b) Display the distribution of estimated ideal points of the survey respondents. On this
same graph, display the distributions for Democrats, independents, and Republicans.
3. Multilevel modeling. From the Pollster data, estimate a time series of support for
Obama and Romney, adjusting for house effects and then smoothing the curve using
some function such as lowess. Compare to the smoothed average of the unadjusted
approval numbers from this series and comment on any differences.
4. Multilevel regression and poststratification. Download the cumulative National
Election Study.
(a) Fit a multilevel logistic regression estimating support for gun control given state, year,
sex, and ethnicity (white/black/Hispanic/other). Use the display() function in R to
display the fitted model. Explain the output in a brief paragraph.
(b) Using your model, get estimates of the proportion of people who support gun control,
for all 8 demographic groups in each state (excluding Alaska and Hawaii) for the year
2012. Using the 2010 census, poststratify to get an estimate for each state.
(c) Make the following five graphs on a 3x2 grid: (i) a map of estimated gun control
support by state in 2012; (ii) a plot of estimated gun control support vs. Obama vote share
in 2012 (indicating each state by its two-letter abbreviation); (iii) a plot of estimated gun
control support in 2012 vs, the raw proportion of respondents in the state from 2012 who
supported gun control; (iv) a plot of estimated gun control support in 2012 vs, the raw
proportion of respondents in the state from who supported gun control, pooling data from
all years of NES; (v) a plot of estimated gun control support in 2012 vs. the state-level
“random effects” from the fitted multilevel model.
Weeks 13-14: Hard-to-reach populations
Class 13a: Low response rates in U.S. surveys
Class 13b: Surveys in less-developed countries
Class 14a: Network sampling
Class 14b: Review
Topics:
Callbacks
Capture-recapture
Respondent-driven sampling
Stories:
Friendsense
How many X’s do you know?
Polarization and perceived polarization
The Iraq death surveys
Census adjustment
“Millionaires for McCain, Billionaires for Obama”
Readings:
U.S. Census Bureau (2012). Census Bureau releases estimates of undercount and
overcount in the 2010 Census.
http://www.census.gov/newsroom/releases/archives/2010_census/cb12-95.html
Gelman, A. (2008). Political attitudes of the super-rich. Red State Blue State blog, 2
Nov. http://redbluerichpoor.com/blog/2008/11/political-attitudes-of-the-super-rich/
Page, B. I., Bartels, L. M., and Seawright, J. (2013). Democracy and the policy
preferences of wealthy Americans. Perspectives on Politics 11, 51-73.
Goel, S., Mason, W., and Watts, D. J. (2010). Real and perceived attitude agreement in
social networks. Journal of Personality and Social Psychology.
Hampton, K. N., Goulet, L. S., Rainie, L., and Purcell, K. (2011). Social networking
sites and our lives. Pew Research Center report, 16 June.
McCormick, T. H., Salganik, M. J., and Zheng, T. (2010). How many people do you
know?: Efficiently estimating personal network size. Journal of the American Statistical
Association 105, 59-70.
Heckathorn, D. D. (1997). Respondent-driven sampling: A new approach to the study of
hidden populations. Social Problems 44, 74-199.
Goel, S., and Salganik, M. J. (2010). Assessing respondent-driven sampling.
Proceedings of the National Academy of Sciences USA 107, 6743-6747.
Spagat, M. (2009). The reliability of cluster surveys of conflict mortality: Violent deaths
and non-violent deaths. Presentation given at the conference, International Conference
on Recording and Estimation of Casualties, Carnegie Mellon University and University
of Pittsburgh, 23-24 Oct.
Gelman, A. (2010). Ethical and data-integrity problems in a study of mortality in Iraq.
Statistical Modeling, Causal Inference, and Social Science blog, 27 Apr.
http://andrewgelman.com/2010/04/ethical_and_dat_1/
Spagat, M. (2010). Ethical and data-integrity problems in the second Lancet survey of
mortality in Iraq. Defence and Peace Economics 21, 1-41.
Rothschild, D., and Wolfers, J. (2011). Forecasting elections: Voter intentions versus
expectations. http://assets.wharton.upenn.edu/~rothscdm/RothschildExpectations.pdf