STT200 Chapter 12 KM Chapter 12 Sample Surveys (please read) Vocabulary: Already known: Population, Sample, Parameter, Statistic, Surveys Notation: We typically use Greek letters to denote parameters and Latin letters to denote statistics. Sampling Frame - a list of individuals from which a sample is selected Sampling Variability - each random sample is different Sampling Design - a method of taking a sample Sample Size – the number of elements in a sample Do not sample more than 10% of the population. If a goal is to estimate population parameters or distribution, a sample should represent the whole population. To achieve this, sample individuals should be selected at random, so that each set of the same size has the same chance of being selected. A sample drawn in this way is called a simple random sample (SRS) Sampling Designs (to produce representative samples) All statistical sampling designs have in common the idea that chance, rather than human choice, is used to select the sample. • SRS – Simple Random Sample - each set of size n has the same chance to be selected (best!) STT200 Chapter 12 KM • Stratified Random Sample - a sampling design in which the population is divided into groups (strata) and random samples are then drawn from each stratum • Cluster Sample - a sampling design in which first we select some parts of the populations (clusters) at random, and then a census is performed within each of them. • Systematic Sample - there is pattern in selecting a sample, but the starting point is random • Multistage Sample - design that combines several methods Bias – a failure of a sample to represent a population Sources of bias  Voluntary Response Sample - large group is invited to response, but only those who responded count. This sample is not-representative.  Convenience Sample - include in a sample individuals which are at hand. This sample is not-representative.  Undercoverage - some portion of a population is not sampled  Nonresponse bias - when large fraction of sampled individuals fail to response  Response bias - wrong wording, order etc. of questions, that suggest certain response Data collected in an inappropriate way is useless. Example (Gallup Poll). GEORGE GALLUP (1901-1984) was the father of opinion surveys. He studied journalism and psychology and dealt with market research, analysis of advertising and marketing. He was a founder of Gallup Institute in 1935. His first big triumph occurred in 1936 when F.D. Roosvelt run for his second term in the White House against the Kansas governor Alfred Landon. LITERARY DIGEST made a prediction of the election based on enormous number of responses obtained after sending 10 million (!!!) postcards to American households. STT200 Chapter 12 F.D. ROOSEVELT A. LANDON 44% 56% KM Literary Digest had an excellent reputation after correct prediction of the 1932 presidential election. G. GALLUP sent only a few thousand postcards and announced a convincing victory of F.D. Roosevelt. The real result of the vote F.D. ROOSEVELT A. LANDON 22 809 638 15 758 901 63% 37% On one hand, common sense suggests the more people are examined the more accurate result is obtained. On the other hand... Gallup chose his respondents at random in such a way that every potential voter had the same chance or probability of being selected. Literary Digest used addresses from a telephone books and car registration lists. Clearly at that time car and telephone owners did not constitute a half of the adults in the US. GALLUP case proved that what matters is the way a sample is selected and the sample size is less important. Exercises: 4. Drug tests. Major League Baseball tests players to see whether they are using performance-enhancing drugs. Officials select a team at random, and a drug-testing crew shows up unannounced to test all 40 players on the team. Each testing day can be considered a study of drug use in Major League Baseball. a) What kind of sample is this? b) Is that choice appropriate? STT200 Chapter 12 KM 10. The Gallup Poll interviewed 1007 randomly selected U.S. adults aged 18 and older, March 23–25, 2007. Gallup reports that when asked when (if ever) the effects of global warming will begin to happen, 60% of respondents said the effects had already begun. Only 11% thought that they would never happen. For the following reports about statistical studies, identify the following items (if possible). If you can't tell, then say so—this often happens when we read about a survey. a) The population b) The population parameter of interest c) The sampling frame d) The sample e) The sampling method, including whether or not randomization was employed f) Any potential sources of bias you can detect and any problems you see in generalizing to the population of interest