Sample Sized Determination for Survey Design1 One crucial aspect of survey design involves the calculation of the sample size. Although that assertion may seem obvious, or even trivial, exactly how one arrives at the correct sample size f or a survey is neither obvious nor trivial. Unfortunately, and all too often, the proper method of arriving at the correct sample size for a survey is overlooked. This document will illustrate the appropriate means of calculating the correct sample size for a survey. UNDERSTANDING SAMPLING CONCEPTS Arriving at the correct sample size for a survey requires first knowing how to calculate a sample size, and knowing how to calculate a sample size requires an understanding of why a sample is needed in the first place. A sample is a subset of a population. A population can be almost any group. A population can be very large, for example, the population of all persons living in the United States or the population of all persons alive on this planet. A population can be very small - the population of all rainy days in the past year in the Sahara. In this case, the population may only be three. The sample then is a subset from the entire group. For instance, an individual living in the United States constitutes a sample of the population of the United States. Statistics or the science of statistics involves examining a sample from a population in order to estimate various characteristics (called parameters) of that population. Such characteristics may include the average, median, mode, variance, and others measures such as the range. A sample is taken because looking at the entire population may not be feasible. This infeasibility usually stems from either the cost, or the difficulty, or both of sampling every member of a population. (If a population isn’t especially large you should consider sampling the entire population. By so doing, you’ll know the exact parameters of the population, and you won’t have to rely on estimates.) In most instances, though, a sample must be taken. DEFINING THE POPULATION Another important step towards arriving at the correct sample size involves carefully defining the population from which the sample will be taken. If for example, a researcher is interested in knowing the average weight of the American male, then the population is the weights of all American men - not all American men. In other words, the population is the actual numbers representing the weight of each American male. Now suppose that this same researcher is also interested in the average height of the American male. In this case, the researcher must consider two populations even though he or she will be taking measurements from the same men, the population of all weights of American men and the population of all heights of American men. Remembering this point is very important since each population will yield a different sample size. You will also see how relevant this discussion is when sample size determination for surveys containing multiple questions is discussed. 1 November 1, 2011 Version, from PC-MDS documentation of MDPREF. Sample Sized Determination for Survey Design | 1 CALCULATING THE SAMPLE SIZE Two general formulae exist to calculate the correct sample size. One calculates the sample size when estimating averages and the other is used when estimating proportions or percentages. The formula used for estimating proportions should be used anytime you’re interested in a percentage, e.g. the proportion of registered voters who plan on voting for a certain candidate. The formula used for estimating averages or means should be used anytime you’re interested in a number other than a proportion. For example, you would use the formula for averages if you wished to know the average age of those who voted in the last election. Sometimes deciding which formula to use can be confusing. For instance, let’s suppose if you want to know the number of registered voters within a certain state. In this case, you would use the formula for proportions, since the number of registered voters in that state is a proportion of the entire population of eligible voters. (Incidentally, to arrive at the number of registered voters, the proportion of registered voters would first need to be estimated, and then this percentage would need to be multiplied by the number of all eligible voters. Assume that the total number of all eligible voters is known.) The two general formulae for calculating sample size are as follows For Means: where For Proportions: where z /2 represents the number of standard deviations relative to the mean of the standard normal curve corresponding to the level of confidence. In other words if the level of confidence is 90%, then a = 10%, and /2 = 5%. Therefore the z Sample Sized Determination for Survey Design | 2 value is z095 or 1.645. The Margin of Error is a value added to and subtracted from the estimate, which establishes an interval which interval contains the true population parameter, given a certain level of confidence and N represents the population size. Consider the following example: A researcher wishes to know the minimum sample size required to ascertain the average monthly wage of all employees who work for the city of Boston, with a Margin of Error of ± $10 and with a confidence level of 95%. This researcher knows from a previous survey that the standard deviation of the monthly wage of these employees is $100. This researcher also knows that the total number of employees who work for the city of Boston is approximately 5,000. Substituting the above values into the equation yields: D = (10)2 / (1.96)2 or D = 26.03 n >= 5000 (100) 2 / [ (5000 – 1) 26.03 + (100) 2 ] or n >= 356.82 The correct sample size in this example should be 357. Since the sample size can’t be anything other than a whole number, you should round up to the nearest integer. Recall the definition of the Margin of Error. Referring to the example above, where the Margin of Error is ± $10, and assuming that the estimate of the average monthly wage is $512, then the true average monthly wage lies somewhere in the range from $512 - $10 to $512 + $10 or between $502 and $522. Remember also that in the example the researcher only wished a confidence level of 95%. With respect to the Margin of Error of $10, this means that the true population parameter will lie within the range from $502 to $522 95% of the time. In other words, if this survey were repeated 100 times, in 95 of those times the true population parameter would lie in the range from $502 to $522. Therefore, holding all else constant, as the Margin of Error decreases the sample size increases. In a similar fashion, again holding all else constant, to decrease the Margin of Error and thereby increase the confidence level you must increase the sample size. Formula 2 (proportions): The number of standard deviations relative to the mean of the standard normal curve corresponding to the level of confidence is represented by z /2. The Margin of Error is a value added to and subtracted from the estimate, which establishes an interval which interval contains the true population parameter, given a certain level of confidence. P is the a-prior assumption of the population parameter. If no information is available p should be assumed to be 0.5. Why p should be assumed to be 0.5 will be explained later. And where N represents the population size. Sample Sized Determination for Survey Design | 3 Consider the following example: A newspaper wishes to estimate the proportion of registered voters who plan on voting for a particular candi date for mayor. This newspaper wants a confidence level of 90% with a Margin of Error of ±3%. A prior survey shows that 25% of the registered voters plan on voting for this particular candidate. The population of registered voters in this city is 6,500. What is the minimum sample size that is required? Substituting the above values into the equation yields: The correct sample size in this example should be 519, because as stated previously the sample size should be rounded up to the nearest integer. MORE ON CALCULATING THE SAMPLE SIZE Notice that the Margin of Error for the example involving sample size for averages is expressed as the units of measure, in this case dollars, while the example involving sample size for proportions is expressed as a percentage. The Margin of Error for proportions can never be expressed as anything other than a percentage. However, the Margin of Error for means can be expressed in percentage terms or in the terms of the units of measure. For example, referring back to the example of the average monthly wage of Boston City employees, the Margin of Error was established at $10, but it could have been established at 10%. However, this 10% would have represented a range of ±10% around the estimate, which range would have contained the true population parameter given a certain level of confidence. If the estimate is $1,000, then the range containing the true parameter at a Margin of Error of ±10% is (0.1) ($1,000) or a range from $900 to $1,100. As you can see, stating the Margin of Error in percentage terms for sample sizes dealing with averages makes sample size determination more difficult, for the following reasons: One, you must convert the percentage into units of measure in order to use the average sample size formula, shown above. This conversion, though, presents its own problem. Since you haven’t taken a sample yet, you don’t know the estimate and therefore you can’t convert to a Margin of Error in terms of the units of measure. This fact precludes you from using the above average sample size formula. Two, you must use the following average sample size formula: Sample Sized Determination for Survey Design | 4 Where: represents the number of standard deviations of the standard normal curve corresponding to the level of confidence. represents the percent of the Margin of Error. represents the size of the population. Refer to the above example of monthly wage on page 3: If N = 5,000 and we assume that ~ = $1,000, e will be $10/$1,000 or 1%. Remember from this example that the Margin of Error is $10. Also remember that the confidence level is 95% and the standard deviation is $100. Substituting these figures into the above formula yields: n = 356.75, which is very close to the previous answer of n = 356.82, calculated on page 3. If we assume xˉ = $1, then e will be 10, but n will still be 356.75. So you can see that xˉ can take on any value. Obviously though, you would want to have a pretty good idea of what the value of xˉ should be and then establish a Margin of Error based around that value. The problem is that very often you have no idea of what the value of ˉx should be. The very reason you’re taking a sample is to discover the value of x. ˉ The corresponding formula for proportions to the formula immediately above is as follows:4 Where: represents the number of standard deviations of the standard normal curve corresponding to the level of confidence. Py is the a priori assumption of the population parameter. If no information is available, Py should be assumed to be 0.5. N represents the population size. e represents the percent of the Margin of Error. Since the Margin of Error in this case is a percentage, is a percent of a percent. In other words, if the Margin of Error is 3%, then e is the percentage that 3% is from the true population parameter. This fact makes working with this formula confusing, which is why you should avoid having to use the , and use the previous sampling formula for proportions, shown above. Sample Sized Determination for Survey Design | 5 Referring back to the example of registered voters on pages 4 and 5 and using the formula Immediately above the sample size n will be 518.83, which is almost identical to the previous answer of 518.84. This is based on the following: a population of 6,500, a priori information of 25%, a confidence level of 90%, and a Margin of Error of 3%, in which case e is 0.03/0.25 or 0.12. 0.03 is 12% of .25, or 12% of the true population parameter. UNDERSTANDING VARIANCE IN THE CONTEXT OF SAMPLING Even a casual study of sample size formulae will reveal that the size of the sample depends on the variance (variance is the standard deviation squared) or spread of the population distribution. A larger variance will require a larger sample size. For example, referring back to the example of average monthly wage of Boston City employees, if we assumed that the standard deviation was $50 instead of $100, the required sample size would be 95. The problem is determining the variance of a population distribution. One solution is to take a sample of size 30 from the population. Then, calculate the sample variance and from sample variance, estimate the population variance. The formula for doing this is as follows:5 Where: s2 = the sample variance n = the sample size and ô2 = the estimation of the population variance. (Often a variance of the sample is calculated instead of a sample variance, e.g. In this case the estimation of the population variance would be: where: Consider the following example: A car manufacturer wants to know the average gas mileage of their newly produced X-mobile. How large a sample should they select if they want a confidence level of 95%, with ±‘A gallon Margin of Error. The manufacturer has produced 3,000 X-mobiles. The car maker also has no knowledge about the distribution of this population, (that is the distribution of the population of gas mileage per car.) Sample Sized Determination for Survey Design | 6 Since they have no knowledge about the distribution of this population and hence no knowledge of the population variance, the first step is to take a random sample of 30 cars and measure each of their gas mileages. Let’s suppose that the sample variance of their gas mileages turns out to be 3 gallons. Then the population variance will be as follows: or σ2= 2.999 gallons. As you can see if N is very large, that σ2= s2. Therefore, for all practical purposes, σ2= 3. Hence, σ = 1.73 gallons. Substituting the proper values into the equation: shows that approximate minimum sample size should be 45.41 or 46. (Calculation of the exact size using this iterative method may require more complex methods: See Sampling Techniques by Cochran.6) Since they have already sampled 30, they only need to sample an additional 16. In this manner, the determination of sample size becomes an iterative exercise. One other approach to solving the problem of unknown population variance is to consider the range of the population. If you know the range, then dividing the range by the number 4, will provide an approximate value for the population standard deviation.7 (Remember that the standard deviation is the square root of the variance.) The justification for dividing the range by 4 is because 4 standard deviations represent 95% of most population distributions. Referring to the example above, if the car maker knew that the gas mileage would be no higher than 30 and no lower than 20, then the range is 10, meaning the approximate population standard deviation would be 10/4 or 2.5. Using a standard deviation of 2.5, the carmaker would need a sample size of 94. As you can see the carmaker would need a larger sample size if they chose to use the range method rather than the iterative method. Knowing the population variance is not a problem for calculating the sample size for a proportion. You simply assume the maximum allowable variance, meaning Py is set to 0.5. However, if you have a priori knowledge about a proportion, setting Py to this value will give you a smaller sample size. For instance, referring back to the example of registered voters, a prior survey indicated that 25% of voters planned on voting for a particular candidate. Therefore, Py equaled 0.25. If no prior survey had been taken, then Py, would equal 0.5, which would ensure the largest sample size. At Py = 0.5 the sample size would have then been 674, instead of 519. You can, however, when no a priori information is available for proportions, compute the correct sample size iteratively, similarity to example in the paragraph above. PROPORTIONAL VARIABLES So far, in dealing with proportions only dichotomous situations have been discussed. In the case of the example of the registered voters, the question is whether the voters will vote for a particular candidate or not vote for him. Now, let’s suppose that many candidates are running and we are interested in knowing the proportion of voters who will vote for each candidate. Again, if no a-priori information exists then Py should be 0.5.8 Sample Sized Determination for Survey Design | 7 Another example of a multi-category variable is Likert scales. Likert scales are scales, e.g. 1 to 5, from which respondents to a survey are asked to rate something. For example: 5 = “very satisfied” 4 = “satisfied” 3 = “neutral” 2 = “unsatisfied” 1 = “very unsatisfied” These numbers represent qualifiable attributes. Therefore, theoretically speaking, only sampling formulae of proportions should be used. The reasoning behind this restriction stems from the obvious conclusion that you really cannot and should not combine the attitudes of respondents. For instance, if 10 respondents indicate that they are “very unsatisfied” and another 10 indicate that they are “very satisfied” then the average is not “neutral,” because not one respondent indicated that he/she is “neutral.” However, some researchers do take averages of Likert scales. Since Likert scales are treated both as qualifiable and quantifiable attributes, either sampling formulae of proportions or of means may be used. You should use the sampling formulae for proportions if the Likert scales are going to be treated as qualifiable attributes. Likewise, you should use the sampling formulae for means if the Likert scales are going to be treated as quantifiable attributes. CALCULATING THE SAMPLE SIZE OF A QUESTIONNAIRE Calculating the correct sample size for a survey containing multiple questions, or questionnaire is a simple extension of calculating a sample size for one question. In a survey containing multiple questions each question either involves an average or a proportion, each with their own set of statistical characteristics. In the case of questions that deal with averages one such characteristic necessary to arrive at the correct sample size is the population variance. Furthermore, knowing the population variance will help you to determine a Margin of Error. From the Margin of Error and the level of confidence, which level you decide, a sample size needs to be calculated for each question in the survey. For example, let’s suppose that you are designing a survey for Boston City employees containing the following three questions: Q1) What is your monthly wage? Q2) How many hours a week do you work? Q3) Do you enjoy your work? Assuming that you know that the population standard deviation (standard deviation is the square root of the variance) for monthly wage is $100 and that the population is 5,000, you decide on a confidence level of 95% with a Margin of Error of ±10$. The correct sample size for Q1 is 357. You also know that population standard deviation for hours worked per week is 2 hours. You decide that you want a confidence level of 95% with a Margin of Error of ± 1/2 an hour. The correct sample size for Q2 is then 61. You don’t know anything about the proportion of people who enjoy their work, but from a trial sample of 30 you ascertain that just as many employees don’t enjoy their work as enjoy their work. You decide again, on a confidence level of 95% with a Margin of Error of 5%. The correct sample size for Q3 is then 381. As you can see the required sample size for each question is different. The obvious next step is to set the sample size for Sample Sized Determination for Survey Design | 8 this survey using the question that yielded the highest sample size, in this case Q3, whose required sample size is 381. Unfortunately, more than 381 people will probably need to be contacted. The problem is that respondents often don’t answer all the questions all the time, or give invalid responses such as “don’t know.” Therefore, before the termination of a survey each question needs to be checked to ensure that it has the required number of valid responses to fulfill its required sample size. In this example, you might expect that over 400 people would need to be contacted. You might also find that the reason f or having to contact so many people may be due to the fact the so many people refused to answer Q1. Remember Ql only required a sample size of 357. In this case, the sample size requirements for Q2 and Q3 would easily be fulfilled. The fact that Q2 and Q3 are “over sampled” is not a problem, though. In fact, over sampling is never a problem (except when considering cost), but under sampling is always a problem. Hence when in doubt as to the exact minimum sample size, be sure to take a sample which exceeds the minimum sample size. Although proper and careful calculation of survey sample size may seem time consuming and somewhat complicated, it will ensure that enough data is gathered, which will yield “good estimates” and enable conclusive analytical results. 1. Elementary Survey Sampling, Fifth Edition by Scheaffer, Mendenhall & Ott page 95 2. Ibid page 99 3. Sampling of Populations Methods and Applications by Levy & Lemeshow page 62 4. Ibid page 62 5. Ibid page 22 6. Sampling Techniques by Cochran page 78 & 79 7. See Endnote #1 page 94 8. The Finite Population Corelation by Narins in the SPSS journal Keywords Sample Sized Determination for Survey Design | 9
© Copyright 2024