SELECTING THE SAMPLE SIZE FOR ESTIMATING POPULATION MEANS AND TOTALS

SELECTING THE SAMPLE SIZE FOR ESTIMATING POPULATION
MEANS AND TOTALS
At some point in the design of the survey, someone must make a decision
about the size of the sample to be selected from the population. So far we have
discussed a sampling procedure (simple random sampling) but have said nothing
about the number of observations to be included in the sample. The implications
of such a decision are obvious. Observations cost money. Hence if the sample is
too large, time and talent are wasted. Conversely, if the number of observations
included in the sample is too small, we have bought inadequate information for
the time and effort expended and have again been wasteful.
The number of observations needed to estimate a population mean µ with a
bound on the error of estimation of magnitude B is found by setting two
standard deviations of the estimator, y , equal to B and solving this expression
for n. That is, we must solve
B = 2 V (y) = 2
σ2
n
( NN−−1n )
Nσ 2
2
where D = B4
2
( N − 1) D + σ
Solving for n in a practical situation presents a problem because the population
variance σ2 is unknown. Since a sample variance s2 is frequently available from
prior experimentation, we can obtain an approximate sample size by replacing σ2
with s2. We will illustrate a method for guessing a value of σ2 when very little
prior information is available. If N is large, as it usually is, the (N - 1) can be
replaced by N in the denominator of equation.
to get n =
Example 5
The average amount of money µ for a hospital's accounts receivable must be
estimated. Although no prior data is available to estimate the population
variance σ2, that most accounts lie within a $100 range is known. There are N =
1000 open accounts. Find the sample size needed to estimate µ with a bound on
the error of estimation B = $3.
Solution
We need an estimate of σ2, the population variance. Since the range is often
approximately equal to four standard deviations (4σ), one-fourth of the range
will provide an approximate value of σ. Hence σ is taken to be approximately
25 and σ2 = 625.
B 2 32
D=
= = 2.25
4
4
1000 ( 625 )
Nσ 2
n=
=
= 217.56
2
( N − 1) D + σ 999 ( 2.25) + 625
That is, we need approximately 218 observations to estimate µ, the mean
accounts receivable, with a bound on the error of estimation of $3.00.
In like manner, we can determine the number of observations needed to
estimate a population total τ, with a bound on the error of estimation of
magnitude B. The required sample size is found by setting two standard
deviations of the estimator equal to B and solving this expression for n. Proceeding
as we did earlier, we get
Nσ 2
B2
n=
where
D
=
4N 2
( N − 1) D + σ 2
Example 6
An investigator is interested in estimating the total weight gain in 0 to 4 weeks
for N = 1000 chicks fed on a new ration. Obviously, to weigh each bird would
be time-consuming and tedious. Therefore, determine the number of chicks to be
sampled in this study in order to estimate τ with a bound on the error of
estimation equal to 1000 grams. Many similar studies on chick nutrition have
been run in the past. Using data from these studies, the investigator found that σ2,
the population variance, was approximately equal to 36.00 (grams)2. Determine
the required sample size.
Solution
We can obtain an approximate sample size using the previous equation with σ2
equal to 36.0.
B2
10002
D=
=
= 0.25
4 N 2 4 10002
(
n=
)
1000 ( 36 )
Nσ
=
= 125.98
( N − 1) D + σ 2 999 ( 0.25) + 36
2
The investigator, therefore, needs to weigh n = 126 chicks to estimate τ, the
total weight gain for N = 1000 chickens in 0 to 4 weeks, with a bound on the
error of estimation equal to 1000 grams.
ESTIMATION OF A POPULATION PROPORTION
The investigator conducting a sample survey is frequently interested in
estimating the proportion of the population that possesses a specified characteristic.
For example, a congressional leader investigating the merits of an 18-year-old
voting age may want to estimate the proportion of the potential voters in the
district between the ages of 18 and 21. A marketing research group may be
interested in the proportion of the total sales market in diet preparations that is
attributable to a particular product. That is, what percentage of sales is
accounted for by a particular product? A forest manager may be interested in the
proportion of trees with a diameter of 12 inches or more. Television ratings are
often determined by estimating the proportion of the viewing public that watches
a particular program.
You will recognize that all these examples exhibit a characteristic of the
binomial experiment, that is, an observation either does belong or does not
belong to the category of interest. For example, one can estimate the proportion of
eligible voters in a particular district by examining population census data for
several of the precincts within the district. An estimate of the proportion of voters
between 18 and 21 years of age for the entire district will be the fraction of
potential voters from the precincts sampled that fell into this age range.
In subsequent discussion we denote the population proportion and its estimator
by the symbols p and pˆ , respectively. The properties of pˆ for simple random
sampling parallel those of the sample mean y if the response measurements are
defined as follows: Let y i , = 0 if the i th element sampled does not possess the
specified characteristic and yi = 1 if it does. Then the total number of elements in a
n
sample of size n possessing a specified characteristic is ∑ yi .
i =1
If we draw a simple random sample of size n, the sample proportion pˆ is the
fraction of the elements in the sample that possess the characteristic of interest.
For example, the estimate pˆ of the proportion of eligible voters between the
ages of 18 and 21 in a certain district is
n
number of voters sampled between the ages of 18 and 21
pˆ =
=
number of voters sampled
∑y
i =1
n
i
=y
In other words, pˆ is the average of the 0 and 1 values from the sample. Similarly,
we can think of the population proportion as the average of the 0 and 1 values
for the entire population (that is, p = µ).
Estimator of the population proportion p:
n
pˆ = y =
∑y
i =1
i
n
Estimated variance of p:
ˆ ˆ ⎛ N −n⎞
pq
Vˆ ( pˆ ) =
⎜
⎟
n −1 ⎝ N ⎠
Where q = 1 – p and qˆ = 1 − pˆ
Bound on the error of estimation:
ˆ ˆ ⎛ N −n⎞
pq
2 Vˆ ( pˆ ) = 2
⎜
⎟
n −1 ⎝ N ⎠
Example 7
A simple random sample of n = 100 college seniors was selected to estimate (1)
the fraction of N = 300 seniors going on to graduate school and (2) the fraction
of students that have held part-time jobs during college. Let y, and x i (i = 1, 2,
..., 100) denote the responses of the ith student sampled. We will set yi = 0 if the
ith student does not plan to attend graduate school and y i = 1 if he does.
Similarly, let x i = 0 if he has not held a part-time job sometime during college
and x i = l if he has. Using the sample data presented in the accompanying table,
estimate p1, the proportion of seniors planning to attend graduate school, and p2,
the proportion of seniors who have had a part-time job sometime during their
college careers (summers included).
Student
y
x
1
2
3
4
5
6
7
1
0
0
1
0
0
0
0
1
1
1
0
0
96
97
98
99
100
0
1
0
0
1
1
0
1
1
1
∑y
i
= 15
∑x
i
= 65
Solution:
The sample proportions are given by
15
65
pˆ1 =
= 0.15
pˆ1 =
= 0.65
100
100
Bounds on the error of estimation for p1 and p2 respectively are
(.15)(.85 ) ⎛ 300 − 100 ⎞ = 2 0.0293 = 0.059
pˆ qˆ ⎛ N − n ⎞
2 Vˆ ( pˆ1 ) = 2 1 1 ⎜
(
)
⎟ =2
⎜
⎟
n −1 ⎝ N ⎠
99
⎝ 300 ⎠
and
(.65)(.35) ⎛ 300 − 100 ⎞ = 2 0.0391 = 0.078
pˆ qˆ ⎛ N − n ⎞
2 Vˆ ( pˆ 2 ) = 2 2 2 ⎜
(
)
⎟ =2
⎜
⎟
n −1 ⎝ N ⎠
99
⎝ 300 ⎠
Thus we estimate that 15% of the seniors plan to attend graduate school, with
a bound on the error of estimation equal to .059. We estimate that 65% of the
seniors have held a part-time job during college, with a bound on the error of
estimation equal to .078.
We have shown that the population proportion p can be regarded as the average
of the 0 and 1 values for the entire population. Hence the problem of determining
the sample size required to estimate p to within B units should be analogous to
determining a sample size for estimating µ with a bound on the error of estimation
B. You will recall that the required sample size for estimating µ is given by
B2
Nσ 2
n=
where
D
=
. The corresponding sample size needed to
4
( N − 1) D + σ 2
estimate p can be found by replacing σ2 with pq..
Sample size required to estimate p with a bound on the error of estimation B:
n=
B2
Npq
where D =
4
( N − 1) D + pq
In a practical situation we do not know p. An approximate sample size can be
found by replacing p with an estimated value. Frequently, such an estimate can be
obtained from similar past surveys. However, if no such prior information is
available, we can take p = .5 to obtain a conservative sample size (one that is
N
likely to be larger than required). This yields n =
4 ( N − 1) D + 1
Example 8
Student government leaders at a college want to conduct a survey to determine
the proportion of students that favors a proposed honor code. Since interviewing
N = 2000 students in a reasonable length of time is almost impossible,
determine the sample size (number of students to be interviewed) needed to
estimate p with a bound on the error of estimation of magnitude B = .05.
Assume that no prior information is available to estimate p.
Solution
We can approximate the required sample sizes when no prior information is
available by setting p = .5. We have
2
B 2 (.05 )
D=
=
= .000625
4
4
N
2000
n=
=
= 333.47
4 ( N − 1) D + 1 4 (1999 )(.000625 ) + 1
That is, 334 students must be interviewed to estimate the proportion of students
that favors the proposed honor code with a bound on the error of estimation of B
= .05.
Example 9
Referring to Example 8, suppose that in addition to estimating the proportion of
students that favors the proposed honor code, student government leaders also
want to estimate the number of students who feel the student union building
adequately serves their needs. Determine the combined sample size required for a
survey to estimate p1 , the proportion that favors the proposed honor code, and
p2, the proportion that believes the student union adequately serves its needs,
with bounds on the errors of estimation of magnitude B1 = .05 and B2 = .07.
Although no prior information is available to estimate p1, approximately 60 % of
the students believed the union adequately met their needs in a similar survey run
the previous year.
Solution
In this example we must determine a sample size n that will allow us to
estimate p1, with a bound B1 = .05 and p2 with a bound B 2 = .07. First, we
determine the sample sizes that satisfy each objective separately. The larger of
the two will then be the combined sample size for a survey to meet both
objectives. From Example 8 the sample size required to estimate p1 , with a
bound on the error of estimation of B 1 = .05 was n = 334 students. We can use
data from the survey of the previous year to determine the sample size needed
to estimate p2. We have
2
B 2 (.07 )
D=
=
= .001225
4
4
2000 (.6 )(.4 )
Npq
n=
=
= 178.52
( N − 1) D + pq (1999 )(.001225) + (.6 )(.4 )
That is, 179 students must be interviewed to estimate p2. The sample size required
to achieve both objectives in one survey is 334, the larger of the two sample sizes.
SAMPLING WITH PROBABILITIES PROPORTIONAL TO SIZE
Previous work in this chapter has depended on the sample being a simple
random sample, according to Definition 1. We will now show that varying the
probabilities with which different sampling units are selected is sometimes
advantageous. Suppose, for example, we wish to estimate the number of job
openings in a city by sampling industrial firms within the city. Typically, many such
firms will be quite small and employ few workers, while some firms will be very
large. In a simple random sample, size of firm is not taken into account, and a
typical sample will contain mostly small firms. But the information desired
(number of job openings) is heavily influenced by the large firms. Thus we
should be able to improve on the simple random sample by giving the large
firms a greater chance to appear in the sample. A method for accomplishing this
sampling is called sampling with probabilities proportional to size, or pps
sampling.
For a sample y1, y2, ... , y n from a population of size N, let πi = probability that yi
appears in the sample. Unbiased estimators of τ and µ, along with their
estimated variances and bounds on the error of estimation, are as follows:
Estimator of the population total τ: τˆ pps =
Estimated variance of τˆpps : Vˆ (τˆ pps ) =
n
1
n
Yi
i =1
∑( π
n
1
n( n −1)
∑π
i =1
Yi
i
i
− τˆ pps
)
2
1
n ( n −1)
∑( π
n
1
N n ( n −1)
2
i =1
Yi
i
Bound on the error of estimation: 2 Vˆ ( µˆ pps ) = 2
1
Nn
Yi
i =1
n
Estimator of the population mean µ: µˆ pps = N1 τˆ pps =
Estimated variance of µˆ pps : Vˆ ( µˆ pps ) =
∑( π
n
Bound on the error of estimation: 2 Vˆ (τˆ pps ) = 2
∑π
)
2
Yi
i =1
− τˆ pps
i
− τˆ pps
)
i
2
∑( π
n
1
N 2 n ( n −1)
i =1
Yi
i
− τˆ pps
)
2
These estimators are unbiased for any choices of π i , but it is clearly in the best
interest of the experimenter to choose these probabilities so that the variances of
the estimators are as small as possible. The best practical way to choose the π i ’s is
to choose them proportional to a known measurement that is highly correlated
with yi. In the problem of estimating total number of job openings, firms can he
sampled with probabilities proportional to their total work force, which should be
known fairly accurately before the sample is selected. The number of job openings
per firm is not known before sampling, but it should be highly correlated with the
total number of workers in the firm.
Example 10 An investigator wishes to estimate the average number of defects per
board on boards of electronic components manufactured for installation in
computers. The boards contain varying numbers of components, and the
investigator feels that the number of defects should be positively correlated with
the number of components on a board. Thus pps sampling is used, with the
probability of selecting any one board for the sample being proportional to the
number of components on that board. A sample of n = 4 boards is to be selected
from the N = 10 boards of one day's production. The number of components on
the 10 boards are, respectively,
10, 12, 22, 8, 16, 24, 9, 10, 8, 31
Show how to select n = 4 boards with probabilities proportional to size.
Solution:
We list the number of components (our measure of size) in a column and list the
cumulative ranges and desirable πi ’s in adjacent columns, as follows:
Board
1
2
3
4
5
Number of
components
10
12
22
8
16
Cumulative
range
1-10
11-22
23-44
45-52
53-68
πi
10/150
12/150
22/150
8/150
16/150
6
7
8
9
10
24
9
10
8
31
69-92
93-101
102-111
112-119
120-150
24/150
9/150
10/150
8/150
31/150
There are 150 components in the population to be sampled. We can think of
these components as being numbered from 1 to 150. The cumulative range
column keeps track of the interval of numbered components on each board.
Board number 1 has the first 10 components, board number 2 has components 11
through 22, and so on. The π’s are simply the number of components per board
divided by the total number of components. The boards having greater numbers
of components have larger probabilities of selection.
To choose the sample of n = 4 boards, we enter the random number table and select
four random numbers between l and 150. The numbers we selected were 14, 56, 94,
and 25. We locate these numbers in the cumulative range column. The boards
corresponding to those range intervals constitute the sample.
Since 14 lies in the range of board 2, that board enters the sample. Similarly, 56
lies in the range of board 5, 94 lies in the range of board 7, and 25 lies in the
range of board 3. Thus the sample consists of boards 2, 3, 5, and 7. These boards
have been selected with probabilities proportional to their numbers of components.
Note that with this method we could have sampled a particular board more than
once.
Example 11 After the sampling of Example 10 was completed, the number of
defects found on boards 2, 3, 5, and 7 were, respectively, 1, 3, 2, and 1. Estimate
the average number of defects per board, and place a bound on the error of
estimation.
Solution
µˆ pps =
n
1
Nn
∑π
Yi
i =1
Vˆ ( µˆ pps ) =
i
150
150
150
⎤
= 101( 4) ⎡⎣1( 150
12 ) + 3 ( 22 ) + 2 ( 16 ) + 1( 9 ) ⎦ = 1.71
∑( π
n
1
N n ( n −1)
2
i =1
Yi
i
− τˆ pps
)
2
2
= 102 (14)( 3) ⎡⎢( 150
12 − 17.1) +
⎣
= 0.0295
((
3 150 )
22
) ((
2
− 17.1 +
2 150 )
16
)
2
2⎤
− 17.1 + ( 150
9 − 17.1) ⎥
⎦
2 Vˆ ( µˆ pps ) = 2 0.0295 = 0.34
The estimate of the average number of defects per board, with a bound on the
error of estimation, is then 1.71 ± .34 and the interval (1.37, 2.05) provides an
approximate 95% confidence interval for the average number of defects per
board.