A comparison of sample size calculation algorithms

A comparison of sample size calculation algorithms
Hyun-Tae Kim
Technology Center for Nuclear Control
Korea Atomic Energy Research Institute
Taejon, Korea
Abstract
When a sample is taken without replacement from a finite
population which is suspected to have defects, the probability
that sample will have defects or not is described by the
hypergeometric density function. Usually a hypergeometric
density function is approximated by two binomial density
functions depending on the approximation condition, and
which one satisfies approximation condition is not known
always before sample size calculation. Therefore
simultaneous application of two binomial density functions is
often required. This paper compares three kinds of binomial
approximation and a hypergeometric algorithm when applied
to sample size calculation for various values of q, the over-all
classification probability of classifying a defect as a defect
when measured with up to three verification methods of the
International Atomic Energy Agency (IAEA). The first
approximation is the simply applied standard binomial
approximation which is currently used by the IAEA. The
second one is the correctly applied standard binomial
approximation with simultaneous application of two binomial
density functions. The third one is the improved binomial
approximation developed by Mr. J. L. Jaech.
standard binomial approximation is somewhat simply
applied, therefore here called as simply applied standard
binomial approximation (SB).
On the other hand
simultaneous application of two binomial density function
gives more accurate sample size calculation than SB,
therefore called as correctly applied standard binomial
approximation (CB). Improved binomial approximation
(IB), developed by Mr. J.L. Jaech, gives more accurate
sample size calculation than CB. It is the purpose of this
paper to present a comparison of sample sizes among these
approximations8 with respect to hypergeometric algorithm7
(HY) for various values of q.
Hypergeometric density function The exact probability that
a randomly selected sample of n items containing d defects
without replacement is given by the following
hypergeometric density function:
h(N, D, n, d) =
 D  N − D
 

 d n − d 
 N
 
 n
(1)
Standard binomial approximation1 The standard binomial
approximation to h(N,D,n,d) is as follows:
Introduction
MC&A (Material Control and Accountancy) is one of
essential parts of the IAEA’s conventional safeguards and
strengthened safeguards. To verify that there is no diversion
of nuclear material, the IAEA sampling plan specifies the
number of items in a given stratum to be randomly selected
and then measured by means of up to three verification
methods.2,3 Here stratum usually means a grouping of
items/batches having similar physical and chemical
characteristics (e.g. volume, weight, isotopic composition,
location). To determine the size of sample, it is necessary to
give the non-detection probability β and the over-all
classification probability q that a defect is properly classified
as a defect given that it has been measured. The functional
form for β involves summing a large number of terms,
formula (5), each term containing as one of the factors a
probability calculated by the hypergeometric density
function, where N is the number of items in the population
and d is the number of defects in the sample. The calculation
of the non-detection probability is greatly simplified upon
approximating the hypergeometric density function with a
binomial density function. In the IAEA sampling plan,
when N > 50, f = n/N 0.10, D n
 D d
D−d
h(N, D, n, d) ≈ b(D, f, d) =   f (1 − f)
d
 
when N > 50, p = D/N 0.10, n D
 n d
n −d
h(N, D, n, d) ≈ b(n, p, d) =   p (1 − p)
d
 
(2)
(3)
Improved binomial approximation4,5 The improved binomial
density function, with some assumption, takes the form of the
formula (3), with p and n replaced by p1 and n1 respectively,
found by equating the first two moments of the
hypergeometric density function to the first two moments of
the binomial density function respectively and solving for p1
and n1. p1 and n1 are as follows:
p1 = 1 − (N − n)(N − D) / (N(N − 1))
n1 = n ⋅ D / (N ⋅ p 1 )
(4)
Non-detection probability
Let q be the probability that a defect is properly classified as
a defect given that it has been measured. Since the IAEA use
up to three verification method in the inspection activities, q
is the weighted average of q1, q2 and q3. Here q1 is the
probability of classifying a defect as a defect when measured
with a verification method 1. q2 and q3 are similarly
defined. It is quite natural to assume that q < 1. Since the
values of q2 from the 13th and the 19th columns of the
reference 9 are 91% and 85%, and the values of q3 are 86%
and 83%, q = 100%, 99%, 98%, 97%, 96%, 95%, 94%, and
93% were used in this paper. The non-detection probability
is given as follows:
when N > 50, p = D/N 0.10, n D
ln( β )
(10)
n=
D
ln (1 − q )
N
Usually it is not known in advance whether D n or n D,
therefore it is necessary to compare D with n after applying
formula (9) or (10). If the assumed inequality is not met, the
other formula is used to obtain the sample size.
Improved binomial approximation (IB)4
With some
assumption the improved binomial distribution is
approximated as a statistical distribution. Its form, the
formula (3) is used for sample size calculation with excellent
result.
Hypergeometric density function
Min(n,D)
d
β=
h( N , D, n, d ) (1 − q )
∑
d=0
(5)
Here Min(n,D) is the minimum value of n and D.
Standard binomial approximation
when N > 50, f = n/N 0.10, D n
n D
D
β = (1 − q ) = (1 − f ⋅ q )
N
(6)
when N > 50, p = D/N 0.10, n D
D n
n
β = (1 − q ) = (1 − p ⋅ q )
N
(7)
n1 =
ln( β )
ln(1 − p1 ⋅ q)
Since there is a symmetry of n and D in the formula (4), the
formula (11) can be used irrespective of D n or n D,
desirable phenomenon.
Simply applied standard binomial approximation (SB)3,6 To
counteract the possible diversion scenarios, up to three
verification methods are used by the IAEA. The formula
(12) is used by the IAEA for the calculation of the sample
size for the verification methods 1, 2 and 3. The formula (10)
is used for calculation of the sample size of the verification
methods 2 and 3.
1
n = N (1 − β D )
Improved binomial approximation
(11)
(12)
Comparison of sample size calculation algorithms
n
β = (1 − p 1 ⋅ q ) 1
(8)
Sample size
Estimation of the value of q is the most important one. q
values, from 100% down to 93%, are located at the upper-left
corner of the sub-tables of Table 1. Table 1 shows sample
sizes calculated by aforementioned three binomial
approximations and a hypergeometric algorithm.
Given , sample sizes are calculated by the formulas (5)
through (8).
The 1st column
is the non-detection probability with
values 5%, 10%, 50% and 80%.
Correctly applied standard binomial approximation (CB)
The 2nd column N is the number of items in the stratum in
inspection.
The formulas (6), (7) and (8) are approximate expressions of
the formula (5).
when N > 50, f = n/N 0.10, D n
1
N
n = (1 − β D )
q
(9)
The 3rd column x is the average weight of item in the stratum
in inspection with the same unit as the goal amount M.
Values 1.0 and 0.4 were used with the same unit as M in the
4th column.
The 4th column M is the goal amount (generally, 1
significant quantity) with values 8 and 75.
The 5th column D1 is [M/γx], the rounded-up number of
defects with defect fraction γ1 = 1.0.
The 6th column n has four sub-columns SB, CB, IB and HY.
SB was calculated with the formula (12). CB was calculated
with the formulas (9) and (10). IB was calculated with an
iterative algorithm for the formulas (11). HY was calculated
with an iterative algorithm for the formula (5).
The 7th column diff has three sub-columns S-, C- and I-.
Here S- is the value of the first sub-column SB of n minus the
fourth sub-column HY of n. C- is the value of the second
sub-column CB of n minus the fourth sub-column HY of n.
I- is the value of the third sub-column CB of n minus the
fourth sub-column HY of n. The sub-column shows nonnegative values when q = 100%, 99%, and 98%, but shows
negative values when q is equal to and less than 97%.
Therefore SB is not an conservative approximation to HY
when q is equal to and less than 97%. Since the values of the
sub-column S- are greater than those of C- and I-, SB is a
poor approximation to HY compared to CB and IB. The
values of the sub-column C- show no negative values for all
the values of q used in the Table 1. Therefore CB is a
conservative approximation to HY. Also the values of the
sub-column IB show no negative values for all the values of
q used in the Table 1. Therefore CB is also a conservative
approximation to HY. Since the values of CB is always
equal to or greater than IB.
IB is a more better
approximation than CB to HY.
The 8th column r_diff has three sub-columns S-, C- and I-.
Here S- , C- and I are defined as follows:
S- =
SB - HY
HY
C- =
CB - HY
HY
I- =
IB - HY
HY
(13)
(14)
(15)
Table 1 was calculated the Microsoft Excel using the Visual
Basic for Application. The columns diff and r_diff of the
sub-tables of Table 1 are summarized in Table 2. The subcolumn SB of n of the sub-table q = 100% of Table 1 is very
close and conservative to the sub-column HY of n of the subtable q = 98% of Table 1. Therefore if we can assume that q
= 98% then SB is a good approximation to HY. But it is
more desirable to use CB, IB, or HY to get statistically
accurate sample size. CB and IB can be easily implemented
in the pocket calculator. Although calculation of HY requires
many more steps than IB and CB, with currently used
powerful personal computers we feel no calculation speed
difference among CB, IB, and HY. With personal computer,
16-bit or 32-bit, we can use HY directly for safeguards
inspection activities.
Table 2. Relative degree of approximation to HY
Table 1
SB
CB
IB
q=
conservative
conservative conservative
100%
poor
good
very good
conservative
conservative conservative
q = 99%
poor
good
very good
conservative
conservative conservative
q = 98%
poor
good
very good
not conservative conservative conservative
q = 97%
poor
good
very good
not conservative conservative conservative
q = 96%
poor
good
very good
not conservative conservative conservative
q = 95%
poor
good
very good
not conservative conservative conservative
q = 94%
poor
good
very good
not conservative conservative conservative
q = 93%
poor
good
very good
Conclusion
From Table 1 and 2, SB (simply applied binomial
approximation), an approximation algorithm used by the
IAEA, is not a conservative approximation to HY
(hypergeometric algorithm) when q is equal or less than 97%,
but CB (correctly applied standard binomial approximation)
and IB (improved binomial approximation) are conservative
approximation algorithms to HY. IB is a more better
approximation to HY than CB. Although IB is a more better
approximation to HY than CB, an iterative algorithm is
required in the calculation of IB. Furthermore the improved
binomial distribution is approximated as a statistical
distribution.
With currently used powerful personal
computers, we feel no calculation speed difference among
CB, IB, and HY. Since SB can be thought as a poor
approximation to HY, it is recommended to use CB, IB, or
HY. To apply these methods further investigation of the
estimation of the value of q is required.
References
1. V. K. Rohatgi, Statistical Inference, New York: John
Wiley & Sons, 1984, pp. 341-342
2. International Atomic Energy Agency, IAEA Safeguards
Statistical
Concepts
and
Technique,
Vienna,
IAEA/SG/SGT/4, IAEA, 1989
3. J. L. Jaech and M. Russell, Algorithm to Calculate Sample
Sizes for Inspection Sampling Plans, IAEA STR-261 Rev. 1,
1991
4. J. L. Jaech, “An improved binomial approximation to the
hypergeometric density function”, Journal of Nuclear
Material Management: 36-41 (January 1994).
5. W.D. Sellinschegg, “Statistical Analysis employed in
IAEA Safeguards”, International Nuclear Safeguards 1994:
Vision for the Future Vol. 1, IAEA-SM-333/224, IAEA
(July 1994)
6. Mingshih LU, “Detection probabilities for random
inspections in variable flow situations”, International Nuclear
Safeguards 1994: Vision for the Future Vol. 1, IAEA-SM333/124, IAEA (July 1994)
7. Hyun-Tae Kim, et al., “A Study on the application of
hypergeometric distribution to the IAEA inspection sample
size allocation algorithm(Korean)”, Proceedings of the
Korean Nuclear Society Spring Meeting: 1093-1098, Ulsan,
Korea (May 1995)
8. Hyun-Tae Kim, “A Comparison between IAEA inspection
sample size allocation algorithms(Korean)”, Proceedings of
the Korean Nuclear Society Autumn Meeting: 1029-1034,
Seoul, Korea (October 1995)
9. Hyun-Tae Kim, “A Comparison of sample size allocation
between simply applied standard binomial approximation,
correctly applied standard binomial approximation, and
improved binomial approximation” Proceedings of the 37th
Annual Meeting of the Institute of Nuclear Materials
Management: 113-118, Naples, FL, U.S.A. (July 1996)
-------------------------------------------------------------------------Mr. Hyun-Tae Kim is a principal researcher working for the
Technology Center for Nuclear Control (TCNC) and is the
Secretary of the INMM Korea Chapter. He is in charge of
the safeguards software development at the TCNC. He had
received an MBA from ChungNam National University,
Korea. His fields of interest are safeguards information
processing and fuzzy information processing.
Address: Technology Center for Nuclear Control
Korea Atomic Energy Research Institute
P.O.Box 105, Yusung
Taejon, Korea
Telephone: +82-42-868-8939 FAX: +82-42-861-8819
Internet e-mail: [email protected]