Sampling

Sampling
More pixels
Less pixels
1
Basic Statistics
2
Basic Statistics of Universe
Central Tendency
Population Mean and Total
• This statistic is a measure of the “central” value in
the observed Total Population
N - the number of all elements in the Total Population
Xi - the value of each element in the Total Population
Population Mean
Population Total
X  NX
N
X 
X
i 1
N
i
N
  Xi
i 1
3
Basic Statistics of Universe
Spread
xi
Population Variance (2)
This statistic is a measure of how variable the Population is
about the MEAN  X
i.e. What is the average distance of the mean from each of
the elements
Xi  X
 Xi  X 
… but this will be zero by definition
 X i  X 
… removes the negative values
1
i 1
1
2
i 1
N
S22 

(X
i 1
i
 X)
N 1
2
… is the population variance
4
Population Variability – example
Population I (1,2,3,4,5,6,7,8,9,10)
MEAN = 5.5 VARIANCE = 9.17
Population II(1,2,3,4,5,11,12,13,14,15)
MEAN = 8 VARIANCE = 30
5
Population I: Mean and Variance
Xi  X  X i  X
1
Xi
Sum
Mean X
 
1
2
3
4
5
6
7
8
9
10
55
5.5
N
Variance
S 
(2) 2=
i1
-4.5
-3.5
-2.5
-1.5
-0.5
0.5
1.5
2.5
3.5
4.5
0
0
(X
i 1
i
 X)
20.25
12.25
6.25
2.25
0.25
0.25
2.25
6.25
12.25
20.25
82.5

2
2
9.17
N 1
6
Sampling
More dots
Less dots
8
Sample
• Sample is a subset of the universe that’s used for making
conclusions or inferences about the universe.
• It reduces the time, effort and cost in estimating parameters
such as market size, a brand’s sales volume or market share
• Sample size is usually a commercial decision weighing the
cost and benefit
– Small unreliable samples are not meaningful
– Large, overly accurate samples, that none can afford also do not
make commercial sense
9
Sample size depends on
• Population Variability
– Larger variance - bigger sample is needed
• Product Distribution
• Sample Design
– Optimum stratification and allocation
– Selection
– Projection
• Level of Accuracy
– The higher the required accuracy, larger the sample
Sample size is not dependent on Universe size.
10
Why large samples in India and China?
Retail Environment
Highly Variable
Product Distribution
Low
Difficult
to Measure
Need Large
Sample
Reliability is more
Expensive
12
Central Limit Theorem
• The determination of sample size is based partly on the
Central Limit Theorem, which states that the sampling
distribution of the mean will be normally distributed as the
sample size (n) increases, even if the population distribution
is not normally distributed.
• This means that when we choose samples of the same size
from the universe (e.g. retail outlets), while each of these
samples will yield a different result for any variable (such as
rate of brand sales per store) being measured, it can
however be expected that all samples of the same size and
design will yield a result that is within a measured range
around the true value
13
Different samples give different results
Universe size = N
Sample size = n
True value = μ
Measure:
Sample 1: x1
Sample 2: x2
Sample 3: x3
Sample 4: x4
……
Based on the Central Limit Theorem, the frequency
distribution of values x1, x2 … follows a bell-shaped curve
called normal distribution
14
Normal Distribution
Frequency (%)
Distribution density function
Frequency distribution of estimates
ഥ2, ࢞
ഥ3, ࢞
ഥ4 … of a variable X from
ഥ1, ࢞
࢞
multiple samples follows the bellshaped normal distribution curve
Samples of the same size yield
a result that is within a
measured range around the
true value. 68% of all of its
observations fall within a range
of ±1 standard deviation from
the mean
Normal distribution is defined
by a function which has two
parameters: mean, standard
deviation
μ
Estimated Value
x
Normal distribution represents one of the empirically verified elementary "truths about the general nature of reality"
15
Probability that estimated value will lie
within ±3% of actual value is 90%
Frequency (%)
90%
5%
5%
1.65 
μ
1.65 
x
Estimated Value
20
Standards for sampling error
• Nielsen standards for Sampling Error* (tolerance
level of error) are:
–National Market
–Major MBDs / Channels
–Minor MBDs
±3% of sales level
±6% of sales level
±6-10% of sales level
• National Market ±3% of sales level …
Probability that estimated value will lie within
±3% of actual value is 90%
* Sampling Error also called Relative Standard Error or RSE
21
Sample Size for estimating population mean μ
Variance of distribution = 2
1.65  = 0.03 μ = e
i.e. Z  = e
Z 2 2 = e2
2 = S2 / n
S2 is the Universe variance, μ is universe average (Central Limit Theorem)
Z 2 (S2 / n ) = e2
n = Z2 S2
e2
22
Sample Size
(for estimation population mean)
n = Z2 S2
e2
Sample Size = n
Universe variance = S2
Standardized z value associated = Z
with the level of confidence
Level of confidence is:
90% … Z=1.65
95% … Z=1.96
99% … Z=2.58
Acceptable tolerance level of error = e
(stated in absolute value)
24
Sample Size
(for estimation of variable)
2 S2
Z
n=
e2
In order to half a sampling
error we need to quadruple
the sampling size
S =100
μ = 200
Z = 1.65
e = 3% x 200
Sample size is directly
proportional to Universe
variance
n = 1.65 2 x 1002/ (0.03 x 200)2 = 756
25
Sample Size
(for estimation population proportion)
•
Quantitative Research usually deals with estimation of population
proportion. i.e. What % of the population …
– are aware of product X
– say they will buy product X
– agree / strongly agree that product X is superior to product Y
•
For large population, the sample size (n) for these type of “Yes”, “No”
questions is:
n = Z2 p(1-p)
e2
Where:
p is the probability for given response and varies from 0 to 1. It reflects
the variability in the data. When p=1, no sample required.
Z is the standardized value associated with the level of confidence. For
95% confidence level Z = 1.96
e is the desired precision or the margin of error
26
Sample Size
(for estimation population proportion)
• 0.5 is the most conservative value for ‘p’. This is probability for
which we require the largest sample size (n)
• Assuming:
p = 0.5
e = + 5%
z = 1.96 ≈ 2 … for confidence level of 95%
n = Z2 p(1-p) = 22 0.5(1-0.5) = 400
e2
(or 384 to be precise)
0.052
n=1
e2
• To estimate the proportion of population that respond positively to
a question, with confidence level of 95% and 5% margin of error,
we need a sample of 400 respondent
27
Sample Size
(for estimation population proportion)
n=1
e2
e
+20%
+10%
+5%
+4%
+3%
+2%
+1%
n
25
100
400
625
1100
2,500
10,000
•
For small populations, 100 for instance, take census.
•
Finite Population Correction Factor: For medium size populations, the
required (reduced) sample size may be computed using the following
adjustment:
nadj =
•
n
1 + n/N
If N = 2000, e = 5% then
n = 384 (not rounding Z to 2),
•
nadj = 384/(1+384/2000) = 322
Using FPC factor one can optimize sample size and save cost for
smaller populations. Further savings are possible if it’s known that the
proportions are skewed (i.e. p is not close to 0.5)
28
Sample size for different margins of error
(at confidence level of 95%)
25%
e
+20%
+14%
+10%
+7%
+5%
+4%
+3.5%
+3%
+2%
n
25
50
100
200
400
600
800
1000
2,500 10,000
+1%
20%
14%
Margin of error (e)
15%
10%
10%
7%
5%
5%
4%
3.5%
3%
0%
0
50
100
200
300
400
500
600
Sample size (n)
700
800
900
1000
29
Design issues for tracking surveys
• Many market research surveys are tracking
surveys in which an independent sample is taken
at fixed intervals. The main objective of a tracker
is to measure change between the intervals
• Estimates of change (differences) have margins of
error which are 40% larger than the corresponding
estimates from the individual surveys
– Because both estimates are subject to sample error
30
Tracking Studies - margin of error for differences in
proportions. (at confidence level of 95%)
35%
33%
30%
e
+20%
+14%
+10%
+7%
+5%
+4%
+3%
n
50
100
200
400
800
1250
2,200 20,000
+1%
25%
Margin of error (e)
20%
20%
14%
15%
10%
10%
7%
6%
5%
5%
4%
0%
0
50
100
200
300
400
500
600
Sample size (n)
700
800
900
1000
31
Design issues for tracking surveys
• The previous charts show a “tracking survey” for
which no change was happening – the population
value did not change over time
– The charts are only showing sample error!
• Many trackers have samples of around 20-50 per
wave and are unable to measure the change
happening in the population
32
Design issues for tracking surveys
• This problem is often “solved” with rolling data
• With rolling data the result for each wave is
averaged across the current and several previous
waves
– eg with a weekly sample size of 25 per week an 8 week
rolling average would give a sample size of 200
– This improves the standard error (by using more
sample) but an 8 week average flattens the data,
making any change hard to detect
33
Types of Sampling
• Probability sampling: Where we know the probability of
a data point being included in the sample … though the
probability of inclusion may not be equal
– Random sampling: equal chance of inclusion
– Systematic sampling: equal chance of inclusion
– Stratified sampling: unequal chance of inclusion
• Non-probability sampling:
– Convenience sampling … such as selecting sample (of
outlets as in retail audit) that is near home
– Purposive sampling … eg selecting brand users
– Quota sampling … where sample matches a predetermined profile
In real life, we usually use a combination of methods
34
Biased
Unbiased,
random errors.
Unbiased
and accurate
35
Stratified Sampling
• Stratification is a process of
– dividing a Universe into groups (called Strata or Cell) for
the purpose of selecting sample from each one
• Strata examples: Provision, Supermarkets, mini-markets
– Each group is usually internally homogenous.
Homogeneity in retail measurement is based on store
characteristics such as store type / retailer chain,
geographical location and shop size.
• Stratified Sample provides greater precision than a
Simple Random Sample of the same Sample Size
• Most suitable for retail measurement services
36
Stratification
Population II
(1,2,3,4,5, 11,12,13,14,15)
MEAN = 8
2 = 30
Strata I
(1,2,3,4,5)
MEAN = 3
2 = 2.5
Strata II
(11,12,13,14,15)
MEAN = 13
2 = 2.5
Through reduction in strata variance, a Stratified Sample provides
greater precision than a Simple Random Sample of the same size
37
Sample Size in Stratified Sampling
Sample Size = ni … for stratai (k in all)
Stratai Population (i.e. Universe) = Ni
Stratai Population variance = σi2
Standardized z value associated =Z
with the level of confidence
Acceptable tolerance level of error = e%
ni =
Z2 σi2
e2
x
N i σi 2
Njσj2
j = 1 to k
Level of confidence is:
90% … Z=1.65
95% … Z=1.96
99% … Z=2.58
38
Projection from sample to universe
• Projection - statistical technique used to
estimate Population variables from the
observations gathered from the Sample
• Stratified Samples are projected at the Cell
level
39
Projection factors
• Cell Projection Factors can be based on
– Number of Shops - Numeric Projection
# of Shops in Universe
# of Shops in Sample
– Shops ACV (‘All commodity value' or Turnover) or any
other size indicator - Ratio Estimation
Universe Shops ACV
Sample Shops ACV
40
Projection
• Ratio Estimation is better than Numeric Projection
when
– the correlation between given Category sales and
measure of size (ACV sales) used to calculate Ratio
Estimation is greater than 0.5
• Distance between the Numeric and Ratio estimates
reflects Sample Quality
41
42