4 RATIO AND REGRESSION METHODS OF ESTIMATION

RATIO AND REGRESSION METHODS OF ESTIMATION
4
Hukum Chandra
Indian Agricultural Statistics Research Institute, New Delhi-110012
4.1
INTRODUCTION
In sampling theory the auxiliary information is being utilized in following ways:

Utilization of information at pre-selection stage i.e. for stratifying the
population.

Utilization of information at selection stage i.e. in selecting the units with
probabilities proportional to some suitable measure of size (size being based
on some auxiliary variables).

Utilization of information at estimation stage i.e. in formulation of the
ratio-type, regression, difference and product estimators etc.

Auxiliary information may also be utilized in mixed ways.
Usually the information available is in the form that:

The values of the auxiliary character(s) are known in advance for each and
every sampling unit of the population.

The population total(s) or mean(s) of auxiliary character(s) are known in
advance.

If it is desired to stratify the population according to the values of some variate
x, their frequency distribution must be known.
The use of auxiliary information at estimation stage in the formation of ratio-type and
regression estimators and sampling scheme providing unbiased regression estimator
has been discussed in the following sections.
In sample surveys, many a time the characteristic y under study is closely related to an
auxiliary characteristic x, and data on x are either readily available or can be easily
collected for all the units in the population. In such situations, it is customary to
consider estimators of population mean YN of survey variable y that use the data on x
and are more efficient than the estimators which use data on the characteristic y alone.
The fact that the data on the auxiliary variable can be used even at a later stage after
selecting the sample, encourages such procedures. Two types of these commonly used
methods are as follows:

the ratio-type method of estimation

the regression method of estimation
4.1
4.2 RATIO-TYPE METHOD OF ESTIMATION
Let a sample of size n be drawn by SRSWOR (Simple random sampling without
replacement) from a population of size N. Denote by
yi = the value of the characteristic under study for the ith unit of the population,
xi = the value of the auxiliary characteristic on the ith unit of the population,
Y = the total of the y values in the population,
X = the total of the x values in the population,
ri 
yi
, the ratio of y to x for the ith unit,
xi
rN 
1
N
rn 
1 n , the simple arithmetic mean of the ratios for all the units in the sample,
 ri
n i 1
RN 
N

i 1
ri , the simple arithmetic mean of the ratio for all the units in the population,
YN
Y
 , the ratio of the population mean of y to the population mean of x, and
XN X
n

y
Rn  n = in1
xn
yi
x
i 1
,the corresponding ratio for the sample.
i
With this, an estimator of the population mean YN is given by
yR  Rn X N 
yn
XN .
xn
This estimator is known as the ratio-type estimator and pre-supposes the knowledge
of X N . Here, Rn provide an estimator of the population ratio R N . For example, if y is
the number of bullocks on a holding and x its area in acres, the ratio Rn is an
estimator of the number of bullocks per acre of holding in the population. The product

of Rn with X N , the average size of a holding in acres would provide an estimator of

YN , the average number of bullocks per holding in the population.
4.2.1 Expected value of the ratio estimator
Note that Rn is a biased estimator of RN and the bias in Rn is given by
4.2
Bias in Rn =
 Cov( Rn , xn )
.
xN
Expected value of the ratio estimator to the first approximation is given by
N n


E1 ( yR )  y N 1 + (
)(C x2  C y C x ) ,
Nn


S
Sx
, C y  y and  = population correlation coefficient between x
XN
YN
and y. It may be noted here that the bias to the first approximation vanishes when the
regression of y on x is a straight line passing through the origin.
where, C x 
4.2.2. Variance of the Ratio Estimator
The variance of the ratio estimator to a first approximation is given by
V1  Rn  = RN2 (
N n 2
)(C y  Cx2 - 2 CyCx ) ,
Nn
and the variance of the ratio estimator of population mean to a first approximation is
given by
V1 ( yR ) =
N n 2
Sy + R 2N S2x  2 RN S yx
Nn


.
4.2.3. Estimator of the variance of the ratio estimator
A consistent estimator of the relative variance of a ratio estimator is given by
2s yx 
 R  N  n  s y2
sx2
Vˆ1  n  =

 2

2
Nn  yn
xn
yn xn 
 RN 
and the estimator of variance of the ratio estimator of population mean to a first
approximation is given by
N n 2
s y + Rn2 sx2  2 Rn s yx 
Vˆ1 ( yR ) 
Nn
where s y2 , sx2 and s yx are the corresponding sample values.
4.3
4.2.4 Efficiency of the Ratio Estimator
In large samples, the ratio estimator will be more efficient than the corresponding
sample estimator based on the simple arithmetic mean if

Cy
Cx
>
1
2
>
or
1 Cx
.
2 Cy
If Cx  C y , as may be expected, for example, when y and x denote values of the same
variate, in two consecutive periods,  will be larger than one-half in order that the
ratio estimator may be more efficient than the one based on the simple arithmetic
mean.
4.3 RATIO ESTIMATOR IN STRATIFIED SAMPLING
Let there be K stratum in the population. Let Nt denotes the number of units in the tth
stratum and nt the size of the sample to be selected there from, so that
K
 Nt  N
t 1
K
and
n
t 1
t
 n.
Denote by Rnt the estimate of the population ratio RN t  YN t / X N t and by yRt the ratio
estimate of the population mean YN t for the tth stratum. Then clearly, the ratio
estimator of the population mean YN  
i 1
Nt
Y has been discussed in the next
N Nt
section.
4.3.1 Separate Ratio Estimator ( y Rs )
K
yRs  
t 1
K
N
Nt
yRt   pt yRt , where pt  t
N
N
t 1
(t  1,..., K ).
This is a biased but consistent estimator of population mean YN . The bias to the first
approximation is given by
K
Bias in ( y Rs ) = E1 ( yRs )  YN   ptYN t (
t 1
where Ctx 
Stx
X Nt
and Cty 
Sty
YNt
Nt  nt
)(Ctx2  t CtxCty ) ,
Nt nt
. The variance of y RS to a first approximation is given
by
4.4
K
1 1 
V1  yRs    pt2    Sty2  RN2 t Stx2  2 RNt Stxy ,
t 1
 nt Nt 

pt (
Nt  nt
)(Sty2  RN2 t Stx2  2 RN t Stxy ) ,
nt
pt (
Nt  nt
)(Sty2  RN2 t Stx2  2 RN t t Stx Sty ) .
nt
K
V1 ( yRs ) 
1
N

V1 ( yRs ) 
1
N

t 1
K
t 1

The above formula is based on the assumption that nt is large. A consistent estimator
of V1  yRs  is given by
N n
1 K
Vˆ1  yRs    pt ( t t )( sty2  Rn2t stx2  2 Rnt styx ) .
N t 1
nt
In practice, the assumption that nt is large is not always true. To get over this
difficulty, a combined ratio estimator has been suggested as below:
4.3.2. Combined Ratio Estimator ( y Rc )
K
y Rc 
p
t 1
K
t
p
t 1
t
ynt
XN .
x nt
This is again a biased estimator, however, it is a consistent estimator. The relative bias
to the first approximation is given by
K
Relative Bias in ( yRc ) = (( E1 ( yRc )  YN ) / YN   pt2 (
t 1
Nt  nt
)(Ctx2  ρt Ctx Cty ) .
Nt nt
The variance of y Rc to a first approximation is given by
V1 ( yRc ) 
1
N
K
p
t 1
t
Nt  nt 2
( Sty  RN2 Stx2  2 RN ρt Sty Stx ) ,
nt
and an estimator of the variance is given by
1 K
N  nt 2
Vˆ1 ( yRc )   pt t
( sty  Rn2 stx2  2 Rn styx ) ,
N t 1
nt
4.5
K
where,
Rnt =
ynt
xn t
and Rn =

pt ynt

pt xnt
t 1
K
t 1
4.4 REGRESSION METHOD OF ESTIMATION
We have seen that the ratio estimate provides on efficient estimate of the population
mean if the regression of y, the variable under study, on x, the auxiliary variable is
linear and the regression line passes through the origin. It happens frequently that
even though the regression of y on x is linear, the regression line does not pass
through the origin. Under such conditions, it is more appropriate to use the regression
method of estimation rather than ratio method of estimation.
4.4.1 Simple Regression Estimate
Since the regression coefficient  is generally not known, the usual practice is to use
estimate
s
βˆ  xy2 ,
sx
1 n
where
s xy =
 ( xi  xn )( yi  yn )
n 1
simple regression estimate,
and
1 n
( xi  xn ) 2 giving the
s =

n 1
2
x
ylr  yn  ˆ ( xN  xn ) .
Note: The general form of the estimator is
Yˆ = y + k(X N  xn ) .
(i)
If k = βˆ , then Yˆ  yn  βˆ ( X N  xn ) i.e. Yˆ is regression estimator
(ii)
If k =
y
y
then Yˆ  yn  n
x
xn
X
N
- xn  =
yn
X N i.e. Y is a ratio estimator.
xn
4.4.2 Expected value of the Simple Regression Estimator
E ( ylr ) = y N  Cov(ˆ , xn )
showing that the simple regression estimate is biased by an amount - Cov( ˆ , xn ) .
4.6
4.4.3
Variance of the Simple Regression Estimate
To a first approximation,
~ ( 1  1 ) S2 (1   2 )
V ( ylr ) =
y
n N
where  is the correlation coefficient between y and x in the population.
4.4.4
Estimator of the variance
1 1
Vˆ ( ylr ) = (  ) s 2y (1  r 2 )
n N
where r =
s xy
sx s y
is the sample correlation coefficient.
4.5 REGRESSION ESTIMATORS IN STRATIFIED SAMPLING
At first, we shall consider two difference estimates, namely
(i) Separate difference estimator
(ii) Combined difference estimate
4.5.1
Separate Regression Estimate
When i , s are not known in case of separate difference estimator, we estimate these
from the sample and in that case the estimator is known as separate regression
estimator.
K

ylrs   pi yni  ˆi ( xN i  xni )
i 1

where
ˆi 
sixy
six2
This estimator is biased and the variance of the estimator to the first approximation, is
given by
K
V ( ylrs )   pi2 (
i 1
1 1 2
 ) Siy (1  i2 )
ni Ni
where  i is the correlation coefficient between y and x for the i-th stratum and
K
1 1
Vˆ ( ylrs )   pi2 (  )(siy2  ˆi2 six2  2ˆi sixy )
ni Ni
i 1
4.7
4.5.2 Combined Regression Estimator
When the pooled regression coefficient  is not known then we replace it by  and
get the combined regression estimator,
K
K
i 1
i 1
ylrc   pi yni  ˆ ( X N   pi xni ) ,
K
1 1
 ) sixy
ni N i
i 1
ˆ
where   K
.
1 2
2 1
pi (  ) six

ni N i
i 1
p
2
i
(
The variance of the estimator along with its estimator, to the first approximation are
given by
K
V ( ylrc )   pi2 (
i 1
1 1
 )(Siy2   2 Six2  2Sixy ) ,
ni N i
and
K
1 1
Vˆ ( ylrc )   pi2 (  )(siy2  ˆ 2 six2  2ˆsixy ) .
ni Ni
i 1
4.6
PRACTICAL EXAMPLES
Let y i (i  1,..., N ) be the variate under study, and xi (i  1,..., N ) be the auxiliary
variate. Let N be the population size out of which a sample of size n is drawn. Let
X N be the population total of the auxiliary variate.
n
STEP-I: Calculate:
 yi ,
i 1
n
 xi ,
i 1
n
 yi2 ,
i 1
STEP-II: Calculate:
2
yi  
1 


2
s =
  yi 

(n  1) 
n 


2
y
2
xi  

1 

2
s 
  xi 

(n  1) 
n 


2
x
s xy =
1
(n  1)

 xi  yi 
  xi y i 

n


4.8
n
 xi2 and
i 1
n
x y
i 1
i
i
.
b
s xy
r
s x2
s xy
s x .s y
yn =
1
 yi
n
xn =
Rn 
yn
xn
X =
1
 xi
n
XN
N
STEP-III: Calculate:
(a) Ratio estimate .
yR =
yn
XN .
xn
Estimate of its variance

1 1 
V ( y R ) =    s y2  Rn2 s x2  2Rn s xy .
n N 


(b) Regression estimate ( ylr )
y lr = y n  bX N  xn  .
Estimate of its variance

1 1 
 1 1 
V ( ylr ) ) =    s y2  b 2 s x2  2bs xy      1  r 2 s y2
n N 
 n N 




(c) Simple Mean estimate .
y srs  y n .
Estimate of its variance .

1 1 
V ( y SRS ) =    s y2 .
n N 
STEP-IV: Calculate Estimate of Relative Efficiency
(a) Estimate of Relative Efficiency of Ratio estimate over Simple Mean estimate
=
Vˆ  y SRS 
x 100
Vˆ  y 
R
(b) Estimate of Relative Efficiency of Regression estimate over Simple Mean estimate
=
Vˆ  y SRS 
x100
Vˆ  y 
lr
(c) Estimate of Relative Efficiency of Regression estimate over Ratio estimate
4.9
=
Vˆ  y R 
x 100
Vˆ  y 
lr
Note: Estimate of Standard Error (SE) of the estimate can be worked out by taking
square root of the corresponding value of the estimate of the variance.
Practical Exercise 1
A sample survey for the study of yield and cultivation practices of guava was
conducted in Allahabad district. Out of a total of 146 guava growing villages in
Phulpur-Saran tehsil, 13 villages were selected by method of simple random
sampling. The Table below presents total number of guava trees and area under guava
orchards for the selected 13 villages. It is also given that the total area under guava
orchards of 146 villages is 354.78 acres.
Using area under guava orchards as auxiliary variate, estimate the total number of
guava trees in the tehsil along with its standard error, by using
(i)
Ratio method of estimation, and
(ii)
Regression method of estimation.
(iii)
Discuss the efficiency of these estimates with the one which does not
make use of the information on the auxiliary variate.
Sl. No. of Village
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Total number of guava trees ( y i )
492
1008
714
1265
1889
784
294
798
780
619
403
467
197
4.80
5.99
4.27
8.43
14.39
6.53
1.88
6.35
6.58
9.18
2.00
2.20
1.00
SOLUTION:
n
y
i 1
= 9710
i
n
x
i 1
i
= 73.60
n
y
i 1
2
i
Area under guava orchards (in acres) ( xi )
= 9685234
4.10
n
x
i 1
2
i
= 579.20
n
x y
i 1
i
i
= 72879.72
s y2 = 202717.60
s x2 = 13.54
s xy = 1492.18
b = 110.19
r = 0.90
y n = 746.92
x n = 5.66
Rn = 131.93
X N = 2.43
y R = 320.59
Vˆ ( y R ) = 3132.35
(Estimate of Standard Error = 55.97)
y lr = 390.85
Vˆ ( y lr ) = 2683.74
(Estimate of Standard Error = 51.80)
y n = 746.92
Vˆ ( y n ) = 14205.18
(Estimate of Standard Error = 119.19)
(a)
Estimate of Relative Efficiency of Ratio estimate over
Simple Mean estimate
453.50
(b)
Estimate of Relative Efficiency of Regression estimate
over Simple Mean estimate
529.31
(c)
Estimate of Relative Efficiency of Regression estimate
over Ratio estimate
116.72
4.11
Practical Exercise 2
A sample survey was conducted for studying milk yield, feeding and management
practices of cattle and buffaloes in the eastern districts of U.P. The whole of the
eastern districts of U.P. were divided into four Zones (strata). The Table below
present total number of milch cows in 17 randomly selected villages of Zone-I as
enumerated in winter season and as per Livestock Census.
Number of Milch Cows
Sl. No. of Village
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
Winter Season ( y i )
Livestock Census ( xi )
29
44
25
38
37
27
63
53
64
30
53
25
16
15
12
12
23
41
44
27
53
17
40
53
46
89
37
70
15
30
18
22
13
66
Estimate the number of milch cows per village with its standard error for the rural
area of Zone-I in winter season by using (i) Ratio method of estimation, and
(ii) Regression method of estimation. It is given that total number of milch cows in
Zone-I as per Livestock Census was 10,87,004 and number of villages in Zone-I was
22,654. Also compare the efficiency of these estimates with Simple Mean estimate.
SOLUTION:
n
y
i 1
= 566
i
n
x
i 1
i
= 681
n
y
i 1
2
i
= 23450
n
x
i 1
2
i
= 34617
4.12
n
x y
i 1
i
i
= 26879
s y2 = 287.85
s x2 = 458.56
s xy = 262.86
b = 0.57
r = 0.72
y n = 33.29
x n = 40.06
Rn = 0.83
X N = 47.98
yˆ R = 39.88
Vˆ ( yˆ R ) = 9.86
yˆ lr = 37.84
Vˆ ( yˆ lr ) = 8.06
yˆ n = 33.29
SE  y R   3.14 (Estimate of Standard Error = 3.14)
Vˆ ( yˆ n ) = 16.92
(Estimate of Standard Error = 2.84)
(Estimate of Standard Error = 4.11)
(a)
Estimate of Relative Efficiency of Ratio estimate over Simple Mean estimate
171.67
(b)
Estimate of Relative Efficiency of Regression estimate over Simple Mean
estimate
209.85
(c)
Estimate of Relative Efficiency of Regression estimate over Ratio estimate
122.24
4.13
Practical Exercise 3
A pilot sample survey for estimating the extent of cultivation and production of fresh
fruits was conducted in three districts of Uttar Pradesh State during the agricultural
year 1976-77. The following data were collected
Stratum
Number
Total
number
of
villages
( Nm )
Total area
under
orchards
985
11253
1
2
2196
3
(ha.)
(X m )
Number
of
villages
in
Sample
(n m )
6
25115
1020
8
18870
11
Area under orchards
(ha.)
(x m )
Total number of trees
(y m )
10.63
9.90
1.45
747
719
78
3.38
5.17
10.35
201
311
448
14.66
2.61
4.35
580
103
316
9.87
2.42
5.60
739
196
235
4.70
36.75
212
1646
11.60
5.29
7.94
488
227
374
7.29
8.00
1.20
491
499
50
11.50
1.70
2.01
455
47
879
7.96
23.15
115
115
Estimate the total number of trees in the three districts by different methods and compare their
precision.
SOLUTION
The calculations have been shown in the Table given below:
Stratum
Wm
1 1 
  
 nm N m 
xm
ym
Rˆ m
W m xm
W m ym
s 2x m
s 2y m
s xym
1
0.2345 0.16598
6.81 417.33 61.28
1.60
97.66
16.03
2
0.5227 0.12454 10.07 503.38 49.99
5.26
263.12
129.64 259107.98 5643.81
3
0.2428 0.08902
1.94
82.55
38.39
W m = Nm
N
7.97 340.00 42.66
m
, Rˆ m = y m xm
4.14
74778.80 1008.75
65885.60 1403.69
(A) RATIO ESTIMATORS
(i) Separate Ratio Estimate ( y Rs )

K
y Rs =
R
m 1
m
X m = 2750077

Estimate of its variance V ( y Rs )



 1
1  2
 s ym  Rˆ m2 .s x2m  2.Rˆ m .s xym = 2441137855.48

V  y Rs  =  N m2 
 nm N m 
(ii) Combine Ratio Estimate ( y Rc )
y Rc =
∑W
∑W
m
ym
m
xm
X = (2783995)

Estimate of its variance V ( y Rc )


 1
1  2
 s ym  Rˆ .s x2m  2.Rˆ .s xym

V  y Rc  =  N m2 
 nm N m 
ˆ =
where R
W
m
ym
W
m

xm
(iii) Efficiency of Separate Ratio Estimate ( y Rs ) over the Combined Ratio Estimate ( y Rc )

V  y Rc 
Estimate of Relative Precision Efficiency (R.P.)= 
x 100 (246.58%)
V  y Rs 
(B) Regression estimators
(i) Separate Regression Estimate ( yls )
yls   N m ym  bm X m  xm  = 2672911
K
m

Estimate of its variance V ( yls )
K

 1
1  2
 s ym  bm2 .s x2m = 1870633332
V  yls    N m2 

m
 nm N m 


(ii) Combine Regression Estimate
 ylc 
ylc = N yst  bc X  xst  where bc 
K
K
y st   N m y m
m
 y
and
K
nm
m
j
x st   N m xm
m
4.15
mj
 y m xmj  xm 
 x
K
nm
m
j
 xm 
2
mj
= 2643949

Estimate of its variance V ( ylc )
K

W 2 1  f m  nm
ymj  ym   bc xmj  xm  2 = 2020917640
V  ylc    m

m nm nm  1 j


where f m 
nm
Nm
a) Estimate of Efficiency of Separate Regression Estimate  yls  over the
Separate Ratio Estimate  y Rs  is given by

V  y Rs 
Relative Precision (R.P.) = 
. 100 = 130.50%
V  yls 
b) Estimate of Efficiency of Combine Regression Estimate ( ylc ) over the
Combined Ratio Estimate ( y Rc ) is given by

V  y Rc 
Relative Precision (R.P.) = 
. 100 = 297.86%
V  ylc 
c) Estimate of Efficiency of Separate Regression Estimate ( yls ) over the
Combined Regression Estimate ( ylc ) is given by

V  ylc 
Relative Precision (R.P.) = 
. 100 = 108.03%
V  yls 
REFERENCES
Cochran, William G. (1977). Sampling Techniques. Third Edition. John Wiley and
Sons.
Des Raj (1968). Sampling Theory. TATA McGRAW-HILL Publishing Co. Ltd.
Des Raj and Promod Chandok (1998). Sample Survey Theory. Narosa Publishing
House.
Murthy, M.N. (1977). Sampling Theory and Methods. Statistical Publishing Society,
Calcutta.
Singh, Daroga and Chaudhary, F.S. (1986). Theory and Analysis of Sample Survey
Designs. Wiley Eastern Limited.
Singh, Daroga, Singh, Padam and Pranesh Kumar (1978). Handbook of Sampling
Methods. I.A.S.R.I., New Delhi.
Singh Ravindra and Mangat N.S. (1996). Elements of Survey Sampling. Kluwer
Academic Publishers.
Sukhatme, P.V. and Sukhatme, B.V. (1970). Sampling Theory of Surveys with
Application. Second Edition. Iowa State University Press, USA.
Sukhatme, P. V., Sukhatme, B.V., Sukhatme, S. and Asok, C. (1984). Sampling
Theory of Surveys with Applications. Third Revised Edition, Iowa State University
Press, USA.
4.16