Normality distribution testing for levelling data obtained

Geonauka
Vol. 3, No. 1 (2015)
UDC: 519.246
DOI: 10.14438/gn.2015.08
Typology: 1.04 Professional Article
Article info: Received 2015-03-01, Accepted 2014-04-06, Published 2015-04-10
Normality distribution testing for levelling data obtained by geodetic control
measurements
Milan TRIFKOVIĆ1*
1
University of Novi Sad, Faculty of Civil Engineering, Subotica, Serbia
Abstract. Normal distribution of data is of crucial importance in data processing and hypothesis testing in
geodesy. Models of geodetic measurements adjustment assume that data are normally distributed. However,
results of measurements could be affected by different influences because geodetic data are obtained under
external conditions and under different limitations such as the deadlines or other processes on the
construction site. These possibilities implicate certain risks that deviations from normal distribution in
geodetic data could appear. Those deviations from normal distribution could spread through the
mathematical and stochastic models and violate conclusions based on geodetic data. To avoid mentioned
risks it is of considerable importance to devote attention to testing normality distribution of geodetic data
obtained from production measurements. In this paper one set of leveling data obtained from production
measurements was considered from aspect of its normal distribution curve.
Keywords: Shapiro-Wilk test, Pearson Chi-square test, skewness, kurtosis
*
Milan Trifković> [email protected]
40
Geonauka
Vol. 3, No. 1 (2015)
normality available in statistical literature but most
powerful test of researched one is Shapiro-Wilk test.
Shapiro and Wilk coined the test for normality
distribution of data in 1965 [4]. The Shapiro-Wilk
statistics is defined as
1 Introduction
The basic assumption for geodetic data
processing especially for their adjustment by least
square method is that they are normally distributed.
The models of data distribution are well researched
and their importance for least square method is
explained in literature [1]. Also the methods of
hypotheses of normally distribution of geodetic data
testing are given in literature [2]. Most common tests
for normality distribution of geodetic data testing are
Shapiro-Wilk test, Pearson’s Chi-square test and
Kolmogorov-Smirnov test but any conclusions about
distribution is suggested to be checked by other
characteristics of normally distributed data such as
skewness and kurtosis [2]. Some authors [3] have
found, on the base of simulated results, that ShapiroWilk test is most powerful normality test, followed by
Anderson-Darling
test,
Lilliefors
test
and
Kolmogorov-Smirnov test and that the power of all
these four tests is still low for small sample size. In
this paper the frequency histograms (as graphical
methods), skewness and kurtosis (as numerical
methods) and Shapiro-Wilk and Pearson’s Chi-square
method (as formal methods) will be used for testing
normality distribution as they are suggested in
literature [2].
For case study the set of n=1324 differences of
heights obtained by levelling for one object
deformation monitoring are analysed. The levelling
was performed by digital level and by using bar coded
levelling rods.
∑
= ∑
(2)
where:
- – ith order statistics,
- – sample mean and
- – coefficients from [4].
In order to accept the null hypothesis the statistics
shall fulfil the condition > ; = ;
where is number of observations and α is
significance level. In that case “there is no evidence,
from the test, of non-normality of this sample” [4].
Pearson’s Chi-square test of distributions is given
as follows [2]
" =
#
-
[2]
%
2 Background
$%∗ % '
∑()*
%
~",# ,. = − 1 (3)
)∗ – number of data in 1 23 interval,
) – theoretical number of data in 1 23 interval,
4) – probability 4) = % ;) = 4) and
",# – chi-square statistics with .-degrees of
freedom.
Determination of interval ends is given as follows
= 4) = 5
6)" − 5
6)8 = 5
9)" − 5
9)8 where (for normal distribution)
9)8
Normality distribution of geodetic data testing is
of considerable importance because it is the base for
correct utilization of least squares method in process
of measured data adjustment. If geodetic data were
not normally distributed it may cause the errors in
hypotheses testing and consequently lead to wrong
decisions in acceptance or rejecting ones. To avoid
these possibilities it is recommended that normality of
data distribution shall be tested before adjustment.
Before starting procedures of normality tests the
data should be grouped and tabled. For large samples
of data optimal number of interval shall be determined
as [2].
≤ 5
=
:% =
∆<%
, 9)88
=
:% >
=
∆<%
(4)
(5)
In spite of results of normality distribution
conclusions obtained by formal tests (1) and (2)
literature [2] proposes careful approach and suggests
utilization of skewness and kurtosis measure as well
as theoretical relationship between mean square error,
average error and probable error as methods for
normality distribution checking for sample of data.
Skewness and kurtosis are given by formulae:
?@( = AB ∑*6 − 6C
(1)
where is optimal number of intervals and is
number of data.
According to [3] there are nearly 40 tests of
where
41
(6)
DE = AF ∑*6 − 6G
(7)
6 = ∑* 6
(8)
Geonauka
H # = ∑*6 − 6#
Vol. 3, No. 1 (2015)
normality by Shapiro-Wilk test and there was no
reason to reject null hypothesis because:
(9)
Even though some modifications of formulae 4
and 5 for skewness and kurtosis exist in
contemporary literature, in this paper we will use only
here showed ones.
Average and probable error are defined as follows
[2]:
Θ = ∑*|Δ |
p|Δ| < O =
= 0.95 > b;c.db = 0.874
Table 1. Ends of intervals, empirical frequencies and
averages for each interval
#
-∞
(11)
The theoretical relationships between errors is
given as
P: Θ: O = 1:1.25:1.48
h∗i
f, g
(10)
(12)
Research in this paper will be mostly provided
according to the models given by formulae (1) to (12).
h∗i
j
k li
h∗i
i*j
-0.198
1
-0.173
0
-0.21
-0.148
1
-0.14
-0.124
2
-0.12
-0.099
30
-0.10
-0.074
48
-0.07
-0.050
142
-0.05
-0.025
206
-0.02
3 Normality testing of levelling data obtained
from production measurements - Case Study
0.000
367
0.00
0.024
225
0.02
In this paper normality distribution of set of
levelling data is tested. Data are obtained by digital
level with bar coded rods. Data were collected for
purpose of deformation monitoring of object.
0.049
212
0.05
0.074
54
0.07
0.098
30
0.10
0.123
4
0.12
3.1 Description of measurement
0.148
2
0.14
+∞
The method of levelling was “back”-“for”-“for”“back” on the each station. Differences of heights
differences obtained on all stations are the object of
analysis.
Heights differences on the each station was
calculated by formulae
Δ3
,UVWVX
= Y − ,
ΔYZ(WVX
= Y# − ,#
3
Consequently, according to Shapiro-Wilk test it is
possible to state that the analysed data follow the
normal distribution. But bearing in mind that the
power of normality distribution tests is low for small
sample size [3] in next step we shall also use the
Pearson’s Chi-square test for normality.
According to optimal number of intervals and
condition )∗ ≥ 5 the table for normality distribution
test of levelling data was formed. Span of intervals
was calculated as follows
(13)
(14)
Differences of height differences for each station
are
[ = Δ3
,UVWVX
− ΔYZ(WVX
3
n=
(15)
Set of data for analysis is consisted of =1324
results of [ i.e. of 1324 measured height differences.
n [o: − [o 0.16 − −.21
=
=
= 0.024667
15
≈ 0.025mm
Optimal number of intervals according to (1) is
15 because:
Table 2 shows the data for normality distribution
testing by Pearson’s Chi-square test.
According to results of Pearson’s Chi-square test
we shall reject the null hypothesis and accept the
alternative one, because
Table 1 shows the ends of intervals, empirical
frequencies and averages for each group of data.
Averages of data for intervals were tested on
Different results suggest further analysis
consisted of skewness, kurtosis and theoretical
3.2 Obtained Results and Discussion
≤ 5 ∗ log 1324 = 15
#
" # = 138.30 > ";c.ddd
= 31.26
42
Geonauka
relationship between characteristic errors. Diagrams 1
and 2 shows the polygon of frequencies and histogram
Vol. 3, No. 1 (2015)
of frequencies related to theoretical frequencies for
normal distribution, respectively.
Table 2. Data tabled for Pearson’s Chi-square test
xj
tj
nj*
pj
npj
nj*-npj
(nj*-npj)^2/npj
-0.98005
0.00997
13.205
20.795
32.748
48
-0.92117
0.02944
38.983
9.017
2.086
-1.188
142
-0.76520
0.07798
103.250
38.750
14.543
-0.619
206
-0.46438
0.15041
199.146
6.854
0.236
0.000
-0.049
367
-0.03917
0.21260
281.488
85.512
25.977
0.024
0.521
225
0.39786
0.21851
289.310
-64.310
14.295
0.049
1.090
212
0.72435
0.16325
216.137
-4.137
0.079
0.074
1.660
54
0.90300
0.08933
118.268
-64.268
34.924
0.098
2.230
30
0.97420
0.03560
47.134
-17.134
6.229
+∞
+∞
6
1.00000
0.01290
17.080
-11.080
7.187
1.00000
1324
0.000
-∞
-∞
-0.099
-2.328
34
-0.074
-1.758
-0.050
-0.025
F(tj)
-1.00000
1324
400
" # =138.30
Polygon of frequences
Fre…
Th…
300
200
100
0
Diagram 1. Polygon of empirical frequencies related to theoretical frequencies
Histogram of frequences
400
Frequ…
Theor…
300
200
100
0
-∞-2.328 -1.758
-1.188
-0.619
-0.049
0.521
1.090
1.660
2.230
+∞
Diagram 2. Histogram of empirical frequencies related to theoretical frequencies
Literature [2] contains detailed analytical models
for skewness and kurtosis analysis which will be
performed next.
Quantiles of distribution ?(; for statistics ?@( ,
where probability 4
?@( < ?(; = 4, are tabulated
and value for = 1324 and 4 = 0.95 is ±0.111.
According to formulae (6) and (7) the estimation
of skewness and kurtosis are
?@( = −0.141
DE = 3.372
sE = 3.372 − 3 = +0.372
43
Geonauka
Because ?@( = −0.141 ∉ −0.111, +0.111 we shall
reject null hypothesis about symmetry of empirical
distribution of analysed data for significance level
v = 0.05. However for
4 = 0.99 we have ?@( =
−0.141 ∈ −0.158, +0.158 meaning that for
significance level v = 0.01 there is no reason for
rejecting null hypothesis about the symmetry of
analysed data.
Criterion for accepting null hypothesis about
kurtosis is given as follows:
Vol. 3, No. 1 (2015)
probability density function of the observables is not
needed to routinely apply a least-squares algorithm
and compute estimates for the parameters of interest.
For the interpretation of the outcomes, and in
particular for the statements on the quality of the
estimator, the probability density has to be known.”
Bearing in mind that geodetic control measurements
are the base for conclusion about the state of an object
or engineering structure it could be stated that
knowledge of probability density is significant.
24
xsE x < 9 y = 0.264
4 Conclusions
Case study showed that it is possible to accept
opposite hypothesis about the distribution of the one
set of data. As set of measured data was quite big it
possible to state that all tests are reliable.
This fact could open different questions about
results of production measurements. For example
during the measurements it could exist some
influences which disturbed the normal distribution of
data or there was some gross errors contaminating the
results of measurements. However, production
measurements are almost impossible to repeat even
some gross errors or some influences have been
detected. This may be caused by the limited time for
measurements, by some technological processes or by
changes of measured object with time. Also
production measurements are the base for certain
decisions which, if not based on reliable data, could
lead to unacceptable losses.
One of possible solution for this problem in these
conditions (limited possibilities for measurements
repetition and opposite hypotheses acceptance) is to
analyse the influence of deviations from normal
distribution to the reliability of final results for every
measurement.
For our set of data:
xsE x = +0.372 > 0.264
meaning that null hypothesis shall be rejected.
Also, comparing the mean square error, average
error and relative error it is obtained:
z = 0.0433
Θ = 0.0303
O = 0.0300
z: Θ:O = 1: 1.43: 1.44
meaning that theoretical relationship is not
satisfied.
According to [5] Jarque-Bera (SkewnessKurtosis) test is given by formula
{
A(|W|AA
}
+
(~V2UAAC
#G
 ~" # 2
(16)
Replacing obtained values for ?@( = −0.141 and
xsE x = +0.372 in formula (16) we have got
1324 €
References
−0.141# 0.372#
#
+
 = 12.05 > "#;c.dd
= 11.34
6
24
[1] Perović, G.: Least Squares. University of Belgrade,
Faculty of Civil Engineering, Belgrade. 2005.
[2] Perović, G.: Adjustment calculus, theory of
measurements error (in SRB:Рачун изравнања,
теорија грешака мерења). University of Belgrade,
Faculty of Civil Engineering, Belgrade.1989.
[3] Razali, N. M., Wah, Y. B.: Power comparisons of
Shapiro-Wilk, Kolmogorov-Smirnov, Lillefors and
Anderson-Darling tests. Journal of Staistical Modeling
and Analytics, Vol.2.No.1, pp 21-33. 2011.
[4] Shapiro, S.S., Wilk, M.B.: An Analysis of Variance Test
for Normality (Complete Samples). Biometrika, Vol.52,
No.3/4, pp 591-611. 1965.
[5] Park, Hun Myoung: Univariate Analysis and Normality
which means that there is no reason to accept null
hypothesis even for significance level v = 0.01.
Summarizing the obtained results only the
Shapiro-Wilk test confirmed the normal distribution
of analysed data, while Pearson’s Chi-square test,
skewness and kurtosis analyses as well as theoretical
relationship between errors could not confirm normal
distribution of analysed data. This situation implicate
that different methods could lead to opposite
conclusions about the distribution of data.
In literature [6] is stated “Knowledge of the
44
Geonauka
Vol. 3, No. 1 (2015)
Test Using SAS, Stata and SPSS. Technical Working
Paper. The University Information technology Services
(UITS) Center for Statistical and Mathematical
Computing, Indiana University. 2008
[6] Tiberius, C.C.J.M., Borre, K.: Are GPS data normally
distributed?
(http://link.springer.com/chapter/10.1007%2F978-3642-59742-8_40#page-2)
45