Modeling Road Accident Data of Saudi Arabia

Modeling Road Accident Data of Saudi
Arabia
A THESIS
SUBMITTED TO THE GRADUATE EDUCATIONAL POLLCIES COUNCIL
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
For the degree
MASTER OF SCIENCE
By
Abdulrahman Alshammari
Advisor Dr. Rahmatullah Imon
Ball State University
Muncie, Indiana
May 2015
ACKNOWLEDGEMENTS
Foremost, I would like to express my sincere gratitude to my advisor Professor Dr. Rahmatullah
Imon for the continuous support of my thesis study, for his patience, motivation, enthusiasm, and
immense knowledge. His guidance helped me in all the time during my analysis and writing the
report. I could not have imagined having a better advisor and mentor for my thesis other than
him. Besides my advisor, I would like to thank the rest of my thesis committee: Dr. Munni
Begum and Dr. Yayuan Xiao for their encouragement, insightful comments and patience. I am
thankful to all my classmates for their kind supports. Last but not the least; I would like to thank
my family: my mother, my brothers and sisters, for supporting me throughout my life.
Abdulrahman Alshammari
March 25, 2015
Table of Contents
CHAPTER 1
1
INTRODUCTION
1
1.1 Saudi Arabia Road Accident Data
1
1.3 Outline of the Study
10
11
CHAPTER 2
ESTIMATION OF MISSING VALUES, CROSS VALIDATION AND ARIMA MODELS IN TIME SERIES
2.1 Estimation of Missing Values
11
2.2 Estimation of Missing Values in Time Series
13
2.3 Robust Estimation Methods
16
2.4 Bootstrap
21
2.5 Stochastic Time Series Modeling
25
2.6 Cross Validation in Time Series
39
2.7 Computation
43
44
CHAPTER 3
ESTIMATION OF MISSING VALUES BY ROBUST EM AND BOOTSTRAP ALGORITHM FOR SAUDI
ARABIA ROAD ACCIDENT DATA
3.1 Estimation of Missing Values
3.2 Evaluation of Estimation of Missing Values
44
Error! Bookmark not defined.
50
CHAPTER 4
SELECTION OF ARIMA MODELS FOR SAUDI ARABIA ROAD ACCIDENTS DATA
59
59
4.1 Total Number of Road Accidents in Saudi Arabia
59
4.2 Total Number of Road Accidents in Riyadh
62
4.3 Total Number of Road Accidents in Mecca
64
4.4 Total Number of Road Accidents in Eastern Province
66
4.5 Total Number of Road Accidents in Medina
68
4.6 Total Number of Road Accidents in Al Qassim
70
4.7 Total Number of Road Accidents in Tabuk
73
4.9 Total Number of Road Accidents in Asir
75
4.10 Total Number of Road Accidents in Albahih
77
4.11 Total Number of Road Accidents in Northern Borders
79
4.12 Total Number of Road Accidents in Al Jouf
81
4.13 Total Number of Road Accidents in Ha'il
84
4.14 Total Number of Road Accidents in Najran
86
4.15 Total Number of Road Accidents in Jizan
88
4.17 Result Summary
90
CHAPTER 5
CONCLUSIONS AND DIRECTION OF FUTURE RESEARCH
92
92
5.1 Conclusions
92
5.2 Direction of Future Research
93
REFERENCES
94
List of Tables
Chapter 1
Table 1.1: Total Number of Road Accidents in Saudi Arabia
2
Table 1.2: Total Number of Road Accidents in Riyadh
2
Chapter 2
Table 2.1: Specification of ARIMA Models
39
Chapter 3
Table 3.1: Estimates of Missing Values for the Year 2003 Using Conventional Methods
44
Table 3.2: Estimates of Missing Values for the Year 2004 Using Conventional Methods
44
Table 3.3: Estimates of Missing Values for 2003 Using EM Robust Time Series Methods
47
Table 3.4: Estimates of Missing Values for 2004 Using EM Robust Time Series Methods
48
Table 3.5: Estimates of Missing Values for 2003 Using EM Robust Time Series Methods
48
Table 3.6: Estimates of Missing Values for 2004 Using EM Robust Time Series Methods
49
Table 3.7: A Comparison Between Exact and Estimated Values of the Total Number of
Road Accidents for 2004 in Saudi Arabia
50
Table 3.8: Total Number of Road Accidents in Saudi Arabia after the Missing Value
Estimation
51
Table 3.9: Total Number of Road Accidents in Different Region of Saudi Arabia after the
52
Missing Value Estimation
Chapter 4
Table 4.1: The ACF and PACF Values for the Total Number of Road Accidents in Saudi
Arabia
60
Table 4.2: Significance of ARIMA models fir the Number of Road Accidents in Saudi
Arabia
61
Table 4.3: The ACF and PACF Values for the Total Number of Road Accidents in Riyadh
62
Table 4.4: Significance of ARIMA models fir the Number of Road Accidents in Riyadh
63
Table 4.5: The ACF and PACF Values for the Total Number of Road Accidents in Mecca
64
Table 4.6: Significance of ARIMA models fir the Number of Road Accidents in Mecca
65
Table 4.7: The ACF and PACF Values for the Total Number of Road Accidents in E
Province
66
Table 4.8: Significance of ARIMA models fir the Number of Road Accidents in E
Province
67
Table 4.9: The ACF and PACF Values for the Total Number of Road Accidents in
Medina
69
Table 4.10: Significance of ARIMA models fir the Number of Road Accidents in Medina
69
Table 4.11: The ACF and PACF Values for the Total Number of Road Accidents in
Qassim
71
Table 4.12: Significance of ARIMA models fir the Number of Road Accidents in Qassim
72
Table 4.13: The ACF and PACF Values for the Total Number of Road Accidents in
Tabuk
73
Table 4.14: Significance of ARIMA models fir the Number of Road Accidents in Tabuk
74
Table 4.15: The ACF and PACF Values for the Total Number of Road Accidents in Asir
75
Table 4.16: Significance of ARIMA models fir the Number of Road Accidents in Asir
76
Table 4.17: The ACF and PACF Values for the Total Number of Road Accidents in
Albhih
77
Table 4.18: Significance of ARIMA models fir the Number of Road Accidents in Albahih
78
Table 4.19: The ACF and PACF Values for the Total Number of Road Accidents in N
Borders
80
Table 4.20: Significance of ARIMA models fir the Number of Road Accidents in N
Borders
80
Table 4.21: The ACF and PACF Values for the Total Number of Road Accidents in Al
Jouf
82
Table 4.22: Significance of ARIMA models fir the Number of Road Accidents in Al Jouf
83
Table 4.23: The ACF and PACF Values for the Total Number of Road Accidents in Ha’il
84
Table 4.24 Significance of ARIMA models fir the Number of Road Accidents in Ha’il
85
Table 4.25: The ACF and PACF Values for the Total Number of Road Accidents in
Najran
86
Table 4.26: Significance of ARIMA models fir the Number of Road Accidents in Najran
87
Table 4.27: The ACF and PACF Values for the Total Number of Road Accidents in Jizan
88
Table 4.28: Significance of ARIMA models fir the Number of Road Accidents in Jizan
89
Table 4.29 Selected Models for Saudi Arabia Road Accident Data
90
List of Figures
Chapter 1
Figure 1.1: The Political Map of Saudi Arabia with Different Regions
2
Figure 1.2: Time Series Plot of Total Road Accidents in Saudi Arabia
5
Figure 1.3: Time Series Plot of Total Road Accidents in Riyadh
5
Figure 1.4: Time Series Plot of Total Road Accidents in Mecca
5
Figure 1.5: Time Series Plot of Total Road Accidents in Eastern Province
6
Figure 1.6: Time Series Plot of Total Road Accidents in Medina
6
Figure 1.7: Time Series Plot of Total Road Accidents in Al Qassim
6
Figure 1.8: Time Series Plot of Total Road Accidents in Tabuk
7
Figure 1.9: Time Series Plot of Total Road Accidents in Asir
7
Figure 1.10: Time Series Plot of Total Road Accidents in Albahih
7
Figure 1.11: Time Series Plot of Total Car Accident in Northern Borders
8
Figure 1.12: Time Series Plot of Total Car Accident in Al Jouf
8
Figure 1.13: Time Series Plot of Total Car Accident in Ha’il
8
Figure 1.14: Time Series Plot of Total Car Accident in Najran
9
Figure 1.15: Time Series Plot of Total Car Accident in Jizan
9
Chapter 2
Figure 2.1: A Schematic Diagram of Bootstrap and Simulation
23
Chapter 3
Figure 3.1: Time Series Plot of Original and Estimated Road Accidents in Saudi Arabia
54
Figure 3.2: Time Series Plot of Original and Estimated Road Accidents in Riyadh
54
Figure 3.3: Time Series Plot of Original and Estimated Road Accidents in Mecca
54
Figure 3.4: Time Series Plot of Original and Estimated Road Accidents in E Province
55
Figure 3.5: Time Series Plot of Original and Estimated Road Accidents in Medina
55
Figure 3.6: Time Series Plot of Original and Estimated Road Accidents in Al Qassim
55
Figure 3.7: Time Series Plot of Original and Estimated Road Accidents in Tabuk
56
Figure 3.8: Time Series Plot of Original and Estimated Road Accidents in Asir
56
Figure 3.9: Time Series Plot of Original and Estimated Road Accidents in Albahih
56
Figure 3.10: Time Series Plot of Original and Estimated Road Accidents in N Borders
57
Figure 3.11: Time Series Plot of Original and Estimated Road Accidents in Al Jouf
57
Figure 3.12: Time Series Plot of Original and Estimated Road Accidents in Ha’il
57
Figure 3.13: Time Series Plot of Original and Estimated Road Accidents in Najran
58
Figure 3.14: Time Series Plot of Original and Estimated Road Accidents in Jizan
58
Chapter 4
Figure 4.1: The ACF and PACF Values for the Total Number of Road Accidents in Saudi
Arabia
59
Figure 4.2: The ACF and PACF Values for the Total Number of Road Accidents in
Riyadh
62
Figure 4.3: The ACF and PACF Values for the Total Number of Road Accidents in Mecca
64
Figure 4.4: The ACF and PACF Values for the Total Number of Road Accidents in E
Province
66
Figure 4.5: The ACF and PACF Values for the Total Number of Road Accidents in
Medina
68
Figure 4.6: The ACF and PACF Values for the Total Number of Road Accidents in
Qassim
71
Figure 4.7: The ACF and PACF Values for the Total Number of Road Accidents in Tabuk
73
Figure 4.8: The ACF and PACF Values for the Total Number of Road Accidents in Asir
75
Figure 4.9: The ACF and PACF Values for the Total Number of Road Accidents in
Albahih
77
Figure 4.10: The ACF and PACF Values for the Total Number of Road Accidents in N
Borders
79
Figure 4.11: The ACF and PACF Values for the Total Number of Road Accidents in Al
Jouf
82
Figure 4.12: The ACF and PACF Values for the Total Number of Road Accidents in Ha’il
84
Figure 4.13: The ACF and PACF Values for the Total Number of Road Accidents in
Najran
86
Figure 4.14: The ACF and PACF Values for the Total Number of Road Accidents in Jizan
88
CHAPTER 1
INTRODUCTION
Road accident has been considered as one of the most serious problems in Saudi Arabia since
1994 [see Abdulrahman (2013)]. Saudi Arabia is ranked 12th in the world in the number of road
fatality rate per 100,000 inhabitants according to a latest survey. The fatality rate for Saudi
Arabia is 24.8 which is more than twice of the fatality rate of the USA (11.2). According to
Abdulrahman (2013), 570 people died in road accident in 1971. In 1994 this number got
increased to 4077. Desperate lots of desperate measures taken after 1994, there were 6205 deaths
in road accident in 2013. More than 68,000 people are injured annually, and the lost amounts of
money are more than 13 billion SR (3.5 billion USD) in every year. This problem is not new in
Saudi Arabia. However, its severity has been increasing over the years since Saudi Arabia is
developing and its road network is growing very fast.
1.1 Saudi Arabia Road Accident Data
In our study, we would like to consider road accident data of the entire Saudi Arabia and it’s
different regions from year 1999 to 2010. This data set is taken from the official website of Saudi
Arabian Ministry of Road Safety department. Here is the link of the data:
http://www.moi.gov.sa/wps/portal/traffic/!ut/p/b1/jc_JDoIwEAbgZEJOmVaU46lLG1MVSSg9GJ6MEbDkhjj84uKR9S5TfL9sxBHmpAiChpRgWRPXO_v55O_n
Yfet8_eLQ5cbZQRFmmumALDaA2l4ohbNoJmBDFAGsVUgtAWwOiiDitbS2PhvzzMlPyZL_
2V7Ih7sXxVl5hrpJAuRlZwvV7yCkyOE4B1aBNaKJlBoVnCI_1ZNIFvf7wnzB60kN3JJ1rM1GaC5NB8ACybrkI/dl4/d5/L2dJQSEvUUt3QS80SmtFL1o2X0dOVlMzR0gzMTB
1
FNkMwSVE1SE9LNVUwSTgx/?WCM_GLOBAL_CONTEXT=/wps/wcm/connect/Traffic/Tra
ffic+AR/Main/Statistics/
At first we present time series plot of the total number of road accidents for the entire Saudi
Arabia and then for the different regions which are Riyadh, Mecca, Eastern Province, Medina,
Al-Qassim, Tabuk, Asir, Albahih, Northern Borders, Al Jouf, Ha’il, Najran, Jizan. These regions
are located in the political map of Saudi Arabia as presented in Figure 1. Data for the entire
Saudi Arabia is given in Table 1.1 and the regional data is given in Table 1.2.
Figure 2.1: The Political Map of Saudi Arabia with Different Regions
2
Table 1.1: Total Number of Road Accidents in Saudi Arabia
Year
Number of accidents
1999
264326
2000
267772
2001
280401
2002
305649
2003
-
2004
261872
2005
293281
2006
296015
2007
283648
2008
435264
2009
485931
2010
484805
Table 1.2: Total Number of Road Accidents in Different Region of Saudi Arabia
Year
Riyadh
Mecca
Eastern
Medina
Province
Al-
Tabuk
Asir
Qassim
2000
80067
94669
56182
747
7213
3591
10242
2001
84313
89520
64381
700
8266
3060
12469
2002
86387
97091
69143
973
8051
4033
18998
2003
-
-
-
-
-
-
-
2004
-
-
-
-
-
-
-
3
2005
47803
95649
76435
13422
12901
5043
18230
2006
49019
93987
71731
14471
11687
5974
22108
2007
48142
86854
62033
18988
13370
6061
19024
2008
134915
96204
107780
21334
14768
7503
20556
2009
143466
107932
127460
20956
17645
11634
22929
2010
141549
112844
112420
18448
18622
15500
25263
Northern_
Al Jouf
Hail
Najran
Jizan
Year Albahih
Borders
2000
1169
3999
3447
3112
2431
903
2001
1716
4174
4248
3686
2412
1456
2002
-
-
-
-
-
-
2003
-
-
-
-
-
-
2004
2754
4930
5960
4878
2008
2620
2005
3293
4034
6889
4502
1823
3257
2006
3515
5668
7259
5835
1710
3051
2007
3208
6805
7170
5274
1963
4756
2008
2843
6160
9446
5884
2631
5240
2009
4245
6799
7284
5714
4215
5652
2010
5117
9449
7038
6069
4340
8146
4
Figure 1.2: Time Series Plot of Total Road Accident in Saudi Arabia
Figure 1.3: Time Series Plot of Total Road Accident in Riyadh
Figure 1.4: Time Series Plot of Total Road Accident in Mecca
5
Figure 1.5: Time Series Plot of Total Road Accident in Eastern Province
Figure 1.6: Time Series Plot of Total Road Accident in Medina
Figure 1.7: Time Series Plot of Total Road Accident in Al-Qassim
6
Figure 1.8: Time Series Plot of Total Road Accident in Tabuk
Figure 1.9: Time Series Plot of Total Road Accident in Asir
Figure 1.10: Time Series Plot of Total Road Accident in Albahih
7
Figure 1.11: Time Series Plot of Total Road Accident in Northern Borders
Figure 1.12: Time Series Plot of Total Road Accident in Al Jouf
Figure 1.13: Time Series Plot of Total Road Accident in Ha’il
8
Figure 1.14: Time Series Plot of Total Road Accident in Najran
Figure 1.15: Time Series Plot of Total Road Accident in Jizan
The time series plots of the variables reveal lots of interesting features of time series. All of the
plots show that there is an increasing trend in the number of road accidents in the entire Saudi
Arabia and all regions of it. Another interesting feature of this data is there are missing
observations for all variables. For the entire Saudi Arabia the total number of accident for the
year 2003 is missing, however for each region total number of accidents for the year 2003 and
2004 are missing.
9
1.2 Outline of the Study
We organize this thesis in the following way. In chapter 2, we introduce different methodologies
we use in our research that include the estimation of missing values by a variety of robust and
bootstrap expectation minimization (EM) algorithm, validation technique for assessing the
goodness of estimation of missing values, and ARIMA models for fitting the data. Since all
variables that we consider in our study have missing observations, in chapter 3, we employ a
variety algorithm for estimating the missing values. Since this is a time series problem, the
conventional techniques which assume independence of observations are not appropriate here.
Lack of normaility could be always to big issue, hence to avoid this issue we consider robust and
nonparametric bootstrap techniques to estimate the missing values. Later we employed the mean
average percentage error (MAPE) to validate which method estimates the missing value in the
best possible way. In Chapter 4 in order to determine the most appropriate ARIMA model we
employed both graphical methods based on autocorrelation function (ACF) and the partial
autocorrelation function and numerical tests such as the t-test and the Ljung-Box test based on
the ACF and the PACF. In Chapter 5, we draw conclusions on our current research and suggest
appropriate ARIMA models for all variables we considered in our study. In this chapter we also
outline our directions for future research.
10
CHAPTER 2
ESTIMATION OF MISSING VALUES, CROSS VALIDATION
AND ARIMA MODELS IN TIME SERIES
In this chapter we discuss different aspects of data analysis techniques useful in time series
analysis. Although the prime topic of our discussion will be ARIMA models but we will start
discussing with the missing value technique. For running the ARIMA model it is essential to
estimate the missing values because if there is any missing value in the series the estimates of
parameters in ARIMA model does not converge. We also talk about the cross validation
technique as a measure of accuracy of the estimates of the missing values.
2.1 Estimation of Missing values
Missing data are a part of almost all research and it has a negative influence on the analysis, such
as information loss and, as a result, a loss of efficiency, loss of unbiasedness of estimated
parameters and loss of power. An excellent review of different aspects of missing values is
available in Little and Rubin (2002). In this section we introduce few commonly used missing
values estimation techniques.
2.1.1 Mean Imputation Technique
Among the different methods for solving the missing value, Imputation methods (Little and
Rubin (2002)) is one of the most widely used technique to solve incomplete data problems.
Therefore, this study stresses on several imputation methods to determine the best methods to
replace missing data.
11
Let us consider n observations x1 , x2 ,..., xn of which m values are missing denoted
by x1* , x2* ,..., xm* . Thus the observed data with missing values are
x1 , x2 ,..., xn1 , x1* , xn1 1 , xn1 2 ,..., xn2 , x2* , xn2 1 , xn2  2 ,..., xm* , xn
(2.1)
Therefore, the first missing value occurs after n1 observations, the second missing value occur
after n 2 observations, and so on. Note that there might be more than one consecutive missing
observation.
Mean-before Technique
The mean-before technique is one of the most popular imputation techniques in handling missing
data. This technique consists of substituting all missing values with the mean of all available data
before missing values. Thus for the data in (2.1), x1* will be replaced by
x1
1 n1
 xi
n1 i 1
(2.2)
and x 2* will be replaced by
x2
n2
1
 xi
(n2  n1  1) i n1 1
(2.3)
and so on.
Mean-before-after Technique
The mean-before-after technique substitutes all missing values with the mean of one datum
before the missing value and one datum after the missing value. Thus for the data in (2.1), x1* will
be replaced by
x1
x n1  x n1 1
(2.4)
2
12
and x 2* will be replaced by
x2
x n2  x n2 1
(2.5)
2
and so on.
2.1.2 Median Imputation Technique
Since mean is highly sensitive to extreme observations and/or outliers, median imputation is
becoming more popular now-a-days [Mamun (2014)]. Instead of mean the missing value for a
data set given in (2.1), the missing value is estimated by the median.
2.1.3 Expectation Maximization Algorithm
Let y is an incomplete data vector whose density function is
parameter. If y were complete, the maximum likelihood of
) where
would be based on the distribution
of y. The log-likelihood function of y,
is required to be
maximized. As y is incomplete we may denote that as (
incomplete data and
is a p-dimensional
) where
is the observed
is the unobserved missing data. Let assume that the missing data is
missing by random, then:
(2.6)
Considering the log-likelihood function:
(2.7)
The EM algorithm focused on maximizing
in each iteration by replacing it by its
conditional expectation given the observed data
. The EM algorithm has an E-step
(estimation step) followed by an M-step (maximization step) as follows:
13
E-step: Compute
where
(2.8)
M-step: Find
such that
(2.9)
The E-step and M-step are repeated alternately until the difference
, where
is less that
is a small quantity.
If the convergence attribute of the likelihood function of the complete data, that is
, is
attainable then convergence of EM algorithm also attainable. The rate of convergence depends
on number of missing observations. Dempster, Laird, and Rubin (1977) show that convergence is
linear with rate proportional to the fraction of information about
2.2
in
that is observed.
Estimation of Missing Values in Time Series
The missing value estimation discussed above are designed for independent observations. But in
regression or in time series we assume a model and that should have a consideration when we try
to estimate missing values. In time series things are even more challenging as the observations
are dependent. Here we consider several commonly used missing value technique useful for time
series.
2.2.1 Estimation with Trend Models
We begin with simple models that can be used to forecast a time series on the basis of its past
behavior. Most of the series we encounter are not continuous in time, instead, they consist of
discrete observations made at regular intervals of time. We denote the values of a time series by
14
{ y t }, t = 1, 2, …, T. Our objective is to model the series y t and use that model to forecast y t
beyond the last observation yT . We denote the forecast l periods ahead by yˆ T l .
We sometimes can describe a time series y t by using a trend model defined as
yt  TR t   t
(2.10)
where TR t is the trend in time period t.
Linear Trend Model:
TR t   0  1t
(2.11)
We can predict y t by
yˆ t  ˆ0  ˆ1t
(2.12)
Polynomial Trend Model of Order p
TR t   0  1t   2 t 2  ...   p t p
(2.13)
If the number of observation is not too large, we can predict y t by
yˆ t  ˆ0  ˆ1t  ˆ2t 2  ...  ˆ pt p
(2.14)
Exponential Trend Model:
The exponential trend model which is often referred to a growth curve model is defined as
y t   0 1  t
t
(2.15)
If  0 > 0 and  1 > 0, applying a logarithmic transformation to the above model yields
y t   0  1 t   t
*
*
*
*
(2.16)
*
where yt  ln yt ,  0  ln  0 , 1  ln 1 and  t  ln  t .
*
*
*
15
2.2.2 Estimation with Smoothing Techniques
Several decomposition and smoothing techniques are available in the literature for estimating the
missing value of a time series data.
Moving Average
Moving average is a technique of updating averages by dropping the earliest observation and
adding the latest observation of a series. Let us suppose we have T = nL observations, where L is
the number of seasons, then the first moving average computed for that season is obtained as
y1   y1  y 2  ...  y L  / L
(2.17)
The second moving average is obtained as
y2   y2  ...  y L  y L1  / L  y1   y L1  y1  / L
(2.18)
In general, the m-th moving average is obtained as
y m  y m1   y m L1  y m1  / L
(2.19)
Centered Moving Average
Instead of moving average we often use the centered moving average (CMA) which is the
average of two successive moving average values.
Some other useful smoothing techniques that can be used for estimating the missing values in
time series are exponentially weighted moving average (EWMA) and the Holt-Winter model.
2.3
Robust Estimation Methods
Robustness is now playing a key role in time series. According to Kadane (1984) ‘Robustness is
a fundamental issue for all statistical analyses; in fact it might be argued that robustness is the
subject of statistics.' The term robustness signifies insensitivity to small deviations from the
16
assumption. That means a robust procedure is nearly as efficient as the classical procedure when
classical assumptions hold strictly but is considerably more efficient over all when there is a
small departure from them. The main application of robust techniques in a time series problem is
to try to devise estimators that are not strongly affected by outliers or departures from the
assumed model. In time series, robust techniques grew up in parallel to diagnostics (Hampel et
al., 1986) and initially they were used to estimate parameters and to construct confidence
intervals in such a way that outliers or departures from the assumptions do not affect them.
A large body of literature is now available [Rousseuw and Leroy (1987), Maronna, Martin, and Yohai
(2006), Hadi, Imon and Werner (2009)] for robust techniques that are readily applicable in linear
regression or in time series.
L – Estimator
A first step toward a more robust time series estimator was the consideration of least absolute values
estimator (often referred to as L – estimator). In the OLS method, outliers may have a very large
influence since the estimated parameters are estimated by minimizing the sum of squared residuals
n
u
t 1
2
t
L estimates are then considered to be less sensitive since they are determined by minimizing the sum of
absolute residuals
n
| u
t 1
t
|
The L estimator was first introduced by Edgeworth in 1887 who argued that the OLS method is over
influenced by outliers, but because of computational difficulties it was not popular and not much used
17
until quite recently. Sometimes we consider the L – estimator as a special case of L p -norm estimator
in the literature where the estimators are obtained by minimizing
n
| u
t 1
t
|p
The L1 -norm estimator is the OLS, while the L2 - norm estimator is the L – estimator. But
unfortunately a single erroneous observation (high leverage point) can still totally offset the Lestimator.
Least Median of Squares
Rousseeuw (1984) proposed Least Median of Squares (LMS) method which is a fitting technique less
sensitive to outliers than the OLS. In OLS, we estimate parameters by
n
minimizing the sum of squared residuals
u
t 1
2
t
Which is obviously the same if we
Minimize the mean of squared residuals
1 n 2
 ut .
n t 1
Sample means are sensitive to outliers, but medians are not. Hence to make it less sensitive we can
replace the mean by a median to obtain median sum of squared residuals
2
MSR ( ˆ ) = Median { uˆ t }
(2.20)
Then the LMS estimate of  is the value that minimizes MSR ( ˆ ). Rousseeuw and Leroy (1987)
have shown that LMS estimates are very robust with respect to outliers and have the highest possible
50% breakdown point.
18
Least Trimmed Squares
The least trimmed (sum of) squares (LTS) estimator is proposed by Rousseeuw (1984). In this method
we try to estimate  in such a way that
LTS ( ˆ ) = minimize
h
 uˆ  
t 1
2
(2.21)
t
Here uˆt  is the t-th ordered residual. For a trimming percentage of  , Rousseeuw and Leroy (1987)
suggested choosing the number of observations h based on which the model is fitted as h = [n (1 –  )]
+ 1. The advantage of using LTS over LMS is that, in the LMS we always fit the regression line based
on roughly 50% of the data, but in the LTS we can control the level of trimming. When we suspect that
the data contains nearly 10% outliers, the LTS with 10% trimming will certainly produce better result
than the LMS. We can increase the level of trimming if we suspect there are more outliers in the data.
M – Estimator
Huber (1973) generalized the estimation of parameters by considering a class of estimators, which
chooses   to
n
n

Minimize   i   Minimize   yi  xi 


i 1
i 1
T

(2.22)
Where   is a symmetric function less sensitive to outliers than squares. An estimator of this type is
called an M – estimator, where M stands for maximum likelihood. It is easy to see from (2.22) that the
function  is related to the likelihood function for an appropriate choice of an error distribution. For
example, if the error distribution is normal, then  z   z 2 / 2 , –∞ < z < ∞, which also yields the OLS
estimator the M – estimator obtained from (2.22) is not scale invariant. To obtain a scale invariant
version of this estimator we solve
n
n
 y  x T 
 
Minimize    i   Minimize    i i 



 
i 1
i 1


19
(2.23)
In most of the practical applications, the value of  is unknown and it is usually estimated before
solving the equation (2.22). A popular choice of  is
~ = MAD (normalized).
(2.23)
To minimize (2.22), we have to equate the first partial derivatives of  w.r.t.  for j = 0, 1, …, p to
zero, yielding a necessary condition for a minimum. This gives a system of k = p + 1 equations
 yi  xi  
 0,

(2.24)
xij =

~ j =0, 1, …, p


i 1


where     . In general the  function is nonlinear and (2.24) must be solved by iterative methods.
T
n
S – Estimator
Rouusseeuw and Yohai (1984) suggested another class of high breakdown estimator based on the
minimization of the dispersion of the residuals:
ˆ 1  , 
ˆ 2  , ..., 
ˆ n  
s 
The above dispersion is defined as the solution of
K is often put equal to E  
1 n
(2.25)
  ˆ i / s   K
n i 1
where  is the standard normal. The function  must satisfy the
following conditions:  is symmetric and continuously differentiable and  (0) = 0. The estimator
thus obtained is called an S – estimator because it is derived from a scale statistic in an implicit way. In
fact s given in the above estimating equation is an M – estimator of scale.
MM – estimator
The MM – estimator was originally proposed by Yohai (1987). The objective was to produce a high
breakdown point estimator that maintained good efficiency. The MM – estimator has three stages:

The initial estimate is an S – estimate, so it has a high breakdown point.

The second stage computed an M – estimate of the error standard deviation using the residuals
from the initial S – estimate.
20

The last step is an M – estimate of the parameters using a hard redescending weight function to
put a very small (often zero) weight to sufficiently large residuals.
In an extensive performance evaluation of several robust regression estimators, Simpson and
Montgomery (1998) report that MM –estimators have high efficiency and work well in most outlier
scenarios.
2.4
Bootstrap
According to Schucany (1988) in sample reuse technique ‘a sample is used to evaluate a statistic and
then may be used again to assess or improve the performance of the basic statistical procedure.‘ This
implies that this concept is a subjective, post data concept. There is a variety of sample reuse or
resampling methods in the literature which have been in use in a regression context. Among them
bootstrap technique proposed by Efron (1979) has become extremely popular with the statisticians.
In this procedure one can create a huge number of sub-samples from a pre-observed data set by a
simple random sampling with replacement. These sub-samples could be later used to investigate the
nature of the population without having any assumption about the population itself.
There are several forms of the bootstrap, and additionally, several other resampling methods that
are related to it, such as jackknifing, cross-validation, randomization tests, and permutation tests.
Suppose that we draw a sample S={x1, x2,...,xn } from a population P={x1, x2,...,xn } imagine
further at least for the time being, that N is very much larger than n, and that S is either a simple
random or an independent random sample, from P. Now suppose that we are interested in some
statistic T = t(S) as an estimate of the corresponding population parameter θ = t(P). Again, θ
could be a vector of parameters and T the corresponding vector of estimates, but for simplicity
assumes that θ is scalar. A traditional approach to statistical inference is to make assumptions
about the structure of the population (e.g., an assumption of normality), and along with the
21
stipulation of random sampling, to use these assumptions to derive the sampling distribution of
T, on which classical inference is based. In certain instances, the exact distribution of T may be
intractable and so we instead derive its asymptotic distribution. This familiar approach has two
potentially important deficiencies:
1. If the assumptions about the population are wrong, then the corresponding sampling
distribution of the statistic may be seriously inaccurate. On the other hand, if asymptotic
results are relied upon these may not hold to the required level of accuracy in a relatively
small sample.
2. The approach requires sufficient mathematical process to derive the sampling distribution
of the statistic of interest. In some cases, such a derivation may be prohibitively difficult.
In contrast, the nonparametric bootstrap allows us to estimate the sampling distribution of an
empirically statistic without making assumptions about the form of the population, without
deriving the sampling distribution explicitly. The essential idea of the nonparametric bootstrap is
as follows:
We proceed to draw a sample of size n from among the elements of S. sampling with
*
*,
, x12
..., x1*n } . It is necessary to sample
replacement. Call the resulting bootstrap sample S1*  {x11
with replacement, because we would otherwise simply reproduce the sample S. In effect, we are
treating the S as an estimate of the population P; that is, each element Xi of S is selected for the
bootstrap sample with probability 1/n, mimicking the original selection of the sample S from the
population P. We repeat this procedure a large number of times, B selecting many bootstrap
*
} .The basic hypothesis is
samples; the bth such bootstrap sample is denoted S b*  {xb*1 , xb*2 ,..., xbn
this: representative and sufficient resample from the original sample would contain information
about original population as the original sample does represent the population.
22
REAL WORLD
Unknown
Probability
Distribution
F
BOOTSTRAP WORLD
EmpiricalT *
b
Distribution
Observed
Random Sample
Fˆ
S={x1,x2,...,x n}
T= t (s)
Statistic of Interest
*
Random
 t ( SObserved
b)
Sample
*
Sb*  {xb*1, xb*2 ,..., xbn
}
Bootstrap Replication
Assumed
SIMULATION WORLD
Known
Probability
Distribution
F
Observed Random
Sample
S={x1,x2,...,xn}
T= t(s)
Statistic of Interest
Figure 2.1: A schematic diagram of bootstrap and simulation [Efron and Tibshirani (1993)].
23
In real world, an unknown distribution F has given the observed data S= (x1, x2,...,xn) by
random sampling. We calculate a statistic of interest T = t(S). In the bootstrap world, the
*
empirical distribution Fˆ gives bootstrap samples Sb*  {xb*1 , xb*2 ,..., xbn
} by random sampling from
which we calculate bootstrap replications of the statistic of interest Tb  t ( Sb ) . The big
*
*
advantage of the bootstrap world is that we can calculate as many replications of Tb* as we want
or at least as many as we can afford. Next, we compute the statistic T for each of the bootstrap
samples; that is Tb  t ( S b ) . Then the distribution of Tb* around the original estimate T is
*
*
analogous to the sampling distribution of the estimator T around the population parameter θ. For
B
example the average of the bootstrapped statistics, T *  Eˆ * (T * ) 
average of the bootstrap statistics; then
Bia sˆ*  T *  T is
T
b 1
B
*
b
. Estimate of the
an estimate of the bias of T.
*
Similarly, the estimated bootstrap variance of T, Vˆ * (T * )   (Tb  T * ) 2 /( B  1) , that estimates
the sampling variance of T. The random selection of bootstrap samples is not an essential aspect
of the nonparametric bootstrap. At least in principle, we could enumerate all bootstrap samples
of size n. Then we could calculate E * (T * ) and V * (T * ) exactly, rather than having to estimate
them. The number of bootstrap samples, however, is astronomically large unless n is tiny. There
are, therefore two sources of error in bootstrap inference: (a) the error induced by using a
particular sample S to represent the population; (b) the sampling error produce by falling to
estimate all bootstrap samples. Making the number of bootstrap replication B sufficiently large
can control the latter sources of error.
24
How large should we take B, the number of bootstrap replications uses to evaluate
different estimate. The possible bootstrap replications is B = nn. To estimate the standard error
we usually use the bootstrap replications between 25 to250. But for other estimate such as
confidence interval or regression estimate B is much bigger. We may increase B if T= t (S) a
very complicated function of X to increasing variability. Bootstrap replications are depend on the
value of X, if n ≤ 100 we generally replicate B ≤ 10000.
Here are two rule of thumb gathered from Efron and Tibshirani's experience

Even a small number of bootstrap replications, say B = 25, is usually informative. B = 50
is often enough to give a good estimate of standard error.

Very seldom are more than B = 200 replications needed for estimating a standard error
For estimating a standard error, the number B will ordinarily be in the range 25-2000. Much
bigger values of B are required for bootstrap confidence interval, cross-validation, randomization
tests and permutation test. Some other suggestions are also available in the literature [see Efron
(1987), Hall (1992), Booth and Sarker (1998)] about the number of replications needed in
bootstrap.
2.5 Stochastic Time Series Modeling
An excellent review of different aspects of stochastic time series modelling is available in
Pyndick and Rubenfield (1998), Bowerman et al. (2005) and Imon (2015). We assume that the
time series models have been generated by a stochastic process. In other words, we assume that
each value y1 , y 2 , …, yT in the series is randomly drawn from a probability distribution. We
could assume that the observed series y1 , y 2 , …, yT is drawn from a set of jointly distributed
random variables. If we could specify the probability distribution function of our series, we could
25
determine the probability of one or another future outcome. Unfortunately, the complete
specification of the probability distribution function for a time series is usually impossible.
However, it usually is possible to construct a simplified model of the time series which explains
its randomness in a manner that is useful for forecasting purpose.
2.5.1 The Box-Jenkins Methodology
The Box-Jenkins methodology consists of a four-step iterative procedure.
Step 1: Tentative Identification: Historical data are used to tentatively identify an appropriate
Box-Jenkins model.
Step 2: Estimation: Historical data are used to estimate the parameters of tentatively identified
model.
Step 3: Diagnostic Checking: Various diagnostics are used to check the adequacy of the
tentatively identified model and, if need be, to suggest an improved model, which is then
regarded as a new tentatively identified model.
Step 4: Forecasting: Once a final model is obtained, it is used to forecast future time series
values.
2.5.2 Stationary and Nonstationary Time Series
It is important to know whether the stochastic process that generates the series can be assumed to
be invariant with respect to time. If the characteristic of the stochastic process changes over time
we call the process nonstationary. If the process is nonstationary, it will often be difficult to
represent the time series over past and future intervals of time by a simple algebraic model. By
contrast, if the process is stationary, one can model the process via an equation with fixed
coefficients that can be estimated from past data.
26
Properties of Stationary Process
We have said that any stochastic time series y1 , y 2 , …, yT can be thought of as having been
generated by a set of jointly distributed random variables; i.e., the set of data points y1 , y 2 , …,
yT represents a particular outcome (also known as a realization) of the joint probability
distribution function p( y1 , y 2 , …, yT ). Similarly, a future observation yT 1 can be thought of
as being generated by a conditional probability distribution function
p( yT 1 | y1 , y 2 , …, yT )
(2.26)
that is, a probability distribution for yT 1 given the past observations y1 , y 2 , …, yT . We define
a stationary process, then, as one whose joint distribution and conditional distribution both are
invariant with respect to displacement in time. In other words, if the series is stationary, then
p( y t , y t 1 , …, y t  k ) = p( yt  m , yt m1 , …, yt  m k )
and
p( y t ) = p( yt  m )
(2.27)
for any t, k, and m.
If the series y t is stationary, the mean of the series, which is defined as
 y = E( y t )
(2.28)
must also be stationary, so that E( y t ) = E( yt  m ), for any t and m. Furthermore, the variance of
the series
 y 2 = E[( y t –  y )] 2
(2.29)
must be stationary, so that
E[( y t –  y )] 2 = E[( yt  m –  y )] 2 .
27
(2.30)
Finally, for any lag k, the covariance of the series
 k = Cov ( y t , y t  k ) = E[( y t –  y ) ( y t  k –  y )]
(2.31)
must be stationary, so that Cov ( y t , y t  k ) = Cov ( yt  m , yt  m k ).
If a stochastic process is stationary, the probability distribution p( y t ) is the same for all
time t and its shape can be inferred by looking at the histogram of the observations y1 , y 2 , …,
yT . An estimate of the mean  y can be obtained from the sample mean
T
y =
y
t 1
t
/T
(2.32)
And an estimate of the variance  y can be obtained from the sample variance
2
ˆ y 2 =
T
 y
t 1
 y  / T.
2
t
(2.33)
Usually it is very difficult to get a complete description of a stochastic process. The
autocorrelation function could be extremely useful because it provides a partial description of the
process for modeling purposes. The autocorrelation function tells us how much correlation there
is between neighboring data points in the series y t . We define the autocorrelation with lag k as
k 
Cov y t , y t  k 
V  yt V  yt k 
(2.34)
For a stationary time series the variance at time t is the same as the variance at time t + k; thus,
the autocorrelation becomes
k 
k
k

0
 y2
and thus  0 = 1 for any stochastic process.
Suppose the stochastic process is simply
28
(2.35)
y t  t
where t is an independently distributed random variable with zero mean. Then it is easy to
show that for this process,  0 = 1 and  k = 0 for k > 0. This particular process is known as
white noise, and there is no model that can provide a forecast any better than yˆ T l = 0 for all l.
Thus, if the autocorrelation function is zero (or close to zero) for all k > 0, there is little or no
value in using a model to forecast the series.
2.5.3 Test for Significance of a White Noise Autocorrelation Function
In practice, we use an estimate of the autocorrelation function, called the sample autocorrelation
(SAC) function
T k
rk 
 y
t 1
t
 y  y t  k  y 
T
 y
t 1
t
 y
(2.36)
2
It is easy to see from their definitions that both the theoretical and the estimated autocorrelation
functions are symmetrical; i.e.,  k =   k and rk = r k .
Bartlett’s test
Here the null hypothesis is H 0 :  k = 0 for k > 0. Bartlett shows that when the time series is
generated by a white noise process, the sample autocorrelation function is distributed
approximately as a normal with mean 0 and variance 1/T. Hence, the test statistic is
|z| =
T | rk |
(2.37)
and we reject the null hypothesis at the 95% level of significance, if |z| is greater than 1.96.
29
The t-test based on SAC
The standard error of rk is given by

1/ T

k 1
SE rk    
2
1

2
ri  / T



i 1

 
if k  1
if k  1
(2.38)
The t-statistic for testing the hypothesis H 0 :  k = 0 for k > 0 is defined as
T = rk /SE( rk )
(2.39)
and this test is significant when |T| > 2.
Box and Pierce Test and Ljung and Box Test
To test the joint hypothesis that all the autocorrelation coefficients are zero we use a test statistic
introduced by Box and Pierce. Here the null hypothesis is
H 0 : 1 =  2 = … =  k = 0.
Box and Pierce show that the appropriate statistic for testing this null hypothesis is
k
Q=T
r
i 1
2
(2.40)
i
is distributed as chi-square with k degrees of freedom.
A slight modification of the Box-Pierce test was suggested by Ljuang and Box, which is
known as the Ljuang-Box Q (LBQ) test defined as
k
Q  T (T  2) (T  k ) 1 ri
i 1
2
(2.41)
Thus, if the calculated value of Q is greater than, say, the critical 5% level, we can be 95% sure
that the true autocorrelation coefficients are not all zero.
30
Stationarity and the Autocorrelation Function
How can we decide that whether a series is stationary or determine the appropriate number of
times a homogenous nonstationary series should be differenced to arrive at a stationary series?
The correlogram (a plot of autocorrelation coefficients against the number of lag periods) could
be a useful indicator of it. For a stationary series, the autocorrelation function drops off as k
becomes large, but this usually is not the case for a nonstationary series. In order to employ the
Box-Jenkins methodology, we must examine the behavior of the SAC. The SAC for a
nonseasonal time series can display a variety of behaviors. First, the SAC for a nonseasonal time
series can cut off at lag k. We say that a spike at lag k exists in the SAC if the SAC at lag k is
statistically significant. Second, we say that the SAC dies down if this function does not cut-off
but rather decreases in a steady fashion. In general, it can be said that
1. If the SAC of the time series values either cuts off fairly quickly or dies down fair
quickly, then the time series values should be considered stationary.
2. If the SAC of the time series values dies down extremely slowly, then the time series
values should be considered nonstationary.
2.5.4 ARIMA Models
Moving Average Models
In the moving average process of order q each observation y t is generated by a weighted average
of random disturbances going back to q periods. We denote this process as MA(q) and write its
equation as
y t =   t 1 t 1  2 t 2 ...   q t q
31
(2.42)
In the moving average model the random disturbances are assumed to be independently
distributed across time, i.e., generated by a white noise process. In particular, each t is
assumed to be normal random variable with mean 0 and variance   , and covariance  k = 0 for
2
k  0. We have E( y t ) =  which shows that a moving average process is independent of time.
The process MA(q) is described by exactly q + 2 parameters, the mean  , the disturbance  
2
and the parameters 1 ,  2 ,...,  q .
Let us now look at the variance of the moving average process of order q.

2
 0 = E  yt   

(2.43)
= E( t 1 t 1  2 t 2 ...   q t q 21 t t 1 y t 1 – …)
2
2
2
2
2
2
2
2
=   (1 + 1   2  ...   q )
2
2
2
The moving average process of order q has autocorrelation function
   k  1 k 1   2 k  2 ...   q  k  q

2
2
2
k = 
1  1   2  ...   q
 0

k  1,2,..., q
kq
(2.44)
Autoregressive Models
In the autoregressive process of order p the current observation y t is generated by a weighted
average of past observations going back p periods, together with a random disturbance in the
current period. We denote this process as AR(p) and write its equation as
y t =   1 yt 1  2 yt 2  ...   p yt  p  t
(2.45)
Properties of Autoregressive Models
If the autoregressive process is stationary, then its mean, which we denote by  , must be
invariant with respect to time; i.e., E( y t ) = E( y t 1 ) = E( y t  p ) =  . The mean  is thus given by
32
 =   1   2   ...   p  or  =

If the process is stationary, then 
1  1   2  ...   p
must be finite. For this, it is necessary that 1  2  ...   p < 1.
For the first-order process AR(1): y t =   1 yt 1  t ,
k =
k
= 1 k
0
(2.46)
In general, for k > 1,
 k = E  yt k 1 yt 1  2 yt 2  t  = 1  k 1 +  2  k 2
(2.47)
Thus the autocorrelation function is given by
1 =
1
1  2
(2.48)
 k = 1  k 1 +  2  k 2 for k > 1
The Partial Autocorrelation Function
The partial autocorrelation function is used to determine the order of an autoregressive process.
For an autoregressive process of order p, the covariance with displacement k is determined from
 k = E yt k 1 yt 1  2 yt 2  ...   p yt  p  t 
(2.49)
which gives
 0 = 1  1 +  2  2 + … +  p  p +   2
 1 = 1  0 +  2  1 + … +  p  p 1
……………………………………
 p = 1  p 1 +  2  p  2 + … +  p  0
(2.50)
The above equations also give a set of p equations, known as Yule-Walker equations, to
determine the first p values of the autocorrelation functions:
33
1 = 1 +  2 1 +  p  p 1
………………………………
 p = 1  p 1 +  2  p 2 + … +  p
(2.51)
The solution of the Yule-Walker equations requires the knowledge of p. Therefore we solve
these equations for successive values of p. We begin by hypothesizing that p = 1. We compute
the sample autocorrelation ˆ1 as an estimate of 1 . If this value is significantly different from 0,
we know that the autoregressive process is at least order 1. Next we consider the hypothesis that
p = 2. We solve the Yule-Walker equations for p = 2 and obtain a new set of estimates for 1
and  2 . If  2 is significantly different from 0, we may conclude that the process is at least order
2. Otherwise we conclude that the process is order 1. We repeat this process for successive
values of p. We call the series 1 ,  2 , …, partial autocorrelation function. If the true order of the
process is p, we should observe that ˆ j  0 for j > p. To test whether a particular  j is zero, we
can use the fact that it is approximately normally distributed with mean 0 and variance 1 / T.
Hence we can check whether it is statistically significant at, say, the 5% level by determining
whether it exceeds 2 /
T in magnitude.
Mixed Autoregressive – Moving Average (ARMA) Models
Many stationary random processes cannot be modeled as purely moving average or as purely
autoregressive, since they have quality of both types of processes. In this situation we can use the
mixed autoregressive – moving average (ARMA) process. An ARMA process of order (p, q) is
defined as
y t =   1 yt 1  2 yt 2  ...   p yt  p  t 1 t 1  2 t 2 ...   q t q
34
(2.52)
We assume that the process is stationary, then its mean, which we denote by  , must be
invariant with respect to time and is given by  =   1   2   ...   p  or
 =

For the general ARMA(p, q) process it is not easy to obtain the
1  1   2  ...   p .
variances, covariances and autocorrelations by solving equations. It can be shown easily,
however, that
 k = 1  k 1 +  2  k 2 + … +  p  k  p , k > q
(2.53)
It is interesting to note that q is the memory of the moving average part of the process, so that for
k > q, the autocorrelation function exhibits the properties of a purely autoregressive process.
The above equations also give a set of p equations, known as Yule-Walker equations, to
determine the first p values of the autocorrelation functions:
1 = 1 +  2 1 +  p  p 1
………………………………
 p = 1  p 1 +  2  p 2 + … +  p
(2.54)
The solution of the Yule-Walker equations requires the knowledge of p. Therefore we solve
these equations for successive values of p. We begin by hypothesizing that p = 1. We compute
the sample autocorrelation ˆ1 as an estimate of 1 . If this value is significantly different from 0,
we know that the autoregressive process is at least order 1. Next we consider the hypothesis that
p = 2. We solve the Yule-Walker equations for p = 2 and obtain a new set of estimates for 1
and  2 . If  2 is significantly different from 0, we may conclude that the process is at least order
2. Otherwise we conclude that the process is order 1. We repeat this process for successive
35
values of p. We call the series 1 ,  2 , …, partial autocorrelation function. If the true order of the
process is p, we should observe that ˆ j  0 for j > p.
To test whether a particular  j is zero, we can use the fact that it is approximately
normally distributed with mean 0 and variance 1 / T. Hence we can check whether it is
statistically significant at, say, the 5% level by determining whether it exceeds 2 /
T in
magnitude.
Homogenous Nonstationary Processes: ARIMA Models
Probably very few of the time series one meets in practice are stationary. Fortunately, however,
many of the nonstationary time series that are encountered have the desirable property that they
are differentiated one or more times, the resulting series will be stationary. Such a nonstationary
series is termed homogenous. The number of times the original series must be differenced before
a stationary series results in is called the order of homogeneity. Thus, if y t is first order
homogenous nonstationary, the series wt = y t – y t 1 = y t is stationary. Here we should
construct models for those nonstationary series, which can be transformed into stationary series
by differencing them one or more times. We say that y t is homogenous nonstationary of order d
if
wt  d yt
(2.55)
is a stationary series. If wt  d yt and wt is an ARMA(p, q) process, then we say that y t is an
integrated autoregressive moving average process of order p, d and q, or simply ARIMA(p, d, q).
36
2.5.5 Estimation and Specification of ARIMA Models
It is often convenient to describe time lags by using the backward shift operator B. The operator
B imposes a one-period time lag each time it is applied to a variable. Thus
B t = t 1 , B 2 t = t  2 , …, B n t = t  n
(2.56)
Using this operator, we can write an MA(q) process as:


y t =   t 1  1 B   2 B 2  ...   q B q =   t  (B)
(2.57)
In a similar way, the AR(p) process can be rewritten as


yt 1  1 B  2 B 2  ...   p B p =   t
(2.58)
Finally, an ARMA(p, q) process can be reexpressed as



yt 1  1 B  2 B 2  ...   p B p =   t 1  1 B   2 B 2  ...   q B q

(2.59)
It is easy to show that any homogenous nonstationary process can be modeled as an ARIMA
process. We can write the equation for an ARIMA(p, d, q) as:
 B d y t =    B t
(2.60)
 B  y t =    B t
(2.61)
If d = 0, we obtain
which is an ARMA(p, q). When q = 0, i.e., when wt  d yt is just AR(p), we call y t an
integrated autoregressive process of order (p, d) and denote it as ARIMA(p, d, 0) or ARI(p, d, 0).
When p = 0, i.e., when wt is just MA(q), we call y t an integrated moving average process of
order (d, q) and denote it as ARIMA(0, d, q) or IMA(0, d, q). In practice, it is crucial to specify
the ARIMA model, i.e., to choose the most appropriate values for p, d and q.
37
Given a series y t , the first problem is to determine the degree of homogeneity d. To do
this one first examines the autocorrelation function of the original series y t and determine
whether it is stationary. If it is not, difference the series and examine the autocorrelation function
for y t . Repeat this process until a value of d is reached such that d yt is stationary, i.e., the
autocorrelation function goes to 0 as k becomes large. After d is determined, one can work with
the stationary series wt  d yt and examine both its autocorrelation and partial autocorrelation
function to determine possible specifications for p and q. The lower order processes like AR(1),
AR(2), MA(1), MA(2), ARMA(1, 1) etc are easy to recognize
Table 2.1: Specification of ARIMA Models
Model
ACF
PACF
White noise
All zero
All zero
MA(1)
Zero after 1 lag
Declining from 1st lag
MA(2)
Zero after 2 lags
Declining from 2nd lag
MA(q)
Zero after q lags
Declining from qth lag
AR(1)
Geometric decline from 1 lag
Zero after 1 lag
AR(2)
Geometric decline from 2 lags
Zero after 2 lags
AR(p)
Geometric decline from pth lag
Zero after p-lags
ARMA(1,1)
Geometric decline from 1 lag
Declining from first lag
ARMA(p,q)
Geometric decline from pth lag
Declining from qth lag
38
The autoregressive and moving average of order p and q respectively are determined by
using the partial autocorrelation function (PACF) and ACF respectively. Table 2.1 summarizes
the characteristics of the ARIMA(p,0,q) or ARMA(p,q) model.
2.5.6 Diagnostic Checking
After a time series model has been estimated, one must test whether the specification was
correct. We assume that the random errors t in the actual process are normally distributed and
ˆ t should resemble a white
independent. Then, if the model is specified correctly, the residuals 
noise process. Consequently a sample autocorrelation function of the residuals
rˆk 
ˆ ˆ
t k
t
t 1
2
ˆ t
(2.62)
t 1
would be close to 0 for k > 0. We can use the Box and Pierce Test for this purpose. Consider the
statistic Q composed of the first K residual autocorrelations
K
Q=T
 rˆ
k 1
2
k
(2.63)
which is distributed as chi-square with K – p – q degrees of freedom.
2.6 Cross Validation in Time Series Models
Cross-validation is a technique for assessing how the results of a statistical analysis will
generalize to an independent data set. It is mainly used in settings where the goal is prediction,
and one wants to estimate how accurately a predictive model will perform in practice. One round
of cross-validation involves partitioning a sample of data into complementary subsets,
performing the analysis on one subset (called the training set), and validating the analysis on the
39
other subset (called the validation set or testing set). An excellent review of different type of
cross validation techniques is available in Izenman (2008). Picard and Cook (1984) developed all
basic fundamentals of applying cross validation technique in regression and time series.
According to Montgomery et al. (2008b), three types of procedures are useful for
validating a time series model.
(i)
Analysis of the model coefficients and predicted values including comparisons with prior
experience, physical theory, and other analytical models or simulation results,
(ii)
Collection of new data with which to investigate the model’s predictive performance,
(iii) Data splitting, that is, setting aside some of the original data and using these observations to
investigate the model’s predictive performance. Since we have a large number of data set, we
prefer the data splitting technique for cross-validation of the fitted model.
In randomly missing data, generally three performance indicators; say, mean absolute error
(MAE), root mean square error (RMSE) and estimated bias (EB) are considered to examine the
accuracy of theses imputation methods. In order to select the best method for estimation missing
values, the predicted and observed data were compared. The mean absolute error is the average
difference between predicted and actual data values, and is given by
MAE 
1
N
N
 P O
i 1
i
i
(2.64)
where N is the number of imputations, Pi and Oi are the imputed and observed data points,
respectively. MAE varies from 0 to infinity and perfect fit is obtained when MAE = 0.
The root mean squared error is one of the most commonly used measure and it is computed by
RMSE 
1 N
Pi  Oi 2

N i 1
40
(2.65)
The smaller is the RMSE value, the better is the performance of the model.
The estimated bias is the absolute difference between the observed and the estimated value of the
respective parameters and defined as
EB  Oi  Ei
(2.66)
where Ei is the estimated value of the parameter that obtained from the imputation methods.
For a regular time series three measures of accuracy of the fitted model: MAPE, MAD, and MSD
for each of the simple forecasting and smoothing methods. For all three measures, the smaller the
value, the better the fit of the model. Use these statistics to compare the fits of the different
methods.
MAPE, or Mean Absolute Percentage Error, measures the accuracy of fitted time series values.
It expresses accuracy as a percentage.
MAPE =
|  y
t
 yˆt  / yt |
T
 100
(2.67)
where yt equals the actual value, yˆ t equals the fitted value, and T equals the number of
observations.
MAD (Mean), which stands for Mean Absolute Deviation, measures the accuracy of fitted time
series values. It expresses accuracy in the same units as the data, which helps conceptualize the
amount of error.
MAD (Mean) =
| y
t
T
 yˆt |
(2.68)
where yt equals the actual value, yˆ t equals the fitted value, and T equals the number of
observations.
41
MSD stands for Mean Squared Deviation. MSD is always computed using the same
denominator, T, regardless of the model, so you can compare MSD values across models. MSD
is a more sensitive measure of an unusually large forecast error than MAD.
MSD =
y
 yˆt 
2
t
T
(2.69)
where yt equals the actual value, yˆ t equals the fitted value, and T equals the number of
observations.
In order to find out the best prediction model we usually leave out say, l observations aside as
holdback period. The size of l is usually 10% to 20% of the original data. Suppose that we
tentatively select two models namely, A and B. We fit both the models using (T – l) set of
observations. Then we compute
MSPE A 
1 l
2
e Ai

l t 1
(2.70)
for model A and
MSPE B 
1 l
2
e Bi

l t 1
(2.71)
for model B. Several methods have been devised to determine whether one MSPE is statistically
different from the other. One such popular method of testing is the F-test approach, where Fstatistic is constructed as a ratio between the two MSPEs keeping the larger MSPE in the
numerator of the F-statistic. If the MSPE for model A is larger, this statistic takes the form:
F
MSPE A
MSPE B
(2.72)
This statistic follows an F distribution with (l , l) degrees of freedom under the null hypothesis of
equal forecasting performance. If the F-test is significant we will choose model B for this data
42
otherwise, we would conclude that there is a little bit difference in choosing between these two
models.
2.7 Computation
We have used a number of modern and sophisticated statistical software such as R, S-Plus and
Minitab for computational purposes.
43
CHAPTER 3
ESTIMATION OF MISSING VALUES BY ROBUST EM AND
BOOTSTRAP ALGORITHM FOR SAUDI ARABIA ROAD
ACCIDENT DATA
In this section our main objective is to estimate missing values of Saudi Arabia road accident
data because if any data is missing we cannot fit any ARIMA model to the data. It is now evident
that outliers may have an adverse effect [see Mamun (2014)] on the estimation of missing values
and also in the determination of the ARIMA [see Imon et al. (2007)]. So we would like to apply
a variety of robust and nonparametric approach of missing value estimation. We also investigate
the accuracy of the estimated missing values by cross validation.
3.1 Estimation of Missing Values
We have observed in section 1.2 that the total number of road accidents for the entire Saudi
Arabia is missing in the year 2003 whereas all regions have missing values for 2003 and 2004. In
section 2.3 we discussed different methods of estimating missing values. Here is the list of
methods we apply to estimate the missing values.

Mean Imputation

Median Imputation

Linear Trend Model

Quadratic Trend Model

Exponential Trend Model

Centered Moving Average
44

EM-OLS

EM-LTS

EM-LMS

EM-M

EM-MM

BOOT-OLS

BOOT-LTS

BOOT-LMS

BOOT-M

BOOT-MM
Results of this experiment are presented in Tables 3.1 to 3.6. Table 3.1 gives some traditional
measures of missing value estimation for the year 2013. Values of the same for the year 2004 are
presented in Table 3.2. Tables 3.3 and 3.4 give EM algorithmic vales for all robust time series
methods including the OLS for the year 2003 and 2004 respectively. The same values using the
bootstrap technique are presented in Tables 3.5 and 3.6. It is worth mentioning that each of
bootstrap result is based on 10000 replications.
Table 3.1: Estimates of Missing Values for the Year 2003 Using Conventional Methods
Region
Mean
Median
LT
QT
ET
CMA
Riyadh
90629
84313
77997.1
54242.7
74376.6
85350
Mecca
97194
95649
94098.9
89945.5
94004.2
93305.5
E Province
83063
71731
69690.4
62433.5
68330.4
66762
Medina
12227
19972
6698.81
7915.52
2456.14
836.5
45
Al- Qassim
12503
12901
9908.26
9360.26
9527.53
8158.5
Tabuk
6933
5974
4654.92
2982.60
4373.41
3546.5
Asir
18869
19024
16060.5
16545.9
15411.4
15733.5
Albahih
3025
3208
2287.03
2239.00
2099.98
1917
N Borders
5697
5668
4668.42
4074.99
4573.35
4180.50
Al Jouf
6461
7038
5462.06
6059.46
5202.04
4809.50
Ha’il
5139
5714
4622.64
4837.18
4461.22
4929
Najran
2630
2412
2272.64
1625.21
2233.72
2280.50
Jizan
3715
3257
2207.34
1687.86
1752.07
1216
Total
348085
336861
300629
263949.7
288802.1
293025
Table 3.2: Estimates of Missing Values for the Year 2004 Using Conventional Methods
Region
Mean
Median
LT
QT
ET
CMA
Riyadh
90629
84313
83410.8
51738.2
77821.5
86387
Mecca
97194
95649
95425.6
89887.7
95231.9
97091
E Province
83063
71731
75421.4
65745.6
73032.6
69143
Medina
12227
19972
9067.85
10690.1
3675.92
973
Al- Qassim
12503
12901
11020.1
10289.4
10470.7
8051
Tabuk
6933
5974
5631.33
3401.57
5027.94
4033
Asir
18869
19024
17264.1
17911.3
16558.6
18998
Albahih
3025
3208
2603.26
2539.22
2366.25
2118
N Borders
5697
5668
5109.33
4318.10
4933.70
4187
Al Jouf
6461
7038
5890.32
6686.85
5610.72
5371
46
Ha’il
5139
5714
4843.79
5129.84
4690.08
6172
Najran
2630
2412
2425.99
1562.74
2341.11
2149
Jizan
3715
3257
2853.58
2160.93
2175.30
976
Total
348085
336861
320967.5
272061.6
303936.3
305649
Table 3.3: Estimates of Missing Values for 2003 Using EM Robust Time Series Methods
Region
OLS
LTS
LMS
M
MM
Riyadh
74393
97492
96314
11537
69292
Mecca
94509
26556
71565
94994
99090
E Province
70581
70443
6989
17501
17856
Medina
6649
7792
10361
6557
6589
Al- Qassim
9994
11090
4503
9924
9939
Tabuk
4557
4532
18551
4521
4438
Asir
16442
18597
2426
16509
16482
Albahih
2406
2470
4107
2380
2408
N Borders
4488
4103
5955
4524
4511
Al Jouf
55805
5860
4094
5544
5558
Ha’il
4764
4077
1941
4661
4657
Najran
2179
2052
2336
2188
0102
Jizan
14186
2385
2000
2108
2115
Total
360953
204337
231142
182948
243037
47
Table 3.4: Estimates of Missing Values for 2004 Using EM Robust Time Series Methods
Region
OLS
LTS
LMS
M
MM
Riyadh
80153
104387
96506
11542
63168
Mecca
95786
53139
75616
96125
100466
E Province
76280
74089
9772
35005
35714
Medina
4026
10740
11520
9011
9016
Al- Qassim
11100
12249
4856
11030
11045
Tabuk
5546
4869
18811
5404
4921
Asir
17587
18856
2817
17658
17634
Albahih
2714
2848
4114
2708
2754
N Borders
4948
4100
6315
4932
4930
Al Jouf
5994
6237
4372
5947
5960
Ha’il
4967
4355
1830
4885
4878
Najran
2345
1939
2886
2351
2008
Jizan
28355
2836
2600
2616
2620
Total
339801
194366
246015
209214
265114
Table 3.5: Estimates of Missing Values for 2003 Using EM Robust Time Series Methods
Region
OLS
LTS
LMS
M
MM
Riyadh
73290
97492
97188
18143
67396
Mecca
94470
94872
94911
94685
94687
E Province
70440
76843
67581
17344
17317
Medina
6328
6347
6008
6292
6299
48
Al- Qassim
9830
11066
10289
9775
9792
Tabuk
3924
4424
4113
3961
4008
Asir
15621
15986
15737
15081
15214
Albahih
2244
2430
2426
2297
22791
N Borders
4727
5213
4623
4821
4805
Al Jouf
5439
5398
5483
5411
5406
Ha’il
4960
4077
4094
4956
9801
Najran
2241
2141
2070
2237
2194
Jizan
14719
2470
2361
2201
2231
Total
308233
328759
316884
187204
261941
Table 3.6: Estimates of Missing Values for 2004 Using EM Robust Time Series Methods
Region
OLS
LTS
LMS
M
MM
Riyadh
79345
104387
104083
36288
61543
Mecca
95774
95057
94763
96073
96014
E Province
76036
82467
70012
34691
34638
Medina
8738
9395
8916
8715
8750
Al- Qassim
10964
12225
11448
10909
10925
Tabuk
4984
4830
4691
4965
4753
Asir
16882
17408
17147
16420
16549
Albahih
2573
2821
2817
2632
2610
N Borders
5158
5512
4951
5182
5125
Al Jouf
5871
6035
6144
5826
5824
49
Ha’il
5140
4355
4372
5136
5002
Najran
2399
2014
1930
2395
2159
Jizan
29440
2968
2865
2735
2744
Total
343304
349474
334139
231967
256636
3.2 Evaluation of the Estimation of Missing Values
One of the most interesting features of the number of road accidents in Saudi Arabia data is that
although the total number of accidents in different regions for the year 2004 is missing, but we
know the total for the entire country. So we estimate the total number of accidents for different
regions using the different alternate methods and then add them together to estimate the total
number for the entire country. Once we get the estimate we can compare it with the actual values
what we definitely know. Here we use the mean average percentage error (MAPE) for a variety
of alternative methods and the results are presented in Table 3.7.
Table 3.7: A Comparison Between Exact and Estimated Values of the Total Number of Road
Accidents for 2004 in Saudi Arabia
Measures
Estimated Total
MAPE
(Exact value: 261874)
Mean
348085
32.92
Median
336861
28.64
LT
320967.5
22.57
QT
272061.6
3.90
ET
303936.3
16
50
CMA
305649
16.72
EM-OLS
339801
30
EM-LTS
194366
26
EM-LMS
239415
8.58
EM-M
209214
20
EM-MM
265114
1.24
BOOT-OLS
343304
31
BOOT-LTS
349474
33.45
BOOT-LMS
334139
27.60
BOOT-M
231967
11.42
BOOT-MM
256636
2
From the above results it is clear that the MM method estimates the missing value in the best
possible way. The MAPE values of MM are low for both EM and bootstrap. So we take the
values of EM-MM for the years 2003 and 2004 and the completed data are presented in Tables
3.8 and 3.9. The estimated values are presented in boldfaces.
Table 3.8: Total Number of Road Accidents in Saudi Arabia after the Missing Value Estimation
Year
Number of accidents
1999
264326
2000
267772
2001
280401
2002
305649
51
2003
263037
2004
261872
2005
293281
2006
296015
2007
283648
2008
435264
2009
485931
2010
484805
Table 3.9: Total Number of Road Accidents in Different Region of Saudi Arabia after the
Missing Value Estimation
Year
Riyadh
Mecca
E Province
Medina
Qassim
Tabuk
Asir
2000
80067
94669
56182
747
7213
3591
10242
2001
84313
89520
64381
700
8266
3060
12469
2002
86387
97091
69143
973
8051
4033
18998
2003
69292
99090
37501
6589
9939
4438
16482
2004
63168
100466
35714
9016
11045
4921
17634
2005
47803
95649
76435
13422
12901
5043
18230
2006
49019
93987
71731
14471
11687
5974
22108
2007
48142
86854
62033
18988
13370
6061
19024
2008
134915
96204
107780
21334
14768
7503
20556
2009
143466
107932
127460
20956
17645
11634
22929
2010
141549
112844
112420
18448
18622
15500
25263
52
Year Albahih
N_Borders
Al Jouf
Hail
Najran
Jizan
2000
1169
3999
3447
3112
2431
903
2001
1716
4174
4248
3686
2412
1456
2002
2118
4187
5371
6172
2149
976
2003
2408
4511
5558
4657
2102
2115
2004
2754
4930
5960
4878
2008
2620
2005
3293
4034
6889
4502
1823
3257
2006
3515
5668
7259
5835
1710
3051
2007
3208
6805
7170
5274
1963
4756
2008
2843
6160
9446
5884
2631
5240
2009
4245
6799
7284
5714
4215
5652
2010
5117
9449
7038
6069
4340
8146
Next we construct time series plot for all variables considered in this study with and without the
missing value estimates o get an impression how closely the EM-MM method estimate the
missing values.
53
Figure 3.1: Time Series Plot of Original and Estimated Road Accidents in Saudi Arabia
Figure 3.2: Time Series Plot of Original and Estimated Road Accidents in Riyadh
Figure 3.3: Time Series Plot of Original and Estimated Road Accidents in Mecca
54
Figure 3.4: Time Series Plot of Original and Estimated Road Accidents in Eastern Province
Figure 3.5: Time Series Plot of Original and Estimated Road Accidents in Medina
Figure 3.6: Time Series Plot of Original and Estimated Road Accidents in Al-Qassim
55
Figure 3.7: Time Series Plot of Original and Estimated Road Accidents in Tabuk
Figure 3.8: Time Series Plot of Original and Estimated Road Accidents in Asir
Figure 3.9: Time Series Plot of Original and Estimated Road Accidents in Albahih
56
Figure 3.10: Time Series Plot of Original and Estimated Road Accidents in Northern Borders
Figure 3.11: Time Series Plot of Original and Estimated Road Accidents in Al Jouf
Figure 3.12: Time Series Plot of Original and Estimated Road Accidents in Ha’il
57
Figure 3.13: Time Series Plot of Original and Estimated Road Accidents in Najaran
Figure 3.14: Time Series Plot of Original and Estimated Road Accidents in Jizan
Figures 3.1 – 3.14 clearly show that the estimation of missing values using EM-MM are very
consistent and perfectly match with the rest of the data.
58
CHAPTER 4
SELECTION OF ARIMA MODELS FOR SAUDI ARABIA ROAD
ACCIDENT DATA
In this chapter our main objective is to find appropriate ARIMA models for the total number of
car accidents in the entire Saudi Arabia and it’s all regions. At first we look at the autocorrelation
function (ACF) and partial autocorrelation functions (PACF) and their corresponding t value and
the Ljung-Box test to determine the appropriate ARIMA model. But we know that ACF and
PACF can only give an indication. We need to employ formal test to answer this question. Later
we fit a variety of tentative ARIMA models for each of the variables and by examining the
significance of the coefficients of ARIMA models we select the most appropriate model for each
of the variable.
4.1 Total Number of Accidents in Saudi Arabia
At first we consider the total number of road accidents in the entire Saudi Arabia. The ACF and
PACF values for this data together with the Ljuang-Box t and  2 tests are given in Figure 4.1
and in Table 4.1.
Autocorrelation Function for total
Partial Autocorrelation Function for total
(with 5% significance limits for the autocorrelations)
(with 5% significance limits for the partial autocorrelations)
1.0
0.8
0.8
0.6
0.6
Partial Autocorrelation
Autocorrelation
1.0
0.4
0.2
0.0
-0.2
-0.4
-0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-0.8
-1.0
-1.0
1
2
3
4
5
6
7
8
9
10
11
1
Lag
2
3
4
5
6
7
8
9
10
11
Lag
Figure 4.1: The ACF and PACF Values for the Total Number of Road Accidents in Saudi Arabia
59
Table 4.1: The ACF and PACF Values for the Total Number of Road Accidents in Saudi Arabia
Index
ACF
PACF
Values
TSTAT
LBQ
Values
TSTAT
1
0.650100
2.25201
6.4547
0.650100
2.25201
2
0.254603
0.64927
7.5437
-0.291023
-1.00813
3
-0.009077
-0.02238
7.5453
-0.063523
-0.22005
4
-0.030761
-0.07582
7.5651
0.149157
0.51669
5
-0.184256
-0.45395
8.3799
-0.408400
-1.41474
6
-0.221489
-0.53655
9.7536
0.170052
0.58908
7
-0.193680
-0.45832
11.0140
-0.041897
-0.14514
8
-0.164204
-0.38194
12.1464
-0.274840
-0.95207
9
-0.268235
-0.61647
16.1756
-0.070623
-0.24464
10
-0.220151
-0.49066
20.2468
0.121250
0.42002
The above table and figure clearly show an indication of autoregressive or moving average
pattern. The ACF values show that only the first one is significant, so an MA(1) could be a
possible model for this. However, the first PACF value is also significant that indicates that the
possible model could be AR(1). Now we fit eight different ARIMA model for this data. They are
AR(1), AR(2), AR(3), MA(1), MA(2), MA(3), ARMA(1, 1) and ARIMA(1, 1, 1). Results of the
above models are presented in Table 4.2.
60
Table 4.2: Significance of ARIMA models fir the Number of Road Accidents in Saudi Arabia
Model
Components
Coefficients
t-statistic
p-value
AR(1)
AR(1)
1.0006
21.27
0.000
AR(2)
AR(1)
1.3449
3.92
0.003
AR(2)
-0,3431
-0.93
0.375
AR(1)
1.3756
3.85
0.004
AR(2)
-0.5562
-1.05
0.321
AR(3)
0.1827
0.48
0.644
MA(1)
MA(1)
-0.8547
-2.49
0.032
MA(2)
MA(1)
-1.5529
-4.45
0.001
MA(2)
-0.6426
-1.36
0.203
MA(1)
-1.6856
-4.85
0.001
MA(2)
-1.4250
-2.46
0.036
MA(3)
-0.5650
-1.08
0.309
AR(1)
1.0010
15.86
0.000
MA(1)
-0.2530
-0.78
0.454
AR(1)
-0.3182
-0.22
0.828
MA(1)
-0.5555
-0.43
0.679
AR(3)
MA(3)
ARMA(1, 1)
ARIMA(1, 1, 1)
The above results clearly show that AR(1) is significant throughout. Although MA(1) is
significant in some cases but AR(1) is more significant than MA(1). ARIMA(1, 1, 1) is not a
possibility at all. Hence the best model for this data is AR(1).
61
4.2 Total Number of Accidents in Riyadh
Now we consider the total number of road accidents in Riyadh. The ACF and PACF values for
this data together with the Ljuang-Box t and  2 tests are given in Figure 4.2 and in Table 4.3.
Autocorrelation Function for Riyadh
Partial Autocorrelation Function for Riyadh
(with 5% significance limits for the autocorrelations)
(with 5% significance limits for the partial autocorrelations)
1.0
1.0
0.8
0.8
Partial Autocorrelation
Autocorrelation
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
-1.0
1
2
3
4
5
6
7
8
9
1
10
2
3
4
5
6
7
8
9
10
Lag
Lag
Figure 4.2: The ACF and PACF Values for the Total Number of Road Accidents in Riyadh
Table 4.3: The ACF and PACF Values for the Total Number of Road Accidents in Riyadh
Index
ACF
PACF
Values
TSTAT
LBQ
Values
TSTAT
1
0.639726
2.12173
5.8523
0.639726
2.12173
2
0.310614
0.76394
7.3852
-0.166967
-0.55377
3
0.034943
0.08171
7.4071
-0.156943
-0.52052
4
-0.174029
-0.40672
8.0258
-0.148767
-0.49341
5
-0.354857
-0.81714
11.0269
-0.220778
-0.73224
6
-0.356595
-0.77542
14.6637
0.017352
0.05755
7
-0.257252
-0.53112
17.0296
0.013195
0.04376
8
-0.162444
-0.32710
18.2875
-0.071982
-0.23874
9
-0.117293
-0.23392
19.2711
-0.124647
-0.41341
62
10
-0.062812
-0.12465
19.8353
-0.063589
-0.21090
The above table and figure clearly show an indication that this day may fit an MA(1) or AR(1)
model. For confirmation we fit eight different ARIMA model for this data and the results are
presented in Table 4.4.
Table 4.4: Significance of ARIMA models fir the Number of Road Accidents in Riyadh
Model
Components
Coefficients
t-statistic
p-value
AR(1)
AR(1)
1.0003
10.00
0.000
AR(2)
AR(1)
1.1851
3.53
0.006
AR(2)
-0.1841
-0.50
0.630
AR(1)
1.1692
3.28
0.011
AR(2)
-0.1717
-0.33
0.753
AR(3)
0.0036
0.01
0.993
MA(1)
MA(1)
-0.854
-2.60
0.029
MA(2)
MA(1)
-1.0942
-4.45
0.001
MA(2)
-0.7256
-1.72
0.119
MA(1)
-1.2305
-3.52
0.008
MA(2)
-1.1393
-2.13
0.065
MA(3)
-0.4821
-0.57
0.583
AR(1)
1.0006
8.18
0.000
MA(1)
-0.0915
-0.26
0.804
AR(1)
0.1078
0.03
0.977
AR(3)
MA(3)
ARMA(1, 1)
ARIMA(1, 1, 1)
63
MA(1)
0.0097
0.00
0.999
The above results clearly show that AR(1) is the most appropriate model for this data.
4.3 Total Number of Accidents in Mecca
Next we consider the total number of road accidents in Mecca and the summary results are given
Autocorrelation Function for Mecca
Partial Autocorrelation Function for Mecca
(with 5% significance limits for the autocorrelations)
(with 5% significance limits for the partial autocorrelations)
1.0
1.0
0.8
0.8
0.6
0.6
Partial Autocorrelation
Autocorrelation
in Figure 4.3 and in Table 4.5.
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
-1.0
1
2
3
4
5
6
7
8
9
10
1
2
Lag
3
4
5
6
7
8
9
10
Lag
Figure 4.3: The ACF and PACF Values for the Total Number of Road Accidents in Mecca
Table 4.5: The ACF and PACF Values for the Total Number of Road Accidents in Mecca
Index
ACF
PACF
Values
TSTAT
LBQ
Values
TSTAT
1
0.518564
1.71988
3.8454
0.518564
1.71988
2
0.137058
0.36656
4.1439
-0.180348
-0.59815
3
-0.060622
-0.16019
4.2096
-0.072091
-0.23910
4
-0.000961
-0.00253
4.2096
0.129911
0.43087
5
-0.107564
-0.28357
4.4853
-0.232716
-0.77183
6
-0.138906
-0.36355
5.0372
0.003594
0.01192
64
7
-0.281103
-0.72703
7.8621
-0.247915
-0.82224
8
-0.248209
-0.61316
10.7987
-0.031766
-0.10536
9
-0.226961
-0.54244
14.4818
-0.101963
-0.33817
10
-0.091295
-0.21258
15.6737
0.000941
0.00312
Although the ACF and PACF values are not significant here but they are close to the critical
value 2 and their graphs indicate that this data may fit an AR(1) model.
Table 4.6: Significance of ARIMA models fir the Number of Road Accidents in Mecca
Model
Components
Coefficients
t-statistic
p-value
AR(1)
AR(1)
0.9996
49.35
0.000
AR(2)
AR(1)
1.7810
4.95
0.001
AR(2)
-0.7799
-2.15
0.060
AR(1)
1.6151
4.06
0.004
AR(2)
-0.3786
-0.56
0.592
AR(3)
-0.2361
-0.49
0.640
MA(1)
MA(1)
-0.8838
-2.88
0.016
MA(2)
MA(1)
-1.4742
-4.39
0.002
MA(2)
-0.7438
-2.25
0.051
MA(1)
-1.9245
-4.68
0.002
MA(2)
-1.5421
-2.23
0.056
MA(3)
-0.4883
-1.01
0.310
AR(1)
1.004
38.38
0.000
AR(3)
MA(3)
ARMA(1, 1)
65
ARIMA(1, 1, 1)
MA(1)
-0.2464
-0.75
0.474
AR(1)
0.1159
0.08
0.939
MA(1)
-0.1570
0.11
0.915
The above results clearly show that although MA(1) and MA(2) can be considered for this data,
based on the overall performances AR(1) is considered as the most appropriate model for this
data.
4.4 Total Number of Accidents in Eastern Province
Next we consider the total number of road accidents in Eastern Province and the summary results
are given in Figure 4.4 and in Table 4.7.
Partial Autocorrelation Function for Eastern Province
Autocorrelation Function for Eastern Province
(with 5% significance limits for the partial autocorrelations)
(with 5% significance limits for the autocorrelations)
1.0
1.0
0.8
0.8
Partial Autocorrelation
Autocorrelation
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
-1.0
1
2
3
4
5
6
7
8
9
1
10
2
3
4
5
6
7
8
9
10
Lag
Lag
Figure 4.4: The ACF and PACF Values for the Total Number of Road Accidents in E Province
Table 4.7: The ACF and PACF Values for the Total Number of Road Accidents in E Province
Index
ACF
PACF
Values
TSTAT
LBQ
Values
TSTAT
1
0.705947
2.34136
7.1266
0.705947
2.34136
2
0.344712
0.80909
9.0146
-0.306295
-1.01586
66
3
0.060024
0.13318
9.0790
-0.091588
-0.30376
4
-0.226373
-0.50147
10.1259
-0.308181
-1.02212
5
-0.360562
-0.78107
13.2243
0.017203
0.05706
6
-0.427266
-0.87815
18.4454
-0.224781
-0.74551
7
-0.292031
-0.56209
21.4942
0.286353
0.94973
8
-0.133929
-0.25068
22.3492
-0.221803
-0.73564
9
-0.116442
-0.21672
23.3187
-0.154299
-0.51175
10
-0.054081
-0.10023
23.7369
-0.046454
-0.15407
Here both the ACF and PACF values are significant for the first order, hence the model can be
fitted by AR(1) or MA(1) methods. .
Table 4.8: Significance of ARIMA models fir the Number of Road Accidents in E Province
Model
Components
Coefficients
t-statistic
p-value
AR(1)
AR(1)
1.0006
10.43
0.000
AR(2)
AR(1)
1.0645
3.03
0.014
AR(2)
-0.0620
-0.16
0.880
AR(1)
0.9432
3.17
0.013
AR(2)
-0.5708
-1.34
0.218
AR(3)
0.6306
1.91
0.093
MA(1)
MA(1)
-0.8666
-2.62
0.026
MA(2)
MA(1)
-1.5590
-3.54
0.006
MA(2)
-0.6521
-1.28
0.232
AR(3)
67
MA(3)
ARMA(1, 1)
ARIMA(1, 1, 1)
MA(1)
-2.0450
-6.30
0.000
MA(2)
-1.5474
-2.61
0.031
MA(3)
-0.4313
-0.97
0.358
AR(1)
0.9997
224.27
0.000
MA(1)
-1.0575
-262.20
0.000
AR(1)
0.5252
0.20
0.849
MA(1)
0.6924
0.28
0.787
The above results clearly show that AR(1), MA(1) and ARMA(1, 1) can be considered for this
data, based on the overall performances ARMA(1, 1) is considered as the most appropriate
model for this data.
4.5 Total Number of Accidents in Medina
Next we consider the total number of road accidents in Medina and the summary results are
given in Figure 4.5 and in Table 4.9.
Autocorrelation Function for Medina
Partial Autocorrelation Function for Medina
(with 5% significance limits for the autocorrelations)
(with 5% significance limits for the partial autocorrelations)
1.0
0.8
0.8
0.6
Partial Autocorrelation
1.0
Autocorrelation
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
1
2
3
4
5
6
7
8
9
10
1
Lag
2
3
4
5
6
7
8
9
10
Lag
Figure 4.5: The ACF and PACF Values for the Total Number of Road Accidents in Medina
68
Table 4.9: The ACF and PACF Values for the Total Number of Road Accidents in Medina
Index
ACF
PACF
Values
TSTAT
LBQ
Values
TSTAT
1
0.788543
2.61530
8.8917
0.788543
2.61530
2
0.493955
1.09373
12.7685
-0.338039
-1.12115
3
0.149332
0.29967
13.1671
-0.314293
-1.04239
4
-0.126859
-0.25252
13.4959
-0.058117
-0.19275
5
-0.383965
-0.75991
17.0096
-0.302652
-1.00378
6
-0.433620
-0.81639
22.3871
0.268468
0.89041
7
-0.382425
-0.67998
27.6155
-0.036391
-0.12069
8
-0.301701
-0.51523
31.9543
-0.268906
-0.89186
9
-0.214677
-0.35807
35.2495
0.034405
0.11411
10
-0.088583
-0.14606
36.3716
-0.014209
-0.04713
Here both the ACF and PACF values are significant for the first order, hence the model can be
fitted by AR(1) or MA(1) methods. .
Table 4.10: Significance of ARIMA models fir the Number of Road Accidents in Medina
Model
Components
Coefficients
t-statistic
p-value
AR(1)
AR(1)
1.0174
14.55
0.000
AR(2)
AR(1)
1.5854
4.59
0.001
AR(2)
-0.6039
-1.52
0.164
AR(1)
1.3756
3.85
0.004
AR(3)
69
AR(2)
-0.5562
-1.05
0.321
AR(3)
0.1827
0.48
0.644
MA(1)
MA(1)
-0.8578
-2.52
0.030
MA(2)
MA(1)
-1.4742
-4.39
0.002
MA(2)
-0.7438
-2.25
0.051
MA(1)
-1.5348
-3.32
0.010
MA(2)
-1.2944
-1.97
0.085
MA(3)
-0.5512
-1.20
0.265
AR(1)
1.0151
10.91
0.000
MA(1)
-0.2932
-0.82
0.434
AR(1)
0.7457
1.74
0.120
MA(1)
0.2806
0.43
0.679
MA(3)
ARMA(1, 1)
ARIMA(1, 1, 1)
The above results clearly show that AR(1) and MA(1) can be considered for this data, based on
the overall performances AR(1) is considered as the most appropriate model for this data.
4.6 Total Number of Accidents in Al-Qassim
Next we consider the total number of road accidents in Al-Qassim and the summary results are
given in Figure 4.6 and in Table 4.11.
70
Partial Autocorrelation Function for Qassim
Autocorrelation Function for Qassim
(with 5% significance limits for the partial autocorrelations)
(with 5% significance limits for the autocorrelations)
1.0
0.8
1.0
Partial Autocorrelation
0.8
Autocorrelation
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
-1.0
1
2
3
4
5
6
7
8
9
1
10
2
3
4
5
6
7
8
9
10
Lag
Lag
Figure 4.6: The ACF and PACF Values for the Total Number of Road Accidents in Al-Qassim
Table 4.11: The ACF and PACF Values for the Total Number of Road Accidents in Al-Qassim
Index
ACF
PACF
Values
TSTAT
LBQ
Values
TSTAT
1
0.778861
2.58319
8.6747
0.778861
2.58319
2
0.554764
1.23677
13.5648
-0.131836
-0.43725
3
0.229719
0.45300
14.5080
-0.405259
-1.34409
4
-0.063780
-0.12349
14.5911
-0.196334
-0.65117
5
-0.285927
-0.55284
16.5396
-0.040538
-0.13445
6
-0.421232
-0.79272
21.6143
-0.051637
-0.17126
7
-0.463226
-0.82585
29.2855
-0.062427
-0.20705
8
-0.405921
-0.68260
37.1396
0.019373
0.06425
9
-0.263172
-0.42492
42.0916
0.095264
0.31595
10
-0.160087
-0.25434
45.7564
-0.200343
-0.66446
Here both the ACF and PACF values are significant for the first order, hence the model can be
fitted by AR(1) or MA(1) methods. .
71
Table 4.12: Significance of ARIMA models fir the Number of Road Accidents in Al-Qassim
Model
Components
Coefficients
t-statistic
p-value
AR(1)
AR(1)
1.0021
24.62
0.000
AR(2)
AR(1)
1.3705
3.60
0.014
AR(2)
-0.3679
-0.74
0.476
AR(1)
1.2183
2.75
0.025
AR(2)
0.2143
0.40
0.702
AR(3)
-0.4303
-0.81
0.442
MA(1)
MA(1)
-0.8585
-2.49
0.032
MA(2)
MA(1)
-1.4380
-3.83
0.004
MA(2)
-0.7188
-1.65
0.133
MA(1)
-1.6287
-3.72
0.006
MA(2)
-1.4056
-2.39
0.044
MA(3)
-0.6016
-1.41
0.197
AR(1)
1.0025
16.34
0.000
MA(1)
-0.1784
-0.43
0.678
AR(1)
1.0001
497.56
0.000
MA(1)
1.0412
61.08
0.000
AR(3)
MA(3)
ARMA(1, 1)
ARIMA(1, 1, 1)
The above results clearly show that AR(1), MA(1) and ARIMA(1, 1, 1) can be considered for
this data, based on the overall performances ARIMA(1, 1, 1) is considered as the most
appropriate model for this data.
72
4.7 Total Number of Accidents in Tabuk
Next we consider the total number of road accidents in Tabuk and the summary results are given
in Figure 4.7 and in Table 4.13.
Autocorrelation Function for Tabuk
Partial Autocorrelation Function for Tabuk
(with 5% significance limits for the autocorrelations)
(with 5% significance limits for the partial autocorrelations)
1.0
1.0
0.8
Partial Autocorrelation
0.8
Autocorrelation
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
-1.0
1
2
3
4
5
6
7
8
9
10
1
Lag
2
3
4
5
6
7
8
9
10
Lag
Figure 4.7: The ACF and PACF Values for the Total Number of Road Accidents in Tabuk
Table 4.13: The ACF and PACF Values for the Total Number of Road Accidents in Tabuk
Index
ACF
PACF
Values
TSTAT
LBQ
Values
TSTAT
1
0.682488
2.26356
6.6608
0.682488
2.26356
2
0.509195
1.21513
10.7805
0.081250
0.26948
3
0.178187
0.37755
11.3480
-0.370368
-1.22837
4
-0.096803
-0.20250
11.5394
-0.263705
-0.87461
5
-0.248731
-0.51840
13.0139
0.035595
0.11806
6
-0.362890
-0.73850
16.7802
-0.057857
-0.19189
7
-0.419031
-0.81337
23.0575
-0.203492
-0.67491
8
-0.359251
-0.65883
29.2094
0.012469
0.04135
9
-0.236137
-0.41691
33.1962
0.149891
0.49713
73
10
-0.147029
-0.25558
36.2875
-0.143894
-0.47724
Here both the ACF and PACF values are significant for the first order, hence the model can be
fitted by AR(1) or MA(1) methods. .
Table 4.14: Significance of ARIMA models fir the Number of Road Accidents in Tabuk
Model
Components
Coefficients
t-statistic
p-value
AR(1)
AR(1)
1.0046
10.43
0.000
AR(2)
AR(1)
1.5999
4.19
0.002
AR(2)
-0.5976
-1.30
0.224
AR(1)
1.1692
3.28
0.011
AR(2)
-0.1717
-0.33
0.753
AR(3)
0.0036
0.01
0.993
MA(1)
MA(1)
-0.8468
-2.29
0.045
MA(2)
MA(1)
-1.4721
-4.18
0.002
MA(2)
-0.7436
-2.12
0.063
MA(1)
-1.6287
-5.15
0.001
MA(2)
-1.6140
-2.51
0.036
MA(3)
-0.5256
-1.28
0.237
AR(1)
1.0025
15.86
0.000
MA(1)
-0.2530
-0.78
0.454
AR(1)
1.0043
1.70
0.128
MA(1)
0.0234
0.03
0.976
AR(3)
MA(3)
ARMA(1, 1)
ARIMA(1, 1, 1)
74
The above results clearly show that AR(1), MA(1) and MA(2) can be considered for this data,
based on the overall performances AR(1) is considered as the most appropriate model for this
data.
4.8 Total Number of Accidents in Asir
Our variable of interest is now the total number of road accidents in Asir. The summary results
for this data are given in Figure 4.8 and in Table 4.15.
Autocorrelation Function for Asir
Partial Autocorrelation Function for Asir
(with 5% significance limits for the autocorrelations)
(with 5% significance limits for the partial autocorrelations)
1.0
1.0
0.8
0.8
Partial Autocorrelation
Autocorrelation
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
-1.0
1
2
3
4
5
6
7
8
9
10
1
2
Lag
3
4
5
6
7
8
9
Lag
Figure 4.8: The ACF and PACF Values for the Total Number of Road Accidents in Asir.
Table 4.15: The ACF and PACF Values for the Total Number of Road Accidents in Asir.
Index
ACF
PACF
Values
TSTAT
LBQ
Values
TSTAT
1
0.583366
1.93481
4.8665
0.583366
1.93481
2
0.321259
0.82189
6.5064
-0.028888
-0.09581
3
0.217749
0.52573
7.3539
0.063379
0.21020
4
0.012084
0.02847
7.3569
-0.204853
-0.67942
5
-0.132835
-0.31292
7.7774
-0.100465
-0.33321
75
10
6
-0.243621
-0.56886
9.4749
-0.157780
-0.52330
7
-0.383432
-0.87010
14.7309
-0.212037
-0.70325
8
-0.337912
-0.71892
20.1737
0.025868
0.08580
9
-0.309691
-0.62994
27.0311
-0.108039
-0.35832
10
-0.226966
-0.44587
34.3976
0.049251
0.16335
Here the ACF and PACF values are not significant but for the first order they are very close to
the critical value 2. Hence the model may be fitted by AR(1) or MA(1) methods. .
Table 4.16: Significance of ARIMA models fir the Number of Road Accidents in Asir.
Model
Components
Coefficients
t-statistic
p-value
AR(1)
AR(1)
1.0017
19.28
0.000
AR(2)
AR(1)
1.0252
2.84
0.019
AR(2)
-0.0220
-0.06
0.956
AR(1)
1.2183
2.75
0.025
AR(2)
0.2143
0.40
0.702
AR(3)
-0.4303
-0.81
0.442
MA(1)
MA(1)
-0.8729
-2.87
0.017
MA(2)
MA(1)
-1.3698
-4.06
0.003
MA(2)
-0.7454
-2.21
0.055
MA(1)
-1.2305
-3.52
0.008
MA(2)
-1.1393
-2.13
0.065
MA(3)
-0.4821
-0.57
0.583
AR(3)
MA(3)
76
ARMA(1, 1)
ARIMA(1, 1, 1)
AR(1)
1.0022
17.50
0.000
MA(1)
0.1057
0.26
0.801
AR(1)
-0.9997
-15.99
0.000
MA(1)
-0.9857
-1.79
0.111
The above results clearly show that AR(1), MA(1) and MA(2) can be considered for this data,
based on the overall performances AR(1) is considered as the most appropriate model for this
data.
4.9 Total Number of Accidents in Albahih
Now we consider the total number of road accidents in Albahih. The summary results for this
data are given in Figure 4.9 and in Table 4.17.
Autocorrelation Function for Albahih
Partial Autocorrelation Function for Albahih
(with 5% significance limits for the autocorrelations)
(with 5% significance limits for the partial autocorrelations)
1.0
1.0
0.8
Partial Autocorrelation
0.8
Autocorrelation
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
-1.0
1
2
3
4
5
6
7
8
9
1
10
2
3
4
5
6
7
8
9
10
Lag
Lag
Figure 4.9: The ACF and PACF Values for the Total Number of Road Accidents in Albahih
Table 4.17: The ACF and PACF Values for the Total Number of Road Accidents in Albahih
Index
1
ACF
PACF
Values
TSTAT
LBQ
Values
TSTAT
0.637186
2.11331
5.8059
0.637186
2.11331
77
2
0.450798
1.11070
9.0348
0.075409
0.25010
3
0.206319
0.45942
9.7957
-0.181694
-0.60261
4
-0.061150
-0.13363
9.8721
-0.266149
-0.88272
5
-0.143985
-0.31413
10.3662
0.040374
0.13390
6
-0.255075
-0.55156
12.2270
-0.092220
-0.30586
7
-0.422780
-0.88992
18.6171
-0.338786
-1.12363
8
-0.417706
-0.82205
26.9338
-0.059729
-0.19810
9
-0.276939
-0.51434
32.4176
0.285438
0.94669
10
-0.216669
-0.39306
39.1308
-0.109983
-0.36477
Here the ACF and PACF values are significant for the first order. Hence the model can be fitted
by AR(1) or MA(1) methods. .
Table 4.18: Significance of ARIMA models fir the Number of Road Accidents in Albahih
Model
Components
Coefficients
t-statistic
p-value
AR(1)
AR(1)
1.0046
14.74
0.000
AR(2)
AR(1)
1.3183
3.48
0.007
AR(2)
-0.3128
-0.74
0.481
AR(1)
1.3008
3.44
0.009
AR(2)
-0.6292
-0.92
0.386
AR(3)
0.3336
0.56
0.588
MA(1)
MA(1)
-0.8556
-2.42
0.036
MA(2)
MA(1)
-1.4904
-3.98
0.003
AR(3)
78
MA(3)
ARMA(1, 1)
ARIMA(1, 1, 1)
MA(2)
-0.7411
-1.82
0.102
MA(1)
-2.1498
-12.08
0.000
MA(2)
-1.6801
-2.96
0.018
MA(3)
-0.4746
-1.11
0.299
AR(1)
1.0070
8.19
0.000
MA(1)
-0.6269
-1.54
0.159
AR(1)
-0.0683
-0.04
0.969
MA(1)
-0.6911
-0.40
0.697
The above results clearly show that AR(1), MA(1) and MA(2) can be considered for this data,
based on the overall performances AR(1) is considered as the most appropriate model for this
data.
4.10 Total Number of Accidents in Northern Borders
Now we consider the total number of road accidents in Northern Borders. The summary results
for this data are given in Figure 4.10 and in Table 4.19.
Autocorrelation Function for Northern Borders
Partial Autocorrelation Function for Northern Borders
(with 5% significance limits for the partial autocorrelations)
(with 5% significance limits for the autocorrelations)
1.0
1.0
0.8
0.8
Partial Autocorrelation
Autocorrelation
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
-1.0
1
2
3
4
5
6
7
8
9
1
10
2
3
4
5
6
7
8
9
10
Lag
Lag
Figure 4.10: The ACF and PACF Values for the Total Number of Road Accidents in N Borders
79
Table 4.19: The ACF and PACF Values for the Total Number of Road Accidents in N Borders
Index
ACF
PACF
Values
TSTAT
LBQ
Values
TSTAT
1
0.478168
1.58591
3.2696
0.478168
1.58591
2
0.518466
1.42444
7.5407
0.375730
1.24615
3
0.204436
0.48006
8.2877
-0.196225
-0.65080
4
-0.084495
-0.19438
8.4336
-0.458333
-1.52012
5
-0.169860
-0.38943
9.1212
-0.085555
-0.28376
6
-0.325861
-0.73699
12.1581
0.057935
0.19215
7
-0.368456
-0.79499
17.0116
-0.107831
-0.35764
8
-0.374139
-0.76452
23.6839
-0.203452
-0.67477
9
-0.200046
-0.38865
26.5453
0.222916
0.73933
10
-0.178213
-0.34157
31.0869
0.049331
0.16361
Here the ACF and PACF values are not significant but for the first order they are very close to
the critical value 2. Hence the model may be fitted by AR(1) or MA(1) methods. .
Table 4.20: Significance of ARIMA models fir the Number of Road Accidents in N Borders
Model
Components
Coefficients
t-statistic
p-value
AR(1)
AR(1)
1.0016
15.07
0.000
AR(2)
AR(1)
1.2141
2.36
0.043
AR(2)
-0.2102
-0.39
0.708
AR(1)
1.1609
2.10
0.069
AR(3)
80
AR(2)
-0.4018
-0.40
0.563
AR(3)
0.2440
0.42
0.687
MA(1)
MA(1)
-0.8556
-2.42
0.036
MA(2)
MA(1)
-1.4944
-4.35
0.002
MA(2)
-0.7734
-2.25
0.051
MA(1)
-1.8743
-5.40
0.001
MA(2)
-1.4567
-2.24
0.055
MA(3)
-0.4397
-0.85
0.419
AR(1)
1.0021
9.39
0.000
MA(1)
-0.6103
-1.94
0.084
AR(1)
-0.2880
-0.38
0.712
MA(1)
-0.7422
-1.26
0.237
MA(3)
ARMA(1, 1)
ARIMA(1, 1, 1)
The above results clearly show that AR(1), MA(1) and ARMA(1, 1) can be considered for this
data, based on the overall performances we consider AR(1) is as the most appropriate model for
this data.
4.11 Total Number of Accidents in Al Jouf
Now we consider the total number of road accidents in Al Jouf. The summary results for this
data are given in Figure 4.11 and in Table 4.21.
81
Autocorrelation Function for Aljouf
Partial Autocorrelation Function for Aljouf
(with 5% significance limits for the partial autocorrelations)
1.0
1.0
0.8
0.8
0.6
0.6
Partial Autocorrelation
Autocorrelation
(with 5% significance limits for the autocorrelations)
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
-1.0
1
2
3
4
5
6
7
8
9
10
1
Lag
2
3
4
5
6
7
8
9
10
Lag
Figure 4.11: The ACF and PACF Values for the Total Number of Road Accidents in Al Jouf
Table 4.21: The ACF and PACF Values for the Total Number of Road Accidents in Al Jouf
Index
ACF
PACF
Values
TSTAT
LBQ
Values
TSTAT
1
0.591436
1.96157
5.0021
0.591436
1.96157
2
0.264240
0.67224
6.1115
-0.131583
-0.43641
3
0.130765
0.31979
6.4171
0.049680
0.16477
4
-0.062931
-0.15249
6.4980
-0.215148
-0.71356
5
-0.368551
-0.89117
9.7353
-0.367613
-1.21923
6
-0.458838
-1.03713
15.7565
-0.097710
-0.32407
7
-0.204077
-0.42187
17.2454
0.300553
0.99682
8
-0.168385
-0.34259
18.5970
-0.182215
-0.60434
9
-0.151072
-0.30414
20.2288
-0.019734
-0.06545
10
-0.072586
-0.14492
20.9822
-0.192690
-0.63908
Here the ACF and PACF values are not significant but for the first order they are very close to
the critical value 2. Hence the model may be fitted by AR(1) or MA(1) methods. .
82
Table 4.22: Significance of ARIMA models fir the Number of Road Accidents in Al Jouf
Model
Components
Coefficients
t-statistic
p-value
AR(1)
AR(1)
1.0006
18.21
0.000
AR(2)
AR(1)
0.9682
2.87
0.018
AR(2)
0.0364
0.10
0.920
AR(1)
1.1143
3.39
0.010
AR(2)
0.6334
1.23
0.255
AR(3)
-0.8294
-1.63
0.141
MA(1)
MA(1)
-0.9124
-2.30
0.044
MA(2)
MA(1)
-1.1197
-2.57
0.030
MA(2)
-0.7362
-1.67
0.130
MA(1)
-1.2337
-3.16
0.013
MA(2)
-1.0878
-1.88
0.097
MA(3)
-0.5747
-0.77
0.466
AR(1)
1.0012
17.51
0.000
MA(1)
0.1864
0.45
0.662
AR(1)
-0.2724
-0.13
0.899
MA(1)
-0.0548
-0.03
0.981
AR(3)
MA(3)
ARMA(1, 1)
ARIMA(1, 1, 1)
The above results clearly show that AR(1), MA(1) and MA(2) can be considered for this data,
based on the overall performances we consider AR(1) as the most appropriate model for this
data.
83
4.12 Total Number of Accidents in Ha’il
Now we consider the total number of road accidents in Ha’il. The summary results for this data
Autocorrelation Function for Hail
Partial Autocorrelation Function for Hail
(with 5% significance limits for the autocorrelations)
(with 5% significance limits for the partial autocorrelations)
1.0
1.0
0.8
0.8
0.6
0.6
Partial Autocorrelation
Autocorrelation
are given in Figure 4.12 and in Table 4.23.
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
-1.0
1
2
3
4
5
6
7
8
9
10
1
2
Lag
3
4
5
6
7
8
9
10
Lag
Figure 4.12: The ACF and PACF Values for the Total Number of Road Accidents in Ha’il
Table 4.23: The ACF and PACF Values for the Total Number of Road Accidents in Ha’il
Index
ACF
PACF
Values
TSTAT
LBQ
Values
TSTAT
1
0.290442
0.963287
1.2063
0.290442
0.963287
2
-0.012848
-0.039417
1.2089
-0.106160
-0.352093
3
0.192035
0.589062
1.8681
0.250735
0.831592
4
0.123485
0.367375
2.1796
-0.021859
-0.072499
5
-0.120277
-0.353521
2.5244
-0.138686
-0.459968
6
-0.182778
-0.531222
3.4799
-0.150781
-0.500083
7
-0.224613
-0.636681
5.2835
-0.213506
-0.708118
8
-0.166265
-0.454827
6.6012
-0.028472
-0.094432
9
-0.226800
-0.609074
10.2790
-0.161838
-0.536755
84
10
-0.172381
-0.448069
14.5283
0.010759
0.035684
Here the ACF and PACF values are not significant so the appropriate model could be white
noise.
Table 4.24: Significance of ARIMA models fir the Number of Road Accidents in Ha’il
Model
Components
Coefficients
t-statistic
p-value
White Noise
ARIMA (0,0,0)
-
0.00
1.00
AR(1)
AR(1)
1.0005
15.14
0.000
AR(2)
AR(1)
1.0645
3.03
0.014
AR(2)
-0.0620
-0.16
0.880
AR(1)
0.8776
2.37
0.045
AR(2)
0.3962
0.98
0.354
AR(3)
-0.2683
-0.67
0.519
MA(1)
MA(1)
-0.8680
-3.27
0.008
MA(2)
MA(1)
-1.1627
-2.75
0.022
MA(2)
-0.7736
-1.92
0.087
MA(1)
-1.3062
-3.70
0.006
MA(2)
-1.1474
-2.42
0.042
MA(3)
-0.3759
-0.97
0.359
AR(1)
1.0006
25.29
0.000
MA(1)
0.5475
1.38
0.200
AR(1)
-0.0473
-0.07
0.942
AR(3)
MA(3)
ARMA(1, 1)
ARIMA(1, 1, 1)
85
MA(1)
0.5186
0.93
0.379
Although the ACF and the PACF gave an indication that the right model should be a white noise,
but this foot is really poor. Based on the above results we can select AR(1), MA(1) or MA(2) for
this data. Based on the overall performances we consider AR(1) as the most appropriate model
for this data.
4.13 Total Number of Accidents in Najran
Now we consider the total number of road accidents in Najran. The summary results for this data
are given in Figure 4.13 and in Table 4.25.
Autocorrelation Function for Najran
Partial Autocorrelation Function for Najran
(with 5% significance limits for the autocorrelations)
(with 5% significance limits for the partial autocorrelations)
1.0
1.0
0.8
Partial Autocorrelation
0.8
Autocorrelation
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
-1.0
1
2
3
4
5
6
7
8
9
10
1
2
Lag
3
4
5
6
7
8
9
10
Lag
Figure 4.13: The ACF and PACF Values for the Total Number of Road Accidents in Najran
Table 4.25: The ACF and PACF Values for the Total Number of Road Accidents in Najran
Index
ACF
PACF
Values
TSTAT
LBQ
Values
TSTAT
1
0.760792
2.52326
8.2769
0.760792
2.52326
2
0.424511
0.95851
11.1402
-0.366322
-1.21495
3
0.078217
0.16348
11.2496
-0.231333
-0.76724
86
4
-0.249643
-0.52051
12.5227
-0.274939
-0.91187
5
-0.369878
-0.75289
15.7834
0.191436
0.63492
6
-0.370222
-0.71752
19.7034
-0.078195
-0.25934
7
-0.336178
-0.62303
23.7437
-0.184142
-0.61073
8
-0.236194
-0.42306
26.4029
-0.042759
-0.14181
9
-0.132954
-0.23436
27.6668
0.022027
0.07306
10
-0.068449
-0.12006
28.3368
-0.079983
-0.26528
Here the ACF and PACF values are significant for the first order. Hence the model can be fitted
by AR(1) or MA(1) methods. .
Table 4.26: Significance of ARIMA models fir the Number of Road Accidents in Najran
Model
Components
Coefficients
t-statistic
p-value
AR(1)
AR(1)
1.0005
14.41
0.000
AR(2)
AR(1)
1.6108
5.06
0.001
AR(2)
-0.6110
-1.74
0.116
AR(1)
1.7939
2.61
0.031
AR(2)
-1.2905
-0.62
0.554
AR(3)
0.4979
0.35
0.738
MA(1)
MA(1)
-0.8520
-2.62
0.026
MA(2)
MA(1)
-1.4558
-4.04
0.033
MA(2)
-0.7145
-1.82
0.101
MA(1)
-2.1198
-9.33
0.000
AR(3)
MA(3)
87
ARMA(1, 1)
ARIMA(1, 1, 1)
MA(2)
-1.6590
-2.67
0.028
MA(3)
-0.4766
-0.96
0.364
AR(1)
1.0005
8.48
0.000
MA(1)
-0.5178
-1.00
0.345
AR(1)
0.2411
0.10
0.925
MA(1)
-0.2833
-0.09
0.927
The above results clearly show that AR(1), MA(1) and MA(2) can be considered for this data,
based on the overall performances we consider AR(1) as the most appropriate model for this
data.
4.14 Total Number of Accidents in Jizan
Now we consider the total number of road accidents in Jizan. The summary results for this data
are given in Figure 4.14 and in Table 4.27.
Partial Autocorrelation Function for Jizan
Autocorrelation Function for Jizan
(with 5% significance limits for the partial autocorrelations)
(with 5% significance limits for the autocorrelations)
1.0
0.8
0.8
0.6
Partial Autocorrelation
1.0
Autocorrelation
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
-1.0
1
2
3
4
5
6
7
8
9
1
10
2
3
4
5
6
7
8
9
10
Lag
Lag
Figure 4.14: The ACF and PACF Values for the Total Number of Road Accidents in Jizan
88
Table 4.27: The ACF and PACF Values for the Total Number of Road Accidents in Jizan
Index
ACF
PACF
Values
TSTAT
LBQ
Values
TSTAT
1
0.639136
2.11977
5.8415
0.639136
2.11977
2
0.569828
1.40205
11.0006
0.272750
0.90461
3
0.198326
0.41884
11.7037
-0.439926
-1.45907
4
-0.055681
-0.11576
11.7671
-0.367870
-1.22009
5
-0.244339
-0.50736
13.1900
0.033945
0.11258
6
-0.406835
-0.82567
17.9237
-0.048714
-0.16157
7
-0.394432
-0.75507
23.4855
-0.013338
-0.04424
8
-0.410516
-0.74804
31.5185
-0.107096
-0.35520
9
-0.216164
-0.37527
34.8595
0.160052
0.53083
10
-0.179322
-0.30740
39.4578
-0.127243
-0.42202
Here the ACF and PACF values are significant for the first order. Hence the model can be fitted
by AR(1) or MA(1) methods. .
Table 4.28: Significance of ARIMA models fir the Number of Road Accidents in Jizan
Model
Components
Coefficients
t-statistic
p-value
AR(1)
AR(1)
1.0094
10.19
0.000
AR(2)
AR(1)
0.7001
1.19
0.265
AR(2)
0.3127
0.46
0.659
AR(1)
0.7347
1.17
0.314
AR(3)
89
AR(2)
0.5415
0.76
0.467
AR(3)
-0.2662
-0.32
0.754
MA(1)
MA(1)
-0.8509
-2.34
0.042
MA(2)
MA(1)
-1.4050
-2.84
0.019
MA(2)
-0.7353
-1.39
0.197
MA(1)
-1.6450
-3.22
0.012
MA(2)
-1.3749
-1.65
0.136
MA(3)
-0.5492
-0.94
0.375
AR(1)
1.0099
7.33
0.000
MA(1)
-0.1260
-0.23
0.824
AR(1)
0.9996
9.28
0.000
MA(1)
0.8134
1.42
0.195
MA(3)
ARMA(1, 1)
ARIMA(1, 1, 1)
The above results clearly show that AR(1) and MA(1) can be considered for this data, based on
the overall performances we consider AR(1) as the most appropriate model for this data.
4.15 Result Summary
In this section we summarize results that we obtain in sections 4.1 – 4.14 and the results are
presented in the following table.
Table 4.29 Selected Models for Saudi Arabia Road Accident Data
Region
Tentative Models
Selected Model
Entire Saudi Arabia
AR(1), MA(1)
AR(1)
Riyadh
AR(1), MA(1)
AR(1)
90
Mecca
AR(1), MA(1), MA(2)
AR(1)
Eastern Province
AR(1), MA(1), ARMA(1, 1)
ARMA(1,1)
Medina
AR(1), MA(1), ARIMA(1, 1, 1)
AR1MA(1, 1, 1)
Al Qassim
AR(1), MA(1), MA(2)
AR(1)
Tabuk
AR(1), MA(1), MA(2)
AR(1)
Asir
AR(1), MA(1), MA(2)
AR(1)
Albahih
AR(1), MA(1), MA(2)
AR(1)
Northern Borders
AR(1), MA(1), ARMA(1, 1)
AR(1)
Al Jouf
AR(1), MA(1), MA(2)
AR(1)
Ha’il
White Noise, AR(1), MA(1), MA(2)
AR(1)
Najran
AR(1), MA(1), MA(2)
AR(1)
Jizan
AR(1), MA(1)
AR(1)
91
CHAPTER 5
CONCLUSIONS AND DIRECTION OF
FUTURE RESEARCH
In this chapter we will summarize the findings of our research to draw some conclusions and
outline ideas for our future research.
5.1 Conclusions
My prime objective was to find the most appropriate models for analyzing the total number of
road accidents in the entire Saudi Arabia and its different regions. Since we have time series data
it is conventional to consider ARIMA models to fit the data. So the question was what type of
ARIMA model best fit the data? Since all of the variables had missing values it was not possible
to fit ARIMA model readily. As a remedy to this problem we estimate missing observations first.
We employed sixteen different methods to estimate the missing values and based on the crossvalidation it turned out that the robust MM estimates based on the EM algorithm provided the
best set of estimates of the missing values. But since our data are time series simple EM
algorithm would not be appropriate for them. After the estimation of missing values we
employed a series of tests such as ACF, PACF, Ljung-Box test and t-test of the ARIMA
components to select the most appropriate ARIMA model(s) for all fourteen variables under
study. Twelve of them matched with AR(1) model, one with ARMA (1, 1) and one with
ARIMA(1, 1, 1) model.
92
5.2 Direction of Future Research
We have to complete this research within a short period of time. So because of time constraints
it was not possible for us to look at every aspect of Saudi Arabia road accident data. We intend to
look at the major causes of road accidents in Saudi Arabia. Unlike the Western countries drunk
driving is not an issue there as Alcohol is strictly prohibited in Saudi Arabia but the other issues
like over-speeding, quality of road, qualification of drivers, texting or talking while driving, but
we could not manage enough information for those. We would like to extend our research in this
direction in future.
93
REFERENCES
1.
2.
3.
4.
5.
6
7.
8.
9.
10
.
11
.
12
.
13
.
14
.
15
.
16
17
.
18
.
19
.
20
.
21
.
22
Bowerman, B. L., O’Connell, R. T., and Koehler, A. B. (2005). Forecasting, Time
Series, and Regression: An Applied Approach, 4th Ed., Duxbury Publishing, Thomson
Books/Cole, New Jersey.
Booth, J.G. and Sarkar, S. (1998). Monte Carlo Approximation of Bootstrap Variances,
The American Statisticians, 52, 354-357.
Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum Likelihood from
Incomplete Data Via the EM algorithm, Journal of the Royal Statistical Society. Series
B, 39, 1-38
Efron, B. (1979). Bootstrap Method: Another Look at the Jackknife, The Annals of
Statistics, 7, 1-26.
Efron B. (1987). Better Bootstrap Confidence Intervals (With Discussions), Journal of
the American Statistical Association, 82, 171-200.
Efron, B. and Tibshirani, R.J. (1993). An Introduction to Bootstrap, Chapman and Hall,
London.
Efron, B. and Tibshirani, R. (1997), Improvements on cross-validation: The .632 +
Bootstrap Method, Journal of the American Statistical Association, 92, 548–560.
Hadi, A.S., Imon, A.H.M.R. and Werner, M. (2009). Detection of outliers, Wiley
Interdisciplinary Reviews: Computational Statistics, 1, 57 – 70.
Hall, P. (1992). The Bootstrap and Edgeworth Expansion, Springer-Verlag, New York.
Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J. and Stahel, W. (1986). Robust
Statistics: The Approach Based on Influence Function, Wiley, New York.
Huber, P.J. (1973). Robust Regression: Asymptotics, Conjectures, and Monte Carlo,
The Annals of Statistics, 1, 799-821.
Imon, A. H. M. R. (2015). An Introduction to Regression, Time Series, and Forecasting
(To Appear).
Izenman, A.J. (2008), Modern Multivariate Statistical Techniques: Regression,
Classification, and Manifold Learning, Springer, New York.
Kadane, J.B. (1984). Robustness of Bayesian Analysis, Elsevier North-Holland,
Amsterdam.
Little, R.J.A. and Rubin D. B. (2002). Statistical Analysis with Missing Data, 2nd ed.,
Wiley, New York.
Mamun, A.S.A. (2014). Robust Statistics in Linear Structural Relationship Model and
Analysis of Missing Values. Unpublished Ph.D. thesis, University of Malaya.
Maronna, R.A., Martin, R.D. and Yohai, V.J. (2006), Robust Statistics: Theory and
Methods, Wiley, New York.
Picard, R. and Cook, R.D. (1984), Cross-Validation of Regression Models, Journal of
the American Statistical Association, 79, 575–583.
Pindyck, R. S. and Rubenfeld, D. L. (1998), Econometric Models and Economic
Forecasts, 4th Ed. Irwin/McGraw-Hill Boston.
Rousseeuw, P.J. (1984). Least Median of Squares Regression, Journal of the American
Statistical Association, 79, 871 – 880.
Rousseeuw, P.J. and Leroy, A.M. (1987). Robust Regression and Outlier Detection, Wiley,
New York.
Rousseeuw, P.J. and Leroy, A.M. (1987). A Fast Algorithm for S-Regression Estimates,
94
.
23
.
24
.
25
.
26
.
Journal of Computational and Graphical Statistics, 15, 414–427.
Schucany, W.R. (1988). Sample Reuse in Encyclopedia of Statistical Science, 8, Kotz,
S. and Johnson, N.L. (Eds.), 235-38.
Simpson, J.R. and Montgomery, D.C. (1998). A Robust Regression Technique Using
Compound Estimation, Naval Research Logistics (NRL), 45, 125–139.
Yohai, V. (1987). High Breakdown-Point and High Efficiency Robust Estimates for
Regression, The Annals of Statistics, 15, 642-656.
Zahir, Abdulrahman (2013). Saudi Arabia First globally in Traffic Accidents. Alarabia
net. http://www.alarabiya.net/ar/sauditoday/2013/03/13/%D8%A7%D9%84%D8%B3%D8%B9%D9%88%D8%AF%D9%8
A%D8%A9-%D8%A7%D9%84%D8%A3%D9%88%D9%84%D9%89%D8%B9%D8%A7%D9%84%D9%85%D9%8A%D8%A7%D9%8B%D9%81%D9%8A%D8%A7%D9%84%D8%AD%D9%88%D8%A7%D8%AF%D8%AB%D8%A7%D9%84%D9%85%D8%B1%D9%88%D8%B1%D9%8A%D8%A9.html
95