Parallel Stochastic Estimation on Multicore Platforms

ACTA UNIVERSITATIS UPSALIENSIS
Uppsala Dissertations from the Faculty of Science and Technology
111
Parallel Stochastic Estimation
on Multicore Platforms
Olov Rosén
Dissertation presented at Uppsala University to be publicly examined in ITC 2347,
Lägerhyddsvägen 2, Uppsala, Tuesday, 12 May 2015 at 13:15 for the degree of Doctor of
Philosophy. The examination will be conducted in English. Faculty examiner: Professor Petar
Djuric (Stony Brook University, New York, USA.).
Abstract
Rosén, O. 2015. Parallel Stochastic Estimation on Multicore Platforms. Uppsala Dissertations
from the Faculty of Science and Technology 111. xiv+191 pp. Uppsala: Acta Universitatis
Upsaliensis. ISBN 978-91-554-9191-8.
The main part of this thesis concerns parallelization of recursive Bayesian estimation methods,
both linear and nonlinear such. Recursive estimation deals with the problem of extracting
information about parameters or states of a dynamical system, given noisy measurements of the
system output and plays a central role in signal processing, system identification, and automatic
control. Solving the recursive Bayesian estimation problem is known to be computationally
expensive, which often makes the methods infeasible in real-time applications and problems
of large dimension. As the computational power of the hardware is today increased by adding
more processors on a single chip rather than increasing the clock frequency and shrinking the
logic circuits, parallelization is one of the most powerful ways of improving the execution
time of an algorithm. It has been found in the work of this thesis that several of the optimal
filtering methods are suitable for parallel implementation, in certain ranges of problem sizes.
For many of the suggested parallelizations, a linear speedup in the number of cores has been
achieved providing up to 8 times speedup on a double quad-core computer. As the evolution
of the parallel computer architectures is unfolding rapidly, many more processors on the same
chip will soon become available. The developed methods do not, of course, scale infinitely, but
definitely can exploit and harness some of the computational power of the next generation of
parallel platforms, allowing for optimal state estimation in real-time applications.
Keywords: Recursive estimation, Parallelization, Bayesian estimation, Anomaly detection
Olov Rosén, Department of Information Technology, Division of Systems and Control, Box
337, Uppsala University, SE-75105 Uppsala, Sweden.
© Olov Rosén 2015
ISSN 1104-2516
ISBN 978-91-554-9191-8
urn:nbn:se:uu:diva-246859 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-246859)
List of papers
This thesis is based on the following papers, which are referred to in
the text by their Roman numerals.
I
Olov Rosén and Alexander Medvedev. Efficient parallel
implementation of state estimation algorithms on multicore
platforms. Control Systems Technology, IEEE Transactions on,
21(1):107–120, 2013
II
Olov Rosén, Alexander Medvedev, and Torbjörn Wigren.
Parallelization of the Kalman filter on multicore computational
platforms. Control Engineering Practice, 21(9):1188–1194, 2013
III
Olov Rosén, Alexander Medvedev, and Mats Ekman. Speedup
and tracking accuracy evaluation of parallel particle filter
algorithms implemented on a multicore architecture. In Control
Applications (CCA), 2010 IEEE International Conference on,
pages 440–445. IEEE, 2010
IV
Olov Rosén and Alexander Medvedev. Efficient parallel
implementation of a Kalman filter for single output systems on
multicore computational platforms. In Decision and Control and
European Control Conference (CDC-ECC), 2011 50th IEEE
Conference on, pages 3178–3183. IEEE, 2011
V Olov Rosén and Alexander Medvedev. Parallelization of the
Kalman filter for banded systems on multicore computational
platforms. In 2012 IEEE 51st Annual Conference on Decision
and Control (CDC), pages 2022–2027, 2012
VI
Olov Rosén and Alexander Medvedev. An on-line algorithm for
anomaly detection in trajectory data. In American Control
Conference (ACC), 2012, pages 1117–1122. IEEE, 2012
v
VII
Olov Rosén and Alexander Medvedev. Parallel recursive
estimation, based on orthogonal series expansions. In American
Control Conference (ACC), 2014, pages 622–627, June 2010
VIII
Olov Rosén and Alexander Medvedev. The recursive Bayesian
estimation problem via orthogonal expansions: an error bound.
IFAC WC, Aug, 2014
IX Daniel Jansson, Alexander Medvedev, and Olov Rosén.
Parametric and non-parametric analysis of eye-tracking data by
anomaly detection. IEEE Transactions on Control Systems
Technology, 2014
X Olov Rosén, Alexander Medvedev, and Daniel Jansson.
Non-parametric anomaly detection in trajectorial data.
Submitted to a journal, 2014
XI
Olov Rosén and Alexander Medvedev. Orthogonal basis particle
filtering : an approach to parallelization of recursive estimation.
Submitted to a journal, 2015
XII
Olov Rosén and Alexander Medvedev. Parallel recursive
estimation using Monte Carlo and orthogonal series expansions.
In American Control Conference, Palmer House Hilton, Chicago,
IL, USA, 2015
XIII
Olov Rosén, Margarida M Silva, and Alexander Medvedev.
Nonlinear estimation of a parsimonious Wiener model for the
neuromuscular blockade in closed-loop anesthesia. In Proc. 19th
IFAC World Congress, pages 9258–9264. International Federation
of Automatic Control, 2014
XIV
Olov Rosén and Alexander Medvedev. Nonlinear identification of
individualized drug effect models in neuromuscular blockade.
Submitted to a journal, 2015
XV
Daniel Jansson, Olov Rosén, and Alexander Medvedev.
Non-parametric analysis of eye-tracking data by anomaly
detection. In Control Conference (ECC), 2013 European, pages
632–637. IEEE, 2013
Reprints were made with permission from the publishers.
The following paper has also been published by the author but does not
contain material that is published in this theses.
• Fredrik Wahlberg, Alexander Medvedev, and Olov Rosén. A LEGObased mobile robotic platform for evaluation of parallel control
and estimation algorithms. In Decision and Control and European
vi
Control Conference (CDC-ECC), 2011 50th IEEE Conference on,
pages 4548–4553. IEEE, 2011
vii
Acknowledgment
I would like to thank my supervisor Professor Alexander Medvedev for
his support and guidance throughout this work. I am grateful for the
degree of freedom that you have given me in my research, and for always being encouraging to my ideas and providing helpful feedback for
improvement of them.
I would also like to thank all my colleagues at SysCon for providing
such a pleasant working atmosphere. It has been great to be part of the
SysCon group over these years, where all in the group have contributed,
everyone in their own way, to make the working day enjoyable both
socially and professionally.
I would like to thank my family for their encouragements. A special
thanks goes to Linnéa for her love and support during the work of this
thesis. While not being particularly interested in parallel stochastic
estimation, you have been a great partner in the other parts of life, which
is at least as important as giving feedback about equations during the
work of a thesis.
My co-authors also deserves a thanks for their contributions. Mats
Ekman for the collaboration on parallel particle filtering, Daniel Jansson
for the eye movement data used for evaluation of the anomaly detection
method, Margarida Silva for the collaboration on parameter estimation
in a PK/PD model for anesthesia and Torbjörn Wigren for the data and
model for the WCDMA application used for evaluation of the parallel
Kalman filter, thank you!
The thesis covers research within the project “Computationally Demanding Real-Time Applications on Multicore Platforms” funded by
Swedish Foundation for Strategic Research, whose financial support is
greatly appreciated.
viii
Contents
1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2
Notation and Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Sub-indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Multi-indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Series expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1
Orthogonal functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2
Examples of orthogonal basis functions . . . . . . . . . . . . . . . . . 9
1.3.3
Multivariate orthogonal series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.4 Shifting and scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Some probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.1
Definition of a random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.2 The distribution function and some associated
measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.3 Bayesian statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.4
Estimation from random samples . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.5 Confidence regions and outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5 Recursive Bayesian estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.5.1 Optimal estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.5.2 The prediction-update recursion . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.6 Solution methods for the optimal filtering problem . . . . . . . . . . 27
1.6.1
Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.6.2 Static Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.6.3 Extended Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.6.4 Unscented Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.6.5 Monte-Carlo methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.6.6 Grid-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.6.7 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.7 High-performance computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.7.1 Efficient memory handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
ix
1.7.2
Hardware mechanisms for efficient code execution
1.7.3 Some further examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.7.4 Software libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.8 Multicore architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.8.1
Evolution of the multicore processor . . . . . . . . . . . . . . . . . . .
1.8.2 Parallel architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.9 Parallel implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.9.1 Parallel implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.9.2
Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.9.3
Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.9.4
Efficient parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.9.5 Using Shared Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.9.6 Data partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.10 Short chapter summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
36
36
37
37
38
39
39
41
41
43
45
46
50
2 Parallelization of the Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 State space model and filtering equations . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 State space system description . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2
Kalman filter equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Banded systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1
Transformation to a banded system form . . . . . . . . . . . .
2.3.2
Time-invariant case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.3
Time-varying case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 MISO System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Efficient sequential implementation . . . . . . . . . . . . . . . . . . . . .
2.4.2 Parallel implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 MIMO System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Implementation example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.1 Uplink interference power estimation model . . . . . . . .
2.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8 Static Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.9 Extended Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
53
54
54
55
56
56
56
57
58
59
59
61
62
63
63
65
66
69
70
71
3 Parallel implementation of the Kalman filter as a parameter
estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 System model and the Kalman filter . . . . . . . . . . . . . . . . . . .
3.2
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Straightforward implementation . . . . . . . . . . . . . . . . . . . . . . . . . .
73
73
74
74
74
x
3.2.2
3.3
3.4
3.5
Reordering of the equations for efficient memory
utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.3 Utilizing the symmetry of P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.4 Parallel implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Analysis of Algorithm 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Sequential and parallel work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2
Communication and synchronization . . . . . . . . . . . . . . . . . . .
3.3.3 Memory bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.4 Cache miss handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Exection time and speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.2 Memory Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Parallel implementation of the particle filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1 The particle filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Parallel algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3
Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Conclusions for parallel implementation of the particle
filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
75
77
78
78
80
80
81
81
81
81
83
84
85
86
88
92
94
96
5 Solving the RBE via orthogonal series expansions . . . . . . . . . . . . . . . . . . . . . 99
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2 Solving the RBE via series expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2.1
Mean and Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2.2 Truncation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2.3 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3 Parallel implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4.1 The system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4.2 Solution using Fourier basis functions . . . . . . . . . . . . . . . . 105
5.4.3 Execution time and speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.5
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.5.1 Estimation accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.5.2 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.5.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.6
An error bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.6.1 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.6.2
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6
Orthogonal basis PF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
xi
6.2
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 The PF algorithm with importance sampling . . . . .
6.2.2 Hermite functions basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Hermitian Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.1 Parallelization properties analysis . . . . . . . . . . . . . . . . . . . . .
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Computational Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.6.1 System model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.6.2 Estimation accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.6.3 Execution time and speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Appendix for Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dervation of Eq. (6.16) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
122
122
123
124
125
127
127
131
132
132
133
136
138
140
140
7 Anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3 The anomaly detection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.1 Context function and reference trajectory . . . . . . . . . .
7.3.2 Probability density function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.3 Outlier detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.4 Anomaly detection method recapitulated . . . . . . . . .
7.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.1
Vessel traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.2 Eye-tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.A Appendix for Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.A.1 Evaluation of Eq. (1.20) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.A.2 Proof of fˆX (τ, x) being a PDF . . . . . . . . . . . . . . . . . . . . . . . . . .
141
141
143
143
145
146
147
147
148
148
152
156
156
156
156
157
8 Application to parameter estimation in PK/PD model . . . . . . . . . . . .
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Parsimonious Wiener Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3 Estimation algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3.1 Filter tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4 Data sets and performance evaluation metrics . . . . . . . . . . . . . . .
8.4.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4.2
Real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5.2
Real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
159
159
161
162
163
164
164
165
166
166
169
6.3
6.4
6.5
6.6
6.A
6.B
6.C
6.D
xii
8.6
9
Conclusions
..................................................................
173
BLAS based parallelizations of UKF and point mass filter . . . . . . . 175
9.1 UKF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
9.2 Point mass filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
References
..................................................................................
Svensk sammanfattning
...............................................................
179
189
xiii
Chapter
1
Background
1.1 Introduction
The main part of this thesis is about parallelization of discrete time recursive estimation and filtering methods, both linear and nonlinear such.
Recursive estimation deals with the problem of extracting information
about parameters or states of a dynamical system, given noisy measurements of the system output. It plays a central role in signal processing,
system identification and automatic control.
Signal filters were originally seen as circuits or systems with frequencyselecting behaviors and the most typical area of implementation was in
radio transmission and receiving equipment. The developement of filtering techniques went on and more sophisticated filters were introduced,
such as e.g. Chebychev and Butterworth filters, which gave means of
shaping the frequency characteristics of the filter in a more systematic
design procedure. During this stage, the filtering was mainly considered
from this frequency-domain point of view.
By the introduction of the Wiener-Kolmogorov filter [112], [54], statistical ideas were incorporated into the field of filtering and statistical
properties of the signal, rather than the frequency content, were utilized
to select what to filter out. The idea of the Wiener-Kolmogorov filter
is to minimize the mean square error between the estimated signal and
the true signal. An optimality criterion was thus introduced in this case
and it became possible to state whether a filter was optimal in some
specific sense. Further steps in the development of filters were taken by
Emil Rudolph Kalman by the introduction of the famous Kalman filter
that, in contrast to the Wiener and Kolmogorov filters, applies to nonstationary processes. A conceptual difference is that the Kalman filter
is based on the state-space model framework, rather than the polynomial formalism adopted in the Wiener and Kolmogorov filters. Working
1
in time domain with state space models, the term “filter” can seem
somewhat misleading and the name “observer” or “state estimator” is
perhaps more naturally connected to the problem formulations. The
term filter have though been kept and is still widely used.
The Wiener, Kolmogorov, and Kalman filters, all assume that the
underlying system is linear. Nonlinear non-Gaussian filtering methods
though exist, that can be applied to general systems with arbitrary
noise distributions. However, historically, nonlinear filtering has been a
relatively sparsely researched area, with a main reason being the computational burden required to compute the estimates due to the lack
of a closed-form expression for the solution. As an example, the early
developments of the nonlinear particle filter were proposed already in
the 1950’s, under the name ’Poor Man’s Monte Carlo’ by Hammersley
[37], and later in the 1970’ further improvements of the method were
performed in the control community. However, because of the practical
limitations of the method due to the computational cost, it was more
or less abandoned and forgotten until 1993 when Gordon [32] published
the formulation of the method that is commonly used today. Even then
there were skepticism to the method and its usability was confined to
a small set of application which could bear with the execution times
associated with the filter computations.
It was not until the beginning of the 2000 that the capacity of computational hardware had improved well enough to let the methods be
used in a wider range of applications, and the interest to nonlinear filtering grew rapidly. With the development of parallel hardware, the
computational capacity of computers has started increasing at an even
faster rate, and the development of parallel versions of the filtering algorithms continues to broaden the application range of nonlinear filters.
The situation is the same for other non-linear non-Gaussian filtering
techniques, and even for linear filters of large dimension. In real-time
applications, it has been common that suboptimal filtering methods are
employed because the optimal solution simply requires too much computation to be practically feasible. For example, recursive least squares
and least mean squares methods are often used for linear filtering, instead of the optimal Kalman filter which can provide both faster convergence rates and better mean square error. For nonlinear systems,
the (suboptimal) extended Kalman filter is a usual choice even though
other nonlinear methods such as the unscented Kalman filter, grid-based
methods and simulation-based methods can provide superior estimation
accuracy. With the computational power offered by parallel hardware,
new doors are opening for the application of computationally costly optimal methods.
As mentioned, the computational capacity of the hardware is no more
growing by shrinking the logic circuits and increasing the processor op2
erating frequency, but rather by adding more processors on a single chip.
This is due to physical limitations as well as power and heat dissipation
concerns. All major manufacturers have turned from a single core design
to a multicore design and parallel processing is no longer the exclusive
domain of supercomputers or clusters. Any computer bought today is
likely to have two or more cores, and the number of cores available on
a single chip increases steadily. To utilize the computational power provided by parallel hardware, algorithms must be implemented in a way
that suits the parallel architecture. Occasionally, the implementation is
rather straightforward but in many cases the algorithm must, at some
point, be modified to yield better parallelization properties, where the
modification often comes with a decreased accuracy, sacrificed in favor
of faster execution time.
As parameter and state estimation constitute a key part in automatic
control, system identification and signal processing, estimation quality
can be of utmost importance for the performance of the system. This
motivates the interest for designing implementations of recursive estimation methods that can be executed on a parallel architecture and provide
real-time feasibility, without significant loss of accuracy. Another aspect
in the parallelization, important for e.g. mobile devices powered by batteries and low power communication systems, is the substantially lower
power-per-FLOP ratio for parallel processors, compared to sequential
processors.
The main part of this thesis deals with the parallelization of recursive
Bayesian estimation problems in discrete time. Another problem, by its
Bayesian nature related to the estimation problem, is anomaly detection,
to which a smaller part of the thesis is devoted. Anomaly detection refers
to finding patterns in a given data set that do not conform to a properly
defined normal behavior.
3
1.2 Notation and Nomenclature
Symbols
A
Matrices are written in bold upper case letters.
x
Vectors are written in bold lower case letters.
AT , aT
Transpose of a matrix or vector.
X
Stochastic variable.
det (A)
Determinant of A.
tr(A)
Trace of a A.
A−1
Inverse of A.
pX (x)
Probability density function for stochastic variable X evaluated at x.
pX,Y (x, y) Joint density function for random variables X and Y . When
there is no risk for confusion this is written simply as p(x, y).
pX|Y (x|y) Conditional density function for p(x, y) given Y = y.
PX (x)
The cumulative density function for X.
P r(A)
Probability of a random event A.
f (x)
Vector-valued function.
Rn
n-dimensional space of real numbers.
Nn
n-dimensional space of natural numbers.
Rn×m
n × m-dimensional space of real numbers.
m:n
Set of numbers {m, m + 1, ..., n}, m, n ∈ N and m ≤ n.
N (μ, Σ)
Normal distribution with mean μ and covariance Σ.
γ(x; μ, Σ) Probability density function for normal distribution N (μ, Σ).
xm:n
The ordered set {xm , xm+1 , ..., xn }.
Lp (Ω)
The space of functions for which the p-th power of the absolute value is Lebesgue integrable over the domain Ω.
For PDFs and cumulative density functions: when there is no risk of
confusion, the subscript will be dropped and pX (x) will be written simply
as p(x).
4
Abbreviations
KF
Kalman filter.
EKF
Extended Kalman filter.
UKF
Unscented Kalman filter.
PF
Particle filter.
PDF
Probability density function.
CPU
Central processing unit.
SMC
Shared memory multicore.
MIMO
Multiple input multiple output.
SISO
Single input single output.
MISO
Multiple input single output.
FLOP
Floating point operation.
FLOPS
Floating point operations per second.
MISE
Mean integrated square error.
AMISE
Asymptotic MISE.
BLAS
Basic Linear Algebra Subsystems.
RBE
Recursive Bayesian estimation.
5
1.2.1 Sub-indexing
Let A denote a matrix of size m × n. The submatrix that lies in the
rows of α ⊆ {1, .., n} and columns of β ⊆ {1, .., m} is denoted A(α, β).
For example, if α = {1, 2}, β = {1, 3} and
⎡
⎤
a11 a12 a13
A = ⎣ a21 a22 a23 ⎦
a31 a32 a33
then
A(α, β) = A({1, 2} , {1, 3}) =
a11 a13
a21 a23
.
The submatrix that consists of all rows and the columns β is denoted
A(:, β). Furthermore, 1 : n {1, 2, ..., n}. When indexing is out of
range, the result is defined as zero, e.g. A(−1, 1) = A(1, 4) = 0. This
is to avoid complicated notation to handle indices near the edges of the
matrix. In an implementation, the matrix can simply be padded with a
frame of zeros.
1.2.2 Multi-indexing
Multi-indexing is used to simplify the notation of multivariate expressions and generalizes the concept of an scalar index to an ordered tuple
of indices.
A D-dimensional multi-index is an D-tuple α = (α1 , α2 , . . . , αD ), i.e.
an element of the D-dimensional set of natural numbers ND .
Let n = (n1 , n2 , ..., nD ), m = (m1 , m2 , ..., mD ) denote two D-dimensional
T
multi-indices, and x = x1 x2 ... xD
∈ RD be a D-dimensional
vector. Multi-index sum, product, power and partial derivative are interpreted in the following way
n + m = (n1 + m1 , n2 + m2 , ..., nD + mD ),
n · x = n1 x1 + n2 x2 + ... + nD xD ,
xn = xn1 1 xn2 2 · ... · xnDD ,
∂ n1 ∂ n2
∂ nD
...
.
∂n =
∂xn1 1 ∂xn2 2 ∂xnDD
Two multi-indices are equal if all their elements are equal, i.e. n = k, if
and only if n1 = k1 , n2 = k2 ,....,nD = kD .
6
1.3 Series expansions
Series expansions have been utilized in several of the constructed methods for PDF estimation, and, in particular, orthogonal series expansion
have been used. The literature covering theory of this topic is vast, with
its foundations in functional analysis and special functions. Here only a
brief presentation of the facts that are relevant to this particular work
is given, see e.g. [59], [24] for a more thorough exposition.
In mathematics, a series expansion is a way of representing a function
that cannot be expressed in terms of elementary operators (addition,
subtraction, multiplication and division) using a series of other functions
with known properties. The series representation is in general infinite,
but can be truncated to give an approximation to the function with a
guaranteed accuracy.
Suppose a function f (x) is given and it is sought to approximate it
by a series over the domain Ω, so that the integrated square error is
minimized, using a set of other functions {φk (x)}K
k=0 , i.e.
f (x) ≈ fˆ(x) =
K
ck φk (x),
(1.1)
k=0
where ck are the weights or coefficients. Making a least-squares fit, the
coefficients can be found by minimizing the integrated square error loss
function
K
Q = [f (x) −
ck φk (x)]2 dx.
Ω
k=0
Differentiating the loss function w.r.t. cn and evaluating for 0 to find
the extremum gives the set of equations
K
∂Q
= 2 φn (x)[f (x) −
ck φk (x)]dx = 0 ⇔
∂cn
Ω
k=0
K
cn
φk (x)φn (x)dx =
φn (x)f (x)dx, n = 0, 1, 2...K.
k=0
Ω
Ω
The extremum can be shown to be the minimum by evaluating the
second derivative w.r.t.
to the coefficients. Denoting ank = Ω φn (x)φk (x)dx, bn = Ω φn (x)f (x)dx, this can in
matrix form be written as
⎡
⎤⎡
⎤ ⎡
⎤
a00 a01 · · · a0K
c0
b0
.. ⎥⎢ c ⎥ ⎢ b ⎥
⎢
. ⎥⎢ 1 ⎥ ⎢ 1 ⎥
⎢ a10 a11
(1.2)
⎢ .
⎥⎢ .. ⎥ = ⎢ .. ⎥ ,
.
⎣ ..
⎦⎣ . ⎦ ⎣ . ⎦
..
cK
bK
aK0 · · ·
aKK
A
c
b
7
which algebraic system has a unique solution, provided that A is nonsingular
c = A−1 b,
(1.3)
and are hence the coefficients that solve the least-squares fitting problem.
1.3.1 Orthogonal functions
A sequence of functions φ0 (x), φ1 (x), ... is said to be orthogonal over
the domain Ω, if
φn (x)φm (x)dx =
Ω
0
qn
n = m
.
n=m
(1.4)
Further, if qn = 1, n = 0, 1, 2, ..., the functions are said to be orthonormal. Owing to the orthogonality of the functions, series expansions for
this particular class of functions have some beneficial properties. For
instance, A in (1.2) becomes diagonal or, in the case of orthonormal
functions, even an identity matrix, and the solution of the least-squares
fitting problem (i.e. (1.3)) is given by
ck = q
−1
φk (x)f (x)dx, k = 0, 1, 2, ..., K.
Ω
The cross-couplings between the coefficients vanish and the coefficients
can hence be estimated independent of each other. This is a property
that is of particular interest for parallelization, where mutually independent is a key word to seek for, since this typically provides a good basis
for partitioning of the workload into independent segments. To avoid
getting into peculiar mathematics, the approximation in (1.1) is given
for a truncated expansion. However, it is fully possible to let K → ∞,
in which case it can be shown that, for a continuous f (x) ∈ L2 (Ω), the
series converges to the function itself, i.e.
f (x) =
∞
k=0
8
ck φk (x).
There are some other useful properties of orthogonal series expansions
such
∞ k2
[
ck φk (x)]2 dx
Q =
−∞ k=k
1
=
k2 k2
ck cn
k=k1 n=k1
=
k2
∞
−∞
φk (x)φn (x)dx
c2k .
k=k1
From this it follows that
∞
2
f (x) dx =
−∞
∞
∞ −∞ k=0
[ck φk (x)]2 dx =
∞
c2k ,
(1.5)
k=0
which result is known as Parseval’s identity. It also implies the following
equality for the truncation error
∞
∞ ∞
∞
e(x)2 dx =
[ck φk (x)]2 dx =
c2k .
(1.6)
−∞
−∞ k=K+1
k=K+1
Another implication of the basis functions orthogonality is that the truncation error is orthogonal to the truncated expansion, i.e.
∞
fˆ(x)e(x)dx = 0.
(1.7)
−∞
1.3.2 Examples of orthogonal basis functions
There are an infinite set of functions that forms an orthogonal basis
on a given domain. Which set of basis functions that are suitable to
use for the approximation depends on the underlying function being
approximated. It should be sought to pick a set of basis functions that
gives as good approximation as possible with a low truncation order.
Here some examples of commonly used basis functions are given.
Hermite basis functions
The Hermite functions constitute an orthonormal basis of L2 (R). There
are two ”versions” of the Hermite functions, the probabilist’s and the
physicist’s, which are simply scaled versions of each other. The probabilist formulation is commonly used in probability theory, and the physicist formulation is employed mainly in quantum mechanics when working
9
0.8
φ 1 (x)
0.6
φ 2 (x)
φ (x)
0.4
3
φ 4 (x)
φ 5 (x)
0.2
0
-0.2
-0.4
-0.6
-0.8
-6
-4
-2
0
2
4
6
x
Figure 1.1. The first five Hermite functions.
with the Schrödinger equation. In this thesis, the probabilist’s Hermite
functions are used for PDF estimation and are defined by
(−1)k x2 /2 dk −x2
φk (x) = e ,
√ e
dxk
2k k! π
k ∈ N0 ,
or recursively as
φ0 (x) = π −1/4 e−x /2 ,
√
φ1 (x) =
2xφ0 (x),
2
k−1
φk (x) =
xφk−1 (x) −
φk−2 (x), k = 2, 3, ...
k
k
2
The first five Hermite functions are plotted in Fig. 1.1.
2
The k-th Hermite function is of the form e−x /2 pk (x), where pk (x) is
a k-th order polynomial. It can be noted that the first basis function
2
φ0 (x) is a scaled Gaussian bell function, and the factor e−x /2 gives the
functions rapidly decaying tails. As PDF’s often have the characteristics
of a Gaussian bell, and have rapidly decaying tails, the Hermite basis
functions in many cases present a suitable basis for PDF approximation.
Fourier basis functions
The Fourier functions constitute an orthogonal basis of L2 ([−π, π]). The
real-valued Fourier basis functions are cosines and sines of different frequencies. The complex valued Fourier functions however, are often much
more convenient and compact to work with, and is what will be used in
this thesis. The complex Fourier basis functions are given by
φk (x) = eikx , k ∈ N.
10
√
where i = −1 is the imaginary unit. Even though they are complexvalued they can be used to approximate real-valued functions. The coefficients will then be complex conjugated such that ck = ck , where
overline denote complex conjugate, and the imaginary parts will annihilate each other. This is shown by the following computations. Let
ck = ak + ibk , then
f (x) ≈
K
ck φk (x) = c0 +
k=−K
= c0 +
K
K
[ck e−ikx + ck eikx ]
k=1
[(ak + bk i)(cos(kx) − i sin(kx))
k=1
+(ak − bk i)(cos(kx) + i sin(kx))]
= c0 +
K
[(2ak cos(kx) + 2bk sin(kx))]
k=1
The complex-valued description of the Fourier series is thus equivalent
to the real-valued one, but is more convenient notionally to work with.
Legendre basis functions
The Legendre basis is a basis of L2 ([−1, 1]). The k-th Legendre polynomial is given by the formula
1 dk 2
k
Pk (x) = k
(x
−
1)
,
2 k! dxk
or, alternatively, from Bonnet’s recursion formula
P0 (x) = 1,
P1 (x) = x,
(k + 1)Pk+1 (x) = (2k + 1)xPk (x) − kPk−1 (x).
The first five Legendre basis functions are plotted in Fig. 1.2.
1.3.3 Multivariate orthogonal series
Multivariate orthogonal basis functions can be used to approximate a
multivariate function f (x) ∈ R, x ∈ RD ,
f (x) ≈
ck φk (x),
(1.8)
k∈K
where K is some subset of ND . A set of multivariate basis functions can
(1)
be constructed from the one-dimensional ones. Assume that {φk (x)}∞
k=0 ,
11
φ 1 (x)
1
φ (x)
2
φ (x)
3
0.5
φ (x)
4
φ (x)
5
0
-0.5
-1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
x
Figure 1.2. The five first Legendre functions.
∞
{φk (x)}∞
k=0 , ...., {φk (x)}k=0 , are orthogonal bases for L2 (Ω1 ), L2 (Ω2 ),....,
L2 (ΩD ), then {φk (x)}k∈N forms an orthogonal basis for L2 (Ω) where
Ω = Ω1 × Ω2 × .... × ΩD and
(2)
(D)
φk (x) =
(1)
(2)
φk1 (x1 )φk2 (x2 )
· ... ·
(D)
φkD (xD )
=
D
(i)
φki (xi ).
i=1
By the separability of φk (x), it follows that the basis functions are
orthogonal to one and other by the following computations
φn (x)φk (x)dx =
Ω
D
···
Ω1
Ω1
Ω2
(1)
φ(1)
n1 (x1 )φk1 (x1 )dx1
ΩD i=1
φ(i)
ni (xi )
D
(j)
φkj (xj )dx1 dx2 · · · dxD =
j=1
(2)
φ(2)
n2 (x2 )φk2 (x2 )dx2 ·...·
Ω2
(D)
φ(D)
nD (xD )φkD (xD )dxD
ΩD
=
1 if k = n
,
0 otherwise
(i)
(i)
which follows from that each term Ωi φni (xi )φki (xi )dxi equals to one
iff ni = ki and zeros otherwise, i = 1, 2, ..., D. The coefficient with the
index k is given by
ck =
Ω
φk (x)f (x)dx.
A proof of completeness for the set of functions is given in [18]. As an
example the multivariate Hermite functions φ11 (x), φ12 (x) and φ22 (x)
are plotted in Fig. 1.3.
12
.5
0.5
0.5
0
0
0
-0.5
4
.5
4
2
2
0
x2
-0.5
4
2
4
-2
-4
-4
x2
x1
2
2
0
0
-2
4
-2
-4
-4
x1
4
2
0
0
-2
x2
0
-2
-2
-4
-4
x1
Figure 1.3. The multivariate Hermite functions φ11 (x), φ12 (x) and φ22 (x).
1.3.4 Shifting and scaling
How well the underlying function is approximated by a truncated series
expansion depends on how the basis functions are scaled and shifted
relative to it. Since the set of basis functions is complete, the series will
converge to the true function (provided it is square integrable) when
K → ∞, regardless of the scaling and shifting. However by rescaling and
shifting, a better fit can be obtained for a truncated expansion. Assume
that the set of basis functions {φk (x)}k∈K are orthogonal on the domain
Ω, then a set of basis functions orthogonal on the domain Ω = {y|y =
Σx + μ, x ∈ Ω} is given by φk (x) = det(Σ)−1/2 φk (Σ−1 (x − μ)), k ∈ K,
where Σ is a symmetric positive definite matrix. The orthogonality on
the domain Ω follows from
Ω
φk (y)φn (y)dy
=
Ω
det(Σ)−1 φk (Σ−1 (y − μ))φn (Σ−1 (y − μ))dy
= {x = Σ−1 (y − μ), dx = det(Σ)−1 dy}
1 if k = n
=
φk (x)φn (x)dx =
.
0 otherwise
Ω
13
1.4 Some probability theory
Probability theory constitutes a whole branch in mathematics and is
one of the foundations that this thesis is built on. Here, some concepts
regarding random variables utilized in the thesis are summarized. Much
of the material is assumed to be well known to the reader and is hence
only briefly explained. The problem of estimating a probability density
function from a random sample is discussed in more detail as it is a more
specialized topic that is typically less known to a wider audience.
1.4.1 Definition of a random variable
A random variable is a mathematical object, developed to represent an
event that has not yet happened and is subject to chance. A common
example is the number of dots in a dice throw, which, if the dice is balanced, has a probability of 1/6 to be some of the numbers {1, 2, 3, 4, 5, 6}.
To give a more formal definition of a random variable, the concept of a
probability space has first to be introduced.
A probability space is defined as the triplet (Ω, A, P ) where Ω = {ω}
is a set of all possible outcomes. A = {a} is a set of events, where a ⊆ Ω
and P : A → R+ is a function that to each event a assigns a probability
P (a) ≥ 0. A random variable, or stochastic variable, is defined as a
real-valued function X : Ω → R on the set Ω, [28].
1.4.2 The distribution function and some associated measures
Assume that X = [X1 , X2 , ..., Xn ]T is an n-dimensional random variable.
To every stochastic variable, there is an associated distribution G, which
relationship is written as X ∼ G. To each G, there are two commonly
associated distribution functions, the cumulative density function (CDF)
PX (x) and the probability density function (PDF) pX (x) defined as
PX (x) = P r(X ≤ x)
= P r(X1 ≤ x1 , X2 ≤ x2 , ..., Xn ≤ xn ),
∂ n PX (x)
.
pX (x) =
∂x1 ∂x2 ...∂xn
The CDF satisfies
0 ≤ PX (x) ≤ 1
(1.9)
and is monotonically increasing in each dimension. The PDF satisfies
RD
14
pX (x) ≥ 0,
(1.10)
pX (x)dx = 1.
(1.11)
When there is no risk for confusion, the index is usually dropped and
the functions are written just as P (x) and p(x). Let Y be a subset of
the random variables X1 , X2 , ..., Xn and Z be the subset that contains
the variables not included in Y . The conditional density function,
p(y|z) =
p(y, z)
,
p(z)
specifies the density of Y given that Z = z. The marginal distribution,
characterizing the distribution of y alone, is given by
p(y) =
p(y, z)dz.
Rn z
The expected value of g(X), where g is an arbitrary function, is given
by
E[g(X)] =
g(x)pX (x)dx.
Rn
The mean value and covariance of X are defined as
xpX (x)dx,
μ = E[X] =
Rn
Σ = E[(X − μ)(X − μ)T ] =
(x − μ)(x − μ)T pX (x)dx.
Rn
1.4.3 Bayesian statistics
Bayesian statistics is a subset of the field of statistics in which the evidence about the true state of the world is expressed in terms of degrees
of belief or, more specifically, Bayesian probabilities. Such an interpretation is only one of a number of interpretations of probability and there
are other statistical techniques that are not based on “degrees of belief”.
A fundamental equation in Bayesian statistics is Bayes rule. Assume
that A and B are two random events, the conditional probability of the
event A given the outcome of B is given by Bayes rule as
P r(A|B) =
P r(A)P r(B|A)
,
P r(B)
or in the form of a PDF
pA|B (a|b) =
pA (a)pA|B (b|a)
.
pB (b)
It is a powerful formula that states how the beliefs of the event A
should be updated when new evidence is provided. Without the knowledge of event B, the probability of event A is just P r(A), which is
15
usually referred to as the prior probability. When the new information,
or evidence, B, is received, Bayes rule states the formula for how the
belief of the event A should be updated to give the probability P r(A|B).
P r(A|B) is usually known as the posterior probability. The Bayesian
framework, provides a powerful and comprehensive angle of attack to
the problem of dealing with uncertainty.
1.4.4 Estimation from random samples
Let {Xi }N
i=1 be a set of N i.i.d. (independent identically distributed)
random variables with distribution G. In statistics a random i.i.d. sample refers to a set of observations {x(i) }N
i=1 of some random variable
X. In signal processing, a sample usually refers to an observation at
some given time instant, the terminologies hence collides and can cause
confusion. In this thesis the terminology that a sample is a set of observations is employed. A sample from G is given by the set of observations
(i) is a realization of X . From the sample, information
{x(i) }N
i
i=1 where x
about the underlying distribution can be extracted. For instance, the
sample mean and covariance
1 (i)
x ,
N
(1.12)
1 (i)
(x − μ̂)(x(i) − μ̂)T ,
N −1
(1.13)
N
μ̂ =
i=1
N
Σ̂ =
i=1
are unbiased estimators of the true mean and variance of the distribution.
Consistency and efficiency
When computing a point estimate θ̂ of some quantity θ from a random
sample, it is interesting to specify a confidence interval for the estimate
to assign some degree of certainty to it. An estimator is said to be consistent if it is unbiased, i.e. E[θ̂] = θ, and that the variance approaches
zero as N increases, i.e. V[θ̂] → 0 as N → ∞. Furthermore, it is said
to be efficient if it is an unbiased estimator that provides the lowest
variance for a given N .
As an example, estimator (1.12) of μ in can be shown to be consistent
since
E[μ̂] = E[
16
1
1
1
Xi ] =
E[Xi ] =
μ=μ
N
N
N
N
N
N
i=1
i=1
i=1
and
V[μ̂] = V[
V[Xi ]
ΣX
1
1 =
,
Xi ] = 2
V[Xi ] =
N
N
N
N
N
N
i=1
i=1
which apparently approaches 0 when N → ∞.
Confidence intervals of point estimates
When a point estimate is made, it is interesting to assign a confidence
interval to the estimate, i.e. some interval that covers the true parameter value with some given probability α. The 95% or 99% confidence
intervals are often displayed. If the observations are independent, and
the estimate θ̂ is formed as a sum of functions of the observations
θ̂ =
N
fi (x(i) ),
(1.14)
i=1
which is a commonly encountered case, θ̂ will have a variance of
Σθ̂ = V[θ̂] =
N
V[fi (Xi )].
(1.15)
i=1
If fi (·) is linear, and Xi have a Gaussian distribution, the exact confidence interval for θ̂ is given by
I = [θ̂ − λα/2 Σθ̂ , θ̂ + λα/2 Σθ̂ ],
(1.16)
where λα/2 equals 1.96 for a 95% confidence interval and 2.58 for a 99%
confidence interval. If fi (·) are not linear, and/or Xi are not Gaussian,
θ̂ will not be normally distributed and it can be arbitrarily difficult to
construct a confidence interval for the estimate. However, if the observations are many enough, θ̂ will be approximately normally distributed
regardless of the distribution of Xi and the class of fi (·), according to
the central limit theorem. Typically ”large enough” is considered approximately N = 30 in which case the approximation holds with high
accuracy and (1.16) can be taken as an approximative confidence interval for θ̂. Confidence intervals for more difficult situations are discussed
in Sec. 1.4.5.
PDF estimation
Estimating a PDF from a random sample is a somewhat more complicated problem than extracting point estimates since, in that case,
the whole PDF p(x) is estimated from the given observations. What
is meant by convergence of the estimate is a more difficult question in
17
this case. Typically, convergence refers to convergence in the mean integrated square error sense (MISE). The MISE, Q, is defined as
Q = E[ (p̂(x) − p(x))2 dx].
Then, if Q → 0 as N → ∞, the estimate is said to be MISE consistent.
There are two sub classes of PDF estimators: parametric and nonparametric ones. A parametric estimator assumes that the underlying
stochastic variable comes from some family of parametric distributions,
G, characterized by the parameters a1 , a2 , ...aK . The parameters are
then estimated from the sample, to give an estimate of the whole PDF.
The most commonly occurring parametric estimator is the Gaussian one.
It has the mean and covariance as parameters which are consistently
estimated from (1.12), (1.13). Parametric estimators can be shown to
have a O(N −1 ) convergence rate in the MISE sense, in the best case
[91].
Non-parametric estimators assume nothing about the underlying distribution and are hence more general than parametric estimators. The
price paid is a slower convergence rate. Below are three commonly used
non-parametric PDF estimators briefly presented: the histogram, the
kernel density estimator, and the orthogonal series density estimator.
Histogram
The histogram constitutes a piece-wise constant estimator of the PDF. It
is simply created by dividing the domain Ω into bins bk , k = 1, 2, ..., K,
and assigning them value given by the number of samples fk that belong
to each bin. The PDF p(x) is approximated by a constant value hk =
fk /N over each bin. It is a simple but rather primitive estimator that
requires a relatively large sample size N to yield a good approximation.
Kernel estimation
Another commonly used approximation method is the kernel density
estimator. A kernel density approximation, see e.g. [105], of p(x) is
given by
N
1
p̂(x) =
φ(H−1/2 (x − x(i) )),
N |H|1/2 i=1
where φ(·) is a kernel function that is symmetric and integrates to one.
The parameter H ∈ Rn×n is known as the bandwidth of the kernel, it
is symmetric and positive definite and acts as a smoothing parameter.
Assume that H = hI. A high value of h will give a smooth estimate,
with a low variance but a high bias. Conversely, a low value of h will
give a higher variance but a lower bias of the estimate.
18
0.5
p(xt+1,Yt)
0.4
0.3
0.2
0.1
0
−3
−2
−1
0
xt+1
1
2
3
Figure 1.4. A set of 50 weighted particles (gray stems) and the fitted series
expansion (black solid line) using the first 7 Hermite functions.
Consider the one-dimensional case. The value of h is a user parameter,
but there are some guidelines how it should be chosen. It can be shown
[96] that the optimal choice for h, in the sense that it minimizes the
asymptotic mean integrated square error, is given by
1
h = σ̂C(ν)N − 2ν+1 ,
where σ̂ is the sample standard deviation, and C and ν are kernelspecific constants. With this choice of bandwidth, the asymptotic mean
square error (AMISE) converges with a O(N −4/5 ) rate. It is slower than
the O(N −1 ) rate that is obtained for a parametric estimator. However,
under weak assumptions, it has been shown that the kernel estimator is
optimal in the sense that there can be no non-parametric estimator that
converges faster to the true density [110].
For computational purposes, the kernel density estimator has the
drawback that the approximation requires a large number of terms,
namely N of them.
Orthogonal series estimator
An alternative to the kernel estimator is the orthogonal series estimator
[104], [96] that has the capability of capturing the shape of p(x) using
far fewer terms than the kernel estimator. Using an orthogonal series
estimator, the estimate is given by
p̂(x) =
K
ck φk (x),
k=0
19
where {φk } is a set of basis functions, orthogonal on the domain Ψ w.r.t.
the weighted inner product
< φk , φn >=
w(x)2 φk (x)φn (x)dx.
Ψ
If the function p(x) were known, the coefficients would be computed as
w(x)φk (x)p(x)dx.
ck =
Ψ
Noting that this integral equals E[w(X)φk (X)] the coefficients can be
unbiasedly estimated from the sample according to
1
w(x(i) )φk (x(i) )
N
N
ck = E[w(X)φk (X)] ≈
i=1
and the variance of the estimated coefficient is given by
1
1 V[ck ] = V[
w(Xi )φk (Xi )] = 2
V[w(Xi )φk (Xi )].
N
N
N
N
i=1
i=1
For the orthogonal series estimator, the number of terms K in the expansion can somewhat loosely be interpreted as a smoothing parameter.
A low value of K will give a low variance of the estimate but a large
bias and vice verse for a high value of K. In Fig. 1.4 an illustration of
a fitted series expansion is given.
One issue with the orthogonal series approximation estimator is that
the approximation can take on negative values, i.e. p̂(x) < 0 for some
values of x, and hence do not fulfill the positivity property in (1.10), required of a PDF. This is often a main reason for not using the method in
different situations. However, for the purposes of approximation encountered in this thesis, it does not pose an obstacle. In recursive Bayesian
estimation, estimation of the PDF is typically just an intermediate step,
towards the actual goal of extracting a point estimate of the state. The
point estimate is typically extracted as the mean value or a maximum of
p(x), which are not crucially affected by a potential negativity of p(x).
Consider the scalar case. The mean value is the point at which p(x)
would balance if it were a mechanical structure put on a spike. A negative density to the left of μ will thus act as a negative mass that has
the same net effect as having a mirrored positive mass to the right of
μ. It is thus not more severe to have a negative estimate to the left of
μ, than overestimate p(x) to the right of μ, and vice verse, but merely
|p(x) − p̂(x)| is of importance. In other situations, it is only sought to
find a unnormalized estimate of p(x) (this is the case in e.g. Chapter 6)
in which case p̂(x) = |p̂(x)| can simply be taken as the estimate.
20
1.4.5 Confidence regions and outliers
Assume that a set of observations, believed to come from the same distribution, is given. An outlier is an observation that is so deviating from
the other observations as to rise suspicion of being generated by other
mechanisms than the majority of the observations in the set. If the PDF
from which the sample comes from is known, it can be used to determine
whether an observation is unlikely enough to be classified as anomalous.
There are several methods for detection of outliers, but most of them
are based on a Gaussian assumption of the distribution. A commonly
used outlier detection method is to study the Mahalanobis distance
DM (x) = (x − μ)T Σ−1 (x − μ).
(1.17)
If it exceeds some given threshold then the observation is classified as
anomalous. The Mahalanobis distance is a sensible measure for deviation if the distribution is unimodal and radially symmetric but makes
little sense otherwise.
Here a more general approach to classify outliers is suggested. Let
V (Ω) := Ω 1dx denote the volume of a set. The inlier region at a
confidence level α is then defined as the most dense domain Ω such that
p(x)dx = 1 − α,
(1.18)
Ω
where the density of the domain is defined as
p(x)dx
δ= Ω
.
V (Ω)
This means that under the null hypothesis that x comes from the distribution G, the probability of getting an observation x ∈
/ Ω is lower than
α. Using the criterion (1.18) only gives an ambiguity of how to select
the inlier set Ω. Having an inlier region that does not cover the region
where the probability density of getting an observation is highest under
the null hypothesis, is not reasonable. By demanding that it should be
the most dense region that satisfies (1.18), this is avoided and the domain Ω will also (if p(x) is not constant over a set of nonzero measure)
be uniquely defined. This can be motivated by the following arguing.
Assume that a threshold value 0 < γ < sup p(x) is chosen, and that Ω
x
is the set Ω(γ) = {x|p(x) ≥ γ} (i.e. the most dense region). Then
p(x)dx
(1.19)
h(γ) =
Ω(γ)
is a monotonically increasing function in V (Ω) as p(x) > 0 on Ω.
There is thus a one-to-one mapping between h(γ) and V (Ω(γ)) and
21
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
−5
−4
−3
−2
−1
0
1
2
3
4
5
Figure 1.5. Inlier region for Gaussian PDF.
0.7
0.6
0.5
p(x)
0.4
0.3
0.2
0.1
0
−5
−4
−3
−2
−1
0
x
1
2
3
4
5
Figure 1.6. Inlier region according to (1.18) for a bimodal non-symmetric PDF.
there is also a one-to-one mapping between V (Ω(γ)) and Ω(γ). Ω will
hence be uniquely defined if h(γ) = 1 − α has a solution. Unfortunately
it may not have a solution. If p(x) = k, where k ∈ R is a constant, over
a set of nonzero measure h(γ) will have a jump discontinuity at γ = k,
and h(γ) = 1−α will lack solution if the jump discontinuity covers 1−α.
In that case any Ω, that satisfies (1.18) and Ω+ ⊆ Ω ⊆ Ω− makes equal
sense and can be taken as the inlier region, where Ω− = Ω(lim γ ) and
γ→k−
Ω+
= Ω(lim γ ).
γ→k+
In the case of a distribution with symmetric unimodal PDF, the inlier region described by (1.17) and (1.18) will be the same, but in the
case of multimodal skew distributions they can differ significantly. In
Fig. 1.5 and Fig. 1.6, the inlier region for a unimodal symmtric and a
non-symmetric bimodal distribution is shown, respectively.
When using this reasoning for computing outliers, the p-value, q, for
an observation x0 (the probability of getting an observation as least as
extreme as x0 ), is given from
q(x0 ) = 1 − h(p(x0 )).
22
(1.20)
1.5 Recursive Bayesian estimation
Statistical estimation deals with the problem of estimating parameters,
based on empirical data containing some random component. A subfield of this is parameter and state estimation in dynamical systems. In
recursive statistical estimation, the estimate is updated in an iterative
manner as new evidence about the unobserved quantity is acquired. Being the underlying problem to all optimal stochastic filtering methods,
the recursive Bayesian estimation problem is briefly reviewed in this section. A thorough exposition of the subject can be found in some of the
classical textbooks, e.g. [6], [70], [108], [99].
Consider a stochastic process {xt , t = 0, 1, ...} described by the state
space model
xt+1 = ft (xt , vt ),
yt = ht (xt , et ),
(1.21)
(1.22)
with the state xt ∈ Rn and output yt ∈ Rp . The sequences vt ∈ Rnv
and et ∈ Rne are zero-mean white noise mutually independent processes
characterized by the distributions with known PDFs given by p(vt ) and
p(et ), respectively, where t is discrete time. The functions ft (·) and
ht (·) are arbitrary but known vector-valued functions. The aim is to
provide an estimate, x̂t , of the state xt given the measurements Yt =
{y0 , y1 , ..., yt }.
1.5.1 Optimal estimation
An optimal estimator for the system (1.21)-(1.22) is the estimator that
gives an estimate x̂0:t , of the random vector x0:t given observations of
the correlated random variable Yt = {y0 , y1 , ..., yt }, that is optimal in
some sense. Consider the matrix-valued criterion
Q(l(Yt )) = E[(x0:t − l(Yt ))(x0:t − l(Yt ))T ].
(1.23)
It can be shown that the function l(·) that minimizes any scalar valued
monotonically increasing function of Q, is the conditional mean [99], i.e.
l(Yt ) = E[x0:t |Yt ].
(1.24)
Examples of scalar valued monotonically increasing functions of Q
are det(Q) and tr(WQ), where W is a real positive-definite weighting
matrix. With W taken as the identity matrix, this shows that (1.24), is
the optimal estimatior in the sense that it minimizes the mean square
error of the estimate. For dynamical systems it is fairly complicated to
compute the conditional mean, in the general case. One case when this
problem is possible to solve in a closed form, is for linear systems with
23
Gaussian noise, in which case the Kalman filter gives the solution. For
other system structures there is no closed-form solution to the problem.
To find the (minimum mean square) estimate, E[xt |Yt ], of the state at
time step t the PDF p(x0:t |Yt ) must then be computed and marginalized
over x0:t−1 , i.e.
(1.25)
p(xt |Yt ) = p(x0:t |Yt )dx0:t−1
and the estimate extracted as
E[xt |Yt ] =
xt p(xt |Yt )dxt .
(1.26)
In general, to construct the mathematical object p(x0:t |Yt ), is extremely complex and require tremendous amount of computation even
for moderate values of t. There are however simplifying properties in
the model structure that provide a way of computing the marginal distribution p(xt |Yt ) recursively, namely that
p(xt |xt−1 , xt−2 , ...., x0 ) = p(xt |xt−1 ),
p(yt |xt , Yt−1 ) = p(yt |xt ).
(1.27)
(1.28)
The first equality follows from the Markov property of the state the
whiteness of vt . The second equality follows straightforwardly from
(1.22), as it is free from dynamics, and et is white.
These properties simplifies the problem significantly and give means of
computing p(xt |Yt ) recursively, via the prediction and update recursion.
1.5.2 The prediction-update recursion
Assume that p(xt−1 |Yt−1 ) is known and that a new measurement yt
is obtained. Exploiting the Markov property of the system (1.27), the
predicted PDF p(xt |Yt−1 ) is obtained from the Kolmogorov-Chapman
equation
p(xt |xt−1 )p(xt−1 |Yt−1 )dxt−1 .
(1.29)
p(xt |Yt−1 ) =
Rn
Using (1.28) and applying Bayes rule, the updated PDF p(xt |Yt ) is
then found, using the new evidence yt , as
p(xt |Yt ) =
p(yt |xt )p(xt |Yt−1 )
.
p(yt |Yt−1 )
(1.30)
Hence, given an initial PDF p(x0 |y0 ) = p(x0 ) for the initial state
x0 , the PDF p(xt |Yt ) can be computed by applying the prediction and
24
15
p(xk|y1:k)
2
10
1
5
0
−6
−4
−2
0
k
2
4
6
8
0
x
Figure 1.7. An illustration of how the PDF p(xt |Yt ) evolves over time, t.
update steps in (1.29), (1.30) recursively to the measurements Yt as
they arrive. Fig. 1.7 shows an example of how it could look like when
the PDF evolves over time.
The PDF p(xt |Yt ) gives the complete information about the random
variable xt , and whatever statistical information required could be extracted from it. Typically a point estimate, x̂t of the state is of interest
to provide. As discussed in Sec. 1.5.1 the conditional mean,
x̂t = E[xt |Yt ],
is the optimal estimator in the sense that it minimizes the variance of
the estimation error. However, depending on the intended use of the
point estimate, it can be motivated to employ another estimator. For
instance, the maximum likelihood estimate
x̂t = arg sup p(xt |Yt ).
xt
is a commonly used point estimator.
In the recursive Bayesian estimation framework, p(xt |Yt ) is calculated
by iterating the prediction and update steps.
The PDFs p(xt |xt−1 ) and p(yt |xt ) defined by (1.29), (1.30), are implicitly given by the state space model in (1.21), (1.22). Via the generalized convolution formula, p(xt |xt−1 ) is found as
p(xt |xt−1 ) =
p(xt |xt−1 , vt−1 )p(vt−1 )dvt−1
(1.31)
Rn
=
δ(xt − ft−1 (xt−1 , vt−1 ))p(vt−1 )dvt−1 . (1.32)
Rn
25
1
1
1
0.5
0.5
0.5
0
4
0
4
2
0
-2
-4
-4
-2
0
2
4
0
4
2
0
-2
-4
-4
-2
0
2
4
2
0
-2
-4
-4
-2
0
2
Figure 1.8. Representation of a 2 dimensional PDF (a) Kalman filter, (b)
particle filter, (c) grid-based filter.
An important special case arises when the process noise is additive, i.e.
(1.21) is given as
xt+1 = ft (xt ) + vt ,
(1.33)
in which case (1.32) evaluates to
p(xt |xt−1 ) = pvt−1 (xt − ft−1 (xt−1 )).
(1.34)
In the same way, it follows that p(yt |xt ) is given by
δ(yt − ht (xt , et ))p(e,t )det ,
p(yt |xt ) =
Rp
which in the case of an additive measurement noise becomes
p(yt |xt ) = pet (yt − ht (xt )).
(1.35)
In general, closed-form expressions cannot be obtained for (1.29)-(1.30).
As mentioned, a special case arises when ft and ht are linear and vt
and et are Gaussian with zero mean, for which case the solution is
given by the Kalman filter. For non-Gaussian noise, it can be shown
that the Kalman filter is still the best unbiased linear estimator. However, to solve the estimation problem optimally when the system is nonlinear/non-Gaussian, approximation methods have to be used, of which
Monte Carlo methods and grid-based are commonly used examples. The
Kalman filter describes the PDF by a Gaussian function, the Monte
Carlo methods provide a sample from the distribution, and the gridbased methods approximate the PDF over a discrete set of grid points.
Fig. 1.8 illustrates how the above methods represent the information
about the sought PDF, by a 2-dimensional example.
In the following subsections some of the solution methods to the RBE
problem are given.
26
1.6 Solution methods for the optimal filtering problem
1.6.1 Kalman filter
If the process and measurement noises are white Gaussian noise sequences and ft , ht are linear functions, (1.21)-(1.22) can be written as
xt+1 = Ft xt + vt ,
yt = Ht xt + et ,
(1.36)
(1.37)
where Ft ∈ Rn×n and Ht ∈ Rp×n . Under these assumptions, it can be
shown that the prior and posterior distributions are Gaussian and hence
completely characterized by the mean μ and covariance Σ. The closedform solution of (1.29), (1.30) that propagates the mean μ (coinciding
with the estimated state x̂) and the estimation error covariance
Pt|t = E([xt − x̂t|t ][xt − x̂t|t ]T ),
is the Kalman filter [53], for which the prediction and update steps can
be formulated as follows :
Prediction
x̂t|t−1 = Ft−1 x̂t−1|t−1 ,
(1.38)
Pt|t−1 = Ft−1 Pt−1|t−1 FTt−1 + Qt−1 ,
(1.39)
Kt = Pt|t−1 HTt (Ht Pt|t−1 HTt + Rt )−1 ,
x̂t|t = x̂t|t−1 + Kt (yt − Ht x̂t|t−1 ),
Pt|t = (I − Kt Ht )Pt|t−1 ,
(1.40)
(1.41)
(1.42)
Update
where Qt = E[vt vtT ] and Rt = E[et eTt ]. There is a huge literature
devoted to linear estimation and the Kalman filter, with some references
given by [52], [99], [10], [34].
1.6.2 Static Kalman filter
Assume that the system in (1.36)-(1.37) is time invariant. Under the
mild assumptions of R being positive definite and (F, B) being a stabilizable pair, where Q = BBT , the gain Kt and the error covariance
Pt|t−1 = E([x̂t|t−1 − xt ][x̂t|t−1 − xt ]T )
will converge to the constants K and P respectively, as t → ∞ [99].
The filter gain can then be treated as a static one, with P given by the
solution to the algebraic Riccatti equation
P = FPFT + Q − FPHT (HPHT + R)−1 HPFT ,
27
and K calculated as
K = FPH(HPHT + R)−1 .
The static Kalman filter is then given from
x̂t+1|t = Fx̂t|t−1 + K(yt − Hx̂t|t−1 ).
(1.43)
1.6.3 Extended Kalman filter
A generalization of the KF that applies to nonlinear systems is the extended Kalman filter (EKF) [99], [34]. It is a suboptimal solution to the
recursive estimation problem based on a linearization of the measurement and system equations around estimates of the state.
Assume that the nonlinear functions ft and ht are differentiable. At
each time step they are approximated by a first order Taylor expansion,
i.e.
ft (xt ) ≈ ft (x̂t|t ) + Ft (xt − x̂t|t ),
ht (xt ) ≈ ht (x̂t|t−1 ) + Ht (xt − x̂t|t−1 ),
where
Ft =
Ht =
∂ft (x) ,
∂x x=x̂t−1|t−1
∂ht (x) .
∂x (1.44)
(1.45)
x=x̂t|t−1
The filtering is then performed by applying the standard KF equations
(1.38)-(1.42) to the linearized system. If the nonlinearities are severe,
the linearization can be a poor approximation of the system, which can
in the worst case lead to divergence of the filter.
1.6.4 Unscented Kalman filter
Another filtering method that applies to nonlinear system and can be
shown to be more robust against nonlinearities is the unscented Kalman
filter (UKF) [50]. One iteration of the method can be summarized as
follows.
Assume that xt−1|t−1 has the mean and covariance given by μt−1|t−1
and Σt−1|t−1 , respectively and define the augmented state and covariance as
28
μTt−1|t−1 E[vtT ]
Σt−1|t−1 0
=
.
0
Qt
xat−1|t−1 =
Σat−1|t−1
T
,
The UKF picks a deterministic set of sigma points around the mean
which are then propagated and updated to get an approxmation of the
posterior distribution. In the prediction step the set of weighted sigma
(i)
(i)
points St−1|t−1 = {χt−1|t−1 , wt−1|t−1 }N
i=1 is chosen as
(0)
χt−1|t−1 = xat−1|t−1 ,
(i)
χt−1|t−1 = xat−1|t−1 + (
(i)
χt−1|t−1 = xat−1|t−1 − (
(1.46)
nΣat−1|t−1 )i , i = 1, ..., n,
(1.47)
nΣat−1|t−1 )i−n , i = n + 1, ..., 2n,(1.48)
where ( nΣt−1 )i denotes the i-th row of the Choleskey factorization of
nΣt−1 , and the weights are given by
λ
,
L+λ
λ
+ (1 − α2 + β),
=
L+λ
1
,
= wc(i) =
2(L + λ)
ws(0) =
wc(D0)
ws(i)
(1.49)
(1.50)
(1.51)
(1.52)
where λ = α2 (L+κ)−L. The constants α, β and κ are user parameters
used to control the spread of the sigma points. The sigma points are
then propagated through the state transition equation, i.e.
(i)
(i)
χt|t−1 = ft−1 (χt−1|t−1 ), i = 0, 1, ..., 2n,
which yields the predicted state and covariance as
x̂t|t−1 =
Pt|t−1 =
2L
i=0
2L
(i)
ws(i) χt|t−1 ,
(i)
(i)
wc(i) [χt|t−1 − x̂t|t−1 ][χt|t−1 − x̂t|t−1 ]T .
i=0
29
In the update step an analogue procedure as in the prediction step is
carried out, but the state and covaraince are augmented with E[eTt ]
and Rt respectively and the sigma points are propagated though the
measurement equation. The state estimate and the error covariance are
then updated by
Pyx =
2L
(i)
(i)
wc(i) [χt|t−1 − x̂k|k−1 ][γt|t − ŷk ]T ,
(1.53)
i=0
Kt = Pt|t P−1
yx ,
xt|t = xt|t−1 + Kt (yt − ŷt ),
(1.54)
(1.55)
(1.56)
Pt|t = Pt|t−1 − Kt Pyx KtT .
(1.57)
1.6.5 Monte-Carlo methods
A simple and powerful, though computationally costly, method to perform filtering via (1.29), (1.30) is by means of Monte-Carlo simulation
[71], [21]. The Monte-Carlo based framework can handle nonlinear systems with general noise distributions. The method provides way of ob(i)
(i)
taining a weighted sample St = {xt , wt }N
i=1 from the distribution
with PDF p(xt |Yt ) from which the desired information about the random variable xt can be extracted. Assume that at time step t − 1,
(i)
(i)
St−1 = {xt−1 , wt−1 }N
i=1 constitutes a weighted sample from the distribu(i)
tion with PDF p(xt−1 |Yt−1 ), where xt−1 is the i-th observation, called
(i)
a particle, with associated weight wt−1 ≥ 0. Given St−1 , a sample
from p(xt |Yt−1 ) is obtained by propagating each particle through system equation (1.21), i.e.
(i)
(i)
(i)
xt = ft−1 (xt−1 , vt−1 ), i = 1, .., N,
(1.58)
(i)
where vt−1 is a draw from the distribution with PDF p(vt−1 ). The
measurement yt is then used to update the weights by
(i)
(i)
(i)
wt = wt−1 p(yt |xt ), i = 1, ..., N.
(i)
(i)
(1.59)
This yields the particle set St = {xt , wt }N
i=1 at time step t. By
iterating (4.1) and (4.2), a sample from p(xt |Yt ) is thus recursively
obtained. The recursion is initialized by making N draws from an a
initial distribution with PDF p(x0 ).
It can be shown that, as formulated for now, the variance of the
weights in the particle set can only increase over time, with the consequence that the weights of all particles except for one will approach
30
Figure 1.9. The evolution of a set of particles. The left part of the figure shows
the particles as dots and their weights are represented by the dot sizes. The
right part of the figure shows the discrete weighted estimate p̂(xt |Yt ) of p(xt |Yt )
given by the particles. Step (1), (2), (3) and (4) shows the initial set St−1|t−1 ,
the propagated set St|t−1 , the updated set St|t and the resampled (and one
respectively.
step propagated) set St|t
zero as t → ∞ [22]. When this happens, the filtering has broken down
to a pure simulation of the system. To remedy this problem and concentrate the particles to the domain of interest, i.e. where the density
of p(xt |Yt ) is high, resampling can be performed. In the resampling
(i)
(i)
step, a new set of particles St = {xt , wt }N
i=1 is created by sampling
from p(xt |Yt ), and replaces the old particle set St . Bootstrapping is
a common approach where a new set of particles St is created by by
making N draws with replacement from the old particle set such that
(i)
(i)
(i)
(i)
P r(xt = xt ) = wt and setting wt = 1/N . An illustration of how
the particle set evolves during the prediction, update and resampling
step, is given in Fig. 1.9.
1.6.6 Grid-based methods
Grid based methods, see e.g. [15], solves the recursive Bayesian estimation problem by giving an approximate solution over a discrete set of
grid points. In the way in which the PDF is approximated over discrete
bins makes it closely related to the histogram estimator.
The involved PDFs are approximated by point masses over a discrete
(i)
(i)
(i) N
n
set of grid points {xt }N
i=1 , xt ∈ R , with associated weights {wt|t }i=1 .
The PDF p(xt−1 |Yt−1 ) is then approximated as
p(xt−1 |Yt−1 ) ≈
N
(i)
(i)
wt−1|t−1 δ(xt−1 − xt−1 ),
(1.60)
i=1
31
where the approximation sign should be interpreted as that the weighted
set of point masses carries approximately the same statistical information about the state as the true PDF, such as e.g. the mean and variance.
The i-th weight is propagated via the prediction and update equations
as
(i)
wt|t−1 =
N
(j)
(i)
(j)
wt−1|t−1 p(xt |xt−1 ),
(1.61)
j=1
(i)
wt|t
(i)
(i)
= wt|t−1 p(yt |xt ),
(1.62)
and the predicted PDF p(xt |Yt−1 ) and updated PDF p(xt |Yt ) are approximated by
p(xt |Yt−1 ) ≈
N
(i)
(i)
wt|t−1 δ(xt − xt ),
i=1
p(xt |Yt ) ≈
N
(i)
(i)
wt|t δ(xt − xt ).
i=1
A problem with grid-based methods is a large computational burden
associated with it. To achieve satisfactory accuracy, a large number of
grid points must be used. As the number of grid points grows exponentially with the dimension of the problem, its usability is confined to
low-dimensional problems.
1.6.7 Computational complexity
The computational complexity of the optimal solution to the recursive
Bayesian estimation problem is a major obstacle in real-time applications as well as in high-dimensional problems. In the linear Gaussian
case (KF), the computational complexity grows as O(n3 ) where n is the
dimension of the state space. However, for non-linear non-Gaussian filtering methods, the computational complexity typically grows as O(N n ),
i.e. it increases exponentially with the dimension. This is often referred to as the ”curse of dimensionality”. Consider for instance a grid
based method. If one dimension requires N grid points, then a twodimensional approximation requires N 2 grid points, and a n-dimensional
problem requires N n grid points. The curse of dimensionality poses a
severe problem, limiting the applicability of the methods to relatively
low-dimensional cases. Basically all non-parametric estimators suffers
more or less from this problem, but the factor N in the O(N n ) complexity can vary significantly among different filtering methods, which is of
high importance in practice.
32
1.7 High-performance computing
Since one of the main points in using parallel hardware is to achieve
faster execution times, it is not only important that the computations
are made in parallel, but also that the program is optimized w.r.t. the
execution time.
Today’s compilers can automatically do significant optimizations to
the code. However, to achieve high performance, the programmer must
still invest an effort in optimization of the code, giving the compiler a
good ground to work on. Some of the optimizations are discussed here
and can be found in [62], [31].
1.7.1 Efficient memory handling
One of the most important aspects in achieving fast execution is to handle the memory accesses efficiently. A program, where memory access
has been optimized, can potentially execute several orders of magnitude
faster than an unoptimized program. Two important data access properties exhibited by many programs are spatial and temporal locality.
This is something that the cache memory utilizes, at a hardware level,
in the following manner.
• Spatial locality: If an element at a specific memory address is
accessed, it is likely that data elements at nearby addresses will
be accessed soon. Therefore, neighboring data elements to the one
that are being accessed now will also be brought to the cache.
• Temporal locality: If an element is being accessed now, it is likely
that it will be used soon again. Therefore, the cache will keep the
most recently used data.
Thus, when a data element is brought to the cache, not only that particular element, but also the neighboring data elements will be brought
to the cache. How many elements are brought in depends on the cache
line size.
As it is time consuming to move data in the memory, it benefits
performance greatly if the code is written in a such manner that the data
movement from the main memory to the CPUs is minimized. When a
data element is brought to the cache, it is hence desirable to use the
element to accomplish as many calculations as possible that the element
is involved in, before it is thrown out of the cache. Cache re-use can be
an even more critical issue on multicore processors than on single core
processors due to their larger computational power and more complex
memory hierarchies.
33
Algorithm 1
N = 10000;
A = randn (N) ;
B = randn (N) ;
f o r j =1:N
f o r i =1:N
A( i , j ) = A( i , j ) + B( i , j ) ;
end
end
As a simple example of the importance of good memory handling
consider the following case. Alg. 1 and Alg. 2 implement a simple matrixmatrix addition. In Alg. 1, the matrices are read column-wise, while in
Alg. 2 they are read row-wise. The code in Alg. 1 executes in about 1.2
seconds while the code in Alg. 2 executes in about 5.8 seconds on the
author’s PC. The code performs exactly the same work, but the second
code executes about 5 times faster than the first one. 1 The reason for
this is that Matlab implements column major ordering for the arrays,
which means that when a matrix is stored in memory elements that are
in the rows next to each other in the column will be stored in neighboring
memory addresses. For instance
⎡
⎤
1 4 7
A=⎣ 2 5 8 ⎦
3 6 9
will be stored in memory as
a1
1
a2
2
a3
3
a4
4
a5
5
a6
6
a7
7
a8
8
a9
9
where ai is the i-th memory address. Since the cache will fetch several
neighboring data elements when reading a value, the next value will
already be available in the cache when requested by the processor, if
reading the matrix column-wise. However, if it is read row-wise, the next
element will not already be in the cache and have to be brought from
the memory above in the hierarchy which will slow down the execution
as the processor will have to idle while waiting for the requested data to
be brought to the registers.
1
Note that this is only an example and in the Matlab language the code should not
be written in either of the two ways but simply just as A = A + B; to utilize Matlabs
pre-compiled libraries for vector operations.
34
Algorithm 2
N = 10000;
A = randn (N) ;
B = randn (N) ;
f o r i =1:N
f o r j =1:N
A( i , j ) = A( i , j ) + B( i , j ) ;
end
end
1.7.2 Hardware mechanisms for efficient code execution
Further considerations when designing high performance software are
hardware-specific optimizations. A programmer aware of such mechanisms can utilize them to improve the performance substantially with
some examples given below.
For instance a pre-fetcher is a hardware mechanism that tries to predict which data will be used in a near future and fetch this piece of
data into the cache as to be readily available when requested from the
processor. The pre-fetcher monitors which memory locations are being
accessed by the processor currently, predicts which memory locations
will be needed in the future, and issues pre-fetches to those memory
locations. Predicting the future accesses is of course a very difficult
problem to solve. But if the programmer is aware, and as far as possible
try to access the memory in regular patterns, the pre-fetcher can make
better predictions and improve the performance of the execution.
As another example, many CPUs implement a hardware pipeline to
increase the throughput of the processor. In the pipeline the stages of
code execution: instruction fetch, instruction decode and register fetch,
execute, memory access and register write back, are implemented in
a series of elements, where the output of one element is the input of
the next one. This construction is to allow overlapping execution of
multiple instructions with the same circuitry. To be able to fill the
pipeline the processor must know which code is to be executed next.
Therefore branching, such as ”if” statements can have negative impact
on the execution time since the processor does not know beforehand what
code will be executed next and fill the pipeline appropriately. There are
several other situations that can cause trouble to the pipeline, known
as ”Hazards”, that should be taken into consideration when optimizing
the code.
35
1.7.3 Some further examples
Loop unrolling, loop fusion, hoisting, avoiding branching and avoiding
redundant computations are other techniques to improve the execution
time, to mention some.
In loop unrolling a loop is written in a long sequence of operations,
rather than in a typical loop format, such as a ”for” loop. In this way
the overhead caused by the loop construction, such as counter increment
and checking for termination conditions are avoided, and can provide
substantial gain for the execution time if the actual work performed
in each iteration of the loop is small. Loop unrolling though have the
drawback that the program size increases, which can be undesirable in
some situation.
In loop fusion, several loops are merged, if possible. In this way the
loop overhead is minimized, which favors the execution time. Also by
merging loops it is often possible to get a better memory access pattern
(which is of great importance as previously discussed).
Hoisting refers to the avoidance of repeated operations, such as e.g.
de-referencing a memory access in a loop.
Avoiding redundant computations is often an overlooked opportunity
to improve the efficiency of the code. By computing a required value
and storing it for reuse, if needed, can speed up a program significantly.
1.7.4 Software libraries
For routine operations there are often highly optimized software libraries
available. As linear algebra operations are commonly encountered in
high-performance computing, optimized libraries such as BLAS (Basic
Linear Algebra Subprograms) 2 , have been developed for these type of
operations. These libraries are extremely efficiently implemented, utilizing hardware-specific optimizations for particular architectures. The set
of routines is though limited and covers only the more basic operations.
2
www.netlib.org/blas/
36
1.8 Multicore architecture
1.8.1 Evolution of the multicore processor
A multicore processor is a single computing component with two or more
independent actual processors, called cores or central processing units
(CPUs), which are the units that read and execute program instructions.
The material in this section is based on the references [106], [67], [33],
[107], [66].
For decades, it was possible to improve the performance of the singlecore CPU by shrinking the area of the integrated circuit and increasing
the clock rate at which it operated. In the early 2000’s, the rate of
increase of the computational power for the single core processor began
to stall, mainly due to three major bottlenecks:
• The memory wall; Over decades, processor speeds have increased at
far faster rates than the memory speeds. As the memory system
cannot deliver data fast enough to keep the processor busy, the
memory has become a bottleneck to performance improvement of
the sequential processor.
• The power wall; The power consumption of a processor increases
exponentially with each factorial increase of operating frequency.
Hence, it is not possible, due to both power and heat dissipation
concerns, to improve the performance of a single core processor by
increasing the operating frequency.
• The ILP wall; An increasing difficulty of finding enough instruction
level parallelism (ILP) in a single instructions stream to keep a
high-performance single-core processor busy.
In the pursue of improving the computational capacity of a system,
more focus was put on parallel architectures that have started to evolve
at an increasing rate and become a standard piece of hardware. Any
PC, and many mobile phones bought today, will likely have two or more
processors on an integrated circuit.
The purpose of fitting several cores on the same chip is mainly to improve the computational capacity of the system. Another aspect, important to low power consumption systems such as e.g. battery driven mobile devices and low power communication systems, is the lower powerper-FLOP ratio provided by parallel hardware. Several cores on the
same chip generally consume less power than the same amount of cores
located on different chips.
37
1.8.2 Parallel architectures
There is a plethora of parallel processing architectures with examples as
shared memory multicores (SMCs), graphical processing units (GPUs),
and computer clusters. Roughly speaking, all parallel architectures can
be modeled as M processing units with some kind of interconnection,
each having a private memory and possibly connected to a shared memory. What differs between the architectures are the sizes of the shared
and private memory, the interconnection topology, and the bandwidth
of the interconnection. How well a parallelization executes depends very
much on how well it maps to the particular architecture used.
In a computer cluster, processors located at different chips are interconnected over a communication network. The bandwidth of the
network is typically relatively low, there is no shared memory, but the
private memory is relatively large. A graphics processing unit (GPU) is,
as its name suggests, mainly constructed to process computer graphics.
It is suitable when the same operation is to be performed on several
independent data streams. A GPU architecture has a large memory
bandwidth, no private memory, and a medium-sized shared memory. A
GPU can have hundreds of processors, but is only suitable for a narrow
class of problems that can provide the amount of fine-grained parallelism
required for efficient operation.
This thesis is mainly concerned with the shared memory multicore architecture (SMC) that is a flexible architecture suited for embedded realtime applications. The term multicore refers to a processor where several
CPUs are manufactured on the same integrated circuit die. Fig. 1.10
shows a simplified picture of a SMC. The CPUs are connected to a
shared memory (the RAM) via a shared bus. In addition to the shared
memory, each processor has a private memory, the cache, to which only
that particular CPU has access. In general, most SMCs have several
layers of cache where some levels of the cache are shared among two or
more cores. However, to understand the concept and reasoning of a multicore implementation, it many times suffices to envisage the simplified
description with a single cache per processor. The CPUs can operate
independently of each other, and the interprocessor communication is
accomplished trough reads and writes to the shared memory.
38
Figure 1.10. A simplified picture of a shared memory multicore architecture.
1.9 Parallel implementation
1.9.1 Parallel implementation
The performance improvement gained by the use of a multicore processor depends very much on the employed software algorithms and their
implementation. Ideally, an implementation may realize speedup factors near the number of cores used. Most applications, however, are not
accelerated so much unless programmers invest an amount of effort in
re-factoring the whole problem. To design well-performing software that
scales well and executes fast, it is important to understand the basics of
the architecture on which the software is intended to execute.
Roughly speaking, when designing a parallel implementation, it is
sought to determine a number of tasks that can execute as large portions of work as possible with a minimal amount of interaction. It is
though important to remember that parallelization is not a goal by itself, but merely a way of improving the execution time or lowering the
power consumption of the system. Constructing an implementation that
runs perfectly in parallel, but slower than the sequential version of the
implementation typically presents nothing of interest. An exception is
when dealing with low power applications. Parallel processors in general
have a lower power per FLOP ratio, and it can thus be motivated in that
case to have a slower parallel execution than the sequential one.
Parallelization can provide many benefits for an implementation. The
programming and debugging of a parallel program can though be much
more challenging than for a sequential program. Constructing and debugging parallel code to ensure its correctness can be a difficult task.
Some examples of problems that a parallel programmer must deal with
that are not present in sequential programming are:
Parallel overhead The amount of time required to coordinate parallel threads, as opposed to doing useful work. Parallel overhead
can include factors such as thread start-up time, synchronization,
39
software overhead imposed by parallel compilers, libraries, tools,
operating system, thread termination time, etc.
Load balancing For an implementation to execute efficiently, the workload of the processing units should be as balanced as possible, i.e.
each processor should have an equal amount of computations to
perform. With an unbalanced workload, one or more processors
will be idle waiting for the more loaded processors to finish, and
thereby wasting computational capacity of the system.
Cache coherency On a multicore computer, several processors can
have the same piece of data in their private caches. If one processor modifies that data, the other processors must be notified about
this to get a consistent view of the memory. How this scheme is
adopted is architecture-dependent.
Synchronization At some points in the execution, two or more of the
processing units must be synchronized, i.e. they must wait at some
point to make sure that the other processors have reached a certain
point in the execution stream.
Communication In order to complete a task, processors must communicate with each other. How and when to communicate must be
specified by the programmer.
Race conditions If two or more processors are accessing the same piece
of data, the outcome of the program can be inconsistent, depending
on in which order the processors happened to read and modify the
shared data. To prevent this, mutual exclusion mechanisms must
be used to ensure correct results.
For a more extensive exposition of parallel programming issues see e.g.
[106], [107], [66].
When designing a parallel program, the procedure can be devided
into three stages, see e.g. [68]:
• Partitioning: Opportunities for parallel execution are exposed. A
fine-grained decomposition of the problem is created, where a large
number of tasks that can be executed concurrently are identified.
• Communication: The communication required among the finegrained tasks identified in the partitioning stage is explored.
• Agglomeration: It is determined how to agglomerate tasks identified by the partitioning phase, so as to provide a smaller number
of tasks which can execute concurrently with a small amount of interaction. It is also determined if data and/or computation should
be replicated in order to minimize interaction between tasks.
40
Automatic parallelization has been a research topic for several decades.
Yet fully automatic parallelization of sequential programs by compilers
still remains a challenge due to its need for complex program analysis
and dependance on unknown factors, such as input data range, during
compilation. In most cases, to parallelize other than so called ”embarrassingly parallel” algorithms, insight and understanding to the theory
of the underlying algorithm is required.
1.9.2 Software
There are many different languages available for multicore programming,
with such examples as OpenMP, Pthreads, Cilk++, OpenHMPP, FastFlow, Skandium, and MPI.
OpenMP 3 has been the choice for all developed algorithms in this
thesis because of the algorithms’ suitable mapping to the fork join model
adopted by OpenMP. OpenMP is a collection of directives, library routines and environment variables that may be used to parallelize Fortran, C and C++ programs for execution on shared-memory platforms.
A master thread running sequentially forks a specified number of slave
threads with tasks divided among them. The slave threads then run in
parallel and the runtime environment allocates threads to different processing units. After the execution of the parallelized code, the threads
join back into the master thread that continues onward to the end of the
program. See Fig. 1.11 for an illustration of the flow of execution using
OpenMP. Both task parallelism and data parallelism can be achieved
using OpenMP in this way. When the work partitioning is regular,
OpenMP is a suitable, simple, and convenient choice for the programmer. For more irregular partitionings, it is not necessarily a good option,
in that case other languages such as e.g. Pthreads can be more efficient
to use.
To be able to write optimized code, it it important to be aware of the
policy adopted at a hardware level, to exploit the underlying mechanism
in a good manner, and avoid misuse of them that can lead to tremendous
degradation of the execution time.
1.9.3 Performance measures
The execution time of a parallel implementation is often the most important performance measure, as in many cases the parallelization is
performed in order to shorten it. How the execution time scales with
the number of cores is also of importance as it specifies how much the
3
http://openmp.org
41
Figure 1.11. Fork join model. A program containing three sequential sections
of work, S1 , S2 , and S3 , and two parallel sections with the tasks A1 , B1 , C1
and D1 in the first section and A2 and B2 in the second section. The master
thread is marked with gray.
execution time can be improved by employing more processing units.
This is characterized by the speedup s(M ) defined as
s(M ) =
t(1)
,
t(M )
where t(1) is the execution time of the fastest possible (known) sequential
implementation and t(M ) is the execution time of the program using M
cores. The efficiency, specifying how well the computational resources
are utilized, is calculated as
s(M )
.
(1.63)
M
An ideal speedup curve is linear in the number of processors used,
i.e. s(M ) = M , and and has the efficiency e(M ) = 1. It is actually
possible to achieve a speedup slightly above linear, known as superlinear
speedup. This phenomenon can occur because when more processors are
employed, the size of the private cache increases allowing a better cache
performance that can result in an efficiency over 1.
A simplified formula for the speedup can be obtained as follows. Assume that p| and p|| are the portions of the program that are executed
sequentially and in parallel, respectively. The parallel overhead is denoted with c(M ). The execution time on M processors is given by
e(M ) =
p||
+ c(M ).
M
Noting that C(1) = 0, the speedup is obtained from
t(M ) = p| +
s(M ) =
p| + p||
t(1)
=
=
p
t(M )
p| + M|| + c(M )
p| +
p||
M
1
.
+ c(M )
(1.64)
Ignoring the overhead term, this formula is known as Amdahl’s law
[5]. A consequence of it is that the highest achievable speedup, assuming
42
9
p| = 0
8
7
p = 0.05
|
speed up
6
5
p| = 0.1
4
3
Point where bottlenck is
hit.
2
1
0
2
4
6
8
10
number of cores, M
12
14
16
Figure 1.12. Speedup curves for a program with different portions of sequentially executed code, p| , and speedup curve for an implementation that has hit
a bottleneck such as saturation of the memory bus. For reference linear speed
up is marked by the dashed line.
no parallel overhead, is given by
s(∞) =
1
.
p|
Hence, a program having e.g. p| = 0.1 can never reach a greater speedup
than 1/0.1 = 10 times, no matter how many processors used. It is
therefore of utmost importance to keep the sequentially executed part
of an implementation as small as possible. Fig. 1.12 shows speedup
curves for different values of p| , where the parallel overhead is given by
c(M ) = 0.01 + 0.01 · M 2 . As c(M ) increases with an increasing number
of processors, the speedup curve has a maximum. Obviously, it is not
beneficial to increase the number of processors beyond this maximum
since the overhead becomes too large and increases the execution time.
The figure also shows the characteristics of a speedup curve when a
bottleneck, such as memory bandwidth, has been hit.
1.9.4 Efficient parallelization
When constructing a parallel program, the goal should always be to have
100% efficiency, i.e. all time is spent on doing useful work and no time
is spent on parallel overhead, such as communication, synchronization
thread start up etc. This must be carefully taken into consideration
when designing and implementing the algorithm. As a simple example
43
Figure 1.13. Performance visualization for execution of code 1 (a) and code 2
(b). Code execution is marked by green, and overhead/synchronization work
is marked by red.
Algorithm 3 Code for computation of c = A*b
// Compute c = A∗b
v o i d mvm( d o u b l e ∗∗ A [ ] [ N] , d o u b l e ∗ b , d o u b l e ∗ c ) {
// Outer l o o p p a r a l l e l i z a t i o n
#pragma omp p a r a l l e l f o r
f o r ( i n t i =0; i <N; i ++) {
f o r ( i n t j =0; j <N; j ++)
c [ i ] += A[ i ] [ j ] ∗ b [ j ] ;
}
}
of this, consider two different parallel implementations of a simple matrix
vector multiplication
c = Ab,
(1.65)
where A ∈ RN ×N , b ∈ RN , c ∈ RN . For the first implementation, the
outer loop is parallelized, in the second implementation the inner loop is
parallelized. Both perform the same job, but the second implementation
has a much larger overhead. Since the parallelization is in the inner loop
thread, forking and termination will occur N times instead of one time as
in implementation 1. In Fig. 1.13, the performance is visualized for the
two implementations. As can be seen, implementation 1 has an efficiency
of very close to 100% while implementation 2 only has an efficiency of
roughly 80 %.
44
Algorithm 4 Code for computation of c = A*b
// Compute c = A∗b
v o i d mvm( d o u b l e ∗∗ A [ ] [ N] , d o u b l e ∗ b , d o u b l e ∗ c ) {
// I n n e r l o o p p a r a l l e l i z a t i o n
f o r ( i n t i =0; i <N; i ++) {
#pragma omp p a r a l l e l f o r r e d u c t i o n (+: temp )
f o r ( i n t j =0; j <N; j ++)
temp += A[ i ] [ j ] ∗ b [ j ] ;
c [ i ] = temp ;
}
}
Table 1.1. Order of execution. R, I and W, denotes Read value, Increase
value and Write back respectively.
T1
T2
R
I
W
R
I
W
Value of b
0
0
0
1
1
1
2
T1
T2
R
R
I
I
W
W
Value of b
0
0
0
1
1
1
1
1.9.5 Using Shared Resources
Compared to sequential programming, one of the main differences that
have to be taken into consideration in parallel programming is that several processors can access and modify shared resources. Precaution must
be taken to both ensure correct execution of the algorithm and to avoid
time consuming conflicts in the resource usage.
An example of a situation where random results will be obtained if
precaution is not taken is race conditions. Consider that two different
threads T1 and T2 will compute the sum 1 + 1 = 2 in parallel by writing
and reading from the shared variable b. In Tab. 1.1, two possible outcomes of the execution are shown. In the first case, the correct result
is obtained while an incorrect result is found for the second case. The
order in which the events R, I and W occur can be considered random
and cannot be controlled by the programmer. Such problems must be
handled using mutual exclusion mechanisms. It is then possible to restrict the access to the shared resource so that one processor cannot read
data while another processor is modifying it.
45
Another example that does not produce incorrect results but has negative impact on the efficiency of the execution is false sharing. It is a
subtle problem, but very important to address to achieve performance.
Assume that we want to compute the recursion
2
yt = yt−1
+ cos yt−1 + sin yt−1 , t = 1, 2, ..., 106 ,
for four different initial values of y0 . The code in Alg. 5 will achieve this
goal. To speed up the computations, the outer for-loop is parallelized,
which is perfectly fine as the loops are completely independent of each
other. A linear speedup in the number of cores employed could be
expected (for up to four cores). However, as can be seen from the speed
up plots in Fig. 1.14, the results are disappointing and only a speed up
of about 1.2 times is reached using four cores. The problem here is so
called false sharing. Even though the loops are completely independent,
the elements of y are stored in neighboring addresses of the memory.
As explained in Sec. 1.7, when a data element is requested from the
processor, several contiguous data elements will be fetched to the cache
(in an attempt to exploit data locality). The effect here is though devastating for the performance. When processor 1 accesses y[0], it will also
fetch y[1], y[2] and y[3] in to its cache (even though it will not use them).
When it then modifies y[0], it will broadcast to the other processors (or
the other processors caches), that the cache line have been changed,
which results in that the other processors invalidate their cache lines involving y[0] and update with the new one produced by processor 1. This
happens even though they actually do not need this update to perform
their work correctly. The participating in parallel execution processors
will thus modify their own local part of the data in the cache line and
will all the time force the other processors to update the changes they
have made, even though it is unnecessary. The problem can be resolved
by separating the memory addresses of where y[1], y[2], y[3] and y[4] are
stored, by padding zeros in between them. The code and corresponding
speedup plot, where the storage addresses have been separated, is shown
in Alg. 6 and Fig. 1.14 respectively.
1.9.6 Data partitioning
Another important issue in parallel programming is how to partition the
data among the cores. Ideally, each core should touch as small amount
of data as possible, in order to not saturate the memory bus, and also
the data it touches should preferably be local to that core, to minimize
the inter-core communication. Some of the concepts are exemplified by
matrix operations below.
Let A, B, and C denote n × n matrices, and x, y, z denote column
vectors of length n. Assume that a matrix multiplication C = AB is to
46
4
3.5
Speedup
3
2.5
2
1.5
1
1
1.5
2
2.5
3
Number of processors
3.5
4
Figure 1.14. Speed up curves for code in Alg. 5, and Alg. 6, in dashed red and
solid blue respectively.
Algorithm 5 Parallel code with false sharing problems
void r e c u r s i o n ( ) {
double y [ 4 ] = { 1 . 2 , 0 . 8 , 5 . 6 , 2 . 3 } ;
}
#pragma omp p a r a l l e l f o r
f o r ( i n t i =0; i <4; i ++) {
f o r ( i n t j =0; j <1E6 ; j ++) {
y [ i ] = y [ i ]∗ y [ i ] + cos (y [ i ] )
}
}
+ sin (y [ i ]) ;
Algorithm 6 Parallel code without false sharing problems
void r e c u r s i o n ( ) {
int n = 20;
d o u b l e y [ 4 + 3∗n ] ;
y[0∗n]=1.2; y[1∗n]=0.8; y[2∗n]=5.6; y[3∗n]=2.3;
}
#pragma omp p a r a l l e l f o r
f o r ( i n t i =0; i < 4∗n ; i+= n ) {
f o r ( i n t j =0; j <1E6 ; j ++) {
y [ i ] = y [ i ]∗ y [ i ] + cos (y [ i ] )
}
}
+ sin (y [ i ]) ;
47
be parallelized. Let
√
M be integer and consider a partitioning
⎤
A1
⎢ A2 ⎥ ⎢
⎥
AB = ⎢
⎥ B1 B2 · · · B√M
..
⎣
⎦
.
⎡
⎡
A √M
A 1 B1
A 1 B√ M ,
..
.
..
..
.
.
√
· · · A M B√ M
A 1 B2 · · ·
⎢
⎢ A 2 B1 A 2 B2
= ⎢
⎢
..
⎣
.
√
A M B1
···
⎤
⎥
⎥
⎥,
⎥
⎦
(1.66)
where one processor computes one of the M blocks Ai Bj , 1 ≤ i, j ≤
√
M . Compare this to the partitioning
⎡
⎤
⎡
⎤
A1
A1 B
⎢ A2 ⎥
⎢ A2 B ⎥
⎢
⎥
⎢
⎥
(1.67)
AB = ⎢ .. ⎥ B = ⎢
⎥,
..
⎣ . ⎦
⎣
⎦
.
AM
AM B
where one processor computes one of the blocks Ai B, 1 ≤ i ≤ M . For
both partitionings, the workload is perfectly distributed among the processors. All processors will perform an equal amount of computations
and no computations are duplicated. However, for partitioning (1.66),
each processor must touch MM+1 n2 data elements while for partitioning
2 2
n elements. It is clearly ben(1.67) each processor must only touch M
eficial to use partitioning (1.67), for M > 2, in order to touch as small
amount of data as possible. Specifically, it is seen that the total amount
of the data touched is given by (M +1)n2 and 2n2 elements by partitioning as in (1.66) and in (1.67), respectively. The amount of data touched
thus increases linearly with the number of processors, M when using
partitioning in (1.66), while being constant for partitioning in (1.67).
Thus, using partitioning (1.66), one could not expect an implementation to scale well for a large number of processors since the memory bus
will eventually be strained and limit the speedup.
As another example, consider a sequence of matrix computations
y = Ax,
C = yzT ,
that are to be parallelized. Compare a partitioning
T
T
y1 y2 · · · yM
A1 A2 · · · AM
=
x,
T
T T
C1 C 2 · · · CM
y1 y2 · · · yM
=
z , (1.68)
48
where processor m computes ym = ATm x, Cm = ym zT with a partitioning
⎡
y1 + y2 + ... + yM
C1 C 2 · · · CM
T
=
=
A1 A2 · · · AM
y1 y2 · · · yM
x1
x2
..
.
⎢
⎢
⎢
⎣
T
⎤
⎥
⎥
⎥,
⎦
xM
zT ,
(1.69)
where processor m computes ym = Am xm , Cm = ym zT . For both
partitionings, each processor will perform the same number of computations and touch the same amount of data. However, the partitioning in
(1.69) requires a synchronization point in between the two lines, since
M
the partial sums must be added to form y, i.e. y =
ym . Partitionm=1
ing (1.68) thus provides better potential for an efficient implementation
since the processors can perform larger amount of work independently
of each other without synchronization and communication.
49
1.10 Short chapter summaries
Here a short summary of each chapter presented in the thesis is given.
Chapter 2
In this chapter a parallelization of the Kalman filter is presented. The
content is based on Paper II and Paper V. First a parallelization method
for the MISO case is given, which is based on a banded structure of the
system transition matrix. It is discussed how different systems, both
time invariant and time variant ones, can be realized on a banded form.
The given parallelization method is then extended to cover the MIMO
case, by utilizing sequential filtering of the measurement vector. The
proposed parallelization is evaluated on a load-estimation problem for
mobile networks, and is compared against a BLAS based implementation. The results show that the given parallelization performs significantly better than the BLAS based one, and is capable of achieving
linear speed up in the number of cores used (tests are performed up to
8 cores).
Chapter 3
In this chapter an important special case of the paralleization given in
Chapter 2 is given. The content is based on the material in Paper IV.
The case under consideration is when the Kalman filter is used as a parameter estimator, and it is shown how this can be especially efficiently
implemented. Some more detailed discussion about implementation details for optimization of the execution time is given.
Chapter 4
This chapter is based on the material in Paper I and Paper III. Parallelization of the particle filter is studied. Four different parallelizations:
the globally distributed particle filter, resampling with proportional allocation filter, resampling with non-proportional allocation filter, and the
Gaussian particle filter, are implemented on a multicore computer and
evaluated using up to eight cores. The results show that the Gaussian
particle filter and the resampling with non-proportional allocation filter
are the best suited ones for parallelization on multicore computers and
linear speed up is achieved.
50
Chapter 5
This chapter is based on Paper VII and Paper VIII. A solution method
for the recursive Bayesian estimation problem is presented. The involved
probability density functions are approximated with truncated series expansions in orthogonal basis functions. Via the prediction-update recursion, the coefficients of the expansions are computed and propagated.
The beneficial parallelization properties are demonstrated at a bearings
only tracking problem, where linear speedup is achieved for even small
problem sizes. The drawback of the method is mainly that the state
must be confined to a pre-specified domain. An analysis of the method
is also carried out. It is mainly studied how the error in the estimated
PDF, caused by the truncation of the expansions, is propagated over the
iterations. A bound ensuring that this error does not grow unbounded is
given. A comparison of the solution to a bearings-only tracking problem
using the Fourier and Legendre basis functions is also given.
Chapter 6
In this chapter a method combining particle filtering with orthogonal
series expansion, is developed and analyzed. The material is taken from
Paper XI and Paper XII. The method is based on fitting a series expansion to the particle set when resampling. In this way, the information
carried by the particle set can be compressed to a few informative coefficients that can be efficiently communicated between the processing
units. This gives the method favorable parallelization properties making it suitable for multicore and even distributed hardware platforms.
An analysis of how well the series expansion captures the underlying
PDF is given. Also an upper bound on the magnitude on the expansion coefficients, when using the Hermite basis functions, is derived and
provided.
Chapter 7
This chapter is based on the material in Paper IX, Paper X and Paper VI. A novel method for anomaly detection in reference following
systems is given and discussed. The method is based on that a set of
observed trajectories from the system has been collected. From this
set, PDFs that specify the probability density of finding the system in a
given state, are computed. The anomaly detection is then carried out by
performing outlier test with respect to the estimated PDFs, to see how
deviating the system state is from the normal one. The method is evaluated, with good results, on vessel traffic data, as well as eye movement
data from an eye tracking application.
51
Chapter 8
Parameter estimation of a minimally parametrized model for the PK/PD
for drug delivery in anesthesia is considered. The content of this chapter
is based on the material given in Paper XIV and Paper XIII. Three
different estimation methods, EKF, PF and the filtering method given
in Chapter 7 (OBPF), are tested and compared at this application. It is
shown the the EKF is prone to significant bias in the parameter estimates
while the PF and OBPF do not suffer from such problems. The PF and
the OBPF are shown to give similar results in the estimation quality, but
the OBPF can provide this estimation quality to a smaller computational
cost. As the estimated model is intended to serve as a model for closedloop control of the anesthesia, it is of importance to be able to provide
as accurate estimates of the parameters as possible.
Chapter 9
This is a short chapter presenting the results from BLAS based implementations of the UKF and a point mass filter. It is shown that linear
speed up is obtainable by a parallel implementation without any modifications for improved parallelization properties.
52
Chapter
2
Parallelization of the Kalman
Filter
2.1 Introduction
The Kalman Filter (KF) still represents the mainstay of linear estimation, even in medium and large-sized systems. Parallel implementations
of the KF have been suggested over the years to improve the execution time. However, many of these schemes are hardware-specific with
respect to such architectures as e.g. the Connection Machine [63], distributed memory machines [60] and systolic arrays [92] and thus are not
directly suitable for a multicore implementation. Other parallelization
solutions suffer from the presence of sequentially executed sections that
prevent significant speedup [117], [103]. Pipelined-by-design algorithms
[43] have input-to-output latency equal or even greater than that of a
sequentially executed filter, which property is not acceptable in many
real-time applications (e.g. the application in cellular communication
studied in [115]). In [74], a parallel multicore implementation of the KF
for parameter estimation is presented.
This chapter deals with efficient parallel implementation of the Kalman
filter (KF) for state estimation in discrete time-varying linear systems
on shared memory multicore architectures. However, the proposed solution requires only a small amount of inter-processor communication,
which makes it suitable also for a distributed architecture. The KF
algorithm consists of a sequence of matrix-matrix and matrix-vector
multiplications. The parallelization of this kind of operations on a multicore architecture is a routine matter. It is indeed possible to perform
the operations invovled in the KF algorithm using a pre-built library
such as BLAS in a straightforward manner. However, as it is shown in
this chapter, the result will suffer from several drawbacks that include
53
a large amount of inter-processor communication, synchronization, and
high demand for memory bandwidth. For the case of systems with
banded state-space realizations, the above mentioned drawbacks of parallel KF implementation can be efficiently alleviated.
As the Multiple-Input Multiple-Output (MIMO) estimation problem
with p outputs always can be implemented as a sequence of p singleoutput filter problems [52], a method for the Multiple-Input SingleOutput (MISO) case is developed and used as a building block for the
MIMO case. This approach avoids the inversion of a p × p matrix, which
is known to be difficult to parallelize efficiently because of its intricate
data dependencies.
To mention a few, active noise cancellation [58], climate and weather
prediction [64], and mobile broadband load estimation [116] are applications where efforts are made to decrease the execution time of the
filtering step. The method suggested in the present chapter is applied
to a Wideband Code Division Multiple Access (WCDMA) load estimation problem that is deemed critical in mobile broadband. This is a
field where the computational burden is growing rapidly due to the increasing number of smart phones in the system. Since the number of
users is directly affecting the number of estimated states, it is clear that
the computational burden of sequential implementations of the KF becomes prohibitively demanding. The multicore techniques of the present
chapter provide therefore an interesting alternative to an increase in the
potential number of uplink users of the cell, which in turn results in
savings in the required amount of hardware and power consumption of
the system.
The chapter structure is as follows. Sec. 2.2 provides a summary of
the KF equations. A discussion on banded systems is given in Sec. 2.3.
In Sec. 2.4 the main contribution, a parallel implementation of the KF
for MISO systems, is provided. An analysis yielding estimates of the
amount of parallelizable work, the required bandwidth, and amount of
communication is also given, to offer instrumental guidelines for the
choice of implementation hardware. In Sec. 2.5, the parallalelization
of the KF for MIMO system based on the MISO implementation is
carried out. Finally, in Sec. 2.6, the results of computer experiments are
presented followed up by a discussion in Sec. 2.7.
2.2 State space model and filtering equations
2.2.1 State space system description
Consider a MISO discrete time system
54
xt+1 = Ft xt + Gt ut + wt ,
yt = ht xt + jt ut + vt ,
(2.1)
(2.2)
with the state vector xt ∈ Rn , the input vector ut ∈ Rm and
the output yt ∈ R at discrete time step t. Generally, Ft ∈ Rn×n and
Gt ∈ Rn×m are time-varying matrices, while ht ∈ R1×n and jt ∈ R1×m
are time-varying vectors. The process and measurement noise sequences
wt ∈ Rn and vt ∈ R are assumed to be independent, white, zero mean,
Gaussian distributed, with the covariance matrices E[wt wtT ]=Qt and
E[vt2 ] = rt , respectively.
2.2.2 Kalman filter equations
The KF equations below are in the so-called standard form. For filtering
problems that require special attention to numerical stability, the square
root formulation is to prefer [10]. Parallelization of the square root form
of the KF is investigated in [63], where mainly the Givens rotation step
is parallelized. However, many systems do not require the square root
form to maintain numerical stability. As it will be shown here, the
implementation and parallelization can be made more efficient for the
KF in the standard form, than the implementation proposed in [63].
As given in Sec. 1.6.1 the KF consists of two steps: prediction and
update. These are recursively applied to the data to calculate the state
estimate x̂ and the error covariance matrix P. For the system (2.1)-(2.2),
the KF [53] is calculated as:
Prediction
x̂t|t−1 = Ft x̂t−1|t−1 + Gt ut ,
(2.3)
Pt|t−1 = Ft Pt−1|t−1 FTt + Qt ,
(2.4)
Update
ỹt = yt − ht x̂t|t−1 − jt ut ,
dt =
ht Pt|t−1 hTt + rt ,
Pt|t−1 hTt d−1
t ,
Kt =
x̂t|t = x̂t|t−1 + Kt ỹt ,
Pt|t = (I − Kt ht )Pt|t−1 .
(2.5)
(2.6)
(2.7)
(2.8)
(2.9)
55
2.3 Banded systems
A matrix A is said to be banded with bandwidth Nb if it is zero everywhere except at the Nb super and sub diagonals, i.e. it is of the form
⎡
a00
···
a01
⎢
⎢ a10 a11
⎢ .
⎢ .
⎢ .
A=⎢
⎢ a
⎢ Nb 0
⎢
..
⎣
.
0
..
a0Nb
0
..
.
.
..
a(N −Nb )N
..
.
.
..
a(N −Nb )N
···
.
aN (N −1)
a(N −1)N
aN N
⎤
⎥
⎥
⎥
⎥
⎥
⎥.
⎥
⎥
⎥
⎦
The performance of the suggested parallelization increases with a decreasing bandwidth of the transition matrix Ft . In this section transformation to a banded system form is discussed.
2.3.1 Transformation to a banded system form
Any linear finite-dimensional system can, with more or less effort, be
transformed to a realization with a banded system matrix in a numerically stable manner. This holds true for both time-varying and timeinvariant systems. The number of bands in the system matrix is denoted
Nb , where e.g. a diagonal and a tridiagonal matrix have Nb = 0 and
Nb = 1, respectively. Under the state variable transformation xt = Tt zt ,
where Tt is a non-singular matrix, the transformed system is given by
zt+1 = Ft zt + Gt uu + T−1
t+1 wt ,
y t = h t zt + j t u t + v t ,
where
Ft = T−1
t+1 Ft Tt ,
Gt = T−1
t+1 Gt ,
h t = h t Tt .
2.3.2 Time-invariant case
Assume that (2.1)-(2.2) represent a time-invariant system, i.e.
Ft = F, Gt = G, ht = h, ∀t. Then the following holds:
56
• For F with distinct eigenvectors, a modal form can be used. A
transformation T can be found such that F is block-diagonal with
a 2 × 2 block for each complex conjugated pair of eigenvalues and
a 1 × 1 block for each real eigenvalue of F [42].
• There is also a possibility to bring the system (the F matrix) to a
tri-diagonal form, in a numerically sound manner. Tri-diagonalization
via similarity transforms as well as tri-diagonal realizations obtained directly from the system’s Hankel matrix are discussed in
[65].
• It is always possible to transform F to Jordan (bi-diagonal) form
[42]. However, since the Jordan form is known to exhibit poor
numerical properties, it is not always a suitable option.
Thus, the matrix F belongs to the class of tri-diagonal matrices
(Nb = 1) in the worst case and, in the best case, it is diagonal (Nb = 0).
Both cases make the KF equations highly suitable for parallel implementation. Note that, for time-invariant systems, the transformation T
can be computed beforehand and offline.
2.3.3 Time-varying case
For time-varying systems, there are many possibilities to express the
system in banded form.
• Using a realization method based on a sequence of Markov parameters, Mk,j , it is sometimes possible to find a rank factorization
such that Mk,j = Ct Bj . A realization with a diagonal F matrix is
then simply obtained by taking Ft = I, Gt = Bt , ht = Ct , jt = 0,
[88].
• Assume that a realization is readily available in the form of (2.1)(2.2). It can then be seen that the transformation Tt = Φt , where
t
Φt =
Fi is the transition matrix, will give Ft = I. Since Gt
i=0
requires the inverse transform T−1
= Φ−1
t
t+1 , it is not always computationally sound. However, in case analytical expressions for Φ
and Φ−1 are available or can be obtained by a small amount of computation, the transformation can be applied to obtain a diagonal
system.
• If Ft is a sparse matrix where the zero elements are located at the
same positions for all t, an optimized band structure of Ft can be
obtained by taking T to be a permutation matrix [29].
• A system that consists of loosely coupled subsystems can often
be realized with a block diagonal matrix F that possesses a few
57
block-diagonal elements describing the couplings between the subsystems.
• Matrices arising from finite-element or finite-difference problems in
one or two dimensions are often banded. The bandedness stems
from the fact that the variables are not coupled over arbitrarily
large distances. For instance, discretizations of the differential
equation
∂ n1 T (x1 , x2 )
∂ n2 T (x1 , x2 )
=
,
∂ n1 x1
∂ n2 x2
0 ≤ n1 , n2 ≤ 2 encountered in physics as the heat equation, wave
equation and Laplace equation, have realizations in a banded form.
• In stochastic setups, where the KF is used as a black-box parameter estimator [100], the parameter vector θ is modeled as a random
walk, and the output is required to be a linear combination of the
unknown parameters driven by process noise. The system equations are then given by
θt+1 = θt + wt ,
y t = h t θt + v t ,
which is a special case of (2.1)-(2.2) with Ft = I, Gt = jt = 0. Since
Ft = I, the recursive parameter estimation problem is especially
suitable for parallel implementation. This case is studied in detail
in Chapter 3.
2.4 MISO System
To parallelize a Kalman filter for a MISO system is a simpler problem
than doing it for the MIMO case. A parallization for the MISO case
will first be studied, and then extended to the MIMO case. Further,
a banded structure of F gives a better ground for an efficient parallelization. The presence of a general structure matrix F in the system
equations makes efficient implementation and parallelization of the KF
somewhat more difficult. As discussed in Sec. 2.3, it is always possible
to yield a realization, or transform an existing realization, so that the
matrix F becomes banded with a low band width. Since the main purpose is to achieve faster execution times, it is of importance to optimize
the implementation of the sequential version, from which the parallel
version will be built. Therefore, an efficient sequential implementation
is presented below in Sec. 2.4.1 and the parallelization of it is handled
in Sec. 2.4.2.
58
2.4.1 Efficient sequential implementation
The main focus of optimization should be on the computations involving
the matrix P since a majority of the FLOPs and memory accesses are
related to it. In Alg. 7, an implementation of (2.3)-(2.9) where the
accesses to P occur after each other is given. The gain orginates from the
fact that, for a banded F, once an element of P is brought to the cache,
it can be used to accomplish all calculations the element is involved in
before it is thrown out. Further, it allows the calculated elements in
Pt+1|t to be stored at the same locations as the elements of Pt|t were
held, giving a substantial reduction in the memory size and bandwidth
needed. In [74], this reordering is shown to execute about twice as
fast as compared to an implementation that does not make use of this
possibility. This kind of optimization is not possible with a dense matrix
F.
The matrix P is a symmetric positive definite matrix. This should
be taken advantage of since approximately half of the computations
and memory storage can be spared due to this fact. However, to avoid
too many technical details, the parallelization principles of KF will be
presented for a version where the whole matrix P is used in the computations. The modifications needed for an implementation using only the
upper triangular part of P are straightforward and minor.
Algorithm 7 Efficient Kalman Filter implementation.
x̂t|t = x̂t|t−1 + d−1
t ct [yt − ŷt ]
(2.10)
T
Pt|t = Pt|t−1 − d−1
t ct ct
(2.11)
Pt+1|t =
ct+1
x̂t+1|t
ŷt+1
dt+1
Ft+1 Pt|t FTt+1
Pt+1|t hTt+1
+ Qt+1
(2.12)
=
= Ft+1 x̂t|t + Gt+1 ut+1
= ht+1 x̂t+1|t + jt+1 ut+1
= rt+1 + ht+1 ct+1
(2.13)
(2.14)
(2.15)
(2.16)
2.4.2 Parallel implementation
Assume that F is a dense matrix. A parallel implementation of Alg. 7 can
be produced by parallelizing each step individually using BLAS1 or some
1
Basic Linear Algebra Subprograms (BLAS) are routines that provide standard building blocks for performing basic vector and matrix operations. BLAS is a de facto
application programming interface standard, see netlib.org/blas/.
59
other highly optimized library for matrix operations. The calculation of
(2.12) is then split as
A = Pt|t FTt+1 ,
Pt+1|t = Ft+1 A + Qt+1 ,
where each line is parallelized separately. However, such an approach
will have several drawbacks. Each processor must touch a large amount
of data limiting the scalability of the implementation. A synchronization
point between the calculations and a temporary storage for A is also
required. A large amount of inter-processor communication is needed
that will have negative impact on the execution time and as well limit
the algorithm performance in a distributed implementation. In the case
of a banded matrix F, where the number of bands Nb N , it is possible
to remedy the mentioned drawbacks and thus achieve fast execution and
good scalability.
Assume N/M to be integer and define (recall that 1 : n = {1, 2, .., n})
N
N
(i − 1) :
i − 1,
M
M
:= ri (1) − nb : ri (N/M ) + nb .
ri :=
si
A parallelization of the KF over the whole sequence of matrix operations
for a banded matrix F is described in Alg. 8. The algorithm is designed
to make the number of synchronization points, amount of communication, and the amount of data that each processor has to touch, as small
as possible. Notice that (2.12) in Alg. 7 is executed by the i:th CPU as
Pt+1|t (ri , :) = Ft+1 (ri , si )Pt|t (si , :)FTt+1 .
Each processor is given access to the whole matrix F that is banded
and contains a small amount of data, but only a restricted part of P
is touched. Processor i will be responsible for updating of Pt+1|t (ri , :),
which will only require knowledge of Pt|t (si , :). The parts of P that
must be communicated by processor i are thus only the Nb rows that
overlap with the neighboring processors (CPU i − 1 and CPU i + 1).
60
Algorithm 8 Kalman Filter parallel implementation.
• Parallel (CPU i calculates)
x̂t|t (ri ) = x̂t|t−1 (ri ) + d−1
t ct (ri )[yt − ŷt ]
T
Pt|t (si , :) = Pt|t−1 (si , :) − d−1
t ct (si )ct
Pt+1|t (ri , :) = Ft+1 (ri , si )Pt|t (si , :)FTt+1 + Qt+1 (ri , :)
ct+1 (ri ) = Pt+1|t (ri , :)hTt+1
x̂t+1|t (ri ) = Ft+1 (ri , si )x̂t|t (si ) + Gt+1 (ri , :)ut+1
(i)
ŷt+1 = ht+1 x̂t+1|t (ri )
(i)
bt+1 = ht+1 ct+1 (ri )
• Sequential
ŷt+1 =
M
(i)
ŷt+1 + jt ut
i
dt+1 = rt+1 +
M
(i)
bt+1
i
2.4.3 Analysis
An analysis of Alg. 8 is carried out in this section to evaluate the number
of sequential and parallel FLOPs, the required memory bandwidth, the
demand of communication, and synchronization in the implementation.
This provides important guidelines for the choice of hardware to meet a
desired performance of the designed system.
Parallelizable amount of work
Counting the number of FLOPs fs and fp that are executed sequentially
and in parallel in Alg. 8, the following expressions can be obtained
fs (M, m) = 3M + 2m,
fp (N, Nb , m) = (Nb2 + 2Nb + 5)N 2
+2(2 + m)N.
(2.17)
(2.18)
As noted in Sec. 1.9.3, Amdahl’s law [5] states that the maximal theoretically obtainable speedup is given by
s(M ) =
1
p
p| + M||
,
61
where p| and p|| are the sequentially and parallelly executed portion of
the program, respectively. Now, if N > M , which should definitely be
the case, then fp fs and p| ≈ 0, p|| ≈ 1 is a reasonable approximation,
yielding s(M ) = M . Thus, regarding the portion of parallelizable work,
the algorithm has the potential of achieving good scalability.
Memory bandwidth
The only variables of considerable size in the KF algorithm are the
matrices P and Q. Let q(P) denote the size of P in bytes. Assuming
that P and Q are transferred from the RAM to the processors at each
iteration will give a required memory bandwidth of
B =n·
q(Q) + q(P)
T
(2.19)
to perform n iterations in T seconds. If the bandwidth Bh provided
by the hardware satisfies Bh ≥ B, it will not be a bottleneck in the
implementation. In many practical cases, Q is diagonal, or at least
sparse, in which case q(Q) q(P) and q(Q) can be neglected in (2.19).
Synchronization and Communication
Only one synchronization point at the end of the parallel section is
needed. The data that have to be communicated between the CPUs
are given by the overlapping rows of P, the local parts of c, and the
(i) (i)
variables yt , bt , which number of elements are given by
C(N, Nb , M ) = 2N (M − 1)(2Nb + 2 +
1
).
M
(2.20)
2.5 MIMO System
Consider (2.1)-(2.2) as a MIMO system with p outputs. Denote the
T
process noise vt = vt (1) vt (2) · · · vt (p) . If vt (i) is independent
of vt (j) j = i, i = 1, 2, ..., p, then, by the Gaussian assumption on vt ,
it is equivalent to Rt = E[vt vtT ] being a diagonal matrix. The resulting
MIMO problem can be treated as a sequence of p MISO filtering problems where the filtering of the p measurements can be done sequentially,
one by one [52]. If Rt is not diagonal but positive definite, a (Cholesky)
transformation zt = Lt yt can be applied to render St = E[zt zTt ] diagonal. From the relation
E[zt zTt ] = Lt E[yt ytT ]LTt = Lt Rt LTt ,
62
−1/2
it is seen that the choice Lt = Rt
will give E[zt zTt ] = I, which together
with the Gaussian assumption on vt establish the independence of the
−1/2
is guaranteed to
measurement noise. By the assumption Rt > 0, Rt
exist. If there are measurements that are noise-free, R becomes positive
semidefinite and such measurements can be handled separately by e.g a
reduced observer. The MIMO filtering problem can thus always be split
into a sequence of p MISO filtering problems and the parallelization can
be performed over each MISO filtering problem as proposed in Sec. 2.4.
2.6 Implementation example
In order to quantify and validate the multicore computational gains on
a realistic problem, a simulation study of a WCDMA uplink interference
power estimation system was used.
2.6.1 Uplink interference power estimation model
In this section, a simplified model of the interference power generation
and measurements in the WCDMA uplink is provided. 3G mobile broadband data traffic is based on high speed packet access (HSPA) technology. The estimation of uplink load in this system is an example where
the proposed parallelized KF algorithm may find application.
In the uplink, scheduling is required to assign users to the available
cells. Efficient scheduling requires the interference power from users of
the own cell and from users in neighboring cells to be estimated in real
time, which in practice is a very difficult problem. The reference [115]
therefore proposes a new algorithm for recursive Bayesian estimation of
the noise power floor. Kalman filtering for uplink interference power
estimation was treated in [114]. This solution does however use at least
one state per user and is computationally complex. With the hundreds
of users per cell anticipated in the near future, it is clear that the solution
suggested in [114] becomes practically infeasible.
A state space model
A brief description of the state space model for WCDMA power link
estimation problem is provided in this section, see [114] and [41] for a
more extensive exposition. A state space model for the system is given
by:
xt+1 = Fxt + Gut + wt ,
yt = Ht xt + et ,
63
where
⎤
0
.. ⎥
⎢
. ⎥
⎢ 0
⎥,
⎢ .
⎣ ..
1−κ 0 ⎦
0
···
0
1
κ ... κ 0 ,
⎡
⎤
1
0
···
0
1+ηt (1)
⎢
.. ⎥
..
⎢
.
0
. ⎥
⎢
⎥,
⎢
⎥
..
1
⎣
⎦
.
1+ηt (N ) 0
1
···
1
1
⎡
⎤
q(1) 0
···
0
⎢
⎥
..
..
⎢ 0
⎥
.
.
⎢
⎥,
⎢ ..
⎥
⎣ .
⎦
q(N )
0
n+thermal
0
···
0
q
⎡
⎤
r(1) 0
···
0
⎢
⎥
..
..
⎢ 0
⎥
.
.
⎢
⎥,
⎢ ..
⎥
⎣ .
⎦
r(N )
0
0
···
0
rRT W P
⎡
F =
G =
Ht =
Q =
R =
1−κ
0
..
.
···
and
T
,
xt (1) ... xt (N ) xn+thermal
t
T
ref
=
,
xref
t (1) ... xt (N ) 0
T
=
,
wt (1) ... wt (N ) wtn+thermal
T
yt (1) ... xt (N ) yRT W P,k
=
,
T
et (1) ... et (N ) eRT W P,k
=
.
xt =
ut
wt
yt
et
The power consumed by the i:th channel is xt (i), and xn+thermal
is the
t
sum of neighbor cell interference and thermal noise power and modeled
as a random walk
xn+thermal
= xn+thermal
+ wtn+thermal ,
t
t+1
(2.21)
where wtn+thermal is the systems noise corresponding to the state and κ
is a parameter determined by the radio link quality and set for an inner
control loop.
64
The reference power xref
t (i) for the i:th channel is controlled by an
outer loop controller and is given by
xref
t (i) =
1 + ηt (i)
1+
xtotal
,
t
ref
−1
(C/I)t (i)
where (C/I)ref
i (t), i = 1, ..., N denote the carrier to interference levels.
Furthermore, xtotal
is the total power and ηt (i) is the quotient between
t
the data power offset and the control signal power, see [114] for details.
Note that ηt (i), and hence Ht , is time varying.
The control signal power yt (i) is the quantity measured for the i:th
radio link. The additional measurement yRT W P,k available on the uplink
is the total received wideband uplink power. This is simply the sum of
the powers represented by all states. The measurement and process
noise covariance matrices are given by Q and R respectively.
In the simulation, the SINR targets were set 5 dBs lower than usually
assumed for a conventional RAKE receiver, to be able to run up to 400
radio links. This is motivated since today the uplink is used with higher
block error rate than when the system was standardized. Furthermore,
more advanced receivers than the RAKE are in operation today, among
these interference suppressing [116] and interference canceling receivers
[119].
2.6.2 Results
Alg. 8 was implemented and compared with Alg. 7 parallelized using
Intel’s MKL BLAS library. The execution times for a range of problem
sizes are summarized in Tab. 2.2. The scalability is illustrated by the
speedup plots depicted in Fig. 2.1 and Fig. 2.2, respectively. To verify the
correctness of the parallel implementation, the same data were filtered
with a sequential Matlab implementation confirming that the outputs
were identical. The sum of residuals,
r(Ni ) =
Ni
1 |yt (i) − Ht (i)xt (i)|,
Ni
(2.22)
t=0
for one channel is given for the sequential and parallel filters in Tab. 2.1,
where Ni = 1000 is the number of iterations executed.
The code was written in C and OpenMP 2 was used for parallelization.
The hardware used was an octa-core shared memory multicore computer
R Xeon 5520, Quad-core, Nehalem 2.26 Ghz, with
comprised of two Intel
a 8 MB cache and a memory bandwidth of approximately 23 GB/s.
2
OpenMP (Open Multi-Processing) is an application programming interface that supports multi-platform shared memory multiprocessing programming in C, C++, and
Fortran.
65
Table 2.1. Loss function (Eq. 2.22) values for sequential and parallel implementation of the KF for different problem sizes N .
N
50
100
200
300
400
Sequential
7.2987E − 17
2.3841E − 17
1.0021E − 17
3.5315E − 16
5.2381E − 17
Parallel
7.2583E − 17
2.4012E − 17
1.0531E − 17
3.4714E − 16
5.2190E − 17
Table 2.2. Single core execution time in milliseconds for 50 iterations.
N
BLAS
Alg. 8
50
7.18
0.61
100
10.12
2.09
200
29.89
8.08
300
92.65
18.01
400
167.99
31.64
One implementation that parallelizes the steps in Alg. 7 using Intels
MKL BLAS library, and an implementation of Alg. 8 were made. The
two different implementations will be referred to as implementation 1
and implementation 2 respectively.
Further to study how the scalability of the parallelization is affected by
the bandwidth, Nb of the transition matrix a the discrete version of the
heat equation, where a rod is being heated in the ends was implemented.
Let Tij denote the the temperature at discrete position i and time j.
T
With the state vector xt = T1,t T2,t · · · TN,t
the realization
gets a banded form, with the number of bands, Nb , determined by the
approximation order of the derivatives. For a first order approximation
the F matrix gets the structure
⎡
⎤
1 0
0
⎢ × × ×
⎥
⎢
⎥
⎢
⎥
.
.
.
.. .. ..
Ft = ⎢
⎥.
⎢
⎥
⎣
× × × ⎦
0
0 1
The impact of the number of bands on the scalability was studied for
this system for a fixed problem size N = 1000 and evaluated for the
number of bands 10, 50, and 100 respectively. The results are presented
in Fig 2.3.
2.7 Discussion
Even though each subroutine provided by BLAS is extremely efficiently
implemented, the implementation of Alg. 8 executes faster on a single
66
8
N=50
N=100
N=200
N=300
N=400
7
6
Speed up
5
4
3
2
1
0
1
2
3
4
5
Number of CPUs
6
7
8
Figure 2.1. Speed up curves for Alg. 8 for N = 50 to N = 400, using up to
M = 8 processors. For reference linear speedup is marked by the dashed line.
8
N=50
N=100
N=200
N=300
N=400
7
6
Speed up
5
4
3
2
1
0
1
2
3
4
5
Number of CPUs
6
7
8
Figure 2.2. Speedup curves for BLAS implementation for N = 50 to N = 400,
using up to M = 8 processors. For reference linear speedup is marked by the
dashed line.
67
8
Nb=10
Nb=50
Nb=100
7
Speed up
6
5
4
3
2
1
1
2
3
4
5
Number of CPUs
6
7
8
Figure 2.3. Speed up curves for fixed problem size N = 1000 with a varying
number of bands Nb .
core, as can be seen from Tab. 2.2. This comes from the fact that an
optimization over the sequence of operations can be made for Alg. 8,
whereas the optimization is done over each single operation, one by
one, when employing the BLAS implementation. When optimizing over
sequences of operations, the possibility to make more efficient use of the
memory hierarchies is a main factor in the faster execution.
Regarding scalability, the implementation of Alg. 8 performs far better
than the BLAS implementation. As discussed previously, this is due to
less communication and parallel overhead. The effects are especially
distinct for smaller problem sizes where the overhead constitutes a large
proportion of the execution time. For Alg. 8, almost linear speedup in
the number of cores used is achieved for N ≥ 200. For lower N , the gain
of parallel implementation is less clear, and for N = 50 even a slowdown
can be observed for M = 8, due to the disproportionally large overhead
to computation ratio. However, for smaller problem sizes, not even 2
times speedup is reached for the BLAS implementation, and a slowdown
can be observed for N ≤ 200.
As expected the the scalability drops for larger Nb , Fig. 2.3. This is
due to the fact that the amount of communication is proportional to Nb ,
(2.20). However for Nb = 100, meaning that for most rows of F, 201
of the 1000 elements are filled, the speedup for M = 8 is still about 5
times. Which can be considered fairly good.
68
The implementations were evaluated on a hardware that runs at a very
high CPU clock frequency (2.26 GHz). Embedded hardware, especially
low power systems, typically run at much lower clock frequencies. With
a lower clock frequency, the scalability can be expected to be better
for smaller problem sizes since the computation-to-overhead ratio goes
down.
As mentioned before, Alg. 8 will most likely perform well on a distributed system. This is because of the low amount of communication,
shared data, and the fact that only a restricted part of P must be touched
by each processor. This would definitely not be the case for the BLAS
implementation that would require almost all data to be distributed over
the whole network.
2.8 Static Kalman filter
As the static Kalman filter given by (1.43) simply consists of a single
line made up of matrix and vector multiplications, the most efficient
way of implementing this is by using a multi-threaded optimized linear
algebra library such as e.g. BLAS. A BLAS-based implementation has
been performed and the speedup curves from the execution are presented
in Fig. 2.5. The gained speedup varies significantly depending on the
problem size. A discussion regrading this is given below.
As the work-overhead ratio increases with an increasing problem size,
a better scalability is obtained for the values of n up to n = 1000.
As a consequence of memory bandwidth saturation, the speedup drops
drastically for larger n. The hardware used provides 16MB of cache. A
double precision n × n matrix requires 8n2 bytes of memory. Solving
the equation 16M B = 8n2 B yields n ≈ 1400. Hence, for n > 1400,
the matrix will not fit into the cache and has to be brought from the
main memory on every iteration, in which case the memory bandwidth
becomes a bottleneck.
To understand the dip in the speedup curve for n = 1000 in Fig. 2.4,
a more detailed model of the architecture has to be employed. The
machine on which the code is executed provides 8 cores, but consists of
two Nahalem Quad Core, each having 8 MB cache, as shown in Fig. 2.4.
The first 4 threads have been scheduled to run on the one quad core,
and threads 5 to 8 have been scheduled to run on the other quad core.
When 4 cores are used, there will be only 8M B cache available, in which
case the data of 10002 · 8 = 8M B indeed will not fit as some space is
occupied by other data. However, using more than 4 cores, 16 MB of
data will be available and the data will fit into the cache and hence the
program will not be limited by the memory bandwidth. The situation
can of course be resolved by scheduling threads 1 and 2 to run on one
69
Figure 2.4. Memory connections for octa-core, consisting of two quad-core
CPUs. Memory is given by the blocks, the CPUs are marked by gray circles.
Table 2.3. Single core execution time, T for different problem sizes N , for
execution of static Kalman filter.
N
T [ms]
100
0.0038
200
0.0169
500
0.1068
1000
0.5250
1500
1.6830
2000
2.9590
5000
18.1081
of the quad cores, and thread 3 and 4 on the other, in which case the
dip would disappear. However, the scheduling has been kept as it is
to demonstrate a phenomenon that gives insight into the problems that
can occur in a parallel implementation. Note though that the execution
times are very low.
2.9 Extended Kalman filter
The extended Kalman filter (EKF) is based on the same computations
as the Kalman filter, with an additional linearization step (1.44), (1.45).
The linearization step can be completely parallelized. Assume that n/M
and p/M are integer and let
ft,i =
ht,i =
f Mn (i−1)+1 (x) f Mn (i−1)+2 (x) · · · f Mn i (x)
h Mp (i−1)+1 (x) h Mp (i−1)+2 (x) · · · h Mp i (x)
Processor i will then compute
Ft,i =
Ht,i =
∂fi (x) ,
∂x x=xt−1|t−1
∂hi (x) ,
∂x x=xt|t−1
and the complete Jacobians are given by
70
T
T
,
.
8
n=100
n=200
n=500
n=1000
n=1500
n=2000
n=5000
7
6
Speedup
5
4
3
2
1
0
1
2
3
4
5
Number of CPUs
6
7
8
Figure 2.5. Speedup curves for parallel execution of static Kalman filter for
different problem sizes n. For reference linear speedup is marked by the dashed
line.
⎡
⎢
⎢
Ft = ⎢
⎣
Ft,1
Ft,2
..
.
Ft,M
⎤
⎡
⎥
⎢
⎥
⎢
⎥ , Ht = ⎢
⎦
⎣
Ht,1
Ht,2
..
.
⎤
⎥
⎥
⎥.
⎦
Ht,M
Provided that the matrix Ft possesses a banded structure, the same
method as for parallelization of the original Kalman filter can then be
applied to the linearized system. This might be a restrictive assumption
while an important special case occurs when the Jacobians are sparse
matrices with zeros located at the same positions at each time step t.
For this case, a transformation that optimizes the band structure of the
matrices can be applied as discussed in Sec. 2.3.
2.10 Conclusions
Parallel multicore implementation of the Kalman filter is studied. An
implementation based on parallelization of each step using BLAS is compared to an implementation that exploits a banded structure of the system matrix. It is shown that for systems which can be realized with
a banded system matrix, the KF can be almost completely parallelized
71
with a very restricted amount of inter-processor communication. Application to a radio interference power estimation problem demonstrated a
linear speedup in the number of cores used, for state numbers that are
becoming relevant in smart-phone dominated traffic scenarios.
A BLAS based parallelization of the static Kalman filter was also
performed, and it could be concluded that a linear speedup is achievable
for larger problem sizes (N 1000).
72
Chapter
3
Parallel implementation of the
Kalman filter as a parameter
estimator
3.1 Introduction
Many algorithms for real time parameter estimation are based on the
Kalman filter. For instance, in echo cancellation, one seeks to estimate in real time the coefficients of a finite impulse response (FIR)
filter with thousands of taps, to model an acoustic channel [38]. In
such applications, the KF exhibits superior convergence speed, tracking performance and estimation accuracy compared to e.g. Normalized
Least Mean Squares algorithm and Averaged Kalman Filter Algorithm
(AKFA) [113], [25]. Further, the KF outperforms the Recursive Least
Squares algorithm (RLS) in tracking time-varying parameters since the
underlying mathematical model for the latter assumes the estimated
parameters to be constant. The KF also offers, relative to RLS with forgetting factor, the benefit of individually setting time variation of states,
see e.g. [100].
In this section, efficient parallelization of the KF as a parameter estimator, executed on a shared-memory multicore architecture is studied
and exemplified by an adaptive filtering application. The parallelization
is achieved by re-ordering the KF equations so that the data dependencies are broken and allow for a well-parallelized program implementation
that has the potential to exhibit linear speedup in the number of used
cores. Analysis of the resulting algorithm brings about an estimate of
the memory bandwidth necessary for a realization of this potential on a
multicore computer.
73
3.1.1 System model and the Kalman filter
When employing the Kalman filter as a parameter estimator, the parameters are modelled by the random walk model
θt+1 = θt + t ,
yt = ϕTt θt + et .
Here yt is the scalar measured output, ϕt ∈ R is the (known) regressor vector that depends on the data up to time t − 1, θt ∈ R is the
time-varying vector of N parameters to be estimated, t is the process
noise, et is the measurement noise and t is discrete time. This description includes parameter estimation for any linear single output system,
but also a broad class of nonlinear systems that are linear in unknown
parameters. An important property of the regressor model that will be
utilized further is that the regressor vector ϕt only contains data from
time t − 1. The Kalman filter equations for estimation of θt (see e.g.
[100]) can be written as:
θ̂t = θ̂t−1 + Kt [yt − ϕTt θ̂t−1 ],
Pt−1 ϕt
,
Kt =
rt + ϕTt Pt−1 ϕt
Pt−1 ϕt ϕTt Pt−1
+ Qt ,
Pt = Pt−1 −
rt + ϕTt Pt−1 ϕt
(3.1)
(3.2)
(3.3)
where θ̂t ∈ RN is the estimate of θt , Kt ∈ RN is the Kalman gain,
Pt ∈ RN ×N is the error covariance matrix, rt ∈ R is the measurement
noise variance V(et ) and Qt ∈ RN ×N is the covariance matrix of the
process noise V(t ). A priori estimates of θ0 and P0 are taken as initial
conditions, if available. Otherwise it is standard to use θ0 = 0 and
P0 = ρI where ρ is some ”large” number.
3.2 Implementation
In this section, computer implementation of the KF equations (3.1) (3.3) is discussed. First a straightforward implementation will be presented and the drawbacks of it will be explained. Thereafter it will be
shown how these drawbacks can be remedied by a simple reordering
of the equations, allowing for a well-parallelized algorithm suitable for
multicore and, possibly, for distributed systems.
3.2.1 Straightforward implementation
To minimize the computational redundancy in (3.1)-(3.3), the common
terms Ct Pt−1 ϕt , bt ϕTt Pt−1 ϕt = ϕTt Ct and dt rt + ϕTt Pt−1 ϕt =
74
Algorithm 9 Straightforward implementation of (3.1)-(3.3)
• Ct = Pt−1 ϕt
• bt = ϕTt Ct
• dt = rt + bt
• Pt = Pt−1 + Ct CTt /dt + Qt
• ŷt = ϕTt θ̂t−1
• θ̂t = θ̂t−1 +
Ct
dt [yt
− ŷt ]
rt + bt are first calculated. This results in Alg. 9. The corresponding
pseudocode is provided in Alg. 10.
As mentioned, such an implementation has drawbacks. Assume that
θt is of length N = 2000, a not uncommon size for, say, adaptive filtering in acoustics. P would then require N 2 (8 B) = 32 MB of storage
(assuming double precision, 8 B per element), which is too large to fit
into the cache (recall that the cache size is typically a few MB). Thus
to calculate C in Alg. 10, the elements of Pt−1 will be brought into
the cache as they are requested. Eventually, the elements of Pt−1 that
were first brought in will be substituted by the elements currently in
use. When the program later arrives at the calculation of Pt , the elements of Pt−1 must be brought in once again. Since P is of considerable
size, bringing it to the cache twice leads to a substantial increase in the
execution time.
3.2.2 Reordering of the equations for efficient memory utilization
The reordering is based on the observation that ϕt+1 depends only on
the data from time t, and can thus be made available at time step t.
This observation enables the reformulation of Alg. 9 as Alg. 11. Why
such an reordering would improve the performance becomes clear from
the pseudo code given in Alg. 12 where it can be seen that once an
element of P is brought into the memory, it will be used to accomplish
all calculations it is involved in. Therefore, squeezing the P matrix twice
trough the memory at each iteration is no longer needed.
3.2.3 Utilizing the symmetry of P
If P0 is symmetric, it can be seen from (3.3) that P will stay symmetric
through the recursions. This should be taken advantage of, since approximately half of the calculations and memory storage can be spared.
75
Algorithm 10 Pseudocode for implementation of Alg. 9
• for i = 1 : N
– for j = 1 : N
∗ Ct (i) = Ct (i) + Pt−1 (i, j)ϕt (j)
– end
– bt = bt + ϕt (i)Ct (i)
– ŷt = ŷt + ϕt (i)θ̂t−1 (i)
• end
• dt = rt + bt
• for i = 1 : N
– for j = 1 : N
∗ Pt (i, j) = Pt−1 (i, j) + Ct (i)Ct (j)/dt + Qt (i, j)
– end for
– θ̂t (i) = θ̂t−1 (i) +
Ct (i)
dt [yt
− ŷt ]
• end for
Algorithm 11 Reorganized implementation of Alg. 9
• dt = rt + bt
• θ̂t = θ̂t−1 +
Ct
dt [yt
− ŷt ]
• Pt = Pt−1 + Ct CTt /dt + Qt
• Ct+1 = Pt ϕt+1
• ŷt+1 = ϕTt+1 θ̂t
• bt+1 = ϕTt+1 Ct+1
76
Algorithm 12 Pseudocode of memory efficient implementation.
• dt = rt + bt
• for i = 1 : N
– θ̂t (i) = θ̂t−1 (i) +
Ct (i)
dt [yt
− ŷt ]
– for j = 1 : N
∗ Pt+1 (i, j) = Pt (i, j) + Ct (i)Ct (j)/dt + Qt (i, j)
∗ Ct+1 (i) = Ct+1 (i) + Pt+1 (i, j)ϕt+1 (j)
– end for
– ŷt+1 = ŷt+1 + ϕTt+1 (i)θ̂t (i)
– bt+1 = bt+1 + ϕt+1 (i)Ct+1 (i)
• end for
Ct (i) can be rewritten to be calculated from only upper triangular elements as
Ct (i) =
N
Pt (i, j)ϕt (j) +
j=i
i−1
Pt (j, i)ϕt (j).
j=1
An implementation making use of only the upper triangular part of P
can thus be obtained by changing the j-loop in Alg. 12 to:
• for j = i : N
– Pt+1 (i, j) = Pt (i, j) + Ct (i)Ct (j)/dt + Qt (i, j)
– Ct+1 (i) = Ct+1 (i) + Pt+1 (i, j)ϕt+1 (j)
– Ct+1 (j) = Ct+1 (j) + Pt+1 (i, j)ϕt+1 (i)
• end for
3.2.4 Parallel implementation
Let M be the number of CPUs used for the implementation. It can be
observed by examining Alg. 12 that there are no dependencies between
i-loop iterations, except for the adding up of ŷt+1 , bt+1 and Kt+1 . Such
dependencies are easily broken by using a reduction. In a reduction,
CPU M calculates the local contribution, sM , of the sum that is later
M
sM .
added up in a sequential section to give the global sum S =
i=1
By doing so, a parallelization can be achieved by splitting the i-loop in
77
equally large chunks of size N/M (assumed to be integer), and letting
each CPU process one of the chunks.
For the algorithm utilizing only the upper triangular part of P, there
is an issue of splitting the workload among the CPUs. Splitting over
the i-index would result in an unevenly distributed workload since the
j-loop range from i to N . Moreover, the splitting shall preferably be
done so that each CPU can hold locally as much of the data as possible.
This can be achieved by the following splitting. First map the upper
diagonal elements of P to a rectangular matrix P of size N × (N/2 + 1),
where the mapping from an element in P to element (i.j) in P is given
by
P (i, j) = P(i, (i + j − 1)
mod N ),
1≤i≤N
1 ≤ j ≤ (N/2 + 1).
Notice that this matrix contains N/2 elements more than necessary.
The upper triangular block of P contains N (N + 1)/2 elements and P
thus has N (N/2 + 1) − N (N + 1)/2 = N/2 elements extra. This is to
avoid the use of if-statements in the implementation and hence allow
for better use of the pipeline in the CPU. An example for N = 6 is
given below. Notice that P can be said to contain only upper diagonal
elements since P(i, j) = P(j, i).
⎡ p11 p12 p13 p14 ⎤
⎡ p11 p12 p13 p14 . . ⎤
. p22 p23 p24
. p33 p34
.
. p44
p51 p52 .
.
p61 p62 p63 .
.
P = ⎣ p41
p25
p35
p45
p55
.
.
p36 ⎦
p46
p56
p66
p22 p23
p34
p45
p55 p56
p66 p61
→ P = ⎣ pp33
44
p24
p35
p46
p51
p62
p25
p36 ⎦
p41 .
p52
p63
The redundant elements of P are in the last half of the last column,
which is equal to the first half of the last column. The same mapping is
applied to Q to yield Q .
Splitting these calculations over the i-index so that CPU m will loop
N
N
from i1,m = M
(m − 1) + 1 to i2,m = M
m gives a parallel implementation
described in Alg. 13, where superscript (m) denotes a local variable to
CPU m.
3.3 Analysis of Algorithm 13
3.3.1 Sequential and parallel work
For one iteration of Alg. 13, 2M − 1 + N (M − 1) FLOP’s are executed
sequentially which is negligible, assuming that N is of considerable magnitude, compared to the 10(N 2 + N ) FLOP’s that are executed in parallel. Further, the computational load performed in parallel is perfectly
balanced, i.e. each processor will perform an equal amount of work in
the parallel section.
78
Algorithm 13
• Sequential
M
(m)
yt
– ŷt =
m=1
M
– bt =
m=1
– Ct =
(m)
bt
M
(m)
m=1
Ct
– dt = rt + bt
• CPU m (in parallel)
– for i = i1m : i2m
∗ θ̂t (i) = θ̂t−1 (i) + Ct (i)/dt [yt − ŷt ]
2i ∗ for j = 1 : (N/2 + 1 − N
)
· k = (i + j) mod N
· Pt+1 (i, j) = Pt (i, j) + Ct
(m)
(m)
(i)Ct
(k)/dt + Qt (i, j)
· Ct+1 (i) = Ct+1 (i) + Pt+1 (i, j)ϕt+1 (k)
(m)
(m)
· Ct+1 (k) = Ct+1 (k) + Pt+1 (i, j)ϕt+1 (i)
(m)
(m)
∗ end for
(m)
(m)
(m)
(m)
∗ ŷt+1 = ŷt+1 + ϕTt+1 (i)θ̂t (i)
(m)
∗ bt+1 = bt+1 + ϕt+1 (i)Ct+1 (i)
– end for
79
3.3.2 Communication and synchronization
The proposed algorithm exhibits a large degree of data locality. Most
importantly, each CPU will only access a part of P, consisting of N (N +
1)/2M elements, implying that it can be stored locally and no parts of
P will have to be communicated among the CPUs.
The variables that are involved in a reduction, i.e. C, ŷ and b, which
consist of (N/2 + 1) + N/M + 2 elements, have to be communicated from
the parallel to the sequential section. In the worst case scenario (M = 2),
this becomes (N/2 + 1) + N/2 + 2 = N + 3 elements. Since double
precision is assumed (8 B per element), this means that for N = 2000,
(8 B)(2000 + 3) ≈ 16 kB will need to be communicated, certainly not a
large amount. The data to be communicated from the sequential to the
parallel section are C, ŷ, b and the additional values of ϕt+1 .
Synchronization is required at the end of each iteration. The overhead inflicted by this event is independent of N and depends only on
the number of CPUs used; the more processors are involved, the more
expensive the synchronization is. However, the relative cost of synchronization becomes less for larger N and the synchronization overhead has
smaller influence on the overall execution time.
3.3.3 Memory bandwidth
The memory bandwidth needed by the algorithm to perform niter iterations in ttot seconds can be estimated as follows. The only data
structures of considerable size in the algorithm are P and Q. Studying
how these are transfered from the RAM to the CPU gives a good estimate of the required memory bandwidth. If the matrices P and Q have
a size of s(P) and s(Q) bytes respectively, transferring them from the
RAM to the CPUs at each iteration requires a memory bandwidth of
B=
[s(P) + s(Q)] · niter
.
ttot
(3.4)
Even though Qt is a matrix of size N × N , it is very often selected
to be diagonal or sparse. This means that in most practical cases the
required bandwidth needed is about half of that stated by (3.4).
As for any other parallel algorithm, one could thus not expect the
above algorithm to scale well for a too large or too small problem size
N . For small N , the parallel overhead will become a bottleneck, while
for large N the available memory bandwidth might strangle the performance.
80
3.3.4 Cache miss handling
In a cache-based system, it is of outermost importance to avoid cache
misses to get good performance. One of the main points in the reorganization yielding Alg. 11 is to minimize the cache misses for P. Because of
the reorganization the optimal strategy for minimizing the cache misses
becomes simple. For the matrix P, each element will only be used once
in each loop iteration. There is thus no reason to store any of it in the
cache. The remaining variables claim a negligible space of (3N + 3) · 8B.
Since they are reused several times in one iteration, they should be stored
in the cache. For instance, with N = 8000, which number is considered
to be a large N , they will require 190 kB of storage. This is in the order
of 0.1 % of a cache of a few MB. The strategy is thus to store everything
except P in the cache, unless all data fits in the cache completely. In
the latter case all the data should certainly be kept in the cache.
3.4 Results
All calculations were carried out using double precision. The test data
came from a simulation and were the same for all runs. Program compilation was performed with the pgi-compiler and full compiler optimization was used for all the algorithms. Open MP [1] was used for
parallelization. This allowed the program to be executed in parallel by
adding a single extra code line telling the compiler to run the outer
i-loop in parallel and perform the required reductions. The matrix Q
was diagonal. To evaluate the improvement gained by reorganizing the
equations, Alg. 10 was compared to Alg. 12. The rest of the experiments
were devoted to the algorithm of main interest, i.e. Alg. 13. Also the
memory bandwidth of the computers Kalkyl and Grad were evaluated,
to enable further analysis.
3.4.1 Exection time and speedup
Table 3.1 shows execution times for the memory efficient algorithm,
Alg. 12, the memory inefficient algorithm, Alg. 10, and the parallelizable
implementaiton Alg. 13, tested on Grad and Kalkyl. Speedup curves for
Alg. 13 are plotted in Fig. 3.1.
3.4.2 Memory Bandwidth
Tab. 3.2 show estimates of the required memory bandwidth Blin (N, M )
to achieve linear speedup for problem size N using M processors. These
81
Table 3.1. Execution times in Sec. for 50 iterations of Alg. 10, Alg. 12 and
Alg. 13, executed on a single core on Grad and Kalkyl.
N
Grad
Alg. 10
0.12
0.22
1.06
4.42
17.55
500
1000
2000
4000
8000
Alg. 12
0.063
0.11
0.60
2.49
9.60
Alg. 13
0.021
0.073
0.33
1.37
5.51
Kalkyl
Alg. 10
0.12
0.20
0.99
3.92
16.52
Alg. 12
0.051
0.11
0.56
2.08
8.45
Alg. 13
0.028
0.089
0.34
1.31
5.54
8
7
6
N=500
N=1000
N=2000
N=4000
N=8000
Speed up
5
4
3
2
1
0
1
2
3
4
5
Number of cores
6
7
8
3
4
5
Number of cores
6
7
8
9
8
7
N=500
N=1000
N=2000
N=4000
N=8000
Speed up
6
5
4
3
2
1
0
1
2
Figure 3.1. Speedup for Alg. 13, executed on Grad (upper) and Kalkyl (lower).
For reference linear speedup is marked by the dashed line.
82
Table 3.2. Theoretically evaluated bandwidth to obtain linear speedup of
Alg. 13 executed on Grad and Kalkyl in GB/s.
M\N
500
1
2
4
8
2.4095
4.8190
9.6381
19.2762
1
2
4
8
1.7585
3.5169
7.0338
14.0677
1000
2000
Grad
2.7255
2.4038
5.4509
4.8075
10.9019 9.6151
21.8037 19.2302
Kalkyl
2.2470
2.3512
4.4940
4.7023
8.9881
9.4046
17.9761 18.8093
4000
8000
2.3229
4.6458
9.2915
18.5831
2.3207
4.6414
9.2828
18.5657
2.2636
4.5271
9.0542
18.1085
2.3086
4.6173
9.2345
18.4690
values were obtained by applying (3.4) to the data in Tab. 3.1, to calculate Blin (N, 1) with further extrapolation for M ≥ 1, i.e.
Blin (N, M ) = M · Blin (N, 1).
3.5 Discussion
It can be seen from Tab. 3.1 that the memory-efficient algorithm, Alg. 12,
executes about twice as fast as the memory-inefficient algorithm, Alg. 10,
on both systems (Grad and Kalkyl). Comparing execution times for
Alg. 12 and Alg. 13 in Tab 3.1, it can also be concluded that the execution time for the algorithm utilizing the symmetry of P runs, as
expected, about twice as fast as the algorithm using the whole P matrix.
Speedup curve for Kalkyl
Since linear speedup is obtained for all values of N , there is apparently
neither problem with synchronization overhead for small values of N nor
memory bus saturation for larger values of N . This is further confirmed
by Tab. 3.2 where none of the elements exceeds the available bandwidth
of 23 GB/s. Even super-linear speedup for small values of N can be
observed. This is due to good cache performance. With the work distributed among several cores, each core needs to access a smaller amount
of data that will fit easier into the cache and result in a better overall
throughput. For a more extensive explanation of this phenomenon, see
e.g. [33].
83
Speed-up curve for Grad
In the speedup curve for Grad, bad scaling for N = 500 and N = 1000 is
observed. This is due to the synchronization overhead that constitutes
a disproportionally large part of the execution time. Also in Tab. 3.2,
there are indications that the memory bus would be saturated for N =
{500, 1000, 2000} and M = {4, 8} since the available bandwidth of 5.5
GB/s would be exceeded for these entries. However, no saturation can
be seen in the speedup curves and almost linear speedup is obtained
for N = 2000. One possible explanation to this discrepancy is that the
analysis in Section 3.3.3 assumes that P is transfered from the RAM
to the CPU at each iteration. For N ≤ 2000, the size of P satisfies
s(P) ≤ 16 MB. Since there are 24 MB cache available running on 8
cores, the whole P matrix will remain in the cache memory between
iterations, avoiding the need of fetching it from the RAM, creating an
illusion of a larger memory bandwidth. For N ≥ 4000, s(P) ≥ 64 MB,
which is larger than the available cache of 24 MB, the whole matrix
must be brought to the cache from the RAM at every iteration. At this
point, the memory bandwidth really becomes a bottleneck. Indeed, the
entries in Tab. 3.2 corresponding to N = {4000, 8000} and M = {4, 8}
do not align with the linear speedup for N ≥ 4000. Therefore, on this
hardware and using the proposed KF algorithm, more bandwidth than
the available 5.5 GB/s is needed to achieve a linear speedup.
3.5.1 Conclusions
Through test runs on two different shared-memory multicore architectures, it is found that a Kalman filter for adaptive filtering can be efficiently implemented in parallel by organizing the calculations so that
the data dependencies are broken. The proposed algorithm executes
about twice as fast on a single core as a straightforward implementation
and is capable of achieving linear speedup in the number of cores used.
However, since the KF involves relatively simple calculations on large
data structures, it is required that the hardware provides enough memory bandwidth to achieve linear speedup. This is an inherent problem of
the KF itself and not caused by the proposed parallelization algorithm.
84
Chapter
4
Parallel implementation of the
particle filter
The PF solves the recursive Bayesian estimation problem approximately
with Monte Carlo simulation and provides a general framework for
nonlinear/non-Gaussian dynamic systems. Navigation, positioning, tracking, communication, economics and also computer vision are some application areas where PFs have been applied and proved to perform well.
A special case in tracking applications is the so-called Bearings-Only
Tracking (BOT) problem. This is a scenario often occurring in defense
applications where only the normalized angle to a maneuvering target,
relative some given reference angle, is measured. The BOT problem
is inherently nonlinear with observability issues and is typically solved
with respect to a set of constraints representing e.g. a geographical map.
This is an application where the PF is intensively used.
A well-known drawback of the PF is that good estimation accuracy
requires a large number, often thousands, of particles which makes the
algorithm computationally expensive. Further, in tracking applications,
the particle filter often constitutes only a part of a complete tracking system containing interacting multiple model and joint probabilistic data association algorithms, communication of measurements, constraint handling etc. Time limits in such systems are often tight and it
is desirable to optimize each part in terms of execution time.
Since the PF algorithm is to a large extent parallel, it is natural
to turn to parallel implementations to improve execution times so that
real-time feasibility is achieved. It is therefore instructive and motivated
to study the speedup and tracking performance of existing parallel PF
algorithms implemented on a multicore architecture. By implementing a
PF in parallel, a real-time feasible, powerful, energy effective and cheap
85
filter that can handle a broad class of nonlinear dynamic systems with
constraints is obtained.
In this section, four different existing parallel PFs, namely global
distributed PF (GDPF) [9], resampling with non-proportional allocation
filter (RNA) [12], resampling with proportional allocation (RPA) [12]
and the Gaussian PF (GPF) [56] are compared in tracking accuracy and
speedup at solving a testbed BOT problem. The filters are implemented
on a shared memory multicore computer, using up to eight cores.
4.1 The particle filter
The idea of PF is to recursively obtain a weighted sample from p(xt |Yt )
by Monte Carlo simulation and evaluate an estimate x̂t of xt from it.
For the general PF algorithm see for instance [72], [7]. Assume that
(i)
(i)
at time step t − 1 the particle set St−1 = {xt−1 , wt−1 }N
i=1 constitutes
(i)
a weighted sample from p(xt−1 |Yt−1 ), where xt−1 is the i-th particle
(i)
with associated weight wt−1 . Given St−1 , a sample from p(xt |Yt−1 ) is
obtained by propagating each particle through system equation (1.21),
i.e.
(i)
(i)
(i)
xt = ft−1 (xt−1 , vt−1 ),
(4.1)
(i)
where vt−1 is a draw from p(vt−1 ). This corresponds to the prediction
step in recursive Bayesian estimation. The measurement yt is then used
to update the weights by
(i)
(i)
(i)
wt = wt−1 p(yt |xt ),
(4.2)
which corresponds to the update step in recursive Bayesian estimation.
(i)
(i)
The two steps above yield the particle set St = {xt , wt }N
i=1 at time
step t. By iterating (4.1) and (4.2), samples from p(xt |Yt ) are thus
recursively obtained and can be used to produce an estimate x̂t of the
state xt as for instance the mean
x̂t = E(xt ) ≈
N
(i) (i)
w̃t xt ,
(4.3)
i=1
(i)
(i)
where w̃t = wt /
N
(i)
wt , is the i-th normalized weight. The recursion
i=1
is initialized by making N draws from an a priori distribution p(x0 ).
Resampling is used to avoid degeneracy of the algorithm. In resam
(i)
(i)
a
pling, a new set of Na particles St = {xt , wt }N
i=1 , is to be created and
(i)
(i) Nb
replace the old set St = {xt , wt }i=1 of Nb particles. Usually, but not
86
Algorithm 14 SIR algorithm.
[St ] = SIR[St−1 , yt ]
• FOR i = 1 : N
(i)
– (P) Propagate xt−1 (Eq. (4.1)).
(i)
– (U) Set wt according to (4.2).
• END FOR
• (R) Resample St using SR.
a
• Output the resampled set St = {xt , wt }N
i=1
(i)
(i)
necessarily, Na = Nb . Most resampling algorithms obtain the resampled
set by drawing with replacement Na samples from the set St so that
(i)
(i)
(i)
Pr(xt = xt ) = w̃t , where Pr(·) stands for probability. When resampling the information contained by the weights is replaced by particle
density. Therefore, the weights are reset, i.e. w (i) = 1/N, i = 1, .., N .
A popular resampling algorithm is Systematic Resampling (SR) [20].
The PF algorithm used in this chapter is the so-called SIR (Sampling
Importance Resampling) algorithm using SR for resampling is given by
pseudocode in Alg. 14.
Gaussian Particle Filter
Another variant of the PF is the Gaussian Particle Filter (GPF) [56].
The additional assumption made in GPF is that the posterior distribution can be approximated by a Gaussian PDF, i.e.
p(xt |Yt ) ≈ N (xt ; μt , Σt ) where
N (x; μ, Σ) =
1
−1
1
T
e− 2 (x−μ) Σ (x−μ) ,
(2π)n/2 |Σ|1/2
(4.4)
is the n-dimensional normal distribution PDF for the random variable
x, with mean μ ∈ Rn and covariance Σ ∈ Rn×n . The advantage gained
is a simpler resampling scheme and that only the estimated mean μ̂t and
covariance Σ̂t have to be propagated between iterations. These properties make the algorithm highly amenable to parallel implementation.
Estimates of μt and Σt can be obtained as the weighted sample mean
and covariance [102] given by
1 (i) (i)
wt xt ,
Wt
(4.5)
N
Wt
(i) (i)
(i)
w (xt − μ̂t )(xt − μ̂t )T ,
Wt2 − Wt i=1 t
(4.6)
N
μ̂t =
i=1
Σ̂t =
87
Algorithm 15 GPF algorithm.
[μ̂t , Σ̂t ] = GP F [μ̂t−1 , Σ̂t−1 , yt ]
• FOR i = 1 : N
(i)
– (R) Draw xt−1 ∼ N (μ̂t−1 , Σ̂t−1 )
– Perform (P) and (U) steps as in Alg. 14
• END FOR
• Calculate μ̂t and Σ̂t (Eq. (4.5) and (4.6)).
• Output estimated parameters {μ̂t , Σ̂t }.
where
Wt =
Wt =
N
i=1
N
(i)
wt ,
(i)
(wt )2 .
i=1
The GPF algorithm is described by pseudocode in Alg. 15. The drawback of the GPF is that p(xt |Yt ) must be well approximated by a Gaussian PDF which is not generally true.
4.2 Parallel algorithms
In the following the number of parallel processing units will be denoted
with M . Superscript m indicates CPU m, e.g. N (m) is the number of
(m)
particles in the local particle set S (m) = {x(m,i) , w(m,i) }N
i=1 at CPU m.
Common to all algorithms is that each CPU performs the propagation
(P) and weight update (U) steps in Alg. 14 for N (m) = N/M particles
(N/M assumed to be integer). What differs between the algorithms then
is how the resampling step (R) is handled. All described algorithms also
utilize the fact that the global estimate can be calculated from the local
estimates as
x̂ =
N
M
1 (i) (i)
1 (m) (m)
w x =
W x̂ .
W
W
i=1
(4.7)
m=1
The description of the algorithms GDPF, RNA and RPA starts from a
point where it is assumed that the CPUs have a local particle set St−1
for time sample t − 1 and also have access to the measurement yt .
88
Algorithm 16 GDPF
• CPU m (in parallel)
(m)
– Perform (P) and (U) steps to obtain St .
• Sequential (one CPU only)
M
(m)
– Form St = ∪ St
m=1
.
– Calculate x̂t and resample St .
– Redistribute the resampled particles to CPUs.
Global Distributed Particle Filter (GDPF)
GDPF [9] uses a straightforward way to perform resampling. Steps performed within one iteration of GDPF are as given in Alg. 16. Since
GDPF performs exactly the same calculations as the sequential PF, it
exhibits the same accuracy. A drawback is of course a high communication demand inflicted by sending the particles back and forth between
sequential and parallel sections. Furthermore, a large part of the program (resampling) has to be executed sequentially, limiting speedup
possibilities.
Resampling with Non-proportional Allocation (RNA)
In RNA [12], resampling is performed in a suboptimal but parallel manner. Each CPU resamples the local set of particles S (m) with the locally normalized weights w̃(m,i) = w(m,i) /W (m) . To avoid disturbing
the statistics, the weights at each CPU after resampling are set so that
w(m,i) = w(m,i) W (m) /W .
A problem with RNA is that a CPU can starve, i.e. the local sum
of weights, W (m) , gets very small or even turns to machine zero. When
starving occurs, computational resources are wasted on a particle set
that provides little or no contribution to the final estimate. In [12], it is
suggested that the problem can be resolved by at every iteration letting
the CPUs exchange some portion P of their particles. For instance the
CPUs could form a ring and let CPU m send N (m) P particles to CPU
m+1, with the exception that CPU M sends to CPU 1. Steps performed
in chronological order within one iteration, organized to allow for only
one parallel section per iteration are given in Alg. 17.
Resampling with Proportional Allocation (RPA)
In RPA [12], the resampling is done by using an intermediate step in the
resampling procedure called inter-resampling.
89
Algorithm 17 Parallel RNA.
• CPU m (in parallel)
– Exchange some portion P of the particles with neighboring
CPUs.
(m,i)
(m,i)
– Set i-th weight to wt−1 = wt−1 /Wt−1 .
(m)
– Perform (P) and (U) steps to obtain St
(m)
and Wt
(m)
using the locally normalized weights w̃t
– Calculate x̂t
– Resample St
(m,i)
(m)
wt
/Wt .
(m)
.
.
(m,i)
(m,i)
– Set i-th weight to wt
(m)
= Wt
=
.
• Sequentially
– Calculate x̂t (Eq. (4.7)).
– Calculate and distribute Wt to each CPU.
During the inter-resampling stage CPU m calculates the local sum
of weights W (m) . A CPU running sequentially takes W (m) and now
treats each CPU as a particle with weight W (m) and uses the residual
systematic resampling (RSR) algorithm [11] to produce M replication
factors R(m) , m = 1, .., M , specifying how many particles CPU m should
possess after resampling. CPU m will thus produce a number of particles
proportional to W (m) . R(m) is communicated to CPU m which now
performs intra-resampling.
Intra-sampling: At each CPU, the local particle set S (m) is resampled
with N (m) input particles and R(m) output particles using systematic
resampling.
After this step, it is likely that the particles are unequally distributed
among the CPUs. Therefore load balancing is performed. CPUs with
surplus particles send their excess particles to a CPU running sequentially, which distributes them to the CPUs with lack of particles. The
number of particles that should be sent/received by CPU m is given by
D(m) = R(m) − N (m) . Steps performed at one iteration in chronological
order for RPA are given in Alg. 18.
A drawback of RPA is the unpredictability in execution time caused
by the possibly uneven distributed workload among the CPU in the
inter-resampling step, where the execution time is tied to the slowest
CPUs. The fact that there are two parallel sections with intermedi90
Algorithm 18 Parallel RPA.
• CPU m (in parallel)
(m)
– Perform (P) and (U) steps to obtain St .
(m)
– Calculate Wt
(m)
and x̂t
.
• Sequentially (intra-resampling)
– Calculate x̂t (Eq. (4.7)).
– Compute replication factors R(m) using RSR.
• CPU m (In parallel, inter-resampling)
– Resample using SR with N (m) input particles and R(m) output
particles.
– Calculate D(m) .
• Sequentially
– Use D(m) to distribute particles equally among CPUs.
ate sequential sections per iteration also requires extra overhead that
diminishes the speedup potential of the algorithm.
Gaussian Particle Filter (GPF)
The GPF is highly amenable to parallel implementation since it avoids
the sequential resampling required by SIR. To simplify the notation in
the description of the parallel implementation, the following variables
are defined
W
(m)
=
(m)
N
(w(m,i) )2 ,
i=1
W
,
−W
(m)
N
=
w(m,i) x(m,i) (x(m,i) )T ,
α =
σ (m)
W2
i=1
μ̂t−1 and Σ̂t−1 denote the estimated mean and covariance from the sequentially executed section at time t − 1. CPU m creates a local set
(m)
of particles St−1 by making N (m) draws from N (μ̂t−1 , Σ̂t−1 ), and set(m,i)
ting wt−1
= 1/N (m) , i = 1, .., N (m) . Each CPU then performs the
(m)
(m)
(P) and (U) steps for St−1 to obtain St
(m)
{μ̂t ,
(m)
σt ,
(m)
Wt ,
(m)
Wt }
(m)
. From St
(m)
μ̂t
, the quantities
(m)
are calculated, where
= x̂t . A CPU
running sequentially forms the estimated mean μ̂t via (4.7) using the
91
Algorithm 19 Parallel GPF
• CPU m (in parallel)
(m,i)
(m,i)
– Make N (m) draws xt−1 ∼ N (μ̂t−1 , Σ̂t−1 ) and set wt−1 =
(m)
1/N (m) to obtain St−1 .
(m)
– Perform (P) and (U) steps to obtain St
(m)
– Calculate {μ̂t
(m)
, σt
(m)
, Wt
(m)
, Wt
.
(m)
} from St
.
• Sequentially
– Use the obtained data to calculate μ̂t and Σ̂t using (4.7) and
(4.8).
fact that μ̂t = x̂t . An estimate of Σ is obtained by exploiting that (4.6)
could be rewritten as
Σ̂ = α
= α
N
i=1
N
w(i) (x(i) − μ̂)(x(i) − μ̂)T
[w(i) x(i) (x(i) )T − w(i) x(i) μ̂T − μ̂(w(i) x(i) )T
i=1
+w(i) μ̂μ̂T ] = α[(
M
σ (m) ) − W μ̂μ̂T ],
(4.8)
m=1
where the relationship
N
i=1
1 (i) (i)
=W
w x = W μ̂,
W
N
(i) (i)
w x
i=1
is used in the third equality. Note that the final expression in (4.8) only
(m)
(m)
(m)
(m)
makes use of the data contained in {μ̂t , σ t , Wt , Wt }M
m=1 . The
algorithm for one iteration is given in Alg. 19.
4.3 Performance evaluation
The filters were implemented in C++ on a shared memory multicore
computer, using Open MP [1] for parallelization. Tracking accuracy was
evaluated for a bearings-only tracking (BOT) application, where only
the bearings, i.e. the the normalized angles to the target, relative some
given reference angle, were measured. This is a scenario encountered
in e.g. defense applications where passive microphone array sensors are
92
×10 6
6.5031
Sensor
6.5032
y-position [m]
6.5033
Sensor
Start
6.5034
Stop
6.5035
Road
Sensor
Particle
6.5036
Estimated trajectory
Current estimate
Current position
Measurment
6.5037
1.471
1.4711
1.4712
x-position [m]
1.4713
1.4714
1.4715
×10 6
Figure 4.1. Evaluation scenario.
used for measurements. As a performance measure, the position RMSE,
taken over 200 independent simulation runs, was studied. Note that
GDPF performs the exact same calculation as the sequential PF and
can thus be taken as a performance measure for the sequential PF.
Evaluation scenario
An image of the scenario at a time instant, from a representative simulation run is shown in Fig. 4.1. The PF is tracking a target traveling
along a road, the start and stop points are marked in the figure. Two
sensors are taking noisy bearing measurements of the target. The simulated measurement for sensor i at time sample t was obtained as the
true bearing θ(i,t) corrupted by noise, i.e. z(i,t) = θ(i,t) + v (i,t) where,
v (i,t) is zero mean white noise with standard deviation σ v = 0.1 rad.
State space model
The same state space model as in [23] was used. The system equation
is given by
xt+1 =
I 2 I2 T
0 I2
xt +
I2 T 2 /2
I2 T
wt ,
93
RMSE plot, number of CPUs M = 4
GDPF
RNA−0%
RNA−10%
RNA−50%
RPA
GPF
1.5
RMSE [m]
10
1.4
10
2
10
3
10
Number of particles N
4
10
Figure 4.2. RMSE as a function of the total number of particles N . RNA-X%
denotes RNA with X% particle exchange, P . Note the log-log scale.
T
where the state vector, xt = xt , yt , ẋt , ẏt
, consists of the
Cartesian position and velocity components. I2 denotes the 2×2 identity
2 I ) and
matrix. The system noise wt is white with distribution N (0, σw
2
T is the sampling period. A single measurement taken by sensor i is
related to the state vector by
z(i,t) = g(i) (xt ) + v(i,t) ,
where g(i) (·) is the trigonometric function relating the x-y position to
the bearing, i.e. tan−1 (y/x), if the target is in the first or fourth quadrant, considering the position of sensor i as the origin of the coordinate
system, and v(i,n) is zero mean white Gaussian noise with variance σv2 .
In the simulation σw = 10 m/s2 , σv = 0.1 rad and T = 0.5 s were used.
Performance
Fig. 4.2 and Fig. 4.3 show the tracking performance using M = 4 and
M = 8 respectively. Fig. 4.4 shows the achieved speedup, in the figure
P = 0.2 was used for RNA.
4.4 Discussion
As can be seen from Fig. 4.2, GPF provides better tracking accuracy
than other filters at the given scenario, especially for a small number of
particles. It must though be noted that in the given scenario the measurement noise is Gaussian distributed which provides an almost ideal
94
Number of CPUs M = 8
GDPF
RNA−0%
RNA−10%
RNA−50%
RPA
GPF
RMSE [m]
1.5
10
1.4
10
2
3
10
4
10
Number of particles N
10
Figure 4.3. RMSE as a function of the total number of particles N . RNA-X%
denotes RNA with X% particle exchange, P . Note the log-log scale.
Number of particles N = 100
Number of particles N = 500
8
8
7
7
6
5
Speed−up
Speed−up
6
4
3
4
3
2
2
1
0
5
1
2
3
4
5
Number of cores M
6
7
1
8
1
2
Number of particles N = 1000
6
7
8
7
8
8
7
7
6
6
Speed−up
Speed−up
4
5
Number of cores M
Number of particles N = 10000
8
5
4
3
GDPF
RNA
GPF
RPA
Linear speed up
5
4
3
2
1
3
2
1
2
3
4
5
Number of cores M
6
7
8
1
1
2
3
4
5
Number of cores M
6
Figure 4.4. Speed up for N equals 100, 500, 1000 and 10000 particles. For
reference linear speed up is marked by the dashed line.
95
situation for the GPF. However, this Gaussian noise model is probably
not unrealistic in ground target BOT with fixed sensor platforms. RNA
with 0% particle exchange provides significantly lower tracking accuracy
than other filters since it suffers from CPU starvation. All other filters
have comparably the same performance as the sequential SIR filter.
For the RNA algorithm, the tracking accuracy is affected by the
(m)
amount of particle exchange P . For smaller N (m) , St gives a less accurate local approximation to p(xt |z1:t ) and stronger coupling between
local particle sets (larger P ) is required to maintain tracking accuracy.
This effect can be seen comparing RNA 10% and RNA 50 % in Fig. 4.3,
where for a small number of particles (N = 100) RNA 50% provides
better tracking accuracy that RNA 10%. This effect cannot be clearly
seen in Fig. 4.2 since the number of CPUs is less, implying larger local particle set, and less coupling is thus required to maintain tracking
accuracy.
The obtained speedup naturally depends on the number of particles
used, as can be seen from Fig. 4.4. For a small number of particles,
the parallelization becomes too fine-grained, and the benefit of using a
parallel implementation diminishes.
As expected, the speedup of GDPF is quite limited, restricted to
about 3 times, depending on the large amount of work (resampling) that
is carried out sequentially. RNA achieves speedups very close to linear
for large particle sets (N = 104 ). The speedup of RPA is substantially
less than RNA and GPF, mainly depending on the overhead caused by
the two parallel sections per iteration. GPF provides the best speedup,
almost linear in the number of cores used for large particle sets
(N = 104 ).
4.5 Conclusions for parallel implementation of the
particle filter
Simulations performed with four different parallel PF algorithms showed
that in a BOT problem the GPF gave best tracking performance while
GDPF, RNA and RPA demonstrated tracking performance comparable
to that of the sequential SIR algorithm. The drawback of the GPF is
that it requires the posterior distribution to be approximately normally
distributed, which is probably not unrealistic in ground target BOT with
fixed sensor platforms.
The obtained speedups gained on a shared memory multicore computer depend largely on the total number of used particles N . For
particle sets with N > 1000 GPF and RNA can achieve close to linear speedups in the number of cores used. The speedup obtained by
RPA is substantially lower due to less beneficial parallelization poten96
tial. GDPF has a speedup limited to about 3.5 times as a consequence
of the sequentially executed part of the algorithm. For particle sets with
N < 500, the parallelization becomes too fine-grained, and it is hard to
exceed a speedup of about 2 times using GDPF or RPA while GPF and
RNA can achieve a speedup of up to about 4 times.
The final conclusion for a parallel particle filter implemented on a
shared memory multicore computer is thus the following. If the Gaussian assumption made by GPF holds true, it would be the algorithm to
prefer since it provides best tracking accuracy and is capable of achieving close to linear speedups in the number of cores used. If the Gaussian
assumption does not hold, RNA would be the algorithm to prefer since
it as well can, without loss in accuracy compared to the sequential PF,
obtain close to linear speedups in the number of cores used.
97
Chapter
5
Solving the RBE via orthogonal
series expansions
5.1 Introduction
This chapter investigates a method to solve the RBE problem in parallel
using orthogonal basis functions. The problem under consideration is to
provide an estimate of the state vector xt ∈ Rn , given the measurements
Yt = {y1 , y2 , ..., yt }, yt ∈ Rp , of the nonlinear discrete-time system
xt+1 = f (xt , vt ),
yt = h(xt , et ),
(5.1)
(5.2)
with the process and measurement noise vt ∈ Rn , et ∈ Rp , respectively,
and t denoting discrete time. The probability density functions (PDFs)
p(vt ), p(et ) are assumed to be known but are allowed to have arbitrary
form.
Algorithms employing Fourier basis functions and wavelets have been
developed in [14], [39] and applied to filtering problems. The development here is with respect to a general orthogonal basis, and targeting in
particular the amenability to parallelization that is demonstrated and
analyzed. The favorable parallelization properties of the method stem
from the orthogonality of the basis functions. In contrast, the solutions to the RBE problem that employ non-orthogonal basis functions,
e.g. Gaussian sum filters, [4], [98], [46] parallelize poorly because of the
inherent dependencies in the computations.
Orthogonal bases have been used for a long time in statistics to estimate PDFs with general distributions, see e.g. [96],[89],[104],[27]. Using orthogonal expansions, it is possible to estimate the PDFs with a
substantially lower number of variables than e.g. the particle filter or
grid-based methods. Since much fewer variables are required to approx99
imate the PDF, a smaller computational load for the same estimation
accuracy can be expected.
The chapter structure is as follows. The method of solving the recursive Bayesian estimation problem using orthogonal series expansion is
presented in Sec. 5.2. An application to a bearings-only tracking problem as well as a speedup evaluation for a parallel implementation are
given in Sec. 5.4. The results are discussed Sec. 5.5. An analysis of the
impact of the truncation error is given in Sec. 5.6.
5.2 Solving the RBE via series expansions
In this section it is derived how to solve the RBE via orthogonal series
expansions. It will be assumed that all series are absolute convergent,
and that the product of the n-th and m-th basis functions have the
expansion
φn (x)φm (x) =
gnmk φk (x).
(5.3)
k∈Nd
As there is no closed-form analytical solution for the general case of
RBE, numerical methods have to be employed to find an estimate of
the state in (5.1). The idea advocated here is to approximate the involved PDFs by orthogonal series expansions and recursively propagate
the coefficients of the expansion in time via the prediction and update
equations (1.29)-(1.30).
Assume that p(xt |xt−1 ), p(yt |xt ) and p(xt−1 |Yt−1 ) are given by the
expansions
p(xt |xt−1 ) =
anm φn (xt )φm (xt−1 ),
(5.4)
bnm φn (yt )φm (xt ),
(5.5)
n∈Nd m∈Nd
p(yt |xt ) =
p(xt−1 |Yt−1 ) =
n∈
Nd
n∈
Nd
m∈
t−1|t−1
cn
φn (xt−1 ).
Nd
(5.6)
The target of the approximation is to compute and propagate the coeft|t
t|t−1
ficients cn over time where cn
shall be interpreted as the coefficient
with index n at time step t given data up to time t − 1. Inserting (5.4)(5.6) into the prediction and update equations (1.29) and (1.30) yields
the following relationships:
100
Prediction step
p(xt |Yt−1 ) =
p(xt |xt−1 )p(xt−1 |Yt−1 )dxt−1
d
R =
[
anm φn (xt )φm (xt−1 ) ×
Rd n∈Nd m∈Nd
k∈Nd
=
n∈
=
Nd
m∈
Nd
k∈
t−1|t−1
ck
Nd
φk (xt−1 )]dxt−1
t−1|t−1
anm ck
φn (xt ) ×
φm (xt−1 )φk (xt−1 )dxt−1
Rd
t|t−1
cn φn (xt ),
n∈Nd
with
t|t−1
cn
=
t−1|t−1
anm cm
.
(5.7)
m∈Nd
Update step
When the measurement yt becomes available, the PDF p(yt |xt ) is conditionalized to yield
p(yt |xt ) =
bnm φn (yt )φm (xt )
n∈Nd m∈Nd
=
t
fm
φm (xt ),
m∈Nd
where
t
=
fm
n∈
bnm φn (yt ).
(5.8)
Nd
The multiplication in the update step is then carried out as:
p(xt |Yt ) = γt−1 p(yt |xt )p(xt |Yt−1 )
t|t−1
= γt−1
fnt φn (xt )
cm φm (xt )
n∈Nd
= γt−1
m∈Nd
t|t−1
fnt cm φn (xt )φm (xt )
n∈Nd m∈Nd
=
γt−1
=
n∈Nd m∈Nd
t|t
γt−1
ck φk (xt ),
d
k∈N
t|t−1
fnt cm
gnmk φk (xt )
k∈Nd
101
Algorithm 20 The RBE algorithm using orthogonal basis functions.
Initialization:
0|0
φk (x0 )p(x0 )dx0
ck =
Ω
0|0
γ0 =
c n pn
n∈Nd
Recursion, t = 1, 2, ...:
t|t−1
cn
−1
= γt−1
t
fm
=
t−1|t−1
anm cm
m∈Nd
bnm φn (yt )
n∈Nd
t|t
ck
=
t|t−1 t
fn gnmk
cm
n∈Nd m∈Nd
t|t
c n pn
γt =
n∈Nd
where
t|t
ck
=
n∈
Nd
m∈
t|t−1 t
fn gnmk ,
cm
and γt is the normalization constant given by
t|t
γt =
c n pn .
n∈
(5.9)
Nd
(5.10)
Nd
where pn = Ωd φn (xt )dxt . The algorithm for propagation of the coefficients is summarized in Alg. 20.
5.2.1 Mean and Covariance
The mean and covariance for the PDF p(xt |Yt ) are typically of interest
in estimation problems. The expected value in dimension i can be calculated by marginalizing the expansion for the i-th dimension and taking
the expected value of the marginalized distribution, i.e.
E[xt,i |Yt ] =
xt,i p(xt |Yt )dxt
Ωd
t|t =
cn
xi φ(xt )dxt .
(5.11)
n∈Nd
102
Ωd
Let xi denote the i-th element of xt . The covariance between xi and
xj is given by
cov(xi , xj |Yt ) = E[xi xj |Yt ] − E[xi |Yt ]E[xj |Yt ],
where the second term is evaluated using (5.11), while the first term can
be calculated as
E[xi xj |Yt ] =
xi xj p(xt |Yt )dxt
Ωd
t|t =
cn
xi xj φn (xt )dxt .
(5.12)
n∈Nd
Ωd
5.2.2 Truncation
In practice, the infinite series must be truncated to some order N < ∞,
in each dimension.
In the update step, the order of the series expansion is doubled, in
each dimension, due to the multiplication of series. Thus, to keep the
order from growing exponentially, the series have to be truncated at
each iteration. For simplicity, the truncation is made by keeping the
first N terms. It should be noted that the truncation can result in an
approximation p̂(x) that takes on negative values, and is hence not a
PDF. However the purpose of the approximation is to make inference
about the state x, in this sense it is not worse to have e(x) = p̂(x) − p(x)
negative than having e(x) positive but merely |e(x)| is of importance,
as argued in Sec. 1.4.4.
5.2.3 Computational complexity
The expansions of the PDFs p(xt |xt−1 ) and p(yt |xt ) are assumed to be
determined beforehand and offline. The online computational cost can
be found by counting the flops required in Alg. 20, which gives a total
flop demand of
f (N, d) = 3N 3d + 4N 2d + N d − 1,
(5.13)
where is the flop cost of evaluating φn (y). For many basis functions, the
coefficients gnmk are zero, except for a few certain values of n and m.
This property can reduce the computational complexity substantially
(see Sec. 5.4.2 for an example of this, using the Fourier basis functions).
5.3 Parallel implementation
The orthogonality of the basis functions allows for the computational
load to be well separated in independent segments. Assume that M
103
Algorithm 21 Pseudo code for parallel implementation
• CPU m computes (In parallel)
t|t−1 t
t|t
– ck =
cn fm gnmk , k ∈ Nm
n∈Nm∈N
– γt (m) =
t+1|t
– cn
k∈Nm
t|t
c k pk
−1
(m) = γt−1
t (m) =
– fm
n∈Nd
t|t
m∈Nm
anm cm , n ∈ N
bnm φn (yt ), m ∈ Nm
• One CPU (Sequentially)
M
t+1|t
t+1|t
– ck
=
ck (m), k ∈ N
m=1
M
– fkt =
m=1
M
– γt =
fkt (m), k ∈ N
γt (m)
m=1
processing units are available. With N being a set of cardinality N d ,
Nm , m = 1, 2, ..., M being disjoint subsets of N of cardinality N d /M
M
(assumed to be integer) and ∪ Nm = N, pseudo-code of a parallel
m=1
implementation is given in in Alg. 21, where the computations have been
organized to allow for only one synchronization point per iteration.
5.3.1 Analysis
Counting the number of flops in Alg. 21 that are executed sequentially,
f| , and in parallel, f|| , it is found that
f| (N, d, M ) = (M − 1)(2N d + 1),
f|| (N, d) = 3N 3d + 4N 2d + N d − 1.
The sequential portion of the program is thus almost negligible compared
to the parallel one, even for small problem sizes and dimensions.
The data that have to be communicated between the processors at
t+1|t
each iteration are the elements of the local variables of ck (m), fkt (m)
and γt (m), m = 1, 2, ...M resulting in a total communication demand
per iteration b of
b(N, d, M ) = M (N d + 1) + N d .
104
(5.14)
Further, as mentioned before, only one synchronization point per iteration is required.
The parallelization thus possesses a large parallel portion relative the
sequential portion, and a small amount of communication and synchronization relative the total amount of performed computations. These
properties imply that the method have a high potential of performing
well in a parallel environment.
5.4 Numerical Experiments
A nonlinear non-Gaussian bearings-only tracking problem is studied. It
arises in defense and surveillance applications as well as in robotics. It
exhibits a severe non-linearity in the measurement equation and is known
to require nonlinear filtering to avoid divergence of the estimate, [3]. For
comparison, the filtering problem is solved both with the Fourier basis
functions and the Legendre basis functions. Numerical experiments were
conducted in order to experimentally verify the error bound derived in
the previous section, and also to explore its conservatism.
5.4.1 The system
An object traveling along a path is detected within the range xt ∈
[−π, π]. Noisy bearing measurements yt of its position xt are taken by
a sensor stationed at a distance d = 1 from the road, see Fig. 5.1. The
tracking filter employs the model
xt+1 = xt + wt ,
yt = tan−1 (xt /d) + vt ,
(5.15)
(5.16)
where wt is normally distributed with the mean μw = 0 and standard
deviation σw = 0.3. The measurement noise vk obeys the multi-modal
PDF
v−μv1 2
v−μv2 2
p2
p1
−1(
)
−1(
)
√ e 2 σv1
√ e 2 σv2 ,
+
pv (v) =
σv1 2π
σv2 2π
with p1 = 0.5, p2 = 0.5, σv1 = 0.3, σv2 = 0.3, μv1 = 0.45, μv2 = −0.45.
The system was simulated up to time step T = 40.
5.4.2 Solution using Fourier basis functions
This section presents a solution of the bearing-only tracking problem
obtained by applying the Fourier basis functions [109]
1
N −1
φn (x) = √ einx , |n| ≤
,
2
2π
105
Figure 5.1. An object with position xt traveling along a path. Noisy bearing
measurements, Yt , are taken by a sensor (black dot), positioned a distance d
from the path.
that are orthogonal over the interval [−π, π]. To obtain the basis functions that are orthogonal over an arbitrary interval, a linear transformation of x can be applied as discussed in Sec. 1.3.4.
The expected value of the approximated PDF x̂t = E[xt |Yt ] is used as
the point estimate of the state. From (5.11) and (5.12), the mean and
covariance can be calculated as
N/2
E[xt |Yt ] =
ct|t
n ϕn ,
n=−N/2
E[(xt − E[xt ]) |Yt ] = [
2
N/2
n=−N/2
−1 t|t
c ϕn ] − E[xt |Yt ]2 ,
inπ n
where ϕn is defined as
π
ϕn =
xφn (x)dx =
−π
0
(−1)n+1
√
2π
n i
if n = 0,
otherwise.
Since φn (x)φm (x) = φn+m (x) for the Fourier basis, it follows that
gnmk = δ[n+m−k], with δ[·] denoting the multivariate Kronecker delta
function. This fact reduces the computational complexity to f (N, d) =
6N 2d + N d − 1.
Fig. 5.2 depicts the true state, tangent of the measurement and the
estimated state. In Fig. 5.3, the sequence of estimated PDF:s p(xt |Yt )
using N = 15, is shown for t = 1, 2, ..., 10. For N = 15 the root mean
square error R = 0.078 was achieved. For comparison, a bootstrap
particle filter (PF) [69] was also implemented for the same problem.
Using Np = 150 particles, the minimum root mean square error of R =
0.078 was reached, and did not improve for a larger number of particles.
106
2
x
x̂
tan(y)
1.5
1
0.5
0
-0.5
-1
-1.5
-2
0
5
10
15
20
25
30
t
Figure 5.2. True state x and estimated state x̂, for time step t.
p̂(xt |Yt )
2.5
2
1.5
1
0.5
0
10
5
t
0
-3
-2
0
-1
1
2
x
Figure 5.3. p̂(xt |Yt ) plotted for t = 1, .., 10
107
Table 5.1. Single core execution time.
NT
Execution time
100
0.0021
300
0.0568
500
0.2625
1000
2.1467
8
N=100
N=300
N=500
N=1000
7
Speed up
6
5
4
3
2
1
1
2
3
4
5
Number of CPUs
6
7
8
Figure 5.4. Speedup plots for different values of N . Linear speedup is marked
by the dashed line for reference.
5.4.3 Execution time and speedup
Alg. 21 was implemented on a shared memory multicore architecture.
The execution time and scalability for different problem sizes NT = N d
were studied. Tab. 5.1 shows the execution time for single core execution
while Fig. 5.4 depicts the acheived speedup s(M ). The program was
written in C++ using OpenMP for parallelization and execution was
R
performed on a shared memory multicore processor (Quad-core Intel
Xeon 5520, Nehalem 2.26 GHz, 8MB cache). Compilation was performed
with the pgi compiler and full compiler optimization was used for all
compilations.
5.5 Discussion
5.5.1 Estimation accuracy
From the experiments it can be concluded that the method performs well
for the given problem. The RMSE R = 0.078 is reached for N = 15.
The particle filter reaches this RMSE for Np = 150. Using the approximation of 50 flops to evaluate the exponential function, and 10 flops to
generate a pseudorandom number, the PF requires about 29 times the
flops required by the orthogonal series expansion approach to achieve
108
this estimation accuracy. The PF is though less affected by the curse of
dimensionality and the gap in the computational cost thus reduces for
problems of higher dimensions. Yet, for low-dimensional problems, there
is a significant computational benefit in using the proposed method.
5.5.2 Speedup
As can be seen from the speedup plot in Fig. 5.4, the method has good
scalability and is suitable for parallelization, which is one of its main
strengths. In this particular study, a multicore processor with 8 cores
has been used and close to linear speedup is achieved. From Fig. 5.4 and
the analysis in Sec. 5.3.1, it can though be expected that the method
will scale well for more processors than 8. Further, the method has a
good potential of performing well on a computer cluster due to the low
interprocessor communication required.
5.5.3 Limitations
If p(xt |xt−1 ) and p(yt |xt ) are to be determined offline, domain Ω over
which the problem is solved must be small enough relative the variance of p(xt |xt−1 ) and p(yt |xt ). The PDFs p(xt |xt−1 ) and p(yt |xt ) will
otherwise appear as “spikes” and will demand an unreasonably high
approximation order to produce a good fit. If the expansions for the
PDFs p(xt |xt−1 ) and p(yt |xt ) are updated online, this restriction can be
dropped. Doing so will, however, require a large amount of online computation and, therefore, reduce the real-time feasibility of the method.
Similar to most of the estimation techniques, the exponential growth of
the computational complexity with the dimension is a limitation that
confines the possible applications to relatively low-dimensional ones.
5.6 An error bound
One iteraion of the prediction update recursion for the system (5.1),
(5.2) can in one line be written as
p(yt |xt )
p(xt |Yt ) =
p(xt |xt−1 )p(xt−1 |Yt−1 )dxt−1 , t = 1, 2, . . . ,
p(yt |Yt−1 )
(5.17)
where p(xt |Yt ) denotes the probability density for the state xt given the
measurements Yt . When solving the RBE problem via orthogonal series
expansions the posterior PDF p(xt |Yt ) in (5.17) is approximated by a
truncated orthogonal series expansion
109
p(xt |Yt ) ≈ p̂(xt |Yt ) =
k∈K
t|t
ck φk (xt ),
where {φk (x)} are the orthogonal basis functions and the coefficients
t|t
{ck } are recursively computed via the prediction and update equations.
Due to the truncation of the expansion an error is introduced at every
iteration. It is of interest to study how this error propagates over the
iterations, to be able to make sure that the solution obtained maintains
a reasonable approximation to the sought PDF. A worst case scenario
would be that the approximation errors, due to the truncations of the
expansions, would accumulate in such a way that p̂(xt |Yt ) is no longer
a meaningful approximation of p(xt |Yt ).
This section provides a bound on the 1-norm for the approximation
error in the PDF of the state vector conditional on the measurements,
i.e. a bound on
e(xt |Yt )1 = p(xt |Yt ) − p̂(xt |Yt )1 .
The derived bound, although not being sharp, serves as a tool to ensure
that the estimated PDF represents a sensible approximation to the true
PDF throughout the iterations. When solving the RBE with orthogonal
series expansions there is an option of which basis functions to employ.
A second investigation performed in this section is a comparison of the
method performance in a bearings-only tracking problem being solved
with the Fourier and Legendre basis functions.
For a function h approximated with a series expansion the truncated
approximation and the truncation error is denoted with ĥ and eh respectively, i.e.
h(x) =
∞
ck φk (x) =
k=0
K
k=0
∞
ck φk (x) +
ĥ(x)
ck φk (x) .
k=K+1
eh (x)
For notational tractability the recursion expressed by (5.17) will be
written with the notation
g t+1 (z) = v(y|z) f (z|x)g t (x)dx, t = 0, 1, . . . ,
(5.18)
Ω
where v(y|z), f (z|x) and g t (x) are PDFs. In this notation, the PDF
g t (z) corresponds to p(xt |Yt ) in (5.17) and is the main target of the
approximation. When solving the recursion with orthogonal basis expansions, the truncated expansions v̂(y|z), fˆ(z|x) and ĝ t (x) are used in
place of the true PDFs. It is of interest to know how the error caused by
the truncation propagates through the iterations. An expression for the
110
t+1 (z) − ĝ t+1 (z) is therefore sought. Asapproximation error et+1
g (z) = g
suming that g(x) has the same approximation order in the x-dimension
as f (z|x) does, the following two relations hold in virtue of the orthogonality of the basis functions
Ω
fˆ(z|x)eg (x)dx = 0,
ef (z|x)g(x)dx =
ef (z|x)eg (x)dx.
Ω
Ω
Then it follows that
t+1
ĝ (z) = v̂(y|z) fˆ(z|x)ĝ t (x)dx
Ω
= v̂(y|z) fˆ(z|x)[g t (x) − etg (x)]dx
Ω
t
ˆ
= v̂(y|z) f (z|x)g (x)dx − v̂(y|z) fˆ(z|x)etg (x)dx
Ω
Ω
= [v(y|z) − ev (y|z)] [f (z|x) − ef (z|x)]g t (x)dx
Ω
t
= v(y|z) f (z|x)g (x)dx − v(y|z) ef (z|x)g t (x)dx
Ω
Ω
− ev (y|z) [f (z|x) − ef (z|x)]g t (x)dx
Ω
= g t+1 (z) − v(y|z) ef (z|x)etg (x)dx
Ω
t
− ev (y|z) f (z|x)g (x)dx + ev (y|z) ef (z|x)etg (x)dx
Ω
Ω
t+1
t
= g (z) − [v(y|z) − ev (y|z)] ef (z|x)eg (x)dx
Ω
− ev (y|z) f (z|x)g t (x)dx.
Ω
This gives the expression for the approximation error
t+1
(z) − ĝ t+1 (z)
et+1
g (z) = g
= v̂(y|z) ef (z|x)etg (x)dx + ev (y|z) f (z|x)g t (x)dx. (5.19)
Ω
Ω
From (5.19) the following result can be derived:
111
Theorem 1. For etg (z) given by (5.19), it holds that etg (z)1 ≤ γt ,
t = 0, 1, . . . , where
t t
Q
rt Qt e0g 1 + Rq 1−r
if rQ = 1
1−rQ
γt = 0 (5.20)
eg + tRq
if rQ = 1
1
and
Q := max
y
q := max
y
|v̂(y|z)|dz,
|ev (y|z)|dz,
r := max|ef (z|x)|,
x,z
R := maxf (z|x).
x,z
Proof. The triangle inequality yields
t+1 eg (z) =
|et+1
g (z)|dz
1
Ω
=
|v̂(y|z) ef (z|x)etg (x)dx
Ω
Ω
+ ev (y|z) f (z|x)g t (x)dx|dz
Ω
≤ [|v̂(y|z)| |ef (z|x)||etg (x)|dx
Ω
Ω
+ |ev (y|z)| |f (z|x)||g t (x)|dx]dz
Ω
t
≤ [|v̂(y|z)|r
|eg (x)|dx + |ev (y|z)|R |g t (x)|dx]dz
Ω Ω
=
|v̂(y|z)|dz · r
|etg (x)|dx +
|ev (y|z)|dzR
Ω
Ω
Ω
≤ rQ etg (z)1 + Rq,
i.e.
t+1 eg (z) ≤ rQ etg (z) + Rq.
1
1
(5.21)
The
increasing function
right
hand side in (5.21) is amonotonically
in etg (z)1 . An upper bound γt on etg (z)1 hence obeys the recursion
γt+1 = rQγt + Rq, whose closed-form expression is given by (5.20).
Note that q, Q, r and R in (5.20) only depend on constant quantities
that can be computed offline and before the recursion starts.
112
Corollary 1. If rQ ≤ 1, ekg (z)1 is asymptotically bounded from above
Rq
.
by 1−rQ
Proof. If rQ < 1
Rq
1 − r t Qt
lim γt = lim rt Qt e0g (z)1 + Rq
=
.
t→∞
t→∞
1 − rQ
1 − rQ
(5.22)
5.6.1 Numerical experiments
The filtering problem to estimate the PDF p(xt |Yt ) for system (5.15)(5.16) was solved by using the Legendre and Fourier basis functions
(see Sec. 1.3.2). The estimated PDFs obtained by the orthogonal series
method were cross-validated against the results obtained by applying a
particle filter to the same data set, to ensure correct implementation.
The filtering problem was solved for the approximation orders N =
9 + 4k, k = 0, 1, . . . , 14. The upper bound γt (N ) on e(xt |Yt )1 was
computed according to (5.20) for each N using both the Fourier and
Legendre basis while the empirical values of ||e(xt |Yt )||1 were evaluated
as
e(xt |Yt )1 ≈ Et (N ) =
|p̂65 (xt |Yt ) − p̂N (xt |Yt )|dx,
xt ∈Ω
where p̂N (xt |Yt ) denotes the approximation of p(xt |Yt ) of the approximation order N . As p̂65 (xt |Yt ) can be considered a very close approximation to the true PDF p(xt |Yt ), Et (N ) can be deamed a good
approximation to e(xt |Yt )1 .
In Fig. 5.5 and Fig. 5.6, the empirical and theoretical bounds Et (N )
and γt (N ) are shown for N = 25, using the Fourier basis and the Legendre basis respectively, where γt (N ) denotes the theoretical bound for
an approximation order N .
For all N studied, the bound converges to the value given by (5.22)
and the value of γt (N ) is basically constant after time step t = 10 in all
cases. To illustrate the empirical and and theoretical bound for each N ,
the steady-state value γ30 (N ), the mean and maximum of the empirical
value Et (N )
μ(N ) =
1 Et (N ),
30
ρ(N ) =
max Et (N ),
40
t=11
t∈[11,40]
113
E (N)
t
γ (N)
0.5
t
k
||eg||1
0.4
0.3
0.2
0.1
0
0
5
10
15
20
time step k
25
30
35
40
Figure 5.5. Theoretical bound γt (N ) and empirically measured values of the approximation error in 1-norm, Et (N ) for the solution obtained with the Fourier
basis functions and approximation order N = 25.
25
0.6
E (N)
t
0.5
γ (N)
t
k
||eg||1
0.4
0.3
0.2
0.1
0
0
5
10
15
20
time step k
25
30
35
40
Figure 5.6. Theoretical bound γt (N ) and empirically measured values of the
approximation error in 1 norm, Et (N ) for the solution obtained with the Legendre basis functions and approximation order N = 25.
114
2
μ(N)
ρ(N)
γ(N)
||eg ||1
1.5
1
0.5
0
10
15
20
25
30
35
40
45
50
55
60
65
N
Figure 5.7. Theoretical bound γ(N ), the mean μ(N ) and maximum ρ(N ) of the
empirically measured values of Et (N ), when solving the problem with Legendre
basis functions.
were computed on the stationary interval t ∈ [11, 40]. The results are
shown for the Fourier basis and the Legendre basis in Fig. 5.7 and
Fig. 5.8, respectively.
Point estimates, x̂t = E[xt |Yt ] from the approximated PDFs were
computed. To compare and quantify the estimation quality, the root
mean square error,
! T
!1
Ermse (x̂1:T ) = "
(xt − x̂t )2 ,
T
t=1
was calculated for the estimated states and is shown in Fig. 5.9 for
different approximation orders N , for the Fourier and Legendre basis
functions.
For the particular time instant t = 25, the true PDF p(xt |Yt ) and
the estimated PDFs p̂(xt |Yt ) obtained with the Fourier and Legendre
basis functions are shown for N = 9, 25, 33 in Fig. 5.10, Fig. 5.11 and
Fig. 5.12, respectively.
5.6.2 Discussion
In the studied bearings-only tracking problem, it can be concluded that
the Fourier basis functions generally give a better approximation to the
problem than the Legendre basis functions do, which phenomenon is
especially prominent for lower approximation orders N . It can be seen
that for low N , (N = 9, Fig. 5.10), both the Fourier and Legendre
basis functions fail to capture the multi-modal shape of the true den115
2
μ(N)
ρ(N)
γ(N)
||eg ||1
1.5
1
0.5
0
10
15
20
25
30
35
40
45
50
55
60
65
N
Figure 5.8. Theoretical bound γ(N ), the mean μ(N ) and maximum ρ(N ) of the
empirically measured values of Et (N ), when solving the problem with Fourier
basis functions.
0.3
Fourier
Legendre
0.295
0.29
rmse
0.285
0.28
0.275
0.27
0.265
0.26
10
15
20
25
N
30
35
40
Figure 5.9. The root mean square error, for the estimation error as a function
of the approximation order N .
116
N=9
0.8
Legendre
Fourier
True
0.7
0.6
t 1:t
p(x |y )
0.5
0.4
0.3
0.2
0.1
0
−0.1
−3
−2
−1
0
1
2
x
t
Figure 5.10. The true PDF p(xt |Yt ) and p̂9 (xt |Yt ) for t = 25, for the Fourier
and Legendre solutions, N = 9.
N=25
0.8
Legendre
Fourier
True
0.7
0.6
t 1:t
p(x |y )
0.5
0.4
0.3
0.2
0.1
0
−0.1
−3
−2
−1
0
1
2
x
t
Figure 5.11. The true PDF p(xt |Yt ) and p̂25 (xt |Yt ) for t = 25, for the Fourier
and Legendre solutions, N = 25.
117
N=33
0.8
Legendre
Fourier
True
0.7
0.6
t 1:t
p(x |y )
0.5
0.4
0.3
0.2
0.1
0
−0.1
−3
−2
−1
0
1
2
x
t
Figure 5.12. The true PDF p(xt |Yt ) and p̂33 (xt |Yt ) for t = 25, for the Fourier
and Legendre solutions, N = 33.
sity. Yet the Fourier basis based solution yields a closer approximation
than that of Legendre functions, measured in the 1-norm of the approximation error. When N is in the medium range (N = 25, Fig. 5.11),
the Fourier basis solution gives an almost perfect approximation, while
the Legendre functions still show some difficulties in fully capturing the
multi-modality of p(xt |Yt ). For high approximation orders (N = 33,
Fig. 5.12), both the Legendre and Fourier bases produce close to perfect
approximations.
However, as can be seen from Fig. 5.9, a better PDF fit does not necessarily translate into a superior point estimate of the state x̂t . The root
mean square error for the Fourier and Legendre solutions are practically
the same for N ≥ 20, even though the Fourier basis solution provides a
better fit of the actual underlying PDF.
Another aspect that should be taken into account is the numerical
properties of the basis functions. With the Legendre basis functions it
is not possible, in the given implementation, to go above N = 65 due to
numerical problems, while no numerical problems are encountered using
the Fourier basis functions. However, as virtually perfect approximation
is reached already for N = 33, it is not an issue with the Legendre basis
solution in this case.
From Fig. 5.7 and Fig. 5.8, the bound can be seen to be close to
tight for some N values, but more conservative for other N values. For
the Legendre case, the bound is conservative for small values of N as a
118
consequence of the poorly approximated PDFs p(xt |xt−1 ) and p(yt |xt )
in some intervals. The bound accounts for the worst case effects of this
poor approximation, which scenario does not apparently realize in the
final estimate, for the particular problem and implementation at hand.
In the derivation of the bound the inequality
f (z|x)g(x)dx ≤ maxf (z|x)
R
x,z
was used. This relationship holds if f and g are PDFs, but can in some
cases to be a rather conservative bound. By imposing assumptions on
e.g. the smoothness of f and g, this bound can be tightened and hence
bring about an improvement of the final bound.
119
Chapter
6
Orthogonal basis PF
6.1 Introduction
Parallelization of the PF, as given in Chapter 4, is as a way of improving its real-time feasibility. In Chapter 4, four different parallel particle
filters: the globally distributed particle filter [8], sampling with proportional allocation and sampling with non-proportional allocation [12], as
well as the Gaussian particle filter (GPF) [56] were evaluated. It was
found that the GPF was the most suitable option for a fast and scalable
implementation. The GPF makes the approximation
p(xt |Yt ) ≈ γ(xt , μt , Σt ),
where γ(x, μt , Σt ) denotes a multivariate Gaussian PDF with mean
μt ∈ Rnx and covariance Σt ∈ Rnx ×nx . By this approximation, the
information contained in the particle set can be compressed to a few
informative coefficients (the mean and covariance), and hence efficiently
communicated among the parallel processing units. The Gaussian approximation of a posterior is though a rather restrictive assumption that
infringes upon the generality of the nonlinear filtering method.
In this chapter, a method to fit a truncated series expansion
(k)
p(xt+1 |Y t ) ≈
at+1 φk (xt+1 ),
(6.1)
k∈K
to the particle set for some set of basis functions Φ = {φk (x)}k∈ND ,
where K is the index set of the basis functions included in the expansion,
is suggested. In a sense, this method, termed here Orthogonal Basis
Particle Filter (OBPF), can be seen as an extension of the GPF, as it
reduces to the GPF with Φ chosen to be the Hermit functions basis, and
with only the first basis function used in the approximation, i.e. K =
0. By this construction the OBPF enjoys the favorable parallelization
121
properties of the GPF, as only some few coefficients {a(k) }k∈K have to be
communicated, but abolishes the restriction of the posterior distribution
being Gaussian.
The problem of fitting a series expansion to a random sample is discussed in Sec. 1.4. As noted there, a useful property of the series expansion estimator is that it possesses a convergence rate that is independent
of the dimension of the problem [90]. This is in contrast with most other
non-parametric PDF estimators (the kernel density estimator included),
whose convergence rate severely deteriorates with increasing dimension
[90]. Therefore, the orthogonal series estimator constitutes an appealing option for high-dimensional problems. Further, the series expansion
method as well exhibits beneficial interpolation properties so that less
particles are required to give an approximation to the posterior for a
given accuracy.
Modifications of GPF, such as the Gaussian sum particle filter (GSPF)
[55], allow non-Gaussian posteriors to be approximated by a Gaussian
mixture. Yet, the mixands are required to be refitted frequently if the
filter should operate near optimality [6]. This poses a severe obstacle to
an efficient parallelization, as the refitting requires access to the global
posterior distribution and hence parallelizes poorly.
To concretize the proposed method of OBPF and exemplify the developed techniques, the Hermite basis is particularly studied in this chapter
as a suitable choice of the orthogonal functional basis for OBPF. Naturally, there is no principal difference to the method with any other
orthogonal functions basis employed instead.
The chapter is organized as follows. In Sec. 6.2 notation and background material are briefly summarized. The proposed method of OBPF
is explained in Sec. 6.3, followed in Sec. 6.4 by its parallelization. Experiments validating the estimation accuracy and speedup obtained on
a shared-memory multicore processor are described in Sec. 6.6.
6.2 Background
6.2.1 The PF algorithm with importance sampling
The PF solves the recursive estimation problem by providing a weighted
(i)
(i)
sample {xt , wt }N
i=1 from the PDF p(xt |Yt ), from which the desired
information such as e.g. the minimum mean square error or the maxi(i)
mum likelihood point estimate can be extracted. The notation xt shall
(i)
be interpreted as the i-th particle at time step t and wt as the corresponding weight. The method consists of the three steps performed
recursively: prediction, update, and resampling. In the prediction and
update steps, the particles are propagated and the weights are updated
122
via the relationships
(i)
(i)
(i)
xt+1 = ft (xt , vt ),
(6.2)
(i)
wt+1
(6.3)
=
(i)
(i)
wt p(yt+1 |xt+1 ),
(i)
i = 1, 2, ..., N , where vt ∼ pv (v) and the weights are normalized to
sum up to one at the end of the iteration.
In the resampling that is included to avoid depletion of efficiency in
(i)
(i)
the particle set [69], a new set of particles {xt , wt }N
i=1 is created by
making a draw from p̂(xt |Yt ). In general, it is not possible to sample
directly from p̂(xt |Yt ). Different methods however exist to achieve this
goal, importance sampling being one of them, see e.g. [71], [57]. A
(i)
(i)
weighted sample {xt , wt }N
i=1 is then obtained by sampling from some
(i)
easy-to-sample proposal distribution xt ∼ π(xt ), and computing the
corresponding weight via
(i)
wt ∝
(i)
p̂(xt |Yt )
(i)
.
(6.4)
π(xt )
It is required that π(x) satisfies the condition
p(xt |Yt ) > 0 → π(xt ) > 0, except at a zero measure of points.
6.2.2 Hermite functions basis
In the one-dimensional case, the k-th Hermite function is given by
(−1)k x2 /2 dk −x2
φk (x) = e ,
√ e
dxk
2k k! π
k = 0, 1, . . .
For computational purposes, the three-term recurrence relationship
√
2
φ0 (x) = π −1/4 e−x /2 , φ1 (x) = 2xφ0 (x),
2
k−1
xφk−1 (x) −
φk−2 (x),
φk (x) =
k
k
k = 2, 3, . . . is often exploited. The set {φk (x)}∞
k=0 constitutes an orthogonal basis of L2 (R). For later use, the values of
gk = max |φk (x)| and sk =
φ4k (x) dx,
x
R
k = 0, 1, ..., 10 are listed in Tab. 6.1. Note that the k-th basis function
2
has the form φk (x) = rk (x)e−x /2 , where rk (x) is a polynomial of degree k, and that the 0-th basis function is a scaled Gaussian PDF. Due
123
Table 6.1. Values of gk and sk for the Hermitian basis functions.
k
gk
sk
0
.75
.40
1
.64
.30
2
.61
.26
3
.59
.23
4
.57
.21
6
.56
.20
7
.55
.19
8
.55
.18
9
.54
.17
to this characteristics, the Hermitian functions provide a suitable basis
for approximation of PDFs that have exponentially decaying tails, but
could otherwise be a poor option. Actually, it can be shown [13] that
if the approximated function exhibits exponential decay, the coefficients
{a(k) }K
k=0 in (6.1) will also exhibit exponential decay, in the sense that
they decrease faster than the reciprocal of any finite order polynomial.
Hence a good fit can be expected, in that case, for a low value of the
truncation order K.
6.3 The Hermitian Particle Filter
The proposed PF method is detailed in this section. For notational
brevity, the vectors
ϕ(x) = [ φk0 (x) φk1 (x) · · · φkK (x) ]T ,
K)
at = [ at(k0 ) at(k1 ) · · · a(k
]T ,
t
are introduced where the elements of K have been denoted as k0 , k1 , ...., kK .
The number of elements in K is thus K + 1.
The method follows the regular particle filtering algorithm to obtain
a weighted sample from p(xt+1 |Yt ), i.e. the particles are propagated
and updated via (6.2), (6.3). The main difference is how the resampling
is performed. To resample, a series expansion is fitted to the weighted
set
(k)
p̂(xt+1 |Yt ) =
at+1 φk (xt+1 ) = aTt+1 ϕ(xt+1 ),
k∈K
using the method described in Sec. 1.4., i.e.
at+1 =
N
(i)
(i)
wt ϕ(xt+1 ).
i=1
From the fitted PDF, a new set of particles is drawn by importance
sampling, i.e.
(i)
xt+1 ∼ π(x),
(i)
wt+1 =
124
(6.5)
p̂(xt+1 |Yt )
(i)
π(xt+1 )
(i)
=
|aTt+1 ϕ(xt+1 )|
(i)
π(xt+1 )
.
(6.6)
Algorithm 22 Algorithm for one iteration of OBPF.
(i)
(i)
(i)
(U) wt = wt−1 p(yt |xt )
(i)
(i)
(P) xt+1 ∼ p(xt+1 |xt )
(i)
(i)
(R) ât+1 = N
i=1 wt ϕt+1 (xt+1 )
(i)
xt+1 ∼ π(x)
(i)
(i)
(i)
wt = |âTt+1 ϕt+1 (xt+1 )|/π(xt+1 )
For the Hermite functions, the first basis function is a scaled Gaussian
PDF and it is reasonable to take a Gaussian distribution as the proposal
distribution in that case, i.e. π(x) = γt (x) := γ(x, μt , Σt ) where μt , and
Σt are the mean and covariance of the particle set respectively.
The absolute value in (6.6) is inserted since the approximation method
does not guarantee a non-negative approximation of the PDF. The steps
of the algorithm are summarized in Alg. 22. In the description it is
assumed that resampling is carried out at every iteration. To modify for
a different resampling scheme is straightforward.
Remark 1. Note that during the recursion, the PDF is propagated as
p(xt |Yt ) → p(xt+1 |Yt ) → p(xt+1 |Yt+1 )
via the prediction and update equations in (6.2), (6.3). The resampling
can be carried out at any step of the recursion. In the present formulation, the resampling is performed by making a draw from p(xt+1 |Yt )
because p(xt+1 |Yt ) is typically smoother than p(xt |Yt ) and, therefore,
is more suitable for approximation with a series expansion of low order.
This can be expected as p(xt+1 |Yt ) is a prediction and, hence, subject to
greater uncertainty than p(xt |Yt ) (smoother PDF), or more technically
by that p(xt+1 |Yt ) is the outcome of a convolution of two PDFs, while
p(xt |Yt ) is a product.
6.4 Parallelization
The proposed estimation method is designed for straightforward parallelization. The key to the parallizability of the method is the decoupled
(k)
way in which the coefficients at can be computed from the local particle sets. Pseudocode for a parallelization is given in Alg. 23, where
125
Figure 6.1. Illustration of the work flow for parallel execution of the algorithm.
Algorithm 23 One iteration for parallel implementation.
Parallel (processor m do for i ∈ Nm )
(i)
(R) xt ∼ π(x)
(i)
(i)
(i)
wt−1 = |âTt ϕt (xt )|/π(xt )
(i)
(i)
(i)
(U) wt = wt−1 p(yt |xt )
(i)
(i)
(P) xt+1 ∼ p(xt+1 |xt )
(m)
(m)
(i)
ât+1 = ât+1 + ϕt+1 (xt+1 )
Sequentially (one processor)
(m)
1 M
ât+1 = M
m=1 ât+1
Nm , m = 1, 2, ..M are disjoint subsets {1, 2, .., N } of cardinality N/M
(assumed to be integer). The computations have been organized to allow for only one sequential section, and one synchronization point per
iteration.
Each processing unit starts with creating a local particle set by resampling from p̂(xt+1 |Yt ) = aTt+1 ϕt+1 (x) (R). The local particle sets are
propagated (P), and updated (U) by each processing unit. From the local particle set, each processor will then compute the local estimate â(m)
of a that is communicated to a processor forming the global estimate â
in a sequential section. The global estimate of â is then communicated
back to the processing units which can restart the cycle by sampling
from the global estimate of p(xt+2 |Yt+1 ). These execution steps are
illustrated in Fig. 6.1.
126
6.4.1 Parallelization properties analysis
The simple parallelization scheme of the method facilitates analysis of
the parallelization properties, including the required amount of sequential and parallel work as well as interprocessor communication. In the
proposed algorithm, the major part of the computational work is carried
out in the parallel section, and only a small amount of communication
and sequential processing is necessary. This property is crucial to a
scalable parallelization, as discussed in Sec. 1.9.
To give exact numbers for a general system is of course not possible.
Here the numbers for a linear system with uniformly distributed process
and measurement noise are given. This can be considered a challenging
scenario for the parallelization in the sense that it gives a relatively low
amount of computation compared to communication and be viewed as
”close to worst case scenario”. The flops yielding the sequential work q|
and the parallel work q|| can then be counted as
q| (M, K) ≈ (M − 1)(K + 1),
(6.7)
q|| (N, K) ≈ (2 + 2fr + fe + 4K)N + 2(K + 1),
(6.8)
where fr and fe are the number of flops required to generate a random
number and to evaluate the exponential function, respectively. Transfering the local estimates a(m) , and μ(m) m = 1, 2, ..., M results in a total
communication of
κ(K, M ) = (K + 3)(M − 1),
(6.9)
elements per iteration. This is very low amount of communication, and
can be considered almost negligible compared to the FLOPs performed.
The theoretical speedup one could expect can thus be obtained by computing q| and q|| from (6.7) and (6.8) and substitute p| = q| /(q| + q|| ),
p| = q| /(q| + q|| ) into (6.20) together with an estimate of c(M ) that takes
the amount of communication (6.9) into consideration.
6.5 Analysis
This section provides an analysis of how god the fit of the expansion can
be considered to be. Typically when fitting a SE to a random sample,
the underlying distribution is assumed completely unknown. The lack of
information or assumptions on the target PDF is limiting in the analysis
of the goodness of the fit. For the recursive Bayesian estimation problem
the underlying PDF is not completely unknown, as it is highly influenced
by the known system model (6.12) and (6.13). This information can be
used to provide performance measures on the goodness of the fit.
127
In Theorem 3, a bound on the variance of the parameter estimate
is given. It is interesting in it own right as it gives a measure of the
certainty in the estimate of a specific parameter, but it also gives means
of expressing an upper bound, Theorem 7, on the part of the mean
integrated square error (MISE) that is due to the random error caused
by using a finite number of particles N . It provides practical guidance
since if the bound is higher than desired, it indicates that the number
of particles should be increased.
In Theorem 4, a bound on the absolute value of the coefficients, for the
Hermitian basis functions, is given. The bound is decaying and provides
means of ensuring that important coefficients (of large magnitude) are
not neglected when truncating the expansion and can be used as a tool
for selecting a suitable truncation order.
The main theorems are stated below with the proofs given in Appendices.
The q-norm of a function p(x) : RD → R, is defined as
|p(x)|q dx]1/q .
||p(x)||q = [
RD
All
are assumed to be Riemann integrable so that
involved functions
| RD f (x)dx| ≤ RD |f (x)|dx is a valid inequality.
The supremum of |φk (x)| is denoted gk , i.e. gk = supx |φk (x)|.
Remark 2. The k-coefficient is estimated from the particle set as
(k)
ât+1
=
N
(i)
(i)
wt φk (xt+1 ).
i=1
(k)
By the central limit theorem, ât+1 will be approximately normally dis(k)
(k)
tributed according to ât+1 ∼ N (at+1 , σt+1 (k)2 ), if N is “large”, where
large typically is considered to be N 30, and where the variance
σt+1 (k)2 for coefficient k is bounded as given by the following theorem.
Theorem 3. The variance σt+1 (k)2 of the estimate of the k-coefficient
(k)
ât+1 is bounded by either of the two bounds
σt+1 (k)2 ≤ gk2 Wt
2
φk (x)4 dx]1/2 ||pv (v)||2 Wt
σt+1 (k) ≤ [
RD
where Wt =
N
i=1
128
(i)
(wt )2 .
(6.10)
(6.11)
See Appendix 6.C for proof. Which one of (6.10) and (6.11) is tighter
depends on the particular basis function that is used, and the 2-norm
of pv (v). The factor W −1 ∈ [1, N ] is sometimes termed as the efficiency of the particle set. For an unweighted sample, i.e. w(i) = N −1 ,
i = 1, 2, ..., N , it holds that W −1 = N , which is the highest possible
efficiency.
(k)
The following theorem provides a bound on |at | decaying with inD
creasing values of d=1 kd and applies to systems of the form
xt+1 = f (xt ) + vt ,
yt = h(xt ) + et ,
(6.12)
(6.13)
where the process noise vt is assumed to be mutually independent in each
dimension, with pv,d (v) denoting the PDF in dimension d. ∇ denotes the
Jacobian of a function and σmin (A) and σmax (A) denote the smallest
and largest singular value of a matrix A.
Theorem 4. Assume that the system is given by (6.12), (6.13) and
that h(x) and f (x) are continuous functions with σmin (∇h(x)) ≥ r,
σmax (∇fd (x)) ≤ Rd , d = 1, 2, .., D for all x. Further impose that pe (e) ≤
(k)
for e2 ≥ L. Then at for the Hermitian basis functions is bounded
(k)
(k)
where
in absolute value as |at | ≤ λ−1
t ηt
(k)
ηt
= me gk
D
d=1
Ξ(d, q)
q
(2kd +
+ gk (6.14)
2i)1/2
i=1
where me is the supremum of pe (e), λt is the normalization constant
N
(i)
λt =
wt , q is a positive integer and
i=1
Ξ(d, q) = sup
|θ|≤
Rd L
r
R
|[
∂ q −z 2 /2
2
(e
pv,d (z − θ))]ez /2 |dz
q
∂z
(6.15)
Proof. See Appendix 6.B.
Remark 5. Suppose that it is decided that only coefficients with index k
that satisfy |ηk |/|η0 | ≥ Q, for some Q ∈ R, are to be kept. Reorganizing
and simplifying the expression |ηk |/|η0 | ≥ Q (see Appendix 6.D for details), it is found that only the coefficients with index k = (k1 , k2 , ..., kD )
satisfying
q
D (kd + i)
< (Q − )−2
(6.16)
i
d=1i=1
129
have to be considered. Eq. (6.16) implicitly specifies which coefficients
shall be kept and which ones can be neglected. In Fig. 6.2, the number
of coefficients satisfying (6.16) is shown for the dimension orders, D =
1, 2, .., 10 and Q values Q = {0.15, 0.10, 0.05}. By inspection of the
slopes of the curves, the growth in the number of coefficients needed to
be computed is of order O(D 2 ) for the different Q values, which value
should be compared to the growth of O(K̃ D ) that is obtained if no
selection is performed and K̃ coefficients per dimension are kept and
computed.
The MISE is an important measure of how good the fit is and can be
derived as follows. The overall approximation error is given by
e(xt+1 ) = p(xt+1 |Yt ) − p̂(xt+1 |Yt )
(k)
(k)
=
at+1 φk (xt+1 ) −
ât+1 φk (xt+1 )
k∈ND
=
(k)
(at+1
−
k∈K
k∈K
(k)
ât+1 )φk (xt+1 )
+
(k)
at+1 φk (x),
k∈K
/
er (xt+1 )
eT (xt+1 )
where er is the random error caused by the uncertainty in the estimate
(k)
of at due to the finite number of particles N , and eT is the truncation error caused by neglecting the coefficients with index k ∈
/ K. By
Parseval’s identity (following from the fact that the functional basis is
orthogonal), the mean integrated square error (MISE) is given by
V (t) = E[
e(xt )2 dxt ] =
RD
(k)
(k)
(k)
= E[
(at − ât )2 +
(at )2 ]
k∈K
=
k∈K
2
σt (k) +
Vr (t)
k∈K
/
k∈K
/
(k) 2
(at ) ,
(6.17)
VT (t)
where Vr is the random MISE due to the finite number of particles,
and VT is the MISE caused by truncation of the expansion. Note that
as σt (k)2 → 0 when N → ∞, Vr → 0 as N → ∞ and the MISE of
the estimated expansion converges to the MISE of the true truncated
expansion, i.e. VT .
Remark 6. Consider the scalar case. By inspection of (6.17), it can be
noted that if σt2 (k) does not decay more rapidly than 1/k, the truncation
actually is necessary to avoid divergence of the MISE. The fit does hence
130
10 4
Q=0.15
Q=0.10
L
10
3
Q=0.05
10 2
10 1
10 0
10 0
10 1
D
Figure 6.2. The number of coefficients L, that have to be computed versus
dimension D, for different Q-values in (6.16).
not necessarily improve as more coefficients are included in the truncated
expansion. This is intuitive as it is impossible to estimate an infinite
number of parameters to an arbitrary accuracy from a finite data set.
Theorem 7. The term Vr is bounded by either of the two inequalities
Vr (t) ≤ Wt−1
gk2 ,
k∈K
[
Vr (t) ≤ Wt−1 ||pv (v)||2
k∈K
RD
φk (x)4 dx]1/2 .
(6.18)
Proof. This follows immediately by applying Theorem 3 to the term Vr
in (6.17).
By inserting inequality (6.14) into the expression for VT in (6.17), an
upper bound for the MISE caused by the truncation is given by
VT (t) ≤ λ−2
ηk2 .
(6.19)
t
k∈K
/
6.6 Computational Experiments
The main purpose of constructing the nonlinear estimation algorithm
described above is to achieve parallelizability of its computer implementation. As pointed out before, the method that enjoys similar to the
131
proposed method parallelizability properties is the GPF. Therefore, the
proposed method is tested against the GPF for comparison, to highlight
the benefits of not being restricted to a Gaussian posterior. The estimation accuracy and the parallelizability are investigated in the following
subsections. For brevity, the proposed method will be referred to as the
Hermitian Particle Filter (HPF) in this section, to indicate the selected
orthogonal basis.
6.6.1 System model
To illustrate the method the simple nonlinear system
xt
+ vt ,
|xt | + 1
y t = xt + e t ,
xt+1 =
where vt ∈ R and et ∈ R are mutually independent, white noise sequences. The measurement noise, et , follows a Gaussian distribution
with standard deviation σe = 0.1 while vt is non Gaussian, with the
multimodal PDF
(x−1)2
(x+1)2
1
pv (v) = √
(e 2σv2 + e 2σv2 ),
2 2πσv
where σv = 1. The system was simulated for t = 0, 1, 2..., T , where
T = 100, with the initial condition x0 = 0.
6.6.2 Estimation accuracy
Eq. (6.14) was used to compute an upper bound
√ on the coefficients.
For the given system, R = r = 1 and me = 1/ 2π0.12 . The threshold
= 0.01 was chosen, yielding L ≈ 0.34. Evaluating (6.15) for q = 2 then
yields
2
∂ q (e−x /2 pv,d (x − θ))
2
Ξ(q, d) = sup
|(ex /2
|dx ≈ 5,
∂xq
|θ|≤0.34 R
where the optimization problem was solved by griding over θ and computing the integral numerically. The upper bounds for the absolute
(k)
values of at , k = 0, 1, 2, ... are then given by (6.14) as
√
5gk / 2π0.12
(k)
−1
|at | ≤ λt
.
[(2k + 2)(2k + 4)]1/2
132
True
HPF
GPF
0.2
p(xt |Yt )
0.15
0.1
0.05
0
-4
-2
0
2
4
6
8
10
12
x
Figure 6.3. The true PDF p(xt+1 |Yt ) and estimated PDFs pGP F (xt+1 |Yt ),
pHP F (xt+1 |Yt ) from the Gaussian and Hermitian particle filters respectively at
time step t = 60.
The values of gk are provided in Tab. 6.1. In Fig. 6.4, the absolute
values of the series coefficients with respect to bound (6.14) are shown
for time instant t = 25. Selecting the value Q = 0.1, i.e. only considering
the coefficients that are potentially larger than 0.1η0 , (6.16) states that
K = 9 is the required truncation order.
In Fig. 6.3, the true PDF p(xt+1 |Yt ) and the approximated PDFs
pHP F (xt+1 |Yt ) and pGP F (xt+1 |Yt ) obtained from the HPF (with K = 9)
and GPF, respectively, are shown for time instant t = 60, using N = 800
particles. The “true” PDF has been obtained by executing a regular
bootstrapping particle filter with 106 particles and applying a kernel
density approximation method [97] to the obtained particle set.
Inserting
the value ||pv (v)||2 ≈ 0.16, and the values of
sk = [ RD φk (x)4 dx]1/2 given in Tab. 6.1 into (6.18), the variance for each
coefficient and time step was computed. In Fig. 6.5, the coefficients with
the corresponding upper bound 95% confidence intervals, computed from
(6.18), are shown for time step t = 20.
6.6.3 Execution time and speedup
To evaluate the performance of the method in terms of the speedup
obtained when executed on a shared memory multicore computer, the
method was implemented in c++ and run on a AMD Opteron 6220
processor (3.3 GHz, 8 cores, 16 MB cache). Compilation was performed using the pgi compiler, with full optimization for execution speed.
133
|a(k)|
1
t
ηt
0.8
0.6
0.4
0.2
0
0
2
4
6
8
10
12
k
(k)
Figure 6.4. Absolute value of the coefficients at , k = 0, 1, ..12 are shown as
stems and the dashed line shows the upper bound computed from (6.14) for
time step t = 25.
0.6
0.5
0.4
a (k)
t
0.3
0.2
0.1
0
-0.1
-0.2
0
2
4
6
8
10
12
14
k
(k)
Figure 6.5. Estimated coefficients at , with upper bound 95 % confidence
intervals marked, k = 1, 2, ..13 at time step t = 20.
134
8
2
N=10
7
3
N=10
4
N=10
6
5
N=10
N=106
Speed up
5
4
3
2
1
0
1
2
3
4
5
Number of CPUs
6
7
8
Figure 6.6. Speedup curves for execution on a shared-memory multicore.
OpenMP [1] was used for parallelization. The achieved speedup is shown
in Fig. 6.6, for different problem sizes.
On the machine under consideration, empirical testing shows that
fr ≈ 10, fe ≈ 40, are reasonable approximations for the terms in Eq.
(6.8) and that it takes two hundred CPU cycles to communicate an
element to the RAM and about 1000 CPU cycles for overhead such as
synchronization, thread startup etc. This results in the overhead term
of c(M, K) = 200κ(K, M ) + 1000. Inserting this into (1.64) yields a
theoretical estimate of the speedup, for K = 9, of
s(M, N ) = M
4M + 65N + 4
.
1004M 2 − 504M + 65N + 8
(6.20)
The theoretical speedup curves (6.20) are plotted in Fig. 6.7. Though
being a bit optimistic, the curves resemble the experimentally obtained
ones quite well. Exact figures can of course not be expected from (6.20),
but the expression serves as a guideline for the expected speedup for a
given problem size and number of processors employed. The obtained
speedup values compare well with the ones obtained for the GPF in
Chapter 4. For N ≥ 104 , close-to-linear speedup in the number of cores
used is reached. For N = 100, the benefit of using parallelization is low,
and actually a slow down can be observed for M > 4. This is due to the
fact that the overhead c(M ) constitutes a disproportionally large part of
the execution time. Using more than 104 particles is actually redundant
in this particular example, as the estimation accuracy does not improve
with more particles than that. However, lager numbers of particles are
135
8
N=102
N=103
7
N=104
N=105
6
6
N=10
Speedup
5
4
3
2
1
0
1
2
3
4
5
Number of CPUs
6
7
8
Figure 6.7. Theoretical speedup curves computed from (6.20).
Table 6.2. Single core execution time for one iteration
N
T [s]
102
0.00047
103
0.0012
104
0.0054
105
0.071
106
0.52
relevant when it comes to speedup evaluation. The obtained speedup
curves are for hardware that executes on a relatively high clock frequency
(3.3 GHz). In low-power applications, such as e.g. communication, the
clock frequencies are typically much lower, and hence a better scalability
can be expected for smaller problem sizes. The small amount of communication makes the approach suitable for computer cluster execution
which can be relevant for large problem sizes. The method is demonstrated for a one-dimensional example here, it has though been tested
for other systems as well with up to eight dimensions and proved to perform well. In Chapter 8 the method is evaluated at a five dimensional
estimation problem regarding parameter estimation in a PK/PD model
for closed loop anesthesia.
6.A Appendix for Chapter 6
It is frequently utilized in the derivations that for a PDF p(x), x ∈ RD
it holds that
p(x)dx = 1, and p(x) ≥ 0.
RD
136
Lemma 8. Let π(x) : RD → R that satisfies |π(x)| ≤ γ, x ∈ RD , then
|
RD
π(x)p(x)dx| ≤ γ.
Proof.
π(x)p(x)dx| ≤
|
RD
|π(x)||p(x)|dx ≤ γ
RD
|p(x)|dx = γ
RD
Remark 9. Lem.
8 immediately implies that if p(x) ≤ q for some finite
q ∈ R then RD p(x)2 dx ≤ q, i.e. p(x) ∈ L2 (RD ), and is hence possible
to approximate with an orthogonal series expansion.
Lemma 10. The 2-norm of p(xt+1 |Yt ) is less than the 2-norm of pv (v),
i.e. ||p(xt+1 |Yt )||2 ≤ ||pv (v)||2 .
Proof.
||p(xt+1 |Yt )||22
RD
=
[
RD
=
RD
RD
RD
≤
pv (xt+1 − f (xt ))p(xt |Yt )dxt ]2 dxt+1
RD
pv (xt+1 − f (ξ 1 ))p(ξ 1 |Yt )dξ 1 ×
pv (xt+1 − f (ξ 2 ))p(ξ 2 |Yt )dξ 2 dxt+1
=
RD
p(xt+1 |Yt )2 dxt+1
=
RD
RD
pv (xt+1 − f (ξ 1 ))pv (xt+1 − f (ξ 2 ))dxt+1 ×
p(ξ 1 |Yt )p(ξ 2 |Yt )dξ 1 dξ 2
||pv (v)||22 p(ξ 1 |Yt )p(ξ 2 |Yt )dξ 1 dξ 2
= ||pv (v)||22
p(ξ 1 |Yt )dξ 1
p(ξ 2 |Yt )dξ 2 = ||pv (v)||22 ,
RD
RD
RD
RD
where the inequality holds since the inner integral satisfies
137
R
D
RD
pv (xt+1 − f (ξ 1 ))pv (xt+1 − f (ξ 2 ))dxt+1 =
pv (xt+1 − f (ξ 1 ) + f (ξ 2 ))pv (xt+1 )dxt+1 ≤
pv (xt+1 )pv (xt+1 )dxt+1 = ||pv (v)||22
RD
by the fact that the autocovariance,
f (x − τ )f (x)dx,
R(τ ) =
RD
for any real-valued function f (x) satisfies R(τ ) ≤ R(0).
Lemma 11. Let φx (x) denote the k-th Hermitian function.
∂q
−z 2 /2 f (z))]ez 2 /2 | ≤ Kez 2 /2 for some constant K < ∞, then
If |[ ∂z
q (e
the following equality is valid
1
∂q
2
2
φk (z)f (z)dz = q
φk+q (z)[ q (e−z /2 f (z))]ez /2 dz.
∂z
R
(2k + 2i)1/2 R
i=1
Proof. This follows by repeated partial integration (q times), see e.g.
[13].
6.B Proof of Theorem 4
Denote with x0 the value that satisfies h(x0 ) = y. By the assumption σmin (∇h(x)) ≥ r it follows from the mean value theorem for vector valued functions that h(x0 ) − h(x − x0 )2 = y − h(x − x0 )2 ≥
r x − x0 2 , which by the requirement pe (e) ≤ for e2 ≥ L implies
that pe (h(x0 ) − h(x − x0 )) ≤ if x ∈ Ω̄ := {x| x − x0 2 ≥ L/r}}. For
notational brewity, the following notation will be used z = xt , x = xt−1
(k)
and let z̃ = z − μ, where μ = f (x0 ). Splitting the computation of at
over Ω and Ω̄ gives
(k)
φk (z̃)p(z|Yt−1 )dz| =
|at | = |
D
R = λ−1
φk (z̃)pv (z − f (x))dzpe (y − h(x))p(x|Yt−2 )dx|
t |
D
D
R R
= λ−1
φk (z̃)pv (z − f (x))dzpe (y − h(x))p(x|Yt−2 )dx
t |
D
Ω R
+
φk (z̃)pv (z − f (x))dzpe (y − h(x))p(x|Yt−2 )dx|.
Ω
138
RD
Denote the first term and second term T1 and T2 respectively. Now if the
process noise is mutually independent in each dimension, i.e. pv (w) =
pv1 (w1 ) · ... · pvd (wd ), the first term can bounded as
D |T1 | = |
Ω d=1 R
φkd (z̃d )pvd (zd − fd (x))dzd pe (y − g(x))p(x|Yt−2 )dx|
D ≤ me
| φkd (z̃d )pvd (zd − fd (x))dzd |p(x|Yt−2 )dx
Ω d=1
≤ me
R
ξp(x|Yt−2 )dx ≤ me ξ,
Ω
where ξ = sup
D | R φkd (z̃d )pvd (zd − fd (x))dzd |. To get an upper
x∈Ωd=1
bound the supremum can be taken on eachfactor independently in the
product. Further, for each factor, ξd = sup| φkd (z̃d )pvd (zd − fd (x))dzd |,
x∈Ω
in this product the following bound can be given.
ξd = sup|
φkd (z̃d )pvd (z̃d − (fd (x) − μd ))dzd |
sup | φkd (z̃d )pvd (z̃d − θ)dzd |
x∈Ω
≤
R
|θ|≤RD L/r
≤ g kd
R
1
q
sup
(2kd + 2i)1/2 |θ|≤RD L/r
R
|[
∂ q −z 2 /2
2
(e
pv (z̃d − θ))]ez /2 |dz̃d ,
q
∂z
i=1
the first inequality is a consequence of the assumption σmax (∇fd (x)) ≤
Rd which implies |fd (x) − μd | ≤ Rd L/r over Ω and the second inequality
D
Ξ(d,q)
follows from Lem. 11. Hence ξ ≤
g kd , where Ξ(d, q) is
q
(2kd +2i)1/2
d=1
i=1
defined according to (6.15) which gives the first term in (6.14). The
second term T2 can be bounded as
φk (z̃)pvd (z − fd (x))dzpe (y − g(x))p(x|Y)dx|
|T2 | = |
Ω
RD
≤
|φk (z̃)|pvd (z − fd (x))dzp(x|Y)dx
Ω RD
≤ gk
p(x|Y)dx ≤ hk ,
Ω
which gives the resulting inequality (6.14).
139
6.C Proof of Theorem 3
The variance σt+1 (k)2 of the point estimate coefficient k is given by
(k)
σt+1 (k)2 = V[ât+1 ]
= V[
N
(i)
(i)
wt φk (xt+1 )] = V[φk (xt+1 )]
i=1
N
(i)
(wt )2 .
i=1
V[φk (Xt+1 )] is bounded by
V[φk (Xt+1 )] = E[φk (Xt+1 )2 ] − E[φk (Xt+1 )]2 ≤ E[φk (Xt+1 )2 ],
which in turn can be bounded by the smallest of either
E[φk (Xt+1 )2 ] =
φ2k (xt+1 )p(xt+1 |Yt )dxt+1 ≤ gk2 ,
RD
or
E[φk (Xt+1 ) ] =
2
RD
≤[
≤[
R
φk (xt+1 )2 p(xt+1 |Yt )dxt+1
φk (xt+1 )4 dxt+1
p(xt+1 |Yt )2 dxt+1 ]1/2
RD
D
RD
φk (xt+1 )4 dxt+1 ]1/2 ||pv (v)||2 ,
where the first inequality follows from Cauchy-Schwartz inequality and
the second one is implied by Lem. 10.
6.D Dervation of Eq. (6.16)
ηk /η0 ≤ Q ⇔
ηk /η0 ≤ (Q − ) + ⇐
(hk − (Q − )h0 )
⇔
ηk /η0 ≤ (Q − ) + η0
ηk − hk ≤ (Q − )(η0 − h0 ) ⇔
me hk
D
d=1
Ξ(q, d)
q
(2kd + 2i)1/2
D
Ξ(q, d)
≤ (Q − )me h0
⇐
q
d=1
(2i)1/2
i=1
i=1
(Q − )−1 ≤
(Q − )−2 ≤
q
D d=1i=1
q
D d=1i=1
140
(
2kd + 2i 1/2
) ⇔
2i
(kd + i)
.
i
Chapter
7
Anomaly detection
7.1 Introduction
Anomaly detection refers to detecting patterns in a given data set that
do not conform to a well-defined normal behavior [17]. It serves as an
automatic method to detect system abnormalities or faults that potentially require a counteraction.
Provided the system under consideration can be appropriately modeled, a plethora of model-based methods can be applied [118], [122],
[45]. However, in many cases, the operation principles of the system are
not sufficiently known to constitute the basis of a first-principles model.
Further, for non-linear and non-Gaussian systems, the computational
burden of estimating a black-box model from data can be prohibitively
high or the exogenous excitation be insufficient.
In this chapter, a non-parametric and (analytical) model-free method
for anomaly detection, applicable to systems observed via trajectorial
data, is developed. The method is computationally light, applies to
non-linear as well as non-smooth systems.
The basic idea of the method can be outlined as follows. Assume
that a system S follows a normal (or reference), possibly vector-valued,
trajectory r(τ ), where τ ∈ R is a function of time t and/or the system
state vector x(t), i.e.
τ = c(t, x(t)).
(7.1)
The right-hand side expression in (7.1) will be referred to as the context function. For a given τ , x(t) should thus ideally be equal to some
reference state xr (τ ), but being subject to disturbances and system uncertainty, x can be considered as a random variable X characterized by
the distribution G(τ ).
141
Assume that a set of N observed repeated system trajectory realizations from S, Γ = {γ1 , γ2 , .., γN }, is available where
#
$
(i) (t(i) ) . . . x(i) (t(i) )
γi = x(i) (t(i)
(7.2)
)
x
n
i
1
2
(i)
denotes the i-th realization and x(i) (tj ), 1 ≤ j ≤ ni , are the state
values at ni different, possibly non-uniformly sampled, time instants,
(i)
(i)
(i)
t1 < t2 < ... < tni .
Now consider a realization γ0 ∈ Γ. It is sought to determine whether
or not γ0 is produced by the same system (i.e. S) as the set Γ. From
the data in Γ, the distribution G(τ ) and the corresponding probability
density function (PDF) fX (τ, x) can be estimated. To statistically test
whether or not γ0 differs from the data in Γ, an outlier test can be
performed on γ0 w.r.t. G. Depending on the degree of outlyingness,
the hypothesis of γ0 being produced by S can be either rejected or
accepted. The parts comprising the method are constructed using tools
from probability theory which are discussed in Sec. 1.4.
The developed method is applied to a set containing vessel traffic
data from the English channel, with the aim to find deviating activities. Due to the large number of objects in the scene and the need of
prompt response to detected abnormalities, computationally demanding
algorithms are not practically feasible. Further, the method is applied
to eye-tracking data, with the aim to detect individual differences in the
oculomotor system, in particular those that are caused by Parkinson’s
disease. The oculomotor system is inherently difficult to model due to
its complex nature and model-free methods are therefore highly relevant.
Promising results for both applications are obtained.
There are methods bearing similarities to the one developed here.
An extensive survey of anomaly detection methods can be found in e.g.
[17]. However, the present work provides a generalization of the existing
approaches that brings about significant refinements and addresses some
of the shortcomings of the existing algorithms:
A typical problem in trajectory anomaly detection is that the trajectories can be of unequal length and unaligned. In this chapter, the
problem is implicitly solved by introducing the concept of a context
function that provides a framework to cope with irregular sampled data
sets in a systematic and robust manner.
The idea of constructing a confidence volume in the state space to
which the system trajectories should be confined under normal operation is advocated in [16]. However, the use of rectangular boxes in that
approach can be seen as a histogram approximation of the underlying
PDFs. The algorithm proposed in this chapter is based on confidence
regions, thus yielding a more flexible and less data demanding method.
This can be of particular importance for higher dimensional problems.
142
An envelope of non-anomalous states is proposed in [51]. Each dimension is though treated separately and the method cannot hence account
for the possible (and probable) correlations between features in different dimensions. Neither does it enable handling non-uniformly sampled
trajectories.
In [44], the behavior of the system is also learned by fitting PDFs to
the training data set. Again, the lack of a context function, or a similar
tool, complicates dealing with non-uniformly sampled trajectories and
trajectories of unequal length. Further, only Gaussian distributions, or
at least parametric ones, can be considered in the described framework.
The chapter is composed as follows. First, in Sec. 7.3 through Sec. 7.3.2,
the individual steps comprising the algorithm are presented. In Sec. 7.3.4,
the steps are brought together and the complete algorithm is summarized. Applications to vessel traffic data and eye-tracking data are presented in Sec. 7.4 and the conclusions are drawn in Sec. 7.6.
7.2 Notation
Let Z = {zi }ni=0 be a set of discrete points. The function that linearly
interpolates a curve between consecutive points in Z is defined as
l(ω, Z) = (zω
− zω )(ω − ω) + zω , 0 ≤ ω ≤ n,
where · and · are the ceiling and floor function, respectively. A
linearly interpolated trajectory obtained from the discrete trajectory is
denoted with an overline Z(ω) = l(ω, Z)
N (μ, Σ) is the normal distribution with the mean μ ∈ Rd and the
covariance Σ ∈ Rd×d .
Pr(A) is the probability of the random event A. For a random variable
X, the probability density function (PDF) is denoted fX (x). Vectors
and vector valued functions are written in bold.
7.3 The anomaly detection method
The idea of the method is to learn the normal behavior of the system
from data by statistically quantifying its characteristics in a way that
allows for computationally light and (analytical) model-free anomaly
detection.
To provide an overview of the adopted approach, the elemental steps
of it are illustrated by Fig. 7.1, for a three-dimensional state space example. There, Fig. 7.1a shows a set of system trajectories obtained
under normal operation. In the first step (Fig. 7.1b), the reference trajectory r(τ ) (in red) describing a ”mean” normal behavior is calculated
143
2
1.5
1
x3
0.5
2
0
1.5
8
−0.5
1
6
−1
0.5
4
−1.5
x3
2
−2
0
7
6
5
4
3
2
1
0
−1
0
8
−0.5
x1
−2
6
−1
x2
4
−1.5
2
−2
(a) A set trajectories Γ, in a 3dimensional state space, produced
by the system under normal operation.
0
7
6
5
4
3
2
1
0
−1
x1
−2
x2
(b) The either known or fitted reference trajectory r(τ ).
2
τ
1.5
2
1
1.5
0.5
x3
1
0
8
0.5
x3
−0.5
0
8
−0.5
6
−1
6
5
4
3
2
1
0
−1
−2
0
7
0
7
2
−2
2
−2
4
−1.5
4
−1.5
6
−1
x1
x2
(c) A the knot points r(τk ), k =
1, 2, .., n, the PDFs fˆX,k (x) are
estimated, and confidence regions
are computed, marked here with ellipses.
6
5
4
3
2
1
0
−1
−2
(d) The continuous density function fˆX (τ, x), is obtained by interpolation. The confidence volume is
given by the interior to the shown
tube, of which the system should be
confined under normal operation.
Figure 7.1. The steps in constructing the confidence volumes.
144
x1
x2
and discrete knot points for the analysis of further trajectorial data are
established (black dots). In the second step (Fig. 7.1c), PDFs are fitted
for discrete knot points on the reference trajectory and confidence regions are computed. In the third step (Fig. 7.1d), a confidence volume is
constructed by interpolation of the computed confidence regions at the
knot points. This defines a domain (volume) in the state space to which
the system trajectories are confined under normal operation, according
to the given data. The details of the steps are explained and discussed
in the subsections below.
7.3.1 Context function and reference trajectory
To determine the context in which a certain system behavior is expected
is a major problem in anomaly detection. For this purpose, the notions
of context function and reference trajectory are introduced. First define
a function g that maps the system state x(t) and time t to s(t) that is
the variable determining the behavior of the system
s(t) = g(t, x(t)).
For instance, the behavior of a ground vehicle traveling along routes and
roads depends on its position p. At a certain position on the route, the
vehicle is expected to have a certain speed, position and heading. Hence,
the choice
g(t, x(t)) = p(t)
is a natural one in that case. This kind of systems is exemplified in
Sec. 7.4 by the vessel tracking application.
Quite often, a temporal function g is a natural choice. For systems
that follow a reference signal that is a function of time, as in the eye
tracking application described in Sec. 7.4, the function g can be specified
as
g(t, x(t)) = t.
The reference trajectory is introduced to aid the computation of a
scalar value, τ determining the expected behavior under normal operation, and is constructed in the following way. Define ξi as the trajectory
γi under the mapping of g, i.e.
(i)
(i)
i
ξi = {g(tj , γj )}nj=1
and a set of such trajectories Ξ = {ξi }. To fit a trajectory to the points
in Ξ, a curve fitting method can be utilized, see e.g. [26]. However, it
can be impractical in higher-dimensional problems and for non-smooth
trajectories. A simple though expedient method to fit the trajectory is
offered by the following procedure:
145
• Pick any trajectory from Ξ and denote it ξk .
(k)
• For each element sj , j = 1, 2, .., nk in ξk , find the point ai in
(k)
trajectory ξ i that minimizes ai − sj 2 , where · 2 stands for
the Euclidean norm. The set Aj = {ai }N
i=1 will contain the points
(k)
in Ξ closest to sj in the context space.
• The j-th point, rj , in the reference trajectory is then calculated as
the mean of the points in Aj , and the continuous reference trajeck
tory is given by r(τ ) = l(τ, {rj }nj=1
).
From the reference trajectory, the context function returning the context
value τ ∗ , for a given time and state, is then defined as
τ ∗ = c(t, x(t)) := arg inf r(τ ) − g(t, x(t))2 .
τ
Evaluating τi∗ = c(ti , x(0) (ti )), in practice, can be accomplished by
finding the line segment dj = l(θ, {rj , rj+1 }), 1 ≤ j ≤ n − 1 that is
closest to the given one, and on that line segment evaluate
σ=
rj+1 − rj
(x0 − rj )
||rj+1 − rj ||2
(7.3)
obtaining τ ∗ from τ ∗ = j +σ. The context function will thus map a given
state xt to a scalar value τ ∗ , and for that given τ ∗ a certain behavior of
the system can be expected.
7.3.2 Probability density function
For a given context value τ ∗ , the state of the system should ideally be
given by some reference state xr . However, due to system uncertainty
and disturbances, the state can be considered as a stochastic variable
X with the distribution G(τ ∗ ) and the corresponding PDF fX (τ ∗ , x).
The PDF fX (τ ∗ , x) can be estimated from the training data set Γ for a
discrete set of values {τk }nk=1 .
Let Tk be the set of points in the trajectories x̄i , i = 1, 2, ..., N , that
have context value equal to τk , i.e.
Tk = {x̄i (t)|τk = c(t, x̄i (t)), i = 1, 2, .., N }.
A PDF fˆX,k (x) can then be fitted to the data for each k. If the sample is
known to come from a parametric family of distributions, a parametric
estimator is a natural choice. In absence of a priori information about
the distribution from which the sample is generated, non-parametric
methods have to be employed to find fˆX,k (x). Histogram estimators,
146
kernel estimators, and orthogonal basis function estimators are typical
examples of non-parametric techniques as discussed in Sec. 1.4.
For the applications treated in Sec 7.4, a Gaussian approximation and
an orthogonal series approximation are used.
To handle continuos values of τ in the anomaly detection method, a
piecewise linear continuous approximation of fX (τ, x) interpolating in
between the discrete points {τk }nk=1 is computed
fˆX (τ, x) = l(τ, {fˆX,k (x)}nk=1 ).
As proved in Appendix 7.A.2, fˆX (τ, x) is a PDF for a given τ as it is
non-negative and integrates to 1.
7.3.3 Outlier detection
To determine whether the system state is anomalous, outlier detection
(see Sec. 1.4) is applied, aided by the fitted PDF fˆX (τ, x). A p-value
is then computed, specifying how unlikely the observation is under the
null hypothesis, and used to classify if the observation is anomalous or
not. The p value can be computed from Eq. (1.20). For an estimated
PDF that belongs to a parametric family of distributions, an analytic
expression for Eq. (1.20) can often be determined. For example, if
X ∼ N (μ, Σ), (1.20) is given by
p(x0 ) = χ2d ((x0 − μ)T Σ−1 (x0 − μ)),
(7.4)
where χ2d (z) is the z-quantile function for the χ2 distribution with d
degrees of freedom. For a non-parametric estimator the outlier test
stated it is generally not possible to evaluate (1.20) analytically and
numerical methods have to be employed. A brief discussion of numerical
evaluation of (1.20) is given in Appendix 7.A.1.
7.3.4 Anomaly detection method recapitulated
Assume that a data set Γ comprising system trajectories arising from
tracking of r by system S is given. To determine whether γ0 , defined by
(7.2), is likely to be generated by similar mechanisms as were the data in
Γ, the steps that should be performed are given in Alg. 24. The actual
implementation of the last step depends on the purpose of anomaly
detection. If the aim is to make an immediate detection of a fault, a
warning should be raised directly when an observation achieves a p-value
below the threshold. To scrutinize a trajectory, the cumulative anomaly
score over the whole trajectory can be studied. The proposed method
only provides the ”raw” anomaly values. There are several possible ways
147
Algorithm 24 Anomaly detection
• For each observation x(0) (ti ), i = 1, 2, ..., n0 :
– Determine the context by τi∗ = c(ti , x(0) (ti )).
– Calculate the p-value pi w.r.t. fˆX (τi∗ , x).
• Based on the obtained p-values pi , i = 1, 2, ..., n0 , decide whether
the null hypothesis H0 w.r.t. γ0 should be accepted or rejected.
to specialize or refine the detection by processing these further but is
outside the scope of the development here.
7.4 Experimental results
The proposed method is here evaluated on two anomaly detection applications with respect to vessel traffic data and eye-tracking data.
7.4.1 Vessel traffic
Supervision of vessel traffic is of importance to detect potentially dangerous or suspicious activities such as accidents, engine failures, smuggling,
drunkenness etc. Manual supervision is an expensive and tedious task,
due to the rarely occurring anomalies and the typically large number of
objects in a scene. A data set from the English channel was scanned
for abnormalities using the algorithm. A synthetic data set, where the
ground truth is known was also studied to evaluate the method.
Real data
Data recordings from the Automatic Identification System (AIS) 1 of
freight ships travelling in the English Channel were made for 72 hours.
The state of each vessel is given by x(t) = [x(t), y(t), v(t), φ(t)]T , where
x, y, v and φ denote the longitude, latitude, speed and heading respectively. A total of 182 trajectories were recorded, see Fig. 7.2.
From these, N = 100 trajectories were used as the training data set
Γ. The behavior of the vessel depends on its position x(t), y(t) in the
route. At a given position, it is supposed to have a certain speed and
heading and hence the function g is selected as
g(t, x(t)) = [x(t), y(t)]T .
1
(7.5)
Vessels over 300 gross tonnes transmit their longitude, latitude, speed, and course
via the AIS system.
148
Figure 7.2. Recorded trajectories of vessels travelling through the English
channel.
The context trajectory r̂ was estimated from Γ, using the method described in Sec. 7.3.1. For the PDFs at each knot point fX,k (x), the
distribution was approximated as a Gaussian one, i.e. G = N (μ, Σ),
justified by the fact that Lilliefors normality test supported the assumption of normality at 86 of the 100 knot points. The mean and covariance
were computed as the sample mean and covariance as given by (1.12)
and (1.13), respectively.
The anomaly detection algorithm was then applied to the remaining
82 trajectories, where the p-values were computed from (7.4) w.r.t. the
fitted PDF fˆ(τ, x), revealing aberrations that fall into three types (see
Fig. 7.3):
Type 1: Vessels going into a harbor.
Type 2: Vessels going off the route direction.
Type 3: Vessels that present a clearly abnormal behavior compared to
other vessels at similar positions.
The p-value for Type 2 anomalies were several orders of magnitude lower
than the anomaly scores for the Type 3 anomalies.
Synthetic data
A synthetic data set was produced by simulating vessels controlled by
PD controllers that track the reference trajectory while holding a given
reference speed by exercising a limited force. Random disturbing forces
were added to each simulated vessel. In total 300 ”normal” vessels were
simulated, of which 200 trajectories were used as the training data set Γ.
149
Type 1
Type 2
50.65
51.22
Lattitude [deg]
Lattitude [deg]
51.24
51.2
51.18
51.16
51.14
51.12
1.56
50.6
50.55
50.5
1.58
1.6
1.62
Longitude [deg]
1.64
0.2
0.3 0.35 0.4
Longitude [deg]
Type 3
51.18
50.32
51.17
50.3
Lattitude [deg]
Lattitude [deg]
Type 3
0.25
51.16
51.15
51.14
50.28
50.26
50.24
50.22
51.13
1.64
1.66
1.68
Longitude [deg]
1.7
−1.5
−1.45
−1.4
−1.35
Longitude [deg]
Figure 7.3. Zoom-in of points classified as anomalous of type 1, 2 and 3 respectively. Trajectories are given by gray lines. r̂ is marked by the thick line.
Points classified as anomalous are marked by plus signs.
Also 3 trajectories, γai i = 1, 2, 3, generated by objects with anomalous
behavior, were simulated with the following characteristics:
γa1 - Travels in the opposite direction to the normal one on the route.
γa2 - First behaves normally but then slows down and stops on the
route at time step 20.
γa3 - Controlled by a PD controller with ”bad” parameters, giving it
a somewhat wiggly behavior, though not apparently deviating to
the human perception.
For comparison, the method suggested in [51] was implemented and run
on the same data set. The method does not employ the concept of
context function and therefore faces problems with varying speeds and
accelerations along the route. Further, it does not account for correlation
between the state variables. Fig. 7.4 displays the anomaly score obtained
at τk for each vessel. To facilitate better visual separation, the distance
(x0 − μ)T Σ−1 (x0 − μ) in (7.4), rather than the p-value, is plotted. Thus,
the higher the value, the more anomalous is the point.
150
4
10
2
Anomaly score
10
0
10
γ a1
γ a2
γ a3
−2
10
10
20
30
Sample
40
50
60
4
10
2
Anomaly score
10
0
10
γ a1
γ a2
γ a3
−2
10
10
20
30
Sample
40
50
60
Figure 7.4. Anomaly score using proposed method (top) and method [51]
(bottom). ”Normal” trajectories are shown in gray. The anomalous trajectories
γai , i = 1, 2, 3 are marked according to the legend. The threshold value is
marked by the black dashed line. Note the log scale on the y-axis.
151
Discussion
In the real data set, the underlying causes to the behaviors classified
by the algorithm as anomalous are not known to the authors. The
anomalies of Type 3 definitely seem reasonable to raise warnings for since
the behavior is clearly distinguished from the other vessels at similar
positions. Whether the Type 2 anomalies should result in warnings or
not is more difficult to judge. However, the anomaly scores for these
were low and could just be presented as a notification to the operator.
The Type 1 anomalies should not raise warnings since these are not
actual anomalies. Because a ship is supposed to broadcast over the
AIS system what harbor she is heading for, these types of warnings can
easily be suppressed by using that additional information. A closer look
on the data also reveals that there are no apparent anomalies that are
not detected by the algorithm. In the simulated data set, it can be seen
that the method reliably detects the anomalous trajectories while giving
a low rate of false alarms. Compared to the method in [51], a better
separation between the anomalous and normal trajectories is obtained.
For instance, from the p-values in the bottom sub-plot, it is not possible
to tell γa3 from the normal trajectories using the method of [51].
Computational complexity
The proposed anomaly detection method boasts low computational complexity. The online computations necessary to judge whether a point x0
is anomalous or not are basically to compute τ ∗ = c(x0 , t) from (7.3),
and evaluate (7.4). This is found to require about 100 FLOPs (floating point operations). The processor used for execution of the implemented code(Intel Core2 Quad Q9550, 2.83GHz) can perform about 40
GFLOPs, and hence theoretically process about 4·107 anomaly classifications per second. In practice, this number is lower, due to computational
overhead such as data movement etc.
In Matlab that is an interpreting software and hence executes substantially slower than a compiled code (e.g. in C++), tests show that
an implementation can process about 7 · 105 points per second. This is
far more than required to handle the given scene, which at maximum
contained 453 vessels at the same time instant.
7.4.2 Eye-tracking
There are different types of eye movement (the two most commonly mentioned being saccades and smooth pursuit) [19], all of which are governed
by complex neuromuscular systems. Research has shown that various
medical conditions, e.g. Parkinson’s Disease [30], affect the smooth pursuit system negatively, motivating the search for accurate quantification
152
Figure 7.5. Recording of eye movements.
methods that could then be used as diagnostic or even staging tools. The
oculomotor system is inherently difficult to model due to complex nonlinear dynamics and it is therefore of interest to find a non-parametric
approach to use as a supplement for model-based methods.
Experiment
Three test subjects
• P1 : Healthy man, 26 years old
• P2 : Healthy man, 27 years old
• P3 : Healthy man, 54 years old
• P4 : Parkinson patient, 62 years old
were put in front of a computer screen and asked to follow a gaze reference r(tk ) in the form of a moving dot on the screen designed to have
suitable characteristics as in [47]. Thus, r(tk ) is the x, y coordinates of
the dot at time step k. The j-th recording for test subject Pi is denoted
#
$
(j)
(j)
(j)
,
γPi = x(j)
Pi (t1 ) xPi (t2 ) . . . xPi (tn )
(j)
where xPi (tk ) is the x, y position at which the test subject Pi is looking
at time sample tk , recorded by a video-based eye tracker from Smart Eye
AB, Sweden. Fig. 7.6 shows a picture of the recording of eye movements.
P1 tracked the reference 40 times, while P2 , P3 and P4 tracked the
reference 5 times each.
153
0.4
0.2
y position [cm]
0
−0.2
−0.4
−0.6
−0.8
−1
−0.4
−0.2
0
0.2
0.4
x position [cm]
0.6
0.8
1
Figure 7.6. Part of the trajectory for the visual stimuli.
Since the reference is a function of time the function g was chosen as
g(t, x(t)) = t
(7.6)
was chosen, which implies that the reference trajectory r(τ ) = τ and the
context function is simply given by
c(t, x(t)) := arg inf r(τ ) − t)2 = t.
(7.7)
τ
The PDFs fˆ(tk , x) were estimated from the first 35 realizations be(j)
longing to P1 , i.e. {γP1 }35
j=1 using an orthogonal series estimator and the
(j)
first 5 Hermite basis functions. The p-values for the data in {γP1 }40
j=36 ,
(j) 5
(j) 5
(j) 5
ˆ
{γP2 }j=1 , {γP3 }j=1 and {γP4 }j=1 were evaluated w.r.t. f (tk , x), k =
1, 2, ..., 500. In Fig. 7.7, the cumulative logarithmic p-value
pc (t) =
t
log10 p(k)
(7.8)
k=0
is shown, where p(k) denotes the p-value at time step k. The cumulative
p-values obtained using a Gaussian distribution are also provided for
comparison.
Discussion
From Fig. 7.7, differences in the oculomotor system of the test subjects
can be observed. The Parkinson patient, P4 , is naturally the most distinguished test subject. The 54 year old test subject, P3 is also clearly
154
0
-100
-200
Cumulative p-value
-300
-400
-500
-600
P
-700
P
P
-800
1
2
3
P4
-900
-1000
0
50
100
150
200
250
300
Time step
350
400
450
500
0
-100
Cumulative p-value
-200
-300
-400
P1
-500
P2
P
-600
P
3
4
-700
0
50
100
150
200
250
300
Time step
350
400
450
500
Figure 7.7. Upper figure shows the p-values for test subject P1 , P2 , P3 and
P4 for each sampling instant, using Hermite functions to estimate the PDFs,
the lower figure shows the p-values when then PDFs are approximated using a
Gaussian distribution. Notice the log scale on the y-axis.
155
separated from P1 and P2 , 26 and 27 years old respectively. This is likely
to be a consequence of age, a factor known to affect the oculomotor system. P1 has the highest p-values, which is to expect as the training data
set, and hence the definition of normal behavior, come from P1 . The
PDFs for this application tend to be skew. Indeed, a better distinction
between the test subjects is achieved using a non-parametric estimator
than a Gaussian estimator, Fig. 7.7. As this study only contains four
test subjects, it is not possible to make more insightful conclusions based
on the available data. Subsequent studies containing more test subjects
will be performed to draw statistically significant conclusions.
7.5 Limitations
The method requires that enough realizations are available to enable
accurate estimation of the involved PDFs. This can be especially problematic for high-dimensional systems since estimation of PDFs of high
dimension requires many observations to achieve accuracy. It has though
been shown [110] that orthogonal series estimates exhibit a convergence
rate that is independent of the dimension, which property makes them
an appealing option for high-dimensional estimation.
7.6 Conclusions
A non-parametric and analytical model-free anomaly detection method
is presented. The method is applicable to systems following a given reference whose trajectory realizations are observed. The method is based
on the estimation of statistical distributions characterizing the trajectory deviations from the reference. With the aid of these distributions
and by utilizing outlier detection methods, it can be concluded whether
or not a given system trajectory is likely to be generated by the same
mechanisms as the training data set. The developed method performs
well in the two considered applications. Being model-free, the method
is suitable for systems that are difficult to model appropriately and/or
highly nonlinear.
7.A Appendix for Chapter 7
7.A.1 Evaluation of Eq. (1.20)
One approach to evaluate Eq. (1.20) is by approximating it using a Riemann sum by the following steps. Let {xi }N
i=1 denote a set of equidistant
156
grid points and denote the volume element for one grid point V . Evaluate fX over the grid
yi = fX (xi ),
(7.9)
i = 1, 2, ..., N . Let {yak }N
k=1 be the ordered set of the points yi , such
that ya1 ≤ ya2 ≤ ... ≤ yaN , and denote the cumulative sum
cm = V
m
yai ,
1 ≤ m ≤ N, m ∈ N.
i=1
An approximation of the p-value, for an observation x0 is then given by
p(x0 ) ≈ cn ,
(7.10)
where n = arg max(ymi ≤ fX (x0 )). The approximation can be made
i
arbitrarily accurate by refining the grid. This can be computed off-line
and does not influence the on-line performance of the method. To further
minimize the on-line computational load, a lookup table for the p-value
can be set. The only computation required then to evaluate the p-value
is to compute fX (x0 ) and check the lookup table for the corresponding
p-value. More sophisticated methods for numerical integration than the
Riemann sum can be applied in a straightforward manner.
7.A.2 Proof of fˆX (τ, x) being a PDF
A function is a PDF if it is non-negative and integrates to 1. For fˆX (τ, x)
it holds that
fˆX (τ, x)dx = l(τ, {fˆX,k (x)}nk=0 ) =
d
R
ˆ
[
fX,τ (x)dx −
fˆX,τ (x)dx](τ − τ )
Rd
Rd
+
fˆX,τ (x)dx = (1 − 1)(τ − τ ) + 1 = 1
Rd
and that fˆX (τ, x) = fˆX,τ (x)(τ − τ ) + (1 − (τ − τ ))fˆX,τ (x) ≥ 0,
since (τ − τ ) ≥ 0, 1 − (τ − τ ) ≥ 0, fˆX,τ (x) ≥ 0 and fˆX,τ (x) ≥ 0,
and is hence a PDF.
157
Chapter
8
Application to parameter
estimation in PK/PD model
8.1 Introduction
Nonlinear dynamical models provide a broad framework for biological
and physiological systems and are well suited for the problem of drug
delivery control, [35]. While first-principles pharmacokinetic/pharmacodynamic (PK/PD) models make direct use of insights into the underlying physiological processes, they also usually involve numerous uncertain and individual-specific parameters to be identified. At the same
time, nonlinear dynamics demand sufficient exogenous excitation both
in frequency and amplitude to safeguard identifiability from measured
input-output data. Since the drug in a closed-loop drug delivery system
incorporating the patient is administered by a feedback controller, an accurate and expedient identification of the model parameters is required
in order to guarantee safety of the treatment.
The inter-patient variability in response to administration of drugs
greatly complicates the automatic drug delivery. Due to a huge variation in PK/PD model parameters that can amount to hundreds of percent, it is difficult and often impossible to design a single controller that
performs reasonably well over a broad patient population. Further, the
performance of an individualized feedback controller for drug delivery is
directly influenced by the intra-patient variability, i.e. the uncertainty
incurred by the changes in the PK/PD characteristics of the patient
throughout a clinical event. Patient response to an anesthetic drug can
also alter due to noxious stimuli or hemorrhaging. Intra-patient variability might exceed the robustness margins of a time-invariant controller
design and demand adaptation or online controller re-design as in, e.g.
[120].
159
Due to the physiologically motivated saturations in the nonlinear
PK/PD, the high uncertainty in the mathematical model may lead,
under closed-loop drug administration, to a limit cycle. The nonlinear oscillations result in alternating under- and overdosing episodes that
compromise the intended therapeutic effect and patient safety.
Simple model structures can capture the most significant to the closedloop performance dynamics of the system, i.e. the human body, in response to drug administration, allowing at the same time for suitable
model individualization. Minimal parsimonious models for the effect of
drugs in anesthesia were proposed in [95] and [94], followed by [36] and
[40].
In this chapter, the estimation performance of the extended Kalman
filter (EKF) is compared to that of two particle filter (PF) algorithms
in an application to neuromuscular blockade (NMB) nonlinear Wiener
model. Results shows that the more computationally intensive PF, making direct use of the nonlinear model, performs better than the EKF that
relies on model linearization. For comparison the OBPF, given in chapter 6 is also implemented and evaluated at the application. The OBPF
provides regularization to the filtering problem by fitting a truncated
orthogonal series expansion to the particle set. The truncation order
of the expansion is thus a user parameter. It is investigated how the
regularization benefit the filter estimates, and also how the truncation
order affects the filter accuracy. The matter of intra-patient variability
in terms of model parameter estimates is also assessed in this chapter
by a comparison of the tracking capabilities of the EKF, PF, and the
OBPF.
The numerical experiments performed on synthetic and clinical data
show that the EKF is the computationally cheapest option but is prone
to a significant bias. The estimates of both PF are not biased and the
PF and OBPF perform similarly when there is no limit to the number
of the particles used. For a moderate number of particles, the OBPF
demonstrates higher accuracy at the same computational price as the
PF.
Recent research has shown that complex nonlinear dynamics may arise
in the closed-loop system of a Wiener model for the NMB controlled by a
PID feedback. According to [121], there exists a region in the parameter
space where the system possesses a single stable equilibrium and, when
varying the parameters, this equilibrium undergoes a bifurcation that
leads to the emergence of self-sustained nonlinear oscillations. Notably,
oscillating PID loops in closed-loop anesthesia have been observed in
clinical experiments, e.g. [2]. A third contribution of this chapter is a
quantification of the distance to bifurcation for the identified models.
This quantification provides insight into how close to a nonlinear oscillation the closed-loop system is and it may be used as a flag in a safety
160
u(t)
Linear Dynamics
y(t)
Static Nonlinearity
y(t)
Figure 8.1. Block diagram of a Wiener model.
net for PID controlled anesthesia. Therefore, the considered identification algorithms can not only be used for controller design but as well for
control loop monitoring that assesses online the risk for oscillations.
The remainder of this chapter is organized as follows. Section 8.2 describes the parsimonious nonlinear Wiener model that is used to parametrize
the effect of the muscle relaxant atracurium in the NMB. Section 8.3
briefly introduces the EKF, the PF, and the OBPF. Section 8.4 summarizes the data sets and the performance metrics that were used to assess
parameters convergence as well as filtering and tracking capabilities of
the considered parameter estimation techniques. Section 8.5 presents
the estimation results. The conclusions are drawn in Section 8.6.
8.2 Parsimonious Wiener Model
A block diagram of a Wiener model is shown in Fig. 8.1. In the parsimonious Wiener model for the NMB, [95], that is adopted in this chapter,
the model input u(t) [μg kg−1 min−1 ] is the administered atracurium
rate, and the model output y(t) [%] is the NMB level. The continuoustime output of the linear dynamic part, here denoted as y(t), is not
accessible for measurement.
The transfer function of the linear dynamic part of the Wiener model
is given by
k 1 k2 k3 α 3
Gp (α) =
,
(8.1)
(s + k1 α)(s + k2 α)(s + k3 α)
that may be realized in state-space form as
ẋ(t) = A(α) x(t) + B(α) u(t),
y(t) = C x(t),
⎤
−k3
0
0
0 ⎦,
A(α) = α ⎣ k2 −k2
0
k1 −k1
T
,
B(α) = α k3 0 0
C= 0 0 1 ,
(8.2a)
⎡
(8.2b)
(8.2c)
(8.2d)
where 0 u(t) umax is the input signal.
161
The constants ki , {i = 1, 2, 3} are positive, and α [min−1 ] > 0 is
the patient-dependent parameter to be identified in the linear block. In
the analysis that follows, the values chosen in [93], k1 = 1, k2 = 4 and
k3 = 10 are assumed.
The effect of the drug is quantified by the measured NMB y(t) [%]
and modeled by the Hill function as
y(t) =
γ
100 C50
,
γ
C50
+ y(t)γ
(8.3)
where γ (dimensionless) is the patient-dependent parameter to be identified in the nonlinear block, y(t) is the output of the nonlinearity, and
C50 [μg kg−1 min−1 ] is a normalizing constant that is equal to 3.2435 in
simulations.
In order to implement the model in the estimation algorithms, the
structure in (8.2) and (8.3) was discretized using a zero-order hold
method with sampling rate h = 1/3 min−1 .
A random walk model, [101], for the model parameters is assumed.
With subscripts denoting discrete time, the resulting (sampled) augmented state vector xt is
T
(8.4)
xt = xTt αt γt .
Then the extended state-space model becomes the following
⎡ ⎤
xt
Φ(αt ) 03×2 ⎣ ⎦
Γ(αt )
αt +
xt+1 =
ut + v t ,
02×1
02×3
I
γt
(8.5a)
≡ f (xt , ut ) + vt ,
yt =
γt
100 C50
+ et ≡ h(xt ) + et ,
γt
C50
+ (C xt )γt
(8.5b)
where vt ∈ R5 , et ∈ R are white zero-mean Gaussian noise processes,
with the probability density functions pv (v) and pe (e), respectively. The
system matrices Φ(α), Γ(α) are the discretized versions of A(α), B(α)
in (8.2).
8.3 Estimation algorithms
The EKF and the PF are widely used in nonlinear state estimation.
The EKF builds on the idea of extending Kalman filtering to nonlinear
models. At each time step, the filter gain is computed by linearizing
the nonlinear model around the previous state estimates. Unlike the
162
Kalman filter, the EKF is not an optimal filter and assumes both the
process and sensor noise to be Gaussian.
The PF uses Monte Carlo simulation to obtain a sample from the estimated posterior distribution of the state, from which point estimates
can be extracted. It provides a general framework for estimation in nonlinear non-Gaussian systems. The PF exploits the underlying nonlinear
model as it is, but yields an approximation to the true solution of the
filtering problem. The approximation can be made arbitrarily accurate
by increasing the number of particles, but the latter comes with the cost
of an increased computational burden.
The third filtering method under consideration is the OBPF. At the
resampling step, it approximates the posterior by an orthogonal series
expansion. A new set of particles is created by making a draw from
the fitted distribution. Compared to the PF, the OBPF is even more
suitable for parallelization. It as well provides, by the orthogonal series
approximation, a regularization to the problem, improving the estimation accuracy for a smaller number of particles.
8.3.1 Filter tuning
Following the procedure in [87], the EKF, the PF and the OBPF with
5 × 104 particles were tuned individually over a synthetic database (see
Section 8.5) aiming at the best performance in terms of convergence
speed and bias with reasonable output filtering. For the OBPF, the performance was evaluated for approximation orders K = 0, 1, 2, 3, 4. For
the sake of evaluation consistency, this tuning was used for all simulations in this chapter. Notice that the initial covariance matrix of the
EKF was not increased further, which would have resulted in a reduced
settling time of the estimates. The reason is that, with a more aggressive tuning, the estimates of the nonlinear parameter γ suffered from
divergence for several cases.
The tuned covariance matrices for the EKF are as follows:
%
&
P1|0 = diag 10−4 10−4 10−4 10−4 100 ,
%
&
Q = diag 10−2 10−2 10−2 10−8 10−3 ,
(8.6)
R = 1,
where diag(·) denotes a diagonal matrix with the specified elements of
the main diagonal.
The tuned covariance matrices for the PF and OBPF are as follows:
%
&
P1|0 = diag 10−4 10−4 10−4 10−2 100 ,
%
&
Q = diag 10−3 10−3 10−3 10−8 10−3 ,
(8.7)
R = 0.7.
163
The initial estimates of the parameters were calculated as the mean
over the synthetic database (see Section 8.4.1), i.e. 0.0378 for α and
2.6338 for γ.
8.4 Data sets and performance evaluation metrics
The two data sets and the metrics used for the estimation performance
evaluation are described below.
8.4.1 Synthetic Data
The database of synthetic data generated as described in [87] is used
in this chapter. In brief, the data
were& obtained by simulating system
%
(8.5) with the parameter sets α(i) , γ (i) , {i = 1, . . . , 48} from [73]. The
input (i.e. drug dose) used to generate the 48 synthetic data sets was
the same as the one administered in the 48 real cases, to guarantee that
the excitatory properties of the real input signals are preserved.
Convergence properties:
In order to assess the convergence properties in terms of bias and settling time, the model parameters α(i) and γ (i) for each case i were kept
constant during the whole simulation.
The settling time for an estimate θ̂t of a scalar parameter θ is here
defined as the time ts = ks h, where ks is the least value for which
max θ̂t − min θ̂t ≤ L
t≥ks
t≥ks
(8.8)
is satisfied, i.e. the estimate will be confined to a corridor of width L,
for k larger than or equal to ks . If the signal settles, the bias in the
estimate is defined as
∗
N
1
bθ = θ − ∗
θ̂t ,
N − ks
(8.9)
t=ks
where N ∗ is the number of samples from ks to the end of the case being
evaluated.
Tracking properties:
As in [87], to assess the tracking properties of the algorithms, the true
value of γ for the model simulation is made to evolve following a sigmoidal decay of 20% after minute 50, i.e. time step k0 = 150, according
164
to
γt =
⎧
,
⎨ρ *
⎩ρ 1 − 0.2
1+(
k ≤ k0 ,
+
1
k0
k−k0
)3
,
k > k0 ,
(8.10)
where ρ = γ (i) for case i. This is to simulate slow drifts in the dynamics
that might occur during a general anesthesia episode. The parameter in
the nonlinear block (PD, γ) is chosen for this test over the parameter in
the linear one (PK, α) to highlight the nonlinear estimation performance
of the evaluated algorithms.
Distance to bifurcation:
Following [121], the condition for the birth of sustained nonlinear oscillations of the PID closed-loop system is given by a surface that is
nonlinear in the model parameter α and the controller gains R and L,
as defined in the Ziegler-Nichols tuning procedure. The choice of this
tuning procedure follows the work of [61].
% The 48
& models in the synthetic database were used to obtain the R(i) , L(i) , {i = 1, . . . , 48} via
Ziegler-Nichols.
%
& Considering a nominal model i, the nominal controller
gains R(i) , L(i) define a point in the (R, L) two-dimensional space. The
parameter estimates α̂k from the PF estimation give rise to different bifurcation conditions that, in the case of a fixed α̂k at each sampling time
k, can be represented by lines in the (R, L) space. To assess how close
the nominal closed-loop model defined by (Rj , Lj ) is to the bifurcation
condition at each time instant, the minimum of the Euclidean distance
between this point and the bifurcation line was numerically calculated
by a grid search.
8.4.2 Real data
The database of real cases is the same as in [87] and includes 48 datasets
collected from patients subject to PID-controlled administration of the
muscle relaxant atracurium under general anesthesia.
Real data were used to validate the conclusions drawn from the synthetic data experiments. The output errors obtained in the EKF, PF
and OBPF filtering were compared for the four phases of anesthesia
covered by the data sets, [87]. Phase 1, 0 < t ≤ 10 min, corresponds
to the induction; Phase 2, 10 < t ≤ 30 min, is the time interval when
only a P-controller was used; Phase 3, 30 < t ≤ 75 min, is between the
beginning of the recovery from the initial bolus and the time when the
reference reaches its final value of 10%; Phase 4, 75 < t ≤ tend , corresponds to steady-state. During Phases 3 and 4, drug administration was
PID-controlled for t ≥ 30 min.
165
8.5 Results
This section presents the results of the EKF, the PF and the OBPF
estimation of the nonlinear Wiener model for the NMB described in
Section 8.2.
8.5.1 Synthetic data
Fig. 8.2 shows the parameter estimates of case #7 in the database of
synthetic cases. As in [87], the estimates obtained by the PF, in solid
blue line, converge faster than the estimates obtained by the EKF, in
dashed green line, and exhibit less bias (8.9). This behavior persists in
most of the cases in the database and the bias is more prominent for
higher values of α and γ. Fig. 8.6 illustrates this by showing the true
α and γ vs. bias (8.9) in the estimates for the PF and EKF for the
48 cases in the database. It is hence evident that the PF, in general,
yields estimates with less bias than the EKF, this effect being especially
prominent for large values of α and γ. The presence of a higher bias in
the estimates of the EKF for higher values of the nominal parameters
may be explained by the fact that the gain of the EKF is calculated
from a linearized version of the nonlinear Wiener model while the PF
performs no linearization at all. The performance of the OBPF is very
similar to the performance of the PF for large particle sets (N ≥ 104 ).
The root mean square error (RMSE)
! T
!1
(xt − x̂t )2
R="
T
(8.11)
t=0
for the PF and OBPF is shown in Fig. 8.3 for different particle set sizes
and approximation orders. As can be seen, for smaller N , the OBPF
gives better estimation than the PF, due to the regularization provided
by the fitting of the expansion. No particular difference in RMSE performance of the OBPF can be seen between different approximation
orders K. However, when evaluating the posterior PDF at a given time
instant, the OBPF gives a better fit for higher truncation orders, as
exemplified by Fig. 8.4. The true marginal distribution for α is shown
together with the approximate PDFs obtained by the OBPF of different
orders. Given these results, it is probably not worthwhile to spend the
extra computation required for a higher approximation order since this
has little impact on the final quality of the point estimate, as shown by
the RMSE. Fig. 8.5 depicts the estimates of γ for a case where the
true value, plotted in dotted red, changes obeying a sigmoidal function
after minute 50, according to (8.10). The EKF estimates are plotted in
dashed green, while the PF estimates are plotted in solid blue. This is
166
0.035
0.03
α
true
OBPF(2)
PF
EKF
0.025
0.02
0
20
40
60
80
100
120
140
160
180
200
Time step
4
3.5
γ
3
true
OBPF(2)
PF
EKF
2.5
2
0
20
40
60
80
100
120
140
160
180
200
Time step
Figure 8.2. Estimated α (upper plot) and γ (bottom plot) for the OBPF,
PF and EKF for case number 7 in the synthetic database. The settling time
instants according to (8.8) are marked by the arrows.
RMSE (α)
2.5
×10 -3
OBPF(0)
OBPF(2)
OBPF(4)
PF
2
1.5
1
0.5
10 2
10 3
10 4
10 5
10 6
N
RMSE (γ)
0.08
OBPF(0)
OBPF(2)
OBPF(4)
PF
0.06
0.04
0.02
10 2
10 3
10 4
10 5
10 6
N
Figure 8.3. Root mean square error as a function of the number of particles N
used for filtering.
167
OBPF(0)
OBPF(1)
OBPF(2)
OBPF(3)
OBPF(4)
true
0.35
0.3
p(α)
0.25
0.2
0.15
0.1
0.05
0
0.025
0.026
0.027
0.028
0.029
0.03
0.031
0.032
0.033
0.034
0.035
α
Figure 8.4. Marginal distribution for α at time t = 5min. The true PDF is
shown in dashed black, while the approximation obtained from the OBPF using
approximation orders from 0 to 4 are shown in colored lines.
OBPF(2)
PF
EKF
2.4
2.3
2.2
γ
2.1
2
1.9
1.8
1.7
1.6
20
40
60
80
100
120
Time (min)
Figure 8.5. Estimated γ for the EKF, PF and OBPF. At t = 50, the true γ
starts drifting according to (8.10).
168
−3
2
x 10
0.5
1
0
bγ
bα
0
−1
−2
−0.5
−3
−4
0.02
0.03
0.04
α
0.05
0.06
−1
1
2
3
γ
4
5
6
Figure 8.6. The true α and γ vs. estimation bias bα and bγ , respectively, for
the 48 cases in the synthetic database. The results for the EKF are plotted in
green circles and the results for the PF are plotted in blue crosses.
a case representative of the behavior of the estimates in all the 48 cases
in the synthetic database. As for time-invariant parameters, the EKF
presents a higher bias at tracking the change than the PF does, while no
particular difference can be observed between the PF and the OBPF.
8.5.2 Real data
Keeping the tuning unchanged, the EKF, the PF and the OBPF were
applied to the 48 cases of real input-output data. Fig. 8.7 shows the
estimates of α and γ over time for case #39 in the real data database.
Thus the true parameter values are not available. The higher variance of
the estimates of γ, when compared to that of the estimates of α, supports
the choice of only assessing the tracking performance of both estimation
techniques with respect to changes in γ, as argued in Section 8.4.1.
Fig. 8.8 shows the mean of the absolute value of the output error with
the 1σ confidence interval over all 48 cases. Numerical values of the output errors are also given in Table 8.2 for the four different experimental
phases described in Section 8.4.2. The general result is that the PF
exhibits a much lower output error during the induction phase, i.e. for
0 < t < 10 min, when compared with the output error that is obtained
with the EKF estimates. For 10 ≤ t < 30 min, the EKF provides slightly
better output errors, possibly due to less prominent nonlinear dynamics
exhibited in this interval. For t ≥ 30 min, the performance is similar
for the EKF and the PF. The better performance of the PF during the
highly nonlinear induction phase is likely due to that the PF is designed
to handle nonlinear systems without recourse to linearization.
Numerical values of the output errors per phase are given in Table 8.2.
The general result is that the PF exhibits a much lower output error
during Phase 1, 0 < t ≤ 10 min, when compared with the output error
169
0.05
0.045
α
0.04
OBPF(2)
PF
EKF
0.035
0.03
0.025
0
10
20
30
40
50
60
70
Time (min)
4
OBPF(2)
PF
EKF
γ
3
2
1
0
10
20
30
40
50
60
70
Time (min)
Figure 8.7. Estimated model parameters for the EKF, in dashed green, and
the PF, in solid blue, over time for a case #39 in the real database.
Figure 8.8. The mean μekf
and μpf
e
e of the absolute value of output error over
the 48 cases for the EKF and PF, respectively. The 1σ confidence intervals are
given by the transparent areas.
170
Table 8.1. Mean, standard deviation of simulation output error, with the
parameters θˆt = {α̂t , γ̂t } obtained at t = 10 min, and t = tend
θ̂10
θ̂tend
mean
2.32
4.13
stdv
0.13
0.22
that is obtained with the EKF estimates. For Phase 2, 10 < t ≤ 30
min, the EKF provides slightly better output errors, possibly due to less
prominent nonlinear dynamics exhibited in this interval. For Phases 3
and 4, t > 30 min, the performance is similar for the three estimation
algorithms. The better performance of the PF during the highly nonlinear induction phase is attributed to that the algorithm handles the
nonlinear dynamics without recourse to linearization.
In order to get some insight into the need of estimating the model
parameters throughout the whole surgery and, consequently, the development of adaptive control strategies, the system was simulated with
the estimates of α and γ obtained after induction (at time t=10 min),
and the estimates obtained from last time step of the estimation (at
t = tend ). The mean and standard deviation over the 48 cases of the
output errors are shown in Tab. 8.1. This result shows that, from minute
10 to the end of the surgery, the changes in the model parameters affect the goodness of fit of the simulated model to the real data. It is
therefore plausible that adaptive/re-designed controllers would perform
better during maintenance phase than non-adaptive ones, especially under longer surgical interventions.
Given the time-varying nature of the patient dynamics, in a PID
control setup, and for safety reasons, it is important to judge whether
the system is driven into a parameter region where a bifurcation might
lead to nonlinear oscillations. The distance to bifurcation is calculated
according to [121] for the 48 cases at t = 40 min and presented in
a histogram in Fig. 8.9. The histogram is representative for all time
instants t > 10 min, as the distance depends only on α̂ which typically
settles before t = 10 min. It can be seen that most of the cases are
further than 10−2 from the critical surface. Three cases are nevertheless
closer to the surface, which may be of concern in real practice.
It should be noted that the better performance of the PF and OBPF
comes with a much higher than for the EKF computational cost. For
this application, the EKF and PF/OBPF require FLOPs in the order of
magnitude of 103 and 107 per iteration, respectively. In Fig. 8.10, the
RMSE as a function of the computational complexity is shown for the
PF and OBPF. It can be seen that the OBPF provides better RMSE
results for a given number of FLOPS for the approximation orders K = 0
and K = 2, but is more computationally costly for a higher truncation
order (K ≥ 4). Unoptimized Matlab implementations were clocked to
171
12
number of cases
10
8
6
4
2
0
1E−3
1E−2.5
1E−2
1E−1.5
distance
1E−1
1E−0.5
1E0
Figure 8.9. Histogram of the distance to bifurcation, at time t = 40 min,
over the 48 cases in the synthetic database, assuming PID control. Note the
log-scale on the x-axis.
2.5
×10 -3
OBPF(0)
OBPF(2)
OBPF(4)
PF
RMSE
2
1.5
1
0.5
10 4
10 5
10 6
10 7
10 8
10 9
FLOPs
RMSE
0.08
OBPF(0)
OBPF(2)
OBPF(4)
PF
0.06
0.04
0.02
10 4
10 5
10 6
10 7
10 8
10 9
FLOPs
Figure 8.10. Root mean square error (RMSE) as a function of the number of
floating-point operations per second (FLOPS) required for filter execution.
172
Table 8.2. Output error (absolute value) of estimation for the EKF, the PF
and the OBPF for different approximation orders, during the four phases defined in Section 8.4.2.
Phase
1
2
3
4
mean
4.16
0.49
0.31
0.25
Phase
1
2
3
4
mean
0.87
0.52
0.31
0.26
EKF
stdv [min,max]
0.62 [2.58,5.42]
0.17 [0.16,0.85]
0.16 [0.08,0.98]
0.16 [0.04,0.97]
OBPF(0)
stdv [min,max]
0.53 [0.32,1.98]
0.15 [0.15,1.22]
0.18 [0.06,0.85]
0.16 [0.04,0.78]
mean
0.95
0.58
0.30
0.25
mean
0.90
0.52
0.28
0.23
PF
stdv [min,max]
0.47 [0.24,2.34]
0.39 [0.14,1.97]
0.16 [0.13,0.77]
0.13 [0.07 0.76]
OBPF(5)
stdv [min,max]
0.44 [0.18,2.14]
0.18 [0.17,1.52]
0.18 [0.05,0.74]
0.14 [0.05 0.69]
run one filtering iteration in 0.5ms for the EKF and about 2s for the
PF. For the implementations in hand, the execution time for the PF
is hence four orders of magnitude greater than that of the EKF. Since
the sampling period is 20s, this difference in execution time is though
not an issue. Importantly, the OBPF has full parallelization potential,
and linear speedup in the number of cores employed can be expected on
a multicore computer. Hence, on an eight-core machine used for filter
computation, the execution time can be brought down to 1/8 of that for
single-core execution.
8.6 Conclusions
The nonlinear estimation algorithms EKF, PF and OBPF were compared on a parsimonious Wiener model for the neuromuscular blockade
(NMB) in closed-loop anesthesia. For this application, the PF and the
OBPF provide significantly better estimation quality than the EKF,
but at a higher computational cost of four orders of magnitude in the
FLOPS.
The estimation performance of the OBPF and PF is similar. However, for a given number of FLOPS, the OBPF with a low truncation
order can provide better estimation quality than the PF. Using a higher
truncation order than K = 0 did not result in any significant improvement in the point estimates provided by the filter since the underlying
probability distribution is captured well with already one term, i.e. close
to a normal one. A better fit of the underlying PDF is though achieved
with higher truncation order. The improvement in the PDF fit is, however, not considerable enough in this application to justify the increased
computational cost with the higher truncation order.
173
Chapter
9
BLAS based parallelizations of
UKF and point mass filter
When performing a parallelization the simplest option should, ofcourse,
be investigated first. If it is possible to use a readily available, highly
optimized library, it is the way to go. In this chapter a brief presentation
of the results from BLAS based implementations of the UKF and a point
mass filter is given.
9.1 UKF
In the UKF the square root of the error covariance matrix have to be
computed at every iteration, which is the step that consumes the vast
majority of the computations in the method. Hence, parallelizing the
UKF is mainly a matter of parallelizing the Cholesky factorization. A
subroutine for parallel Choleskey factorization is available in BLAS. An
implementation of the UKF, based on the BLAS routine for Cholesky
factorization, has been performed. The execution time and speedup
results are summarized in Tab. 9.1 and Fig. 9.1, respectively. As can
be seen, the scalability of the parallel UKF is good when the problem
size is large (n 1000). For smaller problem sizes, the parallel overhead
constitutes a disproportionally large part of the execution time, which
property results in a poor scalability.
Table 9.1. Single core execution time, T for different problem sizes n, for
execution of UKF.
n
T [ms]
100
0.0958
200
0.4911
500
6.2881
1000
45.1539
2000
336.1521
175
8
n=100
n=200
n=500
n=1000
n=2000
7
6
Speed up
5
4
3
2
1
0
1
2
3
4
5
Number of CPUs
6
7
8
Figure 9.1. Speedup curves for parallel implementation of UKF for different
problem sizes n. For reference linear speedup is marked by the dashed line.
9.2 Point mass filter
One iteration of a grid-based method consist of computing (1.61) and
(1.62), i.e.
(i)
wt|t−1 =
N
(j)
(i)
(j)
wt−1|t−1 p(xt |xt−1 ),
j=1
(i)
wt|t
(i)
(i)
= wt|t−1 p(yt |xt ).
i = 1, 2, ..., N . Defining
T
(1)
(2)
(N )
wt|t−1 =
,
wt|t−1 wt|t−1 . . . wt|t−1
⎡
(1)
(1)
(1)
(2)
p(xt , xt−1 ) p(xt , xt−1 ) · · ·
⎢
⎢ p(x(2) , x(1) ) p(x(2) , x(2) )
t
t
t−1
t−1
Pt = ⎢
⎢
..
..
⎣
.
.
(N )
(1)
p(xt , xt−1 )
···
⎡
(1)
p(yt , xt )
0
···
⎢
(2)
⎢
0
p(yt , xt )
Qt = ⎢
⎢
..
..
⎣
.
.
0
176
···
0
(1)
(N ) ⎤
p(xt , xt−1 )
⎥
..
⎥
.
⎥,
⎥
⎦
(N )
p(xt
0
..
.
0
(N )
p(yt , xt )
(N )
, xt−1 )
⎤
⎥
⎥
⎥,
⎥
⎦
Algorithm 25 Pseudo code for one iteration of a grid based method.
• for i=1:N
(i)
– wt|t−1 = 0
– for j=1:N
(i)
(i)
(j)
(i) (j)
∗ wt|t−1 = wt|t−1 + wt−1|t−1 p(xt |xt−1 )
– end
(i)
(i)
– wt|t = p(yt |xt−1 )wt|t−1
• end
8
N=100
N=200
N=500
N=5000
N=20000
7
6
Speed up
5
4
3
2
1
0
1
2
3
4
5
Number of CPUs
6
7
8
Figure 9.2. Speedup curves for execution of Alg. 25 for different problem sizes
N. For reference linear speedup is marked by the dashed line.
the prediction and update equations (1.61) and (1.62) can be expressed
in matrix form as
wt|t−1 = Pt wt−1|t−1 ,
wt|t = Qt wt|t−1 .
(9.1)
(9.2)
These matrix equations could then be implemented using, e.g. BLAS
or by the pseudo-code in Alg. 25, where the parallelization is performed
over the i loop iterations. Tab. 9.2 and Fig. 9.2 show the execution
time and, respectively, scalability, of an implementation of Alg. 25, for
different problem sizes N . Note that the problem size is given by the
number of grid points N and not by the number of states n as for the
Kalman filter based methods.
177
Table 9.2. Execution time, T for sequential execution of grid based estimator
for different number of grid points N .
N
T [ms]
178
100
0.1109
200
0.4389
500
2.7909
5000
270.7901
20000
4334.2
References
[1] Open mp. http://www.cs.virginia.edu/stream/, Aug. 2010.
[2] A.R. Absalom and G. N. C. Kenny. Closed loop control of propofol
anaesthtesia using bispectral index: perofrmance assessment in patients
receiving computer-controlled propofol and manually controlled
remifentanil for minor surgery. British Journal of Anaesthesia,
90(6):737–741, 2003.
[3] V.J. Aidala. Kalman filter behavior in bearings-only tracking
applications. Aerospace and Electronic Systems, IEEE Transactions on,
AES-15(1):29 –39, jan. 1979.
[4] D.L. Alspach and H.W. Sorenson. Nonlinear bayesian estimation using
gaussian sum approximations. Automatic Control, IEEE Transactions
on, 17(4):439–448, Aug.
[5] Gene M. Amdahl. Validity of the single processor approach to achieving
large scale computing capabilities. In AFIPS ’67 (Spring): Proceedings
of the April 18-20, 1967, Spring Joint Computer Conference, pages
483–485, New York, NY, USA, 1967. ACM.
[6] B.D.O. Anderson and J.B. Moore. Optimal Filtering. Dover Books on
Electrical Engineering Series. Dover Publications, 2005.
[7] M.S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on
particle filters for online nonlinear/non-gaussian bayesian tracking.
Signal Processing, IEEE Transactions on, 50(2):174 –188, February
2002.
[8] Anwer S Bashi, Vesselin P Jilkov, X Rong Li, and Huimin Chen.
Distributed implementations of particle filters. In Proc. of the Sixth Int.
Conf. of Information Fusion, pages 1164–1171, 2003.
[9] A.S. Bashi, V.P. Jilkov, X.R. Li, and Huimin Chen. Distributed
implementations of particle filters. In Information Fusion, 2003.
Proceedings of the Sixth International Conference of, volume 2, pages
1164 – 1171, 2003.
[10] G. J. Bierman. Factorization Methods for Discrete Sequential
Estimation. Academic Press, New York, NY, 1977.
179
[11] M. Bolic, P.M. Djuric, and Sangjin Hong. New resampling algorithms
for particle filters. In Acoustics, Speech, and Signal Processing, 2003.
Proceedings. (ICASSP ’03). 2003 IEEE International Conference on,
volume 2, pages II – 589–92 vol.2, April 2003.
[12] Miodrag Bolic, Petar M. Djuric, and Sangjin Hong. Resampling
algorithms and architectures for distributed particle filters. IEEE
Transactions on Signal Processing, 53:2442–2450, 2004.
[13] John P Boyd. Asymptotic coefficients of hermite function series.
Journal of Computational Physics, 54(3):382–410, 1984.
[14] D. Brunn, F. Sawo, and U.D. Hanebeck. Nonlinear multidimensional
bayesian estimation with Fourier densities. In Decision and Control,
2006 45th IEEE Conference on, pages 1303 –1308, dec. 2006.
[15] R. S. Bucy and K. D. Senne. Digital synthesis of non-linear filters.
Automatica, 7(3):287–298, May 1971.
[16] Philip K. Chan and Matthew V. Mahoney. Modeling multiple time
series for anomaly detection. In 5th IEEE Interational conference on
data mining, pages 90–97. IEEE Computer Society, 2005.
[17] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly
detection: A survey. ACM Comput. Surv., 41:15:1–15:58, July 2009.
[18] E.W. Cheney. Multivariate Approximation Theory: Selected Topics.
CBMS-NSF Regional Conference Series in Applied Mathematics.
Society for Industrial and Applied Mathematics, 1986.
[19] Raymond Dodge. Five types of eye movement in the horizontal
meridian plane of the field of regard. American Journal of Physiology –
Legacy Content, 8(4):307–329, 1903.
[20] R. Douc and O. Cappe. Comparison of resampling schemes for particle
filtering. In Image and Signal Processing and Analysis, 2005. ISPA
2005. Proceedings of the 4th International Symposium on, pages 64 – 69,
sept. 2005.
[21] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo
Methods in Practice. Statistics for Engineering and Information Science
Series. Springer, 2001.
[22] Arnaud Doucet, Simon Godsill, and Christophe Andrieu. On sequential
monte carlo sampling methods for bayesian filtering. Statistics and
Computing, 10(3):197–208, 2000.
[23] M. Ekman. Particle filtering and data association using attribute data.
Information Fusion, 2009. FUSION09.12 th International Conference
on, (10):9–16, July 2009.
[24] A. Erdélyi. Asymptotic Expansions. Dover Books on Mathematics.
Dover Publications, 1956.
180
[25] Magnus Evestedt, Alexander Medvedev, and Torbjörn Wigren. Windup
properties of recursive parameter estimation algorithms in acoustic echo
cancellation. Control Engineering Practice, 16(11):1372 – 1378, 2008.
[26] Lian Fang and David C Gossard. Multidimensional curve fitting to
unorganized data points by nonlinear minimization. Computer-Aided
Design, 27(1):48 – 58, 1995.
[27] J.E. Gentle. Elements of Computational Statistics. Statistics and
Computing. Springer, 2002.
[28] H.O. Georgii. Stochastics: Introduction to Probabilty Theroy and
Statistics. De Gruyter Textbook. De Gruyter, 2008.
[29] Norman E. Gibbs, Jr. Poole, William G., and Paul K. Stockmeyer. An
algorithm for reducing the bandwidth and profile of a sparse matrix.
SIAM Journal on Numerical Analysis, 13(2):pp. 236–250, 1976.
[30] J M Gibson, R Pimlott, and C Kennard. Ocular motor and manual
tracking in Parkinsons disease and the effect of treatment. J Neurol
Neurosurg Psychiatry, 50(7):853–60, 1987.
[31] Stefan Goedecker and A. Hoisie. Performance Optimization of
Numerically Intensive Codes. Software, Environments and Tools.
Society for Industrial and Applied Mathematics, 2001.
[32] N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to
nonlinear/non-Gaussian Bayesian state estimation. Radar and Signal
Processing, IEE Proceedings F, 140(2):107–113, April 1993.
[33] A. Grama. Introduction to Parallel Computing. Pearson Education.
Addison-Wesley, 2003.
[34] M.S. Grewal and A.P. Andrews. Kalman Filtering: Theory and Practice
Using MATLAB. Wiley, 2011.
[35] Wassim M. Haddad, Tomohisa Hayakawa, and James M. Bailey.
Adaptive control for nonlinear compartmental dynamical systems with
applications to clinical pharmacology. Systems & Control Letters,
55(1):62 – 70, 2006.
[36] Jin-Oh Hahn, G.A. Dumont, and J.M. Ansermino. A direct dynamic
dose-response model of propofol for individualized anesthesia care.
Biomedical Engineering, IEEE Transactions on, 59(2):571–578, 2012.
[37] J. M. Hammersley and K. W. Morton. Poor man’s monte carlo. Journal
of the Royal Statistical Society. Series B (Methodological), 16(1):pp.
23–38.
[38] Eberhard Hansler. The hands-free telephone problem: an annotated
bibliography update. Annals of Telecommunications, 49:360–367, 1994.
181
[39] A. Hekler, M. Kiefel, and U.D. Hanebeck. Nonlinear bayesian
estimation with compactly supported wavelets. In Decision and Control
(CDC), 2010 49th IEEE Conference on, pages 5701 –5706, dec. 2010.
[40] Ramona Hodrea, Radu Morar, Ioan Nascu, and Horatiu Vasian.
Modeling of neuromuscular blockade in general anesthesia. In Advanced
Topics in Electrical Engineering, 2013 8th International Symposium on,
pages 1–4, 2013.
[41] H. Holma and A. Toskala. WCDMA for UMTs: Radio Access for Third
Generation Mobile Communications. John Wiley & Sons, 2001.
[42] R.A. Horn and C.R. Johnson. Matrix Analysis. Cambridge University
Press, 1990.
[43] S. Howard, Hak-Lim Ko, and W.E. Alexander. Parallel processing and
stability analysis of the Kalman filter. In Computers and
Communications, 1996., Conference Proceedings of the 1996 IEEE
Fifteenth Annual International Phoenix Conference on, pages 366 –372,
Mar. 1996.
[44] Weiming Hu, Xuejuan Xiao, Zhouyu Fu, Dan Xie, Tieniu Tan, and
Steve Maybank. A system for learning statistical motion patterns.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
28:1450–1464, 2006.
[45] Rolf Isermann. Process fault detection based on modeling and
estimation methods a survey. Automatica, 20(4):387 – 404, 1984.
[46] K. Ito and K. Xiong. Gaussian filters for nonlinear filtering problems.
Automatic Control, IEEE Transactions on, 45(5):910 –927, may 2000.
[47] D. Jansson and A. Medvedev. Visual stimulus design in parameter
estimation of the human smooth pursuit system from eye-tracking data.
Submitted to IEEE American Control Conference, Washington D.C,
2013.
[48] Daniel Jansson, Alexander Medvedev, and Olov Rosén. Parametric and
non-parametric analysis of eye-tracking data by anomaly detection.
IEEE Transactions on Control Systems Technology, 2014.
[49] Daniel Jansson, Olov Rosén, and Alexander Medvedev. Non-parametric
analysis of eye-tracking data by anomaly detection. In Control
Conference (ECC), 2013 European, pages 632–637. IEEE, 2013.
[50] S.J. Julier and J.K. Uhlmann. Unscented filtering and nonlinear
estimation. Proceedings of the IEEE, 92(3):401 – 422, mar 2004.
[51] I.N. Junejo, O. Javed, and M. Shah. Multi feature path modeling for
video surveillance. In Pattern Recognition, 2004. ICPR 2004.
Proceedings of the 17th International Conference on, volume 2, pages
716 – 719 Vol.2, aug. 2004.
182
[52] T. Kailath, A.H. Sayed, and B. Hassibi. Linear estimation.
Prentice-Hall information and system sciences series. Prentice Hall,
2000.
[53] R. E. Kalman. A New Approach to Linear Filtering and Prediction
Problems. Transactions of the ASME–Journal of Basic Engineering,
82(Series D):35–45, 1960.
[54] A.N. Kolmogorov, W. Doyle, I. Selin, Rand Corporation, and United
States. Air Force. Interpolation and Extrapolation of Stationary Random
Sequences. Memorandum (Rand Corporation). Rand Corporation, 1962.
[55] Jayesh H. Kotecha and P.M. Djuric. Gaussian sum particle filtering.
Signal Processing, IEEE Transactions on, 51(10):2602–2612, Oct 2003.
[56] J.H. Kotecha and P.M. Djuric. Gaussian particle filtering. Signal
Processing, IEEE Transactions on, 51(10):2592 – 2601, oct. 2003.
[57] Jun S Liu. Monte Carlo strategies in scientific computing. springer,
2008.
[58] P.A.C. Lopes and M.S. Piedade. A Kalman filter approach to active
noise control. In Proc. EUSIPCO, volume 3, page 230, 2000.
[59] G.G. Lorentz. Approximation of Functions. AMS Chelsea Publishing
Series. AMS Chelsea, 2005.
[60] P. M. Lyster, C. H. Q. Ding, K. Ekers, R. Ferraro, J. Guo, M. Harber,
D. Lamich, J. W. Larson, R. Lucchesi, R. Rood, S. Schubert,
W. Sawyer, M. Sienkiewicz, A. da Silva, J. Stobie, L. L. Takacs,
R. Todling, and J. Zero. Parallel computing at the nasa data
assimilation office (dao). In Proceedings of the 1997 ACM/IEEE
conference on Supercomputing (CDROM), Supercomputing ’97, pages
1–18, New York, NY, USA, 1997. ACM.
[61] Teresa Mendonça and Pedro Lago. PID control strategies for the
automatic control of neuromuscular blockade. Control Engineering
Practice, 6(10):1225 – 1231, 1998.
[62] S. Oliveira and D.E. Stewart. Writing Scientific Software: A Guide to
Good Style. Cambridge University Press, 2006.
[63] M.A. Palis and D.K. Krecker. Parallel Kalman filtering on the
Connection Machine. In Frontiers of Massively Parallel Computation,
1990. Proceedings., 3rd Symposium on the, pages 55 –58, Oct. 1990.
[64] T. Palmer and R. Hagedorn. Predictability of Weather and Climate.
Cambridge University Press, 2006.
[65] Beresford N. Parlett. Reduction to tridiagonal form and minimal
realizations. SIAM Journal on Matrix Analysis and Applications,
13(2):567–593, 1992.
183
[66] D.A. Patterson and J.L. Hennessy. Computer Organization and Design,
Revised Fourth Edition: The Hardware/Software Interface. Morgan
Kaufmann Series in Computer Graphics. Elsevier Science, 2011.
[67] B.M. R. Parallel Computing. New Age International (P) Limited, 2009.
[68] T. Rauber and G. Rünger. Parallel Programming: for Multicore and
Cluster Systems. Springer, 2010.
[69] B. Ristic, S. Arulampalam, and N. Gordon. Beyond the Kalman Filter:
Particle Filters for Tracking Applications. Artech House Radar Library.
Artech House, 2004.
[70] C. Robert. The Bayesian Choice: From Decision-Theoretic Foundations
to Computational Implementation. Springer Texts in Statistics.
Springer, 2007.
[71] C. Robert and G. Casella. Monte Carlo Statistical Methods. Springer
Texts in Statistics. Springer, 2004.
[72] Christian P. Robert and George Casella. Monte Carlo Statistical
Methods (Springer Texts in Statistics). Springer-Verlag New York, Inc.,
Secaucus, NJ, USA, 2005.
[73] C. Rocha, Teresa Mendonça, and Maria E. Silva. Modelling
neuromuscular blockade: a stochastic approach based on clinical data.
Mathematical and Computer Modelling of Dynamical Systems,
19(6):540–556, 2013.
[74] O. Rosen and A. Medvedev. Efficient parallel implementation of state
estimation algorithms on multicore platforms. Control Systems
Technology, IEEE Transactions on, PP(99):1 –14, 2011.
[75] Olov Rosén and Alexander Medvedev. Parallel recursive estimation,
based on orthogonal series expansions. In American Control Conference
(ACC), 2014, pages 622–627, June 2010.
[76] Olov Rosén and Alexander Medvedev. Efficient parallel implementation
of a Kalman filter for single output systems on multicore computational
platforms. In Decision and Control and European Control Conference
(CDC-ECC), 2011 50th IEEE Conference on, pages 3178–3183. IEEE,
2011.
[77] Olov Rosén and Alexander Medvedev. An on-line algorithm for
anomaly detection in trajectory data. In American Control Conference
(ACC), 2012, pages 1117–1122. IEEE, 2012.
[78] Olov Rosén and Alexander Medvedev. Parallelization of the Kalman
filter for banded systems on multicore computational platforms. In 2012
IEEE 51st Annual Conference on Decision and Control (CDC), pages
2022–2027, 2012.
184
[79] Olov Rosén and Alexander Medvedev. Efficient parallel implementation
of state estimation algorithms on multicore platforms. Control Systems
Technology, IEEE Transactions on, 21(1):107–120, 2013.
[80] Olov Rosén and Alexander Medvedev. The recursive Bayesian
estimation problem via orthogonal expansions: an error bound. IFAC
WC, Aug, 2014.
[81] Olov Rosén and Alexander Medvedev. Nonlinear identification of
individualized drug effect models in neuromuscular blockade. Submitted
to a journal, 2015.
[82] Olov Rosén and Alexander Medvedev. Orthogonal basis particle
filtering : an approach to parallelization of recursive estimation.
Submitted to a journal, 2015.
[83] Olov Rosén and Alexander Medvedev. Parallel recursive estimation
using Monte Carlo and orthogonal series expansions. In American
Control Conference, Palmer House Hilton, Chicago, IL, USA, 2015.
[84] Olov Rosén, Alexander Medvedev, and Mats Ekman. Speedup and
tracking accuracy evaluation of parallel particle filter algorithms
implemented on a multicore architecture. In Control Applications
(CCA), 2010 IEEE International Conference on, pages 440–445. IEEE,
2010.
[85] Olov Rosén, Alexander Medvedev, and Daniel Jansson. Non-parametric
anomaly detection in trajectorial data. Submitted to a journal, 2014.
[86] Olov Rosén, Alexander Medvedev, and Torbjörn Wigren. Parallelization
of the Kalman filter on multicore computational platforms. Control
Engineering Practice, 21(9):1188–1194, 2013.
[87] Olov Rosén, Margarida M Silva, and Alexander Medvedev. Nonlinear
estimation of a parsimonious Wiener model for the neuromuscular
blockade in closed-loop anesthesia. In Proc. 19th IFAC World Congress,
pages 9258–9264. International Federation of Automatic Control, 2014.
[88] Wilson J. Rugh. Linear system theory / Wilson J. Rugh. Prentice Hall,,
Upper Saddle River, N.J., 2nd ed. edition, 1996.
[89] Stuart C. Schwartz. Estimation of probability density by an orthogonal
series. The Annals of Mathematical Statistics, 38(4):pp. 1261–1265,
1967.
[90] Stuart C. Schwartz. Estimation of probability density by an orthogonal
series. The Annals of Mathematical Statistics, 38(4):1261–1265, 08 1967.
[91] D.W. Scott. Multivariate Density Estimation: Theory, Practice, and
Visualization. Wiley Series in Probability and Statistics. Wiley, 2009.
[92] P. L. Shaffer. Implementation of a parallel extended Kalman filter using
a bit-serial silicon compiler. In ACM ’87: Proceedings of the 1987 Fall
185
Joint Computer Conference on Exploring technology: today and
tomorrow, pages 327–334, Los Alamitos, CA, USA, 1987. IEEE
Computer Society Press.
[93] M. M. Silva. Prediction error identification of minimally parameterized
wiener models in anesthesia. In Proc. 18th IFAC World Congress, pages
5615–5620, aug 28-sep 2 2011.
[94] M. M. Silva, T. Mendonça, and T. Wigren. Online nonlinear
identification of the effect of drugs in anaesthesia using a minimal
parameterization and bis measurements. In American Control
Conference, pages 4379–4384, 2010.
[95] M.M. Silva, T. Wigren, and T. Mendonça. Nonlinear identification of a
minimal neuromuscular blockade model in anesthesia. Control Systems
Technology, IEEE Transactions on, 20(1):181–188, 2012.
[96] B.W. Silverman. Density Estimation for Statistics and Data Analysis.
Chapman & Hall/CRC Monographs on Statistics & Applied
Probability. Taylor & Francis, 1986.
[97] B.W. Silverman. Density Estimation for Statistics and Data Analysis.
Chapman & Hall/CRC Monographs on Statistics & Applied
Probability. Taylor & Francis, 1986.
[98] T. Söderström. Discrete-time stochastic systems: estimation and
control. Prentice Hall international series in systems and control
engineering. Prentice Hall, 1994.
[99] T. Söderström. Discrete-time Stochastic Systems: Estimation and
Control. Advanced textbooks in control and signal processing. Springer,
2002.
[100] T. Söderström and P. Stoica. System identification. Prentice-Hall, Inc.,
Upper Saddle River, NJ, USA, 1988.
[101] Torsten Söderström and Petre Stoica. System Identification.
Prentice-Hall, Hemel Hempstead, UK, 1989.
[102] Alan Stuart and Keith J. Ord. Kendall’s advanced theory of statistics.
Oxford University Press, New York, 5th edition, 1987.
[103] Bo Tang, Pingyuan Cui, and Yangzhou Chen. A parallel processing
Kalman filter for spacecraft vehicle parameters estimation. In
Communications and Information Technology, IEEE International
Symposium on, volume 2, pages 1476 – 1479, Oct. 2005.
[104] Michael Tarter and Richard Kronmal. On multivariate density
estimates based on orthogonal expansions. The Annals of Mathematical
Statistics, 41(2):pp. 718–722, 1970.
[105] J.R. Thompson and P.R.A. Tapia. Non Parametric Function
Estimation, Modeling & Simulation. Miscellaneous Bks. Society for
186
Industrial and Applied Mathematics (SIAM, 3600 Market Street, Floor
6, Philadelphia, PA 19104), 1990.
[106] O. Tokhi, M.A. Hossain, and H. Shaheed. Parallel Computing for
Real-Time Signal Processing and Control. Advanced Textbooks in
Control and Signal Processing Series. Springer Verlag, 2003.
[107] R. Trobec, M. Vajteršic, and P. Zinterhof. Parallel Computing:
Numerics, Applications, and Trends. Springer London, 2009.
[108] H.L. Van Trees. Detection, Estimation, and Modulation Theory.
Number del 1 in Detection, Estimation, and Modulation Theory. Wiley,
2004.
[109] Anders Vretblad. Fourier Analysis and Its Applications (Graduate Texts
in Mathematics). Springer, November 2010.
[110] G. Wahba. Optimal Convergence Properties of Variable Knot, Kernel,
and Orthogonal Series Methods for Density Estimation. Defense
Technical Information Center, 1972.
[111] Fredrik Wahlberg, Alexander Medvedev, and Olov Rosén. A
LEGO-based mobile robotic platform for evaluation of parallel control
and estimation algorithms. In Decision and Control and European
Control Conference (CDC-ECC), 2011 50th IEEE Conference on, pages
4548–4553. IEEE, 2011.
[112] N. Wiener. Extrapolation, Interpolation and Smoothing of Stationary
Time Series with Engineering Applications. Technology Press and John
Wiley & Sons, Inc., New York, 1949.
[113] T. Wigren. Fast converging and low complexity adaptive filtering using
an averaged Kalman filter. Signal Processing, IEEE Transactions on,
46(2):515 –518, Feb. 1998.
[114] T. Wigren. Soft uplink load estimation in WCDMA. Vehicular
Technology, IEEE Transactions on, 58(2):760 –772, feb. 2009.
[115] T. Wigren. Recursive Noise Floor Estimation in WCDMA. Vehicular
Technology, IEEE Transactions on, 59(5):2615 –2620, jun 2010.
[116] T. Wigren. WCDMA uplink load estimation with generalized rake
receivers. Vehicular Technology, IEEE Transactions on, 61(5):2394
–2400, jun 2012.
[117] D. Willner, C. B. Chang, and K. P. Dunn. Kalman filter algorithms for
a multi-sensor system. In Decision and Control including the 15th
Symposium on Adaptive Processes, 1976 IEEE Conference on,
volume 15, pages 570 –574, Dec. 1976.
[118] N.E. Wu. Fault Detection, Supervision and Safety of Technical Processes
2003 (SAFEPROCESS 2003): A Proceedings Volume from the 5th
IFAC Symposium, Washington, D.C., USA, 9-11 June 2003. Elsevier.
187
[119] G. Xingyu, Z. Zhang, S. Grant, T. Wigren, N. Johansson, and
A. Kangas. Load control for multistage interference cancellation. to
appear at PIMRC 2012, Sydney, Australia, sep. 2012.
[120] Zh. Zhusubaliyev, A. Medvedev, and M. M. Silva. Bifurcation analysis
of PID controlled neuromuscular blockade in closed-loop anesthesia.
Journal of Process Control, 25:152–163, January 2015.
[121] Zhanybai Zhusubaliyev, Alexander V. Medvedev, and Margarida M.
Silva. Bifurcation analysis for PID-controller tuning based on a minimal
neuromuscular blockade model in closed-loop anesthesia (I). In Decision
and Control, 2013 IEEE 52nd Annual Conference on, pages 115–120,
2013.
[122] A. Zolghadri, B. Bergeon, and M. Monsion. A two-ellipsoid overlap test
for on-line failure detection. Automatica, 29(6):1517 – 1522, 1993.
188
Svensk sammanfattning
I denna avhandlingen har stokastisk estimering studerats, med speciell
inriktning på parallellisering avsedd för så kallade flerkärniga datorer.
Den största delen av arbetet avhandlar det stokastiska problemet, rekursiv optimal filtrering, och hur olika lösningsmetoder för detta har kunnat
parallellimplementeras. Att lösa det rekursiva optimala skattningsproblemet är ett mycket beräkningskrävande problem, speciellt för olinjära
icke-Gaussiska system, och för system av hög ordning. I och med att
beräkningskapaciteten för hårdvara idag främst utökas genom att koppla flera CPUer parallellt, så är parallellisering av algoritmer det mest
effektiva sättet att korta ner exekveringstiderna så att realtidsprestanda
kan uppnås. I arbetet har flera välkända metoder så som Kalmanfiltret, “Extended Kalman filter”, “Uncented Kalman filter”, partikelfiltret, och punktmass-filter parallelliserats. Linjär uppsnabbning i antal använda CPUer har uppnåtts för intervall av problemstorlekar för
samtliga filteringsmetoder. I arbetet har även två nya lösningsmetoder
för optimal filtrering utarbetats. Dessa är baserade på serieutvecklingar
med ortogonala basfunktioner och lämpar sig mycket väl för parallellisering. Detta då beräkningarna enkelt kan delas in i relativt oberoende
stycken och bara en liten mängd kommunikation mellan dessa stycken
krävs.
Optimal filtrering har ett brett applikationsområde, och kan tillämpas
på till synes helt skilda områden. I det här arbetet har de parallella
filtreringsmetoderna utvärderats på en rad olika applikationer, såsom:
målföljning, last-estimering i mobila nätverk, dosering av anestetikum
och eko-cancellering i kommunikationsnätverk.
En flerkärnig dator, eller som den ofta även på svenska kallas, en
multicore dator, är en dator som har en processor med två eller flera
separata kärnor (processorer). I och med att en sådan processor har flera
separata, parallella processorer, så kan den utföra parallella beräkningar.
Den kan därmed uppnå en högre beräkningseffekt än en vanlig enkärnig
processor. Nedan följer korta sammanfattningar av det material som
avhandlas i respektive kapitel.
189
Kapitel 2
Det här kapitlet presenterar en parallellisering av Kalman-filtret. En
paralleliseringsmetod för ”fler in-signaler- en ut-signalsystem” presenteras. Metoden baseras på att systemets överföringsmatris har en bandstruktur. Det diskuteras hur olika system, både tidsvarianta och tidsinvarianta, kan realiseras på en sådan bandmatris-form. Denna parallelliseringsmetod utökas sedan till att inkludera ”fler in-signaler- fler utsignalsystem” genom att utnyttja sekventiell filtrering av mätvektorn.
Den givna parallelliseringen utvärderas på ett lastestimeringsproblem för
mobila nätverk och jämförs mot en BLAS-implementering. Den givna
parallelliseringen presterar signifikant bättre än BLAS-implementeringen
och uppnår linjär uppsnabbning i antal använda kärnor för upp till 8
kärnor, vilket ska jämföras mot en maximal uppsnabbning på 2 gånger
för BLAS implementeringen.
Kapitel 3
I det här kapitlet studeras ett specialfall av materialet i kapitel 2: Kalmanfiltret då det används för parameterskattning. För detta specialfall kan
en särskillt effektiv parallellisering göras, vilket diskuteras i kapitlet.
Implementeringsdetaljer som optimerar körningstiderna behandlas även
mer ingående än vad som görs i kapitel 2.
Kapitel 4
Parallellisering av partikelfiltret studeras. Fyra olika parallelliseringar,
“globally distributed particle filter”, “resampling with proportional allocation particle filter” “resampling with non-porportional allocation
particle filter” och “Gaussian particle filter” parallelimplementeras och
uvärderas på en flerkärnig dator med 8 kärnor. Resultateten visar att
det Gaussiska partikelfiltret och “Resampling with proportinoal allocation particle filter” lämpar sig bäst för parallelimlementering för multicoredatorer, där linjär uppsnabbning upp till 8 gånger uppnås.
Kapitel 5
En ny lösningsmetod för det rekursiva Bayesianska skattningsproblemet
presenteras. De involverade täthetsfunktionerna approximeras med trunkerade serieutvecklingar i ortogonala baser. Via prediktions och uppdateringsstegen för lösningent till det rekursiva Bayesianska skattningsproblemet beräknas och propageras koefficienterna för serietuvecklingarna.
Metoden har exceptionellt bra parallelliseringsegenskaper men medför
nackdelen att tillståndet måste befinna sig på ett på förhand avgränsat
190
område. En analys av metoden som tagits fram genomförs även. Det
som studeras är framförallt hur felet i skattningen påverkas av trunkeringen för serieutvecklingarna och hur detta trunkeringsfel propagerar
mellan filteriterationerna.
Kapitel 6
I det här kapitlet utvecklas en ny metod för att parallellisera partikelfiltret. Metoden bygger på att en serieutveckling anpassas till partikelsettet vid omsamplingssteget. Detta gör att informationen i partikelsettet
kan komprimeras till några få koefficienter som sedan effektivt kan kommuniceras mellan processorkärnorna. En analys av hur väl serieutvecklingen fångar den underliggande täthetsfunktionern utförs. En övre
gräns för koefficienternas magnitud då Hermite-basfunktioner används
härleds även.
Kapitel 7
En ny metod för anomali detektering för system som följer trajektorier
i tillståndsrummet presenteras och diskuteras. Metoden bygger på att
från en mängd observerade trajektorier från systemet, anpassa och uppskatta täthetsfunktioner som beskriver sannolikheten att hitta systemet
i ett visst tillstånd. Utifrån dessa täthetsfunktioner utförs test för att
se hur mycket systemets tillstånd avviker från det normala. Metoden
utvärderas på spårdata från fraktfartyg, samt på ögonföljningsdata från
testpatienter med och utan Parkinsons sjukdom.
Kapitel 8
Tillståndsskattning för en minimalt parametriserad modell för PK/PDmodell för påverkan av anestetikum utförs. Tre olinjära skattningsmetoder: “Extended Kalman filter”, partikelfiltret och filtreringsmetoden
beskriven i Kapitel 7 implementeras för det här problemet och jämförs
i skattningskvalitet. Det visas att “Extended Kalman filter”, som är den
metod som tillämpats tidigare för detta problem, ger icke väntevärdesriktiga skattningar av parametrarna, medan de partikel-baserade metoderna
ger väntevärdesriktiga skattningar. Då modellen skattas för att ge underlag till reglering av anestetikum under operationer, är det av stor
vikt att skattningarna är så goda som möjligt.
Kapitel 9
En kort presentation av resultaten från BLAS baserade implementeringar
av UKF och punktmassfilteret ges.
191
Acta Universitatis Upsaliensis
Uppsala Dissertations from the Faculty of Science
Editor: The Dean of the Faculty of Science
1–11: 1970–1975
12. Lars Thofelt: Studies on leaf temperature recorded by direct measurement and
by thermography. 1975.
13. Monica Henricsson: Nutritional studies on Chara globularis Thuill., Chara zeylanica Willd., and Chara haitensis Turpin. 1976.
14. Göran Kloow: Studies on Regenerated Cellulose by the Fluorescence Depolarization Technique. 1976.
15. Carl-Magnus Backman: A High Pressure Study of the Photolytic Decomposition of Azoethane and Propionyl Peroxide. 1976.
16. Lennart Källströmer: The significance of biotin and certain monosaccharides
for the growth of Aspergillus niger on rhamnose medium at elevated temperature. 1977.
17. Staffan Renlund: Identification of Oxytocin and Vasopressin in the Bovine Adenohypophysis. 1978.
18. Bengt Finnström: Effects of pH, Ionic Strength and Light Intensity on the Flash
Photolysis of L-tryptophan. 1978.
19. Thomas C. Amu: Diffusion in Dilute Solutions: An Experimental Study with
Special Reference to the Effect of Size and Shape of Solute and Solvent Molecules. 1978.
20. Lars Tegnér: A Flash Photolysis Study of the Thermal Cis-Trans Isomerization
of Some Aromatic Schiff Bases in Solution. 1979.
21. Stig Tormod: A High-Speed Stopped Flow Laser Light Scattering Apparatus and
its Application in a Study of Conformational Changes in Bovine Serum Albumin. 1985.
22. Björn Varnestig: Coulomb Excitation of Rotational Nuclei. 1987.
23. Frans Lettenström: A study of nuclear effects in deep inelastic muon scattering.
1988.
24. Göran Ericsson: Production of Heavy Hypernuclei in Antiproton Annihilation.
Study of their decay in the fission channel. 1988.
25. Fang Peng: The Geopotential: Modelling Techniques and Physical Implications
with Case Studies in the South and East China Sea and Fennoscandia. 1989.
26. Md. Anowar Hossain: Seismic Refraction Studies in the Baltic Shield along the
Fennolora Profile. 1989.
27. Lars Erik Svensson: Coulomb Excitation of Vibrational Nuclei. 1989.
28. Bengt Carlsson: Digital differentiating filters and model based fault detection.
1989.
29. Alexander Edgar Kavka: Coulomb Excitation. Analytical Methods and Experimental Results on even Selenium Nuclei. 1989.
30. Christopher Juhlin: Seismic Attenuation, Shear Wave Anisotropy and Some
Aspects of Fracturing in the Crystalline Rock of the Siljan Ring Area, Central
Sweden. 1990.
31. Torbjörn Wigren: Recursive Identification Based on the Nonlinear Wiener Model.
1990.
32. Kjell Janson: Experimental investigations of the proton and deuteron structure
functions. 1991.
33. Suzanne W. Harris: Positive Muons in Crystalline and Amorphous Solids. 1991.
34. Jan Blomgren: Experimental Studies of Giant Resonances in Medium-Weight
Spherical Nuclei. 1991.
35. Jonas Lindgren: Waveform Inversion of Seismic Reflection Data through Local
Optimisation Methods. 1992.
36. Liqi Fang: Dynamic Light Scattering from Polymer Gels and Semidilute Solutions.
1992.
37. Raymond Munier: Segmentation, Fragmentation and Jostling of the Baltic Shield
with Time. 1993.
Prior to January 1994, the series was called Uppsala Dissertations from the Faculty of
Science.
Acta Universitatis Upsaliensis
Uppsala Dissertations from the Faculty of Science and Technology
Editor: The Dean of the Faculty of Science
1–14: 1994–1997. 15–21: 1998–1999. 22–35: 2000–2001. 36–51: 2002–2003.
52. Erik Larsson: Identification of Stochastic Continuous-time Systems. Algorithms,
Irregular Sampling and Cramér-Rao Bounds. 2004.
53. Per Åhgren: On System Identification and Acoustic Echo Cancellation. 2004.
54. Felix Wehrmann: On Modelling Nonlinear Variation in Discrete Appearances of
Objects. 2004.
55. Peter S. Hammerstein: Stochastic Resonance and Noise-Assisted Signal Transfer.
On Coupling-Effects of Stochastic Resonators and Spectral Optimization of Fluctuations in Random Network Switches. 2004.
56. Esteban Damián Avendaño Soto: Electrochromism in Nickel-based Oxides. Coloration Mechanisms and Optimization of Sputter-deposited Thin Films. 2004.
57. Jenny Öhman Persson: The Obvious & The Essential. Interpreting Software Development & Organizational Change. 2004.
58. Chariklia Rouki: Experimental Studies of the Synthesis and the Survival Probability of Transactinides. 2004.
59. Emad Abd-Elrady: Nonlinear Approaches to Periodic Signal Modeling. 2005.
60. Marcus Nilsson: Regular Model Checking. 2005.
61. Pritha Mahata: Model Checking Parameterized Timed Systems. 2005.
62. Anders Berglund: Learning computer systems in a distributed project course: The
what, why, how and where. 2005.
63. Barbara Piechocinska: Physics from Wholeness. Dynamical Totality as a Conceptual Foundation for Physical Theories. 2005.
64. Pär Samuelsson: Control of Nitrogen Removal in Activated Sludge Processes.
2005.
65. Mats Ekman: Modeling and Control of Bilinear Systems. Application to the Activated Sludge Process. 2005.
66. Milena Ivanova: Scalable Scientific Stream Query Processing. 2005.
67. Zoran Radovic´: Software Techniques for Distributed Shared Memory. 2005.
68. Richard Abrahamsson: Estimation Problems in Array Signal Processing, System
Identification, and Radar Imagery. 2006.
69. Fredrik Robelius: Giant Oil Fields – The Highway to Oil. Giant Oil Fields and their
Importance for Future Oil Production. 2007.
70. Anna Davour: Search for low mass WIMPs with the AMANDA neutrino telescope.
2007.
71. Magnus Ågren: Set Constraints for Local Search. 2007.
72. Ahmed Rezine: Parameterized Systems: Generalizing and Simplifying Automatic
Verification. 2008.
73. Linda Brus: Nonlinear Identification and Control with Solar Energy Applications.
2008.
74. Peter Nauclér: Estimation and Control of Resonant Systems with Stochastic Disturbances. 2008.
75. Johan Petrini: Querying RDF Schema Views of Relational Databases. 2008.
76. Noomene Ben Henda: Infinite-state Stochastic and Parameterized Systems. 2008.
77. Samson Keleta: Double Pion Production in dd→αππ Reaction. 2008.
78. Mei Hong: Analysis of Some Methods for Identifying Dynamic Errors-invariables
Systems. 2008.
79. Robin Strand: Distance Functions and Image Processing on Point-Lattices With
Focus on the 3D Face-and Body-centered Cubic Grids. 2008.
80. Ruslan Fomkin: Optimization and Execution of Complex Scientific Queries. 2009.
81. John Airey: Science, Language and Literacy. Case Studies of Learning in Swedish
University Physics. 2009.
82. Arvid Pohl: Search for Subrelativistic Particles with the AMANDA Neutrino Telescope. 2009.
83. Anna Danielsson: Doing Physics – Doing Gender. An Exploration of Physics Students’ Identity Constitution in the Context of Laboratory Work. 2009.
84. Karin Schönning: Meson Production in pd Collisions. 2009.
85. Henrik Petrén: η Meson Production in Proton-Proton Collisions at Excess Energies
of 40 and 72 MeV. 2009.
86. Jan Henry Nyström: Analysing Fault Tolerance for ERLANG Applications. 2009.
87. John Håkansson: Design and Verification of Component Based Real-Time Systems. 2009.
¯ → Λ̄Λ, Λ̄Σ0 Re88. Sophie Grape: Studies of PWO Crystals and Simulations of the pp
actions for the PANDA Experiment. 2009.
90. Agnes Rensfelt. Viscoelastic Materials. Identification and Experiment Design. 2010.
91. Erik Gudmundson. Signal Processing for Spectroscopic Applications. 2010.
92. Björn Halvarsson. Interaction Analysis in Multivariable Control Systems. Applications to Bioreactors for Nitrogen Removal. 2010.
93. Jesper Bengtson. Formalising process calculi. 2010. 94. Magnus Johansson. Psi-calculi: a Framework for Mobile Process Calculi. Cook
your own correct process calculus – just add data and logic. 2010.
95. Karin Rathsman. Modeling of Electron Cooling. Theory, Data and Applications.
2010.
96. Liselott Dominicus van den Bussche. Getting the Picture of University Physics.
2010.