ACTA UNIVERSITATIS UPSALIENSIS Uppsala Dissertations from the Faculty of Science and Technology 111 Parallel Stochastic Estimation on Multicore Platforms Olov Rosén Dissertation presented at Uppsala University to be publicly examined in ITC 2347, Lägerhyddsvägen 2, Uppsala, Tuesday, 12 May 2015 at 13:15 for the degree of Doctor of Philosophy. The examination will be conducted in English. Faculty examiner: Professor Petar Djuric (Stony Brook University, New York, USA.). Abstract Rosén, O. 2015. Parallel Stochastic Estimation on Multicore Platforms. Uppsala Dissertations from the Faculty of Science and Technology 111. xiv+191 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-554-9191-8. The main part of this thesis concerns parallelization of recursive Bayesian estimation methods, both linear and nonlinear such. Recursive estimation deals with the problem of extracting information about parameters or states of a dynamical system, given noisy measurements of the system output and plays a central role in signal processing, system identification, and automatic control. Solving the recursive Bayesian estimation problem is known to be computationally expensive, which often makes the methods infeasible in real-time applications and problems of large dimension. As the computational power of the hardware is today increased by adding more processors on a single chip rather than increasing the clock frequency and shrinking the logic circuits, parallelization is one of the most powerful ways of improving the execution time of an algorithm. It has been found in the work of this thesis that several of the optimal filtering methods are suitable for parallel implementation, in certain ranges of problem sizes. For many of the suggested parallelizations, a linear speedup in the number of cores has been achieved providing up to 8 times speedup on a double quad-core computer. As the evolution of the parallel computer architectures is unfolding rapidly, many more processors on the same chip will soon become available. The developed methods do not, of course, scale infinitely, but definitely can exploit and harness some of the computational power of the next generation of parallel platforms, allowing for optimal state estimation in real-time applications. Keywords: Recursive estimation, Parallelization, Bayesian estimation, Anomaly detection Olov Rosén, Department of Information Technology, Division of Systems and Control, Box 337, Uppsala University, SE-75105 Uppsala, Sweden. © Olov Rosén 2015 ISSN 1104-2516 ISBN 978-91-554-9191-8 urn:nbn:se:uu:diva-246859 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-246859) List of papers This thesis is based on the following papers, which are referred to in the text by their Roman numerals. I Olov Rosén and Alexander Medvedev. Efficient parallel implementation of state estimation algorithms on multicore platforms. Control Systems Technology, IEEE Transactions on, 21(1):107–120, 2013 II Olov Rosén, Alexander Medvedev, and Torbjörn Wigren. Parallelization of the Kalman filter on multicore computational platforms. Control Engineering Practice, 21(9):1188–1194, 2013 III Olov Rosén, Alexander Medvedev, and Mats Ekman. Speedup and tracking accuracy evaluation of parallel particle filter algorithms implemented on a multicore architecture. In Control Applications (CCA), 2010 IEEE International Conference on, pages 440–445. IEEE, 2010 IV Olov Rosén and Alexander Medvedev. Efficient parallel implementation of a Kalman filter for single output systems on multicore computational platforms. In Decision and Control and European Control Conference (CDC-ECC), 2011 50th IEEE Conference on, pages 3178–3183. IEEE, 2011 V Olov Rosén and Alexander Medvedev. Parallelization of the Kalman filter for banded systems on multicore computational platforms. In 2012 IEEE 51st Annual Conference on Decision and Control (CDC), pages 2022–2027, 2012 VI Olov Rosén and Alexander Medvedev. An on-line algorithm for anomaly detection in trajectory data. In American Control Conference (ACC), 2012, pages 1117–1122. IEEE, 2012 v VII Olov Rosén and Alexander Medvedev. Parallel recursive estimation, based on orthogonal series expansions. In American Control Conference (ACC), 2014, pages 622–627, June 2010 VIII Olov Rosén and Alexander Medvedev. The recursive Bayesian estimation problem via orthogonal expansions: an error bound. IFAC WC, Aug, 2014 IX Daniel Jansson, Alexander Medvedev, and Olov Rosén. Parametric and non-parametric analysis of eye-tracking data by anomaly detection. IEEE Transactions on Control Systems Technology, 2014 X Olov Rosén, Alexander Medvedev, and Daniel Jansson. Non-parametric anomaly detection in trajectorial data. Submitted to a journal, 2014 XI Olov Rosén and Alexander Medvedev. Orthogonal basis particle filtering : an approach to parallelization of recursive estimation. Submitted to a journal, 2015 XII Olov Rosén and Alexander Medvedev. Parallel recursive estimation using Monte Carlo and orthogonal series expansions. In American Control Conference, Palmer House Hilton, Chicago, IL, USA, 2015 XIII Olov Rosén, Margarida M Silva, and Alexander Medvedev. Nonlinear estimation of a parsimonious Wiener model for the neuromuscular blockade in closed-loop anesthesia. In Proc. 19th IFAC World Congress, pages 9258–9264. International Federation of Automatic Control, 2014 XIV Olov Rosén and Alexander Medvedev. Nonlinear identification of individualized drug effect models in neuromuscular blockade. Submitted to a journal, 2015 XV Daniel Jansson, Olov Rosén, and Alexander Medvedev. Non-parametric analysis of eye-tracking data by anomaly detection. In Control Conference (ECC), 2013 European, pages 632–637. IEEE, 2013 Reprints were made with permission from the publishers. The following paper has also been published by the author but does not contain material that is published in this theses. • Fredrik Wahlberg, Alexander Medvedev, and Olov Rosén. A LEGObased mobile robotic platform for evaluation of parallel control and estimation algorithms. In Decision and Control and European vi Control Conference (CDC-ECC), 2011 50th IEEE Conference on, pages 4548–4553. IEEE, 2011 vii Acknowledgment I would like to thank my supervisor Professor Alexander Medvedev for his support and guidance throughout this work. I am grateful for the degree of freedom that you have given me in my research, and for always being encouraging to my ideas and providing helpful feedback for improvement of them. I would also like to thank all my colleagues at SysCon for providing such a pleasant working atmosphere. It has been great to be part of the SysCon group over these years, where all in the group have contributed, everyone in their own way, to make the working day enjoyable both socially and professionally. I would like to thank my family for their encouragements. A special thanks goes to Linnéa for her love and support during the work of this thesis. While not being particularly interested in parallel stochastic estimation, you have been a great partner in the other parts of life, which is at least as important as giving feedback about equations during the work of a thesis. My co-authors also deserves a thanks for their contributions. Mats Ekman for the collaboration on parallel particle filtering, Daniel Jansson for the eye movement data used for evaluation of the anomaly detection method, Margarida Silva for the collaboration on parameter estimation in a PK/PD model for anesthesia and Torbjörn Wigren for the data and model for the WCDMA application used for evaluation of the parallel Kalman filter, thank you! The thesis covers research within the project “Computationally Demanding Real-Time Applications on Multicore Platforms” funded by Swedish Foundation for Strategic Research, whose financial support is greatly appreciated. viii Contents 1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Notation and Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Sub-indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.2 Multi-indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Series expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.1 Orthogonal functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.2 Examples of orthogonal basis functions . . . . . . . . . . . . . . . . . 9 1.3.3 Multivariate orthogonal series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.4 Shifting and scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4 Some probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4.1 Definition of a random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4.2 The distribution function and some associated measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4.3 Bayesian statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.4.4 Estimation from random samples . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4.5 Confidence regions and outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.5 Recursive Bayesian estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.5.1 Optimal estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.5.2 The prediction-update recursion . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.6 Solution methods for the optimal filtering problem . . . . . . . . . . 27 1.6.1 Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.6.2 Static Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.6.3 Extended Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 1.6.4 Unscented Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 1.6.5 Monte-Carlo methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 1.6.6 Grid-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 1.6.7 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 1.7 High-performance computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 1.7.1 Efficient memory handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 ix 1.7.2 Hardware mechanisms for efficient code execution 1.7.3 Some further examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.4 Software libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8 Multicore architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8.1 Evolution of the multicore processor . . . . . . . . . . . . . . . . . . . 1.8.2 Parallel architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9 Parallel implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9.1 Parallel implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9.3 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9.4 Efficient parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9.5 Using Shared Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9.6 Data partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.10 Short chapter summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 36 36 37 37 38 39 39 41 41 43 45 46 50 2 Parallelization of the Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 State space model and filtering equations . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 State space system description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Kalman filter equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Banded systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Transformation to a banded system form . . . . . . . . . . . . 2.3.2 Time-invariant case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Time-varying case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 MISO System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Efficient sequential implementation . . . . . . . . . . . . . . . . . . . . . 2.4.2 Parallel implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 MIMO System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Implementation example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Uplink interference power estimation model . . . . . . . . 2.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Static Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Extended Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 53 54 54 55 56 56 56 57 58 59 59 61 62 63 63 65 66 69 70 71 3 Parallel implementation of the Kalman filter as a parameter estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 System model and the Kalman filter . . . . . . . . . . . . . . . . . . . 3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Straightforward implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 73 73 74 74 74 x 3.2.2 3.3 3.4 3.5 Reordering of the equations for efficient memory utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Utilizing the symmetry of P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Parallel implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analysis of Algorithm 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Sequential and parallel work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Communication and synchronization . . . . . . . . . . . . . . . . . . . 3.3.3 Memory bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Cache miss handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Exection time and speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Memory Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Parallel implementation of the particle filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 The particle filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Parallel algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Conclusions for parallel implementation of the particle filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 75 77 78 78 80 80 81 81 81 81 83 84 85 86 88 92 94 96 5 Solving the RBE via orthogonal series expansions . . . . . . . . . . . . . . . . . . . . . 99 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.2 Solving the RBE via series expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.2.1 Mean and Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.2.2 Truncation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.2.3 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.3 Parallel implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.3.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.4.1 The system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.4.2 Solution using Fourier basis functions . . . . . . . . . . . . . . . . 105 5.4.3 Execution time and speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.5.1 Estimation accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.5.2 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.5.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.6 An error bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.6.1 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6 Orthogonal basis PF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 xi 6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 The PF algorithm with importance sampling . . . . . 6.2.2 Hermite functions basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Hermitian Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Parallelization properties analysis . . . . . . . . . . . . . . . . . . . . . Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Computational Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 System model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.2 Estimation accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.3 Execution time and speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix for Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dervation of Eq. (6.16) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 122 123 124 125 127 127 131 132 132 133 136 138 140 140 7 Anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 The anomaly detection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Context function and reference trajectory . . . . . . . . . . 7.3.2 Probability density function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Outlier detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.4 Anomaly detection method recapitulated . . . . . . . . . 7.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Vessel traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Eye-tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.A Appendix for Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.A.1 Evaluation of Eq. (1.20) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.A.2 Proof of fˆX (τ, x) being a PDF . . . . . . . . . . . . . . . . . . . . . . . . . . 141 141 143 143 145 146 147 147 148 148 152 156 156 156 156 157 8 Application to parameter estimation in PK/PD model . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Parsimonious Wiener Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Estimation algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Filter tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Data sets and performance evaluation metrics . . . . . . . . . . . . . . . 8.4.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.2 Real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 159 161 162 163 164 164 165 166 166 169 6.3 6.4 6.5 6.6 6.A 6.B 6.C 6.D xii 8.6 9 Conclusions .................................................................. 173 BLAS based parallelizations of UKF and point mass filter . . . . . . . 175 9.1 UKF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 9.2 Point mass filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 References .................................................................................. Svensk sammanfattning ............................................................... 179 189 xiii Chapter 1 Background 1.1 Introduction The main part of this thesis is about parallelization of discrete time recursive estimation and filtering methods, both linear and nonlinear such. Recursive estimation deals with the problem of extracting information about parameters or states of a dynamical system, given noisy measurements of the system output. It plays a central role in signal processing, system identification and automatic control. Signal filters were originally seen as circuits or systems with frequencyselecting behaviors and the most typical area of implementation was in radio transmission and receiving equipment. The developement of filtering techniques went on and more sophisticated filters were introduced, such as e.g. Chebychev and Butterworth filters, which gave means of shaping the frequency characteristics of the filter in a more systematic design procedure. During this stage, the filtering was mainly considered from this frequency-domain point of view. By the introduction of the Wiener-Kolmogorov filter [112], [54], statistical ideas were incorporated into the field of filtering and statistical properties of the signal, rather than the frequency content, were utilized to select what to filter out. The idea of the Wiener-Kolmogorov filter is to minimize the mean square error between the estimated signal and the true signal. An optimality criterion was thus introduced in this case and it became possible to state whether a filter was optimal in some specific sense. Further steps in the development of filters were taken by Emil Rudolph Kalman by the introduction of the famous Kalman filter that, in contrast to the Wiener and Kolmogorov filters, applies to nonstationary processes. A conceptual difference is that the Kalman filter is based on the state-space model framework, rather than the polynomial formalism adopted in the Wiener and Kolmogorov filters. Working 1 in time domain with state space models, the term “filter” can seem somewhat misleading and the name “observer” or “state estimator” is perhaps more naturally connected to the problem formulations. The term filter have though been kept and is still widely used. The Wiener, Kolmogorov, and Kalman filters, all assume that the underlying system is linear. Nonlinear non-Gaussian filtering methods though exist, that can be applied to general systems with arbitrary noise distributions. However, historically, nonlinear filtering has been a relatively sparsely researched area, with a main reason being the computational burden required to compute the estimates due to the lack of a closed-form expression for the solution. As an example, the early developments of the nonlinear particle filter were proposed already in the 1950’s, under the name ’Poor Man’s Monte Carlo’ by Hammersley [37], and later in the 1970’ further improvements of the method were performed in the control community. However, because of the practical limitations of the method due to the computational cost, it was more or less abandoned and forgotten until 1993 when Gordon [32] published the formulation of the method that is commonly used today. Even then there were skepticism to the method and its usability was confined to a small set of application which could bear with the execution times associated with the filter computations. It was not until the beginning of the 2000 that the capacity of computational hardware had improved well enough to let the methods be used in a wider range of applications, and the interest to nonlinear filtering grew rapidly. With the development of parallel hardware, the computational capacity of computers has started increasing at an even faster rate, and the development of parallel versions of the filtering algorithms continues to broaden the application range of nonlinear filters. The situation is the same for other non-linear non-Gaussian filtering techniques, and even for linear filters of large dimension. In real-time applications, it has been common that suboptimal filtering methods are employed because the optimal solution simply requires too much computation to be practically feasible. For example, recursive least squares and least mean squares methods are often used for linear filtering, instead of the optimal Kalman filter which can provide both faster convergence rates and better mean square error. For nonlinear systems, the (suboptimal) extended Kalman filter is a usual choice even though other nonlinear methods such as the unscented Kalman filter, grid-based methods and simulation-based methods can provide superior estimation accuracy. With the computational power offered by parallel hardware, new doors are opening for the application of computationally costly optimal methods. As mentioned, the computational capacity of the hardware is no more growing by shrinking the logic circuits and increasing the processor op2 erating frequency, but rather by adding more processors on a single chip. This is due to physical limitations as well as power and heat dissipation concerns. All major manufacturers have turned from a single core design to a multicore design and parallel processing is no longer the exclusive domain of supercomputers or clusters. Any computer bought today is likely to have two or more cores, and the number of cores available on a single chip increases steadily. To utilize the computational power provided by parallel hardware, algorithms must be implemented in a way that suits the parallel architecture. Occasionally, the implementation is rather straightforward but in many cases the algorithm must, at some point, be modified to yield better parallelization properties, where the modification often comes with a decreased accuracy, sacrificed in favor of faster execution time. As parameter and state estimation constitute a key part in automatic control, system identification and signal processing, estimation quality can be of utmost importance for the performance of the system. This motivates the interest for designing implementations of recursive estimation methods that can be executed on a parallel architecture and provide real-time feasibility, without significant loss of accuracy. Another aspect in the parallelization, important for e.g. mobile devices powered by batteries and low power communication systems, is the substantially lower power-per-FLOP ratio for parallel processors, compared to sequential processors. The main part of this thesis deals with the parallelization of recursive Bayesian estimation problems in discrete time. Another problem, by its Bayesian nature related to the estimation problem, is anomaly detection, to which a smaller part of the thesis is devoted. Anomaly detection refers to finding patterns in a given data set that do not conform to a properly defined normal behavior. 3 1.2 Notation and Nomenclature Symbols A Matrices are written in bold upper case letters. x Vectors are written in bold lower case letters. AT , aT Transpose of a matrix or vector. X Stochastic variable. det (A) Determinant of A. tr(A) Trace of a A. A−1 Inverse of A. pX (x) Probability density function for stochastic variable X evaluated at x. pX,Y (x, y) Joint density function for random variables X and Y . When there is no risk for confusion this is written simply as p(x, y). pX|Y (x|y) Conditional density function for p(x, y) given Y = y. PX (x) The cumulative density function for X. P r(A) Probability of a random event A. f (x) Vector-valued function. Rn n-dimensional space of real numbers. Nn n-dimensional space of natural numbers. Rn×m n × m-dimensional space of real numbers. m:n Set of numbers {m, m + 1, ..., n}, m, n ∈ N and m ≤ n. N (μ, Σ) Normal distribution with mean μ and covariance Σ. γ(x; μ, Σ) Probability density function for normal distribution N (μ, Σ). xm:n The ordered set {xm , xm+1 , ..., xn }. Lp (Ω) The space of functions for which the p-th power of the absolute value is Lebesgue integrable over the domain Ω. For PDFs and cumulative density functions: when there is no risk of confusion, the subscript will be dropped and pX (x) will be written simply as p(x). 4 Abbreviations KF Kalman filter. EKF Extended Kalman filter. UKF Unscented Kalman filter. PF Particle filter. PDF Probability density function. CPU Central processing unit. SMC Shared memory multicore. MIMO Multiple input multiple output. SISO Single input single output. MISO Multiple input single output. FLOP Floating point operation. FLOPS Floating point operations per second. MISE Mean integrated square error. AMISE Asymptotic MISE. BLAS Basic Linear Algebra Subsystems. RBE Recursive Bayesian estimation. 5 1.2.1 Sub-indexing Let A denote a matrix of size m × n. The submatrix that lies in the rows of α ⊆ {1, .., n} and columns of β ⊆ {1, .., m} is denoted A(α, β). For example, if α = {1, 2}, β = {1, 3} and ⎡ ⎤ a11 a12 a13 A = ⎣ a21 a22 a23 ⎦ a31 a32 a33 then A(α, β) = A({1, 2} , {1, 3}) = a11 a13 a21 a23 . The submatrix that consists of all rows and the columns β is denoted A(:, β). Furthermore, 1 : n {1, 2, ..., n}. When indexing is out of range, the result is defined as zero, e.g. A(−1, 1) = A(1, 4) = 0. This is to avoid complicated notation to handle indices near the edges of the matrix. In an implementation, the matrix can simply be padded with a frame of zeros. 1.2.2 Multi-indexing Multi-indexing is used to simplify the notation of multivariate expressions and generalizes the concept of an scalar index to an ordered tuple of indices. A D-dimensional multi-index is an D-tuple α = (α1 , α2 , . . . , αD ), i.e. an element of the D-dimensional set of natural numbers ND . Let n = (n1 , n2 , ..., nD ), m = (m1 , m2 , ..., mD ) denote two D-dimensional T multi-indices, and x = x1 x2 ... xD ∈ RD be a D-dimensional vector. Multi-index sum, product, power and partial derivative are interpreted in the following way n + m = (n1 + m1 , n2 + m2 , ..., nD + mD ), n · x = n1 x1 + n2 x2 + ... + nD xD , xn = xn1 1 xn2 2 · ... · xnDD , ∂ n1 ∂ n2 ∂ nD ... . ∂n = ∂xn1 1 ∂xn2 2 ∂xnDD Two multi-indices are equal if all their elements are equal, i.e. n = k, if and only if n1 = k1 , n2 = k2 ,....,nD = kD . 6 1.3 Series expansions Series expansions have been utilized in several of the constructed methods for PDF estimation, and, in particular, orthogonal series expansion have been used. The literature covering theory of this topic is vast, with its foundations in functional analysis and special functions. Here only a brief presentation of the facts that are relevant to this particular work is given, see e.g. [59], [24] for a more thorough exposition. In mathematics, a series expansion is a way of representing a function that cannot be expressed in terms of elementary operators (addition, subtraction, multiplication and division) using a series of other functions with known properties. The series representation is in general infinite, but can be truncated to give an approximation to the function with a guaranteed accuracy. Suppose a function f (x) is given and it is sought to approximate it by a series over the domain Ω, so that the integrated square error is minimized, using a set of other functions {φk (x)}K k=0 , i.e. f (x) ≈ fˆ(x) = K ck φk (x), (1.1) k=0 where ck are the weights or coefficients. Making a least-squares fit, the coefficients can be found by minimizing the integrated square error loss function K Q = [f (x) − ck φk (x)]2 dx. Ω k=0 Differentiating the loss function w.r.t. cn and evaluating for 0 to find the extremum gives the set of equations K ∂Q = 2 φn (x)[f (x) − ck φk (x)]dx = 0 ⇔ ∂cn Ω k=0 K cn φk (x)φn (x)dx = φn (x)f (x)dx, n = 0, 1, 2...K. k=0 Ω Ω The extremum can be shown to be the minimum by evaluating the second derivative w.r.t. to the coefficients. Denoting ank = Ω φn (x)φk (x)dx, bn = Ω φn (x)f (x)dx, this can in matrix form be written as ⎡ ⎤⎡ ⎤ ⎡ ⎤ a00 a01 · · · a0K c0 b0 .. ⎥⎢ c ⎥ ⎢ b ⎥ ⎢ . ⎥⎢ 1 ⎥ ⎢ 1 ⎥ ⎢ a10 a11 (1.2) ⎢ . ⎥⎢ .. ⎥ = ⎢ .. ⎥ , . ⎣ .. ⎦⎣ . ⎦ ⎣ . ⎦ .. cK bK aK0 · · · aKK A c b 7 which algebraic system has a unique solution, provided that A is nonsingular c = A−1 b, (1.3) and are hence the coefficients that solve the least-squares fitting problem. 1.3.1 Orthogonal functions A sequence of functions φ0 (x), φ1 (x), ... is said to be orthogonal over the domain Ω, if φn (x)φm (x)dx = Ω 0 qn n = m . n=m (1.4) Further, if qn = 1, n = 0, 1, 2, ..., the functions are said to be orthonormal. Owing to the orthogonality of the functions, series expansions for this particular class of functions have some beneficial properties. For instance, A in (1.2) becomes diagonal or, in the case of orthonormal functions, even an identity matrix, and the solution of the least-squares fitting problem (i.e. (1.3)) is given by ck = q −1 φk (x)f (x)dx, k = 0, 1, 2, ..., K. Ω The cross-couplings between the coefficients vanish and the coefficients can hence be estimated independent of each other. This is a property that is of particular interest for parallelization, where mutually independent is a key word to seek for, since this typically provides a good basis for partitioning of the workload into independent segments. To avoid getting into peculiar mathematics, the approximation in (1.1) is given for a truncated expansion. However, it is fully possible to let K → ∞, in which case it can be shown that, for a continuous f (x) ∈ L2 (Ω), the series converges to the function itself, i.e. f (x) = ∞ k=0 8 ck φk (x). There are some other useful properties of orthogonal series expansions such ∞ k2 [ ck φk (x)]2 dx Q = −∞ k=k 1 = k2 k2 ck cn k=k1 n=k1 = k2 ∞ −∞ φk (x)φn (x)dx c2k . k=k1 From this it follows that ∞ 2 f (x) dx = −∞ ∞ ∞ −∞ k=0 [ck φk (x)]2 dx = ∞ c2k , (1.5) k=0 which result is known as Parseval’s identity. It also implies the following equality for the truncation error ∞ ∞ ∞ ∞ e(x)2 dx = [ck φk (x)]2 dx = c2k . (1.6) −∞ −∞ k=K+1 k=K+1 Another implication of the basis functions orthogonality is that the truncation error is orthogonal to the truncated expansion, i.e. ∞ fˆ(x)e(x)dx = 0. (1.7) −∞ 1.3.2 Examples of orthogonal basis functions There are an infinite set of functions that forms an orthogonal basis on a given domain. Which set of basis functions that are suitable to use for the approximation depends on the underlying function being approximated. It should be sought to pick a set of basis functions that gives as good approximation as possible with a low truncation order. Here some examples of commonly used basis functions are given. Hermite basis functions The Hermite functions constitute an orthonormal basis of L2 (R). There are two ”versions” of the Hermite functions, the probabilist’s and the physicist’s, which are simply scaled versions of each other. The probabilist formulation is commonly used in probability theory, and the physicist formulation is employed mainly in quantum mechanics when working 9 0.8 φ 1 (x) 0.6 φ 2 (x) φ (x) 0.4 3 φ 4 (x) φ 5 (x) 0.2 0 -0.2 -0.4 -0.6 -0.8 -6 -4 -2 0 2 4 6 x Figure 1.1. The first five Hermite functions. with the Schrödinger equation. In this thesis, the probabilist’s Hermite functions are used for PDF estimation and are defined by (−1)k x2 /2 dk −x2 φk (x) = e , √ e dxk 2k k! π k ∈ N0 , or recursively as φ0 (x) = π −1/4 e−x /2 , √ φ1 (x) = 2xφ0 (x), 2 k−1 φk (x) = xφk−1 (x) − φk−2 (x), k = 2, 3, ... k k 2 The first five Hermite functions are plotted in Fig. 1.1. 2 The k-th Hermite function is of the form e−x /2 pk (x), where pk (x) is a k-th order polynomial. It can be noted that the first basis function 2 φ0 (x) is a scaled Gaussian bell function, and the factor e−x /2 gives the functions rapidly decaying tails. As PDF’s often have the characteristics of a Gaussian bell, and have rapidly decaying tails, the Hermite basis functions in many cases present a suitable basis for PDF approximation. Fourier basis functions The Fourier functions constitute an orthogonal basis of L2 ([−π, π]). The real-valued Fourier basis functions are cosines and sines of different frequencies. The complex valued Fourier functions however, are often much more convenient and compact to work with, and is what will be used in this thesis. The complex Fourier basis functions are given by φk (x) = eikx , k ∈ N. 10 √ where i = −1 is the imaginary unit. Even though they are complexvalued they can be used to approximate real-valued functions. The coefficients will then be complex conjugated such that ck = ck , where overline denote complex conjugate, and the imaginary parts will annihilate each other. This is shown by the following computations. Let ck = ak + ibk , then f (x) ≈ K ck φk (x) = c0 + k=−K = c0 + K K [ck e−ikx + ck eikx ] k=1 [(ak + bk i)(cos(kx) − i sin(kx)) k=1 +(ak − bk i)(cos(kx) + i sin(kx))] = c0 + K [(2ak cos(kx) + 2bk sin(kx))] k=1 The complex-valued description of the Fourier series is thus equivalent to the real-valued one, but is more convenient notionally to work with. Legendre basis functions The Legendre basis is a basis of L2 ([−1, 1]). The k-th Legendre polynomial is given by the formula 1 dk 2 k Pk (x) = k (x − 1) , 2 k! dxk or, alternatively, from Bonnet’s recursion formula P0 (x) = 1, P1 (x) = x, (k + 1)Pk+1 (x) = (2k + 1)xPk (x) − kPk−1 (x). The first five Legendre basis functions are plotted in Fig. 1.2. 1.3.3 Multivariate orthogonal series Multivariate orthogonal basis functions can be used to approximate a multivariate function f (x) ∈ R, x ∈ RD , f (x) ≈ ck φk (x), (1.8) k∈K where K is some subset of ND . A set of multivariate basis functions can (1) be constructed from the one-dimensional ones. Assume that {φk (x)}∞ k=0 , 11 φ 1 (x) 1 φ (x) 2 φ (x) 3 0.5 φ (x) 4 φ (x) 5 0 -0.5 -1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 x Figure 1.2. The five first Legendre functions. ∞ {φk (x)}∞ k=0 , ...., {φk (x)}k=0 , are orthogonal bases for L2 (Ω1 ), L2 (Ω2 ),...., L2 (ΩD ), then {φk (x)}k∈N forms an orthogonal basis for L2 (Ω) where Ω = Ω1 × Ω2 × .... × ΩD and (2) (D) φk (x) = (1) (2) φk1 (x1 )φk2 (x2 ) · ... · (D) φkD (xD ) = D (i) φki (xi ). i=1 By the separability of φk (x), it follows that the basis functions are orthogonal to one and other by the following computations φn (x)φk (x)dx = Ω D ··· Ω1 Ω1 Ω2 (1) φ(1) n1 (x1 )φk1 (x1 )dx1 ΩD i=1 φ(i) ni (xi ) D (j) φkj (xj )dx1 dx2 · · · dxD = j=1 (2) φ(2) n2 (x2 )φk2 (x2 )dx2 ·...· Ω2 (D) φ(D) nD (xD )φkD (xD )dxD ΩD = 1 if k = n , 0 otherwise (i) (i) which follows from that each term Ωi φni (xi )φki (xi )dxi equals to one iff ni = ki and zeros otherwise, i = 1, 2, ..., D. The coefficient with the index k is given by ck = Ω φk (x)f (x)dx. A proof of completeness for the set of functions is given in [18]. As an example the multivariate Hermite functions φ11 (x), φ12 (x) and φ22 (x) are plotted in Fig. 1.3. 12 .5 0.5 0.5 0 0 0 -0.5 4 .5 4 2 2 0 x2 -0.5 4 2 4 -2 -4 -4 x2 x1 2 2 0 0 -2 4 -2 -4 -4 x1 4 2 0 0 -2 x2 0 -2 -2 -4 -4 x1 Figure 1.3. The multivariate Hermite functions φ11 (x), φ12 (x) and φ22 (x). 1.3.4 Shifting and scaling How well the underlying function is approximated by a truncated series expansion depends on how the basis functions are scaled and shifted relative to it. Since the set of basis functions is complete, the series will converge to the true function (provided it is square integrable) when K → ∞, regardless of the scaling and shifting. However by rescaling and shifting, a better fit can be obtained for a truncated expansion. Assume that the set of basis functions {φk (x)}k∈K are orthogonal on the domain Ω, then a set of basis functions orthogonal on the domain Ω = {y|y = Σx + μ, x ∈ Ω} is given by φk (x) = det(Σ)−1/2 φk (Σ−1 (x − μ)), k ∈ K, where Σ is a symmetric positive definite matrix. The orthogonality on the domain Ω follows from Ω φk (y)φn (y)dy = Ω det(Σ)−1 φk (Σ−1 (y − μ))φn (Σ−1 (y − μ))dy = {x = Σ−1 (y − μ), dx = det(Σ)−1 dy} 1 if k = n = φk (x)φn (x)dx = . 0 otherwise Ω 13 1.4 Some probability theory Probability theory constitutes a whole branch in mathematics and is one of the foundations that this thesis is built on. Here, some concepts regarding random variables utilized in the thesis are summarized. Much of the material is assumed to be well known to the reader and is hence only briefly explained. The problem of estimating a probability density function from a random sample is discussed in more detail as it is a more specialized topic that is typically less known to a wider audience. 1.4.1 Definition of a random variable A random variable is a mathematical object, developed to represent an event that has not yet happened and is subject to chance. A common example is the number of dots in a dice throw, which, if the dice is balanced, has a probability of 1/6 to be some of the numbers {1, 2, 3, 4, 5, 6}. To give a more formal definition of a random variable, the concept of a probability space has first to be introduced. A probability space is defined as the triplet (Ω, A, P ) where Ω = {ω} is a set of all possible outcomes. A = {a} is a set of events, where a ⊆ Ω and P : A → R+ is a function that to each event a assigns a probability P (a) ≥ 0. A random variable, or stochastic variable, is defined as a real-valued function X : Ω → R on the set Ω, [28]. 1.4.2 The distribution function and some associated measures Assume that X = [X1 , X2 , ..., Xn ]T is an n-dimensional random variable. To every stochastic variable, there is an associated distribution G, which relationship is written as X ∼ G. To each G, there are two commonly associated distribution functions, the cumulative density function (CDF) PX (x) and the probability density function (PDF) pX (x) defined as PX (x) = P r(X ≤ x) = P r(X1 ≤ x1 , X2 ≤ x2 , ..., Xn ≤ xn ), ∂ n PX (x) . pX (x) = ∂x1 ∂x2 ...∂xn The CDF satisfies 0 ≤ PX (x) ≤ 1 (1.9) and is monotonically increasing in each dimension. The PDF satisfies RD 14 pX (x) ≥ 0, (1.10) pX (x)dx = 1. (1.11) When there is no risk for confusion, the index is usually dropped and the functions are written just as P (x) and p(x). Let Y be a subset of the random variables X1 , X2 , ..., Xn and Z be the subset that contains the variables not included in Y . The conditional density function, p(y|z) = p(y, z) , p(z) specifies the density of Y given that Z = z. The marginal distribution, characterizing the distribution of y alone, is given by p(y) = p(y, z)dz. Rn z The expected value of g(X), where g is an arbitrary function, is given by E[g(X)] = g(x)pX (x)dx. Rn The mean value and covariance of X are defined as xpX (x)dx, μ = E[X] = Rn Σ = E[(X − μ)(X − μ)T ] = (x − μ)(x − μ)T pX (x)dx. Rn 1.4.3 Bayesian statistics Bayesian statistics is a subset of the field of statistics in which the evidence about the true state of the world is expressed in terms of degrees of belief or, more specifically, Bayesian probabilities. Such an interpretation is only one of a number of interpretations of probability and there are other statistical techniques that are not based on “degrees of belief”. A fundamental equation in Bayesian statistics is Bayes rule. Assume that A and B are two random events, the conditional probability of the event A given the outcome of B is given by Bayes rule as P r(A|B) = P r(A)P r(B|A) , P r(B) or in the form of a PDF pA|B (a|b) = pA (a)pA|B (b|a) . pB (b) It is a powerful formula that states how the beliefs of the event A should be updated when new evidence is provided. Without the knowledge of event B, the probability of event A is just P r(A), which is 15 usually referred to as the prior probability. When the new information, or evidence, B, is received, Bayes rule states the formula for how the belief of the event A should be updated to give the probability P r(A|B). P r(A|B) is usually known as the posterior probability. The Bayesian framework, provides a powerful and comprehensive angle of attack to the problem of dealing with uncertainty. 1.4.4 Estimation from random samples Let {Xi }N i=1 be a set of N i.i.d. (independent identically distributed) random variables with distribution G. In statistics a random i.i.d. sample refers to a set of observations {x(i) }N i=1 of some random variable X. In signal processing, a sample usually refers to an observation at some given time instant, the terminologies hence collides and can cause confusion. In this thesis the terminology that a sample is a set of observations is employed. A sample from G is given by the set of observations (i) is a realization of X . From the sample, information {x(i) }N i i=1 where x about the underlying distribution can be extracted. For instance, the sample mean and covariance 1 (i) x , N (1.12) 1 (i) (x − μ̂)(x(i) − μ̂)T , N −1 (1.13) N μ̂ = i=1 N Σ̂ = i=1 are unbiased estimators of the true mean and variance of the distribution. Consistency and efficiency When computing a point estimate θ̂ of some quantity θ from a random sample, it is interesting to specify a confidence interval for the estimate to assign some degree of certainty to it. An estimator is said to be consistent if it is unbiased, i.e. E[θ̂] = θ, and that the variance approaches zero as N increases, i.e. V[θ̂] → 0 as N → ∞. Furthermore, it is said to be efficient if it is an unbiased estimator that provides the lowest variance for a given N . As an example, estimator (1.12) of μ in can be shown to be consistent since E[μ̂] = E[ 16 1 1 1 Xi ] = E[Xi ] = μ=μ N N N N N N i=1 i=1 i=1 and V[μ̂] = V[ V[Xi ] ΣX 1 1 = , Xi ] = 2 V[Xi ] = N N N N N N i=1 i=1 which apparently approaches 0 when N → ∞. Confidence intervals of point estimates When a point estimate is made, it is interesting to assign a confidence interval to the estimate, i.e. some interval that covers the true parameter value with some given probability α. The 95% or 99% confidence intervals are often displayed. If the observations are independent, and the estimate θ̂ is formed as a sum of functions of the observations θ̂ = N fi (x(i) ), (1.14) i=1 which is a commonly encountered case, θ̂ will have a variance of Σθ̂ = V[θ̂] = N V[fi (Xi )]. (1.15) i=1 If fi (·) is linear, and Xi have a Gaussian distribution, the exact confidence interval for θ̂ is given by I = [θ̂ − λα/2 Σθ̂ , θ̂ + λα/2 Σθ̂ ], (1.16) where λα/2 equals 1.96 for a 95% confidence interval and 2.58 for a 99% confidence interval. If fi (·) are not linear, and/or Xi are not Gaussian, θ̂ will not be normally distributed and it can be arbitrarily difficult to construct a confidence interval for the estimate. However, if the observations are many enough, θ̂ will be approximately normally distributed regardless of the distribution of Xi and the class of fi (·), according to the central limit theorem. Typically ”large enough” is considered approximately N = 30 in which case the approximation holds with high accuracy and (1.16) can be taken as an approximative confidence interval for θ̂. Confidence intervals for more difficult situations are discussed in Sec. 1.4.5. PDF estimation Estimating a PDF from a random sample is a somewhat more complicated problem than extracting point estimates since, in that case, the whole PDF p(x) is estimated from the given observations. What is meant by convergence of the estimate is a more difficult question in 17 this case. Typically, convergence refers to convergence in the mean integrated square error sense (MISE). The MISE, Q, is defined as Q = E[ (p̂(x) − p(x))2 dx]. Then, if Q → 0 as N → ∞, the estimate is said to be MISE consistent. There are two sub classes of PDF estimators: parametric and nonparametric ones. A parametric estimator assumes that the underlying stochastic variable comes from some family of parametric distributions, G, characterized by the parameters a1 , a2 , ...aK . The parameters are then estimated from the sample, to give an estimate of the whole PDF. The most commonly occurring parametric estimator is the Gaussian one. It has the mean and covariance as parameters which are consistently estimated from (1.12), (1.13). Parametric estimators can be shown to have a O(N −1 ) convergence rate in the MISE sense, in the best case [91]. Non-parametric estimators assume nothing about the underlying distribution and are hence more general than parametric estimators. The price paid is a slower convergence rate. Below are three commonly used non-parametric PDF estimators briefly presented: the histogram, the kernel density estimator, and the orthogonal series density estimator. Histogram The histogram constitutes a piece-wise constant estimator of the PDF. It is simply created by dividing the domain Ω into bins bk , k = 1, 2, ..., K, and assigning them value given by the number of samples fk that belong to each bin. The PDF p(x) is approximated by a constant value hk = fk /N over each bin. It is a simple but rather primitive estimator that requires a relatively large sample size N to yield a good approximation. Kernel estimation Another commonly used approximation method is the kernel density estimator. A kernel density approximation, see e.g. [105], of p(x) is given by N 1 p̂(x) = φ(H−1/2 (x − x(i) )), N |H|1/2 i=1 where φ(·) is a kernel function that is symmetric and integrates to one. The parameter H ∈ Rn×n is known as the bandwidth of the kernel, it is symmetric and positive definite and acts as a smoothing parameter. Assume that H = hI. A high value of h will give a smooth estimate, with a low variance but a high bias. Conversely, a low value of h will give a higher variance but a lower bias of the estimate. 18 0.5 p(xt+1,Yt) 0.4 0.3 0.2 0.1 0 −3 −2 −1 0 xt+1 1 2 3 Figure 1.4. A set of 50 weighted particles (gray stems) and the fitted series expansion (black solid line) using the first 7 Hermite functions. Consider the one-dimensional case. The value of h is a user parameter, but there are some guidelines how it should be chosen. It can be shown [96] that the optimal choice for h, in the sense that it minimizes the asymptotic mean integrated square error, is given by 1 h = σ̂C(ν)N − 2ν+1 , where σ̂ is the sample standard deviation, and C and ν are kernelspecific constants. With this choice of bandwidth, the asymptotic mean square error (AMISE) converges with a O(N −4/5 ) rate. It is slower than the O(N −1 ) rate that is obtained for a parametric estimator. However, under weak assumptions, it has been shown that the kernel estimator is optimal in the sense that there can be no non-parametric estimator that converges faster to the true density [110]. For computational purposes, the kernel density estimator has the drawback that the approximation requires a large number of terms, namely N of them. Orthogonal series estimator An alternative to the kernel estimator is the orthogonal series estimator [104], [96] that has the capability of capturing the shape of p(x) using far fewer terms than the kernel estimator. Using an orthogonal series estimator, the estimate is given by p̂(x) = K ck φk (x), k=0 19 where {φk } is a set of basis functions, orthogonal on the domain Ψ w.r.t. the weighted inner product < φk , φn >= w(x)2 φk (x)φn (x)dx. Ψ If the function p(x) were known, the coefficients would be computed as w(x)φk (x)p(x)dx. ck = Ψ Noting that this integral equals E[w(X)φk (X)] the coefficients can be unbiasedly estimated from the sample according to 1 w(x(i) )φk (x(i) ) N N ck = E[w(X)φk (X)] ≈ i=1 and the variance of the estimated coefficient is given by 1 1 V[ck ] = V[ w(Xi )φk (Xi )] = 2 V[w(Xi )φk (Xi )]. N N N N i=1 i=1 For the orthogonal series estimator, the number of terms K in the expansion can somewhat loosely be interpreted as a smoothing parameter. A low value of K will give a low variance of the estimate but a large bias and vice verse for a high value of K. In Fig. 1.4 an illustration of a fitted series expansion is given. One issue with the orthogonal series approximation estimator is that the approximation can take on negative values, i.e. p̂(x) < 0 for some values of x, and hence do not fulfill the positivity property in (1.10), required of a PDF. This is often a main reason for not using the method in different situations. However, for the purposes of approximation encountered in this thesis, it does not pose an obstacle. In recursive Bayesian estimation, estimation of the PDF is typically just an intermediate step, towards the actual goal of extracting a point estimate of the state. The point estimate is typically extracted as the mean value or a maximum of p(x), which are not crucially affected by a potential negativity of p(x). Consider the scalar case. The mean value is the point at which p(x) would balance if it were a mechanical structure put on a spike. A negative density to the left of μ will thus act as a negative mass that has the same net effect as having a mirrored positive mass to the right of μ. It is thus not more severe to have a negative estimate to the left of μ, than overestimate p(x) to the right of μ, and vice verse, but merely |p(x) − p̂(x)| is of importance. In other situations, it is only sought to find a unnormalized estimate of p(x) (this is the case in e.g. Chapter 6) in which case p̂(x) = |p̂(x)| can simply be taken as the estimate. 20 1.4.5 Confidence regions and outliers Assume that a set of observations, believed to come from the same distribution, is given. An outlier is an observation that is so deviating from the other observations as to rise suspicion of being generated by other mechanisms than the majority of the observations in the set. If the PDF from which the sample comes from is known, it can be used to determine whether an observation is unlikely enough to be classified as anomalous. There are several methods for detection of outliers, but most of them are based on a Gaussian assumption of the distribution. A commonly used outlier detection method is to study the Mahalanobis distance DM (x) = (x − μ)T Σ−1 (x − μ). (1.17) If it exceeds some given threshold then the observation is classified as anomalous. The Mahalanobis distance is a sensible measure for deviation if the distribution is unimodal and radially symmetric but makes little sense otherwise. Here a more general approach to classify outliers is suggested. Let V (Ω) := Ω 1dx denote the volume of a set. The inlier region at a confidence level α is then defined as the most dense domain Ω such that p(x)dx = 1 − α, (1.18) Ω where the density of the domain is defined as p(x)dx δ= Ω . V (Ω) This means that under the null hypothesis that x comes from the distribution G, the probability of getting an observation x ∈ / Ω is lower than α. Using the criterion (1.18) only gives an ambiguity of how to select the inlier set Ω. Having an inlier region that does not cover the region where the probability density of getting an observation is highest under the null hypothesis, is not reasonable. By demanding that it should be the most dense region that satisfies (1.18), this is avoided and the domain Ω will also (if p(x) is not constant over a set of nonzero measure) be uniquely defined. This can be motivated by the following arguing. Assume that a threshold value 0 < γ < sup p(x) is chosen, and that Ω x is the set Ω(γ) = {x|p(x) ≥ γ} (i.e. the most dense region). Then p(x)dx (1.19) h(γ) = Ω(γ) is a monotonically increasing function in V (Ω) as p(x) > 0 on Ω. There is thus a one-to-one mapping between h(γ) and V (Ω(γ)) and 21 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 −5 −4 −3 −2 −1 0 1 2 3 4 5 Figure 1.5. Inlier region for Gaussian PDF. 0.7 0.6 0.5 p(x) 0.4 0.3 0.2 0.1 0 −5 −4 −3 −2 −1 0 x 1 2 3 4 5 Figure 1.6. Inlier region according to (1.18) for a bimodal non-symmetric PDF. there is also a one-to-one mapping between V (Ω(γ)) and Ω(γ). Ω will hence be uniquely defined if h(γ) = 1 − α has a solution. Unfortunately it may not have a solution. If p(x) = k, where k ∈ R is a constant, over a set of nonzero measure h(γ) will have a jump discontinuity at γ = k, and h(γ) = 1−α will lack solution if the jump discontinuity covers 1−α. In that case any Ω, that satisfies (1.18) and Ω+ ⊆ Ω ⊆ Ω− makes equal sense and can be taken as the inlier region, where Ω− = Ω(lim γ ) and γ→k− Ω+ = Ω(lim γ ). γ→k+ In the case of a distribution with symmetric unimodal PDF, the inlier region described by (1.17) and (1.18) will be the same, but in the case of multimodal skew distributions they can differ significantly. In Fig. 1.5 and Fig. 1.6, the inlier region for a unimodal symmtric and a non-symmetric bimodal distribution is shown, respectively. When using this reasoning for computing outliers, the p-value, q, for an observation x0 (the probability of getting an observation as least as extreme as x0 ), is given from q(x0 ) = 1 − h(p(x0 )). 22 (1.20) 1.5 Recursive Bayesian estimation Statistical estimation deals with the problem of estimating parameters, based on empirical data containing some random component. A subfield of this is parameter and state estimation in dynamical systems. In recursive statistical estimation, the estimate is updated in an iterative manner as new evidence about the unobserved quantity is acquired. Being the underlying problem to all optimal stochastic filtering methods, the recursive Bayesian estimation problem is briefly reviewed in this section. A thorough exposition of the subject can be found in some of the classical textbooks, e.g. [6], [70], [108], [99]. Consider a stochastic process {xt , t = 0, 1, ...} described by the state space model xt+1 = ft (xt , vt ), yt = ht (xt , et ), (1.21) (1.22) with the state xt ∈ Rn and output yt ∈ Rp . The sequences vt ∈ Rnv and et ∈ Rne are zero-mean white noise mutually independent processes characterized by the distributions with known PDFs given by p(vt ) and p(et ), respectively, where t is discrete time. The functions ft (·) and ht (·) are arbitrary but known vector-valued functions. The aim is to provide an estimate, x̂t , of the state xt given the measurements Yt = {y0 , y1 , ..., yt }. 1.5.1 Optimal estimation An optimal estimator for the system (1.21)-(1.22) is the estimator that gives an estimate x̂0:t , of the random vector x0:t given observations of the correlated random variable Yt = {y0 , y1 , ..., yt }, that is optimal in some sense. Consider the matrix-valued criterion Q(l(Yt )) = E[(x0:t − l(Yt ))(x0:t − l(Yt ))T ]. (1.23) It can be shown that the function l(·) that minimizes any scalar valued monotonically increasing function of Q, is the conditional mean [99], i.e. l(Yt ) = E[x0:t |Yt ]. (1.24) Examples of scalar valued monotonically increasing functions of Q are det(Q) and tr(WQ), where W is a real positive-definite weighting matrix. With W taken as the identity matrix, this shows that (1.24), is the optimal estimatior in the sense that it minimizes the mean square error of the estimate. For dynamical systems it is fairly complicated to compute the conditional mean, in the general case. One case when this problem is possible to solve in a closed form, is for linear systems with 23 Gaussian noise, in which case the Kalman filter gives the solution. For other system structures there is no closed-form solution to the problem. To find the (minimum mean square) estimate, E[xt |Yt ], of the state at time step t the PDF p(x0:t |Yt ) must then be computed and marginalized over x0:t−1 , i.e. (1.25) p(xt |Yt ) = p(x0:t |Yt )dx0:t−1 and the estimate extracted as E[xt |Yt ] = xt p(xt |Yt )dxt . (1.26) In general, to construct the mathematical object p(x0:t |Yt ), is extremely complex and require tremendous amount of computation even for moderate values of t. There are however simplifying properties in the model structure that provide a way of computing the marginal distribution p(xt |Yt ) recursively, namely that p(xt |xt−1 , xt−2 , ...., x0 ) = p(xt |xt−1 ), p(yt |xt , Yt−1 ) = p(yt |xt ). (1.27) (1.28) The first equality follows from the Markov property of the state the whiteness of vt . The second equality follows straightforwardly from (1.22), as it is free from dynamics, and et is white. These properties simplifies the problem significantly and give means of computing p(xt |Yt ) recursively, via the prediction and update recursion. 1.5.2 The prediction-update recursion Assume that p(xt−1 |Yt−1 ) is known and that a new measurement yt is obtained. Exploiting the Markov property of the system (1.27), the predicted PDF p(xt |Yt−1 ) is obtained from the Kolmogorov-Chapman equation p(xt |xt−1 )p(xt−1 |Yt−1 )dxt−1 . (1.29) p(xt |Yt−1 ) = Rn Using (1.28) and applying Bayes rule, the updated PDF p(xt |Yt ) is then found, using the new evidence yt , as p(xt |Yt ) = p(yt |xt )p(xt |Yt−1 ) . p(yt |Yt−1 ) (1.30) Hence, given an initial PDF p(x0 |y0 ) = p(x0 ) for the initial state x0 , the PDF p(xt |Yt ) can be computed by applying the prediction and 24 15 p(xk|y1:k) 2 10 1 5 0 −6 −4 −2 0 k 2 4 6 8 0 x Figure 1.7. An illustration of how the PDF p(xt |Yt ) evolves over time, t. update steps in (1.29), (1.30) recursively to the measurements Yt as they arrive. Fig. 1.7 shows an example of how it could look like when the PDF evolves over time. The PDF p(xt |Yt ) gives the complete information about the random variable xt , and whatever statistical information required could be extracted from it. Typically a point estimate, x̂t of the state is of interest to provide. As discussed in Sec. 1.5.1 the conditional mean, x̂t = E[xt |Yt ], is the optimal estimator in the sense that it minimizes the variance of the estimation error. However, depending on the intended use of the point estimate, it can be motivated to employ another estimator. For instance, the maximum likelihood estimate x̂t = arg sup p(xt |Yt ). xt is a commonly used point estimator. In the recursive Bayesian estimation framework, p(xt |Yt ) is calculated by iterating the prediction and update steps. The PDFs p(xt |xt−1 ) and p(yt |xt ) defined by (1.29), (1.30), are implicitly given by the state space model in (1.21), (1.22). Via the generalized convolution formula, p(xt |xt−1 ) is found as p(xt |xt−1 ) = p(xt |xt−1 , vt−1 )p(vt−1 )dvt−1 (1.31) Rn = δ(xt − ft−1 (xt−1 , vt−1 ))p(vt−1 )dvt−1 . (1.32) Rn 25 1 1 1 0.5 0.5 0.5 0 4 0 4 2 0 -2 -4 -4 -2 0 2 4 0 4 2 0 -2 -4 -4 -2 0 2 4 2 0 -2 -4 -4 -2 0 2 Figure 1.8. Representation of a 2 dimensional PDF (a) Kalman filter, (b) particle filter, (c) grid-based filter. An important special case arises when the process noise is additive, i.e. (1.21) is given as xt+1 = ft (xt ) + vt , (1.33) in which case (1.32) evaluates to p(xt |xt−1 ) = pvt−1 (xt − ft−1 (xt−1 )). (1.34) In the same way, it follows that p(yt |xt ) is given by δ(yt − ht (xt , et ))p(e,t )det , p(yt |xt ) = Rp which in the case of an additive measurement noise becomes p(yt |xt ) = pet (yt − ht (xt )). (1.35) In general, closed-form expressions cannot be obtained for (1.29)-(1.30). As mentioned, a special case arises when ft and ht are linear and vt and et are Gaussian with zero mean, for which case the solution is given by the Kalman filter. For non-Gaussian noise, it can be shown that the Kalman filter is still the best unbiased linear estimator. However, to solve the estimation problem optimally when the system is nonlinear/non-Gaussian, approximation methods have to be used, of which Monte Carlo methods and grid-based are commonly used examples. The Kalman filter describes the PDF by a Gaussian function, the Monte Carlo methods provide a sample from the distribution, and the gridbased methods approximate the PDF over a discrete set of grid points. Fig. 1.8 illustrates how the above methods represent the information about the sought PDF, by a 2-dimensional example. In the following subsections some of the solution methods to the RBE problem are given. 26 1.6 Solution methods for the optimal filtering problem 1.6.1 Kalman filter If the process and measurement noises are white Gaussian noise sequences and ft , ht are linear functions, (1.21)-(1.22) can be written as xt+1 = Ft xt + vt , yt = Ht xt + et , (1.36) (1.37) where Ft ∈ Rn×n and Ht ∈ Rp×n . Under these assumptions, it can be shown that the prior and posterior distributions are Gaussian and hence completely characterized by the mean μ and covariance Σ. The closedform solution of (1.29), (1.30) that propagates the mean μ (coinciding with the estimated state x̂) and the estimation error covariance Pt|t = E([xt − x̂t|t ][xt − x̂t|t ]T ), is the Kalman filter [53], for which the prediction and update steps can be formulated as follows : Prediction x̂t|t−1 = Ft−1 x̂t−1|t−1 , (1.38) Pt|t−1 = Ft−1 Pt−1|t−1 FTt−1 + Qt−1 , (1.39) Kt = Pt|t−1 HTt (Ht Pt|t−1 HTt + Rt )−1 , x̂t|t = x̂t|t−1 + Kt (yt − Ht x̂t|t−1 ), Pt|t = (I − Kt Ht )Pt|t−1 , (1.40) (1.41) (1.42) Update where Qt = E[vt vtT ] and Rt = E[et eTt ]. There is a huge literature devoted to linear estimation and the Kalman filter, with some references given by [52], [99], [10], [34]. 1.6.2 Static Kalman filter Assume that the system in (1.36)-(1.37) is time invariant. Under the mild assumptions of R being positive definite and (F, B) being a stabilizable pair, where Q = BBT , the gain Kt and the error covariance Pt|t−1 = E([x̂t|t−1 − xt ][x̂t|t−1 − xt ]T ) will converge to the constants K and P respectively, as t → ∞ [99]. The filter gain can then be treated as a static one, with P given by the solution to the algebraic Riccatti equation P = FPFT + Q − FPHT (HPHT + R)−1 HPFT , 27 and K calculated as K = FPH(HPHT + R)−1 . The static Kalman filter is then given from x̂t+1|t = Fx̂t|t−1 + K(yt − Hx̂t|t−1 ). (1.43) 1.6.3 Extended Kalman filter A generalization of the KF that applies to nonlinear systems is the extended Kalman filter (EKF) [99], [34]. It is a suboptimal solution to the recursive estimation problem based on a linearization of the measurement and system equations around estimates of the state. Assume that the nonlinear functions ft and ht are differentiable. At each time step they are approximated by a first order Taylor expansion, i.e. ft (xt ) ≈ ft (x̂t|t ) + Ft (xt − x̂t|t ), ht (xt ) ≈ ht (x̂t|t−1 ) + Ht (xt − x̂t|t−1 ), where Ft = Ht = ∂ft (x) , ∂x x=x̂t−1|t−1 ∂ht (x) . ∂x (1.44) (1.45) x=x̂t|t−1 The filtering is then performed by applying the standard KF equations (1.38)-(1.42) to the linearized system. If the nonlinearities are severe, the linearization can be a poor approximation of the system, which can in the worst case lead to divergence of the filter. 1.6.4 Unscented Kalman filter Another filtering method that applies to nonlinear system and can be shown to be more robust against nonlinearities is the unscented Kalman filter (UKF) [50]. One iteration of the method can be summarized as follows. Assume that xt−1|t−1 has the mean and covariance given by μt−1|t−1 and Σt−1|t−1 , respectively and define the augmented state and covariance as 28 μTt−1|t−1 E[vtT ] Σt−1|t−1 0 = . 0 Qt xat−1|t−1 = Σat−1|t−1 T , The UKF picks a deterministic set of sigma points around the mean which are then propagated and updated to get an approxmation of the posterior distribution. In the prediction step the set of weighted sigma (i) (i) points St−1|t−1 = {χt−1|t−1 , wt−1|t−1 }N i=1 is chosen as (0) χt−1|t−1 = xat−1|t−1 , (i) χt−1|t−1 = xat−1|t−1 + ( (i) χt−1|t−1 = xat−1|t−1 − ( (1.46) nΣat−1|t−1 )i , i = 1, ..., n, (1.47) nΣat−1|t−1 )i−n , i = n + 1, ..., 2n,(1.48) where ( nΣt−1 )i denotes the i-th row of the Choleskey factorization of nΣt−1 , and the weights are given by λ , L+λ λ + (1 − α2 + β), = L+λ 1 , = wc(i) = 2(L + λ) ws(0) = wc(D0) ws(i) (1.49) (1.50) (1.51) (1.52) where λ = α2 (L+κ)−L. The constants α, β and κ are user parameters used to control the spread of the sigma points. The sigma points are then propagated through the state transition equation, i.e. (i) (i) χt|t−1 = ft−1 (χt−1|t−1 ), i = 0, 1, ..., 2n, which yields the predicted state and covariance as x̂t|t−1 = Pt|t−1 = 2L i=0 2L (i) ws(i) χt|t−1 , (i) (i) wc(i) [χt|t−1 − x̂t|t−1 ][χt|t−1 − x̂t|t−1 ]T . i=0 29 In the update step an analogue procedure as in the prediction step is carried out, but the state and covaraince are augmented with E[eTt ] and Rt respectively and the sigma points are propagated though the measurement equation. The state estimate and the error covariance are then updated by Pyx = 2L (i) (i) wc(i) [χt|t−1 − x̂k|k−1 ][γt|t − ŷk ]T , (1.53) i=0 Kt = Pt|t P−1 yx , xt|t = xt|t−1 + Kt (yt − ŷt ), (1.54) (1.55) (1.56) Pt|t = Pt|t−1 − Kt Pyx KtT . (1.57) 1.6.5 Monte-Carlo methods A simple and powerful, though computationally costly, method to perform filtering via (1.29), (1.30) is by means of Monte-Carlo simulation [71], [21]. The Monte-Carlo based framework can handle nonlinear systems with general noise distributions. The method provides way of ob(i) (i) taining a weighted sample St = {xt , wt }N i=1 from the distribution with PDF p(xt |Yt ) from which the desired information about the random variable xt can be extracted. Assume that at time step t − 1, (i) (i) St−1 = {xt−1 , wt−1 }N i=1 constitutes a weighted sample from the distribu(i) tion with PDF p(xt−1 |Yt−1 ), where xt−1 is the i-th observation, called (i) a particle, with associated weight wt−1 ≥ 0. Given St−1 , a sample from p(xt |Yt−1 ) is obtained by propagating each particle through system equation (1.21), i.e. (i) (i) (i) xt = ft−1 (xt−1 , vt−1 ), i = 1, .., N, (1.58) (i) where vt−1 is a draw from the distribution with PDF p(vt−1 ). The measurement yt is then used to update the weights by (i) (i) (i) wt = wt−1 p(yt |xt ), i = 1, ..., N. (i) (i) (1.59) This yields the particle set St = {xt , wt }N i=1 at time step t. By iterating (4.1) and (4.2), a sample from p(xt |Yt ) is thus recursively obtained. The recursion is initialized by making N draws from an a initial distribution with PDF p(x0 ). It can be shown that, as formulated for now, the variance of the weights in the particle set can only increase over time, with the consequence that the weights of all particles except for one will approach 30 Figure 1.9. The evolution of a set of particles. The left part of the figure shows the particles as dots and their weights are represented by the dot sizes. The right part of the figure shows the discrete weighted estimate p̂(xt |Yt ) of p(xt |Yt ) given by the particles. Step (1), (2), (3) and (4) shows the initial set St−1|t−1 , the propagated set St|t−1 , the updated set St|t and the resampled (and one respectively. step propagated) set St|t zero as t → ∞ [22]. When this happens, the filtering has broken down to a pure simulation of the system. To remedy this problem and concentrate the particles to the domain of interest, i.e. where the density of p(xt |Yt ) is high, resampling can be performed. In the resampling (i) (i) step, a new set of particles St = {xt , wt }N i=1 is created by sampling from p(xt |Yt ), and replaces the old particle set St . Bootstrapping is a common approach where a new set of particles St is created by by making N draws with replacement from the old particle set such that (i) (i) (i) (i) P r(xt = xt ) = wt and setting wt = 1/N . An illustration of how the particle set evolves during the prediction, update and resampling step, is given in Fig. 1.9. 1.6.6 Grid-based methods Grid based methods, see e.g. [15], solves the recursive Bayesian estimation problem by giving an approximate solution over a discrete set of grid points. In the way in which the PDF is approximated over discrete bins makes it closely related to the histogram estimator. The involved PDFs are approximated by point masses over a discrete (i) (i) (i) N n set of grid points {xt }N i=1 , xt ∈ R , with associated weights {wt|t }i=1 . The PDF p(xt−1 |Yt−1 ) is then approximated as p(xt−1 |Yt−1 ) ≈ N (i) (i) wt−1|t−1 δ(xt−1 − xt−1 ), (1.60) i=1 31 where the approximation sign should be interpreted as that the weighted set of point masses carries approximately the same statistical information about the state as the true PDF, such as e.g. the mean and variance. The i-th weight is propagated via the prediction and update equations as (i) wt|t−1 = N (j) (i) (j) wt−1|t−1 p(xt |xt−1 ), (1.61) j=1 (i) wt|t (i) (i) = wt|t−1 p(yt |xt ), (1.62) and the predicted PDF p(xt |Yt−1 ) and updated PDF p(xt |Yt ) are approximated by p(xt |Yt−1 ) ≈ N (i) (i) wt|t−1 δ(xt − xt ), i=1 p(xt |Yt ) ≈ N (i) (i) wt|t δ(xt − xt ). i=1 A problem with grid-based methods is a large computational burden associated with it. To achieve satisfactory accuracy, a large number of grid points must be used. As the number of grid points grows exponentially with the dimension of the problem, its usability is confined to low-dimensional problems. 1.6.7 Computational complexity The computational complexity of the optimal solution to the recursive Bayesian estimation problem is a major obstacle in real-time applications as well as in high-dimensional problems. In the linear Gaussian case (KF), the computational complexity grows as O(n3 ) where n is the dimension of the state space. However, for non-linear non-Gaussian filtering methods, the computational complexity typically grows as O(N n ), i.e. it increases exponentially with the dimension. This is often referred to as the ”curse of dimensionality”. Consider for instance a grid based method. If one dimension requires N grid points, then a twodimensional approximation requires N 2 grid points, and a n-dimensional problem requires N n grid points. The curse of dimensionality poses a severe problem, limiting the applicability of the methods to relatively low-dimensional cases. Basically all non-parametric estimators suffers more or less from this problem, but the factor N in the O(N n ) complexity can vary significantly among different filtering methods, which is of high importance in practice. 32 1.7 High-performance computing Since one of the main points in using parallel hardware is to achieve faster execution times, it is not only important that the computations are made in parallel, but also that the program is optimized w.r.t. the execution time. Today’s compilers can automatically do significant optimizations to the code. However, to achieve high performance, the programmer must still invest an effort in optimization of the code, giving the compiler a good ground to work on. Some of the optimizations are discussed here and can be found in [62], [31]. 1.7.1 Efficient memory handling One of the most important aspects in achieving fast execution is to handle the memory accesses efficiently. A program, where memory access has been optimized, can potentially execute several orders of magnitude faster than an unoptimized program. Two important data access properties exhibited by many programs are spatial and temporal locality. This is something that the cache memory utilizes, at a hardware level, in the following manner. • Spatial locality: If an element at a specific memory address is accessed, it is likely that data elements at nearby addresses will be accessed soon. Therefore, neighboring data elements to the one that are being accessed now will also be brought to the cache. • Temporal locality: If an element is being accessed now, it is likely that it will be used soon again. Therefore, the cache will keep the most recently used data. Thus, when a data element is brought to the cache, not only that particular element, but also the neighboring data elements will be brought to the cache. How many elements are brought in depends on the cache line size. As it is time consuming to move data in the memory, it benefits performance greatly if the code is written in a such manner that the data movement from the main memory to the CPUs is minimized. When a data element is brought to the cache, it is hence desirable to use the element to accomplish as many calculations as possible that the element is involved in, before it is thrown out of the cache. Cache re-use can be an even more critical issue on multicore processors than on single core processors due to their larger computational power and more complex memory hierarchies. 33 Algorithm 1 N = 10000; A = randn (N) ; B = randn (N) ; f o r j =1:N f o r i =1:N A( i , j ) = A( i , j ) + B( i , j ) ; end end As a simple example of the importance of good memory handling consider the following case. Alg. 1 and Alg. 2 implement a simple matrixmatrix addition. In Alg. 1, the matrices are read column-wise, while in Alg. 2 they are read row-wise. The code in Alg. 1 executes in about 1.2 seconds while the code in Alg. 2 executes in about 5.8 seconds on the author’s PC. The code performs exactly the same work, but the second code executes about 5 times faster than the first one. 1 The reason for this is that Matlab implements column major ordering for the arrays, which means that when a matrix is stored in memory elements that are in the rows next to each other in the column will be stored in neighboring memory addresses. For instance ⎡ ⎤ 1 4 7 A=⎣ 2 5 8 ⎦ 3 6 9 will be stored in memory as a1 1 a2 2 a3 3 a4 4 a5 5 a6 6 a7 7 a8 8 a9 9 where ai is the i-th memory address. Since the cache will fetch several neighboring data elements when reading a value, the next value will already be available in the cache when requested by the processor, if reading the matrix column-wise. However, if it is read row-wise, the next element will not already be in the cache and have to be brought from the memory above in the hierarchy which will slow down the execution as the processor will have to idle while waiting for the requested data to be brought to the registers. 1 Note that this is only an example and in the Matlab language the code should not be written in either of the two ways but simply just as A = A + B; to utilize Matlabs pre-compiled libraries for vector operations. 34 Algorithm 2 N = 10000; A = randn (N) ; B = randn (N) ; f o r i =1:N f o r j =1:N A( i , j ) = A( i , j ) + B( i , j ) ; end end 1.7.2 Hardware mechanisms for efficient code execution Further considerations when designing high performance software are hardware-specific optimizations. A programmer aware of such mechanisms can utilize them to improve the performance substantially with some examples given below. For instance a pre-fetcher is a hardware mechanism that tries to predict which data will be used in a near future and fetch this piece of data into the cache as to be readily available when requested from the processor. The pre-fetcher monitors which memory locations are being accessed by the processor currently, predicts which memory locations will be needed in the future, and issues pre-fetches to those memory locations. Predicting the future accesses is of course a very difficult problem to solve. But if the programmer is aware, and as far as possible try to access the memory in regular patterns, the pre-fetcher can make better predictions and improve the performance of the execution. As another example, many CPUs implement a hardware pipeline to increase the throughput of the processor. In the pipeline the stages of code execution: instruction fetch, instruction decode and register fetch, execute, memory access and register write back, are implemented in a series of elements, where the output of one element is the input of the next one. This construction is to allow overlapping execution of multiple instructions with the same circuitry. To be able to fill the pipeline the processor must know which code is to be executed next. Therefore branching, such as ”if” statements can have negative impact on the execution time since the processor does not know beforehand what code will be executed next and fill the pipeline appropriately. There are several other situations that can cause trouble to the pipeline, known as ”Hazards”, that should be taken into consideration when optimizing the code. 35 1.7.3 Some further examples Loop unrolling, loop fusion, hoisting, avoiding branching and avoiding redundant computations are other techniques to improve the execution time, to mention some. In loop unrolling a loop is written in a long sequence of operations, rather than in a typical loop format, such as a ”for” loop. In this way the overhead caused by the loop construction, such as counter increment and checking for termination conditions are avoided, and can provide substantial gain for the execution time if the actual work performed in each iteration of the loop is small. Loop unrolling though have the drawback that the program size increases, which can be undesirable in some situation. In loop fusion, several loops are merged, if possible. In this way the loop overhead is minimized, which favors the execution time. Also by merging loops it is often possible to get a better memory access pattern (which is of great importance as previously discussed). Hoisting refers to the avoidance of repeated operations, such as e.g. de-referencing a memory access in a loop. Avoiding redundant computations is often an overlooked opportunity to improve the efficiency of the code. By computing a required value and storing it for reuse, if needed, can speed up a program significantly. 1.7.4 Software libraries For routine operations there are often highly optimized software libraries available. As linear algebra operations are commonly encountered in high-performance computing, optimized libraries such as BLAS (Basic Linear Algebra Subprograms) 2 , have been developed for these type of operations. These libraries are extremely efficiently implemented, utilizing hardware-specific optimizations for particular architectures. The set of routines is though limited and covers only the more basic operations. 2 www.netlib.org/blas/ 36 1.8 Multicore architecture 1.8.1 Evolution of the multicore processor A multicore processor is a single computing component with two or more independent actual processors, called cores or central processing units (CPUs), which are the units that read and execute program instructions. The material in this section is based on the references [106], [67], [33], [107], [66]. For decades, it was possible to improve the performance of the singlecore CPU by shrinking the area of the integrated circuit and increasing the clock rate at which it operated. In the early 2000’s, the rate of increase of the computational power for the single core processor began to stall, mainly due to three major bottlenecks: • The memory wall; Over decades, processor speeds have increased at far faster rates than the memory speeds. As the memory system cannot deliver data fast enough to keep the processor busy, the memory has become a bottleneck to performance improvement of the sequential processor. • The power wall; The power consumption of a processor increases exponentially with each factorial increase of operating frequency. Hence, it is not possible, due to both power and heat dissipation concerns, to improve the performance of a single core processor by increasing the operating frequency. • The ILP wall; An increasing difficulty of finding enough instruction level parallelism (ILP) in a single instructions stream to keep a high-performance single-core processor busy. In the pursue of improving the computational capacity of a system, more focus was put on parallel architectures that have started to evolve at an increasing rate and become a standard piece of hardware. Any PC, and many mobile phones bought today, will likely have two or more processors on an integrated circuit. The purpose of fitting several cores on the same chip is mainly to improve the computational capacity of the system. Another aspect, important to low power consumption systems such as e.g. battery driven mobile devices and low power communication systems, is the lower powerper-FLOP ratio provided by parallel hardware. Several cores on the same chip generally consume less power than the same amount of cores located on different chips. 37 1.8.2 Parallel architectures There is a plethora of parallel processing architectures with examples as shared memory multicores (SMCs), graphical processing units (GPUs), and computer clusters. Roughly speaking, all parallel architectures can be modeled as M processing units with some kind of interconnection, each having a private memory and possibly connected to a shared memory. What differs between the architectures are the sizes of the shared and private memory, the interconnection topology, and the bandwidth of the interconnection. How well a parallelization executes depends very much on how well it maps to the particular architecture used. In a computer cluster, processors located at different chips are interconnected over a communication network. The bandwidth of the network is typically relatively low, there is no shared memory, but the private memory is relatively large. A graphics processing unit (GPU) is, as its name suggests, mainly constructed to process computer graphics. It is suitable when the same operation is to be performed on several independent data streams. A GPU architecture has a large memory bandwidth, no private memory, and a medium-sized shared memory. A GPU can have hundreds of processors, but is only suitable for a narrow class of problems that can provide the amount of fine-grained parallelism required for efficient operation. This thesis is mainly concerned with the shared memory multicore architecture (SMC) that is a flexible architecture suited for embedded realtime applications. The term multicore refers to a processor where several CPUs are manufactured on the same integrated circuit die. Fig. 1.10 shows a simplified picture of a SMC. The CPUs are connected to a shared memory (the RAM) via a shared bus. In addition to the shared memory, each processor has a private memory, the cache, to which only that particular CPU has access. In general, most SMCs have several layers of cache where some levels of the cache are shared among two or more cores. However, to understand the concept and reasoning of a multicore implementation, it many times suffices to envisage the simplified description with a single cache per processor. The CPUs can operate independently of each other, and the interprocessor communication is accomplished trough reads and writes to the shared memory. 38 Figure 1.10. A simplified picture of a shared memory multicore architecture. 1.9 Parallel implementation 1.9.1 Parallel implementation The performance improvement gained by the use of a multicore processor depends very much on the employed software algorithms and their implementation. Ideally, an implementation may realize speedup factors near the number of cores used. Most applications, however, are not accelerated so much unless programmers invest an amount of effort in re-factoring the whole problem. To design well-performing software that scales well and executes fast, it is important to understand the basics of the architecture on which the software is intended to execute. Roughly speaking, when designing a parallel implementation, it is sought to determine a number of tasks that can execute as large portions of work as possible with a minimal amount of interaction. It is though important to remember that parallelization is not a goal by itself, but merely a way of improving the execution time or lowering the power consumption of the system. Constructing an implementation that runs perfectly in parallel, but slower than the sequential version of the implementation typically presents nothing of interest. An exception is when dealing with low power applications. Parallel processors in general have a lower power per FLOP ratio, and it can thus be motivated in that case to have a slower parallel execution than the sequential one. Parallelization can provide many benefits for an implementation. The programming and debugging of a parallel program can though be much more challenging than for a sequential program. Constructing and debugging parallel code to ensure its correctness can be a difficult task. Some examples of problems that a parallel programmer must deal with that are not present in sequential programming are: Parallel overhead The amount of time required to coordinate parallel threads, as opposed to doing useful work. Parallel overhead can include factors such as thread start-up time, synchronization, 39 software overhead imposed by parallel compilers, libraries, tools, operating system, thread termination time, etc. Load balancing For an implementation to execute efficiently, the workload of the processing units should be as balanced as possible, i.e. each processor should have an equal amount of computations to perform. With an unbalanced workload, one or more processors will be idle waiting for the more loaded processors to finish, and thereby wasting computational capacity of the system. Cache coherency On a multicore computer, several processors can have the same piece of data in their private caches. If one processor modifies that data, the other processors must be notified about this to get a consistent view of the memory. How this scheme is adopted is architecture-dependent. Synchronization At some points in the execution, two or more of the processing units must be synchronized, i.e. they must wait at some point to make sure that the other processors have reached a certain point in the execution stream. Communication In order to complete a task, processors must communicate with each other. How and when to communicate must be specified by the programmer. Race conditions If two or more processors are accessing the same piece of data, the outcome of the program can be inconsistent, depending on in which order the processors happened to read and modify the shared data. To prevent this, mutual exclusion mechanisms must be used to ensure correct results. For a more extensive exposition of parallel programming issues see e.g. [106], [107], [66]. When designing a parallel program, the procedure can be devided into three stages, see e.g. [68]: • Partitioning: Opportunities for parallel execution are exposed. A fine-grained decomposition of the problem is created, where a large number of tasks that can be executed concurrently are identified. • Communication: The communication required among the finegrained tasks identified in the partitioning stage is explored. • Agglomeration: It is determined how to agglomerate tasks identified by the partitioning phase, so as to provide a smaller number of tasks which can execute concurrently with a small amount of interaction. It is also determined if data and/or computation should be replicated in order to minimize interaction between tasks. 40 Automatic parallelization has been a research topic for several decades. Yet fully automatic parallelization of sequential programs by compilers still remains a challenge due to its need for complex program analysis and dependance on unknown factors, such as input data range, during compilation. In most cases, to parallelize other than so called ”embarrassingly parallel” algorithms, insight and understanding to the theory of the underlying algorithm is required. 1.9.2 Software There are many different languages available for multicore programming, with such examples as OpenMP, Pthreads, Cilk++, OpenHMPP, FastFlow, Skandium, and MPI. OpenMP 3 has been the choice for all developed algorithms in this thesis because of the algorithms’ suitable mapping to the fork join model adopted by OpenMP. OpenMP is a collection of directives, library routines and environment variables that may be used to parallelize Fortran, C and C++ programs for execution on shared-memory platforms. A master thread running sequentially forks a specified number of slave threads with tasks divided among them. The slave threads then run in parallel and the runtime environment allocates threads to different processing units. After the execution of the parallelized code, the threads join back into the master thread that continues onward to the end of the program. See Fig. 1.11 for an illustration of the flow of execution using OpenMP. Both task parallelism and data parallelism can be achieved using OpenMP in this way. When the work partitioning is regular, OpenMP is a suitable, simple, and convenient choice for the programmer. For more irregular partitionings, it is not necessarily a good option, in that case other languages such as e.g. Pthreads can be more efficient to use. To be able to write optimized code, it it important to be aware of the policy adopted at a hardware level, to exploit the underlying mechanism in a good manner, and avoid misuse of them that can lead to tremendous degradation of the execution time. 1.9.3 Performance measures The execution time of a parallel implementation is often the most important performance measure, as in many cases the parallelization is performed in order to shorten it. How the execution time scales with the number of cores is also of importance as it specifies how much the 3 http://openmp.org 41 Figure 1.11. Fork join model. A program containing three sequential sections of work, S1 , S2 , and S3 , and two parallel sections with the tasks A1 , B1 , C1 and D1 in the first section and A2 and B2 in the second section. The master thread is marked with gray. execution time can be improved by employing more processing units. This is characterized by the speedup s(M ) defined as s(M ) = t(1) , t(M ) where t(1) is the execution time of the fastest possible (known) sequential implementation and t(M ) is the execution time of the program using M cores. The efficiency, specifying how well the computational resources are utilized, is calculated as s(M ) . (1.63) M An ideal speedup curve is linear in the number of processors used, i.e. s(M ) = M , and and has the efficiency e(M ) = 1. It is actually possible to achieve a speedup slightly above linear, known as superlinear speedup. This phenomenon can occur because when more processors are employed, the size of the private cache increases allowing a better cache performance that can result in an efficiency over 1. A simplified formula for the speedup can be obtained as follows. Assume that p| and p|| are the portions of the program that are executed sequentially and in parallel, respectively. The parallel overhead is denoted with c(M ). The execution time on M processors is given by e(M ) = p|| + c(M ). M Noting that C(1) = 0, the speedup is obtained from t(M ) = p| + s(M ) = p| + p|| t(1) = = p t(M ) p| + M|| + c(M ) p| + p|| M 1 . + c(M ) (1.64) Ignoring the overhead term, this formula is known as Amdahl’s law [5]. A consequence of it is that the highest achievable speedup, assuming 42 9 p| = 0 8 7 p = 0.05 | speed up 6 5 p| = 0.1 4 3 Point where bottlenck is hit. 2 1 0 2 4 6 8 10 number of cores, M 12 14 16 Figure 1.12. Speedup curves for a program with different portions of sequentially executed code, p| , and speedup curve for an implementation that has hit a bottleneck such as saturation of the memory bus. For reference linear speed up is marked by the dashed line. no parallel overhead, is given by s(∞) = 1 . p| Hence, a program having e.g. p| = 0.1 can never reach a greater speedup than 1/0.1 = 10 times, no matter how many processors used. It is therefore of utmost importance to keep the sequentially executed part of an implementation as small as possible. Fig. 1.12 shows speedup curves for different values of p| , where the parallel overhead is given by c(M ) = 0.01 + 0.01 · M 2 . As c(M ) increases with an increasing number of processors, the speedup curve has a maximum. Obviously, it is not beneficial to increase the number of processors beyond this maximum since the overhead becomes too large and increases the execution time. The figure also shows the characteristics of a speedup curve when a bottleneck, such as memory bandwidth, has been hit. 1.9.4 Efficient parallelization When constructing a parallel program, the goal should always be to have 100% efficiency, i.e. all time is spent on doing useful work and no time is spent on parallel overhead, such as communication, synchronization thread start up etc. This must be carefully taken into consideration when designing and implementing the algorithm. As a simple example 43 Figure 1.13. Performance visualization for execution of code 1 (a) and code 2 (b). Code execution is marked by green, and overhead/synchronization work is marked by red. Algorithm 3 Code for computation of c = A*b // Compute c = A∗b v o i d mvm( d o u b l e ∗∗ A [ ] [ N] , d o u b l e ∗ b , d o u b l e ∗ c ) { // Outer l o o p p a r a l l e l i z a t i o n #pragma omp p a r a l l e l f o r f o r ( i n t i =0; i <N; i ++) { f o r ( i n t j =0; j <N; j ++) c [ i ] += A[ i ] [ j ] ∗ b [ j ] ; } } of this, consider two different parallel implementations of a simple matrix vector multiplication c = Ab, (1.65) where A ∈ RN ×N , b ∈ RN , c ∈ RN . For the first implementation, the outer loop is parallelized, in the second implementation the inner loop is parallelized. Both perform the same job, but the second implementation has a much larger overhead. Since the parallelization is in the inner loop thread, forking and termination will occur N times instead of one time as in implementation 1. In Fig. 1.13, the performance is visualized for the two implementations. As can be seen, implementation 1 has an efficiency of very close to 100% while implementation 2 only has an efficiency of roughly 80 %. 44 Algorithm 4 Code for computation of c = A*b // Compute c = A∗b v o i d mvm( d o u b l e ∗∗ A [ ] [ N] , d o u b l e ∗ b , d o u b l e ∗ c ) { // I n n e r l o o p p a r a l l e l i z a t i o n f o r ( i n t i =0; i <N; i ++) { #pragma omp p a r a l l e l f o r r e d u c t i o n (+: temp ) f o r ( i n t j =0; j <N; j ++) temp += A[ i ] [ j ] ∗ b [ j ] ; c [ i ] = temp ; } } Table 1.1. Order of execution. R, I and W, denotes Read value, Increase value and Write back respectively. T1 T2 R I W R I W Value of b 0 0 0 1 1 1 2 T1 T2 R R I I W W Value of b 0 0 0 1 1 1 1 1.9.5 Using Shared Resources Compared to sequential programming, one of the main differences that have to be taken into consideration in parallel programming is that several processors can access and modify shared resources. Precaution must be taken to both ensure correct execution of the algorithm and to avoid time consuming conflicts in the resource usage. An example of a situation where random results will be obtained if precaution is not taken is race conditions. Consider that two different threads T1 and T2 will compute the sum 1 + 1 = 2 in parallel by writing and reading from the shared variable b. In Tab. 1.1, two possible outcomes of the execution are shown. In the first case, the correct result is obtained while an incorrect result is found for the second case. The order in which the events R, I and W occur can be considered random and cannot be controlled by the programmer. Such problems must be handled using mutual exclusion mechanisms. It is then possible to restrict the access to the shared resource so that one processor cannot read data while another processor is modifying it. 45 Another example that does not produce incorrect results but has negative impact on the efficiency of the execution is false sharing. It is a subtle problem, but very important to address to achieve performance. Assume that we want to compute the recursion 2 yt = yt−1 + cos yt−1 + sin yt−1 , t = 1, 2, ..., 106 , for four different initial values of y0 . The code in Alg. 5 will achieve this goal. To speed up the computations, the outer for-loop is parallelized, which is perfectly fine as the loops are completely independent of each other. A linear speedup in the number of cores employed could be expected (for up to four cores). However, as can be seen from the speed up plots in Fig. 1.14, the results are disappointing and only a speed up of about 1.2 times is reached using four cores. The problem here is so called false sharing. Even though the loops are completely independent, the elements of y are stored in neighboring addresses of the memory. As explained in Sec. 1.7, when a data element is requested from the processor, several contiguous data elements will be fetched to the cache (in an attempt to exploit data locality). The effect here is though devastating for the performance. When processor 1 accesses y[0], it will also fetch y[1], y[2] and y[3] in to its cache (even though it will not use them). When it then modifies y[0], it will broadcast to the other processors (or the other processors caches), that the cache line have been changed, which results in that the other processors invalidate their cache lines involving y[0] and update with the new one produced by processor 1. This happens even though they actually do not need this update to perform their work correctly. The participating in parallel execution processors will thus modify their own local part of the data in the cache line and will all the time force the other processors to update the changes they have made, even though it is unnecessary. The problem can be resolved by separating the memory addresses of where y[1], y[2], y[3] and y[4] are stored, by padding zeros in between them. The code and corresponding speedup plot, where the storage addresses have been separated, is shown in Alg. 6 and Fig. 1.14 respectively. 1.9.6 Data partitioning Another important issue in parallel programming is how to partition the data among the cores. Ideally, each core should touch as small amount of data as possible, in order to not saturate the memory bus, and also the data it touches should preferably be local to that core, to minimize the inter-core communication. Some of the concepts are exemplified by matrix operations below. Let A, B, and C denote n × n matrices, and x, y, z denote column vectors of length n. Assume that a matrix multiplication C = AB is to 46 4 3.5 Speedup 3 2.5 2 1.5 1 1 1.5 2 2.5 3 Number of processors 3.5 4 Figure 1.14. Speed up curves for code in Alg. 5, and Alg. 6, in dashed red and solid blue respectively. Algorithm 5 Parallel code with false sharing problems void r e c u r s i o n ( ) { double y [ 4 ] = { 1 . 2 , 0 . 8 , 5 . 6 , 2 . 3 } ; } #pragma omp p a r a l l e l f o r f o r ( i n t i =0; i <4; i ++) { f o r ( i n t j =0; j <1E6 ; j ++) { y [ i ] = y [ i ]∗ y [ i ] + cos (y [ i ] ) } } + sin (y [ i ]) ; Algorithm 6 Parallel code without false sharing problems void r e c u r s i o n ( ) { int n = 20; d o u b l e y [ 4 + 3∗n ] ; y[0∗n]=1.2; y[1∗n]=0.8; y[2∗n]=5.6; y[3∗n]=2.3; } #pragma omp p a r a l l e l f o r f o r ( i n t i =0; i < 4∗n ; i+= n ) { f o r ( i n t j =0; j <1E6 ; j ++) { y [ i ] = y [ i ]∗ y [ i ] + cos (y [ i ] ) } } + sin (y [ i ]) ; 47 be parallelized. Let √ M be integer and consider a partitioning ⎤ A1 ⎢ A2 ⎥ ⎢ ⎥ AB = ⎢ ⎥ B1 B2 · · · B√M .. ⎣ ⎦ . ⎡ ⎡ A √M A 1 B1 A 1 B√ M , .. . .. .. . . √ · · · A M B√ M A 1 B2 · · · ⎢ ⎢ A 2 B1 A 2 B2 = ⎢ ⎢ .. ⎣ . √ A M B1 ··· ⎤ ⎥ ⎥ ⎥, ⎥ ⎦ (1.66) where one processor computes one of the M blocks Ai Bj , 1 ≤ i, j ≤ √ M . Compare this to the partitioning ⎡ ⎤ ⎡ ⎤ A1 A1 B ⎢ A2 ⎥ ⎢ A2 B ⎥ ⎢ ⎥ ⎢ ⎥ (1.67) AB = ⎢ .. ⎥ B = ⎢ ⎥, .. ⎣ . ⎦ ⎣ ⎦ . AM AM B where one processor computes one of the blocks Ai B, 1 ≤ i ≤ M . For both partitionings, the workload is perfectly distributed among the processors. All processors will perform an equal amount of computations and no computations are duplicated. However, for partitioning (1.66), each processor must touch MM+1 n2 data elements while for partitioning 2 2 n elements. It is clearly ben(1.67) each processor must only touch M eficial to use partitioning (1.67), for M > 2, in order to touch as small amount of data as possible. Specifically, it is seen that the total amount of the data touched is given by (M +1)n2 and 2n2 elements by partitioning as in (1.66) and in (1.67), respectively. The amount of data touched thus increases linearly with the number of processors, M when using partitioning in (1.66), while being constant for partitioning in (1.67). Thus, using partitioning (1.66), one could not expect an implementation to scale well for a large number of processors since the memory bus will eventually be strained and limit the speedup. As another example, consider a sequence of matrix computations y = Ax, C = yzT , that are to be parallelized. Compare a partitioning T T y1 y2 · · · yM A1 A2 · · · AM = x, T T T C1 C 2 · · · CM y1 y2 · · · yM = z , (1.68) 48 where processor m computes ym = ATm x, Cm = ym zT with a partitioning ⎡ y1 + y2 + ... + yM C1 C 2 · · · CM T = = A1 A2 · · · AM y1 y2 · · · yM x1 x2 .. . ⎢ ⎢ ⎢ ⎣ T ⎤ ⎥ ⎥ ⎥, ⎦ xM zT , (1.69) where processor m computes ym = Am xm , Cm = ym zT . For both partitionings, each processor will perform the same number of computations and touch the same amount of data. However, the partitioning in (1.69) requires a synchronization point in between the two lines, since M the partial sums must be added to form y, i.e. y = ym . Partitionm=1 ing (1.68) thus provides better potential for an efficient implementation since the processors can perform larger amount of work independently of each other without synchronization and communication. 49 1.10 Short chapter summaries Here a short summary of each chapter presented in the thesis is given. Chapter 2 In this chapter a parallelization of the Kalman filter is presented. The content is based on Paper II and Paper V. First a parallelization method for the MISO case is given, which is based on a banded structure of the system transition matrix. It is discussed how different systems, both time invariant and time variant ones, can be realized on a banded form. The given parallelization method is then extended to cover the MIMO case, by utilizing sequential filtering of the measurement vector. The proposed parallelization is evaluated on a load-estimation problem for mobile networks, and is compared against a BLAS based implementation. The results show that the given parallelization performs significantly better than the BLAS based one, and is capable of achieving linear speed up in the number of cores used (tests are performed up to 8 cores). Chapter 3 In this chapter an important special case of the paralleization given in Chapter 2 is given. The content is based on the material in Paper IV. The case under consideration is when the Kalman filter is used as a parameter estimator, and it is shown how this can be especially efficiently implemented. Some more detailed discussion about implementation details for optimization of the execution time is given. Chapter 4 This chapter is based on the material in Paper I and Paper III. Parallelization of the particle filter is studied. Four different parallelizations: the globally distributed particle filter, resampling with proportional allocation filter, resampling with non-proportional allocation filter, and the Gaussian particle filter, are implemented on a multicore computer and evaluated using up to eight cores. The results show that the Gaussian particle filter and the resampling with non-proportional allocation filter are the best suited ones for parallelization on multicore computers and linear speed up is achieved. 50 Chapter 5 This chapter is based on Paper VII and Paper VIII. A solution method for the recursive Bayesian estimation problem is presented. The involved probability density functions are approximated with truncated series expansions in orthogonal basis functions. Via the prediction-update recursion, the coefficients of the expansions are computed and propagated. The beneficial parallelization properties are demonstrated at a bearings only tracking problem, where linear speedup is achieved for even small problem sizes. The drawback of the method is mainly that the state must be confined to a pre-specified domain. An analysis of the method is also carried out. It is mainly studied how the error in the estimated PDF, caused by the truncation of the expansions, is propagated over the iterations. A bound ensuring that this error does not grow unbounded is given. A comparison of the solution to a bearings-only tracking problem using the Fourier and Legendre basis functions is also given. Chapter 6 In this chapter a method combining particle filtering with orthogonal series expansion, is developed and analyzed. The material is taken from Paper XI and Paper XII. The method is based on fitting a series expansion to the particle set when resampling. In this way, the information carried by the particle set can be compressed to a few informative coefficients that can be efficiently communicated between the processing units. This gives the method favorable parallelization properties making it suitable for multicore and even distributed hardware platforms. An analysis of how well the series expansion captures the underlying PDF is given. Also an upper bound on the magnitude on the expansion coefficients, when using the Hermite basis functions, is derived and provided. Chapter 7 This chapter is based on the material in Paper IX, Paper X and Paper VI. A novel method for anomaly detection in reference following systems is given and discussed. The method is based on that a set of observed trajectories from the system has been collected. From this set, PDFs that specify the probability density of finding the system in a given state, are computed. The anomaly detection is then carried out by performing outlier test with respect to the estimated PDFs, to see how deviating the system state is from the normal one. The method is evaluated, with good results, on vessel traffic data, as well as eye movement data from an eye tracking application. 51 Chapter 8 Parameter estimation of a minimally parametrized model for the PK/PD for drug delivery in anesthesia is considered. The content of this chapter is based on the material given in Paper XIV and Paper XIII. Three different estimation methods, EKF, PF and the filtering method given in Chapter 7 (OBPF), are tested and compared at this application. It is shown the the EKF is prone to significant bias in the parameter estimates while the PF and OBPF do not suffer from such problems. The PF and the OBPF are shown to give similar results in the estimation quality, but the OBPF can provide this estimation quality to a smaller computational cost. As the estimated model is intended to serve as a model for closedloop control of the anesthesia, it is of importance to be able to provide as accurate estimates of the parameters as possible. Chapter 9 This is a short chapter presenting the results from BLAS based implementations of the UKF and a point mass filter. It is shown that linear speed up is obtainable by a parallel implementation without any modifications for improved parallelization properties. 52 Chapter 2 Parallelization of the Kalman Filter 2.1 Introduction The Kalman Filter (KF) still represents the mainstay of linear estimation, even in medium and large-sized systems. Parallel implementations of the KF have been suggested over the years to improve the execution time. However, many of these schemes are hardware-specific with respect to such architectures as e.g. the Connection Machine [63], distributed memory machines [60] and systolic arrays [92] and thus are not directly suitable for a multicore implementation. Other parallelization solutions suffer from the presence of sequentially executed sections that prevent significant speedup [117], [103]. Pipelined-by-design algorithms [43] have input-to-output latency equal or even greater than that of a sequentially executed filter, which property is not acceptable in many real-time applications (e.g. the application in cellular communication studied in [115]). In [74], a parallel multicore implementation of the KF for parameter estimation is presented. This chapter deals with efficient parallel implementation of the Kalman filter (KF) for state estimation in discrete time-varying linear systems on shared memory multicore architectures. However, the proposed solution requires only a small amount of inter-processor communication, which makes it suitable also for a distributed architecture. The KF algorithm consists of a sequence of matrix-matrix and matrix-vector multiplications. The parallelization of this kind of operations on a multicore architecture is a routine matter. It is indeed possible to perform the operations invovled in the KF algorithm using a pre-built library such as BLAS in a straightforward manner. However, as it is shown in this chapter, the result will suffer from several drawbacks that include 53 a large amount of inter-processor communication, synchronization, and high demand for memory bandwidth. For the case of systems with banded state-space realizations, the above mentioned drawbacks of parallel KF implementation can be efficiently alleviated. As the Multiple-Input Multiple-Output (MIMO) estimation problem with p outputs always can be implemented as a sequence of p singleoutput filter problems [52], a method for the Multiple-Input SingleOutput (MISO) case is developed and used as a building block for the MIMO case. This approach avoids the inversion of a p × p matrix, which is known to be difficult to parallelize efficiently because of its intricate data dependencies. To mention a few, active noise cancellation [58], climate and weather prediction [64], and mobile broadband load estimation [116] are applications where efforts are made to decrease the execution time of the filtering step. The method suggested in the present chapter is applied to a Wideband Code Division Multiple Access (WCDMA) load estimation problem that is deemed critical in mobile broadband. This is a field where the computational burden is growing rapidly due to the increasing number of smart phones in the system. Since the number of users is directly affecting the number of estimated states, it is clear that the computational burden of sequential implementations of the KF becomes prohibitively demanding. The multicore techniques of the present chapter provide therefore an interesting alternative to an increase in the potential number of uplink users of the cell, which in turn results in savings in the required amount of hardware and power consumption of the system. The chapter structure is as follows. Sec. 2.2 provides a summary of the KF equations. A discussion on banded systems is given in Sec. 2.3. In Sec. 2.4 the main contribution, a parallel implementation of the KF for MISO systems, is provided. An analysis yielding estimates of the amount of parallelizable work, the required bandwidth, and amount of communication is also given, to offer instrumental guidelines for the choice of implementation hardware. In Sec. 2.5, the parallalelization of the KF for MIMO system based on the MISO implementation is carried out. Finally, in Sec. 2.6, the results of computer experiments are presented followed up by a discussion in Sec. 2.7. 2.2 State space model and filtering equations 2.2.1 State space system description Consider a MISO discrete time system 54 xt+1 = Ft xt + Gt ut + wt , yt = ht xt + jt ut + vt , (2.1) (2.2) with the state vector xt ∈ Rn , the input vector ut ∈ Rm and the output yt ∈ R at discrete time step t. Generally, Ft ∈ Rn×n and Gt ∈ Rn×m are time-varying matrices, while ht ∈ R1×n and jt ∈ R1×m are time-varying vectors. The process and measurement noise sequences wt ∈ Rn and vt ∈ R are assumed to be independent, white, zero mean, Gaussian distributed, with the covariance matrices E[wt wtT ]=Qt and E[vt2 ] = rt , respectively. 2.2.2 Kalman filter equations The KF equations below are in the so-called standard form. For filtering problems that require special attention to numerical stability, the square root formulation is to prefer [10]. Parallelization of the square root form of the KF is investigated in [63], where mainly the Givens rotation step is parallelized. However, many systems do not require the square root form to maintain numerical stability. As it will be shown here, the implementation and parallelization can be made more efficient for the KF in the standard form, than the implementation proposed in [63]. As given in Sec. 1.6.1 the KF consists of two steps: prediction and update. These are recursively applied to the data to calculate the state estimate x̂ and the error covariance matrix P. For the system (2.1)-(2.2), the KF [53] is calculated as: Prediction x̂t|t−1 = Ft x̂t−1|t−1 + Gt ut , (2.3) Pt|t−1 = Ft Pt−1|t−1 FTt + Qt , (2.4) Update ỹt = yt − ht x̂t|t−1 − jt ut , dt = ht Pt|t−1 hTt + rt , Pt|t−1 hTt d−1 t , Kt = x̂t|t = x̂t|t−1 + Kt ỹt , Pt|t = (I − Kt ht )Pt|t−1 . (2.5) (2.6) (2.7) (2.8) (2.9) 55 2.3 Banded systems A matrix A is said to be banded with bandwidth Nb if it is zero everywhere except at the Nb super and sub diagonals, i.e. it is of the form ⎡ a00 ··· a01 ⎢ ⎢ a10 a11 ⎢ . ⎢ . ⎢ . A=⎢ ⎢ a ⎢ Nb 0 ⎢ .. ⎣ . 0 .. a0Nb 0 .. . . .. a(N −Nb )N .. . . .. a(N −Nb )N ··· . aN (N −1) a(N −1)N aN N ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥. ⎥ ⎥ ⎥ ⎦ The performance of the suggested parallelization increases with a decreasing bandwidth of the transition matrix Ft . In this section transformation to a banded system form is discussed. 2.3.1 Transformation to a banded system form Any linear finite-dimensional system can, with more or less effort, be transformed to a realization with a banded system matrix in a numerically stable manner. This holds true for both time-varying and timeinvariant systems. The number of bands in the system matrix is denoted Nb , where e.g. a diagonal and a tridiagonal matrix have Nb = 0 and Nb = 1, respectively. Under the state variable transformation xt = Tt zt , where Tt is a non-singular matrix, the transformed system is given by zt+1 = Ft zt + Gt uu + T−1 t+1 wt , y t = h t zt + j t u t + v t , where Ft = T−1 t+1 Ft Tt , Gt = T−1 t+1 Gt , h t = h t Tt . 2.3.2 Time-invariant case Assume that (2.1)-(2.2) represent a time-invariant system, i.e. Ft = F, Gt = G, ht = h, ∀t. Then the following holds: 56 • For F with distinct eigenvectors, a modal form can be used. A transformation T can be found such that F is block-diagonal with a 2 × 2 block for each complex conjugated pair of eigenvalues and a 1 × 1 block for each real eigenvalue of F [42]. • There is also a possibility to bring the system (the F matrix) to a tri-diagonal form, in a numerically sound manner. Tri-diagonalization via similarity transforms as well as tri-diagonal realizations obtained directly from the system’s Hankel matrix are discussed in [65]. • It is always possible to transform F to Jordan (bi-diagonal) form [42]. However, since the Jordan form is known to exhibit poor numerical properties, it is not always a suitable option. Thus, the matrix F belongs to the class of tri-diagonal matrices (Nb = 1) in the worst case and, in the best case, it is diagonal (Nb = 0). Both cases make the KF equations highly suitable for parallel implementation. Note that, for time-invariant systems, the transformation T can be computed beforehand and offline. 2.3.3 Time-varying case For time-varying systems, there are many possibilities to express the system in banded form. • Using a realization method based on a sequence of Markov parameters, Mk,j , it is sometimes possible to find a rank factorization such that Mk,j = Ct Bj . A realization with a diagonal F matrix is then simply obtained by taking Ft = I, Gt = Bt , ht = Ct , jt = 0, [88]. • Assume that a realization is readily available in the form of (2.1)(2.2). It can then be seen that the transformation Tt = Φt , where t Φt = Fi is the transition matrix, will give Ft = I. Since Gt i=0 requires the inverse transform T−1 = Φ−1 t t+1 , it is not always computationally sound. However, in case analytical expressions for Φ and Φ−1 are available or can be obtained by a small amount of computation, the transformation can be applied to obtain a diagonal system. • If Ft is a sparse matrix where the zero elements are located at the same positions for all t, an optimized band structure of Ft can be obtained by taking T to be a permutation matrix [29]. • A system that consists of loosely coupled subsystems can often be realized with a block diagonal matrix F that possesses a few 57 block-diagonal elements describing the couplings between the subsystems. • Matrices arising from finite-element or finite-difference problems in one or two dimensions are often banded. The bandedness stems from the fact that the variables are not coupled over arbitrarily large distances. For instance, discretizations of the differential equation ∂ n1 T (x1 , x2 ) ∂ n2 T (x1 , x2 ) = , ∂ n1 x1 ∂ n2 x2 0 ≤ n1 , n2 ≤ 2 encountered in physics as the heat equation, wave equation and Laplace equation, have realizations in a banded form. • In stochastic setups, where the KF is used as a black-box parameter estimator [100], the parameter vector θ is modeled as a random walk, and the output is required to be a linear combination of the unknown parameters driven by process noise. The system equations are then given by θt+1 = θt + wt , y t = h t θt + v t , which is a special case of (2.1)-(2.2) with Ft = I, Gt = jt = 0. Since Ft = I, the recursive parameter estimation problem is especially suitable for parallel implementation. This case is studied in detail in Chapter 3. 2.4 MISO System To parallelize a Kalman filter for a MISO system is a simpler problem than doing it for the MIMO case. A parallization for the MISO case will first be studied, and then extended to the MIMO case. Further, a banded structure of F gives a better ground for an efficient parallelization. The presence of a general structure matrix F in the system equations makes efficient implementation and parallelization of the KF somewhat more difficult. As discussed in Sec. 2.3, it is always possible to yield a realization, or transform an existing realization, so that the matrix F becomes banded with a low band width. Since the main purpose is to achieve faster execution times, it is of importance to optimize the implementation of the sequential version, from which the parallel version will be built. Therefore, an efficient sequential implementation is presented below in Sec. 2.4.1 and the parallelization of it is handled in Sec. 2.4.2. 58 2.4.1 Efficient sequential implementation The main focus of optimization should be on the computations involving the matrix P since a majority of the FLOPs and memory accesses are related to it. In Alg. 7, an implementation of (2.3)-(2.9) where the accesses to P occur after each other is given. The gain orginates from the fact that, for a banded F, once an element of P is brought to the cache, it can be used to accomplish all calculations the element is involved in before it is thrown out. Further, it allows the calculated elements in Pt+1|t to be stored at the same locations as the elements of Pt|t were held, giving a substantial reduction in the memory size and bandwidth needed. In [74], this reordering is shown to execute about twice as fast as compared to an implementation that does not make use of this possibility. This kind of optimization is not possible with a dense matrix F. The matrix P is a symmetric positive definite matrix. This should be taken advantage of since approximately half of the computations and memory storage can be spared due to this fact. However, to avoid too many technical details, the parallelization principles of KF will be presented for a version where the whole matrix P is used in the computations. The modifications needed for an implementation using only the upper triangular part of P are straightforward and minor. Algorithm 7 Efficient Kalman Filter implementation. x̂t|t = x̂t|t−1 + d−1 t ct [yt − ŷt ] (2.10) T Pt|t = Pt|t−1 − d−1 t ct ct (2.11) Pt+1|t = ct+1 x̂t+1|t ŷt+1 dt+1 Ft+1 Pt|t FTt+1 Pt+1|t hTt+1 + Qt+1 (2.12) = = Ft+1 x̂t|t + Gt+1 ut+1 = ht+1 x̂t+1|t + jt+1 ut+1 = rt+1 + ht+1 ct+1 (2.13) (2.14) (2.15) (2.16) 2.4.2 Parallel implementation Assume that F is a dense matrix. A parallel implementation of Alg. 7 can be produced by parallelizing each step individually using BLAS1 or some 1 Basic Linear Algebra Subprograms (BLAS) are routines that provide standard building blocks for performing basic vector and matrix operations. BLAS is a de facto application programming interface standard, see netlib.org/blas/. 59 other highly optimized library for matrix operations. The calculation of (2.12) is then split as A = Pt|t FTt+1 , Pt+1|t = Ft+1 A + Qt+1 , where each line is parallelized separately. However, such an approach will have several drawbacks. Each processor must touch a large amount of data limiting the scalability of the implementation. A synchronization point between the calculations and a temporary storage for A is also required. A large amount of inter-processor communication is needed that will have negative impact on the execution time and as well limit the algorithm performance in a distributed implementation. In the case of a banded matrix F, where the number of bands Nb N , it is possible to remedy the mentioned drawbacks and thus achieve fast execution and good scalability. Assume N/M to be integer and define (recall that 1 : n = {1, 2, .., n}) N N (i − 1) : i − 1, M M := ri (1) − nb : ri (N/M ) + nb . ri := si A parallelization of the KF over the whole sequence of matrix operations for a banded matrix F is described in Alg. 8. The algorithm is designed to make the number of synchronization points, amount of communication, and the amount of data that each processor has to touch, as small as possible. Notice that (2.12) in Alg. 7 is executed by the i:th CPU as Pt+1|t (ri , :) = Ft+1 (ri , si )Pt|t (si , :)FTt+1 . Each processor is given access to the whole matrix F that is banded and contains a small amount of data, but only a restricted part of P is touched. Processor i will be responsible for updating of Pt+1|t (ri , :), which will only require knowledge of Pt|t (si , :). The parts of P that must be communicated by processor i are thus only the Nb rows that overlap with the neighboring processors (CPU i − 1 and CPU i + 1). 60 Algorithm 8 Kalman Filter parallel implementation. • Parallel (CPU i calculates) x̂t|t (ri ) = x̂t|t−1 (ri ) + d−1 t ct (ri )[yt − ŷt ] T Pt|t (si , :) = Pt|t−1 (si , :) − d−1 t ct (si )ct Pt+1|t (ri , :) = Ft+1 (ri , si )Pt|t (si , :)FTt+1 + Qt+1 (ri , :) ct+1 (ri ) = Pt+1|t (ri , :)hTt+1 x̂t+1|t (ri ) = Ft+1 (ri , si )x̂t|t (si ) + Gt+1 (ri , :)ut+1 (i) ŷt+1 = ht+1 x̂t+1|t (ri ) (i) bt+1 = ht+1 ct+1 (ri ) • Sequential ŷt+1 = M (i) ŷt+1 + jt ut i dt+1 = rt+1 + M (i) bt+1 i 2.4.3 Analysis An analysis of Alg. 8 is carried out in this section to evaluate the number of sequential and parallel FLOPs, the required memory bandwidth, the demand of communication, and synchronization in the implementation. This provides important guidelines for the choice of hardware to meet a desired performance of the designed system. Parallelizable amount of work Counting the number of FLOPs fs and fp that are executed sequentially and in parallel in Alg. 8, the following expressions can be obtained fs (M, m) = 3M + 2m, fp (N, Nb , m) = (Nb2 + 2Nb + 5)N 2 +2(2 + m)N. (2.17) (2.18) As noted in Sec. 1.9.3, Amdahl’s law [5] states that the maximal theoretically obtainable speedup is given by s(M ) = 1 p p| + M|| , 61 where p| and p|| are the sequentially and parallelly executed portion of the program, respectively. Now, if N > M , which should definitely be the case, then fp fs and p| ≈ 0, p|| ≈ 1 is a reasonable approximation, yielding s(M ) = M . Thus, regarding the portion of parallelizable work, the algorithm has the potential of achieving good scalability. Memory bandwidth The only variables of considerable size in the KF algorithm are the matrices P and Q. Let q(P) denote the size of P in bytes. Assuming that P and Q are transferred from the RAM to the processors at each iteration will give a required memory bandwidth of B =n· q(Q) + q(P) T (2.19) to perform n iterations in T seconds. If the bandwidth Bh provided by the hardware satisfies Bh ≥ B, it will not be a bottleneck in the implementation. In many practical cases, Q is diagonal, or at least sparse, in which case q(Q) q(P) and q(Q) can be neglected in (2.19). Synchronization and Communication Only one synchronization point at the end of the parallel section is needed. The data that have to be communicated between the CPUs are given by the overlapping rows of P, the local parts of c, and the (i) (i) variables yt , bt , which number of elements are given by C(N, Nb , M ) = 2N (M − 1)(2Nb + 2 + 1 ). M (2.20) 2.5 MIMO System Consider (2.1)-(2.2) as a MIMO system with p outputs. Denote the T process noise vt = vt (1) vt (2) · · · vt (p) . If vt (i) is independent of vt (j) j = i, i = 1, 2, ..., p, then, by the Gaussian assumption on vt , it is equivalent to Rt = E[vt vtT ] being a diagonal matrix. The resulting MIMO problem can be treated as a sequence of p MISO filtering problems where the filtering of the p measurements can be done sequentially, one by one [52]. If Rt is not diagonal but positive definite, a (Cholesky) transformation zt = Lt yt can be applied to render St = E[zt zTt ] diagonal. From the relation E[zt zTt ] = Lt E[yt ytT ]LTt = Lt Rt LTt , 62 −1/2 it is seen that the choice Lt = Rt will give E[zt zTt ] = I, which together with the Gaussian assumption on vt establish the independence of the −1/2 is guaranteed to measurement noise. By the assumption Rt > 0, Rt exist. If there are measurements that are noise-free, R becomes positive semidefinite and such measurements can be handled separately by e.g a reduced observer. The MIMO filtering problem can thus always be split into a sequence of p MISO filtering problems and the parallelization can be performed over each MISO filtering problem as proposed in Sec. 2.4. 2.6 Implementation example In order to quantify and validate the multicore computational gains on a realistic problem, a simulation study of a WCDMA uplink interference power estimation system was used. 2.6.1 Uplink interference power estimation model In this section, a simplified model of the interference power generation and measurements in the WCDMA uplink is provided. 3G mobile broadband data traffic is based on high speed packet access (HSPA) technology. The estimation of uplink load in this system is an example where the proposed parallelized KF algorithm may find application. In the uplink, scheduling is required to assign users to the available cells. Efficient scheduling requires the interference power from users of the own cell and from users in neighboring cells to be estimated in real time, which in practice is a very difficult problem. The reference [115] therefore proposes a new algorithm for recursive Bayesian estimation of the noise power floor. Kalman filtering for uplink interference power estimation was treated in [114]. This solution does however use at least one state per user and is computationally complex. With the hundreds of users per cell anticipated in the near future, it is clear that the solution suggested in [114] becomes practically infeasible. A state space model A brief description of the state space model for WCDMA power link estimation problem is provided in this section, see [114] and [41] for a more extensive exposition. A state space model for the system is given by: xt+1 = Fxt + Gut + wt , yt = Ht xt + et , 63 where ⎤ 0 .. ⎥ ⎢ . ⎥ ⎢ 0 ⎥, ⎢ . ⎣ .. 1−κ 0 ⎦ 0 ··· 0 1 κ ... κ 0 , ⎡ ⎤ 1 0 ··· 0 1+ηt (1) ⎢ .. ⎥ .. ⎢ . 0 . ⎥ ⎢ ⎥, ⎢ ⎥ .. 1 ⎣ ⎦ . 1+ηt (N ) 0 1 ··· 1 1 ⎡ ⎤ q(1) 0 ··· 0 ⎢ ⎥ .. .. ⎢ 0 ⎥ . . ⎢ ⎥, ⎢ .. ⎥ ⎣ . ⎦ q(N ) 0 n+thermal 0 ··· 0 q ⎡ ⎤ r(1) 0 ··· 0 ⎢ ⎥ .. .. ⎢ 0 ⎥ . . ⎢ ⎥, ⎢ .. ⎥ ⎣ . ⎦ r(N ) 0 0 ··· 0 rRT W P ⎡ F = G = Ht = Q = R = 1−κ 0 .. . ··· and T , xt (1) ... xt (N ) xn+thermal t T ref = , xref t (1) ... xt (N ) 0 T = , wt (1) ... wt (N ) wtn+thermal T yt (1) ... xt (N ) yRT W P,k = , T et (1) ... et (N ) eRT W P,k = . xt = ut wt yt et The power consumed by the i:th channel is xt (i), and xn+thermal is the t sum of neighbor cell interference and thermal noise power and modeled as a random walk xn+thermal = xn+thermal + wtn+thermal , t t+1 (2.21) where wtn+thermal is the systems noise corresponding to the state and κ is a parameter determined by the radio link quality and set for an inner control loop. 64 The reference power xref t (i) for the i:th channel is controlled by an outer loop controller and is given by xref t (i) = 1 + ηt (i) 1+ xtotal , t ref −1 (C/I)t (i) where (C/I)ref i (t), i = 1, ..., N denote the carrier to interference levels. Furthermore, xtotal is the total power and ηt (i) is the quotient between t the data power offset and the control signal power, see [114] for details. Note that ηt (i), and hence Ht , is time varying. The control signal power yt (i) is the quantity measured for the i:th radio link. The additional measurement yRT W P,k available on the uplink is the total received wideband uplink power. This is simply the sum of the powers represented by all states. The measurement and process noise covariance matrices are given by Q and R respectively. In the simulation, the SINR targets were set 5 dBs lower than usually assumed for a conventional RAKE receiver, to be able to run up to 400 radio links. This is motivated since today the uplink is used with higher block error rate than when the system was standardized. Furthermore, more advanced receivers than the RAKE are in operation today, among these interference suppressing [116] and interference canceling receivers [119]. 2.6.2 Results Alg. 8 was implemented and compared with Alg. 7 parallelized using Intel’s MKL BLAS library. The execution times for a range of problem sizes are summarized in Tab. 2.2. The scalability is illustrated by the speedup plots depicted in Fig. 2.1 and Fig. 2.2, respectively. To verify the correctness of the parallel implementation, the same data were filtered with a sequential Matlab implementation confirming that the outputs were identical. The sum of residuals, r(Ni ) = Ni 1 |yt (i) − Ht (i)xt (i)|, Ni (2.22) t=0 for one channel is given for the sequential and parallel filters in Tab. 2.1, where Ni = 1000 is the number of iterations executed. The code was written in C and OpenMP 2 was used for parallelization. The hardware used was an octa-core shared memory multicore computer R Xeon 5520, Quad-core, Nehalem 2.26 Ghz, with comprised of two Intel a 8 MB cache and a memory bandwidth of approximately 23 GB/s. 2 OpenMP (Open Multi-Processing) is an application programming interface that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran. 65 Table 2.1. Loss function (Eq. 2.22) values for sequential and parallel implementation of the KF for different problem sizes N . N 50 100 200 300 400 Sequential 7.2987E − 17 2.3841E − 17 1.0021E − 17 3.5315E − 16 5.2381E − 17 Parallel 7.2583E − 17 2.4012E − 17 1.0531E − 17 3.4714E − 16 5.2190E − 17 Table 2.2. Single core execution time in milliseconds for 50 iterations. N BLAS Alg. 8 50 7.18 0.61 100 10.12 2.09 200 29.89 8.08 300 92.65 18.01 400 167.99 31.64 One implementation that parallelizes the steps in Alg. 7 using Intels MKL BLAS library, and an implementation of Alg. 8 were made. The two different implementations will be referred to as implementation 1 and implementation 2 respectively. Further to study how the scalability of the parallelization is affected by the bandwidth, Nb of the transition matrix a the discrete version of the heat equation, where a rod is being heated in the ends was implemented. Let Tij denote the the temperature at discrete position i and time j. T With the state vector xt = T1,t T2,t · · · TN,t the realization gets a banded form, with the number of bands, Nb , determined by the approximation order of the derivatives. For a first order approximation the F matrix gets the structure ⎡ ⎤ 1 0 0 ⎢ × × × ⎥ ⎢ ⎥ ⎢ ⎥ . . . .. .. .. Ft = ⎢ ⎥. ⎢ ⎥ ⎣ × × × ⎦ 0 0 1 The impact of the number of bands on the scalability was studied for this system for a fixed problem size N = 1000 and evaluated for the number of bands 10, 50, and 100 respectively. The results are presented in Fig 2.3. 2.7 Discussion Even though each subroutine provided by BLAS is extremely efficiently implemented, the implementation of Alg. 8 executes faster on a single 66 8 N=50 N=100 N=200 N=300 N=400 7 6 Speed up 5 4 3 2 1 0 1 2 3 4 5 Number of CPUs 6 7 8 Figure 2.1. Speed up curves for Alg. 8 for N = 50 to N = 400, using up to M = 8 processors. For reference linear speedup is marked by the dashed line. 8 N=50 N=100 N=200 N=300 N=400 7 6 Speed up 5 4 3 2 1 0 1 2 3 4 5 Number of CPUs 6 7 8 Figure 2.2. Speedup curves for BLAS implementation for N = 50 to N = 400, using up to M = 8 processors. For reference linear speedup is marked by the dashed line. 67 8 Nb=10 Nb=50 Nb=100 7 Speed up 6 5 4 3 2 1 1 2 3 4 5 Number of CPUs 6 7 8 Figure 2.3. Speed up curves for fixed problem size N = 1000 with a varying number of bands Nb . core, as can be seen from Tab. 2.2. This comes from the fact that an optimization over the sequence of operations can be made for Alg. 8, whereas the optimization is done over each single operation, one by one, when employing the BLAS implementation. When optimizing over sequences of operations, the possibility to make more efficient use of the memory hierarchies is a main factor in the faster execution. Regarding scalability, the implementation of Alg. 8 performs far better than the BLAS implementation. As discussed previously, this is due to less communication and parallel overhead. The effects are especially distinct for smaller problem sizes where the overhead constitutes a large proportion of the execution time. For Alg. 8, almost linear speedup in the number of cores used is achieved for N ≥ 200. For lower N , the gain of parallel implementation is less clear, and for N = 50 even a slowdown can be observed for M = 8, due to the disproportionally large overhead to computation ratio. However, for smaller problem sizes, not even 2 times speedup is reached for the BLAS implementation, and a slowdown can be observed for N ≤ 200. As expected the the scalability drops for larger Nb , Fig. 2.3. This is due to the fact that the amount of communication is proportional to Nb , (2.20). However for Nb = 100, meaning that for most rows of F, 201 of the 1000 elements are filled, the speedup for M = 8 is still about 5 times. Which can be considered fairly good. 68 The implementations were evaluated on a hardware that runs at a very high CPU clock frequency (2.26 GHz). Embedded hardware, especially low power systems, typically run at much lower clock frequencies. With a lower clock frequency, the scalability can be expected to be better for smaller problem sizes since the computation-to-overhead ratio goes down. As mentioned before, Alg. 8 will most likely perform well on a distributed system. This is because of the low amount of communication, shared data, and the fact that only a restricted part of P must be touched by each processor. This would definitely not be the case for the BLAS implementation that would require almost all data to be distributed over the whole network. 2.8 Static Kalman filter As the static Kalman filter given by (1.43) simply consists of a single line made up of matrix and vector multiplications, the most efficient way of implementing this is by using a multi-threaded optimized linear algebra library such as e.g. BLAS. A BLAS-based implementation has been performed and the speedup curves from the execution are presented in Fig. 2.5. The gained speedup varies significantly depending on the problem size. A discussion regrading this is given below. As the work-overhead ratio increases with an increasing problem size, a better scalability is obtained for the values of n up to n = 1000. As a consequence of memory bandwidth saturation, the speedup drops drastically for larger n. The hardware used provides 16MB of cache. A double precision n × n matrix requires 8n2 bytes of memory. Solving the equation 16M B = 8n2 B yields n ≈ 1400. Hence, for n > 1400, the matrix will not fit into the cache and has to be brought from the main memory on every iteration, in which case the memory bandwidth becomes a bottleneck. To understand the dip in the speedup curve for n = 1000 in Fig. 2.4, a more detailed model of the architecture has to be employed. The machine on which the code is executed provides 8 cores, but consists of two Nahalem Quad Core, each having 8 MB cache, as shown in Fig. 2.4. The first 4 threads have been scheduled to run on the one quad core, and threads 5 to 8 have been scheduled to run on the other quad core. When 4 cores are used, there will be only 8M B cache available, in which case the data of 10002 · 8 = 8M B indeed will not fit as some space is occupied by other data. However, using more than 4 cores, 16 MB of data will be available and the data will fit into the cache and hence the program will not be limited by the memory bandwidth. The situation can of course be resolved by scheduling threads 1 and 2 to run on one 69 Figure 2.4. Memory connections for octa-core, consisting of two quad-core CPUs. Memory is given by the blocks, the CPUs are marked by gray circles. Table 2.3. Single core execution time, T for different problem sizes N , for execution of static Kalman filter. N T [ms] 100 0.0038 200 0.0169 500 0.1068 1000 0.5250 1500 1.6830 2000 2.9590 5000 18.1081 of the quad cores, and thread 3 and 4 on the other, in which case the dip would disappear. However, the scheduling has been kept as it is to demonstrate a phenomenon that gives insight into the problems that can occur in a parallel implementation. Note though that the execution times are very low. 2.9 Extended Kalman filter The extended Kalman filter (EKF) is based on the same computations as the Kalman filter, with an additional linearization step (1.44), (1.45). The linearization step can be completely parallelized. Assume that n/M and p/M are integer and let ft,i = ht,i = f Mn (i−1)+1 (x) f Mn (i−1)+2 (x) · · · f Mn i (x) h Mp (i−1)+1 (x) h Mp (i−1)+2 (x) · · · h Mp i (x) Processor i will then compute Ft,i = Ht,i = ∂fi (x) , ∂x x=xt−1|t−1 ∂hi (x) , ∂x x=xt|t−1 and the complete Jacobians are given by 70 T T , . 8 n=100 n=200 n=500 n=1000 n=1500 n=2000 n=5000 7 6 Speedup 5 4 3 2 1 0 1 2 3 4 5 Number of CPUs 6 7 8 Figure 2.5. Speedup curves for parallel execution of static Kalman filter for different problem sizes n. For reference linear speedup is marked by the dashed line. ⎡ ⎢ ⎢ Ft = ⎢ ⎣ Ft,1 Ft,2 .. . Ft,M ⎤ ⎡ ⎥ ⎢ ⎥ ⎢ ⎥ , Ht = ⎢ ⎦ ⎣ Ht,1 Ht,2 .. . ⎤ ⎥ ⎥ ⎥. ⎦ Ht,M Provided that the matrix Ft possesses a banded structure, the same method as for parallelization of the original Kalman filter can then be applied to the linearized system. This might be a restrictive assumption while an important special case occurs when the Jacobians are sparse matrices with zeros located at the same positions at each time step t. For this case, a transformation that optimizes the band structure of the matrices can be applied as discussed in Sec. 2.3. 2.10 Conclusions Parallel multicore implementation of the Kalman filter is studied. An implementation based on parallelization of each step using BLAS is compared to an implementation that exploits a banded structure of the system matrix. It is shown that for systems which can be realized with a banded system matrix, the KF can be almost completely parallelized 71 with a very restricted amount of inter-processor communication. Application to a radio interference power estimation problem demonstrated a linear speedup in the number of cores used, for state numbers that are becoming relevant in smart-phone dominated traffic scenarios. A BLAS based parallelization of the static Kalman filter was also performed, and it could be concluded that a linear speedup is achievable for larger problem sizes (N 1000). 72 Chapter 3 Parallel implementation of the Kalman filter as a parameter estimator 3.1 Introduction Many algorithms for real time parameter estimation are based on the Kalman filter. For instance, in echo cancellation, one seeks to estimate in real time the coefficients of a finite impulse response (FIR) filter with thousands of taps, to model an acoustic channel [38]. In such applications, the KF exhibits superior convergence speed, tracking performance and estimation accuracy compared to e.g. Normalized Least Mean Squares algorithm and Averaged Kalman Filter Algorithm (AKFA) [113], [25]. Further, the KF outperforms the Recursive Least Squares algorithm (RLS) in tracking time-varying parameters since the underlying mathematical model for the latter assumes the estimated parameters to be constant. The KF also offers, relative to RLS with forgetting factor, the benefit of individually setting time variation of states, see e.g. [100]. In this section, efficient parallelization of the KF as a parameter estimator, executed on a shared-memory multicore architecture is studied and exemplified by an adaptive filtering application. The parallelization is achieved by re-ordering the KF equations so that the data dependencies are broken and allow for a well-parallelized program implementation that has the potential to exhibit linear speedup in the number of used cores. Analysis of the resulting algorithm brings about an estimate of the memory bandwidth necessary for a realization of this potential on a multicore computer. 73 3.1.1 System model and the Kalman filter When employing the Kalman filter as a parameter estimator, the parameters are modelled by the random walk model θt+1 = θt + t , yt = ϕTt θt + et . Here yt is the scalar measured output, ϕt ∈ R is the (known) regressor vector that depends on the data up to time t − 1, θt ∈ R is the time-varying vector of N parameters to be estimated, t is the process noise, et is the measurement noise and t is discrete time. This description includes parameter estimation for any linear single output system, but also a broad class of nonlinear systems that are linear in unknown parameters. An important property of the regressor model that will be utilized further is that the regressor vector ϕt only contains data from time t − 1. The Kalman filter equations for estimation of θt (see e.g. [100]) can be written as: θ̂t = θ̂t−1 + Kt [yt − ϕTt θ̂t−1 ], Pt−1 ϕt , Kt = rt + ϕTt Pt−1 ϕt Pt−1 ϕt ϕTt Pt−1 + Qt , Pt = Pt−1 − rt + ϕTt Pt−1 ϕt (3.1) (3.2) (3.3) where θ̂t ∈ RN is the estimate of θt , Kt ∈ RN is the Kalman gain, Pt ∈ RN ×N is the error covariance matrix, rt ∈ R is the measurement noise variance V(et ) and Qt ∈ RN ×N is the covariance matrix of the process noise V(t ). A priori estimates of θ0 and P0 are taken as initial conditions, if available. Otherwise it is standard to use θ0 = 0 and P0 = ρI where ρ is some ”large” number. 3.2 Implementation In this section, computer implementation of the KF equations (3.1) (3.3) is discussed. First a straightforward implementation will be presented and the drawbacks of it will be explained. Thereafter it will be shown how these drawbacks can be remedied by a simple reordering of the equations, allowing for a well-parallelized algorithm suitable for multicore and, possibly, for distributed systems. 3.2.1 Straightforward implementation To minimize the computational redundancy in (3.1)-(3.3), the common terms Ct Pt−1 ϕt , bt ϕTt Pt−1 ϕt = ϕTt Ct and dt rt + ϕTt Pt−1 ϕt = 74 Algorithm 9 Straightforward implementation of (3.1)-(3.3) • Ct = Pt−1 ϕt • bt = ϕTt Ct • dt = rt + bt • Pt = Pt−1 + Ct CTt /dt + Qt • ŷt = ϕTt θ̂t−1 • θ̂t = θ̂t−1 + Ct dt [yt − ŷt ] rt + bt are first calculated. This results in Alg. 9. The corresponding pseudocode is provided in Alg. 10. As mentioned, such an implementation has drawbacks. Assume that θt is of length N = 2000, a not uncommon size for, say, adaptive filtering in acoustics. P would then require N 2 (8 B) = 32 MB of storage (assuming double precision, 8 B per element), which is too large to fit into the cache (recall that the cache size is typically a few MB). Thus to calculate C in Alg. 10, the elements of Pt−1 will be brought into the cache as they are requested. Eventually, the elements of Pt−1 that were first brought in will be substituted by the elements currently in use. When the program later arrives at the calculation of Pt , the elements of Pt−1 must be brought in once again. Since P is of considerable size, bringing it to the cache twice leads to a substantial increase in the execution time. 3.2.2 Reordering of the equations for efficient memory utilization The reordering is based on the observation that ϕt+1 depends only on the data from time t, and can thus be made available at time step t. This observation enables the reformulation of Alg. 9 as Alg. 11. Why such an reordering would improve the performance becomes clear from the pseudo code given in Alg. 12 where it can be seen that once an element of P is brought into the memory, it will be used to accomplish all calculations it is involved in. Therefore, squeezing the P matrix twice trough the memory at each iteration is no longer needed. 3.2.3 Utilizing the symmetry of P If P0 is symmetric, it can be seen from (3.3) that P will stay symmetric through the recursions. This should be taken advantage of, since approximately half of the calculations and memory storage can be spared. 75 Algorithm 10 Pseudocode for implementation of Alg. 9 • for i = 1 : N – for j = 1 : N ∗ Ct (i) = Ct (i) + Pt−1 (i, j)ϕt (j) – end – bt = bt + ϕt (i)Ct (i) – ŷt = ŷt + ϕt (i)θ̂t−1 (i) • end • dt = rt + bt • for i = 1 : N – for j = 1 : N ∗ Pt (i, j) = Pt−1 (i, j) + Ct (i)Ct (j)/dt + Qt (i, j) – end for – θ̂t (i) = θ̂t−1 (i) + Ct (i) dt [yt − ŷt ] • end for Algorithm 11 Reorganized implementation of Alg. 9 • dt = rt + bt • θ̂t = θ̂t−1 + Ct dt [yt − ŷt ] • Pt = Pt−1 + Ct CTt /dt + Qt • Ct+1 = Pt ϕt+1 • ŷt+1 = ϕTt+1 θ̂t • bt+1 = ϕTt+1 Ct+1 76 Algorithm 12 Pseudocode of memory efficient implementation. • dt = rt + bt • for i = 1 : N – θ̂t (i) = θ̂t−1 (i) + Ct (i) dt [yt − ŷt ] – for j = 1 : N ∗ Pt+1 (i, j) = Pt (i, j) + Ct (i)Ct (j)/dt + Qt (i, j) ∗ Ct+1 (i) = Ct+1 (i) + Pt+1 (i, j)ϕt+1 (j) – end for – ŷt+1 = ŷt+1 + ϕTt+1 (i)θ̂t (i) – bt+1 = bt+1 + ϕt+1 (i)Ct+1 (i) • end for Ct (i) can be rewritten to be calculated from only upper triangular elements as Ct (i) = N Pt (i, j)ϕt (j) + j=i i−1 Pt (j, i)ϕt (j). j=1 An implementation making use of only the upper triangular part of P can thus be obtained by changing the j-loop in Alg. 12 to: • for j = i : N – Pt+1 (i, j) = Pt (i, j) + Ct (i)Ct (j)/dt + Qt (i, j) – Ct+1 (i) = Ct+1 (i) + Pt+1 (i, j)ϕt+1 (j) – Ct+1 (j) = Ct+1 (j) + Pt+1 (i, j)ϕt+1 (i) • end for 3.2.4 Parallel implementation Let M be the number of CPUs used for the implementation. It can be observed by examining Alg. 12 that there are no dependencies between i-loop iterations, except for the adding up of ŷt+1 , bt+1 and Kt+1 . Such dependencies are easily broken by using a reduction. In a reduction, CPU M calculates the local contribution, sM , of the sum that is later M sM . added up in a sequential section to give the global sum S = i=1 By doing so, a parallelization can be achieved by splitting the i-loop in 77 equally large chunks of size N/M (assumed to be integer), and letting each CPU process one of the chunks. For the algorithm utilizing only the upper triangular part of P, there is an issue of splitting the workload among the CPUs. Splitting over the i-index would result in an unevenly distributed workload since the j-loop range from i to N . Moreover, the splitting shall preferably be done so that each CPU can hold locally as much of the data as possible. This can be achieved by the following splitting. First map the upper diagonal elements of P to a rectangular matrix P of size N × (N/2 + 1), where the mapping from an element in P to element (i.j) in P is given by P (i, j) = P(i, (i + j − 1) mod N ), 1≤i≤N 1 ≤ j ≤ (N/2 + 1). Notice that this matrix contains N/2 elements more than necessary. The upper triangular block of P contains N (N + 1)/2 elements and P thus has N (N/2 + 1) − N (N + 1)/2 = N/2 elements extra. This is to avoid the use of if-statements in the implementation and hence allow for better use of the pipeline in the CPU. An example for N = 6 is given below. Notice that P can be said to contain only upper diagonal elements since P(i, j) = P(j, i). ⎡ p11 p12 p13 p14 ⎤ ⎡ p11 p12 p13 p14 . . ⎤ . p22 p23 p24 . p33 p34 . . p44 p51 p52 . . p61 p62 p63 . . P = ⎣ p41 p25 p35 p45 p55 . . p36 ⎦ p46 p56 p66 p22 p23 p34 p45 p55 p56 p66 p61 → P = ⎣ pp33 44 p24 p35 p46 p51 p62 p25 p36 ⎦ p41 . p52 p63 The redundant elements of P are in the last half of the last column, which is equal to the first half of the last column. The same mapping is applied to Q to yield Q . Splitting these calculations over the i-index so that CPU m will loop N N from i1,m = M (m − 1) + 1 to i2,m = M m gives a parallel implementation described in Alg. 13, where superscript (m) denotes a local variable to CPU m. 3.3 Analysis of Algorithm 13 3.3.1 Sequential and parallel work For one iteration of Alg. 13, 2M − 1 + N (M − 1) FLOP’s are executed sequentially which is negligible, assuming that N is of considerable magnitude, compared to the 10(N 2 + N ) FLOP’s that are executed in parallel. Further, the computational load performed in parallel is perfectly balanced, i.e. each processor will perform an equal amount of work in the parallel section. 78 Algorithm 13 • Sequential M (m) yt – ŷt = m=1 M – bt = m=1 – Ct = (m) bt M (m) m=1 Ct – dt = rt + bt • CPU m (in parallel) – for i = i1m : i2m ∗ θ̂t (i) = θ̂t−1 (i) + Ct (i)/dt [yt − ŷt ] 2i ∗ for j = 1 : (N/2 + 1 − N ) · k = (i + j) mod N · Pt+1 (i, j) = Pt (i, j) + Ct (m) (m) (i)Ct (k)/dt + Qt (i, j) · Ct+1 (i) = Ct+1 (i) + Pt+1 (i, j)ϕt+1 (k) (m) (m) · Ct+1 (k) = Ct+1 (k) + Pt+1 (i, j)ϕt+1 (i) (m) (m) ∗ end for (m) (m) (m) (m) ∗ ŷt+1 = ŷt+1 + ϕTt+1 (i)θ̂t (i) (m) ∗ bt+1 = bt+1 + ϕt+1 (i)Ct+1 (i) – end for 79 3.3.2 Communication and synchronization The proposed algorithm exhibits a large degree of data locality. Most importantly, each CPU will only access a part of P, consisting of N (N + 1)/2M elements, implying that it can be stored locally and no parts of P will have to be communicated among the CPUs. The variables that are involved in a reduction, i.e. C, ŷ and b, which consist of (N/2 + 1) + N/M + 2 elements, have to be communicated from the parallel to the sequential section. In the worst case scenario (M = 2), this becomes (N/2 + 1) + N/2 + 2 = N + 3 elements. Since double precision is assumed (8 B per element), this means that for N = 2000, (8 B)(2000 + 3) ≈ 16 kB will need to be communicated, certainly not a large amount. The data to be communicated from the sequential to the parallel section are C, ŷ, b and the additional values of ϕt+1 . Synchronization is required at the end of each iteration. The overhead inflicted by this event is independent of N and depends only on the number of CPUs used; the more processors are involved, the more expensive the synchronization is. However, the relative cost of synchronization becomes less for larger N and the synchronization overhead has smaller influence on the overall execution time. 3.3.3 Memory bandwidth The memory bandwidth needed by the algorithm to perform niter iterations in ttot seconds can be estimated as follows. The only data structures of considerable size in the algorithm are P and Q. Studying how these are transfered from the RAM to the CPU gives a good estimate of the required memory bandwidth. If the matrices P and Q have a size of s(P) and s(Q) bytes respectively, transferring them from the RAM to the CPUs at each iteration requires a memory bandwidth of B= [s(P) + s(Q)] · niter . ttot (3.4) Even though Qt is a matrix of size N × N , it is very often selected to be diagonal or sparse. This means that in most practical cases the required bandwidth needed is about half of that stated by (3.4). As for any other parallel algorithm, one could thus not expect the above algorithm to scale well for a too large or too small problem size N . For small N , the parallel overhead will become a bottleneck, while for large N the available memory bandwidth might strangle the performance. 80 3.3.4 Cache miss handling In a cache-based system, it is of outermost importance to avoid cache misses to get good performance. One of the main points in the reorganization yielding Alg. 11 is to minimize the cache misses for P. Because of the reorganization the optimal strategy for minimizing the cache misses becomes simple. For the matrix P, each element will only be used once in each loop iteration. There is thus no reason to store any of it in the cache. The remaining variables claim a negligible space of (3N + 3) · 8B. Since they are reused several times in one iteration, they should be stored in the cache. For instance, with N = 8000, which number is considered to be a large N , they will require 190 kB of storage. This is in the order of 0.1 % of a cache of a few MB. The strategy is thus to store everything except P in the cache, unless all data fits in the cache completely. In the latter case all the data should certainly be kept in the cache. 3.4 Results All calculations were carried out using double precision. The test data came from a simulation and were the same for all runs. Program compilation was performed with the pgi-compiler and full compiler optimization was used for all the algorithms. Open MP [1] was used for parallelization. This allowed the program to be executed in parallel by adding a single extra code line telling the compiler to run the outer i-loop in parallel and perform the required reductions. The matrix Q was diagonal. To evaluate the improvement gained by reorganizing the equations, Alg. 10 was compared to Alg. 12. The rest of the experiments were devoted to the algorithm of main interest, i.e. Alg. 13. Also the memory bandwidth of the computers Kalkyl and Grad were evaluated, to enable further analysis. 3.4.1 Exection time and speedup Table 3.1 shows execution times for the memory efficient algorithm, Alg. 12, the memory inefficient algorithm, Alg. 10, and the parallelizable implementaiton Alg. 13, tested on Grad and Kalkyl. Speedup curves for Alg. 13 are plotted in Fig. 3.1. 3.4.2 Memory Bandwidth Tab. 3.2 show estimates of the required memory bandwidth Blin (N, M ) to achieve linear speedup for problem size N using M processors. These 81 Table 3.1. Execution times in Sec. for 50 iterations of Alg. 10, Alg. 12 and Alg. 13, executed on a single core on Grad and Kalkyl. N Grad Alg. 10 0.12 0.22 1.06 4.42 17.55 500 1000 2000 4000 8000 Alg. 12 0.063 0.11 0.60 2.49 9.60 Alg. 13 0.021 0.073 0.33 1.37 5.51 Kalkyl Alg. 10 0.12 0.20 0.99 3.92 16.52 Alg. 12 0.051 0.11 0.56 2.08 8.45 Alg. 13 0.028 0.089 0.34 1.31 5.54 8 7 6 N=500 N=1000 N=2000 N=4000 N=8000 Speed up 5 4 3 2 1 0 1 2 3 4 5 Number of cores 6 7 8 3 4 5 Number of cores 6 7 8 9 8 7 N=500 N=1000 N=2000 N=4000 N=8000 Speed up 6 5 4 3 2 1 0 1 2 Figure 3.1. Speedup for Alg. 13, executed on Grad (upper) and Kalkyl (lower). For reference linear speedup is marked by the dashed line. 82 Table 3.2. Theoretically evaluated bandwidth to obtain linear speedup of Alg. 13 executed on Grad and Kalkyl in GB/s. M\N 500 1 2 4 8 2.4095 4.8190 9.6381 19.2762 1 2 4 8 1.7585 3.5169 7.0338 14.0677 1000 2000 Grad 2.7255 2.4038 5.4509 4.8075 10.9019 9.6151 21.8037 19.2302 Kalkyl 2.2470 2.3512 4.4940 4.7023 8.9881 9.4046 17.9761 18.8093 4000 8000 2.3229 4.6458 9.2915 18.5831 2.3207 4.6414 9.2828 18.5657 2.2636 4.5271 9.0542 18.1085 2.3086 4.6173 9.2345 18.4690 values were obtained by applying (3.4) to the data in Tab. 3.1, to calculate Blin (N, 1) with further extrapolation for M ≥ 1, i.e. Blin (N, M ) = M · Blin (N, 1). 3.5 Discussion It can be seen from Tab. 3.1 that the memory-efficient algorithm, Alg. 12, executes about twice as fast as the memory-inefficient algorithm, Alg. 10, on both systems (Grad and Kalkyl). Comparing execution times for Alg. 12 and Alg. 13 in Tab 3.1, it can also be concluded that the execution time for the algorithm utilizing the symmetry of P runs, as expected, about twice as fast as the algorithm using the whole P matrix. Speedup curve for Kalkyl Since linear speedup is obtained for all values of N , there is apparently neither problem with synchronization overhead for small values of N nor memory bus saturation for larger values of N . This is further confirmed by Tab. 3.2 where none of the elements exceeds the available bandwidth of 23 GB/s. Even super-linear speedup for small values of N can be observed. This is due to good cache performance. With the work distributed among several cores, each core needs to access a smaller amount of data that will fit easier into the cache and result in a better overall throughput. For a more extensive explanation of this phenomenon, see e.g. [33]. 83 Speed-up curve for Grad In the speedup curve for Grad, bad scaling for N = 500 and N = 1000 is observed. This is due to the synchronization overhead that constitutes a disproportionally large part of the execution time. Also in Tab. 3.2, there are indications that the memory bus would be saturated for N = {500, 1000, 2000} and M = {4, 8} since the available bandwidth of 5.5 GB/s would be exceeded for these entries. However, no saturation can be seen in the speedup curves and almost linear speedup is obtained for N = 2000. One possible explanation to this discrepancy is that the analysis in Section 3.3.3 assumes that P is transfered from the RAM to the CPU at each iteration. For N ≤ 2000, the size of P satisfies s(P) ≤ 16 MB. Since there are 24 MB cache available running on 8 cores, the whole P matrix will remain in the cache memory between iterations, avoiding the need of fetching it from the RAM, creating an illusion of a larger memory bandwidth. For N ≥ 4000, s(P) ≥ 64 MB, which is larger than the available cache of 24 MB, the whole matrix must be brought to the cache from the RAM at every iteration. At this point, the memory bandwidth really becomes a bottleneck. Indeed, the entries in Tab. 3.2 corresponding to N = {4000, 8000} and M = {4, 8} do not align with the linear speedup for N ≥ 4000. Therefore, on this hardware and using the proposed KF algorithm, more bandwidth than the available 5.5 GB/s is needed to achieve a linear speedup. 3.5.1 Conclusions Through test runs on two different shared-memory multicore architectures, it is found that a Kalman filter for adaptive filtering can be efficiently implemented in parallel by organizing the calculations so that the data dependencies are broken. The proposed algorithm executes about twice as fast on a single core as a straightforward implementation and is capable of achieving linear speedup in the number of cores used. However, since the KF involves relatively simple calculations on large data structures, it is required that the hardware provides enough memory bandwidth to achieve linear speedup. This is an inherent problem of the KF itself and not caused by the proposed parallelization algorithm. 84 Chapter 4 Parallel implementation of the particle filter The PF solves the recursive Bayesian estimation problem approximately with Monte Carlo simulation and provides a general framework for nonlinear/non-Gaussian dynamic systems. Navigation, positioning, tracking, communication, economics and also computer vision are some application areas where PFs have been applied and proved to perform well. A special case in tracking applications is the so-called Bearings-Only Tracking (BOT) problem. This is a scenario often occurring in defense applications where only the normalized angle to a maneuvering target, relative some given reference angle, is measured. The BOT problem is inherently nonlinear with observability issues and is typically solved with respect to a set of constraints representing e.g. a geographical map. This is an application where the PF is intensively used. A well-known drawback of the PF is that good estimation accuracy requires a large number, often thousands, of particles which makes the algorithm computationally expensive. Further, in tracking applications, the particle filter often constitutes only a part of a complete tracking system containing interacting multiple model and joint probabilistic data association algorithms, communication of measurements, constraint handling etc. Time limits in such systems are often tight and it is desirable to optimize each part in terms of execution time. Since the PF algorithm is to a large extent parallel, it is natural to turn to parallel implementations to improve execution times so that real-time feasibility is achieved. It is therefore instructive and motivated to study the speedup and tracking performance of existing parallel PF algorithms implemented on a multicore architecture. By implementing a PF in parallel, a real-time feasible, powerful, energy effective and cheap 85 filter that can handle a broad class of nonlinear dynamic systems with constraints is obtained. In this section, four different existing parallel PFs, namely global distributed PF (GDPF) [9], resampling with non-proportional allocation filter (RNA) [12], resampling with proportional allocation (RPA) [12] and the Gaussian PF (GPF) [56] are compared in tracking accuracy and speedup at solving a testbed BOT problem. The filters are implemented on a shared memory multicore computer, using up to eight cores. 4.1 The particle filter The idea of PF is to recursively obtain a weighted sample from p(xt |Yt ) by Monte Carlo simulation and evaluate an estimate x̂t of xt from it. For the general PF algorithm see for instance [72], [7]. Assume that (i) (i) at time step t − 1 the particle set St−1 = {xt−1 , wt−1 }N i=1 constitutes (i) a weighted sample from p(xt−1 |Yt−1 ), where xt−1 is the i-th particle (i) with associated weight wt−1 . Given St−1 , a sample from p(xt |Yt−1 ) is obtained by propagating each particle through system equation (1.21), i.e. (i) (i) (i) xt = ft−1 (xt−1 , vt−1 ), (4.1) (i) where vt−1 is a draw from p(vt−1 ). This corresponds to the prediction step in recursive Bayesian estimation. The measurement yt is then used to update the weights by (i) (i) (i) wt = wt−1 p(yt |xt ), (4.2) which corresponds to the update step in recursive Bayesian estimation. (i) (i) The two steps above yield the particle set St = {xt , wt }N i=1 at time step t. By iterating (4.1) and (4.2), samples from p(xt |Yt ) are thus recursively obtained and can be used to produce an estimate x̂t of the state xt as for instance the mean x̂t = E(xt ) ≈ N (i) (i) w̃t xt , (4.3) i=1 (i) (i) where w̃t = wt / N (i) wt , is the i-th normalized weight. The recursion i=1 is initialized by making N draws from an a priori distribution p(x0 ). Resampling is used to avoid degeneracy of the algorithm. In resam (i) (i) a pling, a new set of Na particles St = {xt , wt }N i=1 , is to be created and (i) (i) Nb replace the old set St = {xt , wt }i=1 of Nb particles. Usually, but not 86 Algorithm 14 SIR algorithm. [St ] = SIR[St−1 , yt ] • FOR i = 1 : N (i) – (P) Propagate xt−1 (Eq. (4.1)). (i) – (U) Set wt according to (4.2). • END FOR • (R) Resample St using SR. a • Output the resampled set St = {xt , wt }N i=1 (i) (i) necessarily, Na = Nb . Most resampling algorithms obtain the resampled set by drawing with replacement Na samples from the set St so that (i) (i) (i) Pr(xt = xt ) = w̃t , where Pr(·) stands for probability. When resampling the information contained by the weights is replaced by particle density. Therefore, the weights are reset, i.e. w (i) = 1/N, i = 1, .., N . A popular resampling algorithm is Systematic Resampling (SR) [20]. The PF algorithm used in this chapter is the so-called SIR (Sampling Importance Resampling) algorithm using SR for resampling is given by pseudocode in Alg. 14. Gaussian Particle Filter Another variant of the PF is the Gaussian Particle Filter (GPF) [56]. The additional assumption made in GPF is that the posterior distribution can be approximated by a Gaussian PDF, i.e. p(xt |Yt ) ≈ N (xt ; μt , Σt ) where N (x; μ, Σ) = 1 −1 1 T e− 2 (x−μ) Σ (x−μ) , (2π)n/2 |Σ|1/2 (4.4) is the n-dimensional normal distribution PDF for the random variable x, with mean μ ∈ Rn and covariance Σ ∈ Rn×n . The advantage gained is a simpler resampling scheme and that only the estimated mean μ̂t and covariance Σ̂t have to be propagated between iterations. These properties make the algorithm highly amenable to parallel implementation. Estimates of μt and Σt can be obtained as the weighted sample mean and covariance [102] given by 1 (i) (i) wt xt , Wt (4.5) N Wt (i) (i) (i) w (xt − μ̂t )(xt − μ̂t )T , Wt2 − Wt i=1 t (4.6) N μ̂t = i=1 Σ̂t = 87 Algorithm 15 GPF algorithm. [μ̂t , Σ̂t ] = GP F [μ̂t−1 , Σ̂t−1 , yt ] • FOR i = 1 : N (i) – (R) Draw xt−1 ∼ N (μ̂t−1 , Σ̂t−1 ) – Perform (P) and (U) steps as in Alg. 14 • END FOR • Calculate μ̂t and Σ̂t (Eq. (4.5) and (4.6)). • Output estimated parameters {μ̂t , Σ̂t }. where Wt = Wt = N i=1 N (i) wt , (i) (wt )2 . i=1 The GPF algorithm is described by pseudocode in Alg. 15. The drawback of the GPF is that p(xt |Yt ) must be well approximated by a Gaussian PDF which is not generally true. 4.2 Parallel algorithms In the following the number of parallel processing units will be denoted with M . Superscript m indicates CPU m, e.g. N (m) is the number of (m) particles in the local particle set S (m) = {x(m,i) , w(m,i) }N i=1 at CPU m. Common to all algorithms is that each CPU performs the propagation (P) and weight update (U) steps in Alg. 14 for N (m) = N/M particles (N/M assumed to be integer). What differs between the algorithms then is how the resampling step (R) is handled. All described algorithms also utilize the fact that the global estimate can be calculated from the local estimates as x̂ = N M 1 (i) (i) 1 (m) (m) w x = W x̂ . W W i=1 (4.7) m=1 The description of the algorithms GDPF, RNA and RPA starts from a point where it is assumed that the CPUs have a local particle set St−1 for time sample t − 1 and also have access to the measurement yt . 88 Algorithm 16 GDPF • CPU m (in parallel) (m) – Perform (P) and (U) steps to obtain St . • Sequential (one CPU only) M (m) – Form St = ∪ St m=1 . – Calculate x̂t and resample St . – Redistribute the resampled particles to CPUs. Global Distributed Particle Filter (GDPF) GDPF [9] uses a straightforward way to perform resampling. Steps performed within one iteration of GDPF are as given in Alg. 16. Since GDPF performs exactly the same calculations as the sequential PF, it exhibits the same accuracy. A drawback is of course a high communication demand inflicted by sending the particles back and forth between sequential and parallel sections. Furthermore, a large part of the program (resampling) has to be executed sequentially, limiting speedup possibilities. Resampling with Non-proportional Allocation (RNA) In RNA [12], resampling is performed in a suboptimal but parallel manner. Each CPU resamples the local set of particles S (m) with the locally normalized weights w̃(m,i) = w(m,i) /W (m) . To avoid disturbing the statistics, the weights at each CPU after resampling are set so that w(m,i) = w(m,i) W (m) /W . A problem with RNA is that a CPU can starve, i.e. the local sum of weights, W (m) , gets very small or even turns to machine zero. When starving occurs, computational resources are wasted on a particle set that provides little or no contribution to the final estimate. In [12], it is suggested that the problem can be resolved by at every iteration letting the CPUs exchange some portion P of their particles. For instance the CPUs could form a ring and let CPU m send N (m) P particles to CPU m+1, with the exception that CPU M sends to CPU 1. Steps performed in chronological order within one iteration, organized to allow for only one parallel section per iteration are given in Alg. 17. Resampling with Proportional Allocation (RPA) In RPA [12], the resampling is done by using an intermediate step in the resampling procedure called inter-resampling. 89 Algorithm 17 Parallel RNA. • CPU m (in parallel) – Exchange some portion P of the particles with neighboring CPUs. (m,i) (m,i) – Set i-th weight to wt−1 = wt−1 /Wt−1 . (m) – Perform (P) and (U) steps to obtain St (m) and Wt (m) using the locally normalized weights w̃t – Calculate x̂t – Resample St (m,i) (m) wt /Wt . (m) . . (m,i) (m,i) – Set i-th weight to wt (m) = Wt = . • Sequentially – Calculate x̂t (Eq. (4.7)). – Calculate and distribute Wt to each CPU. During the inter-resampling stage CPU m calculates the local sum of weights W (m) . A CPU running sequentially takes W (m) and now treats each CPU as a particle with weight W (m) and uses the residual systematic resampling (RSR) algorithm [11] to produce M replication factors R(m) , m = 1, .., M , specifying how many particles CPU m should possess after resampling. CPU m will thus produce a number of particles proportional to W (m) . R(m) is communicated to CPU m which now performs intra-resampling. Intra-sampling: At each CPU, the local particle set S (m) is resampled with N (m) input particles and R(m) output particles using systematic resampling. After this step, it is likely that the particles are unequally distributed among the CPUs. Therefore load balancing is performed. CPUs with surplus particles send their excess particles to a CPU running sequentially, which distributes them to the CPUs with lack of particles. The number of particles that should be sent/received by CPU m is given by D(m) = R(m) − N (m) . Steps performed at one iteration in chronological order for RPA are given in Alg. 18. A drawback of RPA is the unpredictability in execution time caused by the possibly uneven distributed workload among the CPU in the inter-resampling step, where the execution time is tied to the slowest CPUs. The fact that there are two parallel sections with intermedi90 Algorithm 18 Parallel RPA. • CPU m (in parallel) (m) – Perform (P) and (U) steps to obtain St . (m) – Calculate Wt (m) and x̂t . • Sequentially (intra-resampling) – Calculate x̂t (Eq. (4.7)). – Compute replication factors R(m) using RSR. • CPU m (In parallel, inter-resampling) – Resample using SR with N (m) input particles and R(m) output particles. – Calculate D(m) . • Sequentially – Use D(m) to distribute particles equally among CPUs. ate sequential sections per iteration also requires extra overhead that diminishes the speedup potential of the algorithm. Gaussian Particle Filter (GPF) The GPF is highly amenable to parallel implementation since it avoids the sequential resampling required by SIR. To simplify the notation in the description of the parallel implementation, the following variables are defined W (m) = (m) N (w(m,i) )2 , i=1 W , −W (m) N = w(m,i) x(m,i) (x(m,i) )T , α = σ (m) W2 i=1 μ̂t−1 and Σ̂t−1 denote the estimated mean and covariance from the sequentially executed section at time t − 1. CPU m creates a local set (m) of particles St−1 by making N (m) draws from N (μ̂t−1 , Σ̂t−1 ), and set(m,i) ting wt−1 = 1/N (m) , i = 1, .., N (m) . Each CPU then performs the (m) (m) (P) and (U) steps for St−1 to obtain St (m) {μ̂t , (m) σt , (m) Wt , (m) Wt } (m) . From St (m) μ̂t , the quantities (m) are calculated, where = x̂t . A CPU running sequentially forms the estimated mean μ̂t via (4.7) using the 91 Algorithm 19 Parallel GPF • CPU m (in parallel) (m,i) (m,i) – Make N (m) draws xt−1 ∼ N (μ̂t−1 , Σ̂t−1 ) and set wt−1 = (m) 1/N (m) to obtain St−1 . (m) – Perform (P) and (U) steps to obtain St (m) – Calculate {μ̂t (m) , σt (m) , Wt (m) , Wt . (m) } from St . • Sequentially – Use the obtained data to calculate μ̂t and Σ̂t using (4.7) and (4.8). fact that μ̂t = x̂t . An estimate of Σ is obtained by exploiting that (4.6) could be rewritten as Σ̂ = α = α N i=1 N w(i) (x(i) − μ̂)(x(i) − μ̂)T [w(i) x(i) (x(i) )T − w(i) x(i) μ̂T − μ̂(w(i) x(i) )T i=1 +w(i) μ̂μ̂T ] = α[( M σ (m) ) − W μ̂μ̂T ], (4.8) m=1 where the relationship N i=1 1 (i) (i) =W w x = W μ̂, W N (i) (i) w x i=1 is used in the third equality. Note that the final expression in (4.8) only (m) (m) (m) (m) makes use of the data contained in {μ̂t , σ t , Wt , Wt }M m=1 . The algorithm for one iteration is given in Alg. 19. 4.3 Performance evaluation The filters were implemented in C++ on a shared memory multicore computer, using Open MP [1] for parallelization. Tracking accuracy was evaluated for a bearings-only tracking (BOT) application, where only the bearings, i.e. the the normalized angles to the target, relative some given reference angle, were measured. This is a scenario encountered in e.g. defense applications where passive microphone array sensors are 92 ×10 6 6.5031 Sensor 6.5032 y-position [m] 6.5033 Sensor Start 6.5034 Stop 6.5035 Road Sensor Particle 6.5036 Estimated trajectory Current estimate Current position Measurment 6.5037 1.471 1.4711 1.4712 x-position [m] 1.4713 1.4714 1.4715 ×10 6 Figure 4.1. Evaluation scenario. used for measurements. As a performance measure, the position RMSE, taken over 200 independent simulation runs, was studied. Note that GDPF performs the exact same calculation as the sequential PF and can thus be taken as a performance measure for the sequential PF. Evaluation scenario An image of the scenario at a time instant, from a representative simulation run is shown in Fig. 4.1. The PF is tracking a target traveling along a road, the start and stop points are marked in the figure. Two sensors are taking noisy bearing measurements of the target. The simulated measurement for sensor i at time sample t was obtained as the true bearing θ(i,t) corrupted by noise, i.e. z(i,t) = θ(i,t) + v (i,t) where, v (i,t) is zero mean white noise with standard deviation σ v = 0.1 rad. State space model The same state space model as in [23] was used. The system equation is given by xt+1 = I 2 I2 T 0 I2 xt + I2 T 2 /2 I2 T wt , 93 RMSE plot, number of CPUs M = 4 GDPF RNA−0% RNA−10% RNA−50% RPA GPF 1.5 RMSE [m] 10 1.4 10 2 10 3 10 Number of particles N 4 10 Figure 4.2. RMSE as a function of the total number of particles N . RNA-X% denotes RNA with X% particle exchange, P . Note the log-log scale. T where the state vector, xt = xt , yt , ẋt , ẏt , consists of the Cartesian position and velocity components. I2 denotes the 2×2 identity 2 I ) and matrix. The system noise wt is white with distribution N (0, σw 2 T is the sampling period. A single measurement taken by sensor i is related to the state vector by z(i,t) = g(i) (xt ) + v(i,t) , where g(i) (·) is the trigonometric function relating the x-y position to the bearing, i.e. tan−1 (y/x), if the target is in the first or fourth quadrant, considering the position of sensor i as the origin of the coordinate system, and v(i,n) is zero mean white Gaussian noise with variance σv2 . In the simulation σw = 10 m/s2 , σv = 0.1 rad and T = 0.5 s were used. Performance Fig. 4.2 and Fig. 4.3 show the tracking performance using M = 4 and M = 8 respectively. Fig. 4.4 shows the achieved speedup, in the figure P = 0.2 was used for RNA. 4.4 Discussion As can be seen from Fig. 4.2, GPF provides better tracking accuracy than other filters at the given scenario, especially for a small number of particles. It must though be noted that in the given scenario the measurement noise is Gaussian distributed which provides an almost ideal 94 Number of CPUs M = 8 GDPF RNA−0% RNA−10% RNA−50% RPA GPF RMSE [m] 1.5 10 1.4 10 2 3 10 4 10 Number of particles N 10 Figure 4.3. RMSE as a function of the total number of particles N . RNA-X% denotes RNA with X% particle exchange, P . Note the log-log scale. Number of particles N = 100 Number of particles N = 500 8 8 7 7 6 5 Speed−up Speed−up 6 4 3 4 3 2 2 1 0 5 1 2 3 4 5 Number of cores M 6 7 1 8 1 2 Number of particles N = 1000 6 7 8 7 8 8 7 7 6 6 Speed−up Speed−up 4 5 Number of cores M Number of particles N = 10000 8 5 4 3 GDPF RNA GPF RPA Linear speed up 5 4 3 2 1 3 2 1 2 3 4 5 Number of cores M 6 7 8 1 1 2 3 4 5 Number of cores M 6 Figure 4.4. Speed up for N equals 100, 500, 1000 and 10000 particles. For reference linear speed up is marked by the dashed line. 95 situation for the GPF. However, this Gaussian noise model is probably not unrealistic in ground target BOT with fixed sensor platforms. RNA with 0% particle exchange provides significantly lower tracking accuracy than other filters since it suffers from CPU starvation. All other filters have comparably the same performance as the sequential SIR filter. For the RNA algorithm, the tracking accuracy is affected by the (m) amount of particle exchange P . For smaller N (m) , St gives a less accurate local approximation to p(xt |z1:t ) and stronger coupling between local particle sets (larger P ) is required to maintain tracking accuracy. This effect can be seen comparing RNA 10% and RNA 50 % in Fig. 4.3, where for a small number of particles (N = 100) RNA 50% provides better tracking accuracy that RNA 10%. This effect cannot be clearly seen in Fig. 4.2 since the number of CPUs is less, implying larger local particle set, and less coupling is thus required to maintain tracking accuracy. The obtained speedup naturally depends on the number of particles used, as can be seen from Fig. 4.4. For a small number of particles, the parallelization becomes too fine-grained, and the benefit of using a parallel implementation diminishes. As expected, the speedup of GDPF is quite limited, restricted to about 3 times, depending on the large amount of work (resampling) that is carried out sequentially. RNA achieves speedups very close to linear for large particle sets (N = 104 ). The speedup of RPA is substantially less than RNA and GPF, mainly depending on the overhead caused by the two parallel sections per iteration. GPF provides the best speedup, almost linear in the number of cores used for large particle sets (N = 104 ). 4.5 Conclusions for parallel implementation of the particle filter Simulations performed with four different parallel PF algorithms showed that in a BOT problem the GPF gave best tracking performance while GDPF, RNA and RPA demonstrated tracking performance comparable to that of the sequential SIR algorithm. The drawback of the GPF is that it requires the posterior distribution to be approximately normally distributed, which is probably not unrealistic in ground target BOT with fixed sensor platforms. The obtained speedups gained on a shared memory multicore computer depend largely on the total number of used particles N . For particle sets with N > 1000 GPF and RNA can achieve close to linear speedups in the number of cores used. The speedup obtained by RPA is substantially lower due to less beneficial parallelization poten96 tial. GDPF has a speedup limited to about 3.5 times as a consequence of the sequentially executed part of the algorithm. For particle sets with N < 500, the parallelization becomes too fine-grained, and it is hard to exceed a speedup of about 2 times using GDPF or RPA while GPF and RNA can achieve a speedup of up to about 4 times. The final conclusion for a parallel particle filter implemented on a shared memory multicore computer is thus the following. If the Gaussian assumption made by GPF holds true, it would be the algorithm to prefer since it provides best tracking accuracy and is capable of achieving close to linear speedups in the number of cores used. If the Gaussian assumption does not hold, RNA would be the algorithm to prefer since it as well can, without loss in accuracy compared to the sequential PF, obtain close to linear speedups in the number of cores used. 97 Chapter 5 Solving the RBE via orthogonal series expansions 5.1 Introduction This chapter investigates a method to solve the RBE problem in parallel using orthogonal basis functions. The problem under consideration is to provide an estimate of the state vector xt ∈ Rn , given the measurements Yt = {y1 , y2 , ..., yt }, yt ∈ Rp , of the nonlinear discrete-time system xt+1 = f (xt , vt ), yt = h(xt , et ), (5.1) (5.2) with the process and measurement noise vt ∈ Rn , et ∈ Rp , respectively, and t denoting discrete time. The probability density functions (PDFs) p(vt ), p(et ) are assumed to be known but are allowed to have arbitrary form. Algorithms employing Fourier basis functions and wavelets have been developed in [14], [39] and applied to filtering problems. The development here is with respect to a general orthogonal basis, and targeting in particular the amenability to parallelization that is demonstrated and analyzed. The favorable parallelization properties of the method stem from the orthogonality of the basis functions. In contrast, the solutions to the RBE problem that employ non-orthogonal basis functions, e.g. Gaussian sum filters, [4], [98], [46] parallelize poorly because of the inherent dependencies in the computations. Orthogonal bases have been used for a long time in statistics to estimate PDFs with general distributions, see e.g. [96],[89],[104],[27]. Using orthogonal expansions, it is possible to estimate the PDFs with a substantially lower number of variables than e.g. the particle filter or grid-based methods. Since much fewer variables are required to approx99 imate the PDF, a smaller computational load for the same estimation accuracy can be expected. The chapter structure is as follows. The method of solving the recursive Bayesian estimation problem using orthogonal series expansion is presented in Sec. 5.2. An application to a bearings-only tracking problem as well as a speedup evaluation for a parallel implementation are given in Sec. 5.4. The results are discussed Sec. 5.5. An analysis of the impact of the truncation error is given in Sec. 5.6. 5.2 Solving the RBE via series expansions In this section it is derived how to solve the RBE via orthogonal series expansions. It will be assumed that all series are absolute convergent, and that the product of the n-th and m-th basis functions have the expansion φn (x)φm (x) = gnmk φk (x). (5.3) k∈Nd As there is no closed-form analytical solution for the general case of RBE, numerical methods have to be employed to find an estimate of the state in (5.1). The idea advocated here is to approximate the involved PDFs by orthogonal series expansions and recursively propagate the coefficients of the expansion in time via the prediction and update equations (1.29)-(1.30). Assume that p(xt |xt−1 ), p(yt |xt ) and p(xt−1 |Yt−1 ) are given by the expansions p(xt |xt−1 ) = anm φn (xt )φm (xt−1 ), (5.4) bnm φn (yt )φm (xt ), (5.5) n∈Nd m∈Nd p(yt |xt ) = p(xt−1 |Yt−1 ) = n∈ Nd n∈ Nd m∈ t−1|t−1 cn φn (xt−1 ). Nd (5.6) The target of the approximation is to compute and propagate the coeft|t t|t−1 ficients cn over time where cn shall be interpreted as the coefficient with index n at time step t given data up to time t − 1. Inserting (5.4)(5.6) into the prediction and update equations (1.29) and (1.30) yields the following relationships: 100 Prediction step p(xt |Yt−1 ) = p(xt |xt−1 )p(xt−1 |Yt−1 )dxt−1 d R = [ anm φn (xt )φm (xt−1 ) × Rd n∈Nd m∈Nd k∈Nd = n∈ = Nd m∈ Nd k∈ t−1|t−1 ck Nd φk (xt−1 )]dxt−1 t−1|t−1 anm ck φn (xt ) × φm (xt−1 )φk (xt−1 )dxt−1 Rd t|t−1 cn φn (xt ), n∈Nd with t|t−1 cn = t−1|t−1 anm cm . (5.7) m∈Nd Update step When the measurement yt becomes available, the PDF p(yt |xt ) is conditionalized to yield p(yt |xt ) = bnm φn (yt )φm (xt ) n∈Nd m∈Nd = t fm φm (xt ), m∈Nd where t = fm n∈ bnm φn (yt ). (5.8) Nd The multiplication in the update step is then carried out as: p(xt |Yt ) = γt−1 p(yt |xt )p(xt |Yt−1 ) t|t−1 = γt−1 fnt φn (xt ) cm φm (xt ) n∈Nd = γt−1 m∈Nd t|t−1 fnt cm φn (xt )φm (xt ) n∈Nd m∈Nd = γt−1 = n∈Nd m∈Nd t|t γt−1 ck φk (xt ), d k∈N t|t−1 fnt cm gnmk φk (xt ) k∈Nd 101 Algorithm 20 The RBE algorithm using orthogonal basis functions. Initialization: 0|0 φk (x0 )p(x0 )dx0 ck = Ω 0|0 γ0 = c n pn n∈Nd Recursion, t = 1, 2, ...: t|t−1 cn −1 = γt−1 t fm = t−1|t−1 anm cm m∈Nd bnm φn (yt ) n∈Nd t|t ck = t|t−1 t fn gnmk cm n∈Nd m∈Nd t|t c n pn γt = n∈Nd where t|t ck = n∈ Nd m∈ t|t−1 t fn gnmk , cm and γt is the normalization constant given by t|t γt = c n pn . n∈ (5.9) Nd (5.10) Nd where pn = Ωd φn (xt )dxt . The algorithm for propagation of the coefficients is summarized in Alg. 20. 5.2.1 Mean and Covariance The mean and covariance for the PDF p(xt |Yt ) are typically of interest in estimation problems. The expected value in dimension i can be calculated by marginalizing the expansion for the i-th dimension and taking the expected value of the marginalized distribution, i.e. E[xt,i |Yt ] = xt,i p(xt |Yt )dxt Ωd t|t = cn xi φ(xt )dxt . (5.11) n∈Nd 102 Ωd Let xi denote the i-th element of xt . The covariance between xi and xj is given by cov(xi , xj |Yt ) = E[xi xj |Yt ] − E[xi |Yt ]E[xj |Yt ], where the second term is evaluated using (5.11), while the first term can be calculated as E[xi xj |Yt ] = xi xj p(xt |Yt )dxt Ωd t|t = cn xi xj φn (xt )dxt . (5.12) n∈Nd Ωd 5.2.2 Truncation In practice, the infinite series must be truncated to some order N < ∞, in each dimension. In the update step, the order of the series expansion is doubled, in each dimension, due to the multiplication of series. Thus, to keep the order from growing exponentially, the series have to be truncated at each iteration. For simplicity, the truncation is made by keeping the first N terms. It should be noted that the truncation can result in an approximation p̂(x) that takes on negative values, and is hence not a PDF. However the purpose of the approximation is to make inference about the state x, in this sense it is not worse to have e(x) = p̂(x) − p(x) negative than having e(x) positive but merely |e(x)| is of importance, as argued in Sec. 1.4.4. 5.2.3 Computational complexity The expansions of the PDFs p(xt |xt−1 ) and p(yt |xt ) are assumed to be determined beforehand and offline. The online computational cost can be found by counting the flops required in Alg. 20, which gives a total flop demand of f (N, d) = 3N 3d + 4N 2d + N d − 1, (5.13) where is the flop cost of evaluating φn (y). For many basis functions, the coefficients gnmk are zero, except for a few certain values of n and m. This property can reduce the computational complexity substantially (see Sec. 5.4.2 for an example of this, using the Fourier basis functions). 5.3 Parallel implementation The orthogonality of the basis functions allows for the computational load to be well separated in independent segments. Assume that M 103 Algorithm 21 Pseudo code for parallel implementation • CPU m computes (In parallel) t|t−1 t t|t – ck = cn fm gnmk , k ∈ Nm n∈Nm∈N – γt (m) = t+1|t – cn k∈Nm t|t c k pk −1 (m) = γt−1 t (m) = – fm n∈Nd t|t m∈Nm anm cm , n ∈ N bnm φn (yt ), m ∈ Nm • One CPU (Sequentially) M t+1|t t+1|t – ck = ck (m), k ∈ N m=1 M – fkt = m=1 M – γt = fkt (m), k ∈ N γt (m) m=1 processing units are available. With N being a set of cardinality N d , Nm , m = 1, 2, ..., M being disjoint subsets of N of cardinality N d /M M (assumed to be integer) and ∪ Nm = N, pseudo-code of a parallel m=1 implementation is given in in Alg. 21, where the computations have been organized to allow for only one synchronization point per iteration. 5.3.1 Analysis Counting the number of flops in Alg. 21 that are executed sequentially, f| , and in parallel, f|| , it is found that f| (N, d, M ) = (M − 1)(2N d + 1), f|| (N, d) = 3N 3d + 4N 2d + N d − 1. The sequential portion of the program is thus almost negligible compared to the parallel one, even for small problem sizes and dimensions. The data that have to be communicated between the processors at t+1|t each iteration are the elements of the local variables of ck (m), fkt (m) and γt (m), m = 1, 2, ...M resulting in a total communication demand per iteration b of b(N, d, M ) = M (N d + 1) + N d . 104 (5.14) Further, as mentioned before, only one synchronization point per iteration is required. The parallelization thus possesses a large parallel portion relative the sequential portion, and a small amount of communication and synchronization relative the total amount of performed computations. These properties imply that the method have a high potential of performing well in a parallel environment. 5.4 Numerical Experiments A nonlinear non-Gaussian bearings-only tracking problem is studied. It arises in defense and surveillance applications as well as in robotics. It exhibits a severe non-linearity in the measurement equation and is known to require nonlinear filtering to avoid divergence of the estimate, [3]. For comparison, the filtering problem is solved both with the Fourier basis functions and the Legendre basis functions. Numerical experiments were conducted in order to experimentally verify the error bound derived in the previous section, and also to explore its conservatism. 5.4.1 The system An object traveling along a path is detected within the range xt ∈ [−π, π]. Noisy bearing measurements yt of its position xt are taken by a sensor stationed at a distance d = 1 from the road, see Fig. 5.1. The tracking filter employs the model xt+1 = xt + wt , yt = tan−1 (xt /d) + vt , (5.15) (5.16) where wt is normally distributed with the mean μw = 0 and standard deviation σw = 0.3. The measurement noise vk obeys the multi-modal PDF v−μv1 2 v−μv2 2 p2 p1 −1( ) −1( ) √ e 2 σv1 √ e 2 σv2 , + pv (v) = σv1 2π σv2 2π with p1 = 0.5, p2 = 0.5, σv1 = 0.3, σv2 = 0.3, μv1 = 0.45, μv2 = −0.45. The system was simulated up to time step T = 40. 5.4.2 Solution using Fourier basis functions This section presents a solution of the bearing-only tracking problem obtained by applying the Fourier basis functions [109] 1 N −1 φn (x) = √ einx , |n| ≤ , 2 2π 105 Figure 5.1. An object with position xt traveling along a path. Noisy bearing measurements, Yt , are taken by a sensor (black dot), positioned a distance d from the path. that are orthogonal over the interval [−π, π]. To obtain the basis functions that are orthogonal over an arbitrary interval, a linear transformation of x can be applied as discussed in Sec. 1.3.4. The expected value of the approximated PDF x̂t = E[xt |Yt ] is used as the point estimate of the state. From (5.11) and (5.12), the mean and covariance can be calculated as N/2 E[xt |Yt ] = ct|t n ϕn , n=−N/2 E[(xt − E[xt ]) |Yt ] = [ 2 N/2 n=−N/2 −1 t|t c ϕn ] − E[xt |Yt ]2 , inπ n where ϕn is defined as π ϕn = xφn (x)dx = −π 0 (−1)n+1 √ 2π n i if n = 0, otherwise. Since φn (x)φm (x) = φn+m (x) for the Fourier basis, it follows that gnmk = δ[n+m−k], with δ[·] denoting the multivariate Kronecker delta function. This fact reduces the computational complexity to f (N, d) = 6N 2d + N d − 1. Fig. 5.2 depicts the true state, tangent of the measurement and the estimated state. In Fig. 5.3, the sequence of estimated PDF:s p(xt |Yt ) using N = 15, is shown for t = 1, 2, ..., 10. For N = 15 the root mean square error R = 0.078 was achieved. For comparison, a bootstrap particle filter (PF) [69] was also implemented for the same problem. Using Np = 150 particles, the minimum root mean square error of R = 0.078 was reached, and did not improve for a larger number of particles. 106 2 x x̂ tan(y) 1.5 1 0.5 0 -0.5 -1 -1.5 -2 0 5 10 15 20 25 30 t Figure 5.2. True state x and estimated state x̂, for time step t. p̂(xt |Yt ) 2.5 2 1.5 1 0.5 0 10 5 t 0 -3 -2 0 -1 1 2 x Figure 5.3. p̂(xt |Yt ) plotted for t = 1, .., 10 107 Table 5.1. Single core execution time. NT Execution time 100 0.0021 300 0.0568 500 0.2625 1000 2.1467 8 N=100 N=300 N=500 N=1000 7 Speed up 6 5 4 3 2 1 1 2 3 4 5 Number of CPUs 6 7 8 Figure 5.4. Speedup plots for different values of N . Linear speedup is marked by the dashed line for reference. 5.4.3 Execution time and speedup Alg. 21 was implemented on a shared memory multicore architecture. The execution time and scalability for different problem sizes NT = N d were studied. Tab. 5.1 shows the execution time for single core execution while Fig. 5.4 depicts the acheived speedup s(M ). The program was written in C++ using OpenMP for parallelization and execution was R performed on a shared memory multicore processor (Quad-core Intel Xeon 5520, Nehalem 2.26 GHz, 8MB cache). Compilation was performed with the pgi compiler and full compiler optimization was used for all compilations. 5.5 Discussion 5.5.1 Estimation accuracy From the experiments it can be concluded that the method performs well for the given problem. The RMSE R = 0.078 is reached for N = 15. The particle filter reaches this RMSE for Np = 150. Using the approximation of 50 flops to evaluate the exponential function, and 10 flops to generate a pseudorandom number, the PF requires about 29 times the flops required by the orthogonal series expansion approach to achieve 108 this estimation accuracy. The PF is though less affected by the curse of dimensionality and the gap in the computational cost thus reduces for problems of higher dimensions. Yet, for low-dimensional problems, there is a significant computational benefit in using the proposed method. 5.5.2 Speedup As can be seen from the speedup plot in Fig. 5.4, the method has good scalability and is suitable for parallelization, which is one of its main strengths. In this particular study, a multicore processor with 8 cores has been used and close to linear speedup is achieved. From Fig. 5.4 and the analysis in Sec. 5.3.1, it can though be expected that the method will scale well for more processors than 8. Further, the method has a good potential of performing well on a computer cluster due to the low interprocessor communication required. 5.5.3 Limitations If p(xt |xt−1 ) and p(yt |xt ) are to be determined offline, domain Ω over which the problem is solved must be small enough relative the variance of p(xt |xt−1 ) and p(yt |xt ). The PDFs p(xt |xt−1 ) and p(yt |xt ) will otherwise appear as “spikes” and will demand an unreasonably high approximation order to produce a good fit. If the expansions for the PDFs p(xt |xt−1 ) and p(yt |xt ) are updated online, this restriction can be dropped. Doing so will, however, require a large amount of online computation and, therefore, reduce the real-time feasibility of the method. Similar to most of the estimation techniques, the exponential growth of the computational complexity with the dimension is a limitation that confines the possible applications to relatively low-dimensional ones. 5.6 An error bound One iteraion of the prediction update recursion for the system (5.1), (5.2) can in one line be written as p(yt |xt ) p(xt |Yt ) = p(xt |xt−1 )p(xt−1 |Yt−1 )dxt−1 , t = 1, 2, . . . , p(yt |Yt−1 ) (5.17) where p(xt |Yt ) denotes the probability density for the state xt given the measurements Yt . When solving the RBE problem via orthogonal series expansions the posterior PDF p(xt |Yt ) in (5.17) is approximated by a truncated orthogonal series expansion 109 p(xt |Yt ) ≈ p̂(xt |Yt ) = k∈K t|t ck φk (xt ), where {φk (x)} are the orthogonal basis functions and the coefficients t|t {ck } are recursively computed via the prediction and update equations. Due to the truncation of the expansion an error is introduced at every iteration. It is of interest to study how this error propagates over the iterations, to be able to make sure that the solution obtained maintains a reasonable approximation to the sought PDF. A worst case scenario would be that the approximation errors, due to the truncations of the expansions, would accumulate in such a way that p̂(xt |Yt ) is no longer a meaningful approximation of p(xt |Yt ). This section provides a bound on the 1-norm for the approximation error in the PDF of the state vector conditional on the measurements, i.e. a bound on e(xt |Yt )1 = p(xt |Yt ) − p̂(xt |Yt )1 . The derived bound, although not being sharp, serves as a tool to ensure that the estimated PDF represents a sensible approximation to the true PDF throughout the iterations. When solving the RBE with orthogonal series expansions there is an option of which basis functions to employ. A second investigation performed in this section is a comparison of the method performance in a bearings-only tracking problem being solved with the Fourier and Legendre basis functions. For a function h approximated with a series expansion the truncated approximation and the truncation error is denoted with ĥ and eh respectively, i.e. h(x) = ∞ ck φk (x) = k=0 K k=0 ∞ ck φk (x) + ĥ(x) ck φk (x) . k=K+1 eh (x) For notational tractability the recursion expressed by (5.17) will be written with the notation g t+1 (z) = v(y|z) f (z|x)g t (x)dx, t = 0, 1, . . . , (5.18) Ω where v(y|z), f (z|x) and g t (x) are PDFs. In this notation, the PDF g t (z) corresponds to p(xt |Yt ) in (5.17) and is the main target of the approximation. When solving the recursion with orthogonal basis expansions, the truncated expansions v̂(y|z), fˆ(z|x) and ĝ t (x) are used in place of the true PDFs. It is of interest to know how the error caused by the truncation propagates through the iterations. An expression for the 110 t+1 (z) − ĝ t+1 (z) is therefore sought. Asapproximation error et+1 g (z) = g suming that g(x) has the same approximation order in the x-dimension as f (z|x) does, the following two relations hold in virtue of the orthogonality of the basis functions Ω fˆ(z|x)eg (x)dx = 0, ef (z|x)g(x)dx = ef (z|x)eg (x)dx. Ω Ω Then it follows that t+1 ĝ (z) = v̂(y|z) fˆ(z|x)ĝ t (x)dx Ω = v̂(y|z) fˆ(z|x)[g t (x) − etg (x)]dx Ω t ˆ = v̂(y|z) f (z|x)g (x)dx − v̂(y|z) fˆ(z|x)etg (x)dx Ω Ω = [v(y|z) − ev (y|z)] [f (z|x) − ef (z|x)]g t (x)dx Ω t = v(y|z) f (z|x)g (x)dx − v(y|z) ef (z|x)g t (x)dx Ω Ω − ev (y|z) [f (z|x) − ef (z|x)]g t (x)dx Ω = g t+1 (z) − v(y|z) ef (z|x)etg (x)dx Ω t − ev (y|z) f (z|x)g (x)dx + ev (y|z) ef (z|x)etg (x)dx Ω Ω t+1 t = g (z) − [v(y|z) − ev (y|z)] ef (z|x)eg (x)dx Ω − ev (y|z) f (z|x)g t (x)dx. Ω This gives the expression for the approximation error t+1 (z) − ĝ t+1 (z) et+1 g (z) = g = v̂(y|z) ef (z|x)etg (x)dx + ev (y|z) f (z|x)g t (x)dx. (5.19) Ω Ω From (5.19) the following result can be derived: 111 Theorem 1. For etg (z) given by (5.19), it holds that etg (z)1 ≤ γt , t = 0, 1, . . . , where t t Q rt Qt e0g 1 + Rq 1−r if rQ = 1 1−rQ γt = 0 (5.20) eg + tRq if rQ = 1 1 and Q := max y q := max y |v̂(y|z)|dz, |ev (y|z)|dz, r := max|ef (z|x)|, x,z R := maxf (z|x). x,z Proof. The triangle inequality yields t+1 eg (z) = |et+1 g (z)|dz 1 Ω = |v̂(y|z) ef (z|x)etg (x)dx Ω Ω + ev (y|z) f (z|x)g t (x)dx|dz Ω ≤ [|v̂(y|z)| |ef (z|x)||etg (x)|dx Ω Ω + |ev (y|z)| |f (z|x)||g t (x)|dx]dz Ω t ≤ [|v̂(y|z)|r |eg (x)|dx + |ev (y|z)|R |g t (x)|dx]dz Ω Ω = |v̂(y|z)|dz · r |etg (x)|dx + |ev (y|z)|dzR Ω Ω Ω ≤ rQ etg (z)1 + Rq, i.e. t+1 eg (z) ≤ rQ etg (z) + Rq. 1 1 (5.21) The increasing function right hand side in (5.21) is amonotonically in etg (z)1 . An upper bound γt on etg (z)1 hence obeys the recursion γt+1 = rQγt + Rq, whose closed-form expression is given by (5.20). Note that q, Q, r and R in (5.20) only depend on constant quantities that can be computed offline and before the recursion starts. 112 Corollary 1. If rQ ≤ 1, ekg (z)1 is asymptotically bounded from above Rq . by 1−rQ Proof. If rQ < 1 Rq 1 − r t Qt lim γt = lim rt Qt e0g (z)1 + Rq = . t→∞ t→∞ 1 − rQ 1 − rQ (5.22) 5.6.1 Numerical experiments The filtering problem to estimate the PDF p(xt |Yt ) for system (5.15)(5.16) was solved by using the Legendre and Fourier basis functions (see Sec. 1.3.2). The estimated PDFs obtained by the orthogonal series method were cross-validated against the results obtained by applying a particle filter to the same data set, to ensure correct implementation. The filtering problem was solved for the approximation orders N = 9 + 4k, k = 0, 1, . . . , 14. The upper bound γt (N ) on e(xt |Yt )1 was computed according to (5.20) for each N using both the Fourier and Legendre basis while the empirical values of ||e(xt |Yt )||1 were evaluated as e(xt |Yt )1 ≈ Et (N ) = |p̂65 (xt |Yt ) − p̂N (xt |Yt )|dx, xt ∈Ω where p̂N (xt |Yt ) denotes the approximation of p(xt |Yt ) of the approximation order N . As p̂65 (xt |Yt ) can be considered a very close approximation to the true PDF p(xt |Yt ), Et (N ) can be deamed a good approximation to e(xt |Yt )1 . In Fig. 5.5 and Fig. 5.6, the empirical and theoretical bounds Et (N ) and γt (N ) are shown for N = 25, using the Fourier basis and the Legendre basis respectively, where γt (N ) denotes the theoretical bound for an approximation order N . For all N studied, the bound converges to the value given by (5.22) and the value of γt (N ) is basically constant after time step t = 10 in all cases. To illustrate the empirical and and theoretical bound for each N , the steady-state value γ30 (N ), the mean and maximum of the empirical value Et (N ) μ(N ) = 1 Et (N ), 30 ρ(N ) = max Et (N ), 40 t=11 t∈[11,40] 113 E (N) t γ (N) 0.5 t k ||eg||1 0.4 0.3 0.2 0.1 0 0 5 10 15 20 time step k 25 30 35 40 Figure 5.5. Theoretical bound γt (N ) and empirically measured values of the approximation error in 1-norm, Et (N ) for the solution obtained with the Fourier basis functions and approximation order N = 25. 25 0.6 E (N) t 0.5 γ (N) t k ||eg||1 0.4 0.3 0.2 0.1 0 0 5 10 15 20 time step k 25 30 35 40 Figure 5.6. Theoretical bound γt (N ) and empirically measured values of the approximation error in 1 norm, Et (N ) for the solution obtained with the Legendre basis functions and approximation order N = 25. 114 2 μ(N) ρ(N) γ(N) ||eg ||1 1.5 1 0.5 0 10 15 20 25 30 35 40 45 50 55 60 65 N Figure 5.7. Theoretical bound γ(N ), the mean μ(N ) and maximum ρ(N ) of the empirically measured values of Et (N ), when solving the problem with Legendre basis functions. were computed on the stationary interval t ∈ [11, 40]. The results are shown for the Fourier basis and the Legendre basis in Fig. 5.7 and Fig. 5.8, respectively. Point estimates, x̂t = E[xt |Yt ] from the approximated PDFs were computed. To compare and quantify the estimation quality, the root mean square error, ! T !1 Ermse (x̂1:T ) = " (xt − x̂t )2 , T t=1 was calculated for the estimated states and is shown in Fig. 5.9 for different approximation orders N , for the Fourier and Legendre basis functions. For the particular time instant t = 25, the true PDF p(xt |Yt ) and the estimated PDFs p̂(xt |Yt ) obtained with the Fourier and Legendre basis functions are shown for N = 9, 25, 33 in Fig. 5.10, Fig. 5.11 and Fig. 5.12, respectively. 5.6.2 Discussion In the studied bearings-only tracking problem, it can be concluded that the Fourier basis functions generally give a better approximation to the problem than the Legendre basis functions do, which phenomenon is especially prominent for lower approximation orders N . It can be seen that for low N , (N = 9, Fig. 5.10), both the Fourier and Legendre basis functions fail to capture the multi-modal shape of the true den115 2 μ(N) ρ(N) γ(N) ||eg ||1 1.5 1 0.5 0 10 15 20 25 30 35 40 45 50 55 60 65 N Figure 5.8. Theoretical bound γ(N ), the mean μ(N ) and maximum ρ(N ) of the empirically measured values of Et (N ), when solving the problem with Fourier basis functions. 0.3 Fourier Legendre 0.295 0.29 rmse 0.285 0.28 0.275 0.27 0.265 0.26 10 15 20 25 N 30 35 40 Figure 5.9. The root mean square error, for the estimation error as a function of the approximation order N . 116 N=9 0.8 Legendre Fourier True 0.7 0.6 t 1:t p(x |y ) 0.5 0.4 0.3 0.2 0.1 0 −0.1 −3 −2 −1 0 1 2 x t Figure 5.10. The true PDF p(xt |Yt ) and p̂9 (xt |Yt ) for t = 25, for the Fourier and Legendre solutions, N = 9. N=25 0.8 Legendre Fourier True 0.7 0.6 t 1:t p(x |y ) 0.5 0.4 0.3 0.2 0.1 0 −0.1 −3 −2 −1 0 1 2 x t Figure 5.11. The true PDF p(xt |Yt ) and p̂25 (xt |Yt ) for t = 25, for the Fourier and Legendre solutions, N = 25. 117 N=33 0.8 Legendre Fourier True 0.7 0.6 t 1:t p(x |y ) 0.5 0.4 0.3 0.2 0.1 0 −0.1 −3 −2 −1 0 1 2 x t Figure 5.12. The true PDF p(xt |Yt ) and p̂33 (xt |Yt ) for t = 25, for the Fourier and Legendre solutions, N = 33. sity. Yet the Fourier basis based solution yields a closer approximation than that of Legendre functions, measured in the 1-norm of the approximation error. When N is in the medium range (N = 25, Fig. 5.11), the Fourier basis solution gives an almost perfect approximation, while the Legendre functions still show some difficulties in fully capturing the multi-modality of p(xt |Yt ). For high approximation orders (N = 33, Fig. 5.12), both the Legendre and Fourier bases produce close to perfect approximations. However, as can be seen from Fig. 5.9, a better PDF fit does not necessarily translate into a superior point estimate of the state x̂t . The root mean square error for the Fourier and Legendre solutions are practically the same for N ≥ 20, even though the Fourier basis solution provides a better fit of the actual underlying PDF. Another aspect that should be taken into account is the numerical properties of the basis functions. With the Legendre basis functions it is not possible, in the given implementation, to go above N = 65 due to numerical problems, while no numerical problems are encountered using the Fourier basis functions. However, as virtually perfect approximation is reached already for N = 33, it is not an issue with the Legendre basis solution in this case. From Fig. 5.7 and Fig. 5.8, the bound can be seen to be close to tight for some N values, but more conservative for other N values. For the Legendre case, the bound is conservative for small values of N as a 118 consequence of the poorly approximated PDFs p(xt |xt−1 ) and p(yt |xt ) in some intervals. The bound accounts for the worst case effects of this poor approximation, which scenario does not apparently realize in the final estimate, for the particular problem and implementation at hand. In the derivation of the bound the inequality f (z|x)g(x)dx ≤ maxf (z|x) R x,z was used. This relationship holds if f and g are PDFs, but can in some cases to be a rather conservative bound. By imposing assumptions on e.g. the smoothness of f and g, this bound can be tightened and hence bring about an improvement of the final bound. 119 Chapter 6 Orthogonal basis PF 6.1 Introduction Parallelization of the PF, as given in Chapter 4, is as a way of improving its real-time feasibility. In Chapter 4, four different parallel particle filters: the globally distributed particle filter [8], sampling with proportional allocation and sampling with non-proportional allocation [12], as well as the Gaussian particle filter (GPF) [56] were evaluated. It was found that the GPF was the most suitable option for a fast and scalable implementation. The GPF makes the approximation p(xt |Yt ) ≈ γ(xt , μt , Σt ), where γ(x, μt , Σt ) denotes a multivariate Gaussian PDF with mean μt ∈ Rnx and covariance Σt ∈ Rnx ×nx . By this approximation, the information contained in the particle set can be compressed to a few informative coefficients (the mean and covariance), and hence efficiently communicated among the parallel processing units. The Gaussian approximation of a posterior is though a rather restrictive assumption that infringes upon the generality of the nonlinear filtering method. In this chapter, a method to fit a truncated series expansion (k) p(xt+1 |Y t ) ≈ at+1 φk (xt+1 ), (6.1) k∈K to the particle set for some set of basis functions Φ = {φk (x)}k∈ND , where K is the index set of the basis functions included in the expansion, is suggested. In a sense, this method, termed here Orthogonal Basis Particle Filter (OBPF), can be seen as an extension of the GPF, as it reduces to the GPF with Φ chosen to be the Hermit functions basis, and with only the first basis function used in the approximation, i.e. K = 0. By this construction the OBPF enjoys the favorable parallelization 121 properties of the GPF, as only some few coefficients {a(k) }k∈K have to be communicated, but abolishes the restriction of the posterior distribution being Gaussian. The problem of fitting a series expansion to a random sample is discussed in Sec. 1.4. As noted there, a useful property of the series expansion estimator is that it possesses a convergence rate that is independent of the dimension of the problem [90]. This is in contrast with most other non-parametric PDF estimators (the kernel density estimator included), whose convergence rate severely deteriorates with increasing dimension [90]. Therefore, the orthogonal series estimator constitutes an appealing option for high-dimensional problems. Further, the series expansion method as well exhibits beneficial interpolation properties so that less particles are required to give an approximation to the posterior for a given accuracy. Modifications of GPF, such as the Gaussian sum particle filter (GSPF) [55], allow non-Gaussian posteriors to be approximated by a Gaussian mixture. Yet, the mixands are required to be refitted frequently if the filter should operate near optimality [6]. This poses a severe obstacle to an efficient parallelization, as the refitting requires access to the global posterior distribution and hence parallelizes poorly. To concretize the proposed method of OBPF and exemplify the developed techniques, the Hermite basis is particularly studied in this chapter as a suitable choice of the orthogonal functional basis for OBPF. Naturally, there is no principal difference to the method with any other orthogonal functions basis employed instead. The chapter is organized as follows. In Sec. 6.2 notation and background material are briefly summarized. The proposed method of OBPF is explained in Sec. 6.3, followed in Sec. 6.4 by its parallelization. Experiments validating the estimation accuracy and speedup obtained on a shared-memory multicore processor are described in Sec. 6.6. 6.2 Background 6.2.1 The PF algorithm with importance sampling The PF solves the recursive estimation problem by providing a weighted (i) (i) sample {xt , wt }N i=1 from the PDF p(xt |Yt ), from which the desired information such as e.g. the minimum mean square error or the maxi(i) mum likelihood point estimate can be extracted. The notation xt shall (i) be interpreted as the i-th particle at time step t and wt as the corresponding weight. The method consists of the three steps performed recursively: prediction, update, and resampling. In the prediction and update steps, the particles are propagated and the weights are updated 122 via the relationships (i) (i) (i) xt+1 = ft (xt , vt ), (6.2) (i) wt+1 (6.3) = (i) (i) wt p(yt+1 |xt+1 ), (i) i = 1, 2, ..., N , where vt ∼ pv (v) and the weights are normalized to sum up to one at the end of the iteration. In the resampling that is included to avoid depletion of efficiency in (i) (i) the particle set [69], a new set of particles {xt , wt }N i=1 is created by making a draw from p̂(xt |Yt ). In general, it is not possible to sample directly from p̂(xt |Yt ). Different methods however exist to achieve this goal, importance sampling being one of them, see e.g. [71], [57]. A (i) (i) weighted sample {xt , wt }N i=1 is then obtained by sampling from some (i) easy-to-sample proposal distribution xt ∼ π(xt ), and computing the corresponding weight via (i) wt ∝ (i) p̂(xt |Yt ) (i) . (6.4) π(xt ) It is required that π(x) satisfies the condition p(xt |Yt ) > 0 → π(xt ) > 0, except at a zero measure of points. 6.2.2 Hermite functions basis In the one-dimensional case, the k-th Hermite function is given by (−1)k x2 /2 dk −x2 φk (x) = e , √ e dxk 2k k! π k = 0, 1, . . . For computational purposes, the three-term recurrence relationship √ 2 φ0 (x) = π −1/4 e−x /2 , φ1 (x) = 2xφ0 (x), 2 k−1 xφk−1 (x) − φk−2 (x), φk (x) = k k k = 2, 3, . . . is often exploited. The set {φk (x)}∞ k=0 constitutes an orthogonal basis of L2 (R). For later use, the values of gk = max |φk (x)| and sk = φ4k (x) dx, x R k = 0, 1, ..., 10 are listed in Tab. 6.1. Note that the k-th basis function 2 has the form φk (x) = rk (x)e−x /2 , where rk (x) is a polynomial of degree k, and that the 0-th basis function is a scaled Gaussian PDF. Due 123 Table 6.1. Values of gk and sk for the Hermitian basis functions. k gk sk 0 .75 .40 1 .64 .30 2 .61 .26 3 .59 .23 4 .57 .21 6 .56 .20 7 .55 .19 8 .55 .18 9 .54 .17 to this characteristics, the Hermitian functions provide a suitable basis for approximation of PDFs that have exponentially decaying tails, but could otherwise be a poor option. Actually, it can be shown [13] that if the approximated function exhibits exponential decay, the coefficients {a(k) }K k=0 in (6.1) will also exhibit exponential decay, in the sense that they decrease faster than the reciprocal of any finite order polynomial. Hence a good fit can be expected, in that case, for a low value of the truncation order K. 6.3 The Hermitian Particle Filter The proposed PF method is detailed in this section. For notational brevity, the vectors ϕ(x) = [ φk0 (x) φk1 (x) · · · φkK (x) ]T , K) at = [ at(k0 ) at(k1 ) · · · a(k ]T , t are introduced where the elements of K have been denoted as k0 , k1 , ...., kK . The number of elements in K is thus K + 1. The method follows the regular particle filtering algorithm to obtain a weighted sample from p(xt+1 |Yt ), i.e. the particles are propagated and updated via (6.2), (6.3). The main difference is how the resampling is performed. To resample, a series expansion is fitted to the weighted set (k) p̂(xt+1 |Yt ) = at+1 φk (xt+1 ) = aTt+1 ϕ(xt+1 ), k∈K using the method described in Sec. 1.4., i.e. at+1 = N (i) (i) wt ϕ(xt+1 ). i=1 From the fitted PDF, a new set of particles is drawn by importance sampling, i.e. (i) xt+1 ∼ π(x), (i) wt+1 = 124 (6.5) p̂(xt+1 |Yt ) (i) π(xt+1 ) (i) = |aTt+1 ϕ(xt+1 )| (i) π(xt+1 ) . (6.6) Algorithm 22 Algorithm for one iteration of OBPF. (i) (i) (i) (U) wt = wt−1 p(yt |xt ) (i) (i) (P) xt+1 ∼ p(xt+1 |xt ) (i) (i) (R) ât+1 = N i=1 wt ϕt+1 (xt+1 ) (i) xt+1 ∼ π(x) (i) (i) (i) wt = |âTt+1 ϕt+1 (xt+1 )|/π(xt+1 ) For the Hermite functions, the first basis function is a scaled Gaussian PDF and it is reasonable to take a Gaussian distribution as the proposal distribution in that case, i.e. π(x) = γt (x) := γ(x, μt , Σt ) where μt , and Σt are the mean and covariance of the particle set respectively. The absolute value in (6.6) is inserted since the approximation method does not guarantee a non-negative approximation of the PDF. The steps of the algorithm are summarized in Alg. 22. In the description it is assumed that resampling is carried out at every iteration. To modify for a different resampling scheme is straightforward. Remark 1. Note that during the recursion, the PDF is propagated as p(xt |Yt ) → p(xt+1 |Yt ) → p(xt+1 |Yt+1 ) via the prediction and update equations in (6.2), (6.3). The resampling can be carried out at any step of the recursion. In the present formulation, the resampling is performed by making a draw from p(xt+1 |Yt ) because p(xt+1 |Yt ) is typically smoother than p(xt |Yt ) and, therefore, is more suitable for approximation with a series expansion of low order. This can be expected as p(xt+1 |Yt ) is a prediction and, hence, subject to greater uncertainty than p(xt |Yt ) (smoother PDF), or more technically by that p(xt+1 |Yt ) is the outcome of a convolution of two PDFs, while p(xt |Yt ) is a product. 6.4 Parallelization The proposed estimation method is designed for straightforward parallelization. The key to the parallizability of the method is the decoupled (k) way in which the coefficients at can be computed from the local particle sets. Pseudocode for a parallelization is given in Alg. 23, where 125 Figure 6.1. Illustration of the work flow for parallel execution of the algorithm. Algorithm 23 One iteration for parallel implementation. Parallel (processor m do for i ∈ Nm ) (i) (R) xt ∼ π(x) (i) (i) (i) wt−1 = |âTt ϕt (xt )|/π(xt ) (i) (i) (i) (U) wt = wt−1 p(yt |xt ) (i) (i) (P) xt+1 ∼ p(xt+1 |xt ) (m) (m) (i) ât+1 = ât+1 + ϕt+1 (xt+1 ) Sequentially (one processor) (m) 1 M ât+1 = M m=1 ât+1 Nm , m = 1, 2, ..M are disjoint subsets {1, 2, .., N } of cardinality N/M (assumed to be integer). The computations have been organized to allow for only one sequential section, and one synchronization point per iteration. Each processing unit starts with creating a local particle set by resampling from p̂(xt+1 |Yt ) = aTt+1 ϕt+1 (x) (R). The local particle sets are propagated (P), and updated (U) by each processing unit. From the local particle set, each processor will then compute the local estimate â(m) of a that is communicated to a processor forming the global estimate â in a sequential section. The global estimate of â is then communicated back to the processing units which can restart the cycle by sampling from the global estimate of p(xt+2 |Yt+1 ). These execution steps are illustrated in Fig. 6.1. 126 6.4.1 Parallelization properties analysis The simple parallelization scheme of the method facilitates analysis of the parallelization properties, including the required amount of sequential and parallel work as well as interprocessor communication. In the proposed algorithm, the major part of the computational work is carried out in the parallel section, and only a small amount of communication and sequential processing is necessary. This property is crucial to a scalable parallelization, as discussed in Sec. 1.9. To give exact numbers for a general system is of course not possible. Here the numbers for a linear system with uniformly distributed process and measurement noise are given. This can be considered a challenging scenario for the parallelization in the sense that it gives a relatively low amount of computation compared to communication and be viewed as ”close to worst case scenario”. The flops yielding the sequential work q| and the parallel work q|| can then be counted as q| (M, K) ≈ (M − 1)(K + 1), (6.7) q|| (N, K) ≈ (2 + 2fr + fe + 4K)N + 2(K + 1), (6.8) where fr and fe are the number of flops required to generate a random number and to evaluate the exponential function, respectively. Transfering the local estimates a(m) , and μ(m) m = 1, 2, ..., M results in a total communication of κ(K, M ) = (K + 3)(M − 1), (6.9) elements per iteration. This is very low amount of communication, and can be considered almost negligible compared to the FLOPs performed. The theoretical speedup one could expect can thus be obtained by computing q| and q|| from (6.7) and (6.8) and substitute p| = q| /(q| + q|| ), p| = q| /(q| + q|| ) into (6.20) together with an estimate of c(M ) that takes the amount of communication (6.9) into consideration. 6.5 Analysis This section provides an analysis of how god the fit of the expansion can be considered to be. Typically when fitting a SE to a random sample, the underlying distribution is assumed completely unknown. The lack of information or assumptions on the target PDF is limiting in the analysis of the goodness of the fit. For the recursive Bayesian estimation problem the underlying PDF is not completely unknown, as it is highly influenced by the known system model (6.12) and (6.13). This information can be used to provide performance measures on the goodness of the fit. 127 In Theorem 3, a bound on the variance of the parameter estimate is given. It is interesting in it own right as it gives a measure of the certainty in the estimate of a specific parameter, but it also gives means of expressing an upper bound, Theorem 7, on the part of the mean integrated square error (MISE) that is due to the random error caused by using a finite number of particles N . It provides practical guidance since if the bound is higher than desired, it indicates that the number of particles should be increased. In Theorem 4, a bound on the absolute value of the coefficients, for the Hermitian basis functions, is given. The bound is decaying and provides means of ensuring that important coefficients (of large magnitude) are not neglected when truncating the expansion and can be used as a tool for selecting a suitable truncation order. The main theorems are stated below with the proofs given in Appendices. The q-norm of a function p(x) : RD → R, is defined as |p(x)|q dx]1/q . ||p(x)||q = [ RD All are assumed to be Riemann integrable so that involved functions | RD f (x)dx| ≤ RD |f (x)|dx is a valid inequality. The supremum of |φk (x)| is denoted gk , i.e. gk = supx |φk (x)|. Remark 2. The k-coefficient is estimated from the particle set as (k) ât+1 = N (i) (i) wt φk (xt+1 ). i=1 (k) By the central limit theorem, ât+1 will be approximately normally dis(k) (k) tributed according to ât+1 ∼ N (at+1 , σt+1 (k)2 ), if N is “large”, where large typically is considered to be N 30, and where the variance σt+1 (k)2 for coefficient k is bounded as given by the following theorem. Theorem 3. The variance σt+1 (k)2 of the estimate of the k-coefficient (k) ât+1 is bounded by either of the two bounds σt+1 (k)2 ≤ gk2 Wt 2 φk (x)4 dx]1/2 ||pv (v)||2 Wt σt+1 (k) ≤ [ RD where Wt = N i=1 128 (i) (wt )2 . (6.10) (6.11) See Appendix 6.C for proof. Which one of (6.10) and (6.11) is tighter depends on the particular basis function that is used, and the 2-norm of pv (v). The factor W −1 ∈ [1, N ] is sometimes termed as the efficiency of the particle set. For an unweighted sample, i.e. w(i) = N −1 , i = 1, 2, ..., N , it holds that W −1 = N , which is the highest possible efficiency. (k) The following theorem provides a bound on |at | decaying with inD creasing values of d=1 kd and applies to systems of the form xt+1 = f (xt ) + vt , yt = h(xt ) + et , (6.12) (6.13) where the process noise vt is assumed to be mutually independent in each dimension, with pv,d (v) denoting the PDF in dimension d. ∇ denotes the Jacobian of a function and σmin (A) and σmax (A) denote the smallest and largest singular value of a matrix A. Theorem 4. Assume that the system is given by (6.12), (6.13) and that h(x) and f (x) are continuous functions with σmin (∇h(x)) ≥ r, σmax (∇fd (x)) ≤ Rd , d = 1, 2, .., D for all x. Further impose that pe (e) ≤ (k) for e2 ≥ L. Then at for the Hermitian basis functions is bounded (k) (k) where in absolute value as |at | ≤ λ−1 t ηt (k) ηt = me gk D d=1 Ξ(d, q) q (2kd + + gk (6.14) 2i)1/2 i=1 where me is the supremum of pe (e), λt is the normalization constant N (i) λt = wt , q is a positive integer and i=1 Ξ(d, q) = sup |θ|≤ Rd L r R |[ ∂ q −z 2 /2 2 (e pv,d (z − θ))]ez /2 |dz q ∂z (6.15) Proof. See Appendix 6.B. Remark 5. Suppose that it is decided that only coefficients with index k that satisfy |ηk |/|η0 | ≥ Q, for some Q ∈ R, are to be kept. Reorganizing and simplifying the expression |ηk |/|η0 | ≥ Q (see Appendix 6.D for details), it is found that only the coefficients with index k = (k1 , k2 , ..., kD ) satisfying q D (kd + i) < (Q − )−2 (6.16) i d=1i=1 129 have to be considered. Eq. (6.16) implicitly specifies which coefficients shall be kept and which ones can be neglected. In Fig. 6.2, the number of coefficients satisfying (6.16) is shown for the dimension orders, D = 1, 2, .., 10 and Q values Q = {0.15, 0.10, 0.05}. By inspection of the slopes of the curves, the growth in the number of coefficients needed to be computed is of order O(D 2 ) for the different Q values, which value should be compared to the growth of O(K̃ D ) that is obtained if no selection is performed and K̃ coefficients per dimension are kept and computed. The MISE is an important measure of how good the fit is and can be derived as follows. The overall approximation error is given by e(xt+1 ) = p(xt+1 |Yt ) − p̂(xt+1 |Yt ) (k) (k) = at+1 φk (xt+1 ) − ât+1 φk (xt+1 ) k∈ND = (k) (at+1 − k∈K k∈K (k) ât+1 )φk (xt+1 ) + (k) at+1 φk (x), k∈K / er (xt+1 ) eT (xt+1 ) where er is the random error caused by the uncertainty in the estimate (k) of at due to the finite number of particles N , and eT is the truncation error caused by neglecting the coefficients with index k ∈ / K. By Parseval’s identity (following from the fact that the functional basis is orthogonal), the mean integrated square error (MISE) is given by V (t) = E[ e(xt )2 dxt ] = RD (k) (k) (k) = E[ (at − ât )2 + (at )2 ] k∈K = k∈K 2 σt (k) + Vr (t) k∈K / k∈K / (k) 2 (at ) , (6.17) VT (t) where Vr is the random MISE due to the finite number of particles, and VT is the MISE caused by truncation of the expansion. Note that as σt (k)2 → 0 when N → ∞, Vr → 0 as N → ∞ and the MISE of the estimated expansion converges to the MISE of the true truncated expansion, i.e. VT . Remark 6. Consider the scalar case. By inspection of (6.17), it can be noted that if σt2 (k) does not decay more rapidly than 1/k, the truncation actually is necessary to avoid divergence of the MISE. The fit does hence 130 10 4 Q=0.15 Q=0.10 L 10 3 Q=0.05 10 2 10 1 10 0 10 0 10 1 D Figure 6.2. The number of coefficients L, that have to be computed versus dimension D, for different Q-values in (6.16). not necessarily improve as more coefficients are included in the truncated expansion. This is intuitive as it is impossible to estimate an infinite number of parameters to an arbitrary accuracy from a finite data set. Theorem 7. The term Vr is bounded by either of the two inequalities Vr (t) ≤ Wt−1 gk2 , k∈K [ Vr (t) ≤ Wt−1 ||pv (v)||2 k∈K RD φk (x)4 dx]1/2 . (6.18) Proof. This follows immediately by applying Theorem 3 to the term Vr in (6.17). By inserting inequality (6.14) into the expression for VT in (6.17), an upper bound for the MISE caused by the truncation is given by VT (t) ≤ λ−2 ηk2 . (6.19) t k∈K / 6.6 Computational Experiments The main purpose of constructing the nonlinear estimation algorithm described above is to achieve parallelizability of its computer implementation. As pointed out before, the method that enjoys similar to the 131 proposed method parallelizability properties is the GPF. Therefore, the proposed method is tested against the GPF for comparison, to highlight the benefits of not being restricted to a Gaussian posterior. The estimation accuracy and the parallelizability are investigated in the following subsections. For brevity, the proposed method will be referred to as the Hermitian Particle Filter (HPF) in this section, to indicate the selected orthogonal basis. 6.6.1 System model To illustrate the method the simple nonlinear system xt + vt , |xt | + 1 y t = xt + e t , xt+1 = where vt ∈ R and et ∈ R are mutually independent, white noise sequences. The measurement noise, et , follows a Gaussian distribution with standard deviation σe = 0.1 while vt is non Gaussian, with the multimodal PDF (x−1)2 (x+1)2 1 pv (v) = √ (e 2σv2 + e 2σv2 ), 2 2πσv where σv = 1. The system was simulated for t = 0, 1, 2..., T , where T = 100, with the initial condition x0 = 0. 6.6.2 Estimation accuracy Eq. (6.14) was used to compute an upper bound √ on the coefficients. For the given system, R = r = 1 and me = 1/ 2π0.12 . The threshold = 0.01 was chosen, yielding L ≈ 0.34. Evaluating (6.15) for q = 2 then yields 2 ∂ q (e−x /2 pv,d (x − θ)) 2 Ξ(q, d) = sup |(ex /2 |dx ≈ 5, ∂xq |θ|≤0.34 R where the optimization problem was solved by griding over θ and computing the integral numerically. The upper bounds for the absolute (k) values of at , k = 0, 1, 2, ... are then given by (6.14) as √ 5gk / 2π0.12 (k) −1 |at | ≤ λt . [(2k + 2)(2k + 4)]1/2 132 True HPF GPF 0.2 p(xt |Yt ) 0.15 0.1 0.05 0 -4 -2 0 2 4 6 8 10 12 x Figure 6.3. The true PDF p(xt+1 |Yt ) and estimated PDFs pGP F (xt+1 |Yt ), pHP F (xt+1 |Yt ) from the Gaussian and Hermitian particle filters respectively at time step t = 60. The values of gk are provided in Tab. 6.1. In Fig. 6.4, the absolute values of the series coefficients with respect to bound (6.14) are shown for time instant t = 25. Selecting the value Q = 0.1, i.e. only considering the coefficients that are potentially larger than 0.1η0 , (6.16) states that K = 9 is the required truncation order. In Fig. 6.3, the true PDF p(xt+1 |Yt ) and the approximated PDFs pHP F (xt+1 |Yt ) and pGP F (xt+1 |Yt ) obtained from the HPF (with K = 9) and GPF, respectively, are shown for time instant t = 60, using N = 800 particles. The “true” PDF has been obtained by executing a regular bootstrapping particle filter with 106 particles and applying a kernel density approximation method [97] to the obtained particle set. Inserting the value ||pv (v)||2 ≈ 0.16, and the values of sk = [ RD φk (x)4 dx]1/2 given in Tab. 6.1 into (6.18), the variance for each coefficient and time step was computed. In Fig. 6.5, the coefficients with the corresponding upper bound 95% confidence intervals, computed from (6.18), are shown for time step t = 20. 6.6.3 Execution time and speedup To evaluate the performance of the method in terms of the speedup obtained when executed on a shared memory multicore computer, the method was implemented in c++ and run on a AMD Opteron 6220 processor (3.3 GHz, 8 cores, 16 MB cache). Compilation was performed using the pgi compiler, with full optimization for execution speed. 133 |a(k)| 1 t ηt 0.8 0.6 0.4 0.2 0 0 2 4 6 8 10 12 k (k) Figure 6.4. Absolute value of the coefficients at , k = 0, 1, ..12 are shown as stems and the dashed line shows the upper bound computed from (6.14) for time step t = 25. 0.6 0.5 0.4 a (k) t 0.3 0.2 0.1 0 -0.1 -0.2 0 2 4 6 8 10 12 14 k (k) Figure 6.5. Estimated coefficients at , with upper bound 95 % confidence intervals marked, k = 1, 2, ..13 at time step t = 20. 134 8 2 N=10 7 3 N=10 4 N=10 6 5 N=10 N=106 Speed up 5 4 3 2 1 0 1 2 3 4 5 Number of CPUs 6 7 8 Figure 6.6. Speedup curves for execution on a shared-memory multicore. OpenMP [1] was used for parallelization. The achieved speedup is shown in Fig. 6.6, for different problem sizes. On the machine under consideration, empirical testing shows that fr ≈ 10, fe ≈ 40, are reasonable approximations for the terms in Eq. (6.8) and that it takes two hundred CPU cycles to communicate an element to the RAM and about 1000 CPU cycles for overhead such as synchronization, thread startup etc. This results in the overhead term of c(M, K) = 200κ(K, M ) + 1000. Inserting this into (1.64) yields a theoretical estimate of the speedup, for K = 9, of s(M, N ) = M 4M + 65N + 4 . 1004M 2 − 504M + 65N + 8 (6.20) The theoretical speedup curves (6.20) are plotted in Fig. 6.7. Though being a bit optimistic, the curves resemble the experimentally obtained ones quite well. Exact figures can of course not be expected from (6.20), but the expression serves as a guideline for the expected speedup for a given problem size and number of processors employed. The obtained speedup values compare well with the ones obtained for the GPF in Chapter 4. For N ≥ 104 , close-to-linear speedup in the number of cores used is reached. For N = 100, the benefit of using parallelization is low, and actually a slow down can be observed for M > 4. This is due to the fact that the overhead c(M ) constitutes a disproportionally large part of the execution time. Using more than 104 particles is actually redundant in this particular example, as the estimation accuracy does not improve with more particles than that. However, lager numbers of particles are 135 8 N=102 N=103 7 N=104 N=105 6 6 N=10 Speedup 5 4 3 2 1 0 1 2 3 4 5 Number of CPUs 6 7 8 Figure 6.7. Theoretical speedup curves computed from (6.20). Table 6.2. Single core execution time for one iteration N T [s] 102 0.00047 103 0.0012 104 0.0054 105 0.071 106 0.52 relevant when it comes to speedup evaluation. The obtained speedup curves are for hardware that executes on a relatively high clock frequency (3.3 GHz). In low-power applications, such as e.g. communication, the clock frequencies are typically much lower, and hence a better scalability can be expected for smaller problem sizes. The small amount of communication makes the approach suitable for computer cluster execution which can be relevant for large problem sizes. The method is demonstrated for a one-dimensional example here, it has though been tested for other systems as well with up to eight dimensions and proved to perform well. In Chapter 8 the method is evaluated at a five dimensional estimation problem regarding parameter estimation in a PK/PD model for closed loop anesthesia. 6.A Appendix for Chapter 6 It is frequently utilized in the derivations that for a PDF p(x), x ∈ RD it holds that p(x)dx = 1, and p(x) ≥ 0. RD 136 Lemma 8. Let π(x) : RD → R that satisfies |π(x)| ≤ γ, x ∈ RD , then | RD π(x)p(x)dx| ≤ γ. Proof. π(x)p(x)dx| ≤ | RD |π(x)||p(x)|dx ≤ γ RD |p(x)|dx = γ RD Remark 9. Lem. 8 immediately implies that if p(x) ≤ q for some finite q ∈ R then RD p(x)2 dx ≤ q, i.e. p(x) ∈ L2 (RD ), and is hence possible to approximate with an orthogonal series expansion. Lemma 10. The 2-norm of p(xt+1 |Yt ) is less than the 2-norm of pv (v), i.e. ||p(xt+1 |Yt )||2 ≤ ||pv (v)||2 . Proof. ||p(xt+1 |Yt )||22 RD = [ RD = RD RD RD ≤ pv (xt+1 − f (xt ))p(xt |Yt )dxt ]2 dxt+1 RD pv (xt+1 − f (ξ 1 ))p(ξ 1 |Yt )dξ 1 × pv (xt+1 − f (ξ 2 ))p(ξ 2 |Yt )dξ 2 dxt+1 = RD p(xt+1 |Yt )2 dxt+1 = RD RD pv (xt+1 − f (ξ 1 ))pv (xt+1 − f (ξ 2 ))dxt+1 × p(ξ 1 |Yt )p(ξ 2 |Yt )dξ 1 dξ 2 ||pv (v)||22 p(ξ 1 |Yt )p(ξ 2 |Yt )dξ 1 dξ 2 = ||pv (v)||22 p(ξ 1 |Yt )dξ 1 p(ξ 2 |Yt )dξ 2 = ||pv (v)||22 , RD RD RD RD where the inequality holds since the inner integral satisfies 137 R D RD pv (xt+1 − f (ξ 1 ))pv (xt+1 − f (ξ 2 ))dxt+1 = pv (xt+1 − f (ξ 1 ) + f (ξ 2 ))pv (xt+1 )dxt+1 ≤ pv (xt+1 )pv (xt+1 )dxt+1 = ||pv (v)||22 RD by the fact that the autocovariance, f (x − τ )f (x)dx, R(τ ) = RD for any real-valued function f (x) satisfies R(τ ) ≤ R(0). Lemma 11. Let φx (x) denote the k-th Hermitian function. ∂q −z 2 /2 f (z))]ez 2 /2 | ≤ Kez 2 /2 for some constant K < ∞, then If |[ ∂z q (e the following equality is valid 1 ∂q 2 2 φk (z)f (z)dz = q φk+q (z)[ q (e−z /2 f (z))]ez /2 dz. ∂z R (2k + 2i)1/2 R i=1 Proof. This follows by repeated partial integration (q times), see e.g. [13]. 6.B Proof of Theorem 4 Denote with x0 the value that satisfies h(x0 ) = y. By the assumption σmin (∇h(x)) ≥ r it follows from the mean value theorem for vector valued functions that h(x0 ) − h(x − x0 )2 = y − h(x − x0 )2 ≥ r x − x0 2 , which by the requirement pe (e) ≤ for e2 ≥ L implies that pe (h(x0 ) − h(x − x0 )) ≤ if x ∈ Ω̄ := {x| x − x0 2 ≥ L/r}}. For notational brewity, the following notation will be used z = xt , x = xt−1 (k) and let z̃ = z − μ, where μ = f (x0 ). Splitting the computation of at over Ω and Ω̄ gives (k) φk (z̃)p(z|Yt−1 )dz| = |at | = | D R = λ−1 φk (z̃)pv (z − f (x))dzpe (y − h(x))p(x|Yt−2 )dx| t | D D R R = λ−1 φk (z̃)pv (z − f (x))dzpe (y − h(x))p(x|Yt−2 )dx t | D Ω R + φk (z̃)pv (z − f (x))dzpe (y − h(x))p(x|Yt−2 )dx|. Ω 138 RD Denote the first term and second term T1 and T2 respectively. Now if the process noise is mutually independent in each dimension, i.e. pv (w) = pv1 (w1 ) · ... · pvd (wd ), the first term can bounded as D |T1 | = | Ω d=1 R φkd (z̃d )pvd (zd − fd (x))dzd pe (y − g(x))p(x|Yt−2 )dx| D ≤ me | φkd (z̃d )pvd (zd − fd (x))dzd |p(x|Yt−2 )dx Ω d=1 ≤ me R ξp(x|Yt−2 )dx ≤ me ξ, Ω where ξ = sup D | R φkd (z̃d )pvd (zd − fd (x))dzd |. To get an upper x∈Ωd=1 bound the supremum can be taken on eachfactor independently in the product. Further, for each factor, ξd = sup| φkd (z̃d )pvd (zd − fd (x))dzd |, x∈Ω in this product the following bound can be given. ξd = sup| φkd (z̃d )pvd (z̃d − (fd (x) − μd ))dzd | sup | φkd (z̃d )pvd (z̃d − θ)dzd | x∈Ω ≤ R |θ|≤RD L/r ≤ g kd R 1 q sup (2kd + 2i)1/2 |θ|≤RD L/r R |[ ∂ q −z 2 /2 2 (e pv (z̃d − θ))]ez /2 |dz̃d , q ∂z i=1 the first inequality is a consequence of the assumption σmax (∇fd (x)) ≤ Rd which implies |fd (x) − μd | ≤ Rd L/r over Ω and the second inequality D Ξ(d,q) follows from Lem. 11. Hence ξ ≤ g kd , where Ξ(d, q) is q (2kd +2i)1/2 d=1 i=1 defined according to (6.15) which gives the first term in (6.14). The second term T2 can be bounded as φk (z̃)pvd (z − fd (x))dzpe (y − g(x))p(x|Y)dx| |T2 | = | Ω RD ≤ |φk (z̃)|pvd (z − fd (x))dzp(x|Y)dx Ω RD ≤ gk p(x|Y)dx ≤ hk , Ω which gives the resulting inequality (6.14). 139 6.C Proof of Theorem 3 The variance σt+1 (k)2 of the point estimate coefficient k is given by (k) σt+1 (k)2 = V[ât+1 ] = V[ N (i) (i) wt φk (xt+1 )] = V[φk (xt+1 )] i=1 N (i) (wt )2 . i=1 V[φk (Xt+1 )] is bounded by V[φk (Xt+1 )] = E[φk (Xt+1 )2 ] − E[φk (Xt+1 )]2 ≤ E[φk (Xt+1 )2 ], which in turn can be bounded by the smallest of either E[φk (Xt+1 )2 ] = φ2k (xt+1 )p(xt+1 |Yt )dxt+1 ≤ gk2 , RD or E[φk (Xt+1 ) ] = 2 RD ≤[ ≤[ R φk (xt+1 )2 p(xt+1 |Yt )dxt+1 φk (xt+1 )4 dxt+1 p(xt+1 |Yt )2 dxt+1 ]1/2 RD D RD φk (xt+1 )4 dxt+1 ]1/2 ||pv (v)||2 , where the first inequality follows from Cauchy-Schwartz inequality and the second one is implied by Lem. 10. 6.D Dervation of Eq. (6.16) ηk /η0 ≤ Q ⇔ ηk /η0 ≤ (Q − ) + ⇐ (hk − (Q − )h0 ) ⇔ ηk /η0 ≤ (Q − ) + η0 ηk − hk ≤ (Q − )(η0 − h0 ) ⇔ me hk D d=1 Ξ(q, d) q (2kd + 2i)1/2 D Ξ(q, d) ≤ (Q − )me h0 ⇐ q d=1 (2i)1/2 i=1 i=1 (Q − )−1 ≤ (Q − )−2 ≤ q D d=1i=1 q D d=1i=1 140 ( 2kd + 2i 1/2 ) ⇔ 2i (kd + i) . i Chapter 7 Anomaly detection 7.1 Introduction Anomaly detection refers to detecting patterns in a given data set that do not conform to a well-defined normal behavior [17]. It serves as an automatic method to detect system abnormalities or faults that potentially require a counteraction. Provided the system under consideration can be appropriately modeled, a plethora of model-based methods can be applied [118], [122], [45]. However, in many cases, the operation principles of the system are not sufficiently known to constitute the basis of a first-principles model. Further, for non-linear and non-Gaussian systems, the computational burden of estimating a black-box model from data can be prohibitively high or the exogenous excitation be insufficient. In this chapter, a non-parametric and (analytical) model-free method for anomaly detection, applicable to systems observed via trajectorial data, is developed. The method is computationally light, applies to non-linear as well as non-smooth systems. The basic idea of the method can be outlined as follows. Assume that a system S follows a normal (or reference), possibly vector-valued, trajectory r(τ ), where τ ∈ R is a function of time t and/or the system state vector x(t), i.e. τ = c(t, x(t)). (7.1) The right-hand side expression in (7.1) will be referred to as the context function. For a given τ , x(t) should thus ideally be equal to some reference state xr (τ ), but being subject to disturbances and system uncertainty, x can be considered as a random variable X characterized by the distribution G(τ ). 141 Assume that a set of N observed repeated system trajectory realizations from S, Γ = {γ1 , γ2 , .., γN }, is available where # $ (i) (t(i) ) . . . x(i) (t(i) ) γi = x(i) (t(i) (7.2) ) x n i 1 2 (i) denotes the i-th realization and x(i) (tj ), 1 ≤ j ≤ ni , are the state values at ni different, possibly non-uniformly sampled, time instants, (i) (i) (i) t1 < t2 < ... < tni . Now consider a realization γ0 ∈ Γ. It is sought to determine whether or not γ0 is produced by the same system (i.e. S) as the set Γ. From the data in Γ, the distribution G(τ ) and the corresponding probability density function (PDF) fX (τ, x) can be estimated. To statistically test whether or not γ0 differs from the data in Γ, an outlier test can be performed on γ0 w.r.t. G. Depending on the degree of outlyingness, the hypothesis of γ0 being produced by S can be either rejected or accepted. The parts comprising the method are constructed using tools from probability theory which are discussed in Sec. 1.4. The developed method is applied to a set containing vessel traffic data from the English channel, with the aim to find deviating activities. Due to the large number of objects in the scene and the need of prompt response to detected abnormalities, computationally demanding algorithms are not practically feasible. Further, the method is applied to eye-tracking data, with the aim to detect individual differences in the oculomotor system, in particular those that are caused by Parkinson’s disease. The oculomotor system is inherently difficult to model due to its complex nature and model-free methods are therefore highly relevant. Promising results for both applications are obtained. There are methods bearing similarities to the one developed here. An extensive survey of anomaly detection methods can be found in e.g. [17]. However, the present work provides a generalization of the existing approaches that brings about significant refinements and addresses some of the shortcomings of the existing algorithms: A typical problem in trajectory anomaly detection is that the trajectories can be of unequal length and unaligned. In this chapter, the problem is implicitly solved by introducing the concept of a context function that provides a framework to cope with irregular sampled data sets in a systematic and robust manner. The idea of constructing a confidence volume in the state space to which the system trajectories should be confined under normal operation is advocated in [16]. However, the use of rectangular boxes in that approach can be seen as a histogram approximation of the underlying PDFs. The algorithm proposed in this chapter is based on confidence regions, thus yielding a more flexible and less data demanding method. This can be of particular importance for higher dimensional problems. 142 An envelope of non-anomalous states is proposed in [51]. Each dimension is though treated separately and the method cannot hence account for the possible (and probable) correlations between features in different dimensions. Neither does it enable handling non-uniformly sampled trajectories. In [44], the behavior of the system is also learned by fitting PDFs to the training data set. Again, the lack of a context function, or a similar tool, complicates dealing with non-uniformly sampled trajectories and trajectories of unequal length. Further, only Gaussian distributions, or at least parametric ones, can be considered in the described framework. The chapter is composed as follows. First, in Sec. 7.3 through Sec. 7.3.2, the individual steps comprising the algorithm are presented. In Sec. 7.3.4, the steps are brought together and the complete algorithm is summarized. Applications to vessel traffic data and eye-tracking data are presented in Sec. 7.4 and the conclusions are drawn in Sec. 7.6. 7.2 Notation Let Z = {zi }ni=0 be a set of discrete points. The function that linearly interpolates a curve between consecutive points in Z is defined as l(ω, Z) = (zω − zω )(ω − ω) + zω , 0 ≤ ω ≤ n, where · and · are the ceiling and floor function, respectively. A linearly interpolated trajectory obtained from the discrete trajectory is denoted with an overline Z(ω) = l(ω, Z) N (μ, Σ) is the normal distribution with the mean μ ∈ Rd and the covariance Σ ∈ Rd×d . Pr(A) is the probability of the random event A. For a random variable X, the probability density function (PDF) is denoted fX (x). Vectors and vector valued functions are written in bold. 7.3 The anomaly detection method The idea of the method is to learn the normal behavior of the system from data by statistically quantifying its characteristics in a way that allows for computationally light and (analytical) model-free anomaly detection. To provide an overview of the adopted approach, the elemental steps of it are illustrated by Fig. 7.1, for a three-dimensional state space example. There, Fig. 7.1a shows a set of system trajectories obtained under normal operation. In the first step (Fig. 7.1b), the reference trajectory r(τ ) (in red) describing a ”mean” normal behavior is calculated 143 2 1.5 1 x3 0.5 2 0 1.5 8 −0.5 1 6 −1 0.5 4 −1.5 x3 2 −2 0 7 6 5 4 3 2 1 0 −1 0 8 −0.5 x1 −2 6 −1 x2 4 −1.5 2 −2 (a) A set trajectories Γ, in a 3dimensional state space, produced by the system under normal operation. 0 7 6 5 4 3 2 1 0 −1 x1 −2 x2 (b) The either known or fitted reference trajectory r(τ ). 2 τ 1.5 2 1 1.5 0.5 x3 1 0 8 0.5 x3 −0.5 0 8 −0.5 6 −1 6 5 4 3 2 1 0 −1 −2 0 7 0 7 2 −2 2 −2 4 −1.5 4 −1.5 6 −1 x1 x2 (c) A the knot points r(τk ), k = 1, 2, .., n, the PDFs fˆX,k (x) are estimated, and confidence regions are computed, marked here with ellipses. 6 5 4 3 2 1 0 −1 −2 (d) The continuous density function fˆX (τ, x), is obtained by interpolation. The confidence volume is given by the interior to the shown tube, of which the system should be confined under normal operation. Figure 7.1. The steps in constructing the confidence volumes. 144 x1 x2 and discrete knot points for the analysis of further trajectorial data are established (black dots). In the second step (Fig. 7.1c), PDFs are fitted for discrete knot points on the reference trajectory and confidence regions are computed. In the third step (Fig. 7.1d), a confidence volume is constructed by interpolation of the computed confidence regions at the knot points. This defines a domain (volume) in the state space to which the system trajectories are confined under normal operation, according to the given data. The details of the steps are explained and discussed in the subsections below. 7.3.1 Context function and reference trajectory To determine the context in which a certain system behavior is expected is a major problem in anomaly detection. For this purpose, the notions of context function and reference trajectory are introduced. First define a function g that maps the system state x(t) and time t to s(t) that is the variable determining the behavior of the system s(t) = g(t, x(t)). For instance, the behavior of a ground vehicle traveling along routes and roads depends on its position p. At a certain position on the route, the vehicle is expected to have a certain speed, position and heading. Hence, the choice g(t, x(t)) = p(t) is a natural one in that case. This kind of systems is exemplified in Sec. 7.4 by the vessel tracking application. Quite often, a temporal function g is a natural choice. For systems that follow a reference signal that is a function of time, as in the eye tracking application described in Sec. 7.4, the function g can be specified as g(t, x(t)) = t. The reference trajectory is introduced to aid the computation of a scalar value, τ determining the expected behavior under normal operation, and is constructed in the following way. Define ξi as the trajectory γi under the mapping of g, i.e. (i) (i) i ξi = {g(tj , γj )}nj=1 and a set of such trajectories Ξ = {ξi }. To fit a trajectory to the points in Ξ, a curve fitting method can be utilized, see e.g. [26]. However, it can be impractical in higher-dimensional problems and for non-smooth trajectories. A simple though expedient method to fit the trajectory is offered by the following procedure: 145 • Pick any trajectory from Ξ and denote it ξk . (k) • For each element sj , j = 1, 2, .., nk in ξk , find the point ai in (k) trajectory ξ i that minimizes ai − sj 2 , where · 2 stands for the Euclidean norm. The set Aj = {ai }N i=1 will contain the points (k) in Ξ closest to sj in the context space. • The j-th point, rj , in the reference trajectory is then calculated as the mean of the points in Aj , and the continuous reference trajeck tory is given by r(τ ) = l(τ, {rj }nj=1 ). From the reference trajectory, the context function returning the context value τ ∗ , for a given time and state, is then defined as τ ∗ = c(t, x(t)) := arg inf r(τ ) − g(t, x(t))2 . τ Evaluating τi∗ = c(ti , x(0) (ti )), in practice, can be accomplished by finding the line segment dj = l(θ, {rj , rj+1 }), 1 ≤ j ≤ n − 1 that is closest to the given one, and on that line segment evaluate σ= rj+1 − rj (x0 − rj ) ||rj+1 − rj ||2 (7.3) obtaining τ ∗ from τ ∗ = j +σ. The context function will thus map a given state xt to a scalar value τ ∗ , and for that given τ ∗ a certain behavior of the system can be expected. 7.3.2 Probability density function For a given context value τ ∗ , the state of the system should ideally be given by some reference state xr . However, due to system uncertainty and disturbances, the state can be considered as a stochastic variable X with the distribution G(τ ∗ ) and the corresponding PDF fX (τ ∗ , x). The PDF fX (τ ∗ , x) can be estimated from the training data set Γ for a discrete set of values {τk }nk=1 . Let Tk be the set of points in the trajectories x̄i , i = 1, 2, ..., N , that have context value equal to τk , i.e. Tk = {x̄i (t)|τk = c(t, x̄i (t)), i = 1, 2, .., N }. A PDF fˆX,k (x) can then be fitted to the data for each k. If the sample is known to come from a parametric family of distributions, a parametric estimator is a natural choice. In absence of a priori information about the distribution from which the sample is generated, non-parametric methods have to be employed to find fˆX,k (x). Histogram estimators, 146 kernel estimators, and orthogonal basis function estimators are typical examples of non-parametric techniques as discussed in Sec. 1.4. For the applications treated in Sec 7.4, a Gaussian approximation and an orthogonal series approximation are used. To handle continuos values of τ in the anomaly detection method, a piecewise linear continuous approximation of fX (τ, x) interpolating in between the discrete points {τk }nk=1 is computed fˆX (τ, x) = l(τ, {fˆX,k (x)}nk=1 ). As proved in Appendix 7.A.2, fˆX (τ, x) is a PDF for a given τ as it is non-negative and integrates to 1. 7.3.3 Outlier detection To determine whether the system state is anomalous, outlier detection (see Sec. 1.4) is applied, aided by the fitted PDF fˆX (τ, x). A p-value is then computed, specifying how unlikely the observation is under the null hypothesis, and used to classify if the observation is anomalous or not. The p value can be computed from Eq. (1.20). For an estimated PDF that belongs to a parametric family of distributions, an analytic expression for Eq. (1.20) can often be determined. For example, if X ∼ N (μ, Σ), (1.20) is given by p(x0 ) = χ2d ((x0 − μ)T Σ−1 (x0 − μ)), (7.4) where χ2d (z) is the z-quantile function for the χ2 distribution with d degrees of freedom. For a non-parametric estimator the outlier test stated it is generally not possible to evaluate (1.20) analytically and numerical methods have to be employed. A brief discussion of numerical evaluation of (1.20) is given in Appendix 7.A.1. 7.3.4 Anomaly detection method recapitulated Assume that a data set Γ comprising system trajectories arising from tracking of r by system S is given. To determine whether γ0 , defined by (7.2), is likely to be generated by similar mechanisms as were the data in Γ, the steps that should be performed are given in Alg. 24. The actual implementation of the last step depends on the purpose of anomaly detection. If the aim is to make an immediate detection of a fault, a warning should be raised directly when an observation achieves a p-value below the threshold. To scrutinize a trajectory, the cumulative anomaly score over the whole trajectory can be studied. The proposed method only provides the ”raw” anomaly values. There are several possible ways 147 Algorithm 24 Anomaly detection • For each observation x(0) (ti ), i = 1, 2, ..., n0 : – Determine the context by τi∗ = c(ti , x(0) (ti )). – Calculate the p-value pi w.r.t. fˆX (τi∗ , x). • Based on the obtained p-values pi , i = 1, 2, ..., n0 , decide whether the null hypothesis H0 w.r.t. γ0 should be accepted or rejected. to specialize or refine the detection by processing these further but is outside the scope of the development here. 7.4 Experimental results The proposed method is here evaluated on two anomaly detection applications with respect to vessel traffic data and eye-tracking data. 7.4.1 Vessel traffic Supervision of vessel traffic is of importance to detect potentially dangerous or suspicious activities such as accidents, engine failures, smuggling, drunkenness etc. Manual supervision is an expensive and tedious task, due to the rarely occurring anomalies and the typically large number of objects in a scene. A data set from the English channel was scanned for abnormalities using the algorithm. A synthetic data set, where the ground truth is known was also studied to evaluate the method. Real data Data recordings from the Automatic Identification System (AIS) 1 of freight ships travelling in the English Channel were made for 72 hours. The state of each vessel is given by x(t) = [x(t), y(t), v(t), φ(t)]T , where x, y, v and φ denote the longitude, latitude, speed and heading respectively. A total of 182 trajectories were recorded, see Fig. 7.2. From these, N = 100 trajectories were used as the training data set Γ. The behavior of the vessel depends on its position x(t), y(t) in the route. At a given position, it is supposed to have a certain speed and heading and hence the function g is selected as g(t, x(t)) = [x(t), y(t)]T . 1 (7.5) Vessels over 300 gross tonnes transmit their longitude, latitude, speed, and course via the AIS system. 148 Figure 7.2. Recorded trajectories of vessels travelling through the English channel. The context trajectory r̂ was estimated from Γ, using the method described in Sec. 7.3.1. For the PDFs at each knot point fX,k (x), the distribution was approximated as a Gaussian one, i.e. G = N (μ, Σ), justified by the fact that Lilliefors normality test supported the assumption of normality at 86 of the 100 knot points. The mean and covariance were computed as the sample mean and covariance as given by (1.12) and (1.13), respectively. The anomaly detection algorithm was then applied to the remaining 82 trajectories, where the p-values were computed from (7.4) w.r.t. the fitted PDF fˆ(τ, x), revealing aberrations that fall into three types (see Fig. 7.3): Type 1: Vessels going into a harbor. Type 2: Vessels going off the route direction. Type 3: Vessels that present a clearly abnormal behavior compared to other vessels at similar positions. The p-value for Type 2 anomalies were several orders of magnitude lower than the anomaly scores for the Type 3 anomalies. Synthetic data A synthetic data set was produced by simulating vessels controlled by PD controllers that track the reference trajectory while holding a given reference speed by exercising a limited force. Random disturbing forces were added to each simulated vessel. In total 300 ”normal” vessels were simulated, of which 200 trajectories were used as the training data set Γ. 149 Type 1 Type 2 50.65 51.22 Lattitude [deg] Lattitude [deg] 51.24 51.2 51.18 51.16 51.14 51.12 1.56 50.6 50.55 50.5 1.58 1.6 1.62 Longitude [deg] 1.64 0.2 0.3 0.35 0.4 Longitude [deg] Type 3 51.18 50.32 51.17 50.3 Lattitude [deg] Lattitude [deg] Type 3 0.25 51.16 51.15 51.14 50.28 50.26 50.24 50.22 51.13 1.64 1.66 1.68 Longitude [deg] 1.7 −1.5 −1.45 −1.4 −1.35 Longitude [deg] Figure 7.3. Zoom-in of points classified as anomalous of type 1, 2 and 3 respectively. Trajectories are given by gray lines. r̂ is marked by the thick line. Points classified as anomalous are marked by plus signs. Also 3 trajectories, γai i = 1, 2, 3, generated by objects with anomalous behavior, were simulated with the following characteristics: γa1 - Travels in the opposite direction to the normal one on the route. γa2 - First behaves normally but then slows down and stops on the route at time step 20. γa3 - Controlled by a PD controller with ”bad” parameters, giving it a somewhat wiggly behavior, though not apparently deviating to the human perception. For comparison, the method suggested in [51] was implemented and run on the same data set. The method does not employ the concept of context function and therefore faces problems with varying speeds and accelerations along the route. Further, it does not account for correlation between the state variables. Fig. 7.4 displays the anomaly score obtained at τk for each vessel. To facilitate better visual separation, the distance (x0 − μ)T Σ−1 (x0 − μ) in (7.4), rather than the p-value, is plotted. Thus, the higher the value, the more anomalous is the point. 150 4 10 2 Anomaly score 10 0 10 γ a1 γ a2 γ a3 −2 10 10 20 30 Sample 40 50 60 4 10 2 Anomaly score 10 0 10 γ a1 γ a2 γ a3 −2 10 10 20 30 Sample 40 50 60 Figure 7.4. Anomaly score using proposed method (top) and method [51] (bottom). ”Normal” trajectories are shown in gray. The anomalous trajectories γai , i = 1, 2, 3 are marked according to the legend. The threshold value is marked by the black dashed line. Note the log scale on the y-axis. 151 Discussion In the real data set, the underlying causes to the behaviors classified by the algorithm as anomalous are not known to the authors. The anomalies of Type 3 definitely seem reasonable to raise warnings for since the behavior is clearly distinguished from the other vessels at similar positions. Whether the Type 2 anomalies should result in warnings or not is more difficult to judge. However, the anomaly scores for these were low and could just be presented as a notification to the operator. The Type 1 anomalies should not raise warnings since these are not actual anomalies. Because a ship is supposed to broadcast over the AIS system what harbor she is heading for, these types of warnings can easily be suppressed by using that additional information. A closer look on the data also reveals that there are no apparent anomalies that are not detected by the algorithm. In the simulated data set, it can be seen that the method reliably detects the anomalous trajectories while giving a low rate of false alarms. Compared to the method in [51], a better separation between the anomalous and normal trajectories is obtained. For instance, from the p-values in the bottom sub-plot, it is not possible to tell γa3 from the normal trajectories using the method of [51]. Computational complexity The proposed anomaly detection method boasts low computational complexity. The online computations necessary to judge whether a point x0 is anomalous or not are basically to compute τ ∗ = c(x0 , t) from (7.3), and evaluate (7.4). This is found to require about 100 FLOPs (floating point operations). The processor used for execution of the implemented code(Intel Core2 Quad Q9550, 2.83GHz) can perform about 40 GFLOPs, and hence theoretically process about 4·107 anomaly classifications per second. In practice, this number is lower, due to computational overhead such as data movement etc. In Matlab that is an interpreting software and hence executes substantially slower than a compiled code (e.g. in C++), tests show that an implementation can process about 7 · 105 points per second. This is far more than required to handle the given scene, which at maximum contained 453 vessels at the same time instant. 7.4.2 Eye-tracking There are different types of eye movement (the two most commonly mentioned being saccades and smooth pursuit) [19], all of which are governed by complex neuromuscular systems. Research has shown that various medical conditions, e.g. Parkinson’s Disease [30], affect the smooth pursuit system negatively, motivating the search for accurate quantification 152 Figure 7.5. Recording of eye movements. methods that could then be used as diagnostic or even staging tools. The oculomotor system is inherently difficult to model due to complex nonlinear dynamics and it is therefore of interest to find a non-parametric approach to use as a supplement for model-based methods. Experiment Three test subjects • P1 : Healthy man, 26 years old • P2 : Healthy man, 27 years old • P3 : Healthy man, 54 years old • P4 : Parkinson patient, 62 years old were put in front of a computer screen and asked to follow a gaze reference r(tk ) in the form of a moving dot on the screen designed to have suitable characteristics as in [47]. Thus, r(tk ) is the x, y coordinates of the dot at time step k. The j-th recording for test subject Pi is denoted # $ (j) (j) (j) , γPi = x(j) Pi (t1 ) xPi (t2 ) . . . xPi (tn ) (j) where xPi (tk ) is the x, y position at which the test subject Pi is looking at time sample tk , recorded by a video-based eye tracker from Smart Eye AB, Sweden. Fig. 7.6 shows a picture of the recording of eye movements. P1 tracked the reference 40 times, while P2 , P3 and P4 tracked the reference 5 times each. 153 0.4 0.2 y position [cm] 0 −0.2 −0.4 −0.6 −0.8 −1 −0.4 −0.2 0 0.2 0.4 x position [cm] 0.6 0.8 1 Figure 7.6. Part of the trajectory for the visual stimuli. Since the reference is a function of time the function g was chosen as g(t, x(t)) = t (7.6) was chosen, which implies that the reference trajectory r(τ ) = τ and the context function is simply given by c(t, x(t)) := arg inf r(τ ) − t)2 = t. (7.7) τ The PDFs fˆ(tk , x) were estimated from the first 35 realizations be(j) longing to P1 , i.e. {γP1 }35 j=1 using an orthogonal series estimator and the (j) first 5 Hermite basis functions. The p-values for the data in {γP1 }40 j=36 , (j) 5 (j) 5 (j) 5 ˆ {γP2 }j=1 , {γP3 }j=1 and {γP4 }j=1 were evaluated w.r.t. f (tk , x), k = 1, 2, ..., 500. In Fig. 7.7, the cumulative logarithmic p-value pc (t) = t log10 p(k) (7.8) k=0 is shown, where p(k) denotes the p-value at time step k. The cumulative p-values obtained using a Gaussian distribution are also provided for comparison. Discussion From Fig. 7.7, differences in the oculomotor system of the test subjects can be observed. The Parkinson patient, P4 , is naturally the most distinguished test subject. The 54 year old test subject, P3 is also clearly 154 0 -100 -200 Cumulative p-value -300 -400 -500 -600 P -700 P P -800 1 2 3 P4 -900 -1000 0 50 100 150 200 250 300 Time step 350 400 450 500 0 -100 Cumulative p-value -200 -300 -400 P1 -500 P2 P -600 P 3 4 -700 0 50 100 150 200 250 300 Time step 350 400 450 500 Figure 7.7. Upper figure shows the p-values for test subject P1 , P2 , P3 and P4 for each sampling instant, using Hermite functions to estimate the PDFs, the lower figure shows the p-values when then PDFs are approximated using a Gaussian distribution. Notice the log scale on the y-axis. 155 separated from P1 and P2 , 26 and 27 years old respectively. This is likely to be a consequence of age, a factor known to affect the oculomotor system. P1 has the highest p-values, which is to expect as the training data set, and hence the definition of normal behavior, come from P1 . The PDFs for this application tend to be skew. Indeed, a better distinction between the test subjects is achieved using a non-parametric estimator than a Gaussian estimator, Fig. 7.7. As this study only contains four test subjects, it is not possible to make more insightful conclusions based on the available data. Subsequent studies containing more test subjects will be performed to draw statistically significant conclusions. 7.5 Limitations The method requires that enough realizations are available to enable accurate estimation of the involved PDFs. This can be especially problematic for high-dimensional systems since estimation of PDFs of high dimension requires many observations to achieve accuracy. It has though been shown [110] that orthogonal series estimates exhibit a convergence rate that is independent of the dimension, which property makes them an appealing option for high-dimensional estimation. 7.6 Conclusions A non-parametric and analytical model-free anomaly detection method is presented. The method is applicable to systems following a given reference whose trajectory realizations are observed. The method is based on the estimation of statistical distributions characterizing the trajectory deviations from the reference. With the aid of these distributions and by utilizing outlier detection methods, it can be concluded whether or not a given system trajectory is likely to be generated by the same mechanisms as the training data set. The developed method performs well in the two considered applications. Being model-free, the method is suitable for systems that are difficult to model appropriately and/or highly nonlinear. 7.A Appendix for Chapter 7 7.A.1 Evaluation of Eq. (1.20) One approach to evaluate Eq. (1.20) is by approximating it using a Riemann sum by the following steps. Let {xi }N i=1 denote a set of equidistant 156 grid points and denote the volume element for one grid point V . Evaluate fX over the grid yi = fX (xi ), (7.9) i = 1, 2, ..., N . Let {yak }N k=1 be the ordered set of the points yi , such that ya1 ≤ ya2 ≤ ... ≤ yaN , and denote the cumulative sum cm = V m yai , 1 ≤ m ≤ N, m ∈ N. i=1 An approximation of the p-value, for an observation x0 is then given by p(x0 ) ≈ cn , (7.10) where n = arg max(ymi ≤ fX (x0 )). The approximation can be made i arbitrarily accurate by refining the grid. This can be computed off-line and does not influence the on-line performance of the method. To further minimize the on-line computational load, a lookup table for the p-value can be set. The only computation required then to evaluate the p-value is to compute fX (x0 ) and check the lookup table for the corresponding p-value. More sophisticated methods for numerical integration than the Riemann sum can be applied in a straightforward manner. 7.A.2 Proof of fˆX (τ, x) being a PDF A function is a PDF if it is non-negative and integrates to 1. For fˆX (τ, x) it holds that fˆX (τ, x)dx = l(τ, {fˆX,k (x)}nk=0 ) = d R ˆ [ fX,τ (x)dx − fˆX,τ (x)dx](τ − τ ) Rd Rd + fˆX,τ (x)dx = (1 − 1)(τ − τ ) + 1 = 1 Rd and that fˆX (τ, x) = fˆX,τ (x)(τ − τ ) + (1 − (τ − τ ))fˆX,τ (x) ≥ 0, since (τ − τ ) ≥ 0, 1 − (τ − τ ) ≥ 0, fˆX,τ (x) ≥ 0 and fˆX,τ (x) ≥ 0, and is hence a PDF. 157 Chapter 8 Application to parameter estimation in PK/PD model 8.1 Introduction Nonlinear dynamical models provide a broad framework for biological and physiological systems and are well suited for the problem of drug delivery control, [35]. While first-principles pharmacokinetic/pharmacodynamic (PK/PD) models make direct use of insights into the underlying physiological processes, they also usually involve numerous uncertain and individual-specific parameters to be identified. At the same time, nonlinear dynamics demand sufficient exogenous excitation both in frequency and amplitude to safeguard identifiability from measured input-output data. Since the drug in a closed-loop drug delivery system incorporating the patient is administered by a feedback controller, an accurate and expedient identification of the model parameters is required in order to guarantee safety of the treatment. The inter-patient variability in response to administration of drugs greatly complicates the automatic drug delivery. Due to a huge variation in PK/PD model parameters that can amount to hundreds of percent, it is difficult and often impossible to design a single controller that performs reasonably well over a broad patient population. Further, the performance of an individualized feedback controller for drug delivery is directly influenced by the intra-patient variability, i.e. the uncertainty incurred by the changes in the PK/PD characteristics of the patient throughout a clinical event. Patient response to an anesthetic drug can also alter due to noxious stimuli or hemorrhaging. Intra-patient variability might exceed the robustness margins of a time-invariant controller design and demand adaptation or online controller re-design as in, e.g. [120]. 159 Due to the physiologically motivated saturations in the nonlinear PK/PD, the high uncertainty in the mathematical model may lead, under closed-loop drug administration, to a limit cycle. The nonlinear oscillations result in alternating under- and overdosing episodes that compromise the intended therapeutic effect and patient safety. Simple model structures can capture the most significant to the closedloop performance dynamics of the system, i.e. the human body, in response to drug administration, allowing at the same time for suitable model individualization. Minimal parsimonious models for the effect of drugs in anesthesia were proposed in [95] and [94], followed by [36] and [40]. In this chapter, the estimation performance of the extended Kalman filter (EKF) is compared to that of two particle filter (PF) algorithms in an application to neuromuscular blockade (NMB) nonlinear Wiener model. Results shows that the more computationally intensive PF, making direct use of the nonlinear model, performs better than the EKF that relies on model linearization. For comparison the OBPF, given in chapter 6 is also implemented and evaluated at the application. The OBPF provides regularization to the filtering problem by fitting a truncated orthogonal series expansion to the particle set. The truncation order of the expansion is thus a user parameter. It is investigated how the regularization benefit the filter estimates, and also how the truncation order affects the filter accuracy. The matter of intra-patient variability in terms of model parameter estimates is also assessed in this chapter by a comparison of the tracking capabilities of the EKF, PF, and the OBPF. The numerical experiments performed on synthetic and clinical data show that the EKF is the computationally cheapest option but is prone to a significant bias. The estimates of both PF are not biased and the PF and OBPF perform similarly when there is no limit to the number of the particles used. For a moderate number of particles, the OBPF demonstrates higher accuracy at the same computational price as the PF. Recent research has shown that complex nonlinear dynamics may arise in the closed-loop system of a Wiener model for the NMB controlled by a PID feedback. According to [121], there exists a region in the parameter space where the system possesses a single stable equilibrium and, when varying the parameters, this equilibrium undergoes a bifurcation that leads to the emergence of self-sustained nonlinear oscillations. Notably, oscillating PID loops in closed-loop anesthesia have been observed in clinical experiments, e.g. [2]. A third contribution of this chapter is a quantification of the distance to bifurcation for the identified models. This quantification provides insight into how close to a nonlinear oscillation the closed-loop system is and it may be used as a flag in a safety 160 u(t) Linear Dynamics y(t) Static Nonlinearity y(t) Figure 8.1. Block diagram of a Wiener model. net for PID controlled anesthesia. Therefore, the considered identification algorithms can not only be used for controller design but as well for control loop monitoring that assesses online the risk for oscillations. The remainder of this chapter is organized as follows. Section 8.2 describes the parsimonious nonlinear Wiener model that is used to parametrize the effect of the muscle relaxant atracurium in the NMB. Section 8.3 briefly introduces the EKF, the PF, and the OBPF. Section 8.4 summarizes the data sets and the performance metrics that were used to assess parameters convergence as well as filtering and tracking capabilities of the considered parameter estimation techniques. Section 8.5 presents the estimation results. The conclusions are drawn in Section 8.6. 8.2 Parsimonious Wiener Model A block diagram of a Wiener model is shown in Fig. 8.1. In the parsimonious Wiener model for the NMB, [95], that is adopted in this chapter, the model input u(t) [μg kg−1 min−1 ] is the administered atracurium rate, and the model output y(t) [%] is the NMB level. The continuoustime output of the linear dynamic part, here denoted as y(t), is not accessible for measurement. The transfer function of the linear dynamic part of the Wiener model is given by k 1 k2 k3 α 3 Gp (α) = , (8.1) (s + k1 α)(s + k2 α)(s + k3 α) that may be realized in state-space form as ẋ(t) = A(α) x(t) + B(α) u(t), y(t) = C x(t), ⎤ −k3 0 0 0 ⎦, A(α) = α ⎣ k2 −k2 0 k1 −k1 T , B(α) = α k3 0 0 C= 0 0 1 , (8.2a) ⎡ (8.2b) (8.2c) (8.2d) where 0 u(t) umax is the input signal. 161 The constants ki , {i = 1, 2, 3} are positive, and α [min−1 ] > 0 is the patient-dependent parameter to be identified in the linear block. In the analysis that follows, the values chosen in [93], k1 = 1, k2 = 4 and k3 = 10 are assumed. The effect of the drug is quantified by the measured NMB y(t) [%] and modeled by the Hill function as y(t) = γ 100 C50 , γ C50 + y(t)γ (8.3) where γ (dimensionless) is the patient-dependent parameter to be identified in the nonlinear block, y(t) is the output of the nonlinearity, and C50 [μg kg−1 min−1 ] is a normalizing constant that is equal to 3.2435 in simulations. In order to implement the model in the estimation algorithms, the structure in (8.2) and (8.3) was discretized using a zero-order hold method with sampling rate h = 1/3 min−1 . A random walk model, [101], for the model parameters is assumed. With subscripts denoting discrete time, the resulting (sampled) augmented state vector xt is T (8.4) xt = xTt αt γt . Then the extended state-space model becomes the following ⎡ ⎤ xt Φ(αt ) 03×2 ⎣ ⎦ Γ(αt ) αt + xt+1 = ut + v t , 02×1 02×3 I γt (8.5a) ≡ f (xt , ut ) + vt , yt = γt 100 C50 + et ≡ h(xt ) + et , γt C50 + (C xt )γt (8.5b) where vt ∈ R5 , et ∈ R are white zero-mean Gaussian noise processes, with the probability density functions pv (v) and pe (e), respectively. The system matrices Φ(α), Γ(α) are the discretized versions of A(α), B(α) in (8.2). 8.3 Estimation algorithms The EKF and the PF are widely used in nonlinear state estimation. The EKF builds on the idea of extending Kalman filtering to nonlinear models. At each time step, the filter gain is computed by linearizing the nonlinear model around the previous state estimates. Unlike the 162 Kalman filter, the EKF is not an optimal filter and assumes both the process and sensor noise to be Gaussian. The PF uses Monte Carlo simulation to obtain a sample from the estimated posterior distribution of the state, from which point estimates can be extracted. It provides a general framework for estimation in nonlinear non-Gaussian systems. The PF exploits the underlying nonlinear model as it is, but yields an approximation to the true solution of the filtering problem. The approximation can be made arbitrarily accurate by increasing the number of particles, but the latter comes with the cost of an increased computational burden. The third filtering method under consideration is the OBPF. At the resampling step, it approximates the posterior by an orthogonal series expansion. A new set of particles is created by making a draw from the fitted distribution. Compared to the PF, the OBPF is even more suitable for parallelization. It as well provides, by the orthogonal series approximation, a regularization to the problem, improving the estimation accuracy for a smaller number of particles. 8.3.1 Filter tuning Following the procedure in [87], the EKF, the PF and the OBPF with 5 × 104 particles were tuned individually over a synthetic database (see Section 8.5) aiming at the best performance in terms of convergence speed and bias with reasonable output filtering. For the OBPF, the performance was evaluated for approximation orders K = 0, 1, 2, 3, 4. For the sake of evaluation consistency, this tuning was used for all simulations in this chapter. Notice that the initial covariance matrix of the EKF was not increased further, which would have resulted in a reduced settling time of the estimates. The reason is that, with a more aggressive tuning, the estimates of the nonlinear parameter γ suffered from divergence for several cases. The tuned covariance matrices for the EKF are as follows: % & P1|0 = diag 10−4 10−4 10−4 10−4 100 , % & Q = diag 10−2 10−2 10−2 10−8 10−3 , (8.6) R = 1, where diag(·) denotes a diagonal matrix with the specified elements of the main diagonal. The tuned covariance matrices for the PF and OBPF are as follows: % & P1|0 = diag 10−4 10−4 10−4 10−2 100 , % & Q = diag 10−3 10−3 10−3 10−8 10−3 , (8.7) R = 0.7. 163 The initial estimates of the parameters were calculated as the mean over the synthetic database (see Section 8.4.1), i.e. 0.0378 for α and 2.6338 for γ. 8.4 Data sets and performance evaluation metrics The two data sets and the metrics used for the estimation performance evaluation are described below. 8.4.1 Synthetic Data The database of synthetic data generated as described in [87] is used in this chapter. In brief, the data were& obtained by simulating system % (8.5) with the parameter sets α(i) , γ (i) , {i = 1, . . . , 48} from [73]. The input (i.e. drug dose) used to generate the 48 synthetic data sets was the same as the one administered in the 48 real cases, to guarantee that the excitatory properties of the real input signals are preserved. Convergence properties: In order to assess the convergence properties in terms of bias and settling time, the model parameters α(i) and γ (i) for each case i were kept constant during the whole simulation. The settling time for an estimate θ̂t of a scalar parameter θ is here defined as the time ts = ks h, where ks is the least value for which max θ̂t − min θ̂t ≤ L t≥ks t≥ks (8.8) is satisfied, i.e. the estimate will be confined to a corridor of width L, for k larger than or equal to ks . If the signal settles, the bias in the estimate is defined as ∗ N 1 bθ = θ − ∗ θ̂t , N − ks (8.9) t=ks where N ∗ is the number of samples from ks to the end of the case being evaluated. Tracking properties: As in [87], to assess the tracking properties of the algorithms, the true value of γ for the model simulation is made to evolve following a sigmoidal decay of 20% after minute 50, i.e. time step k0 = 150, according 164 to γt = ⎧ , ⎨ρ * ⎩ρ 1 − 0.2 1+( k ≤ k0 , + 1 k0 k−k0 )3 , k > k0 , (8.10) where ρ = γ (i) for case i. This is to simulate slow drifts in the dynamics that might occur during a general anesthesia episode. The parameter in the nonlinear block (PD, γ) is chosen for this test over the parameter in the linear one (PK, α) to highlight the nonlinear estimation performance of the evaluated algorithms. Distance to bifurcation: Following [121], the condition for the birth of sustained nonlinear oscillations of the PID closed-loop system is given by a surface that is nonlinear in the model parameter α and the controller gains R and L, as defined in the Ziegler-Nichols tuning procedure. The choice of this tuning procedure follows the work of [61]. % The 48 & models in the synthetic database were used to obtain the R(i) , L(i) , {i = 1, . . . , 48} via Ziegler-Nichols. % & Considering a nominal model i, the nominal controller gains R(i) , L(i) define a point in the (R, L) two-dimensional space. The parameter estimates α̂k from the PF estimation give rise to different bifurcation conditions that, in the case of a fixed α̂k at each sampling time k, can be represented by lines in the (R, L) space. To assess how close the nominal closed-loop model defined by (Rj , Lj ) is to the bifurcation condition at each time instant, the minimum of the Euclidean distance between this point and the bifurcation line was numerically calculated by a grid search. 8.4.2 Real data The database of real cases is the same as in [87] and includes 48 datasets collected from patients subject to PID-controlled administration of the muscle relaxant atracurium under general anesthesia. Real data were used to validate the conclusions drawn from the synthetic data experiments. The output errors obtained in the EKF, PF and OBPF filtering were compared for the four phases of anesthesia covered by the data sets, [87]. Phase 1, 0 < t ≤ 10 min, corresponds to the induction; Phase 2, 10 < t ≤ 30 min, is the time interval when only a P-controller was used; Phase 3, 30 < t ≤ 75 min, is between the beginning of the recovery from the initial bolus and the time when the reference reaches its final value of 10%; Phase 4, 75 < t ≤ tend , corresponds to steady-state. During Phases 3 and 4, drug administration was PID-controlled for t ≥ 30 min. 165 8.5 Results This section presents the results of the EKF, the PF and the OBPF estimation of the nonlinear Wiener model for the NMB described in Section 8.2. 8.5.1 Synthetic data Fig. 8.2 shows the parameter estimates of case #7 in the database of synthetic cases. As in [87], the estimates obtained by the PF, in solid blue line, converge faster than the estimates obtained by the EKF, in dashed green line, and exhibit less bias (8.9). This behavior persists in most of the cases in the database and the bias is more prominent for higher values of α and γ. Fig. 8.6 illustrates this by showing the true α and γ vs. bias (8.9) in the estimates for the PF and EKF for the 48 cases in the database. It is hence evident that the PF, in general, yields estimates with less bias than the EKF, this effect being especially prominent for large values of α and γ. The presence of a higher bias in the estimates of the EKF for higher values of the nominal parameters may be explained by the fact that the gain of the EKF is calculated from a linearized version of the nonlinear Wiener model while the PF performs no linearization at all. The performance of the OBPF is very similar to the performance of the PF for large particle sets (N ≥ 104 ). The root mean square error (RMSE) ! T !1 (xt − x̂t )2 R=" T (8.11) t=0 for the PF and OBPF is shown in Fig. 8.3 for different particle set sizes and approximation orders. As can be seen, for smaller N , the OBPF gives better estimation than the PF, due to the regularization provided by the fitting of the expansion. No particular difference in RMSE performance of the OBPF can be seen between different approximation orders K. However, when evaluating the posterior PDF at a given time instant, the OBPF gives a better fit for higher truncation orders, as exemplified by Fig. 8.4. The true marginal distribution for α is shown together with the approximate PDFs obtained by the OBPF of different orders. Given these results, it is probably not worthwhile to spend the extra computation required for a higher approximation order since this has little impact on the final quality of the point estimate, as shown by the RMSE. Fig. 8.5 depicts the estimates of γ for a case where the true value, plotted in dotted red, changes obeying a sigmoidal function after minute 50, according to (8.10). The EKF estimates are plotted in dashed green, while the PF estimates are plotted in solid blue. This is 166 0.035 0.03 α true OBPF(2) PF EKF 0.025 0.02 0 20 40 60 80 100 120 140 160 180 200 Time step 4 3.5 γ 3 true OBPF(2) PF EKF 2.5 2 0 20 40 60 80 100 120 140 160 180 200 Time step Figure 8.2. Estimated α (upper plot) and γ (bottom plot) for the OBPF, PF and EKF for case number 7 in the synthetic database. The settling time instants according to (8.8) are marked by the arrows. RMSE (α) 2.5 ×10 -3 OBPF(0) OBPF(2) OBPF(4) PF 2 1.5 1 0.5 10 2 10 3 10 4 10 5 10 6 N RMSE (γ) 0.08 OBPF(0) OBPF(2) OBPF(4) PF 0.06 0.04 0.02 10 2 10 3 10 4 10 5 10 6 N Figure 8.3. Root mean square error as a function of the number of particles N used for filtering. 167 OBPF(0) OBPF(1) OBPF(2) OBPF(3) OBPF(4) true 0.35 0.3 p(α) 0.25 0.2 0.15 0.1 0.05 0 0.025 0.026 0.027 0.028 0.029 0.03 0.031 0.032 0.033 0.034 0.035 α Figure 8.4. Marginal distribution for α at time t = 5min. The true PDF is shown in dashed black, while the approximation obtained from the OBPF using approximation orders from 0 to 4 are shown in colored lines. OBPF(2) PF EKF 2.4 2.3 2.2 γ 2.1 2 1.9 1.8 1.7 1.6 20 40 60 80 100 120 Time (min) Figure 8.5. Estimated γ for the EKF, PF and OBPF. At t = 50, the true γ starts drifting according to (8.10). 168 −3 2 x 10 0.5 1 0 bγ bα 0 −1 −2 −0.5 −3 −4 0.02 0.03 0.04 α 0.05 0.06 −1 1 2 3 γ 4 5 6 Figure 8.6. The true α and γ vs. estimation bias bα and bγ , respectively, for the 48 cases in the synthetic database. The results for the EKF are plotted in green circles and the results for the PF are plotted in blue crosses. a case representative of the behavior of the estimates in all the 48 cases in the synthetic database. As for time-invariant parameters, the EKF presents a higher bias at tracking the change than the PF does, while no particular difference can be observed between the PF and the OBPF. 8.5.2 Real data Keeping the tuning unchanged, the EKF, the PF and the OBPF were applied to the 48 cases of real input-output data. Fig. 8.7 shows the estimates of α and γ over time for case #39 in the real data database. Thus the true parameter values are not available. The higher variance of the estimates of γ, when compared to that of the estimates of α, supports the choice of only assessing the tracking performance of both estimation techniques with respect to changes in γ, as argued in Section 8.4.1. Fig. 8.8 shows the mean of the absolute value of the output error with the 1σ confidence interval over all 48 cases. Numerical values of the output errors are also given in Table 8.2 for the four different experimental phases described in Section 8.4.2. The general result is that the PF exhibits a much lower output error during the induction phase, i.e. for 0 < t < 10 min, when compared with the output error that is obtained with the EKF estimates. For 10 ≤ t < 30 min, the EKF provides slightly better output errors, possibly due to less prominent nonlinear dynamics exhibited in this interval. For t ≥ 30 min, the performance is similar for the EKF and the PF. The better performance of the PF during the highly nonlinear induction phase is likely due to that the PF is designed to handle nonlinear systems without recourse to linearization. Numerical values of the output errors per phase are given in Table 8.2. The general result is that the PF exhibits a much lower output error during Phase 1, 0 < t ≤ 10 min, when compared with the output error 169 0.05 0.045 α 0.04 OBPF(2) PF EKF 0.035 0.03 0.025 0 10 20 30 40 50 60 70 Time (min) 4 OBPF(2) PF EKF γ 3 2 1 0 10 20 30 40 50 60 70 Time (min) Figure 8.7. Estimated model parameters for the EKF, in dashed green, and the PF, in solid blue, over time for a case #39 in the real database. Figure 8.8. The mean μekf and μpf e e of the absolute value of output error over the 48 cases for the EKF and PF, respectively. The 1σ confidence intervals are given by the transparent areas. 170 Table 8.1. Mean, standard deviation of simulation output error, with the parameters θˆt = {α̂t , γ̂t } obtained at t = 10 min, and t = tend θ̂10 θ̂tend mean 2.32 4.13 stdv 0.13 0.22 that is obtained with the EKF estimates. For Phase 2, 10 < t ≤ 30 min, the EKF provides slightly better output errors, possibly due to less prominent nonlinear dynamics exhibited in this interval. For Phases 3 and 4, t > 30 min, the performance is similar for the three estimation algorithms. The better performance of the PF during the highly nonlinear induction phase is attributed to that the algorithm handles the nonlinear dynamics without recourse to linearization. In order to get some insight into the need of estimating the model parameters throughout the whole surgery and, consequently, the development of adaptive control strategies, the system was simulated with the estimates of α and γ obtained after induction (at time t=10 min), and the estimates obtained from last time step of the estimation (at t = tend ). The mean and standard deviation over the 48 cases of the output errors are shown in Tab. 8.1. This result shows that, from minute 10 to the end of the surgery, the changes in the model parameters affect the goodness of fit of the simulated model to the real data. It is therefore plausible that adaptive/re-designed controllers would perform better during maintenance phase than non-adaptive ones, especially under longer surgical interventions. Given the time-varying nature of the patient dynamics, in a PID control setup, and for safety reasons, it is important to judge whether the system is driven into a parameter region where a bifurcation might lead to nonlinear oscillations. The distance to bifurcation is calculated according to [121] for the 48 cases at t = 40 min and presented in a histogram in Fig. 8.9. The histogram is representative for all time instants t > 10 min, as the distance depends only on α̂ which typically settles before t = 10 min. It can be seen that most of the cases are further than 10−2 from the critical surface. Three cases are nevertheless closer to the surface, which may be of concern in real practice. It should be noted that the better performance of the PF and OBPF comes with a much higher than for the EKF computational cost. For this application, the EKF and PF/OBPF require FLOPs in the order of magnitude of 103 and 107 per iteration, respectively. In Fig. 8.10, the RMSE as a function of the computational complexity is shown for the PF and OBPF. It can be seen that the OBPF provides better RMSE results for a given number of FLOPS for the approximation orders K = 0 and K = 2, but is more computationally costly for a higher truncation order (K ≥ 4). Unoptimized Matlab implementations were clocked to 171 12 number of cases 10 8 6 4 2 0 1E−3 1E−2.5 1E−2 1E−1.5 distance 1E−1 1E−0.5 1E0 Figure 8.9. Histogram of the distance to bifurcation, at time t = 40 min, over the 48 cases in the synthetic database, assuming PID control. Note the log-scale on the x-axis. 2.5 ×10 -3 OBPF(0) OBPF(2) OBPF(4) PF RMSE 2 1.5 1 0.5 10 4 10 5 10 6 10 7 10 8 10 9 FLOPs RMSE 0.08 OBPF(0) OBPF(2) OBPF(4) PF 0.06 0.04 0.02 10 4 10 5 10 6 10 7 10 8 10 9 FLOPs Figure 8.10. Root mean square error (RMSE) as a function of the number of floating-point operations per second (FLOPS) required for filter execution. 172 Table 8.2. Output error (absolute value) of estimation for the EKF, the PF and the OBPF for different approximation orders, during the four phases defined in Section 8.4.2. Phase 1 2 3 4 mean 4.16 0.49 0.31 0.25 Phase 1 2 3 4 mean 0.87 0.52 0.31 0.26 EKF stdv [min,max] 0.62 [2.58,5.42] 0.17 [0.16,0.85] 0.16 [0.08,0.98] 0.16 [0.04,0.97] OBPF(0) stdv [min,max] 0.53 [0.32,1.98] 0.15 [0.15,1.22] 0.18 [0.06,0.85] 0.16 [0.04,0.78] mean 0.95 0.58 0.30 0.25 mean 0.90 0.52 0.28 0.23 PF stdv [min,max] 0.47 [0.24,2.34] 0.39 [0.14,1.97] 0.16 [0.13,0.77] 0.13 [0.07 0.76] OBPF(5) stdv [min,max] 0.44 [0.18,2.14] 0.18 [0.17,1.52] 0.18 [0.05,0.74] 0.14 [0.05 0.69] run one filtering iteration in 0.5ms for the EKF and about 2s for the PF. For the implementations in hand, the execution time for the PF is hence four orders of magnitude greater than that of the EKF. Since the sampling period is 20s, this difference in execution time is though not an issue. Importantly, the OBPF has full parallelization potential, and linear speedup in the number of cores employed can be expected on a multicore computer. Hence, on an eight-core machine used for filter computation, the execution time can be brought down to 1/8 of that for single-core execution. 8.6 Conclusions The nonlinear estimation algorithms EKF, PF and OBPF were compared on a parsimonious Wiener model for the neuromuscular blockade (NMB) in closed-loop anesthesia. For this application, the PF and the OBPF provide significantly better estimation quality than the EKF, but at a higher computational cost of four orders of magnitude in the FLOPS. The estimation performance of the OBPF and PF is similar. However, for a given number of FLOPS, the OBPF with a low truncation order can provide better estimation quality than the PF. Using a higher truncation order than K = 0 did not result in any significant improvement in the point estimates provided by the filter since the underlying probability distribution is captured well with already one term, i.e. close to a normal one. A better fit of the underlying PDF is though achieved with higher truncation order. The improvement in the PDF fit is, however, not considerable enough in this application to justify the increased computational cost with the higher truncation order. 173 Chapter 9 BLAS based parallelizations of UKF and point mass filter When performing a parallelization the simplest option should, ofcourse, be investigated first. If it is possible to use a readily available, highly optimized library, it is the way to go. In this chapter a brief presentation of the results from BLAS based implementations of the UKF and a point mass filter is given. 9.1 UKF In the UKF the square root of the error covariance matrix have to be computed at every iteration, which is the step that consumes the vast majority of the computations in the method. Hence, parallelizing the UKF is mainly a matter of parallelizing the Cholesky factorization. A subroutine for parallel Choleskey factorization is available in BLAS. An implementation of the UKF, based on the BLAS routine for Cholesky factorization, has been performed. The execution time and speedup results are summarized in Tab. 9.1 and Fig. 9.1, respectively. As can be seen, the scalability of the parallel UKF is good when the problem size is large (n 1000). For smaller problem sizes, the parallel overhead constitutes a disproportionally large part of the execution time, which property results in a poor scalability. Table 9.1. Single core execution time, T for different problem sizes n, for execution of UKF. n T [ms] 100 0.0958 200 0.4911 500 6.2881 1000 45.1539 2000 336.1521 175 8 n=100 n=200 n=500 n=1000 n=2000 7 6 Speed up 5 4 3 2 1 0 1 2 3 4 5 Number of CPUs 6 7 8 Figure 9.1. Speedup curves for parallel implementation of UKF for different problem sizes n. For reference linear speedup is marked by the dashed line. 9.2 Point mass filter One iteration of a grid-based method consist of computing (1.61) and (1.62), i.e. (i) wt|t−1 = N (j) (i) (j) wt−1|t−1 p(xt |xt−1 ), j=1 (i) wt|t (i) (i) = wt|t−1 p(yt |xt ). i = 1, 2, ..., N . Defining T (1) (2) (N ) wt|t−1 = , wt|t−1 wt|t−1 . . . wt|t−1 ⎡ (1) (1) (1) (2) p(xt , xt−1 ) p(xt , xt−1 ) · · · ⎢ ⎢ p(x(2) , x(1) ) p(x(2) , x(2) ) t t t−1 t−1 Pt = ⎢ ⎢ .. .. ⎣ . . (N ) (1) p(xt , xt−1 ) ··· ⎡ (1) p(yt , xt ) 0 ··· ⎢ (2) ⎢ 0 p(yt , xt ) Qt = ⎢ ⎢ .. .. ⎣ . . 0 176 ··· 0 (1) (N ) ⎤ p(xt , xt−1 ) ⎥ .. ⎥ . ⎥, ⎥ ⎦ (N ) p(xt 0 .. . 0 (N ) p(yt , xt ) (N ) , xt−1 ) ⎤ ⎥ ⎥ ⎥, ⎥ ⎦ Algorithm 25 Pseudo code for one iteration of a grid based method. • for i=1:N (i) – wt|t−1 = 0 – for j=1:N (i) (i) (j) (i) (j) ∗ wt|t−1 = wt|t−1 + wt−1|t−1 p(xt |xt−1 ) – end (i) (i) – wt|t = p(yt |xt−1 )wt|t−1 • end 8 N=100 N=200 N=500 N=5000 N=20000 7 6 Speed up 5 4 3 2 1 0 1 2 3 4 5 Number of CPUs 6 7 8 Figure 9.2. Speedup curves for execution of Alg. 25 for different problem sizes N. For reference linear speedup is marked by the dashed line. the prediction and update equations (1.61) and (1.62) can be expressed in matrix form as wt|t−1 = Pt wt−1|t−1 , wt|t = Qt wt|t−1 . (9.1) (9.2) These matrix equations could then be implemented using, e.g. BLAS or by the pseudo-code in Alg. 25, where the parallelization is performed over the i loop iterations. Tab. 9.2 and Fig. 9.2 show the execution time and, respectively, scalability, of an implementation of Alg. 25, for different problem sizes N . Note that the problem size is given by the number of grid points N and not by the number of states n as for the Kalman filter based methods. 177 Table 9.2. Execution time, T for sequential execution of grid based estimator for different number of grid points N . N T [ms] 178 100 0.1109 200 0.4389 500 2.7909 5000 270.7901 20000 4334.2 References [1] Open mp. http://www.cs.virginia.edu/stream/, Aug. 2010. [2] A.R. Absalom and G. N. C. Kenny. Closed loop control of propofol anaesthtesia using bispectral index: perofrmance assessment in patients receiving computer-controlled propofol and manually controlled remifentanil for minor surgery. British Journal of Anaesthesia, 90(6):737–741, 2003. [3] V.J. Aidala. Kalman filter behavior in bearings-only tracking applications. Aerospace and Electronic Systems, IEEE Transactions on, AES-15(1):29 –39, jan. 1979. [4] D.L. Alspach and H.W. Sorenson. Nonlinear bayesian estimation using gaussian sum approximations. Automatic Control, IEEE Transactions on, 17(4):439–448, Aug. [5] Gene M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In AFIPS ’67 (Spring): Proceedings of the April 18-20, 1967, Spring Joint Computer Conference, pages 483–485, New York, NY, USA, 1967. ACM. [6] B.D.O. Anderson and J.B. Moore. Optimal Filtering. Dover Books on Electrical Engineering Series. Dover Publications, 2005. [7] M.S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking. Signal Processing, IEEE Transactions on, 50(2):174 –188, February 2002. [8] Anwer S Bashi, Vesselin P Jilkov, X Rong Li, and Huimin Chen. Distributed implementations of particle filters. In Proc. of the Sixth Int. Conf. of Information Fusion, pages 1164–1171, 2003. [9] A.S. Bashi, V.P. Jilkov, X.R. Li, and Huimin Chen. Distributed implementations of particle filters. In Information Fusion, 2003. Proceedings of the Sixth International Conference of, volume 2, pages 1164 – 1171, 2003. [10] G. J. Bierman. Factorization Methods for Discrete Sequential Estimation. Academic Press, New York, NY, 1977. 179 [11] M. Bolic, P.M. Djuric, and Sangjin Hong. New resampling algorithms for particle filters. In Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03). 2003 IEEE International Conference on, volume 2, pages II – 589–92 vol.2, April 2003. [12] Miodrag Bolic, Petar M. Djuric, and Sangjin Hong. Resampling algorithms and architectures for distributed particle filters. IEEE Transactions on Signal Processing, 53:2442–2450, 2004. [13] John P Boyd. Asymptotic coefficients of hermite function series. Journal of Computational Physics, 54(3):382–410, 1984. [14] D. Brunn, F. Sawo, and U.D. Hanebeck. Nonlinear multidimensional bayesian estimation with Fourier densities. In Decision and Control, 2006 45th IEEE Conference on, pages 1303 –1308, dec. 2006. [15] R. S. Bucy and K. D. Senne. Digital synthesis of non-linear filters. Automatica, 7(3):287–298, May 1971. [16] Philip K. Chan and Matthew V. Mahoney. Modeling multiple time series for anomaly detection. In 5th IEEE Interational conference on data mining, pages 90–97. IEEE Computer Society, 2005. [17] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM Comput. Surv., 41:15:1–15:58, July 2009. [18] E.W. Cheney. Multivariate Approximation Theory: Selected Topics. CBMS-NSF Regional Conference Series in Applied Mathematics. Society for Industrial and Applied Mathematics, 1986. [19] Raymond Dodge. Five types of eye movement in the horizontal meridian plane of the field of regard. American Journal of Physiology – Legacy Content, 8(4):307–329, 1903. [20] R. Douc and O. Cappe. Comparison of resampling schemes for particle filtering. In Image and Signal Processing and Analysis, 2005. ISPA 2005. Proceedings of the 4th International Symposium on, pages 64 – 69, sept. 2005. [21] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods in Practice. Statistics for Engineering and Information Science Series. Springer, 2001. [22] Arnaud Doucet, Simon Godsill, and Christophe Andrieu. On sequential monte carlo sampling methods for bayesian filtering. Statistics and Computing, 10(3):197–208, 2000. [23] M. Ekman. Particle filtering and data association using attribute data. Information Fusion, 2009. FUSION09.12 th International Conference on, (10):9–16, July 2009. [24] A. Erdélyi. Asymptotic Expansions. Dover Books on Mathematics. Dover Publications, 1956. 180 [25] Magnus Evestedt, Alexander Medvedev, and Torbjörn Wigren. Windup properties of recursive parameter estimation algorithms in acoustic echo cancellation. Control Engineering Practice, 16(11):1372 – 1378, 2008. [26] Lian Fang and David C Gossard. Multidimensional curve fitting to unorganized data points by nonlinear minimization. Computer-Aided Design, 27(1):48 – 58, 1995. [27] J.E. Gentle. Elements of Computational Statistics. Statistics and Computing. Springer, 2002. [28] H.O. Georgii. Stochastics: Introduction to Probabilty Theroy and Statistics. De Gruyter Textbook. De Gruyter, 2008. [29] Norman E. Gibbs, Jr. Poole, William G., and Paul K. Stockmeyer. An algorithm for reducing the bandwidth and profile of a sparse matrix. SIAM Journal on Numerical Analysis, 13(2):pp. 236–250, 1976. [30] J M Gibson, R Pimlott, and C Kennard. Ocular motor and manual tracking in Parkinsons disease and the effect of treatment. J Neurol Neurosurg Psychiatry, 50(7):853–60, 1987. [31] Stefan Goedecker and A. Hoisie. Performance Optimization of Numerically Intensive Codes. Software, Environments and Tools. Society for Industrial and Applied Mathematics, 2001. [32] N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. Radar and Signal Processing, IEE Proceedings F, 140(2):107–113, April 1993. [33] A. Grama. Introduction to Parallel Computing. Pearson Education. Addison-Wesley, 2003. [34] M.S. Grewal and A.P. Andrews. Kalman Filtering: Theory and Practice Using MATLAB. Wiley, 2011. [35] Wassim M. Haddad, Tomohisa Hayakawa, and James M. Bailey. Adaptive control for nonlinear compartmental dynamical systems with applications to clinical pharmacology. Systems & Control Letters, 55(1):62 – 70, 2006. [36] Jin-Oh Hahn, G.A. Dumont, and J.M. Ansermino. A direct dynamic dose-response model of propofol for individualized anesthesia care. Biomedical Engineering, IEEE Transactions on, 59(2):571–578, 2012. [37] J. M. Hammersley and K. W. Morton. Poor man’s monte carlo. Journal of the Royal Statistical Society. Series B (Methodological), 16(1):pp. 23–38. [38] Eberhard Hansler. The hands-free telephone problem: an annotated bibliography update. Annals of Telecommunications, 49:360–367, 1994. 181 [39] A. Hekler, M. Kiefel, and U.D. Hanebeck. Nonlinear bayesian estimation with compactly supported wavelets. In Decision and Control (CDC), 2010 49th IEEE Conference on, pages 5701 –5706, dec. 2010. [40] Ramona Hodrea, Radu Morar, Ioan Nascu, and Horatiu Vasian. Modeling of neuromuscular blockade in general anesthesia. In Advanced Topics in Electrical Engineering, 2013 8th International Symposium on, pages 1–4, 2013. [41] H. Holma and A. Toskala. WCDMA for UMTs: Radio Access for Third Generation Mobile Communications. John Wiley & Sons, 2001. [42] R.A. Horn and C.R. Johnson. Matrix Analysis. Cambridge University Press, 1990. [43] S. Howard, Hak-Lim Ko, and W.E. Alexander. Parallel processing and stability analysis of the Kalman filter. In Computers and Communications, 1996., Conference Proceedings of the 1996 IEEE Fifteenth Annual International Phoenix Conference on, pages 366 –372, Mar. 1996. [44] Weiming Hu, Xuejuan Xiao, Zhouyu Fu, Dan Xie, Tieniu Tan, and Steve Maybank. A system for learning statistical motion patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28:1450–1464, 2006. [45] Rolf Isermann. Process fault detection based on modeling and estimation methods a survey. Automatica, 20(4):387 – 404, 1984. [46] K. Ito and K. Xiong. Gaussian filters for nonlinear filtering problems. Automatic Control, IEEE Transactions on, 45(5):910 –927, may 2000. [47] D. Jansson and A. Medvedev. Visual stimulus design in parameter estimation of the human smooth pursuit system from eye-tracking data. Submitted to IEEE American Control Conference, Washington D.C, 2013. [48] Daniel Jansson, Alexander Medvedev, and Olov Rosén. Parametric and non-parametric analysis of eye-tracking data by anomaly detection. IEEE Transactions on Control Systems Technology, 2014. [49] Daniel Jansson, Olov Rosén, and Alexander Medvedev. Non-parametric analysis of eye-tracking data by anomaly detection. In Control Conference (ECC), 2013 European, pages 632–637. IEEE, 2013. [50] S.J. Julier and J.K. Uhlmann. Unscented filtering and nonlinear estimation. Proceedings of the IEEE, 92(3):401 – 422, mar 2004. [51] I.N. Junejo, O. Javed, and M. Shah. Multi feature path modeling for video surveillance. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 2, pages 716 – 719 Vol.2, aug. 2004. 182 [52] T. Kailath, A.H. Sayed, and B. Hassibi. Linear estimation. Prentice-Hall information and system sciences series. Prentice Hall, 2000. [53] R. E. Kalman. A New Approach to Linear Filtering and Prediction Problems. Transactions of the ASME–Journal of Basic Engineering, 82(Series D):35–45, 1960. [54] A.N. Kolmogorov, W. Doyle, I. Selin, Rand Corporation, and United States. Air Force. Interpolation and Extrapolation of Stationary Random Sequences. Memorandum (Rand Corporation). Rand Corporation, 1962. [55] Jayesh H. Kotecha and P.M. Djuric. Gaussian sum particle filtering. Signal Processing, IEEE Transactions on, 51(10):2602–2612, Oct 2003. [56] J.H. Kotecha and P.M. Djuric. Gaussian particle filtering. Signal Processing, IEEE Transactions on, 51(10):2592 – 2601, oct. 2003. [57] Jun S Liu. Monte Carlo strategies in scientific computing. springer, 2008. [58] P.A.C. Lopes and M.S. Piedade. A Kalman filter approach to active noise control. In Proc. EUSIPCO, volume 3, page 230, 2000. [59] G.G. Lorentz. Approximation of Functions. AMS Chelsea Publishing Series. AMS Chelsea, 2005. [60] P. M. Lyster, C. H. Q. Ding, K. Ekers, R. Ferraro, J. Guo, M. Harber, D. Lamich, J. W. Larson, R. Lucchesi, R. Rood, S. Schubert, W. Sawyer, M. Sienkiewicz, A. da Silva, J. Stobie, L. L. Takacs, R. Todling, and J. Zero. Parallel computing at the nasa data assimilation office (dao). In Proceedings of the 1997 ACM/IEEE conference on Supercomputing (CDROM), Supercomputing ’97, pages 1–18, New York, NY, USA, 1997. ACM. [61] Teresa Mendonça and Pedro Lago. PID control strategies for the automatic control of neuromuscular blockade. Control Engineering Practice, 6(10):1225 – 1231, 1998. [62] S. Oliveira and D.E. Stewart. Writing Scientific Software: A Guide to Good Style. Cambridge University Press, 2006. [63] M.A. Palis and D.K. Krecker. Parallel Kalman filtering on the Connection Machine. In Frontiers of Massively Parallel Computation, 1990. Proceedings., 3rd Symposium on the, pages 55 –58, Oct. 1990. [64] T. Palmer and R. Hagedorn. Predictability of Weather and Climate. Cambridge University Press, 2006. [65] Beresford N. Parlett. Reduction to tridiagonal form and minimal realizations. SIAM Journal on Matrix Analysis and Applications, 13(2):567–593, 1992. 183 [66] D.A. Patterson and J.L. Hennessy. Computer Organization and Design, Revised Fourth Edition: The Hardware/Software Interface. Morgan Kaufmann Series in Computer Graphics. Elsevier Science, 2011. [67] B.M. R. Parallel Computing. New Age International (P) Limited, 2009. [68] T. Rauber and G. Rünger. Parallel Programming: for Multicore and Cluster Systems. Springer, 2010. [69] B. Ristic, S. Arulampalam, and N. Gordon. Beyond the Kalman Filter: Particle Filters for Tracking Applications. Artech House Radar Library. Artech House, 2004. [70] C. Robert. The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation. Springer Texts in Statistics. Springer, 2007. [71] C. Robert and G. Casella. Monte Carlo Statistical Methods. Springer Texts in Statistics. Springer, 2004. [72] Christian P. Robert and George Casella. Monte Carlo Statistical Methods (Springer Texts in Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005. [73] C. Rocha, Teresa Mendonça, and Maria E. Silva. Modelling neuromuscular blockade: a stochastic approach based on clinical data. Mathematical and Computer Modelling of Dynamical Systems, 19(6):540–556, 2013. [74] O. Rosen and A. Medvedev. Efficient parallel implementation of state estimation algorithms on multicore platforms. Control Systems Technology, IEEE Transactions on, PP(99):1 –14, 2011. [75] Olov Rosén and Alexander Medvedev. Parallel recursive estimation, based on orthogonal series expansions. In American Control Conference (ACC), 2014, pages 622–627, June 2010. [76] Olov Rosén and Alexander Medvedev. Efficient parallel implementation of a Kalman filter for single output systems on multicore computational platforms. In Decision and Control and European Control Conference (CDC-ECC), 2011 50th IEEE Conference on, pages 3178–3183. IEEE, 2011. [77] Olov Rosén and Alexander Medvedev. An on-line algorithm for anomaly detection in trajectory data. In American Control Conference (ACC), 2012, pages 1117–1122. IEEE, 2012. [78] Olov Rosén and Alexander Medvedev. Parallelization of the Kalman filter for banded systems on multicore computational platforms. In 2012 IEEE 51st Annual Conference on Decision and Control (CDC), pages 2022–2027, 2012. 184 [79] Olov Rosén and Alexander Medvedev. Efficient parallel implementation of state estimation algorithms on multicore platforms. Control Systems Technology, IEEE Transactions on, 21(1):107–120, 2013. [80] Olov Rosén and Alexander Medvedev. The recursive Bayesian estimation problem via orthogonal expansions: an error bound. IFAC WC, Aug, 2014. [81] Olov Rosén and Alexander Medvedev. Nonlinear identification of individualized drug effect models in neuromuscular blockade. Submitted to a journal, 2015. [82] Olov Rosén and Alexander Medvedev. Orthogonal basis particle filtering : an approach to parallelization of recursive estimation. Submitted to a journal, 2015. [83] Olov Rosén and Alexander Medvedev. Parallel recursive estimation using Monte Carlo and orthogonal series expansions. In American Control Conference, Palmer House Hilton, Chicago, IL, USA, 2015. [84] Olov Rosén, Alexander Medvedev, and Mats Ekman. Speedup and tracking accuracy evaluation of parallel particle filter algorithms implemented on a multicore architecture. In Control Applications (CCA), 2010 IEEE International Conference on, pages 440–445. IEEE, 2010. [85] Olov Rosén, Alexander Medvedev, and Daniel Jansson. Non-parametric anomaly detection in trajectorial data. Submitted to a journal, 2014. [86] Olov Rosén, Alexander Medvedev, and Torbjörn Wigren. Parallelization of the Kalman filter on multicore computational platforms. Control Engineering Practice, 21(9):1188–1194, 2013. [87] Olov Rosén, Margarida M Silva, and Alexander Medvedev. Nonlinear estimation of a parsimonious Wiener model for the neuromuscular blockade in closed-loop anesthesia. In Proc. 19th IFAC World Congress, pages 9258–9264. International Federation of Automatic Control, 2014. [88] Wilson J. Rugh. Linear system theory / Wilson J. Rugh. Prentice Hall,, Upper Saddle River, N.J., 2nd ed. edition, 1996. [89] Stuart C. Schwartz. Estimation of probability density by an orthogonal series. The Annals of Mathematical Statistics, 38(4):pp. 1261–1265, 1967. [90] Stuart C. Schwartz. Estimation of probability density by an orthogonal series. The Annals of Mathematical Statistics, 38(4):1261–1265, 08 1967. [91] D.W. Scott. Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley Series in Probability and Statistics. Wiley, 2009. [92] P. L. Shaffer. Implementation of a parallel extended Kalman filter using a bit-serial silicon compiler. In ACM ’87: Proceedings of the 1987 Fall 185 Joint Computer Conference on Exploring technology: today and tomorrow, pages 327–334, Los Alamitos, CA, USA, 1987. IEEE Computer Society Press. [93] M. M. Silva. Prediction error identification of minimally parameterized wiener models in anesthesia. In Proc. 18th IFAC World Congress, pages 5615–5620, aug 28-sep 2 2011. [94] M. M. Silva, T. Mendonça, and T. Wigren. Online nonlinear identification of the effect of drugs in anaesthesia using a minimal parameterization and bis measurements. In American Control Conference, pages 4379–4384, 2010. [95] M.M. Silva, T. Wigren, and T. Mendonça. Nonlinear identification of a minimal neuromuscular blockade model in anesthesia. Control Systems Technology, IEEE Transactions on, 20(1):181–188, 2012. [96] B.W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis, 1986. [97] B.W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis, 1986. [98] T. Söderström. Discrete-time stochastic systems: estimation and control. Prentice Hall international series in systems and control engineering. Prentice Hall, 1994. [99] T. Söderström. Discrete-time Stochastic Systems: Estimation and Control. Advanced textbooks in control and signal processing. Springer, 2002. [100] T. Söderström and P. Stoica. System identification. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1988. [101] Torsten Söderström and Petre Stoica. System Identification. Prentice-Hall, Hemel Hempstead, UK, 1989. [102] Alan Stuart and Keith J. Ord. Kendall’s advanced theory of statistics. Oxford University Press, New York, 5th edition, 1987. [103] Bo Tang, Pingyuan Cui, and Yangzhou Chen. A parallel processing Kalman filter for spacecraft vehicle parameters estimation. In Communications and Information Technology, IEEE International Symposium on, volume 2, pages 1476 – 1479, Oct. 2005. [104] Michael Tarter and Richard Kronmal. On multivariate density estimates based on orthogonal expansions. The Annals of Mathematical Statistics, 41(2):pp. 718–722, 1970. [105] J.R. Thompson and P.R.A. Tapia. Non Parametric Function Estimation, Modeling & Simulation. Miscellaneous Bks. Society for 186 Industrial and Applied Mathematics (SIAM, 3600 Market Street, Floor 6, Philadelphia, PA 19104), 1990. [106] O. Tokhi, M.A. Hossain, and H. Shaheed. Parallel Computing for Real-Time Signal Processing and Control. Advanced Textbooks in Control and Signal Processing Series. Springer Verlag, 2003. [107] R. Trobec, M. Vajteršic, and P. Zinterhof. Parallel Computing: Numerics, Applications, and Trends. Springer London, 2009. [108] H.L. Van Trees. Detection, Estimation, and Modulation Theory. Number del 1 in Detection, Estimation, and Modulation Theory. Wiley, 2004. [109] Anders Vretblad. Fourier Analysis and Its Applications (Graduate Texts in Mathematics). Springer, November 2010. [110] G. Wahba. Optimal Convergence Properties of Variable Knot, Kernel, and Orthogonal Series Methods for Density Estimation. Defense Technical Information Center, 1972. [111] Fredrik Wahlberg, Alexander Medvedev, and Olov Rosén. A LEGO-based mobile robotic platform for evaluation of parallel control and estimation algorithms. In Decision and Control and European Control Conference (CDC-ECC), 2011 50th IEEE Conference on, pages 4548–4553. IEEE, 2011. [112] N. Wiener. Extrapolation, Interpolation and Smoothing of Stationary Time Series with Engineering Applications. Technology Press and John Wiley & Sons, Inc., New York, 1949. [113] T. Wigren. Fast converging and low complexity adaptive filtering using an averaged Kalman filter. Signal Processing, IEEE Transactions on, 46(2):515 –518, Feb. 1998. [114] T. Wigren. Soft uplink load estimation in WCDMA. Vehicular Technology, IEEE Transactions on, 58(2):760 –772, feb. 2009. [115] T. Wigren. Recursive Noise Floor Estimation in WCDMA. Vehicular Technology, IEEE Transactions on, 59(5):2615 –2620, jun 2010. [116] T. Wigren. WCDMA uplink load estimation with generalized rake receivers. Vehicular Technology, IEEE Transactions on, 61(5):2394 –2400, jun 2012. [117] D. Willner, C. B. Chang, and K. P. Dunn. Kalman filter algorithms for a multi-sensor system. In Decision and Control including the 15th Symposium on Adaptive Processes, 1976 IEEE Conference on, volume 15, pages 570 –574, Dec. 1976. [118] N.E. Wu. Fault Detection, Supervision and Safety of Technical Processes 2003 (SAFEPROCESS 2003): A Proceedings Volume from the 5th IFAC Symposium, Washington, D.C., USA, 9-11 June 2003. Elsevier. 187 [119] G. Xingyu, Z. Zhang, S. Grant, T. Wigren, N. Johansson, and A. Kangas. Load control for multistage interference cancellation. to appear at PIMRC 2012, Sydney, Australia, sep. 2012. [120] Zh. Zhusubaliyev, A. Medvedev, and M. M. Silva. Bifurcation analysis of PID controlled neuromuscular blockade in closed-loop anesthesia. Journal of Process Control, 25:152–163, January 2015. [121] Zhanybai Zhusubaliyev, Alexander V. Medvedev, and Margarida M. Silva. Bifurcation analysis for PID-controller tuning based on a minimal neuromuscular blockade model in closed-loop anesthesia (I). In Decision and Control, 2013 IEEE 52nd Annual Conference on, pages 115–120, 2013. [122] A. Zolghadri, B. Bergeon, and M. Monsion. A two-ellipsoid overlap test for on-line failure detection. Automatica, 29(6):1517 – 1522, 1993. 188 Svensk sammanfattning I denna avhandlingen har stokastisk estimering studerats, med speciell inriktning på parallellisering avsedd för så kallade flerkärniga datorer. Den största delen av arbetet avhandlar det stokastiska problemet, rekursiv optimal filtrering, och hur olika lösningsmetoder för detta har kunnat parallellimplementeras. Att lösa det rekursiva optimala skattningsproblemet är ett mycket beräkningskrävande problem, speciellt för olinjära icke-Gaussiska system, och för system av hög ordning. I och med att beräkningskapaciteten för hårdvara idag främst utökas genom att koppla flera CPUer parallellt, så är parallellisering av algoritmer det mest effektiva sättet att korta ner exekveringstiderna så att realtidsprestanda kan uppnås. I arbetet har flera välkända metoder så som Kalmanfiltret, “Extended Kalman filter”, “Uncented Kalman filter”, partikelfiltret, och punktmass-filter parallelliserats. Linjär uppsnabbning i antal använda CPUer har uppnåtts för intervall av problemstorlekar för samtliga filteringsmetoder. I arbetet har även två nya lösningsmetoder för optimal filtrering utarbetats. Dessa är baserade på serieutvecklingar med ortogonala basfunktioner och lämpar sig mycket väl för parallellisering. Detta då beräkningarna enkelt kan delas in i relativt oberoende stycken och bara en liten mängd kommunikation mellan dessa stycken krävs. Optimal filtrering har ett brett applikationsområde, och kan tillämpas på till synes helt skilda områden. I det här arbetet har de parallella filtreringsmetoderna utvärderats på en rad olika applikationer, såsom: målföljning, last-estimering i mobila nätverk, dosering av anestetikum och eko-cancellering i kommunikationsnätverk. En flerkärnig dator, eller som den ofta även på svenska kallas, en multicore dator, är en dator som har en processor med två eller flera separata kärnor (processorer). I och med att en sådan processor har flera separata, parallella processorer, så kan den utföra parallella beräkningar. Den kan därmed uppnå en högre beräkningseffekt än en vanlig enkärnig processor. Nedan följer korta sammanfattningar av det material som avhandlas i respektive kapitel. 189 Kapitel 2 Det här kapitlet presenterar en parallellisering av Kalman-filtret. En paralleliseringsmetod för ”fler in-signaler- en ut-signalsystem” presenteras. Metoden baseras på att systemets överföringsmatris har en bandstruktur. Det diskuteras hur olika system, både tidsvarianta och tidsinvarianta, kan realiseras på en sådan bandmatris-form. Denna parallelliseringsmetod utökas sedan till att inkludera ”fler in-signaler- fler utsignalsystem” genom att utnyttja sekventiell filtrering av mätvektorn. Den givna parallelliseringen utvärderas på ett lastestimeringsproblem för mobila nätverk och jämförs mot en BLAS-implementering. Den givna parallelliseringen presterar signifikant bättre än BLAS-implementeringen och uppnår linjär uppsnabbning i antal använda kärnor för upp till 8 kärnor, vilket ska jämföras mot en maximal uppsnabbning på 2 gånger för BLAS implementeringen. Kapitel 3 I det här kapitlet studeras ett specialfall av materialet i kapitel 2: Kalmanfiltret då det används för parameterskattning. För detta specialfall kan en särskillt effektiv parallellisering göras, vilket diskuteras i kapitlet. Implementeringsdetaljer som optimerar körningstiderna behandlas även mer ingående än vad som görs i kapitel 2. Kapitel 4 Parallellisering av partikelfiltret studeras. Fyra olika parallelliseringar, “globally distributed particle filter”, “resampling with proportional allocation particle filter” “resampling with non-porportional allocation particle filter” och “Gaussian particle filter” parallelimplementeras och uvärderas på en flerkärnig dator med 8 kärnor. Resultateten visar att det Gaussiska partikelfiltret och “Resampling with proportinoal allocation particle filter” lämpar sig bäst för parallelimlementering för multicoredatorer, där linjär uppsnabbning upp till 8 gånger uppnås. Kapitel 5 En ny lösningsmetod för det rekursiva Bayesianska skattningsproblemet presenteras. De involverade täthetsfunktionerna approximeras med trunkerade serieutvecklingar i ortogonala baser. Via prediktions och uppdateringsstegen för lösningent till det rekursiva Bayesianska skattningsproblemet beräknas och propageras koefficienterna för serietuvecklingarna. Metoden har exceptionellt bra parallelliseringsegenskaper men medför nackdelen att tillståndet måste befinna sig på ett på förhand avgränsat 190 område. En analys av metoden som tagits fram genomförs även. Det som studeras är framförallt hur felet i skattningen påverkas av trunkeringen för serieutvecklingarna och hur detta trunkeringsfel propagerar mellan filteriterationerna. Kapitel 6 I det här kapitlet utvecklas en ny metod för att parallellisera partikelfiltret. Metoden bygger på att en serieutveckling anpassas till partikelsettet vid omsamplingssteget. Detta gör att informationen i partikelsettet kan komprimeras till några få koefficienter som sedan effektivt kan kommuniceras mellan processorkärnorna. En analys av hur väl serieutvecklingen fångar den underliggande täthetsfunktionern utförs. En övre gräns för koefficienternas magnitud då Hermite-basfunktioner används härleds även. Kapitel 7 En ny metod för anomali detektering för system som följer trajektorier i tillståndsrummet presenteras och diskuteras. Metoden bygger på att från en mängd observerade trajektorier från systemet, anpassa och uppskatta täthetsfunktioner som beskriver sannolikheten att hitta systemet i ett visst tillstånd. Utifrån dessa täthetsfunktioner utförs test för att se hur mycket systemets tillstånd avviker från det normala. Metoden utvärderas på spårdata från fraktfartyg, samt på ögonföljningsdata från testpatienter med och utan Parkinsons sjukdom. Kapitel 8 Tillståndsskattning för en minimalt parametriserad modell för PK/PDmodell för påverkan av anestetikum utförs. Tre olinjära skattningsmetoder: “Extended Kalman filter”, partikelfiltret och filtreringsmetoden beskriven i Kapitel 7 implementeras för det här problemet och jämförs i skattningskvalitet. Det visas att “Extended Kalman filter”, som är den metod som tillämpats tidigare för detta problem, ger icke väntevärdesriktiga skattningar av parametrarna, medan de partikel-baserade metoderna ger väntevärdesriktiga skattningar. Då modellen skattas för att ge underlag till reglering av anestetikum under operationer, är det av stor vikt att skattningarna är så goda som möjligt. Kapitel 9 En kort presentation av resultaten från BLAS baserade implementeringar av UKF och punktmassfilteret ges. 191 Acta Universitatis Upsaliensis Uppsala Dissertations from the Faculty of Science Editor: The Dean of the Faculty of Science 1–11: 1970–1975 12. Lars Thofelt: Studies on leaf temperature recorded by direct measurement and by thermography. 1975. 13. Monica Henricsson: Nutritional studies on Chara globularis Thuill., Chara zeylanica Willd., and Chara haitensis Turpin. 1976. 14. Göran Kloow: Studies on Regenerated Cellulose by the Fluorescence Depolarization Technique. 1976. 15. Carl-Magnus Backman: A High Pressure Study of the Photolytic Decomposition of Azoethane and Propionyl Peroxide. 1976. 16. Lennart Källströmer: The significance of biotin and certain monosaccharides for the growth of Aspergillus niger on rhamnose medium at elevated temperature. 1977. 17. Staffan Renlund: Identification of Oxytocin and Vasopressin in the Bovine Adenohypophysis. 1978. 18. Bengt Finnström: Effects of pH, Ionic Strength and Light Intensity on the Flash Photolysis of L-tryptophan. 1978. 19. Thomas C. Amu: Diffusion in Dilute Solutions: An Experimental Study with Special Reference to the Effect of Size and Shape of Solute and Solvent Molecules. 1978. 20. Lars Tegnér: A Flash Photolysis Study of the Thermal Cis-Trans Isomerization of Some Aromatic Schiff Bases in Solution. 1979. 21. Stig Tormod: A High-Speed Stopped Flow Laser Light Scattering Apparatus and its Application in a Study of Conformational Changes in Bovine Serum Albumin. 1985. 22. Björn Varnestig: Coulomb Excitation of Rotational Nuclei. 1987. 23. Frans Lettenström: A study of nuclear effects in deep inelastic muon scattering. 1988. 24. Göran Ericsson: Production of Heavy Hypernuclei in Antiproton Annihilation. Study of their decay in the fission channel. 1988. 25. Fang Peng: The Geopotential: Modelling Techniques and Physical Implications with Case Studies in the South and East China Sea and Fennoscandia. 1989. 26. Md. Anowar Hossain: Seismic Refraction Studies in the Baltic Shield along the Fennolora Profile. 1989. 27. Lars Erik Svensson: Coulomb Excitation of Vibrational Nuclei. 1989. 28. Bengt Carlsson: Digital differentiating filters and model based fault detection. 1989. 29. Alexander Edgar Kavka: Coulomb Excitation. Analytical Methods and Experimental Results on even Selenium Nuclei. 1989. 30. Christopher Juhlin: Seismic Attenuation, Shear Wave Anisotropy and Some Aspects of Fracturing in the Crystalline Rock of the Siljan Ring Area, Central Sweden. 1990. 31. Torbjörn Wigren: Recursive Identification Based on the Nonlinear Wiener Model. 1990. 32. Kjell Janson: Experimental investigations of the proton and deuteron structure functions. 1991. 33. Suzanne W. Harris: Positive Muons in Crystalline and Amorphous Solids. 1991. 34. Jan Blomgren: Experimental Studies of Giant Resonances in Medium-Weight Spherical Nuclei. 1991. 35. Jonas Lindgren: Waveform Inversion of Seismic Reflection Data through Local Optimisation Methods. 1992. 36. Liqi Fang: Dynamic Light Scattering from Polymer Gels and Semidilute Solutions. 1992. 37. Raymond Munier: Segmentation, Fragmentation and Jostling of the Baltic Shield with Time. 1993. Prior to January 1994, the series was called Uppsala Dissertations from the Faculty of Science. Acta Universitatis Upsaliensis Uppsala Dissertations from the Faculty of Science and Technology Editor: The Dean of the Faculty of Science 1–14: 1994–1997. 15–21: 1998–1999. 22–35: 2000–2001. 36–51: 2002–2003. 52. Erik Larsson: Identification of Stochastic Continuous-time Systems. Algorithms, Irregular Sampling and Cramér-Rao Bounds. 2004. 53. Per Åhgren: On System Identification and Acoustic Echo Cancellation. 2004. 54. Felix Wehrmann: On Modelling Nonlinear Variation in Discrete Appearances of Objects. 2004. 55. Peter S. Hammerstein: Stochastic Resonance and Noise-Assisted Signal Transfer. On Coupling-Effects of Stochastic Resonators and Spectral Optimization of Fluctuations in Random Network Switches. 2004. 56. Esteban Damián Avendaño Soto: Electrochromism in Nickel-based Oxides. Coloration Mechanisms and Optimization of Sputter-deposited Thin Films. 2004. 57. Jenny Öhman Persson: The Obvious & The Essential. Interpreting Software Development & Organizational Change. 2004. 58. Chariklia Rouki: Experimental Studies of the Synthesis and the Survival Probability of Transactinides. 2004. 59. Emad Abd-Elrady: Nonlinear Approaches to Periodic Signal Modeling. 2005. 60. Marcus Nilsson: Regular Model Checking. 2005. 61. Pritha Mahata: Model Checking Parameterized Timed Systems. 2005. 62. Anders Berglund: Learning computer systems in a distributed project course: The what, why, how and where. 2005. 63. Barbara Piechocinska: Physics from Wholeness. Dynamical Totality as a Conceptual Foundation for Physical Theories. 2005. 64. Pär Samuelsson: Control of Nitrogen Removal in Activated Sludge Processes. 2005. 65. Mats Ekman: Modeling and Control of Bilinear Systems. Application to the Activated Sludge Process. 2005. 66. Milena Ivanova: Scalable Scientific Stream Query Processing. 2005. 67. Zoran Radovic´: Software Techniques for Distributed Shared Memory. 2005. 68. Richard Abrahamsson: Estimation Problems in Array Signal Processing, System Identification, and Radar Imagery. 2006. 69. Fredrik Robelius: Giant Oil Fields – The Highway to Oil. Giant Oil Fields and their Importance for Future Oil Production. 2007. 70. Anna Davour: Search for low mass WIMPs with the AMANDA neutrino telescope. 2007. 71. Magnus Ågren: Set Constraints for Local Search. 2007. 72. Ahmed Rezine: Parameterized Systems: Generalizing and Simplifying Automatic Verification. 2008. 73. Linda Brus: Nonlinear Identification and Control with Solar Energy Applications. 2008. 74. Peter Nauclér: Estimation and Control of Resonant Systems with Stochastic Disturbances. 2008. 75. Johan Petrini: Querying RDF Schema Views of Relational Databases. 2008. 76. Noomene Ben Henda: Infinite-state Stochastic and Parameterized Systems. 2008. 77. Samson Keleta: Double Pion Production in dd→αππ Reaction. 2008. 78. Mei Hong: Analysis of Some Methods for Identifying Dynamic Errors-invariables Systems. 2008. 79. Robin Strand: Distance Functions and Image Processing on Point-Lattices With Focus on the 3D Face-and Body-centered Cubic Grids. 2008. 80. Ruslan Fomkin: Optimization and Execution of Complex Scientific Queries. 2009. 81. John Airey: Science, Language and Literacy. Case Studies of Learning in Swedish University Physics. 2009. 82. Arvid Pohl: Search for Subrelativistic Particles with the AMANDA Neutrino Telescope. 2009. 83. Anna Danielsson: Doing Physics – Doing Gender. An Exploration of Physics Students’ Identity Constitution in the Context of Laboratory Work. 2009. 84. Karin Schönning: Meson Production in pd Collisions. 2009. 85. Henrik Petrén: η Meson Production in Proton-Proton Collisions at Excess Energies of 40 and 72 MeV. 2009. 86. Jan Henry Nyström: Analysing Fault Tolerance for ERLANG Applications. 2009. 87. John Håkansson: Design and Verification of Component Based Real-Time Systems. 2009. ¯ → Λ̄Λ, Λ̄Σ0 Re88. Sophie Grape: Studies of PWO Crystals and Simulations of the pp actions for the PANDA Experiment. 2009. 90. Agnes Rensfelt. Viscoelastic Materials. Identification and Experiment Design. 2010. 91. Erik Gudmundson. Signal Processing for Spectroscopic Applications. 2010. 92. Björn Halvarsson. Interaction Analysis in Multivariable Control Systems. Applications to Bioreactors for Nitrogen Removal. 2010. 93. Jesper Bengtson. Formalising process calculi. 2010. 94. Magnus Johansson. Psi-calculi: a Framework for Mobile Process Calculi. Cook your own correct process calculus – just add data and logic. 2010. 95. Karin Rathsman. Modeling of Electron Cooling. Theory, Data and Applications. 2010. 96. Liselott Dominicus van den Bussche. Getting the Picture of University Physics. 2010.
© Copyright 2024