Probability and Statistics Cookbook c Matthias Vallentin, 2015 Copyright [email protected] 31st March, 2015 12 Parametric Inference 12.1 Method of Moments . . . . . . . . . 12.2 Maximum Likelihood . . . . . . . . . California in Berkeley but also influenced by other sources [2, 3]. 12.2.1 Delta Method . . . . . . . . . If you find errors or have suggestions for further topics, I would 12.3 Multiparameter Models . . . . . . . appreciate if you send me an email. The most recent version 12.3.1 Multiparameter delta method of this document is available at http://matthias.vallentin. 12.4 Parametric Bootstrap . . . . . . . . This cookbook integrates a variety of topics in probability theory and statistics. It is based on literature and in-class material from courses of the statistics department at the University of net/probability-and-statistics-cookbook/. To reproduce, please contact me. . . . . . . . . . . . . 13 Hypothesis Testing 14 Exponential Family Contents . . . . . . . . . . 16 22 Math 16 22.1 Gamma Function 17 22.2 Beta Function . . 17 22.3 Series . . . . . . 17 22.4 Combinatorics . 18 6 Inequalities 8 9 16 Sampling Methods 16.1 Inverse Transform Sampling . . . . . . 9 16.2 The Bootstrap . . . . . . . . . . . . . 16.2.1 Bootstrap Confidence Intervals 9 16.3 Rejection Sampling . . . . . . . . . . . 16.4 Importance Sampling . . . . . . . . . . 10 . . . . . 18 18 18 18 19 19 7 Distribution Relationships 10 1 Distribution Overview 1.1 Discrete Distributions . . . . . . . . . . 1.2 Continuous Distributions . . . . . . . . 3 3 5 2 Probability Theory 8 3 Random Variables 3.1 Transformations . . . . . . . . . . . . . 4 Expectation 5 Variance 15 Bayesian Inference 15.1 Credible Intervals . . . . 15.2 Function of parameters . 15.3 Priors . . . . . . . . . . 15.3.1 Conjugate Priors 15.4 Bayesian Testing . . . . 13 20 Stochastic Processes 20.1 Markov Chains . . . . . . . . . . 13 20.2 Poisson Processes . . . . . . . . . 14 14 15 21 Time Series 21.1 Stationary Time Series . . . . . . 15 21.2 Estimation of Correlation . . . . 15 21.3 Non-Stationary Time Series . . . 21.3.1 Detrending . . . . . . . . 15 21.4 ARIMA models . . . . . . . . . . 21.4.1 Causality and Invertibility 16 21.5 Spectral Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 19 20 20 20 . . 18.3 Multiple Regression . . . . 18.4 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 20 21 21 22 19 Non-parametric Function Estimation 19.1 Density Estimation . . . . . . . . . . . . 11 Statistical Inference 12 19.1.1 Histograms . . . . . . . . . . . . 11.1 Point Estimation . . . . . . . . . . . . . 12 19.1.2 Kernel Density Estimator (KDE) 11.2 Normal-Based Confidence Interval . . . 13 19.2 Non-parametric Regression . . . . . . . 11.3 Empirical distribution . . . . . . . . . . 13 11.4 Statistical Functionals . . . . . . . . . . 13 19.3 Smoothing Using Orthogonal Functions 22 22 23 23 23 24 8 Probability Functions and Moment Generating 9 Multivariate Distributions 9.1 Standard Bivariate Normal . . . . . . . 9.2 Bivariate Normal . . . . . . . . . . . . . 9.3 Multivariate Normal . . . . . . . . . . . 17 Decision Theory 17.1 Risk . . . . . . . . . . . . 17.2 Admissibility . . . . . . . 11 17.3 Bayes Rule . . . . . . . . 17.4 Minimax Rules . . . . . . 11 11 18 Linear Regression 11 18.1 Simple Linear Regression 11 18.2 Prediction . . . . . . . . . . . . . . 10 Convergence 11 10.1 Law of Large Numbers (LLN) . . . . . . 12 10.2 Central Limit Theorem (CLT) . . . . . 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 . . . . 24 . . . . 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 26 26 26 27 27 28 28 . . . . . . . . . . . . . . . . 29 29 29 29 30 1 Distribution Overview 1.1 Discrete Distributions Notation1 Uniform Unif {a, . . . , b} FX (x) 0 1 Bernoulli Binomial Multinomial Hypergeometric Negative Binomial 1 We Bern (p) Bin (n, p) x<a a≤x≤b x>b (1 − p)1−x bxc−a+1 b−a I1−p (n − x, x + 1) Mult (n, p) Hyp (N, m, n) NBin (r, p) Geometric Geo (p) Poisson Po (λ) ≈Φ x − np p np(1 − p) ! Ip (r, x + 1) 1 − (1 − p)x e−λ x ∈ N+ x X λi i! i=0 fX (x) E [X] V [X] MX (s) I(a ≤ x ≤ b) b−a+1 a+b 2 (b − a + 1)2 − 1 12 eas − e−(b+1)s s(b − a) px (1 − p)1−x p p(1 − p) 1 − p + pes np np(1 − p) (1 − p + pes )n ! n x p (1 − p)n−x x k X n! x xi = n px1 1 · · · pkk x1 ! . . . xk ! i=1 m m−x x npi p(1 − p)x−1 λx e−λ x! x ∈ N+ npi (1 − pi ) !n si pi e i=0 nm N n−x N x ! x+r−1 r p (1 − p)x r−1 k X r 1−p p nm(N − n)(N − m) N 2 (N − 1) r 1−p p2 1 p 1−p p2 λ λ p 1 − (1 − p)es r pes 1 − (1 − p)es eλ(e s −1) use the notation γ(s, x) and Γ(x) to refer to the Gamma functions (see §22.1), and use B(x, y) and Ix to refer to the Beta functions (see §22.2). 3 Uniform (discrete) Binomial ● Geometric ● ● ● n = 40, p = 0.3 n = 30, p = 0.6 n = 25, p = 0.9 0.8 Poisson ● ● ● ● ● ● p = 0.2 p = 0.5 p = 0.8 ● ● ● λ=1 λ=4 λ = 10 ● 0.3 0.2 0.6 ● ● ● ● ● ● ● ● ● ● 0.4 ● ● ● ● ● b ● ● 0 ● ● ●●● 10 30 1.00 2.5 1.0 ● ● ● ● 5.0 ● ● ● ● ● ● ● 7.5 ● ● ● 0.0 10.0 ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ● ● ● ● ● 10 ● ● ● 1.00 ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.75 ● ● ● 15 Poisson ● ● 0.8 ● ● ● ● ● x ● ● ● ● ● ● Geometric ●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ● 0.75 ● ● ● ● ● ● 0.50 0.6 ● ● ● b x 10 ● ● ● ● ● ● ● ● ● ● ●● ●●●● ● ●●●●●●●●●●●●●●●●● ●●●●●●●●●● 0 ● 0.25 ● ● a ● 0.4 ● 0.00 ● ● 0.25 ● ● ● ● ● ● 0.50 ● ● ● ● CDF CDF CDF ● 0 0.0 ● ● ● ● x ● ● i n 40 Binomial ● ● ● x Uniform (discrete) i n ● 0.0 ● ● ● ● ● ● ● ●●●●● ●●●●●●●●●●●●●●●● 20 x 1 ● ● ● ● ● ● ● ● ● 0.1 ● ● ● ●● ●●●●●●● ●●●● ●●●●●●● 0.0 0.2 ● ● ● a ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.1 0.2 PMF ● PMF ● CDF 1 n PMF PMF ● ● 20 x ● ● ● 30 n = 40, p = 0.3 n = 30, p = 0.6 n = 25, p = 0.9 40 0.2 ● ● ● ● 0.0 2.5 5.0 x 7.5 p = 0.2 p = 0.5 p = 0.8 10.0 ● ● 0.00 ● ● ● ● ● ● ● ● ● ● 0 5 10 15 λ=1 λ=4 λ = 10 20 x 4 1.2 Continuous Distributions Notation FX (x) Uniform Unif (a, b) Normal N µ, σ 2 0 x<a a<x<b 1 x>b Z x Φ(x) = φ(t) dt x−a b−a −∞ Log-Normal ln N µ, σ 2 Multivariate Normal MVN (µ, Σ) Student’s t Chi-square Student(ν) χ2k 1 1 ln x − µ + erf √ 2 2 2σ 2 fX (x) E [X] V [X] MX (s) I(a < x < b) b−a (x − µ)2 1 φ(x) = √ exp − 2σ 2 σ 2π (ln x − µ)2 1 √ exp − 2σ 2 x 2πσ 2 a+b 2 (b − a)2 12 µ σ2 esb − esa s(b − a) σ 2 s2 exp µs + 2 1 (2π)−k/2 |Σ|−1/2 e− 2 (x−µ) ν ν Ix , 2 2 k x 1 γ , Γ(k/2) 2 2 T Σ−1 (x−µ) −(ν+1)/2 Γ ν+1 x2 2 1 + √ ν νπΓ ν2 1 xk/2−1 e−x/2 2k/2 Γ(k/2) r eµ+σ 2 2 (eσ − 1)e2µ+σ /2 µ 2 1 exp µT s + sT Σs 2 Σ ν ν−2 ∞ ( 0 ν>2 1<ν≤2 k 2k d2 d2 − 2 2d22 (d1 + d2 − 2) d1 (d2 − 2)2 (d2 − 4) (1 − 2s)−k/2 s < 1/2 d F F(d1 , d2 ) Exponential Exp (β) Gamma Inverse Gamma Dirichlet Beta Weibull Pareto I d1 x d1 x+d2 Gamma (α, β) InvGamma (α, β) d1 d1 , 2 2 Pareto(xm , α) β β2 γ(α, x/β) Γ(α) Γ α, βx Γ (α) 1 xα−1 e−x/β Γ (α) β α αβ αβ 2 β α −α−1 −β/x x e Γ (α) P k k Γ i=1 αi Y α −1 xi i Qk i=1 Γ (αi ) i=1 β α>1 α−1 β2 α>2 (α − 1)2 (α − 2) Γ (α + β) α−1 x (1 − x)β−1 Γ (α) Γ (β) α α+β 1 λΓ 1 + k αβ (α + β)2 (α + β + 1) 2 λ2 Γ 1 + − µ2 k αxm α>1 α−1 xα m α>2 (α − 1)2 (α − 2) k 1 − e−(x/λ) 1− xB d1 d1 , 2 2 1 −x/β e β Ix (α, β) Weibull(λ, k) (d1 x+d2 )d1 +d2 1 − e−x/β Dir (α) Beta (α, β) (d1 x)d1 d2 2 x α m x x ≥ xm k x k−1 −(x/λ)k e λ λ xα m α α+1 x x ≥ xm αi Pk i=1 αi 1 (s < 1/β) 1 − βs α 1 (s < 1/β) 1 − βs p 2(−βs)α/2 −4βs Kα Γ(α) E [Xi ] (1 − E [Xi ]) Pk i=1 αi + 1 1+ ∞ X k−1 Y k=1 ∞ X n n n=0 r=0 α+r α+β+r ! sk k! s λ n Γ 1+ n! k α(−xm s)α Γ(−α, −xm s) s < 0 5 Uniform (continuous) Normal 2.0 Log−Normal 1.00 µ = 0, σ2 = 0.2 µ = 0, σ2 = 1 µ = 0, σ2 = 5 µ = −2, σ2 = 0.5 1.0 ● ● ● a b 0.00 −5.0 −2.5 0.0 x x χ F 2 k=1 k=2 k=3 k=4 k=5 0.1 0.25 0.0 1.00 0.3 0.2 0.50 0.5 2.5 5.0 ν=1 ν=2 ν=5 ν=∞ PDF ● PDF 1 b−a 0.75 PDF PDF 1.5 Student's t 0.4 µ = 0, σ2 = 3 µ = 2, σ2 = 2 µ = 0, σ2 = 1 µ = 0.5, σ2 = 1 µ = 0.25, σ2 = 1 µ = 0.125, σ2 = 1 0.0 0 1 2 3 −5.0 −2.5 0.0 x Exponential 3 2.0 β=2 β=1 β = 0.4 α = 1, β = 2 α = 2, β = 2 α = 3, β = 2 α = 5, β = 1 α = 9, β = 0.5 1.5 1.5 0.75 5.0 Gamma 2.0 d1 = 1, d2 = 1 d1 = 2, d2 = 1 d1 = 5, d2 = 2 d1 = 100, d2 = 1 d1 = 100, d2 = 100 2.5 x PDF 0.50 PDF PDF PDF 2 1.0 1.0 1 0.5 0.5 0.25 0 0.00 0 2 4 6 8 0.0 0.0 0 1 x 2 3 4 5 0 1 2 x Inverse Gamma 4 5 5 0 5 10 15 20 x Weibull 2.0 α = 0.5, β = 0.5 α = 5, β = 1 α = 1, β = 3 α = 2, β = 2 α = 2, β = 5 4 4 x Beta α = 1, β = 1 α = 2, β = 1 α = 3, β = 1 α = 3, β = 0.5 3 Pareto λ = 1, k = 0.5 λ = 1, k = 1 λ = 1, k = 1.5 λ = 1, k = 5 4 xm = 1, k = 1 xm = 1, k = 2 xm = 1, k = 4 3 1.5 3 2 PDF PDF PDF PDF 3 2 1.0 2 1 0.5 1 1 0 0 0 1 2 3 x 4 5 0 0.0 0.00 0.25 0.50 x 0.75 1.00 0.0 0.5 1.0 1.5 x 2.0 2.5 1.0 1.5 2.0 2.5 x 6 Uniform (continuous) Normal 1 Log−Normal 1.00 Student's t 1.00 µ = 0, σ2 = 3 µ = 2, σ2 = 2 µ = 0, σ2 = 1 µ = 0.5, σ2 = 1 µ = 0.25, σ2 = 1 µ = 0.125, σ2 = 1 0.75 0.75 0.75 CDF CDF CDF CDF 0.50 0.50 0.50 0.25 0.25 0.25 µ = 0, σ = 0.2 µ = 0, σ2 = 1 µ = 0, σ2 = 5 µ = −2, σ2 = 0.5 2 0 0.00 a b −5.0 −2.5 0.0 x 2.5 0.00 5.0 χ 0.00 4 6 Exponential 0.75 0.75 0.50 0.00 1 x 2 3 4 1 2 0.00 3 4 5 0.00 1.00 0.75 0.75 0.25 0.50 x 10 0.75 1.00 15 20 0.50 0.25 λ = 1, k = 0.5 λ = 1, k = 1 λ = 1, k = 1.5 λ = 1, k = 5 0.00 0.00 5 Pareto 1.00 0.25 0.25 α = 1, β = 1 α = 2, β = 1 α = 3, β = 1 α = 3, β = 0.5 0 x 0.50 0.50 0.25 0.00 5 CDF 0.50 4 Weibull α = 0.5, β = 0.5 α = 5, β = 1 α = 1, β = 3 α = 2, β = 2 α = 2, β = 5 CDF 0.75 CDF 0.75 3 x Beta 1.00 x 0 α = 1, β = 2 α = 2, β = 2 α = 3, β = 2 α = 5, β = 1 α = 9, β = 0.5 CDF Inverse Gamma 2 β=2 β=1 β = 0.4 x 1.00 1 0.25 0.00 5 5.0 0.50 0.25 d1 = 1, d2 = 1 d1 = 2, d2 = 1 d1 = 5, d2 = 2 d1 = 100, d2 = 1 d1 = 100, d2 = 100 2.5 CDF 0.75 0 0.0 Gamma 1.00 8 −2.5 x CDF CDF k=1 k=2 k=3 k=4 k=5 −5.0 1.00 0.25 0.25 3 1.00 0.50 0.50 0 2 x CDF 0.75 2 1 F 1.00 0 0.00 0 x 2 ν=1 ν=2 ν=5 ν=∞ 0.0 0.5 1.0 1.5 x 2.0 2.5 xm = 1, k = 1 xm = 1, k = 2 xm = 1, k = 4 0.00 1.0 1.5 2.0 2.5 x 7 2 Probability Theory Law of Total Probability Definitions • • • • P [B] = Sample space Ω Outcome (point or element) ω ∈ Ω Event A ⊆ Ω σ-algebra A P [B|Ai ] P [Ai ] n G Ai i=1 Bayes’ Theorem P [B | Ai ] P [Ai ] P [Ai | B] = Pn j=1 P [B | Aj ] P [Aj ] Inclusion-Exclusion Principle [ X n n Ai = (−1)r−1 • Probability Distribution P i=1 1. P [A] ≥ 0 ∀A 2. P [Ω] = 1 "∞ # ∞ G X 3. P Ai = P [Ai ] 3 r=1 X Ω= n G Ai i=1 \ r A ij i≤i1 <···<ir ≤n j=1 Random Variables Random Variable (RV) i=1 X:Ω→R • Probability space (Ω, A, P) Probability Mass Function (PMF) Properties • • • • • • • • Ω= i=1 1. ∅ ∈ A S∞ 2. A1 , A2 , . . . , ∈ A =⇒ i=1 Ai ∈ A 3. A ∈ A =⇒ ¬A ∈ A i=1 n X P [∅] = 0 B = Ω ∩ B = (A ∪ ¬A) ∩ B = (A ∩ B) ∪ (¬A ∩ B) P [¬A] = 1 − P [A] P [B] = P [A ∩ B] + P [¬A ∩ B] P [Ω] = 1 P [∅] = 0 S T T S ¬( n An ) = n ¬An ¬( n An ) = n ¬An DeMorgan S T P [ n An ] = 1 − P [ n ¬An ] P [A ∪ B] = P [A] + P [B] − P [A ∩ B] =⇒ P [A ∪ B] ≤ P [A] + P [B] • P [A ∪ B] = P [A ∩ ¬B] + P [¬A ∩ B] + P [A ∩ B] • P [A ∩ ¬B] = P [A] − P [A ∩ B] fX (x) = P [X = x] = P [{ω ∈ Ω : X(ω) = x}] Probability Density Function (PDF) Z P [a ≤ X ≤ b] = b f (x) dx a Cumulative Distribution Function (CDF) FX : R → [0, 1] FX (x) = P [X ≤ x] 1. Nondecreasing: x1 < x2 =⇒ F (x1 ) ≤ F (x2 ) 2. Normalized: limx→−∞ = 0 and limx→∞ = 1 3. Right-Continuous: limy↓x F (y) = F (x) Continuity of Probabilities • A1 ⊂ A2 ⊂ . . . =⇒ limn→∞ P [An ] = P [A] • A1 ⊃ A2 ⊃ . . . =⇒ limn→∞ P [An ] = P [A] S∞ whereA = i=1 Ai T∞ whereA = i=1 Ai Z P [a ≤ Y ≤ b | X = x] = b fY |X (y | x)dy Independence ⊥ ⊥ fY |X (y | x) = A⊥ ⊥ B ⇐⇒ P [A ∩ B] = P [A] P [B] f (x, y) fX (x) Independence Conditional Probability P [A | B] = a≤b a P [A ∩ B] P [B] P [B] > 0 1. P [X ≤ x, Y ≤ y] = P [X ≤ x] P [Y ≤ y] 2. fX,Y (x, y) = fX (x)fY (y) 8 3.1 Z Transformations • E [XY ] = xyfX,Y (x, y) dFX (x) dFY (y) X,Y Transformation function • E [ϕ(Y )] 6= ϕ(E [X]) (cf. Jensen inequality) • P [X ≥ Y ] = 1 =⇒ E [X] ≥ E [Y ] • P [X = Y ] = 1 ⇐⇒ E [X] = E [Y ] ∞ X • E [X] = P [X ≥ x] Z = ϕ(X) Discrete X fZ (z) = P [ϕ(X) = z] = P [{x : ϕ(x) = z}] = P X ∈ ϕ−1 (z) = f (x) x∈ϕ−1 (z) x=1 Sample mean Continuous n X ¯n = 1 Xi X n i=1 Z FZ (z) = P [ϕ(X) ≤ z] = with Az = {x : ϕ(x) ≤ z} f (x) dx Az Special case if ϕ strictly monotone dx d 1 fZ (z) = fX (ϕ−1 (z)) ϕ−1 (z) = fX (x) = fX (x) dz dz |J| Conditional expectation Z • E [Y | X = x] = yf (y | x) dy • E [X] = E [E [X | Y ]] ∞ Z The Rule of the Lazy Statistician • E[ϕ(X, Y ) | X = x] = E [Z] = ϕ(x, y)fY |X (y | x) dx Z −∞ ∞ Z ϕ(x) dFX (x) • E [ϕ(Y, Z) | X = x] = ϕ(y, z)f(Y,Z)|X (y, z | x) dy dz −∞ Z E [IA (x)] = • E [Y + Z | X] = E [Y | X] + E [Z | X] • E [ϕ(X)Y | X] = ϕ(X)E [Y | X] • E[Y | X] = c =⇒ Cov [X, Y ] = 0 Z dFX (x) = P [X ∈ A] IA (x) dFX (x) = A Convolution Z • Z := X + Y ∞ fX,Y (x, z − x) dx fZ (z) = −∞ Z X,Y ≥0 Z z fX,Y (x, z − x) dx = 0 ∞ • Z := |X − Y | • Z := 4 X Y fZ (z) = 2 fX,Y (x, z + x) dx 0 Z ∞ Z ∞ ⊥ ⊥ fZ (z) = |x|fX,Y (x, xz) dx = xfx (x)fX (x)fY (xz) dx −∞ −∞ 5 Variance Definition and properties 2 2 • V [X] = σX = E (X − E [X])2 = E X 2 − E [X] " n # n X X X • V Xi = V [Xi ] + 2 Cov [Xi , Yj ] i=1 Expectation • V Definition and properties Z • E [X] = µX = x dFX (x) = i=1 # Xi = i=1 X xfX (x) x Z xfX (x) dx • P [X = c] = 1 =⇒ E [X] = c • E [cX] = c E [X] • E [X + Y ] = E [X] + E [Y ] " n X X discrete n X i6=j V [Xi ] if Xi ⊥ ⊥ Xj i=1 Standard deviation sd[X] = p V [X] = σX Covariance X continuous • • • • Cov [X, Y ] = E [(X − E [X])(Y − E [Y ])] = E [XY ] − E [X] E [Y ] Cov [X, a] = 0 Cov [X, X] = V [X] Cov [X, Y ] = Cov [Y, X] 9 7 • Cov [aX, bY ] = abCov [X, Y ] • Cov [X + a, Y + b] = Cov [X, Y ] n m n X m X X X • Cov Xi , Yj = Cov [Xi , Yj ] i=1 j=1 Distribution Relationships Binomial • Xi ∼ Bern (p) =⇒ i=1 j=1 n X Xi ∼ Bin (n, p) i=1 Correlation • X ∼ Bin (n, p) , Y ∼ Bin (m, p) =⇒ X + Y ∼ Bin (n + m, p) • limn→∞ Bin (n, p) = Po (np) (n large, p small) • limn→∞ Bin (n, p) = N (np, np(1 − p)) (n large, p far from 0 and 1) Cov [X, Y ] ρ [X, Y ] = p V [X] V [Y ] Independence Negative Binomial X⊥ ⊥ Y =⇒ ρ [X, Y ] = 0 ⇐⇒ Cov [X, Y ] = 0 ⇐⇒ E [XY ] = E [X] E [Y ] Sample variance n S2 = 1 X ¯ n )2 (Xi − X n − 1 i=1 Conditional variance 2 • V [Y | X] = E (Y − E [Y | X])2 | X = E Y 2 | X − E [Y | X] • V [Y ] = E [V [Y | X]] + V [E [Y | X]] 6 X ∼ NBin (1, p) = Geo (p) Pr X ∼ NBin (r, p) = i=1 Geo (p) P P Xi ∼ NBin (ri , p) =⇒ Xi ∼ NBin ( ri , p) X ∼ NBin (r, p) . Y ∼ Bin (s + r, p) =⇒ P [X ≤ s] = P [Y ≥ r] Poisson • Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ n X Xi ∼ Po i=1 n X ! λi i=1 X n X n λ i Xj ∼ Bin Xj , Pn • Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ Xi λ j j=1 j=1 j=1 Inequalities Cauchy-Schwarz • • • • Exponential 2 E [XY ] ≤ E X 2 E Y 2 Markov P [ϕ(X) ≥ t] ≤ Chebyshev P [|X − E [X]| ≥ t] ≤ Chernoff P [X ≥ (1 + δ)µ] ≤ • Xi ∼ Exp (β) ∧ Xi ⊥ ⊥ Xj =⇒ E [ϕ(X)] t δ Xi ∼ Gamma (n, β) i=1 • Memoryless property: P [X > x + y | X > y] = P [X > x] V [X] t2 e (1 + δ)1+δ n X Normal • X ∼ N µ, σ 2 δ > −1 Hoeffding X1 , . . . , Xn independent ∧ P [Xi ∈ [ai , bi ]] = 1 ∧ 1 ≤ i ≤ n ¯ −E X ¯ ≥ t ≤ e−2nt2 t > 0 P X 2 2 ¯ −E X ¯ | ≥ t ≤ 2 exp − Pn 2n t P |X t>0 2 i=1 (bi − ai ) Jensen E [ϕ(X)] ≥ ϕ(E [X]) ϕ convex • • • • X−µ σ ∼ N (0, 1) X ∼ N µ, σ 2 ∧ Z = aX + b =⇒ Z ∼ N aµ + b, a2 σ 2 X ∼ N µ1 , σ12 ∧ Y ∼ N µ2 , σ22 =⇒ X + Y ∼ N µ1 + µ2 , σ12 + σ22 P P P 2 Xi ∼ N µi , σi2 =⇒ X ∼N i µi , i σi i i b−µ a−µ P [a < X ≤ b] = Φ σ − Φ σ =⇒ • Φ(−x) = 1 − Φ(x) φ0 (x) = −xφ(x) φ00 (x) = (x2 − 1)φ(x) • Upper quantile of N (0, 1): zα = Φ−1 (1 − α) Gamma • X ∼ Gamma (α, β) ⇐⇒ X/β ∼ Gamma (α, 1) Pα • Gamma (α, β) ∼ i=1 Exp (β) 10 P • Xi ∼ Gamma (αi , β) ∧ Xi ⊥ ⊥ Xj =⇒ Z ∞ Γ(α) = xα−1 e−λx dx • λα 0 9.2 P Xi ∼ Gamma ( i αi , β) i Bivariate Normal Let X ∼ N µx , σx2 and Y ∼ N µy , σy2 . Beta f (x, y) = 1 Γ(α + β) α−1 • xα−1 (1 − x)β−1 = x (1 − x)β−1 B(α, β) Γ(α)Γ(β) B(α + k, β) α+k−1 • E Xk = = E X k−1 B(α, β) α+β+k−1 • Beta (1, 1) ∼ Unif (0, 1) 8 " z= x − µx σx 2 + Xt • MX (t) = GX (e ) = E e =E # "∞ X (Xt)i i! i=0 ∞ X E Xi = · ti i! i=0 − 2ρ x − µx σx y − µy σy # (Precision matrix Σ−1 ) V [X1 ] · · · Cov [X1 , Xk ] .. .. .. Σ= . . . Covariance matrix Σ (i) GX (0) i! • E [X] = G0X (1− ) (k) • E X k = MX (0) X! (k) • E = GX (1− ) (X − k)! Cov [Xk , X1 ] · · · −1/2 fX (x) = (2π)−n/2 |Σ| 2 d • • • • Multivariate Distributions Standard Bivariate Normal Let X, Y ∼ N (0, 1) ∧ X ⊥ ⊥ Z where Y = ρX + p 10 1 x2 + y 2 − 2ρxy p exp − 2(1 − ρ2 ) 2π 1 − ρ2 (X | Y = y) ∼ N ρy, 1 − ρ2 Convergence Let {X1 , X2 , . . .} be a sequence of rv’s and let X be another rv. Let Fn denote the cdf of Xn and let F denote the cdf of X. Conditionals and Z ∼ N (0, 1) ∧ X = µ + Σ1/2 Z =⇒ X ∼ N (µ, Σ) X ∼ N (µ, Σ) =⇒ Σ−1/2 (X − µ) ∼ N (0, 1) X ∼ N (µ, Σ) =⇒ AX ∼ N Aµ, AΣAT X ∼ N (µ, Σ) ∧ kak = k =⇒ aT X ∼ N aT µ, aT Σa 1 − ρ2 Z Joint density 1 exp − (x − µ)T Σ−1 (x − µ) 2 Properties • GX (t) = GY (t) =⇒ X = Y (Y | X = x) ∼ N ρx, 1 − ρ2 V [Xk ] If X ∼ N (µ, Σ), • V [X] = G00X (1− ) + G0X (1− ) − (G0X (1− )) f (x, y) = 2 Multivariate Normal • P [X = i] = 9.1 y − µy σy σX (Y − E [Y ]) σY p V [X | Y ] = σX 1 − ρ2 9.3 • P [X = 0] = GX (0) • P [X = 1] = G0X (0) 9 z exp − 2 2(1 − ρ2 ) 1−ρ E [X | Y ] = E [X] + ρ |t| < 1 t Conditional mean and variance Probability and Moment Generating Functions • GX (t) = E tX 2πσx σy 1 p Types of convergence D 1. In distribution (weakly, in law): Xn → X Independence X⊥ ⊥ Y ⇐⇒ ρ = 0 lim Fn (t) = F (t) n→∞ ∀t where F continuous 11 P 10.2 2. In probability: Xn → X (∀ε > 0) lim P [|Xn − X| > ε] = 0 Central Limit Theorem (CLT) Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] = σ 2 . n→∞ √ ¯ ¯n − µ n(Xn − µ) D X →Z Zn := q = σ ¯n V X as 3. Almost surely (strongly): Xn → X h P i h i lim Xn = X = P ω ∈ Ω : lim Xn (ω) = X(ω) = 1 n→∞ n→∞ where Z ∼ N (0, 1) lim P [Zn ≤ z] = Φ(z) z∈R n→∞ qm 4. In quadratic mean (L2 ): Xn → X CLT notations Zn ≈ N (0, 1) 2 ¯ n ≈ N µ, σ X n σ2 ¯ Xn − µ ≈ N 0, n √ 2 ¯ n(Xn − µ) ≈ N 0, σ √ ¯ n(Xn − µ) ≈ N (0, 1) σ lim E (Xn − X)2 = 0 n→∞ Relationships qm P D • Xn → X =⇒ Xn → X =⇒ Xn → X as P • Xn → X =⇒ Xn → X D P • Xn → X ∧ (∃c ∈ R) P [X = c] = 1 =⇒ Xn → X • • • • Xn Xn Xn Xn P →X qm →X P →X P →X ∧ Yn ∧ Yn ∧ Yn =⇒ P P → Y =⇒ Xn + Yn → X + Y qm qm → Y =⇒ Xn + Yn → X + Y P P → Y =⇒ Xn Yn → XY P ϕ(Xn ) → ϕ(X) D Continuity correction D • Xn → X =⇒ ϕ(Xn ) → ϕ(X) qm • Xn → b ⇐⇒ limn→∞ E [Xn ] = b ∧ limn→∞ V [Xn ] = 0 qm ¯n → • X1 , . . . , Xn iid ∧ E [X] = µ ∧ V [X] < ∞ ⇐⇒ X µ ¯n ≤ x ≈ Φ P X ¯n ≥ x ≈ 1 − Φ P X Slutzky’s Theorem D P D • Xn → X and Yn → c =⇒ Xn + Yn → X + c D P D • Xn → X and Yn → c =⇒ Xn Yn → cX D D D • In general: Xn → X and Yn → Y =⇒ 6 Xn + Yn → X + Y 10.1 Law of Large Numbers (LLN) x + 12 − µ √ σ/ n Yn ≈ N 11 σ2 µ, n =⇒ ϕ(Yn ) ≈ N 11.1 as ¯n → X µ n→∞ Strong (SLLN) σ2 ϕ(µ), (ϕ (µ)) n 0 2 iid Weak (WLLN) n→∞ Statistical Inference Let X1 , · · · , Xn ∼ F if not otherwise noted. ¯n → µ X x − 12 − µ √ σ/ n Delta method Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ. P Point Estimation • Point estimator θbn of θ is a rv: θbn = g(X1 , . . . , Xn ) h i • bias(θbn ) = E θbn − θ P • Consistency: θbn → θ 12 • Sampling distribution: F (θbn ) r h i • Standard error: se(θbn ) = V θbn h i h i • Mean squared error: mse = E (θbn − θ)2 = bias(θbn )2 + V θbn Nonparametric 1 − α confidence band for F L(x) = max{Fbn − n , 0} U (x) = min{Fbn + n , 1} s 1 2 = log 2n α • limn→∞ bias(θbn ) = 0 ∧ limn→∞ se(θbn ) = 0 =⇒ θbn is consistent θbn − θ D → N (0, 1) • Asymptotic normality: se • Slutzky’s Theorem often lets us replace se(θbn ) by some (weakly) consistent estimator σ bn . 11.2 Normal-Based Confidence Interval b 2 . Let zα/2 = Φ−1 (1 − (α/2)), i.e., P Z > zα/2 = α/2 Suppose θbn ≈ N θ, se and P −zα/2 < Z < zα/2 = 1 − α where Z ∼ N (0, 1). Then b Cn = θbn ± zα/2 se P [L(x) ≤ F (x) ≤ U (x) ∀x] ≥ 1 − α 11.4 • • • • Statistical Functionals Statistical functional: T (F ) Plug-in estimator of θ = (F ): θbn = T (Fbn ) R Linear functional: T (F ) = ϕ(x) dFX (x) Plug-in estimator for linear functional: Z T (Fbn ) = 11.3 Empirical distribution b 2 =⇒ T (Fbn ) ± zα/2 se b • Often: T (Fbn ) ≈ N T (F ), se Empirical Distribution Function (ECDF) Pn Fbn (x) = i=1 n 1X ϕ(Xi ) ϕ(x) dFbn (x) = n i=1 I(Xi ≤ x) n ( 1 I(Xi ≤ x) = 0 Xi ≤ x Xi > x Properties (for any fixed x) h i • E Fbn = F (x) h i F (x)(1 − F (x)) • V Fbn = n F (x)(1 − F (x)) D • mse = →0 n P • Fbn → F (x) Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1 , . . . , Xn ∼ F ) • pth quantile: F −1 (p) = inf{x : F (x) ≥ p} ¯n • µ b=X n 1 X ¯ n )2 (Xi − X • σ b2 = n − 1 i=1 Pn 1 b)3 i=1 (Xi − µ n • κ b= b3 Pσ n ¯ ¯ i=1 (Xi − Xn )(Yi − Yn ) qP • ρb = qP n n ¯ 2 ¯ 2 i=1 (Xi − Xn ) i=1 (Yi − Yn ) 12 Parametric Inference Let F = f (x; θ) : θ ∈ Θ be a parametric model with parameter space Θ ⊂ Rk and parameter θ = (θ1 , . . . , θk ). 12.1 Method of Moments j th moment 2 P sup F (x) − Fbn (x) > ε = 2e−2nε x αj (θ) = E X j = Z xj dFX (x) 13 j th sample moment Fisher information (exponential family) n 1X j α bj = X n i=1 i ∂ I(θ) = Eθ − s(X; θ) ∂θ Method of moments estimator (MoM) Observed Fisher information α1 (θ) = α b1 α2 (θ) = α b2 .. .. .=. Inobs (θ) = − αk (θ) = α bk Properties of the mle Properties of the MoM estimator • • • θbn exists with probability tending to 1 P Consistency: θbn → θ Asymptotic normality: √ D n(θb − θ) → N (0, Σ) where Σ = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T , ∂ −1 αj (θ) g = (g1 , . . . , gk ) and gj = ∂θ 12.2 P • Consistency: θbn → θ • Equivariance: θbn is the mle =⇒ ϕ(θbn ) ist the mle of ϕ(θ) • Asymptotic normality: p 1. se ≈ 1/In (θ) (θbn − θ) D → N (0, 1) se q b ≈ 1/In (θbn ) 2. se (θbn − θ) D → N (0, 1) b se Maximum Likelihood • Asymptotic optimality (or efficiency), i.e., smallest variance for large samples. If θen is any other estimator, the asymptotic relative efficiency is Likelihood: Ln : Θ → [0, ∞) Ln (θ) = n ∂2 X log f (Xi ; θ) ∂θ2 i=1 n Y f (Xi ; θ) h i V θbn are(θen , θbn ) = h i ≤ 1 V θen i=1 Log-likelihood `n (θ) = log Ln (θ) = n X log f (Xi ; θ) • Approximately the Bayes estimator i=1 Maximum likelihood estimator (mle) 12.2.1 Ln (θbn ) = sup Ln (θ) θ Score function s(X; θ) = ∂ log f (X; θ) ∂θ Fisher information I(θ) = Vθ [s(X; θ)] In (θ) = nI(θ) Delta Method b where ϕ is differentiable and ϕ0 (θ) 6= 0: If τ = ϕ(θ) (b τn − τ ) D → N (0, 1) b τ) se(b b is the mle of τ and where τb = ϕ(θ) b b b = ϕ0 (θ) b θn ) se se( 14 12.3 13 Multiparameter Models Hypothesis Testing Let θ = (θ1 , . . . , θk ) and θb = (θb1 , . . . , θbk ) be the mle. Hjj ∂ 2 `n = ∂θ2 Hjk H0 : θ ∈ Θ0 ∂ 2 `n = ∂θj ∂θk Definitions Fisher information matrix Eθ [H11 ] · · · .. .. In (θ) = − . . Eθ [Hk1 ] · · · Eθ [H1k ] .. . Eθ [Hkk ] Under appropriate regularity conditions (θb − θ) ≈ N (0, Jn ) with Jn (θ) = In−1 . Further, if θbj is the j th component of θ, then (θbj − θj ) D → N (0, 1) bj se h i b 2j = Jn (j, j) and Cov θbj , θbk = Jn (j, k) where se 12.3.1 • • • • • • • • • • • Null hypothesis H0 Alternative hypothesis H1 Simple hypothesis θ = θ0 Composite hypothesis θ > θ0 or θ < θ0 Two-sided test: H0 : θ = θ0 versus H1 : θ 6= θ0 One-sided test: H0 : θ ≤ θ0 versus H1 : θ > θ0 Critical value c Test statistic T Rejection region R = {x : T (x) > c} Power function β(θ) = P [X ∈ R] Power of a test: 1 − P [Type II error] = 1 − β = inf β(θ) θ∈Θ1 • Test size: α = P [Type I error] = sup β(θ) θ∈Θ0 Multiparameter delta method Let τ = ϕ(θ1 , . . . , θk ) and let the gradient of ϕ be ∂ϕ ∂θ1 . ∇ϕ = .. ∂ϕ H0 true H1 true 1−Fθ (T (X)) p-value < 0.01 0.01 − 0.05 0.05 − 0.1 > 0.1 (b τ − τ) D → N (0, 1) b τ) se(b b τ) = se(b b and ∇ϕ b = ∇ϕ b. and Jbn = Jn (θ) θ=θ 12.4 Parametric Bootstrap b ∇ϕ T Type II Error (β) Reject H0 Type √ I Error (α) (power) • p-value = supθ∈Θ0 Pθ [T (X) ≥ T (x)] = inf α : T (x) ∈ Rα • p-value = supθ∈Θ0 Pθ [T (X ? ) ≥ T (X)] = inf α : T (X) ∈ Rα {z } | b Then, Suppose ∇ϕθ=θb 6= 0 and τb = ϕ(θ). r Retain H0 √ p-value ∂θk where H1 : θ ∈ Θ1 versus since T (X ? )∼Fθ evidence very strong evidence against H0 strong evidence against H0 weak evidence against H0 little or no evidence against H0 Wald test b Jbn ∇ϕ • Two-sided test θb − θ0 • Reject H0 when |W | > zα/2 where W = b se • P |W | > zα/2 → α • p-value = Pθ0 [|W | > |w|] ≈ P [|Z| > |w|] = 2Φ(−|w|) Sample from f (x; θbn ) instead of from Fbn , where θbn could be the mle or method of moments estimator. Likelihood ratio test (LRT) 15 supθ∈Θ Ln (θ) Ln (θbn ) = supθ∈Θ0 Ln (θ) Ln (θbn,0 ) k X D iid • λ(X) = 2 log T (X) → χ2r−q where Zi2 ∼ χ2k and Z1 , . . . , Zk ∼ N (0, 1) • T (X) = Vector parameter ( fX (x | θ) = h(x) exp = h(x) exp {η(θ) · T (x) − A(θ)} = h(x)g(θ) exp {η(θ) · T (x)} fX (x | η) = h(x) exp {η · T(x) − A(η)} = h(x)g(η) exp {η · T(x)} = h(x)g(η) exp η T T(x) 15 Bayesian Inference Bayes’ Theorem f (θ | x) = Pearson Chi-square Test k X (Xj − E [Xj ])2 where E [Xj ] = np0j under H0 E [Xj ] j=1 D • T → χ2k−1 • p-value = P χ2k−1 > T (x) D 2 • Faster → Xk−1 than LRT, hence preferable for small n f (x | θ)f (θ) f (x | θ)f (θ) =R ∝ Ln (θ)f (θ) f (xn ) f (x | θ)f (θ) dθ Definitions • • • • Independence testing X n = (X1 , . . . , Xn ) xn = (x1 , . . . , xn ) Prior density f (θ) Likelihood f (xn | θ): joint density of the data n Y f (xi | θ) = Ln (θ) In particular, X n iid =⇒ f (xn | θ) = i=1 • I rows, J columns, X multinomial sample of size n = I ∗ J X • mles unconstrained: pbij = nij X • mles under H0 : pb0ij = pbi· pb·j = Xni· n·j PI PJ nX • LRT: λ = 2 i=1 j=1 Xij log Xi· Xij·j PI PJ (X −E[X ])2 • PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij D • LRT and Pearson → χ2k ν, where ν = (I − 1)(J − 1) • Posterior density f (θ | xn ) R • Normalizing constant cn = f (xn ) = f (x | θ)f (θ) dθ • Kernel: part of a density that depends Ron θ R θLn (θ)f (θ) • Posterior mean θ¯n = θf (θ | xn ) dθ = R Ln (θ)f (θ) dθ 15.1 Credible Intervals Posterior interval n 14 ηi (θ)Ti (x) − A(θ) Natural form • The approximate size α LRT rejects H0 when λ(X) ≥ χ2k−1,α • T = ) i=1 i=1 • p-value = Pθ0 [λ(X) > λ(x)] ≈ P χ2r−q > λ(x) Multinomial LRT Xk X1 ,..., • mle: pbn = n n Xj k Y pbj Ln (b pn ) = • T (X) = Ln (p0 ) p0j j=1 k X pbj D • λ(X) = 2 Xj log → χ2k−1 p 0j j=1 s X Exponential Family b Z f (θ | xn ) dθ = 1 − α P [θ ∈ (a, b) | x ] = a Equal-tail credible interval Z a Z n f (θ | x ) dθ = Scalar parameter fX (x | θ) = h(x) exp {η(θ)T (x) − A(θ)} = h(x)g(θ) exp {η(θ)T (x)} −∞ ∞ f (θ | xn ) dθ = α/2 b Highest posterior density (HPD) region Rn 16 1. P [θ ∈ Rn ] = 1 − α 2. Rn = {θ : f (θ | xn ) > k} for some k 15.3.1 Continuous likelihood (subscript c denotes constant) Rn is unimodal =⇒ Rn is an interval 15.2 Conjugate Priors Function of parameters Likelihood Conjugate prior Unif (0, θ) Pareto(xm , k) Exp (λ) Gamma (α, β) i=1 2 Let τ = ϕ(θ) and A = {θ : ϕ(θ) ≤ τ }. Posterior CDF for τ N µ, σc H(r | xn ) = P [ϕ(θ) ≤ τ | xn ] = Z 2 N µ0 , σ0 f (θ | xn ) dθ N µc , σ 2 A Posterior density N µ, σ 2 h(τ | xn ) = H 0 (τ | xn ) Bayesian delta method b b se b ϕ0 (θ) τ | X n ≈ N ϕ(θ), 15.3 Posterior hyperparameters max x(n) , xm , k + n n X xi α + n, β + MVN(µ, Σc ) MVN(µc , Σ) Priors Choice • Subjective bayesianism. • Objective bayesianism. • Robust bayesianism. Scaled Inverse Chisquare(ν, σ02 ) Normalscaled Inverse Gamma(λ, ν, α, β) MVN(µ0 , Σ0 ) InverseWishart(κ, Ψ) Pareto(xmc , k) Gamma (α, β) Pareto(xm , kc ) Pareto(x0 , k0 ) Gamma (αc , β) Gamma (α0 , β0 ) Pn µ0 1 n i=1 xi + / + 2 , σ2 σ2 σ02 σc 0 c−1 1 n + 2 σ02 σc Pn νσ02 + i=1 (xi − µ)2 ν + n, ν+n νλ + n¯ x n , ν + n, α + , ν+n 2 n 2 1X γ(¯ x − λ) β+ (xi − x ¯)2 + 2 i=1 2(n + γ) −1 Σ−1 0 + nΣc −1 −1 Σ−1 x ¯ , 0 µ0 + nΣ −1 −1 Σ−1 0 + nΣc n X (xi − µc )(xi − µc )T n + κ, Ψ + i=1 n X xi x mc i=1 x0 , k0 − kn where k0 > kn n X α0 + nαc , β0 + xi α + n, β + log i=1 Types • Flat: f (θ) ∝ constant R∞ • Proper: −∞ f (θ) dθ = 1 R∞ • Improper: −∞ f (θ) dθ = ∞ • Jeffrey’s prior (transformation-invariant): f (θ) ∝ p I(θ) f (θ) ∝ p det(I(θ)) • Conjugate: f (θ) and f (θ | xn ) belong to the same parametric family 17 log10 BF10 Discrete likelihood Likelihood Conjugate prior Posterior hyperparameters Bern (p) Beta (α, β) α+ Bin (p) Beta (α, β) α+ n X i=1 n X xi , β + n − xi , β + i=1 NBin (p) Beta (α, β) Po (λ) α + rn, β + Gamma (α, β) Multinomial(p) α+ Dir (α) α+ n X i=1 n X Beta (α, β) n X xi i=1 Ni − i=1 n X xi xi , β + n 16 1+ p 1−p BF10 evidence 1 − 1.5 1.5 − 10 10 − 100 > 100 Weak Moderate Strong Decisive where p = P [H1 ] and p∗ = P [H1 | xn ] Sampling Methods 16.1 Inverse Transform Sampling Setup x(i) α + n, β + p∗ = xi i=1 i=1 i=1 Geo (p) n X n X 0 − 0.5 0.5 − 1 1−2 >2 p BF 10 1−p BF10 n X xi i=1 • U ∼ Unif (0, 1) • X∼F • F −1 (u) = inf{x | F (x) ≥ u} Algorithm 15.4 Bayesian Testing 1. Generate u ∼ Unif (0, 1) 2. Compute x = F −1 (u) If H0 : θ ∈ Θ0 : Z Prior probability P [H0 ] = f (θ) dθ 16.2 f (θ | xn ) dθ Let Tn = g(X1 , . . . , Xn ) be a statistic. Θ0 Posterior probability P [H0 | xn ] = Z Θ0 1. Estimate VF [Tn ] with VFbn [Tn ]. 2. Approximate VFbn [Tn ] using simulation: Let H0 , . . . , HK−1 be K hypotheses. Suppose θ ∼ f (θ | Hk ), f (xn | Hk )P [Hk ] , P [Hk | xn ] = PK n k=1 f (x | Hk )P [Hk ] Marginal likelihood f (xn | Hi ) = Z f (xn | θ, Hi )f (θ | Hi ) dθ Θ P [Hi | x ] P [Hj | xn ] 16.2.1 n = f (x | Hi ) f (xn | Hj ) | {z } Bayes Factor BFij Bayes factor ∗ ∗ , . . . , Tn,B , an iid sample from (a) Repeat the following B times to get Tn,1 b the sampling distribution implied by Fn i. Sample uniformly X1∗ , . . . , Xn∗ ∼ Fbn . ii. Compute Tn∗ = g(X1∗ , . . . , Xn∗ ). (b) Then !2 B B X X 1 1 ∗ bb = vboot = V Tn,b − T∗ Fn B B r=1 n,r b=1 Posterior odds (of Hi relative to Hj ) n The Bootstrap × P [Hi ] P [Hj ] | {z } prior odds Bootstrap Confidence Intervals Normal-based interval b boot Tn ± zα/2 se Pivotal interval 1. Location parameter θ = T (F ) 18 2. Pivot Rn = θbn − θ 3. Let H(r) = P [Rn ≤ r] be the cdf of Rn ∗ ∗ 4. Let Rn,b = θbn,b − θbn . Approximate H using bootstrap: 2. Generate u ∼ Unif (0, 1) Ln (θcand ) 3. Accept θcand if u ≤ Ln (θbn ) B 1 X ∗ b H(r) = I(Rn,b ≤ r) B 16.4 Sample from an importance function g rather than target density h. Algorithm to obtain an approximation to E [q(θ) | xn ]: b=1 ∗ ∗ 5. θβ∗ = β sample quantile of (θbn,1 , . . . , θbn,B ) iid ∗ ∗ ), i.e., rβ∗ = θβ∗ − θbn 6. rβ∗ = β sample quantile of (Rn,1 , . . . , Rn,B 7. Approximate 1 − α confidence interval Cn = a ˆ, ˆb where a ˆ= ˆb = b −1 1 − α = θbn − H 2 α −1 b = θbn − H 2 Percentile interval 16.3 ∗ θbn − r1−α/2 = ∗ 2θbn − θ1−α/2 ∗ θbn − rα/2 = ∗ 2θbn − θα/2 Rejection Sampling • We can easily sample from g(θ) • We want to sample from h(θ), but it is difficult k(θ) k(θ) dθ • Envelope condition: we can find M > 0 such that k(θ) ≤ M g(θ) ∀θ • We know h(θ) up to a proportional constant: h(θ) = R Algorithm 1. Draw θcand ∼ g(θ) 2. Generate u ∼ Unif (0, 1) k(θcand ) 3. Accept θcand if u ≤ M g(θcand ) 4. Repeat until B values of θcand have been accepted Example We can easily sample from the prior g(θ) = f (θ) Target is the posterior h(θ) ∝ k(θ) = f (xn | θ)f (θ) Envelope condition: f (xn | θ) ≤ f (xn | θbn ) = Ln (θbn ) ≡ M Algorithm 1. Draw θcand ∼ f (θ) 1. Sample from the prior θ1 , . . . , θn ∼ f (θ) Ln (θi ) ∀i = 1, . . . , B 2. wi = PB i=1 Ln (θi ) PB 3. E [q(θ) | xn ] ≈ i=1 q(θi )wi 17 Decision Theory Definitions ∗ ∗ Cn = θα/2 , θ1−α/2 Setup • • • • Importance Sampling • Unknown quantity affecting our decision: θ ∈ Θ • Decision rule: synonymous for an estimator θb • Action a ∈ A: possible value of the decision rule. In the estimation b context, the action is just an estimate of θ, θ(x). • Loss function L: consequences of taking action a when true state is θ or b L : Θ × A → [−k, ∞). discrepancy between θ and θ, Loss functions • Squared error loss: L(θ, a) = (θ − a)2 ( K1 (θ − a) a − θ < 0 • Linear loss: L(θ, a) = K2 (a − θ) a − θ ≥ 0 • Absolute error loss: L(θ, a) = |θ − a| (linear loss with K1 = K2 ) • Lp loss: L(θ, a) = |θ − a|p ( 0 a=θ • Zero-one loss: L(θ, a) = 1 a 6= θ 17.1 Risk Posterior risk Z h i b b L(θ, θ(x))f (θ | x) dθ = Eθ|X L(θ, θ(x)) Z h i b b L(θ, θ(x))f (x | θ) dx = EX|θ L(θ, θ(X)) r(θb | x) = (Frequentist) risk b = R(θ, θ) 19 18 Bayes risk ZZ b = r(f, θ) h i b b L(θ, θ(x))f (x, θ) dx dθ = Eθ,X L(θ, θ(X)) h h ii h i b = Eθ EX|θ L(θ, θ(X) b b r(f, θ) = Eθ R(θ, θ) h h ii h i b = EX Eθ|X L(θ, θ(X) b r(f, θ) = EX r(θb | X) Linear Regression Definitions • Response variable Y • Covariate X (aka predictor variable or feature) 18.1 Simple Linear Regression Model 17.2 Yi = β0 + β1 Xi + i Admissibility E [i | Xi ] = 0, V [i | Xi ] = σ 2 Fitted line • θb0 dominates θb if b0 b ∀θ : R(θ, θ ) ≤ R(θ, θ) b ∃θ : R(θ, θb0 ) < R(θ, θ) • θb is inadmissible if there is at least one other estimator θb0 that dominates it. Otherwise it is called admissible. rb(x) = βb0 + βb1 x Predicted (fitted) values Ybi = rb(Xi ) Residuals ˆi = Yi − Ybi = Yi − βb0 + βb1 Xi Residual sums of squares (rss) 17.3 Bayes Rule rss(βb0 , βb1 ) = Bayes rule (or Bayes estimator) n X ˆ2i i=1 b = inf e r(f, θ) e • r(f, θ) θ R b b = r(θb | x)f (x) dx • θ(x) = inf r(θb | x) ∀x =⇒ r(f, θ) Least square estimates βbT = (βb0 , βb1 )T : min rss b0 ,β b1 β Theorems • Squared error loss: posterior mean • Absolute error loss: posterior median • Zero-one loss: posterior mode 17.4 Minimax Rules Maximum risk b = sup R(θ, θ) b ¯ θ) R( ¯ R(a) = sup R(θ, a) θ θ Minimax rule b = inf R( e = inf sup R(θ, θ) e ¯ θ) sup R(θ, θ) θ θe θe θ b =c θb = Bayes rule ∧ ∃c : R(θ, θ) Least favorable prior θbf = Bayes rule ∧ R(θ, θbf ) ≤ r(f, θbf ) ∀θ ¯n βb0 = Y¯n − βb1 X Pn Pn ¯ ¯ ¯ i=1 Xi Yi − nXY i=1 (Xi − Xn )(Yi − Yn ) b Pn = P β1 = n 2 ¯ 2 2 i=1 (Xi − Xn ) i=1 Xi − nX h i β0 E βb | X n = β1 P h i σ 2 n−1 ni=1 Xi2 −X n V βb | X n = 2 1 −X n nsX r Pn 2 σ b i=1 Xi √ b βb0 ) = se( n sX n σ b √ b βb1 ) = se( sX n Pn Pn 2 1 where s2X = n−1 i=1 (Xi − X n )2 and σ b2 = n−2 ˆi (unbiased estimate). i=1 Further properties: P P • Consistency: βb0 → β0 and βb1 → β1 20 18.3 • Asymptotic normality: βb0 − β0 D → N (0, 1) b βb0 ) se( and βb1 − β1 D → N (0, 1) b βb1 ) se( Multiple Regression Y = Xβ + where • Approximate 1 − α confidence intervals for β0 and β1 : b βb1 ) and βb1 ± zα/2 se( b βb0 ) βb0 ± zα/2 se( • Wald test for H0 : β1 = 0 vs. H1 : β1 6= 0: reject H0 if |W | > zα/2 where b βb1 ). W = βb1 /se( Xn1 Pn b Pn 2 2 ˆ rss i=1 (Yi − Y ) R = Pn = 1 − Pn i=1 i 2 = 1 − 2 tss i=1 (Yi − Y ) i=1 (Yi − Y ) Likelihood L2 = β1 β = ... Xnk n Y i=1 n Y i=1 n Y f (Xi , Yi ) = n Y fX (Xi ) × n Y ( fY |X (Yi | Xi ) ∝ σ 2 1 X exp − 2 Yi − (β0 − β1 Xi ) 2σ i βb = (X T X)−1 X T Y h i V βb | X n = σ 2 (X T X)−1 ) Under the assumption of Normality, the least squares parameter estimators are also the MLEs, but the least squares variance estimator is not the MLE βb ≈ N β, σ 2 (X T X)−1 1X 2 ˆ n i=1 i rb(x) = k X βbj xj j=1 Unbiased estimate for σ 2 Prediction Observe X = x∗ of the covariate and want to predict their outcome Y∗ . Yb∗ = βb0 + βb1 x∗ h i h i h i h i V Yb∗ = V βb0 + x2∗ V βb1 + 2x∗ Cov βb0 , βb1 Prediction interval Estimate regression function n σ b2 = N X (Yi − xTi β)2 If the (k × k) matrix X T X is invertible, fX (Xi ) −n n i=1 fY |X (Yi | Xi ) = L1 × L2 i=1 i=1 1 .. =. βk rss = (y − Xβ)T (y − Xβ) = kY − Xβk2 = i=1 18.2 1 L(µ, Σ) = (2πσ 2 )−n/2 exp − 2 rss 2σ 2 L1 = X1k .. . Likelihood R2 L= ··· .. . ··· X11 .. X= . Pn 2 2 2 i=1 (Xi − X∗ ) b P ξn = σ b ¯ 2j + 1 n i (Xi − X) Yb∗ ± zα/2 ξbn σ b2 = n 1 X 2 ˆ n − k i=1 i ˆ = X βb − Y mle ¯ µ b=X σ b2 = n−k 2 σ n 1 − α Confidence interval b βbj ) βbj ± zα/2 se( 21 18.4 Model Selection Akaike Information Criterion (AIC) Consider predicting a new observation Y ∗ for covariates X ∗ and let S ⊂ J denote a subset of the covariates in the model, where |S| = k and |J| = n. Issues • Underfitting: too few covariates yields high bias • Overfitting: too many covariates yields high variance AIC(S) = `n (βbS , σ bS2 ) − k Bayesian Information Criterion (BIC) k BIC(S) = `n (βbS , σ bS2 ) − log n 2 Procedure 1. Assign a score to each model 2. Search through all models to find the one with the highest score Validation and training bV (S) = R Hypothesis testing m X (Ybi∗ (S) − Yi∗ )2 m = |{validation data}|, often i=1 n n or 4 2 H0 : βj = 0 vs. H1 : βj 6= 0 ∀j ∈ J Leave-one-out cross-validation Mean squared prediction error (mspe) h i mspe = E (Yb (S) − Y ∗ )2 bCV (S) = R n X (Yi − Yb(i) )2 = i=1 Prediction risk R(S) = n X mspei = i=1 n X h i E (Ybi (S) − Yi∗ )2 n X i=1 Yi − Ybi (S) 1 − Uii (S) !2 U (S) = XS (XST XS )−1 XS (“hat matrix”) i=1 Training error btr (S) = R n X (Ybi (S) − Yi )2 19 Non-parametric Function Estimation i=1 R 19.1 2 btr (S) rss(S) R R2 (S) = 1 − =1− =1− tss tss Pn b 2 i=1 (Yi (S) − Y ) P n 2 i=1 (Yi − Y ) The training error is a downward-biased estimate of the prediction risk. h i btr (S) < R(S) E R n h i h i X b b bias(Rtr (S)) = E Rtr (S) − R(S) = −2 Cov Ybi , Yi i=1 Adjusted R2 R2 (S) = 1 − Density Estimation Estimate f (x), where f (x) = P [X ∈ A] = Integrated square error (ise) L(f, fbn ) = Z R A f (x) dx. Z 2 f (x) − fbn (x) dx = J(h) + f 2 (x) dx Frequentist risk Z h i Z R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx n − 1 rss n − k tss Mallow’s Cp statistic b btr (S) + 2kb R(S) =R σ 2 = lack of fit + complexity penalty h i b(x) = E fbn (x) − f (x) h i v(x) = V fbn (x) 22 19.1.1 Histograms KDE n x − Xi 1X1 K fbn (x) = n i=1 h h Z Z 1 1 4 00 2 b R(f, fn ) ≈ (hσK ) K 2 (x) dx (f (x)) dx + 4 nh Z Z −2/5 −1/5 −1/5 c c2 c3 2 2 h∗ = 1 c = σ , c = K (x) dx, c = (f 00 (x))2 dx 1 2 3 K n1/5 Z 4/5 Z 1/5 c4 5 2 2/5 ∗ 2 00 2 b R (f, fn ) = 4/5 K (x) dx (f ) dx c4 = (σK ) 4 n | {z } Definitions • • • • Number of bins m 1 Binwidth h = m Bin Bj has νj observations R Define pbj = νj /n and pj = Bj f (u) du Histogram estimator fbn (x) = m X pbj j=1 h C(K) I(x ∈ Bj ) Epanechnikov Kernel pj E fbn (x) = h h i p (1 − p ) j j V fbn (x) = nh2 Z h2 1 2 R(fbn , f ) ≈ (f 0 (u)) du + 12 nh !1/3 1 6 ∗ h = 1/3 R 2 du n (f 0 (u)) 2/3 Z 1/3 C 3 2 ∗ b 0 R (fn , f ) ≈ 2/3 (f (u)) du C= 4 n h i ( K(x) = √ 3 √ 4 5(1−x2 /5) |x| < 0 otherwise 5 Cross-validation estimate of E [J(h)] Z JbCV (h) = n n n 2Xb 2 1 X X ∗ Xi − Xj fbn2 (x) dx − K + K(0) f(−i) (Xi ) ≈ 2 n i=1 hn i=1 j=1 h nh ∗ K (x) = K (2) (x) − 2K(x) K (2) Z (x) = K(x − y)K(y) dy Cross-validation estimate of E [J(h)] Z JbCV (h) = 19.1.2 n m 2Xb 2 n+1 X 2 fbn2 (x) dx − f(−i) (Xi ) = − pb n i=1 (n − 1)h (n − 1)h j=1 j 19.2 Non-parametric Regression Estimate f (x) where f (x) = E [Y | X = x]. Consider pairs of points (x1 , Y1 ), . . . , (xn , Yn ) related by Kernel Density Estimator (KDE) Yi = r(xi ) + i E [i ] = 0 Kernel K • • • • K(x) ≥ 0 R K(x) dx = 1 R xK(x) dx = 0 R 2 2 x K(x) dx ≡ σK >0 V [i ] = σ 2 k-nearest Neighbor Estimator rb(x) = 1 k X i:xi ∈Nk (x) Yi where Nk (x) = {k values of x1 , . . . , xn closest to x} 23 20 Nadaraya-Watson Kernel Estimator rb(x) = n X Stochastic Processes Stochastic Process wi (x)Yi i=1 wi (x) = x−xi h Pn x−xj K j=1 h K ( {0, ±1, . . . } = Z discrete T = [0, ∞) continuous {Xt : t ∈ T } ∈ [0, 1] Z 4 Z 2 h4 f 0 (x) 2 2 00 0 R(b rn , r) ≈ x K (x) dx r (x) + 2r (x) dx 4 f (x) R Z 2 σ K 2 (x) dx + dx nhf (x) c1 h∗ ≈ 1/5 n c2 ∗ R (b rn , r) ≈ 4/5 n • Notations Xt , X(t) • State space X • Index set T 20.1 Markov Chains Markov chain P [Xn = x | X0 , . . . , Xn−1 ] = P [Xn = x | Xn−1 ] ∀n ∈ T, x ∈ X Cross-validation estimate of E [J(h)] JbCV (h) = n X (Yi − rb(−i) (xi ))2 = i=1 19.3 n X i=1 (Yi − rb(xi ))2 1− K(0) x−x Pn j j=1 K h Smoothing Using Orthogonal Functions Approximation r(x) = ∞ X βj φj (x) ≈ j=1 J X βj φj (x) Multivariate regression where ηi = i ··· .. . φ0 (xn ) · · · !2 pij ≡ P [Xn+1 = j | Xn = i] pij (n) ≡ P [Xm+n = j | Xm = i] Transition matrix P (n-step: Pn ) Chapman-Kolmogorov φJ (x1 ) .. . pij (m + n) = φJ (xn ) βb = (ΦT Φ)−1 ΦT Y 1 ≈ ΦT Y (for equally spaced observations only) n Cross-validation estimate of E [J(h)] 2 n J X X bCV (J) = Yi − R φj (xi )βbj,(−i) j=1 X pij (m)pkj (n) k Pm+n = Pm Pn Least squares estimator i=1 n-step • (i, j) element is pij • pij > 0 P • i pij = 1 i=1 Y = Φβ + η φ0 (x1 ) .. and Φ = . Transition probabilities Pn = P × · · · × P = Pn Marginal probability µn = (µn (1), . . . , µn (N )) where µi (i) = P [Xn = i] µ0 , initial distribution µn = µ0 Pn 24 20.2 Poisson Processes Autocorrelation function (ACF) Poisson process • {Xt : t ∈ [0, ∞)} = number of events up to and including time t • X0 = 0 • Independent increments: γ(s, t) Cov [xs , xt ] =p ρ(s, t) = p V [xs ] V [xt ] γ(s, s)γ(t, t) Cross-covariance function (CCV) ∀t0 < · · · < tn : Xt1 − Xt0 ⊥ ⊥ · · · ⊥⊥ Xtn − Xtn−1 γxy (s, t) = E [(xs − µxs )(yt − µyt )] • Intensity function λ(t) Cross-correlation function (CCF) – P [Xt+h − Xt = 1] = λ(t)h + o(h) – P [Xt+h − Xt = 2] = o(h) • Xs+t − Xs ∼ Po (m(s + t) − m(s)) where m(t) = Rt 0 ρxy (s, t) = p λ(s) ds γxy (s, t) γx (s, s)γy (t, t) Homogeneous Poisson process λ(t) ≡ λ =⇒ Xt ∼ Po (λt) Backshift operator λ>0 B k (xt ) = xt−k Waiting times Wt := time at which Xt occurs 1 Wt ∼ Gamma t, λ Difference operator ∇d = (1 − B)d Interarrival times White noise St = Wt+1 − Wt 1 St ∼ Exp λ 2 • wt ∼ wn(0, σw ) St Wt−1 21 Wt iid 2 Gaussian: wt ∼ N 0, σw E [wt ] = 0 t ∈ T V [wt ] = σ 2 t ∈ T γw (s, t) = 0 s 6= t ∧ s, t ∈ T Random walk Time Series Mean function t • • • • Z ∞ µxt = E [xt ] = xft (x) dx −∞ • Drift δ Pt • xt = δt + j=1 wj • E [xt ] = δt Autocovariance function Symmetric moving average γx (s, t) = E [(xs − µs )(xt − µt )] = E [xs xt ] − µs µt γx (t, t) = E (xt − µt )2 = V [xt ] mt = k X j=−k aj xt−j where aj = a−j ≥ 0 and k X j=−k aj = 1 25 21.1 Stationary Time Series 21.2 Estimation of Correlation Sample mean Strictly stationary n x ¯= P [xt1 ≤ c1 , . . . , xtk ≤ ck ] = P [xt1 +h ≤ c1 , . . . , xtk +h ≤ ck ] 1X xt n t=1 Sample variance n |h| 1 X 1− γx (h) V [¯ x] = n n ∀k ∈ N, tk , ck , h ∈ Z h=−n Weakly stationary Sample autocovariance function ∀t ∈ Z • E x2t < ∞ 2 • E xt = m ∀t ∈ Z • γx (s, t) = γx (s + r, t + r) ∀r, s, t ∈ Z Sample autocorrelation function Autocovariance function • • • • • n−h 1 X (xt+h − x ¯)(xt − x ¯) n t=1 γ b(h) = γ(h) = E [(xt+h − µ)(xt − µ)] γ(0) = E (xt − µ)2 γ(0) ≥ 0 γ(0) ≥ |γ(h)| γ(h) = γ(−h) ∀h ∈ Z ρb(h) = γ b(h) γ b(0) Sample cross-variance function γ bxy (h) = n−h 1 X (xt+h − x ¯)(yt − y) n t=1 Autocorrelation function (ACF) Sample cross-correlation function Cov [xt+h , xt ] γ(t + h, t) γ(h) =p = ρx (h) = p γ(0) V [xt+h ] V [xt ] γ(t + h, t + h)γ(t, t) Jointly stationary time series γ bxy (h) ρbxy (h) = p γ bx (0)b γy (0) Properties γxy (h) = E [(xt+h − µx )(yt − µy )] ρxy (h) = p γxy (h) γx (0)γy (h) 1 • σρbx (h) = √ if xt is white noise n 1 • σρbxy (h) = √ if xt or yt is white noise n 21.3 Linear process Non-Stationary Time Series Classical decomposition model xt = µ + ∞ X ψj wt−j where j=−∞ 2 γ(h) = σw ∞ X |ψj | < ∞ xt = µt + st + wt j=−∞ ∞ X j=−∞ ψj+h ψj • µt = trend • st = seasonal component • wt = random noise term 26 21.3.1 Detrending Moving average polynomial θ(z) = 1 + θ1 z + · · · + θq zq Least squares Moving average operator 1. Choose trend model, e.g., µt = β0 + β1 t + β2 t2 2. Minimize rss to obtain trend estimate µ bt = βb0 + βb1 t + βb2 t2 3. Residuals , noise wt θ(B) = 1 + θ1 B + · · · + θp B p MA (q) (moving average model order q) Moving average xt = wt + θ1 wt−1 + · · · + θq wt−q ⇐⇒ xt = θ(B)wt • The low-pass filter vt is a symmetric moving average mt with aj = vt = 1 2k + 1 k X 1 2k+1 : E [xt ] = xt−1 ( γ(h) = Cov [xt+h , xt ] = i=−k ARIMA models Autoregressive polynomial z ∈ C ∧ φp 6= 0 2 σw 0 Pq−h j=0 θj θj+h 0≤h≤q h>q xt = wt + θwt−1 2 2 (1 + θ )σw h = 0 2 γ(h) = θσw h=1 0 h>1 ( θ h=1 2 ρ(h) = (1+θ ) 0 h>1 • µt = β0 + β1 t =⇒ ∇xt = β1 ARMA (p, q) xt = φ1 xt−1 + · · · + φp xt−p + wt + θ1 wt−1 + · · · + θq wt−q Autoregressive operator φ(B)xt = θ(B)wt φ(B) = 1 − φ1 B − · · · − φp B p Partial autocorrelation function (PACF) Autoregressive model order p, AR (p) xt = φ1 xt−1 + · · · + φp xt−p + wt ⇐⇒ φ(B)xt = wt • xih−1 , regression of xi on {xh−1 , xh−2 , . . . , x1 } • φhh = corr(xh − xh−1 , x0 − xh−1 ) h≥2 0 h • E.g., φ11 = corr(x1 , x0 ) = ρ(1) ARIMA (p, d, q) AR (1) ∇d xt = (1 − B)d xt is ARMA (p, q) • xt = φk (xt−k ) + k−1 X φj (wt−j ) k→∞,|φ|<1 j=0 = ∞ X φj (wt−j ) j=0 | {z φ(B)(1 − B)d xt = θ(B)wt Exponentially Weighted Moving Average (EWMA) xt = xt−1 + wt − λwt−1 } linear process • E [xt ] = P∞ j=0 j φ (E [wt−j ]) = 0 • γ(h) = Cov [xt+h , xt ] = • ρ(h) = θj E [wt−j ] = 0 MA (1) Differencing φ(z) = 1 − φ1 z − · · · − φp zp q X j=0 Pk 1 • If 2k+1 i=−k wt−j ≈ 0, a linear trend function µt = β0 + β1 t passes without distortion 21.4 z ∈ C ∧ θq 6= 0 γ(h) γ(0) xt = 2 h σw φ 1−φ2 (1 − λ)λj−1 xt−j + wt when |λ| < 1 j=1 = φh • ρ(h) = φρ(h − 1) h = 1, 2, . . . ∞ X x ˜n+1 = (1 − λ)xn + λ˜ xn Seasonal ARIMA 27 • Denoted by ARIMA (p, d, q) × (P, D, Q)s d s • ΦP (B s )φ(B)∇D s ∇ xt = δ + ΘQ (B )θ(B)wt Periodic mixture xt = 21.4.1 Causality and Invertibility ∞ X P∞ j=0 ψj < ∞ such that wt−j = ψ(B)wt j=0 ARMA (p, q) is invertible ⇐⇒ ∃{πj } : π(B)xt = (Uk1 cos(2πωk t) + Uk2 sin(2πωk t)) k=1 ARMA (p, q) is causal (future-independent) ⇐⇒ ∃{ψj } : xt = q X • Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rv’s with variances σk2 Pq • γ(h) = k=1 σk2 cos(2πωk h) Pq • γ(0) = E x2t = k=1 σk2 Spectral representation of a periodic process P∞ j=0 ∞ X πj < ∞ such that γ(h) = σ 2 cos(2πω0 h) σ 2 −2πiω0 h σ 2 2πiω0 h e + e 2 2 Z 1/2 = e2πiωh dF (ω) = Xt−j = wt j=0 Properties −1/2 • ARMA (p, q) causal ⇐⇒ roots of φ(z) lie outside the unit circle ∞ X θ(z) ψ(z) = ψj z = φ(z) j=0 j Spectral distribution function 0 F (ω) = σ 2 /2 2 σ |z| ≤ 1 ω < −ω0 −ω ≤ ω < ω0 ω ≥ ω0 • ARMA (p, q) invertible ⇐⇒ roots of θ(z) lie outside the unit circle ∞ X φ(z) πj z j = π(z) = θ(z) j=0 • F (−∞) = F (−1/2) = 0 • F (∞) = F (1/2) = γ(0) |z| ≤ 1 Spectral density Behavior of the ACF and PACF for causal and invertible ARMA models ACF PACF 21.5 AR (p) tails off cuts off after lag p MA (q) cuts off after lag q tails off q ARMA (p, q) tails off tails off Spectral Analysis Periodic process xt = A cos(2πωt + φ) = U1 cos(2πωt) + U2 sin(2πωt) • • • • Frequency index ω (cycles per unit time), period 1/ω Amplitude A Phase φ U1 = A cos φ and U2 = A sin φ often normally distributed rv’s ∞ X γ(h)e−2πiωh − |γ(h)| < ∞ =⇒ γ(h) = R 1/2 f (ω) = h=−∞ • Needs P∞ h=−∞ 1 1 ≤ω≤ 2 2 −1/2 e2πiωh f (ω) dω h = 0, ±1, . . . • f (ω) ≥ 0 • f (ω) = f (−ω) • f (ω) = f (1 − ω) R 1/2 • γ(0) = V [xt ] = −1/2 f (ω) dω 2 • White noise: fw (ω) = σw • ARMA (p, q) , φ(B)xt = θ(B)wt : |θ(e−2πiω )|2 |φ(e−2πiω )|2 Pp Pq where φ(z) = 1 − k=1 φk z k and θ(z) = 1 + k=1 θk z k 2 fx (ω) = σw 28 • I0 (a, b) = 0 I1 (a, b) = 1 • Ix (a, b) = 1 − I1−x (b, a) Discrete Fourier Transform (DFT) d(ωj ) = n−1/2 n X xt e−2πiωj t 22.3 i=1 Fourier/Fundamental frequencies Finite ωj = j/n Inverse DFT xt = n−1/2 n−1 X • d(ωj )e2πiωj t • j=0 • 2 I(j/n) = |d(j/n)| Scaled Periodogram 4 I(j/n) n !2 n 2X = xt cos(2πtj/n + n t=1 P (j/n) = 22.1 n 2X xt sin(2πtj/n n t=1 !2 Math k= k=1 n X n(n + 1) 2 • (2k − 1) = n2 k=1 n X • k=1 n X ck = k=0 cn+1 − 1 c−1 k=0 n X k = 2n r+k r+n+1 = k n k=0 n X k n+1 • = m m+1 • k2 = n X n k=0 • Vandermonde’s Identity: r X m n m+n = k r−k r k=0 c 6= 1 • Binomial Theorem: n X n n−k k a b = (a + b)n k k=0 Infinite Gamma Function Z • ∞ ts−1 e−t dt 0 Z ∞ • Upper incomplete: Γ(s, x) = ts−1 e−t dt x Z x • Lower incomplete: γ(s, x) = ts−1 e−t dt • Ordinary: Γ(s) = 0 • Γ(α + 1) = αΓ(α) α>1 • Γ(n) = (n − 1)! n∈N √ • Γ(1/2) = π 22.2 Binomial n X n(n + 1)(2n + 1) 6 k=1 2 n X n(n + 1) • k3 = 2 Periodogram 22 Series ∞ X k=0 ∞ X pk = 1 , 1−p ∞ X p |p| < 1 1−p k=1 ! ∞ X d 1 1 k p = = dp 1 − p (1 − p)2 pk = d dp k=0 k=0 ∞ X r + k − 1 • xk = (1 − x)−r r ∈ N+ k k=0 ∞ X α k • p = (1 + p)α |p| < 1 , α ∈ C k • kpk−1 = |p| < 1 k=0 Beta Function 1 Γ(x)Γ(y) • Ordinary: B(x, y) = B(y, x) = tx−1 (1 − t)y−1 dt = Γ(x + y) 0 Z x • Incomplete: B(x; a, b) = ta−1 (1 − t)b−1 dt Z 0 • Regularized incomplete: a+b−1 B(x; a, b) a,b∈N X (a + b − 1)! Ix (a, b) = = xj (1 − x)a+b−1−j B(a, b) j!(a + b − 1 − j)! j=a 29 22.4 Combinatorics Sampling k out of n w/o replacement nk = ordered k−1 Y (n − i) = i=0 w/ replacement n! (n − k)! n nk n! = = k k! k!(n − k)! unordered Stirling numbers, 2nd kind n n−1 n−1 =k + k k k−1 nk n−1+r n−1+r = r n−1 ( 1 n = 0 0 1≤k≤n n=0 else Partitions Pn+k,k = n X Pn,i k > n : Pn,k = 0 n ≥ 1 : Pn,0 = 0, P0,0 = 1 i=1 Balls and Urns |B| = n, |U | = m f :B→U f arbitrary mn B : D, U : D B : ¬D, U : D B : D, U : ¬D f injective ( mn m ≥ n 0 else m+n−1 n m X n k=1 B : ¬D, U : ¬D D = distinguishable, ¬D = indistinguishable. m X k=1 m n m! n m n−1 m−1 ( 1 0 m≥n else n m ( 1 0 m≥n else Pn,m k Pn,k f surjective f bijective ( n! m = n 0 else ( 1 m=n 0 else ( 1 m=n 0 else ( 1 m=n 0 else References [1] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships. The American Statistician, 62(1):45–53, 2008. [2] A. Steger. Diskrete Strukturen – Band 1: Kombinatorik, Graphentheorie, Algebra. Springer, 2001. [3] A. Steger. Diskrete Strukturen – Band 2: Wahrscheinlichkeitstheorie und Statistik. Springer, 2002. 30 31 Univariate distribution relationships, courtesy Leemis and McQueston [1].
© Copyright 2024