Stat 6341, Statistical Computing Stat 6341 Syllabus STAT 6341 Numerical Linear Algebra and Statistical Computing Course Information Instructor: Dr. Larry P. Ammann Office hours: MW 2:30-3:30pm, others by appt. Email: [email protected] Office: FO 2.410C Phone: (972) 883-2161 Text: Modern Applied Statistics with S, 4th Ed. Authors: W.N. Venables and B.D. Ripley Additional resources: Matrix Computations Authors: G. Golub and C. van Loan These notes are copyrighted by their author, Larry P. Ammann, and are intended for the use of students currently registered for Stat 6341. They may not be copied or used for any other purpose without permission of the author. Tentative Schedule Topics Numerical linear algebra Introduction to the S language and statistical programming Simulation QR decomposition and least squares regression Data explorations Statistical models SVD and multivariate data Chapters class notes VR 1-3 class notes class notes VR 4-5 VR 6-7 class notes, VR 11 This course will make use of the statistics programming language R. Pre-compiled binaries for R are freely available for Windows, MacOS, and Linux at http://cran.r-project.org An online introduction to R is located at http://cran.r-project.org/doc/manuals/R-intro.html 1 In addition to homework projects, a final group project will be assigned. Students will form groups of 4-5 each and complete a major simulation project that will be presented to the class at the end of the semester. Student Learning Objectives Understand numerical and computational issues associated with the major matrix decompositions: LU, QR, SVD. Understand how to express basic mathematical and statistical problems in a high-level statistical programming language. Note: the complete syllabus is available here: http://www.utdallas.edu/~ammann/stat6341_syllabus.pdf 2 Simulation of random variables One of R’s strengths is its extensive library of functions for simulation of random variables. This includes the following distributions. Discrete r.v.’s: binomial, geometric, hypergeometric, multinomial, negative binomial, Poisson. Continuaous r.v.’s: beta, Cauchy, chi-squared, exponential, F, gamma, log-normal, normal, t, uniform, Weibull. For each distribution there are functions in R to generate the cdf, density or pmf, quantiles, and random samples. These functions follow the convention dname gives density or pmf, pname gives cdf, qname gives quantiles, and rname gives random samples. For example, n = 200 mu = 100 sig = 20 X = rnorm(n,mu,sig) X.hist = hist(X,plot=FALSE) x0 = seq(mu-4*sig,mu+4*sig,length=250) d0 = dnorm(x0,mu,sig) y0 = d0*n*unique(diff(X.hist$breaks)) y.lim = max(c(y0,X.hist$counts)) png("NormalDens.png",width=480,height=480) hist(X,col="cyan",ylim=c(0,y.lim),xlim=mu+c(-4,4)*sig,main="") title(paste("Histogram of N(",mu,",",sig,")",sep="")) mtext("with Density Function",side=3,line=.25) lines(x0,y0,col="red") graphics.off() 3 Additional notes: simulation of the Gamma distribution The Gamma distribution is a two-parameter family with alternative parameterizations. The density function is f (x) = = 1 β α Γ(α) xα−1 e−x/β , x > 0, α, β > 0 λα α−1 −λx x e , x > 0, α, λ > 0 Γ(α) 4 where α is called the shape parameter, β is called the scale parameter, and λ = 1/β is called the rate parameter. The expected value and standard deviation of the gamma distribution are, E(X) = αβ = SD(X) = √ α λ√ αβ = α . λ It usually is more natural to specify the distribution in terms of its mean and s.d., µ, σ. The relationship between α, β and µ, σ can be inverted to obtain the natural parameters of this distribution in terms of its mean and s.d. That is, α= σ2 µ2 , β = . σ2 µ The plot below shows why α is called the shape parameter. 5 This plot was generated by the following R code. n = 100 # sample size N = 400 # num x-values mu = c(.75,1,5) # means Gam.col = c("red","ForestGreen","blue") sig = 1 # s.d.’s alpha = (mu/sig)^2 beta = (sig^2)/mu x = seq(0.05,8,length=N) 6 y1 = dgamma(x,alpha[1],scale=beta[1]) y2 = dgamma(x,alpha[2],scale=beta[2]) y3 = dgamma(x,alpha[3],scale=beta[3]) Y = cbind(y1,y2,y3) png("GammaDens.png",width=480,height=480) plot(x,y1,xlab="",ylab="Density",ylim=range(Y),type="n") for(k in seq(mu)) { lines(x,Y[,k],col=Gam.col[k],lwd=1.5) } legend(x[N],max(Y),legend=paste("Mean =",mu), lty=1,col=Gam.col,xjust=1) title("Gamma Densities") mtext(paste("SD =",sig[1]),side=1,line=2) graphics.off() Assignments Homework 1 1. Use the iris dataset that is included with R. This is a data frame with measurements on 150 iris blossoms. The columns of this data frame include four measurements of iris blossoms along with the species. See help(iris). a. Plot histograms of petal length for each species separately and put all three histograms on the same page arranged vertically. Use the same limits on the x-axis for each histogram and superimpose a vertical line corresponding to the mean for the respective species. Also include informative titles. References: R functions par (see argument mfrow ), hist, abline. b. Repeat the previous item using each of the other measurements. c. Obtain pairwise plots of the four physical measurements and use different plotting symbols for the different species. References: R function pairs. 2. Show that if A is a square matrix and AT A is nonsingular, then A is nonsingular. Hint: use the SVD. 3. Let S be an n × n matrix with S T = −S. Show that I-S is nonsingular and that (I − S)−1 (I + S) is orthogonal. Hint: first show that (I − S)T (I − S) is nonsingular. Then show and use the fact that (I + S)(I − S) = (I − S)(I + S). 4. Let A be an n × p matrix, n ≥ p, with singular values, σ1 ≥ · · · ≥ σp , and let k · k denote the 2-norm. 7 a. Show that min (kAzk) = σp . kzk=1 b. Show that σp kxk ≤ kAxk ≤ σ1 kxk, ∀x ∈ <p . 5. Suppose that A is a 2 × 2 matrix with singular values σ1 = σ2 = σ > 0. a. Show that if z ∈ <2 with kzk2 = 1, then kAzk2 = σ. b. Let u,v be any pair of orthonormal vectors in <2 . Show that u,v are right singular vectors of A. 6. Simulate Nrep = 2500 random samples of size n from the gamma distribution with mean µ and s.d. = 1. Construct 95% confidence intervals for the mean with each sample using large-sample intervals defined by ¯ ± t √s , X n where t is the 0.975 quantile of the t-distribution with n-1 d.f. Determine the proportion of confidence intervals that do not contain the mean for each combination of n = c(50,200,400) and mu = seq(.5,5,by=.5). Display these proportions as functions of µ in an informative graphic. Also show how these distributions compare to the normal distribution (for which the large-sample intervals were developed). Hints: see R functions, rgamma to generate the random samples, and qt to obtain the t-value. To compare these gamma distributions to the normal distribution, generate one random sample of size 200 from each gamma distribution and apply R function qqnorm to each. Include an informative title and put all qqnorm plots on one page. Note: in R the gamma distribution is parameterized by shape α and scale β. Its density function is given by f (x; α, β) = 1 xα−1 exp{−x/β}, x > 0. Γ(α)β α Its mean and variance are µ = αβ, σ 2 = αβ 2 . Hence α= µ2 σ2 , β = . σ2 µ 8 Scripts Scripts and data used in this course are located listed here. Original heights dataset http://www.utdallas.edu/~ammann/stat6341scripts/heights.txt Modified heights dataset http://www.utdallas.edu/~ammann/stat6341scripts/heights1.txt Simple script for heights data http://www.utdallas.edu/~ammann/stat6341scripts/height0.r Fancier script for heights data http://www.utdallas.edu/~ammann/stat6341scripts/height1.r Crabs example http://www.utdallas.edu/~ammann/stat6341scripts/crabs.r Script for simulation of confidence intervals http://www.utdallas.edu/~ammann/stat6341scripts/ConfIntSim.r Script for simulation of robust confidence intervals http://www.utdallas.edu/~ammann/stat6341scripts/ConfIntSimRob.r 9
© Copyright 2024