Download Report

Stat 6341, Statistical Computing
Stat 6341 Syllabus
STAT 6341
Numerical Linear Algebra and Statistical Computing
Course Information
Instructor:
Dr. Larry P. Ammann
Office hours:
MW 2:30-3:30pm, others by appt.
Email:
[email protected]
Office:
FO 2.410C
Phone:
(972) 883-2161
Text:
Modern Applied Statistics with S, 4th Ed.
Authors:
W.N. Venables and B.D. Ripley
Additional resources: Matrix Computations
Authors:
G. Golub and C. van Loan
These notes are copyrighted by their author, Larry P. Ammann, and are intended for the
use of students currently registered for Stat 6341. They may not be copied or used for any
other purpose without permission of the author.
Tentative Schedule
Topics
Numerical linear algebra
Introduction to the S language and statistical programming
Simulation
QR decomposition and least squares regression
Data explorations
Statistical models
SVD and multivariate data
Chapters
class notes
VR 1-3
class notes
class notes
VR 4-5
VR 6-7
class notes, VR 11
This course will make use of the statistics programming language R. Pre-compiled binaries
for R are freely available for Windows, MacOS, and Linux at
http://cran.r-project.org
An online introduction to R is located at
http://cran.r-project.org/doc/manuals/R-intro.html
1
In addition to homework projects, a final group project will be assigned. Students will form
groups of 4-5 each and complete a major simulation project that will be presented to the
class at the end of the semester.
Student Learning Objectives
Understand numerical and computational issues associated with the major matrix decompositions: LU, QR, SVD. Understand how to express basic mathematical and statistical
problems in a high-level statistical programming language.
Note: the complete syllabus is available here:
http://www.utdallas.edu/~ammann/stat6341_syllabus.pdf
2
Simulation of random variables
One of R’s strengths is its extensive library of functions for simulation of random variables.
This includes the following distributions.
Discrete r.v.’s: binomial, geometric, hypergeometric, multinomial, negative binomial,
Poisson.
Continuaous r.v.’s: beta, Cauchy, chi-squared, exponential, F, gamma, log-normal, normal, t, uniform, Weibull.
For each distribution there are functions in R to generate the cdf, density or pmf, quantiles, and random samples. These functions follow the convention dname gives density or
pmf, pname gives cdf, qname gives quantiles, and rname gives random samples. For example,
n = 200
mu = 100
sig = 20
X = rnorm(n,mu,sig)
X.hist = hist(X,plot=FALSE)
x0 = seq(mu-4*sig,mu+4*sig,length=250)
d0 = dnorm(x0,mu,sig)
y0 = d0*n*unique(diff(X.hist$breaks))
y.lim = max(c(y0,X.hist$counts))
png("NormalDens.png",width=480,height=480)
hist(X,col="cyan",ylim=c(0,y.lim),xlim=mu+c(-4,4)*sig,main="")
title(paste("Histogram of N(",mu,",",sig,")",sep=""))
mtext("with Density Function",side=3,line=.25)
lines(x0,y0,col="red")
graphics.off()
3
Additional notes: simulation of the Gamma distribution
The Gamma distribution is a two-parameter family with alternative parameterizations. The
density function is
f (x) =
=
1
β α Γ(α)
xα−1 e−x/β , x > 0, α, β > 0
λα α−1 −λx
x e , x > 0, α, λ > 0
Γ(α)
4
where α is called the shape parameter, β is called the scale parameter, and λ = 1/β is called
the rate parameter. The expected value and standard deviation of the gamma distribution
are,
E(X) = αβ =
SD(X) =
√
α
λ√
αβ =
α
.
λ
It usually is more natural to specify the distribution in terms of its mean and s.d., µ, σ. The
relationship between α, β and µ, σ can be inverted to obtain the natural parameters of this
distribution in terms of its mean and s.d. That is,
α=
σ2
µ2
,
β
=
.
σ2
µ
The plot below shows why α is called the shape parameter.
5
This plot was generated by the following R code.
n = 100 # sample size
N = 400 # num x-values
mu = c(.75,1,5) # means
Gam.col = c("red","ForestGreen","blue")
sig = 1 # s.d.’s
alpha = (mu/sig)^2
beta = (sig^2)/mu
x = seq(0.05,8,length=N)
6
y1 = dgamma(x,alpha[1],scale=beta[1])
y2 = dgamma(x,alpha[2],scale=beta[2])
y3 = dgamma(x,alpha[3],scale=beta[3])
Y = cbind(y1,y2,y3)
png("GammaDens.png",width=480,height=480)
plot(x,y1,xlab="",ylab="Density",ylim=range(Y),type="n")
for(k in seq(mu)) {
lines(x,Y[,k],col=Gam.col[k],lwd=1.5)
}
legend(x[N],max(Y),legend=paste("Mean =",mu),
lty=1,col=Gam.col,xjust=1)
title("Gamma Densities")
mtext(paste("SD =",sig[1]),side=1,line=2)
graphics.off()
Assignments
Homework 1
1. Use the iris dataset that is included with R. This is a data frame with measurements
on 150 iris blossoms. The columns of this data frame include four measurements of iris
blossoms along with the species. See help(iris).
a. Plot histograms of petal length for each species separately and put all three histograms on the same page arranged vertically. Use the same limits on the x-axis
for each histogram and superimpose a vertical line corresponding to the mean for
the respective species. Also include informative titles. References: R functions
par (see argument mfrow ), hist, abline.
b. Repeat the previous item using each of the other measurements.
c. Obtain pairwise plots of the four physical measurements and use different plotting
symbols for the different species. References: R function pairs.
2. Show that if A is a square matrix and AT A is nonsingular, then A is nonsingular.
Hint: use the SVD.
3. Let S be an n × n matrix with S T = −S. Show that I-S is nonsingular and that
(I − S)−1 (I + S) is orthogonal. Hint: first show that (I − S)T (I − S) is nonsingular.
Then show and use the fact that
(I + S)(I − S) = (I − S)(I + S).
4. Let A be an n × p matrix, n ≥ p, with singular values, σ1 ≥ · · · ≥ σp , and let k · k
denote the 2-norm.
7
a. Show that
min (kAzk) = σp .
kzk=1
b. Show that
σp kxk ≤ kAxk ≤ σ1 kxk, ∀x ∈ <p .
5. Suppose that A is a 2 × 2 matrix with singular values σ1 = σ2 = σ > 0.
a. Show that if z ∈ <2 with kzk2 = 1, then kAzk2 = σ.
b. Let u,v be any pair of orthonormal vectors in <2 . Show that u,v are right singular
vectors of A.
6. Simulate Nrep = 2500 random samples of size n from the gamma distribution with
mean µ and s.d. = 1. Construct 95% confidence intervals for the mean with each
sample using large-sample intervals defined by
¯ ± t √s ,
X
n
where t is the 0.975 quantile of the t-distribution with n-1 d.f. Determine the proportion of confidence intervals that do not contain the mean for each combination of n =
c(50,200,400) and mu = seq(.5,5,by=.5). Display these proportions as functions
of µ in an informative graphic. Also show how these distributions compare to the
normal distribution (for which the large-sample intervals were developed).
Hints: see R functions, rgamma to generate the random samples, and qt to obtain the
t-value. To compare these gamma distributions to the normal distribution, generate
one random sample of size 200 from each gamma distribution and apply R function
qqnorm to each. Include an informative title and put all qqnorm plots on one page.
Note: in R the gamma distribution is parameterized by shape α and scale β. Its
density function is given by
f (x; α, β) =
1
xα−1 exp{−x/β}, x > 0.
Γ(α)β α
Its mean and variance are
µ = αβ, σ 2 = αβ 2 .
Hence
α=
µ2
σ2
,
β
=
.
σ2
µ
8
Scripts
Scripts and data used in this course are located listed here.
Original heights dataset
http://www.utdallas.edu/~ammann/stat6341scripts/heights.txt
Modified heights dataset
http://www.utdallas.edu/~ammann/stat6341scripts/heights1.txt
Simple script for heights data
http://www.utdallas.edu/~ammann/stat6341scripts/height0.r
Fancier script for heights data
http://www.utdallas.edu/~ammann/stat6341scripts/height1.r
Crabs example
http://www.utdallas.edu/~ammann/stat6341scripts/crabs.r
Script for simulation of confidence intervals
http://www.utdallas.edu/~ammann/stat6341scripts/ConfIntSim.r
Script for simulation of robust confidence intervals
http://www.utdallas.edu/~ammann/stat6341scripts/ConfIntSimRob.r
9