Download Report

MAT 4375
Multivariate statistical methods
Summary of R commands (Chapter 4)
Univariate Normal Distribution
1. To generate a random sample of size n from a univariate normal distribution N (µ, σ 2 ) with mean µ and standard deviation σ (say n = 100,
µ = 0 and σ = 1):
rnorm(100,mean=0,sd=1)
or simply
rnorm(100,0,1)
2. To compute the value of the N (µ, σ 2 ) density function:
(x − µ)2
1
exp −
f (x) = √
2σ 2
2πσ 2
at point x: (say µ = 0, σ = 1 and x = 1.5)
dnorm(1.5,mean=0,sd=1)
or simply
dnorm(1.5,0,1)
3. To plot the graph of the density function of N (µ, σ 2 ) between the points
a and b (say we use again µ = 0, σ = 1 and the points a = −3 and
b = 3):
curve(dnorm(x,mean=0,sd=1),-3,3)
or simply
curve(dnorm(x,0,1),-3,3)
1
If we omit writing the values a and b, by default the graph will be
between a = 0 and b = 1.
4. To compute a normal probability, i.e to find the value P (X ≤ x) for
given x, where X ∼ N (µ, σ 2 ): (say µ = 0, σ = 1 and x = 1.96)
pnorm(1.96,mean=0,sd=1)
or simply
pnorm(1.96,0,1)
5. To compute a normal quantile, i.e. to find the value x such that
P (X ≤ x) = p for a given value p ∈ (0, 1), where X ∼ N (µ, σ 2 ):
(say µ = 0, σ = 1 and p = 0.975):
qnorm(0.975,mean=0,sd=1)
or simply
qnorm(0.975,0,1)
6. To do a normal QQ plot of univariate data (saved in a variable x),
there are two methods:
Method 1:
qqnorm(x)
abline(mean(x),sd(x))
# Do the QQ plot
# Plot the line of best fit
Method 2: (assume that n is the sample size)
p=ppoints(n)
# Compute the probabilities p_j=(j-3/8)/(n+1/4)
z=qnorm(p,0,1)
# Compute the normal quantiles corresponding to p_j
y=sort(x)
# Arrange the data in increasing order
plot(z,y)
# Plot the pairs (z_j,y_j) for j=1,...,n
abline(mean(x),sd(x)) #Plot the line of best fit
2
Multivariate Normal distribution
1. To generate a random sample of size n from a multivariate normal
distribution Np (µ, Σ) with mean vector µ and covariance matrix Σ.
Say p = 2, n = 100, µ = (0, 0)0 and
1 0.5
Σ=
0.5 1
We need to install the package MASS:
install.packages("MASS")
library(MASS)
mvrnorm(100,mu=c(0,0),Sigma=matrix(c(1,0.5,0.5,1),2))
In the last line, we could simply type:
mvrnorm(100,c(0,0),matrix(c(1,0.5,0.5,1),2))
2. To plot the Np (µ, Σ) density function:
f (x) =
1
(2π)p/2 |Σ|1/2
1
exp{− (x − µ)0 Σ−1 (x − µ)}
2
In the example below, we take p = 2, µ = (0, 0)0 and Σ as above.
#Specifies the parameters of the bivariate normal
mu1 = 0 # expected value of x
mu2 = 0.5 # expected value of y
sigma1 = 0.5 # variance of x
sigma2 = 2 # variance of y
rho = 0.5 # corr(x, y)
#Gives the range for x-axis and y-axis
xm =-3
xp = 3
ym = -3
yp = 3
3
#Specifies a sequence of values (x,y) for which we will do the plot
x = seq(xm, xp, length= as.integer((xp + abs(xm)) * 10))
y = seq(ym, yp, length= as.integer((yp + abs(ym)) * 10))
#Defines the density function
bivariate = function(x,y){
term1 = 1 / (2 * pi * sigma1 * sigma2 * sqrt(1 - rho^2))
term2 = (x - mu1)^2 / sigma1^2
term3 = -(2 * rho * (x - mu1)*(y - mu2))/(sigma1 * sigma2)
term4 = (y - mu2)^2 / sigma2^2
z = term2 + term3 + term4
term5 = term1 * exp((-z / (2 *(1 - rho^2))))
return (term5)
}
# Computes the density values
z =outer(x,y,bivariate)
# Plot
persp(x, y, z)
3. To compute a quantile corresponding to the χ2k distribution, i.e. to find
a value x such that P (X ≤ x) = p for a given value p ∈ (0, 1), where
X ∼ χ2k : (here k = 7 and p = 0.95)
qchisq(0.95,df=7)
or simply
qchisq(0.95,7)
4. Consider a variable x (which contains observations x1 , . . . , xn ) and a a
fixed value. To compute how many observations xj are smaller than a,
we first create a boolean variable b which takes values TRUE (i.e. 1)
if xj ≤ a and FALSE (i.e. 0) if xj > a, and then sum all the values of
b: (here we take a = 1.5)
b=x<1.5
sum(b)
# Create the boolean variable
# Count how many observations are smaller than 1.5
4
5. Suppose we have data consisting of n p-dimensional vectors x1 , . . . , xn
with sample mean x and sample covariance matrix S. We will assume
here that data is saved in a data frame called “table” whose columns
are called V1 , . . . , Vp . For simplicity we work only with the first two
variables in the table, so that p = 2.
a) To compute the squared Mahalanobis distances
d2j = (xj − x)S−1 (xj − x),
x1=table$V1
#
x2=table$V2
#
x=cbind(x1,x2)
#
m=colMeans(x)
#
S=cov(x)
#
d=mahalanobis(x,m,S)
print(d)
j = 1, . . . , n
Extract variable x1 from table
Extract variable x2 from table
Create a new variable x from x1 and x2
Compute the column means of x
Compute the covariance matrix of x
# Compute the squared Mahalanobis distances
# Print the values of d
b) To compute the number and the proportion of observations which
are smaller than χ22 (0.50) we use the following commands:
chisq.quantile=qchisq(0.50,df=2)
#Calculate the chi-square quantile
print(chisq.quantile)
number.obs=sum(d<chisq.quantile) #Calculate the number of observations
print(number.obs)
proportion=number.obs/n # Calculate the proportion of observations
print(proportion)
b) To see if the data in columns V1 and V2 has a bivariate normal
distribution, we use the following commands:
p=ppoints(n)
# Compute the probabilities p_j=(j-3/8)/(n+1/4)
q=qchisq(p,df=2) # Compute the chi-square quantiles corresponding to p_j
y=sort(d)
# Arrange d in increasing order
plot(q,d)
# Plot the pairs (q_j,d_j) for j=1,...,n
abline(0,1)
# Plot the line of best fit
5