MSU_pigs.R Juan Pedro Steibel Authors:

Manual code: MSU_pigs.R
Authors: Jose Luis Gualdrón Duarte1 and Juan Pedro Steibel2,3
1
Departamento de Producción Animal, Facultad de Agronomía, UBA-CONICET, Buenos Aires, ARG
Department of Animal Science, Michigan State University, East Lansing, Michigan, USA
3
Department of Fisheries and Wildlife, Michigan State University, East Lansing, Michigan, USA
2
Introduction
The R code MSU_pigs.R is based on the paper Gualdrón et al. 2014. This code uses a gpData object called
pigMSU as input file, which contains different files: genotype file, map, pedigree, and phenotype file. For
further details of construction and structure of gpData objects please refer to synbreed package in R
(Wimmer et al. [1]).This package was used for assembling of gpData pigMSU.
Description of Input Files
Each slot in pigMSU contains the following information:
Genotype file (pigMSU$geno)
In this file, animals IDs are row names and SNP names are column names. The genotypes are expressed as
allelic dosage, having elements equal to 0,1, 2, i.e. the count of the allele used as reference, or a decimal
number in the interval [0, 2] for imputed genotypes.:
MARC0076778 MARC0015795 ASGA0086709 ALGA0112852 MARC0058092 M1GA0025062
0
0
1
1
0.00
1
0
0
1
1
0.89
1
0
0
1
1
0.00
1
0
0
1
1
0.00
1
1
1
0
2
1.00
2
1001
1002
1003
1004
1006
Map file (pigMSU$map)
This file contains SNP names as row names and also, chromosome and physical position expressed in Megabases as columns.
MARC0044150
ASGA0000014
ASGA0000021
chr
1
1
1
pos
0.286933
0.342481
0.489855
Pedigree file (pigMSU$pedigree)
The pedigree file contains five columns: animal ID - Sire ID - Dam ID - generation - sex (1=male,
2=female).
ID
6323
6458
160
330
Par1 Par2 gener sex
0
0
0
1
0
0
0
0
6323 6458
1
0
6323 6458
1
1
Phenotype file (pigMSU$pheno)
The trait file contains: animal ID in row names and the trait in column name:
1001
1002
1003
1004
1006
1007
1011
bf10_13wk
8.38
11.94
10.16
7.87
5.59
8.38
8.64
Running MSU_pigs.R:
Load required packages, functions and input files
First of all, code MSU_pigs.R allows to load “Regress” package (version 1.3-10, R packages [2]) which is
used in model fitting and variance components estimation process, performed by using REML algorithm.;
and also, to load “Synbreed” package for data visualization and analysis of gpData "pigMSU".
library(synbreed)
library(regress)
Then, input gpData (pigMSU) and functions (Functions_codes.R)for MSU_pigs.R are uploaded:
load("pigMSU")
source("Functions_codes.R") # Load functions
Filter process by Minor Allele Frequency (MAF)
Genotypes in the F2 haves a second editing process considering MAF < 0.01, as well as fixed SNP (MAF=
0.5). As a result, a filtered matrix "M" that contains information for 44055 SNP is created.
##############################################################################
## Filter by Minor allele frequency (MAF) in F2 genotypes
#
## See Section: "Methods-Genotyping and data editing" in the reference paper #
##############################################################################
## Genotype Matrix #
ge<-pigMSU$geno
IDrowM<-rownames(ge)
IDrow2M<-as.numeric(IDrowM)
genf2<-(IDrow2M>1000)&(IDrow2M<6000)
Mf2<-ge[genf2,]
# Genotype matrix for F2 animals
m1<-colSums(Mf2)
m2<-m1/(2*nrow(Mf2))
# Vector with frequency of second allele in each marker
## Filter by Minor allele frequency (MAF) and SNP fixed in the F2 genotypes
MAF<-0.01
#Note: the addition of the filter "m2<0.99" is because some SNPs are codificated as the major
allele frequency
f1<-(m2>MAF)&(m2!=0.5) &(m2<0.99)
SNPf<-as.matrix(m2[f1])
## New genotype "M" matrix (SNP 44055) with genotypes filtered by MAF###
M1<-colnames(Mf2)%in%rownames(SNPf)
M<-ge[,M1]
# Total SNP 44055 for F0-F1-F2 generations
Construction of Z and G matrices
Following the approach of VanRaden [3], Z matrix for F2 animals (Znf2) is calculated applying the function
"zstandard". Then, matrix G (GTo) for the same animalsis calculated using the Z matrix obtained previously
(Znf2).
################################
##1) Computation of G Matrix ##
################################
IDrow<-rownames(M)
IDrow2<-as.numeric(IDrow)
## Frequency of P="2" in the F0
genof0<-(IDrow2>6000) # logical vector with TRUE in F0 ids
gen0<-M[genof0,] # Genotype for F0 (1st column ID - Genotype)
all_frq_f0<-colMeans(gen0,na.rm=T)/2
## Computation of Z matrix for F2
idgf2<-as.numeric(rownames(M))
idxg2<-(idgf2>1000)&(idgf2<6000)
# index for F2 animals ID
genf2<-M[idxg2,]
# genotype for F2 animals
Znf2<-zstandard(genf2,alfreq=all_frq_f0,procedure="heterogeneous")
## Computation of G matrix: G=ZZ'
GTo<-Znf2%*%t(Znf2)
Fixed effects matrix construction
From object "pigMSU" are extracted: 1) the growth trait 13 week tenth rib backfat (mm) (bf10_13wk) and 2)
sex for animals in the F2, to be used as response variable (y) and incidence matrix fixed effects (X) for the
model respectively.
#######################################
## 2) Input files for funtion "snpe" ##
#######################################
##Phenotype
pheno<-pigMSU$pheno[,"bf10_13wk",1]
indx<-match(rownames(Znf2),names(pheno))
pheno<-pheno[indx]
##sex
indx<-match(rownames(Znf2),pigMSU$pedigree$ID)
sex<-pigMSU$pedigree$sex[indx]
#model.matrix to create matrix X
sex<-as.factor(sex)
x<-model.matrix( ~ sex -1,
contrasts.arg=list(sex=contrasts(sex, contrasts=F)))
#sex ID
psex<-cbind(pigMSU$pedigree$ID,pigMSU$pedigree$sex) #Extract ID and sex for all animals (F0-F1
and F2)
idxsex<-psex[,1]%in%names(pheno)
sexid<-as.numeric(pigMSU$pedigree$ID[idxsex])
rownames(x)<-sexid
colnames(x)<-c("female","male")
Estimation of variance components and breeding values
Once incidence matrix X is constructed, "snpe" function allows to fit the model y = X  + a + e . This
function takes as input the phenotype file (y), X matrix, and also Z and G matrices. Internally, variance
components are estimated by REML using regress package (version 1.3-10 R [2]).
###########################################################################
## 3) Apply function "snpe"
##
#
return:
##
#
g_hat=SNPe,a_hat,E_variance,A_variance,Heredability,#iteratons,Ginv ##
###########################################################################
trout_SNPe<-snpe(pheno,x,Znf2,GTo)
As a result, the function "snpe" gives:
llik:
Estimate of the LogLikelihood for the model
a_hat:
Genome breeding values (GEBVs).
2
E_variance:
Error variance e )
Additive variance A )
Heritability: Heritability of the trait
n_iter:
Number of iterations to converge.
Ginv:
The inverse of G matrix
A_variance:
2
2
Here, a_hat contains the random breeding values, such that a  N (0, G A ) , and e is the random error
2
vector such that e  N (0, I e ) , and I is the identity matrix.
Estimation of SNP effects
The marker effect, variance of the markers effects, and p-values are calculated for each marker or SNP. For
this purpose, the function "GWA" is applied using elements obtained in function "snpe".
###########################################################################
## 4) Apply function "GWA"
##
#
return:
##
#
beta=uh ("g_hat"), snp_variance=vsnp, pvalues=pvalue
##
###########################################################################
gwa_trait<-GWA(trout_SNPe,x,Znf2)
As a result, the function "GWA" gives:
beta:
Estimate of each marker (SNP) effect
snp_variance: Variance of each marker (SNP) effect
pvalues:
p-value for each marker (SNP)
Histogram of p-values and Manhattan plot
Here, the histogram of p-values and the Manhattan plot are displayed. However, for the Manhattan plot is
necessary the absolute marker (SNP) position or “consecutive position”. Then, the function "abmap" is
applied.
##############################
## 5) Map Absolute Position ##
##############################
map<-as.matrix(pigMSU$map)
map1<-rownames(map)%in%colnames(Znf2)
map2<-map[map1,]
#### Final map #####
mapmsu1<-abmap(map2)
idxc<-2-map2[,1]%%2
# Read map
# Final SNP 44055
# Apply funtion ab
# Color index
# histogram of p-values
hist(gwa_trait$pvalues)
# Manhattan Plot
threshold<-0.05/nrow(map2)
#pdf(file="Manhattan_trait.pdf") #option to save the Manhattan plot in format ".pdf" in the
current directory
plot(mapmsu1,-log(gwa_trait$pvalues,10),pch=16,col=ifelse(idxc==2,"red","blue"),abline(h=log(threshold,10),lwd=1.1,col="red"),xlab="Absolute position Mb", ylab="-Log10(p-value)")
#dev.off() #option to save the Manhattan plot in format ".pdf" in the current directory
Definition of candidate segments
Candidate segments are defined by taking SNPs within one Mb upstream and one Mb downstream of the
SNP with smallest p-value in each chromosome (see list "maxlogpv").
##########################
## 4) Candidate Segment ##
##########################
pvalue<-gwa_trait$pvalues
logpv<- -log(pvalue,10)
assoc<-as.data.frame(cbind(map2,mapmsu1,pvalue,logpv))
colnames(assoc)<-c("chr", "pos_Mb","Abspos_Mb","pvalue","logpv")
#Extract max -logpvalues per chromosome
result <- vector("list",18)
for(i in 1:18){
pvch<-assoc[assoc$chr==i,]
maxpv<-which.max(pvch$logpv)
mpv<-pvch[maxpv,]
result[[i]] <-mpv
}
# List "maxlogpv" with the highest -log(p-value) per chromosome
maxlogpv<-do.call(rbind,result)
###########################################
## 4.1) Extract segments per chromosome ##
###########################################
#model_result<- vector("list",nrow(maxlogpv))
model_result<- vector("list",2)
# for(i in 1:nrow(maxlogpv)){
for(i in 1:2){
#Example!!! Chromosome 1 and 2
psnp<-maxlogpv[i,3]
phigh<-as.numeric(psnp)+1
plow<-as.numeric(psnp)-1
#Extract SNP in this range
slxmap<-(assoc[,3]<=phigh)&(assoc[,3]>=plow)
slxmap<-assoc[slxmap,]
Computation of Z and G matrices for candidate segments
Using the markers or SNPs into the segment, a new Z ("Z1") and G ("G1") matrices are calculated, whereas
genomic relationship matrix G2 was built using all remaining SNPs. Next, the model equal to:
y = X  + a1 + a2 + e ("model2"), where a1 is the vector of random effects associated with those SNP
located in the segment, such that
a1


N 0, G 12 1A and a2 is the vector of additive random effects
associated with all SNPs except those involved with a1 , such that a2


N 0, G2 2A2 . The
model2
is
compared with the reduce model y = X  + a + e ("model1"), from results obtained across the regress R
package [2].
###################################################################
## 4.2) compute matrix Z for snp selected (normalized separtely) ##
###################################################################
# Create Gseg for the segement using the columns in Znf2 belong to SNP segment
Zidg<-colnames(Znf2)%in%rownames(slxmap)
Zs<-Znf2[,Zidg]
################## Extrac for the GTo
G2<-GTo-(Zs%*%t(Zs))
### compute matrix Z2 and G2 for snp selected in the segmwnt (normalized separtely)
Zidx<-colnames(genf2)%in%rownames(slxmap)
genof2fil<-genf2[,Zidx]
idxfrq<-names(all_frq_f0)%in%rownames(slxmap) # Select the frequencies for the SNP segement
sum(idxfrq==TRUE)
fil_frq_f0<-all_frq_f0[idxfrq] # Extract the frequencies for the SNP segment
# Matrices Z1 and G1 (use the selected markers)
#Apply the function for the SNP segement
Z1<-zstandard(genof2fil,alfreq=fil_frq_f0,procedure="heterogeneous")
G1<-Z1%*%t(Z1)
# G1 for SNP selected
##########################################
## 4.3) compute models with "regress" ##
##########################################
model1<-regress(pheno~x,~GTo)
# y=Xb+a+e
model2<-regress(pheno~x,~G1+G2) # y=Xb+a1+a2+e
######################################
## 4.4) Performed models comparison ##
######################################
llik<-list(Result=list(l0=model1,l1=model2))
#####################################################
## 4.5) Results from Regress for model1 and model2 ##
#####################################################
model_result[[i]]<-sapply(llik,summary.ll.fit) # Results refer to Additional files in "Table 1"
}
comp_model<-do.call(rbind,model_result)
The function summary.ll.fit list the results obtained by regress R package [2] for "model1" and
"model2", as follow:
Loglikem2:
LogLikem1:
LRT:
LRTseg:
varE1:
LogLikehood "model2"
LogLikehood "model1"
Likehood Ratio Test for "model1" and "model2"
p-value for Likehood Ratio Test for the segment
2
Error variance e ) of "model1"
varA1:
2
Additive variance A ) of "model1"
varE2:
2
Error variance e ) of "model2"
VarA2:
2
Additive variance A ) of "model2"
Varseg:
Additive variance segment A1 ) of "model2"
Varseg_pr:
Proportion in % of the total variance explained by the segment.
2
Results for function summary.ll.fit are stored in the object "comp_model"
To cite this code use:
Gualdrón Duarte JL, Cantet RJC, Bates RO, Ernst CW, Raney NE, Steibel JP. Rapid screening for
phenotype-genotype associations by linear transforming genomic evaluations. BMC Bioinformatics
2014. (Submitted)
References
1. Wimmer V, Albrecht T, Auinger H-J, Schön C-C: synbreed: a framework for the analysis of genomic
prediction data using R. Bioinformatics 2012, 28:2086–7.
2. Clifford D, McCullagh P: The regress function. R News 2006, 6:6–10.
3. VanRaden PM: Efficient methods to compute genomic predictions. J. Dairy Sci. 2008, 91:4414–23.