Heritability and genomics of gene expression in peripheral blood

Heritability and genomics
of gene expression in
peripheral blood
Paper: http://www.nature.com/ng/journal/v46/n5/full/ng.2951.html
Related news: http://news.ncsu.edu/releases/wrightnatgen/
Presenter: Pak-Kan, WONG (13/05/2014)
1
Contents
• Background
• Method Summary
• Results
• Heritability in peripheral blood transcriptome
• eQTL analysis
• Biomedical relevance
• Discussions
2
Background
3
Expression Quantitative Trait Loci
(eQTL)
• QTL: Stretches of DNA containing or linked to the genes that
underlie a quantitative trait
• eQTL: QTL that regulate expression levels of mRNAs or proteins
cis-eQTL
trans-eQTL
Master trans-eQTL
Image credit:
http://www.biostat.jhsph.edu/GenomeCAFE/ExpressionistSeminarSlides/eQTL_review_s.ppt
4
Peripheral Venous Blood
• Blood vessels which are
outside human heart make
peripheral blood system.
• Peripheral vessels
• Venous blood is deoxygenated
blood which travels from the
peripheral vessels, through the
venous system into the right
atrium.
http://en.wikipedia.org/wiki/Peripheral_blood
http://www.circulationfoundation.org.uk/help-advice/vascular-health/the-circulatory-system/
5
Classical Twin Design (CTD)
• Allow the study of varying family environments (across pairs) and
widely differing genetic makeup:
• “Identical" or monozygotic (MZ) twins
• Share nearly 100% of their genes, which means that most differences
between the twins (such as height, susceptibility to boredom,
intelligence, depression, etc.) is due to experiences that one twin has
but not the other twin.
• "Fraternal" or dizygotic (DZ) twins
• Share only about 50% of their genes.
• Thus powerful tests of the effects of genes can be made. Twins share
many aspects of their environment (e.g., uterine environment,
parenting style, education, wealth, culture, community) by virtue of
being born in the same time and place.
• The presence of a given genetic trait in only one member of a pair of
identical twins (called discordance) provides a powerful window into
environmental effects.
Ref.: http://en.wikipedia.org/wiki/Twin_study
http://ibg.colorado.edu/cdrom2012/keller/Assumptions/Keller_Coventry_CTD_Indeterminacy_2005.pdf
6
Classical Twin Design Mathematical Model
• Monozygotic (MZ) twins: sharing all of their alleles
• Dizygotic (DZ) twins: sharing on average 50% of their polymorphic
alleles
• Assumption: Equal environments for identical and fraternal twins
• Assessing the variance of a phenotype in a large group and
attempts to estimate how much of this is due to
1.
2.
3.
Factors
A, D
Genetic effects (heritability)
Shared environment – events that happen to both twins, affecting C
them in the same way
Unshared, or unique, environment – events that occur to one twin E
but not the others, or events that affect either twin in a different
way
7
ACE Model
A=h2: additive genetics
C=c2: common environment
E=e2: unique environment
A+C+E=1
• MZ: share 100% of their genes, share all of the environment
• Correlation between identical twins provides an estimate of
rmz = A+C
• DZ: share on average 50% of genes, share all the environment
• Correlation between fraternal twins is a direct estimate of
rdz = 0.5A+C
Expectation
E = 1-rmz
A = 2(rmz-rdz)
C = rmz-A
Ref.: http://onlinelibrary.wiley.com/doi/10.1002/0470013192.bsa002/pdf
8
From Netherlands Twin Registry (NTR)
Introduction
9
Quantifying Human
Transcriptomic Heritability
• Although genes with genome-wide significant eQTLs are by
definition ‘heritable’ additional polygenic variation may be
widespread and fail to reach statistically significance by
standard genotype-expression association.
• Genes with substantial polygenic variation may also be subject
to unique selection pressures not apparent from the analysis
of local eQTLs.
10
Association Analysis of Genetical Genomics Data
1. Sample size > 1000 for few studies but we require > 3000
2. Not replicate even using the same HapMap LCLs under standardized procedures
• Especially for trans-eQTLs (due to tissue type, ancestry, winner’s curse, …)
3. Gene expression for commonly used LCL is sensitive to EBV copy number and growth rates
Franke, Lude, and Ritsert C. Jansen. "eQTL analysis in humans." Cardiovascular Genomics. Humana Press, 2009. 311-328.
11
Proposed Method
•
•
•
•
Classical twin design (MZ vs. DZ)
2752 individual twins
Cohort study
Peripheral venous blood samples
12
Goals
1. To describe and evaluate the heritability of all transcripts
measured in peripheral blood
2. To identify a comprehensive list of local and distant eQTLs
and evaluate their characteristics and replicability
3. To assess the biomedical relevance of the identified eQTLs
13
Data Collection and Pre-Processing
• Subjects and biological sampling
• Using harmonized protocols
• Two longitudinal cohort studies (2-year follow-up)
Examined replication
in eQTL analyses
• Netherlands Twin Registry (NTR): 2752 (out of 3516) samples
• Netherlands Study of Depression and Anxiety (NESDA): 1895 (out of 2783) samples
• 227 controls
• Steady-state transcription in peripheral blood for 43638 probe sets from 18392
genes
• Gene expression assays
• Remove sex-mismatched samples and additional samples of poor quality
• Removal of 19 samples with the lowest D values resulted in the largest number
of significant transcripts (q<0.10)
• Genome-wide SNP assays
• Among 714 monozygotic twin pairs, the intrapair agreement for 686895
autosomal SNPs was 0.9985
• 8.3 million SNPs are used.
14
Demography of 2752 subjects from 1444 twin
pairs for twin-based heritability analyses
15
Twin-Based Transcript Heritability
• Maximize the logarithm of the profile restricted maximum likelihood (REML)
function
1
2
• 𝑝𝑙𝑅𝐸𝑀𝐿 𝜎𝑒2 , 𝜌𝑎 , 𝜌𝑐 = − log 𝑉 ⋅ 1, 𝑋 ′ 𝑉 −1 1, 𝑋
1
𝑟 ′ 𝑉 −1 𝑟
2
2𝜎𝑒
−
𝑛−𝑝−1
log
2
−
𝑛−𝑝−1
log
2
𝜎𝑒2 −
2𝜋
,where
• 𝜌𝑎 =
𝜎𝑎2
𝜎𝑒2
• 𝜌𝑐 =
𝜎𝑐2
𝜎𝑒2
𝐴: the correlation matrix of zygosity.
𝐶: the correlation matrix of twins.
𝑦: the expression values.
𝑥: the covariates.
• 𝑟 = 𝑦 − 1, 𝑋 1, 𝑋 ′ 𝑉 −1 1, 𝑋
• 𝑉 = 𝜌𝑎 𝐴 + 𝜌𝑐 𝐶 + 𝐼
• 𝑝 is the rank of 𝑋
−1
1, 𝑋 ′ 𝑉 −1 𝑦
Twin-based heritability 𝑎 2 = 𝜎𝑎2 /(𝜎𝑎2 + 𝜎𝑐2 + 𝜎𝑒2 )
Shared environmental effects 𝑐 2 = 𝜎𝑐2 /(𝜎𝑎2 + 𝜎𝑐2 + 𝜎𝑒2 )
16
Results
Twin-Based Heritability in the
Peripheral Blood Transcriptome
17
Investigating on Expression Covariates
• To identify a minimal set of covariates
• Increase power for expression heritability calculation and
improve the eQTL mapping
• The covariates can be roughly divided into
1.
2.
3.
Covariates related to technical variation
Clinical covariates that are subject specific
Covariates related to blood counts, which if not properly
accounted for might produce spurious “eQTL” relationships.
18
19
Manhattan plot of heritability P values for the
transcript with the highest h2 estimate
𝑞 = 0.05
18392 genes
h2 for all genes: 0.101 ± 0.142
h2 for expressed genes: 0.138 ±0.153
Max h2 = 0.905
20
K-means clustering of 777 (4.2%) genes
with q<0.05 for h2 estimates
Mean within-cluster expression correlation r ranged from 0.46 to 0.006
21
3
5
2
7
1
Tissue relevance
6
8
9
4
22
Heritability was strongly
associated with expression
mean and variance.
Values in bold correspond to
P<0.0022, for Bonferroni significance
at α=0.05 for 23 tests in each of
uncorrected and corrected analyses.
And numerous KEGG and GO pathways ...
23
Disease Relevance
?
NHGRI GWAS catalog identifying the nearest gene (GWAS genes) for each of 3628 significantly
disease-associated SNPs (P≤5x10-8) for a total of 2343 GWAS genes.
elevated
24
Hypothesis “Disease-causing
genes are highly heritable.”
• Given that GWAS genes were designated only on the basis of proximity
to NHGRI-listed SNPs, these results may reflect an even stronger true
tendency of disease-causing genes to be highly heritable.
• These results are complementary to observations that diseaseassociated SNPs show eQTL enrichment.
• OMIM database shows similar heritability enrichment, even though
NHGRI GWAS and OMIM only partly overlap (of genes in either list, 10%
are in both).
• The OMIM genes with significant heritability (q<0.05) are also quite
diverse, further supporting the potential relevance of peripheral blood
to other tissues and developmental processes.
• Evolutionary associations are consistent with the observation that
heritability is necessary for responsiveness to selection.
• Enrichment of disease-associated heritability may reflect other
underlying sources of commonality but still point to transcription as an
important intermediary in disease risk.
25
Results
Local Genetic Contributions and Bias in
Heritability Estimation
26
Local Genetic Contributions and
Bias in h2 Estimation
• In published studies, estimates have been complicated by bias
and variability in h2 estimation.
27
Definitive Assess the True Extent
of Transcriptomic Heritability
• Model true h2 as following a gamma distribution with
sampling variation determined by the ACE model
7.9%
Similar mean h2
Less variation
100
0.3
For twin-based h2 estimates (n = 2752; 8818 expressed genes shown), subtracting the effects of
sampling variation produces an estimated true distribution (blue curve).
Resimulating from the fitted true assumed distribution closely approximates the observed h2
estimates (black curve).
28
Discrepancy between NTR and MuTHER
• Expressed genes in both skin and LCLs with h2>0.5
• MuTHER report estimated >700
• NTR estimated ~100
• Effect of age?
• NTR mean age was ~20 years younger
• But age is not a covariate
• Effect of sample size?
0.3
• Sample size of MuTHER is much smaller.
• Apply gamma fit and artificially adding sampling error to the true
distribution / inflating the sampling variation
• Fit the NTR estimated h2 distribution again
31
How many samples do we need?
Small sample size
Effect of Sample Size
1.0 correlation is not attainable…
32
Results
eQTL Analyses of Peripheral Blood
33
Genotypes as Predictors of Transcription
• Two types of genes
• Local: Within 1MB upstream of the TSS and 1MB down stream of
TES
• Distant: Otherwise
• Genes with at least one local eQTL (q<0.01) had significantly
higher expression levels and heritability (P<1x10-200 for both)
34
Number of Unique Genes with Evidence of Local
Association
With increasing sample size, it seems that most
expressed genes (>10000) show evidence of
local eQTL influence in peripheral blood.
For NTR, the number of genes with
significant eQTLs (q<0.01) was 11384.
After employing final quality control steps,
9640 significant genes.
Little difference among
the transformations
35
Overlap of local eQTL findings with two
other large blood studies, at q<0.01
Peripheral blood eQTL meta-analysis of
Westra et al.
NTR
NESDA
Local eQTL replication
Annotated Genes
True Discovered Rate: 59.6% and 59.7%,
36
Results
Characteristics of Distant eQTLs
37
Number of unique genes with evidence (q<0.01)
for distant association
Roughly linear in log-log scale
38
Overlap of distant eQTL findings (q<0.001)
with previous studies (within 1 Mb of gene)
Peripheral blood eQTL meta-analysis of
Westra et al.
NTR
NESDA
Distant eQTL replication
39
Properties of Distant eQTLs
Examine using Ensembl Variant Effect Predictor v2.8
Lowest rate of overlap
with regulatory features
or replication in NESDA
40
eQTL Hotspots (SNPs influencing numerous transcripts)
• 304 distant eQTL SNPs  203 regional clusters
• 160 clusters: 1 SNP
• 43 clusters: 2kb to 2Mb of DNA (median 89kb)
• Potential hotspots: 11 clusters associated with ≥ 6 genes
• The proportion of associated transcripts using NESDA data to
avoid selection bias.
influenced by the 304 SNPs
Estimated proportion
• eQTL hotspots and significant distant eQTLs influence
relatively few genes.
Lower than the reported
in MuTHER study
41
Putative eQTL Hotspot
• A distant eQTL hotspot on chr19 was associated with the
expression of 12 distant genes and 1 local gene (MYO1F)
• MYO1F expression is independent of the expression of the
other distant genes, given the expression of the transcription
factor SOX13
42
Biomedical Relevance
• NHGRI GWAS catalog + filtering P<1x10-8
• 3415 SNPs, 498 traits and 4167 SNP trait pairs from 927 report
Trait or
Disease
SNP found
Height
High-density
Crohn’s
lipoprotein
disease
cholesterol
Type 2
diabetes
Ulcerative
colitis
248
92
98
81
155
• Of the 3118 genes in OMIM, 74.4% were part of a SNP-gene
local eQTL pair (q<0.05).
• …
43
Conclusion
1. Assessed gene expression profiles in 2,752 twins
• Classic twin design to quantify expression heritability and eQTLs
in peripheral blood
2. Group ~777 highly heritable genes into 9 clusters
3. Suggest that the previous heritability examined in a
replication set is have been upwardly biased
4. Provide a new resource toward understanding the genetic
control of transcription
44
Comments
•
•
•
•
•
•
New resource for support the newly identified SNPs
Computational pipeline for a board range of twin-based experiments
Sample variation in small sample size
Why are and how do they correlated?
Functions of each gene in the cluster, multiple layer of control?
New things to explore?
45
Data
• Nature Paper + Supplementary Notes
• http://www.nature.com/ng/journal/v46/n5/fig_tab/ng.2951_ft.html
• Expression data and genotypes (Affymetrix 6.0 and U219)
• http://www.ncbi.nlm.nih.gov/gap/?term=phs000486
• Summary results in the seeQTL browser (GWAS results p<5e-8)
• http://gbrowse.csbio.unc.edu/cgi-bin/gb2/gbrowse/seeqtl/
46
Related Links
• Netherlands Twin Register (NTR in Dutch)
• http://www.tweelingenregister.org/en/
• FastFacts about NTR
• http://fastfacts.nl/en/content/netherlands-twin-register
• The Multiple Tissue Human Expression Resource (MuTHER)
• http://www.muther.ac.uk/
47
Correlation Matrix
In GW heritability analysis using DZ
twins, reestimated by PLINK with mean
0.501 and standard deviation 0.038
• 𝑎𝑖𝑗 = 𝑐𝑜𝑟 𝛾𝑖 , 𝛾𝑗
1 𝑖𝑓 𝑖 𝑎𝑛𝑑 𝑗 𝑎𝑟𝑒 𝑀𝑍
0.5 𝑖𝑓 𝑖 𝑎𝑛𝑑 𝑗 𝑎𝑟𝑒 𝐷𝑍
=
0 𝑖𝑓 𝑖 𝑎𝑛𝑑 𝑗 𝑎𝑟𝑒 𝑢𝑛𝑟𝑒𝑙𝑎𝑡𝑒𝑑
• 𝑐𝑖𝑗 = 𝑐𝑜𝑟 𝛿𝑖 , 𝛿𝑗
1 𝑖𝑓 𝑖 𝑎𝑛𝑑 𝑗 𝑎𝑟𝑒 𝑡𝑤𝑖𝑛𝑠
=
0 𝑖𝑓 𝑖 𝑎𝑛𝑑 𝑗 𝑎𝑟𝑒 𝑢𝑛𝑟𝑒𝑙𝑎𝑡𝑒𝑑
• 𝑦~𝒩 𝜇1 + 𝑋𝛽, Σ
• Where Σ = 𝜎𝑎2 𝐴 + 𝜎𝑐2 𝐶 + 𝜎𝑒2 𝐼
• Re-express Σ =
𝜎𝑒2 𝑉,
where V =
𝜎𝑎2
𝐴
𝜎𝑒2
+
𝜎𝑐2
𝐶
𝜎𝑒2
+ 𝐼 = 𝜌𝑎 𝐴 + 𝜌𝑐 𝐶 + 𝐼
48
On the Profile Function for TwinBased Heritability
• Considers the loss in degrees of freedoms associated with the
fixed effect estimates.
• Less biased compared to their corresponding maximum
likelihood estimates and control type I error better.
• The profile function has only three parameters regardless of
the number of fixed effects and computationally more
efficient than maximizing over the full REML function
• Develop an algorithm on R for twin-based heritability analysis
49