1 DNA was extracted from blood sample ... San Diego, CA, USA). DNA concentration and quality was...

1
Supplementary Materials and methods
Sample preparation
DNA was extracted from blood sample by using Qiagen QIAamp DNA blood mini/midi kit (Qiagen,
San Diego, CA, USA). DNA concentration and quality was determined by measuring light absorbance
at 260 nm and 280 nm. Samples with A260/A280 ratio between 1.7 and 2.0 were included for
genotyping. RNA for expression assays were extracted using TRIzol reagent (Life Technologies,
Rockville, MD, USA) and converted to cDNA using an oligo (dT) 15 primer and Superscript III
(Invitrogen, Carlsbad, CA, USA).
SNP selection
The association region for fine mapping was delimited by recombination hotspots flanking the
candidate genes ADD3 and XPNPEP1, which had accommodated the haplotypes possibly tagged by
rs17095355 and other linked association markers. The region spans 350kb from chr10:111600kb to
chr10:111950kb, and accommodates 1580 SNPs (NCBI dbSNP build37).
We did regional imputation based on the 13 GWAS SNPs genotype data in the fine mapping
region (MACH1.0, based on HapMap Phase II data)[1]. In GWAS, the 13 SNPs were associated with
BA as tagged by rs17095355, but had no independent effect in disease association. The imputation did
not produce higher disease association signals.
2
Next, we ran a dense-mapping strategy to assay the genetic architecture in the region. Multiple
allelic SNPs or SNP with minor allele frequency (MAF) <1% in Asian were excluded, and 624 out of
1580 SNPs were left. Tag SNPs was selected after wise by using r2>0.99 based on the HapMap
Release #22 genotype data of Asian individuals. For the completely tagged SNPs, functional SNPs
were prioritized according to the genomic coordinates
ftp://hgdownload.cse.ucsc.edu/goldenPath/).
(UCSC regulatory track, hg18,
As a result, 107 SNPs in total were selected for
genotyping.
SNP genotyping
SNPs were genotyped on Sequenom platform in 2 batches using the same reaction lot. To assess
genotype reproducibility, we included 28 randomly duplicated samples on the 7 Polymerase Chain
Reaction (PCR) plates. Two SNPs rs17095355 and rs2501577 were genotyped by Sanger sequencing.
The sequencing result was read by two technicians independently.
Phenotype permutation strategy
Correction for the inflated type I error in multiple-testing using the Bonferroni method[2] is too
conservative for our dataset, because these SNPs are highly correlated with each other. Alternatively,
we applied the phenotype permutation test to empirically estimate the study-wide significance level of
3
our association results, by swapping the phenotype identity (105 times) of our samples, testing the
“case/control” association and then examining the possibility of getting a better-than-the-observed (all
variants) association statistics. Since swapping the phenotype labels preserved the linkage structure
among variants, the possibility, or the empirical p (EMP2) value, fully addressed the correlation
among variants, while powerfully deflating the type I error in multiple-comparisons.
Another advantage of the phenotype permutation test is that sample substructures other than
case/control segregation will be exposed by the random “case/control” association test, therefore the
better-than-the-observed (single variant) empirical p value (EMP1) can be applied as a quality control
method to test associations caused by case/control bias. We plotted EMP1 against the observed
association p values of variants in a quantile-quantile (QQ) plot to examine whether there was
deviation was caused by population stratification (Figure S2).
The quality control procedure (Please refer to Figure S1)
Samples with mixed-labeling errors, and those with low genotyping rate (<95%) were excluded from
further analysis. SNPs with genotyping rate <95% (N=14), minor allele frequency (MAF) <1% (N=1),
and SNPs that deviated from Hardy–Weinberg equilibrium (N=2; p≤0.0001) were excluded for
downstream association tests. After QC, to refine the variants mapping, we imputed un-genotyped
SNPs in 10q24.2 by IMPUTE2[3] (Table S2).
4
Imputation
To explore the un-genotyped SNPs in the fine mapping region, we imputed the region encompassing
the genotyped SNPs using IMPUTE2[3]. We used the 1000 Genomes Phase I reference data contains
1,094 individuals from Africa, Asia, Europe, and the Americas. SNPs with a quality score (i.e.
info) >0.30 were considered eligible for downstream analysis (896 SNPs). Additionally, 41 missing
genotypes of the typed SNPs (after genotyped SNPs quality control) using the implemented -pgs_miss
function by IMPUTE 2.0.
Disease association tests of the imputed genotypes were done using logistic regression test
which accounted for the dosage uncertainty of imputation by PLINK [4]. Two hundred and fifty five
SNPs were detected to be associated with disease (p<0.05 by phenotype permutation test; Figure 1b,
Table S2).
Model selection
The rationale of the model selection strategy was developed from Macleod and Xu’s method[5]: for
the generalized linear-regression model (GLM), subsets of genotypes (variables) were randomly
sampled to predict the BA phenotype (response), subsequently the fitness of each prediction was
evaluated by Akaike Information Criterion (AIC)±2 value (a parsimonious model was prioritized), and
the best model was chosen. Since exhaustive sampling is not feasible at the current computational
5
platform, given the great number of variants, we modified the approach by using a Markov Chain
Monte Carlo (MCMC) method with a simple Metropolis-Hasting algorithm 105 times (after 5000 burnin) for each fixed number of variants (ranging from 2 to 10). The algorithm was conducted in R[6] and
the script is attached as supplementary material.
To reduce noise of random fitting, associated variants detected by the phenotype permutation
procedure were included in the model selection procedure. Each SNP had been genotyped with equal
sample size to minimize bias of power in the procedure.
Haplotypes were estimated by an expectation-maximization (EM) algorithm from the genotypes.
The haplotype-specific case-control association test was performed by PLINK using the logistic
regression method and incorporating the S2 covariate to eliminate possible confounding effect.
Phenotype permutations (105 times) were performed to evaluate the empirical haplotype-specific
significance[4].
Independent effect of the tagging SNPs was interrogated by haplotype-based
conditional tests implemented in PLINK[4, 7].
Real-time quantitative PCR
The cDNA products equivalent to 10ng of total RNA were used for quantitative real-time PCR (RTPCR) using TaqMan gene expression assays from Applied Biosystems.
The assays for ADD3
(Hs00249890_m1) and XPNPEP1 (Hs00958026_m1) both target all isoforms of these genes.
Real-time quantitative PCR was performed in triplicate (96-well plates) on an ABI 7900
6
(Applied Biosystems) machine using standard thermal cycling conditions (10min at 95 oC, 40 cycles for
15s at 95oC, 1min at 6oC). A standard curve was constructed for each PCR run with plasmid cDNA of
ADD3 and XPNPEP1 gene. The amount of target gene per sample was interpolated according to the
standard curves. All analyses were performed in a blinded fashion with the laboratory operators
unaware of genotyping data.
ADD3 immunohistochemistry staining
ADD3 immunohistochemistry was performed to examine the expression pattern in BA livers (Adducin
γ Antibody (H-60), Catalog Number: SC-25733, Santa Cruz, CA, U.S.A).
Formalin-fixed liver tissues were processed and 5-mm thick paraffin sections were used for
ADD3 immunohistochemistry. Briefly, deparaffinized liver sections were treated with 3% hydrogen
peroxide followed by antigen retrieval.
The sections were blocked with 10% sheep serum, and
incubated with the primary antibody, which was then linked to the biotinylated secondary antibody
followed by the avidin–biotin complex (ABC Vectastain kit; Vector Laboratories, Burlingame, CA,
USA), and diaminobenzidine, which gave a brown reaction product. The sections were then
counterstained with hematoxylin.
Bioinformatics investigation of SNP functionality
The fine mapping genomic region including the genetic region of XPNPEP1 and ADD3, and
7
upstream/downstream sequence of the genes, therefore was enriched with cis regulatory elements
(http://genome.ucsc.edu/). Because the fine mapping region can have undergone positive genetic
selection, conservation-based functionality prediction from multiple species sequence alignment was
less prioritized in this case. Alternatively, we applied experimental predictions for bioinformatics
investigation of SNP loci functionality.
We used Encode Regulation super-track (including 5 sub-tracks; italic fonts indicate track
names, same as below)[8], regulatory potential prediction (regPotential7x)[9], and microRNA target
scan (TS miRNA sites)[10] for the analysis (Figure S6). The related genomic regions were downloaded
as .bed files from UCSC website (http://genome.ucsc.edu/), except for the layered H3k4me3, layered
H3k4me1, and layered H3k27ac regulatory tracks in the Encode Regulation super-track, for which we
downloaded the .wiggle files (data points at a resolution of 25 base pair per point). Signal intensity in
the .wiggle files, reflected as height in related tracks of Figure S6. To reduce false discoveries, only
loci with signal intensity greater than 30 in 2 or more tracks (layered H3k4me1, layered H3k27ac, and
DNase clusters) were considered as positive for Enhancer prediction. Promoter prediction was mainly
based on layered H3k4me1, the Encode promoter-associated histone mark prediction (signal intensity
greater than 30).
Besides, we also checked expression quantitative trait loci based on liver array data[11], but no
positive loci for either XPNPEP1 or ADD3 was found. Positive predictions are listed in Table S2.
8
Assay the signature of selection: the fixation index (Fst)
HapMap Phase II samples include 60 Utah residents with ancestry from northern and western Europe
(CEU), 45 Han Chinese in Beijing (CHB), 45 Japanese in Tokyo (JPT), and 60 Yoruba in Ibadan,
Nigeria (YRI) were included. We pooled the data of CHB and JPT as one geographical group and
denoted it as ASN(Asian)[12].
A total of 2881972 SNPs Fst values genotyped in all three populations, excluding sex
chromosome SNPs, were analyzed. Fst was calculated by Cheng, et.al[13].
The fractions of highly differentiated SNPs (F) were used to test the enrichment status of
genetic regions: the ADD3 locus, the whole genome gene loci, and the sliding windows. Each gene
locus was defined as the Refseq gene region extending 30 kilo base pairs both upstream and
downstream. Each sliding window include 200 SNPs, slide along each chromosome with one SNP
forward for each step. Haplotype blocks were estimated using Gabriel’s[14] algorithm using default
parameters in Haploview[15], the computing was completed using PLINK[4]; only blocks with more
than 10 SNPs were considered for downstream analysis. The analytical steps are illustrated in Figure
S4.
The definition of ancestral allele is based on chimpanzee/human sequence alignment from
National Center for Biotechnology Information: http://www.ncbi.nlm.nih.gov/)[16].
9
REFERENCES
[1]
Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to
estimate haplotypes and unobserved genotypes. Genetic epidemiology 2010;34:816-834.
[2]
Dunn OJ. Multiple Comparisons Among Means. Journal of the American Statistical Association,
1961;56:9.
[3]
Howie B, Marchini J, Stephens M. Genotype imputation with thousands of genomes. G3
2011;1:457-470.
[4]
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set
for whole-genome association and population-based linkage analyses. American journal of human
genetics 2007;81:559-575.
Xu AIMaC. bestglm: Best Subset GLM. R package version 0.33. 2011.
[5]
[6]
Team TRDC. R: A Language and Environment for Statistical Computing. 2010.
[7]
Purcell S, Daly MJ, Sham PC. WHAP: haplotype-based association analysis. Bioinformatics
2007;23:255-256.
[8]
Consortium EP. A user's guide to the encyclopedia of DNA elements (ENCODE). PLoS biology
2011;9:e1001046.
[9]
King DC, Taylor J, Elnitski L, Chiaromonte F, Miller W, Hardison RC. Evaluation of regulatory
potential and conservation scores for detecting cis-regulatory modules in aligned mammalian genome
sequences. Genome research 2005;15:1051-1060.
[10] Lewis BP, Burge CB, Bartel DP. Conserved seed pairing, often flanked by adenosines, indicates
that thousands of human genes are microRNA targets. Cell 2005;120:15-20.
[11] Schadt EE, Molony C, Chudin E, Hao K, Yang X, Lum PY, et al. Mapping the genetic
architecture of gene expression in human liver. PLoS biology 2008;6:e107.
[12] International HapMap C. The International HapMap Project. Nature 2003;426:789-796.
[13] Cheng F, Chen W, Richards E, Deng L, Zeng C. SNP@Evolution: a hierarchical database of
positive selection on the human genome. BMC evolutionary biology 2009;9:221.
[14] Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, et al. The structure of
haplotype blocks in the human genome. Science 2002;296:2225-2229.
[15] Barrett JC, Fry B, Maller J, Daly MJ. Haploview: analysis and visualization of LD and
haplotype maps. Bioinformatics 2005;21:263-265.
[16] Carlson CS, Thomas DJ, Eberle MA, Swanson JE, Livingston RJ, Rieder MJ, et al. Genomic
regions exhibiting positive selection identified from dense genotype data. Genome research
2005;15:1553-1565.
Subset sampling n-script
N <- 100000
dev <- c()
size <- n
steps <- c()
S <- 1:72
S1 <- sample(S,size)
S2 <- setdiff(S,S1)
glmfit <- glm(y~x[,S1],family="binomial")
istep <- 0
while(istep < N){
i1 <- sample(S1,1)
i2 <- sample(S2,1)
S1.new <- c(setdiff(S1,i1),i2)
glmfit1 <- glm(y~x[,S1.new],family="binomial")
if(runif(1) < min(1,exp((glmfit$dev-glmfit1$dev)/2))){
S1 <- S1.new
S2 <- c(setdiff(S2,i2),i1)
glmfit <- glmfit1
}
istep <- istep+1
if(istep%%100==1){print(istep)}
steps <- cbind(steps,sort(S1))
dev <- c(dev,glmfit$dev)
}
bestdevs <- sort((unique(dev)))[1:10]
bestsubsets <- rep(0,10); for(i in 1:10){bestsubsets[i] <which(dev==bestdevs[i])[1]}
bestMarkerSets <- c()
for(i in 1:10){bestMarkerSets <rbind(bestMarkerSets,markerNames[steps[,bestsubsets[i]]])}
data.frame(bestdevs,bestMarkerSets,t(steps[,bestsubsets]))