Supporting Information Parfenov et al. 10.1073/pnas.1416074111

Supporting Information
Parfenov et al. 10.1073/pnas.1416074111
SI Materials and Methods
The presence of HPV and its integration into tumor genome were
detected using WGS (30–60×, n = 28 and 6–8×, n = 122), RNAseq, whole-exome sequencing, and molecular approaches (a PCRbased MassArray Assay, p16 staining, and HPV in situ hybridization). A subset of samples (n = 129), which lacked evidence of
HPV on the basis of RNA sequencing, was not analyzed by WGS.
WGS at 6–8× Target Coverage. Between 500 and 700 ng each genomic DNA (gDNA) sample was sheared using Covaris E220 to
∼250-bp fragments and then converted to a pair-end Illumina
library using KAPA Bio Kits with the Caliper (PerkinElmer)
Robotic NGS Suite according to manufacturers’ protocols. All
libraries were sequenced on HiSeq2000 using one sample per
lane with the paired end 2 × 51-bp setup. A tumor and its corresponding normal were usually loaded on the same flow cell.
Average sequence coverage was 6.49×, read quality was 38.5, and
94.6% of the reads were mapped. Raw data were converted to
the FASTQ format, and BWA alignment was used to generate
BAM files.
WGS at 30× Target Coverage. WGS libraries were prepared as
previously described (1) and analyzed at the Broad Institute to a
coverage of 30× as described at www.broadinstitute.org/cancer/cga.
RNA-Seq and Expression Quantification. Methods for sequencing
and data processing of RNA using the RNA-seq protocol have
been previously described for The Cancer Genome Atlas (TCGA)
(1). Briefly, RNA was extracted, prepared into Illumina TruSeq
mRNA libraries, sequenced by Illumina HiSeq2000 (resulting in
paired 48-nt reads), and subjected to quality control as previously
described (1). RNA reads were aligned to the hg19 genome assembly using Mapsplice (2). Gene expression was quantified for
the transcript models corresponding to the TCGA GAF2.1
(tcga-data.nci.nih.gov/docs/GAF/GAF.hg19.June2011.bundle/
outputs/TCGA.hg19.June2011.gaf) using RSEM (3) and normalized within sample to a fixed upper quartile of total reads.
For additional details on this processing, refer to the description
file on the DCC data portal under the V2_MapSpliceRSEM
workflow (tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/
anonymous/tumor/luad/cgcc/unc.edu/illuminahiseq_rnaseqv2/
rnaseqv2/unc.edu_LUAD.IlluminaHiSeq_RNASeqV2.mage-tab.1.9.0/
DESCRIPTION.txt and https://cghub.ucsc.edu/docs/tcga/UNC_
mRNAseq_summary.pdf). For gene-level analyses, expression
values of zero were set to the overall minimum value, and
all data were log2-transformed (tcga-data.nci.nih.gov/docs/
publications/hnsc_2013).
Data Availability. All sequence BAM files are available at CGHub
(cghub.ucsc.edu/).
HPV Determination and Identification of Genomic Integration from
WGS Data. The PathSeq (4) algorithm was used to perform
computational subtraction of human reads followed by alignment of residual reads to a combined database of human reference genomes and microbial reference genomes (which includes
but is not limited to HPV genomes), resulting in the identification
of reads mapping with high confidence to HPV genomes in WGS
sample data. Subjects were classified as HPV-positive by WGS if
at least 100 reads mapping to the HPV genome were present;
otherwise, subjects were classified as HPV-negative. In addition to
Parfenov et al. www.pnas.org/cgi/content/short/1416074111
identification of the HPV strain type, WGS data were also analyzed to identify HPV integration sites.
Using PathSeq, human reads were subtracted by first mapping
reads to a database of human genomes using BWA (5), Megablast, and Blastn (6). Only sequences with perfect or near-perfect
matches to the human genome were removed in the subtraction
process. To identify HPV reads, the resultant nonhuman reads
were aligned with Megablast and Blastn to a database of microbial genomes that includes multiple HPV reference genomes.
HPV reference genomes were obtained from the Human Papilloma Virus Episteme (7). An HPV-positive sample was considered integration-positive if there were at least three discordant read pairs and one split read supporting an integration
event. Discordant read pairs were defined as having one end of
the paired end read mapped to the HPV genome and having its
mate pair mapped to the human genome. Split reads were defined as having one end of the paired end read spanning the
integration junction and having its mate pair mapped to either
the human or HPV genome. To call an integration site, these
seven reads together must support integration at the same locus.
After HPV reads were obtained, we extracted all pair mates and
used BWA to map these paired end reads to a combined database containing the human genome and an HPV genome. Next,
soft-clipped reads in the HPV genome were identified and carried forward as potential split reads supporting integration sites.
For each split read, we searched for discordant read pairs with
one pair mate mapping to the human genome, and we verified
that the soft-clipped sequence and the discordant read pairs
supported the same location of integration. Read pairs where
both mates mapped to the human genome were also used to
identify the precise location of the human breakpoint (note that
all read pairs that reached this point of the analysis must have
had at least one read that had homology with an HPV genome).
Finally, if these methods could not allow us to define the precise
location of the human breakpoint, we returned to the original
human genome alignment and searched for soft-clipped reads in
the area of the integration site.
Concordance of HPV Presence and Integration Detection Between
Different Methods. Results of HPV detection with WGS were
concordant with the other sequencing (RNA-seq and wholeexome sequencing) and molecular (a PCR-based MassArray
Assay, p16 staining, and HPV in situ hybridization) approaches.
Two HPV(−), two HPV(+)/integration(−), and two HPV(+)/
integration(+) tumors, as detected by 6–8× WGS, were resequenced using 30× WGS. HPV presence/absence and integration events were concordant between the two techniques.
HPV Genetic Variants and Sublineage Analysis. WGS data for HPV16positive samples was aligned to the HPV16 prototype sequence
(GenBank accession no. AY686584) using BWA. To call variants,
we used GATK v2.5. HPV16 genomes were classified into sublineages based on the combinations of SNPs in the E6 gene and the
long control region (8).
HPV RNA-Seq Expression Analysis. Expression of HPV genes was
quantified by taking reads not aligned to the human genome
and aligning these reads to a database of microbial genomic sequences and gene annotations acquired from GenBank using
MapSplice (2). Read counts per HPV gene were determined
using GenomicFeatures (9) and collapsed to unique reads per
1 of 10
HPV gene (E6, E7, E1, E2, E4, E5, L2, and L1) among variants
of HPV. Mated reads were counted one time.
To analyze patterns of differential HPV gene expression, HPV
gene read counts were divided by total RNA-seq library aligned
read counts and gene lengths and standardized within each tumor
(mean 0, SD 1); then, genes were standardized (mean = 0; SD = 1).
Clustering of these normalized expression values revealed two
clusters that were tightly associated with HPV integration status. (Fig. S2A).
HPV Integration Breakpoint Enrichment Simulation. We identified
the breakpoints of HPV integration sites in the HPV genome, and
we evaluated whether these integration sites were distributed
nonrandomly in genes. We used only samples with HPV16 in this
simulation. First, we took 93 HPV16 integration breakpoint
locations and calculated the number of breakpoints occurring in
each HPV gene. Because the HPV genome has overlapping genes,
breakpoints in regions with multiple genes were assigned to both
genes. Second, we performed 10,000 simulations, in which we chose
93 HPV breakpoints in random locations in the HPV genome and
quantified the number of HPV breakpoints in each HPV gene.
Using the gene-specific distributions formed by the results of these
simulations, we evaluated whether our observed data were significant. Interestingly, we found that there were more E1 breakpoints than would have been expected by chance (P = 0.008) (Fig.
S2B). Notably, 21 of 25 integrated samples contained an integration site in E1. Integration in E1 would be predicted to
separate E2 from the HPV promoter, thus likely decreasing expression of the downstream E2 gene. Decreased expression of E2
is known to allow for increased expression of the E6 and E7 oncoproteins (10).
Identification of CNVs. To characterize somatic copy number
alterations in the tumor genome, we applied Bayesian Information
Criterion (BIC) sequencing (11), an algorithm that we have developed previously. We first counted the uniquely aligned reads in
fixed-size (1,000 bp), nonoverlapping windows along the genome.
Given these bins with read counts for tumor and matched normal
genomes, BIC sequencing attempts to iteratively combine neighboring bins with similar copy numbers. The decision to merge two
neighboring bins is based on the BIC, a statistical criterion measuring both the fit and complexity of a statistical model. Segmentation stops when no merging of windows improves BIC, and
the boundaries of the windows are reported as a final set of copy
number breakpoints. Segments with copy ratio difference smaller
than 0.1 (log2 scale) between tumor and normal genomes were
merged in the postprocessing step to avoid excessive refinement of
altered regions with high read counts.
million reads mapped) across 70% of the patient samples as not
expressed (white with black outline in Fig. 2 and Figs. S4 and
S5). For graphical representation of exon expression levels, we
mean-centered the RPKM values and divided by their SD for
each exon. The lengths of the composite exons were transformed
using the square root of the length.
DNA Methylation and mRNA Expression Analysis. The Infinium
HM450k array was used for TCGA Head and Neck samples
(13). This platform includes probes for more than 480,000 CpG
sites, spanning 99% of Ref-seq. In total, 96% of CpG islands and
92% of CpG shores are represented by at least one probe. The
analysis of DNA methylation and mRNA expression datasets
was performed using R/Bioconductor software with the limma
package and custom routines for data analysis (14–16). In each
dataset separately, we looked for differential methylation and
expression between HPV-positive samples with and without virus
integration. For each CpG locus and gene in DNA methylation
and mRNA expression datasets, we estimated a t statistic and
a P value by fitting a linear model of its differential methylation
and expression, respectively (17). All CpG loci and genes with
P values (adjusted by the Benjamini–Hochberg procedure) less
than 0.05 were considered as significantly differentially methylated or expressed. The sites and corresponding statistics for all
CpG loci and genes can be found in Dataset S1, Tables S3 and S4.
Additionally, we carried out an unsupervised hierarchical clustering with Euclidean distance on DNA methylation data using the
most variable 1.0% of CpG loci. As shown on Fig. S7A, the clustering
separates samples into two groups according to the dendrogram.
Fisher’s exact test was used to test for associations between methylation groups, virus integration status, and other clinical factors as
well as E6/E7 expression (Fig. S7B).
HPV Integration and Clinical Covariates. The log-rank test as
implemented in the survival package (18) of R/Bioconductor
software was applied to test for the association of HPV integration status and survival and showed no statistically significant association (P value = 0.11) (Fig. S1A). Student t tests were
used to test for the association between age, total number of
somatic mutations, and virus integration status (Fig. S1B), and
they also did not show statistically significant association. Additionally, Fisher’s exact test was used to test for associations
between virus integration and other clinical factors as well as
E6/E7 expression (Fig. S1C).
FISH. DNA from two fosmids, G248P80924G11 and G248P86073A9,
Exon Expression Graphs. To summarize the potential impact of
HPV integration on transcription of the host genes, we visualized
RNA-seq–derived exon expression levels for genes involved in
integration events. We first flagged exons that had expression
levels below one RPKM (reads per kilobase of transcript per
was isolated and labeled with 5-Fluorescein (Empire Genomics).
FISH was performed following standard protocols. Briefly, 5-μm
tissue slides were baked overnight at 600 °C, deparaffinized, heated
in pressure cooker for 20 min in citrate buffer (pH 6), treated with
a 150-μg/mL solution of Proteinase K, fixed in 1% neutral-buffered
formalin, and codenatured with the probe for 5 min at 800 °C.
Hybridization was carried out overnight at 37 °C, and posthybridization slide washes were carried out according to the Empire
Genomics protocol. FISH signal evaluation and acquisition were
performed manually using filter sets and software developed by
Applied Spectral Imaging. We attempted simultaneous FISH detection of HPV and RAD51B genomic material using the Pan
Human Papillomavirus Probe (Invitrogen). Unfortunately, we were
unable to detect HPV using the conditions optimized for the
RAD51B probe, perhaps because of significantly shorter length
HPV probe, which requires different posthybridization wash conditions followed by additional steps of tyramide signal amplification.
1. Cancer Genome Atlas Research Network (2012) Comprehensive genomic characterization of squamous cell lung cancers. Nature 489(7417):519–525.
2. Wang K, et al. (2010) MapSplice: Accurate mapping of RNA-seq reads for splice
junction discovery. Nucleic Acids Res 38(18):e178.
Spatial Correlation Analysis of Integration Sites. To assess spatial
correlation of integration sites and human genes, miRNAs, and
somatic CNVs, we used the software GenometriCorr (12). The
conclusion of whether two sets of intervals are spatially correlated across a genome was based on results of the Jaccard
measure and the projection test and estimation of whether we
are in the upper or lower tail of the distribution. The coordinates of the human genes and miRNAs were obtained from
the RefSeq genes, version hg19, and miRBase, version 19,
respectively.
Parfenov et al. www.pnas.org/cgi/content/short/1416074111
2 of 10
3. Li B, Dewey CN (2011) RSEM: Accurate transcript quantification from RNA-Seq data
with or without a reference genome. BMC Bioinformatics 12:323.
4. Kostic AD, et al. (2011) PathSeq: Software to identify or discover microbes by deep
sequencing of human tissue. Nat Biotechnol 29(5):393–396.
5. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler
transform. Bioinformatics 25(14):1754–1760.
6. Altschul SF, et al. (1997) Gapped BLAST and PSI-BLAST: A new generation of protein
database search programs. Nucleic Acids Res 25(17):3389–3402.
7. Van Doorslaer K, et al. (2013) The Papillomavirus Episteme: A central resource for
papillomavirus sequence data and analysis. Nucleic Acids Res 41(Database issue):
D571–D578.
8. Cornet I, et al.; IARC HPV Variant Study Group (2012) Human papillomavirus type 16 genetic variants: Phylogeny and classification based on E6 and LCR. J Virol 86(12):6855–6861.
9. Lawrence M, et al. (2013) Software for computing and annotating genomic ranges.
PLOS Comput Biol 9(8):e1003118.
10. Thierry F (2009) Transcriptional regulation of the papillomavirus oncogenes by
cellular and viral transcription factors in cervical carcinoma. Virology 384(2):
375–379.
a
11. Xi R, et al. (2011) Copy number variation detection in whole-genome sequencing data
using the Bayesian information criterion. Proc Natl Acad Sci USA 108(46):E1128–E1136.
12. Favorov A, et al. (2012) Exploring massive, genome scale datasets with the GenometriCorr package. PLOS Comput Biol 8(5):e1002529.
13. Bibikova M, et al. (2011) High density DNA methylation array with single CpG site
resolution. Genomics 98(4):288–295.
14. R Development Core Team (2004) R: A Language and Environment for Statistical
Computing (R Foundation for Statistical Computing, Vienna).
15. Gentleman RC, et al. (2004) Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol 5(10):R80.
16. Smyth G (2005) Limma: Linear models for microarray data. Bioinformatics and Computational Biology Solutions Using R and Bioconductor, eds Gentleman R, Carey VJ,
Dudoit S, Huber W, Irizarry RA (Springer Science+Business Media, Inc., New York),
pp 397–420.
17. Smyth GK (2004) Linear models and empirical bayes methods for assessing differential
expression in microarray experiments. Stat Appl Genet Mol Biol 3(1):Article3.
18. Therneau T (2014) A Package for Survival Analysis in S. R Package Version 2.37-7.
Available at http://CRAN.R-project.org/package=survival. Accessed 30 September 2014.
1.0
Survival by HPV integration
0.0
0.2
0.4
0.6
0.8
HPV integration +
HPV integration −
n = 35, p−value = 0.11
0
500
1000
1500
2000
2500
3000
Days
b
Age
The total number of mutations
2.6
log10(# of mutations)
80
years
70
60
50
2.4
2.2
2.0
1.8
1.6
1.4
40
1.2
Negative
Positive
Negative
Positive
c
HPV integration
Site
0.144
Smoking
0.421
0.588
0.002
E7 expression
0.004
BB−4225
CR−6467
BA−5153
CR−6480
CR−6470
CR−5250
BB−4223
BB−4228
CR−5248
CV−6433
DQ−7591
CV−7100
CV−6939
CR−6471
CV−5443
CR−6473
BA−4077
CR−6482
CR−6487
CR−7385
CN−5374
BA−5559
CR−6472
CR−7404
CR−5243
HD−7754
CR−6481
CV−7406
CN−4741
CV−6961
CV−5971
CV−5442
CQ−5323
CR−7368
CR−7369
Stage
E6 expression
HPV integration:
positive
negative
Site:
Oropharynx
Oral Cavity
Larynx
Smoking:
Yes
No
Stage:
II
III
IVA/B
E6/7 expression:
High
Low
Fig. S1. Correlative analysis of clinical variables and HPV integration. Association of HPV integration status and (A) survival, (B) age and number of somatic
mutations, and (C) anatomic site of tumor, tumor stage, and smoking status of the subject as well as expression level of HPV genes E6 and E7. P values are
shown on the right of each track. One sample, CR-6480, did not have RNA-seq data and is left blank on the E6/E7 expression heat map.
Parfenov et al. www.pnas.org/cgi/content/short/1416074111
3 of 10
Fig. S2. (A) Heat map of HPV gene expression. Depending on their integration status, HPV(+) tumors tended to form two clusters. Tumors with integrated
HPV are characterized by increased expression of E6 and E7 and severely reduced levels of E2, whereas nonintegrated tumors have elevated expression levels of
E2, E4, and E5. Mixed expression patterns suggest the presence of both episomal and integrated forms of HPV in the same tumor. (B) HPV integration
breakpoint distribution. Violin plots representing expected and actual (red circles) counts of breakpoints in the corresponding HPV genes. In each violin plot,
the inner portion is a standard box and whisker plot, and the outer portion is the density of the distribution. The E1 gene shows significant enrichment for
integration breakpoints (P = 0.008).
Parfenov et al. www.pnas.org/cgi/content/short/1416074111
4 of 10
Fig. S3. FISH on tissue sample TCGA-BA-4077. FISH probe (green) for the RAD51B fragment involved in HPV integration. Nuclei are counterstained with DAPI
(blue). Tumor cells contain multiple copies of the 42-kb RAD51B segment. WGS data show that this RAD51B segment is present as either a tandem amplification incorporated in the genome in a single place or multiple extra chromosomes. The punctuate pattern of RAD51B amplification suggests an extrachromosomal location.
Parfenov et al. www.pnas.org/cgi/content/short/1416074111
5 of 10
a
HPV16
ORI
P97
P670
E7
E2
E6
E1
ORI
L2
E4
E5
L1
TCGA-CV-6961 (hg19)
Events (schematic)
5
6
7
8
AAGTGAGTCAG...[GCATCCTAG]...TTATTTACATCCTAGTTA
chimeric HPV-human read
9
10
TAATAGATTGGTGGTGTTCCATAATAAACCTCTTA
chimeric human-HPV read
ETS2 Exon Expression
5
6
7
8
9
10
40,177,231
40,196,876
Chromosome 21 Copy Number
z-score <-2 -1
0
1
>2 none
4
3
2
1
0
40,180,000
40,190,000
40,200,000
b
HPV16
ORI
P97
P670
E7
E2
E6
E1
ORI
L2
E4
E5
L1
TCGA-CR-6482 (hg19)
Events (schematic)
NR4A2
TGGAAATCCAAACCAAACC
chimeric HPV-human read
ATTTATCAGTTTGTG
chimeric human-HPV read
TTCCTGTTGTTATAAC
GTATCAGGT...CACC
GCTAGAGGTGAGGA
fused human-human read
NR4A2 Exon Expression
8
7
6
5
4
3
2
1
157,180,946
157,188,992
z-score <-2 -1
Chromosome 2 Copy Number
0
1
>2 none
248
186
124
62
0
157,000,000
157,100,000
157,200,000
157,300,000
c
HPV16
ORI
P97
P670
E7
E2
E6
E1
ORI
L2
E4
E5
L1
TCGA-CN-4741 (hg19)
Events (schematic)
A
chr3
B
C
TPRG1
D
E
7
F
G
H
KLF5
I
J
K
L
chr13
discordant read pairs
Chromosome 3 Copy Number
16
14
12
10
B
A
8
6
4
2
0
188,800,000
D
C
F
189,000,000
Chromosome 13 Copy Number
Insertion Detail
16
16
14
14
12
12 H
10
10
8
8
6
6
4
4
2
2
0
0
E
189,200,000
H
I
J
K
G
73,600,000
73,800,000
TPRG1 Exon Expression
188,665,003
L
74,000,000
KLF5 Exon Expression
189,041,270
73,629,114
73,651,675
z-score <-2
-1
0
1
>2 none
Fig. S4. Integrated analysis of HPV integration events. Breakpoint locations and joining patterns are shown with regard to schematic representations of the
HPV and human genomes, regions of copy number alteration, and exon expression of genes involved and/or located near integration sites. Colored regions on
the genome schemes represent sequences included in the integrated structure. Gray regions represent sequences that were replaced or lost as a result of
integration. (A) ETS2 integration led to the heterozygous loss of two exons, which is reflected by copy number loss and decreased expression of the seventh
and eighth exons. (B) NR4A2 double integration was associated with high-level amplification and overexpression of the entire gene. The integrated virus
included its ori and promoters. (C) HPV integration associated with interchromosomal chromosome 3/chromosome 13 translocation (rearrangement F–G) and
four subsequent tandem duplications (rearrangements B–K, C–J, D–I, and E–H) involving the TPRG1 and KLF5 genes that are amplified and overexpressed.
Parfenov et al. www.pnas.org/cgi/content/short/1416074111
6 of 10
Fig. S5. Integrated analysis of the HPV integration event. Virus integrated in the fourth intron of the PDL1 gene. The entire PDL1 gene, HPV insertion, and
a part of PLGRKT gene underwent structural rearrangement, amplification, and overexpression of human exons, supporting the model of an extrachromosomal element. The pattern of PDL1 exon expression is changed compared with other tumors.
Fig. S6. DNA methylation and gene expression analysis for tumors with different HPV status. Unsupervised clustering of differentially (A) methylated loci and
(B) expressed genes in four groups: HPV(+)/integration(+), HPV(+)/integration(−) and HPV(−) tumors and adjacent normal tissue.
Parfenov et al. www.pnas.org/cgi/content/short/1416074111
7 of 10
a
HD−7754
CN−5374
CQ−5323
CR−7368
CR−7369
CR−6471
CV−7100
CV−6961
BA−4077
CR−6472
CV−7406
CV−6939
CR−6470
CR−7385
CV−5971
CR−6481
CV−5442
CN−4741
BB−4225
BB−4228
CR−5243
CV−5443
CR−5250
BA−5153
CV−6433
CR−6473
CR−6480
BB−4223
CR−6467
DQ−7591
BA−5559
CR−6482
CR−5248
CR−7404
CR−6487
CpG locus
Sample
0
beta value
1
b
DNA methylation
subtype
0.01
HPV integration
Site
0.038
Smoking
0.723
Stage
1
0.082
E7 expression
0.045
HD−7754
CN−5374
CQ−5323
CR−7368
CR−7369
CR−6471
CV−7100
CV−6961
BA−4077
CR−6472
CV−7406
CV−6939
CR−6470
CR−7385
CV−5971
CR−6481
CV−5442
CN−4741
BB−4225
BB−4228
CR−5243
CV−5443
CR−5250
BA−5153
CV−6433
CR−6473
CR−6480
BB−4223
CR−6467
DQ−7591
BA−5559
CR−6482
CR−5248
CR−7404
CR−6487
E6 expression
DNA methylation
subtype:
1
2
HPV integration:
positive
negative
Site:
Oropharynx
Oral Cavity
Larynx
Smoking:
Yes
No
Stage:
II
III
IVA/B
E6/7 expression:
High
Low
Fig. S7. Correlative analysis of clinical variables and methylation status. (A) Unsupervised hierarchical clustering with Euclidean distance from DNA methylation data. (B) Associations between methylation clusters, virus integration status, anatomic site of tumor, tumor stage, smoking status of the subject, and
expression level of HPV genes E6 and E7 using Fisher’s exact test. P values are shown on the right of each track. One sample, CR-6480, did not have RNA-seq
data and is left blank on the E6/E7 expression heat map.
Parfenov et al. www.pnas.org/cgi/content/short/1416074111
8 of 10
a
mRNA expression. BARX2
BARX2 expression, DNA methylation, and copy number variation
p−value=2.42e−04
Virus integration +
Virus integration −
Homozygous deletion
Heterozygous deletion
CN gain
CN amp
HPV+/int−
HPV+/int+
Normal
DNA methylation. BARX2_cg17692125
8
4
2
Beta values
0.0 0.2 0.4 0.6 0.8
CpGi relation=Island. Probe location=TSS1500. p−value=1.57e−03
6
HPV−
mRNA expression
2
4
6
10
log2(RSEM)
8 10 12
Spearman correlation = −0.586
HPV−
HPV+/int−
HPV+/int+
0.0
Normal
0.2
0.4
0.6
0.8
1.0
DNA methylation
mRNA expression. IRX4
IRX4 expression, DNA methylation, and copy number variation
p−value=8.68e−04
Spearman correlation = −0.636
Virus integration +
Virus integration −
Homozygous deletion
Heterozygous deletion
CN gain
CN amp
HPV+/int+
Normal
DNA methylation. IRX4_cg07882671
2
0
Beta values
0.0 0.2 0.4 0.6 0.8
CpGi relation=Island. Probe location=TSS200. p−value=1.02e−03
6
HPV+/int−
4
HPV−
mRNA expression
0
8
2
4
6
10
log2(RSEM)
8 10 12
b
HPV−
HPV+/int−
HPV+/int+
Normal
0.0
0.2
0.4
0.6
0.8
1.0
DNA methylation
mRNA expression. SIM2
SIM2 expression, DNA methylation, and copy number variation
p−value=1.02e−11
Spearman correlation = −0.862
Virus integration +
Virus integration −
Homozygous deletion
Heterozygous deletion
CN gain
CN amp
HPV+/int−
HPV+/int+
Normal
DNA methylation. SIM2_cg01065599
10
6
0.8
0.6
0.4
4
0.2
Beta values
CpGi relation=N_Shore. Probe location=Body. p−value=3.89e−06
8
HPV−
mRNA expression
4
6
8
12
log2(RSEM)
10 12
c
HPV−
HPV+/int+
0.0
Normal
0.2
0.4
0.6
0.8
1.0
DNA methylation
mRNA expression. CTSE
CTSE expression, DNA methylation, and copy number variation
p−value=2.01e−05
Spearman correlation = −0.718
10
d
HPV+/int−
HPV+/int+
Normal
DNA methylation. CTSE_cg21478437
2
0.7
0.5
0
0.3
Beta values
0.9
CpGi relation=NA. Probe location=TSS200. p−value=3.99e−06
6
HPV+/int−
mRNA expression
HPV−
4
0
2
8
4
6
log2(RSEM)
8
10
Virus integration +
Virus integration −
Homozygous deletion
Heterozygous deletion
CN gain
CN amp
HPV−
HPV+/int−
HPV+/int+
Normal
0.0
0.2
0.4
0.6
0.8
1.0
DNA methylation
Fig. S8. DNA methylation and expression analysis for chosen genes in tumors with different HPV status. DNA methylation and expression levels for (A) BARX2,
(B) IRX4, (C) SIM2, and (D) CTSE genes in four groups: HPV−, HPV(+)/integration(−), and HPV(+)/integration(+) and adjacent normal tissue. P values were
calculated using t test for HPV(−), HPV(+)/integration(−), and HPV(+)/integration(+) tumors. Tumors with copy number gain or loss are shown in red and blue,
respectively. CN, copy number; int, integration.
Parfenov et al. www.pnas.org/cgi/content/short/1416074111
9 of 10
Dataset S1. Characteristics of HPV-positive tumors
Dataset S1
(Table S1) The dataset contains variant calls for 29 HPV16-positive tumors. Fields include position in the reference genome sequence (GenBank accession no.
AY686584), reference allele, and corresponding genotypes for the tumors. The symbol −/− represents homozygous genotype by reference allele, and the
symbol ./. represents absence of a high-quality call. (Table S2) The dataset contains summary information for 35 HPV-positive tumors, including characterization
of clinical features, HPV type and lineage, and a detailed description of the integration sites. The viral part of a split read is highlighted in red. Underlined bold
nucleotides represent regions of microhomology between viral and human parts of a split read. The asterisk marks samples where integration events had
insufficient discordant read pairs or split reads from WGS and were additionally supported by RNA-seq data. The symbol # marks deceased patients. (Table S3)
The dataset contains a list of CpG loci significantly differentially methylated between HPV(+)/integration(+) and HPV(+)/integration(−) tumors. P values were
adjusted using the Benjamini–Hochberg procedure. (Table S4) The dataset contains a list of genes significantly differentially expressed between HPV(+)/integration(+)
and HPV(+)/integration(−) tumors. P values were adjusted using the Benjamini–Hochberg procedure.
Other Supporting Information Files
SI Appendix (DOCX)
Parfenov et al. www.pnas.org/cgi/content/short/1416074111
10 of 10