Supporting Information Parfenov et al. 10.1073/pnas.1416074111 SI Materials and Methods The presence of HPV and its integration into tumor genome were detected using WGS (30–60×, n = 28 and 6–8×, n = 122), RNAseq, whole-exome sequencing, and molecular approaches (a PCRbased MassArray Assay, p16 staining, and HPV in situ hybridization). A subset of samples (n = 129), which lacked evidence of HPV on the basis of RNA sequencing, was not analyzed by WGS. WGS at 6–8× Target Coverage. Between 500 and 700 ng each genomic DNA (gDNA) sample was sheared using Covaris E220 to ∼250-bp fragments and then converted to a pair-end Illumina library using KAPA Bio Kits with the Caliper (PerkinElmer) Robotic NGS Suite according to manufacturers’ protocols. All libraries were sequenced on HiSeq2000 using one sample per lane with the paired end 2 × 51-bp setup. A tumor and its corresponding normal were usually loaded on the same flow cell. Average sequence coverage was 6.49×, read quality was 38.5, and 94.6% of the reads were mapped. Raw data were converted to the FASTQ format, and BWA alignment was used to generate BAM files. WGS at 30× Target Coverage. WGS libraries were prepared as previously described (1) and analyzed at the Broad Institute to a coverage of 30× as described at www.broadinstitute.org/cancer/cga. RNA-Seq and Expression Quantification. Methods for sequencing and data processing of RNA using the RNA-seq protocol have been previously described for The Cancer Genome Atlas (TCGA) (1). Briefly, RNA was extracted, prepared into Illumina TruSeq mRNA libraries, sequenced by Illumina HiSeq2000 (resulting in paired 48-nt reads), and subjected to quality control as previously described (1). RNA reads were aligned to the hg19 genome assembly using Mapsplice (2). Gene expression was quantified for the transcript models corresponding to the TCGA GAF2.1 (tcga-data.nci.nih.gov/docs/GAF/GAF.hg19.June2011.bundle/ outputs/TCGA.hg19.June2011.gaf) using RSEM (3) and normalized within sample to a fixed upper quartile of total reads. For additional details on this processing, refer to the description file on the DCC data portal under the V2_MapSpliceRSEM workflow (tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/ anonymous/tumor/luad/cgcc/unc.edu/illuminahiseq_rnaseqv2/ rnaseqv2/unc.edu_LUAD.IlluminaHiSeq_RNASeqV2.mage-tab.1.9.0/ DESCRIPTION.txt and https://cghub.ucsc.edu/docs/tcga/UNC_ mRNAseq_summary.pdf). For gene-level analyses, expression values of zero were set to the overall minimum value, and all data were log2-transformed (tcga-data.nci.nih.gov/docs/ publications/hnsc_2013). Data Availability. All sequence BAM files are available at CGHub (cghub.ucsc.edu/). HPV Determination and Identification of Genomic Integration from WGS Data. The PathSeq (4) algorithm was used to perform computational subtraction of human reads followed by alignment of residual reads to a combined database of human reference genomes and microbial reference genomes (which includes but is not limited to HPV genomes), resulting in the identification of reads mapping with high confidence to HPV genomes in WGS sample data. Subjects were classified as HPV-positive by WGS if at least 100 reads mapping to the HPV genome were present; otherwise, subjects were classified as HPV-negative. In addition to Parfenov et al. www.pnas.org/cgi/content/short/1416074111 identification of the HPV strain type, WGS data were also analyzed to identify HPV integration sites. Using PathSeq, human reads were subtracted by first mapping reads to a database of human genomes using BWA (5), Megablast, and Blastn (6). Only sequences with perfect or near-perfect matches to the human genome were removed in the subtraction process. To identify HPV reads, the resultant nonhuman reads were aligned with Megablast and Blastn to a database of microbial genomes that includes multiple HPV reference genomes. HPV reference genomes were obtained from the Human Papilloma Virus Episteme (7). An HPV-positive sample was considered integration-positive if there were at least three discordant read pairs and one split read supporting an integration event. Discordant read pairs were defined as having one end of the paired end read mapped to the HPV genome and having its mate pair mapped to the human genome. Split reads were defined as having one end of the paired end read spanning the integration junction and having its mate pair mapped to either the human or HPV genome. To call an integration site, these seven reads together must support integration at the same locus. After HPV reads were obtained, we extracted all pair mates and used BWA to map these paired end reads to a combined database containing the human genome and an HPV genome. Next, soft-clipped reads in the HPV genome were identified and carried forward as potential split reads supporting integration sites. For each split read, we searched for discordant read pairs with one pair mate mapping to the human genome, and we verified that the soft-clipped sequence and the discordant read pairs supported the same location of integration. Read pairs where both mates mapped to the human genome were also used to identify the precise location of the human breakpoint (note that all read pairs that reached this point of the analysis must have had at least one read that had homology with an HPV genome). Finally, if these methods could not allow us to define the precise location of the human breakpoint, we returned to the original human genome alignment and searched for soft-clipped reads in the area of the integration site. Concordance of HPV Presence and Integration Detection Between Different Methods. Results of HPV detection with WGS were concordant with the other sequencing (RNA-seq and wholeexome sequencing) and molecular (a PCR-based MassArray Assay, p16 staining, and HPV in situ hybridization) approaches. Two HPV(−), two HPV(+)/integration(−), and two HPV(+)/ integration(+) tumors, as detected by 6–8× WGS, were resequenced using 30× WGS. HPV presence/absence and integration events were concordant between the two techniques. HPV Genetic Variants and Sublineage Analysis. WGS data for HPV16positive samples was aligned to the HPV16 prototype sequence (GenBank accession no. AY686584) using BWA. To call variants, we used GATK v2.5. HPV16 genomes were classified into sublineages based on the combinations of SNPs in the E6 gene and the long control region (8). HPV RNA-Seq Expression Analysis. Expression of HPV genes was quantified by taking reads not aligned to the human genome and aligning these reads to a database of microbial genomic sequences and gene annotations acquired from GenBank using MapSplice (2). Read counts per HPV gene were determined using GenomicFeatures (9) and collapsed to unique reads per 1 of 10 HPV gene (E6, E7, E1, E2, E4, E5, L2, and L1) among variants of HPV. Mated reads were counted one time. To analyze patterns of differential HPV gene expression, HPV gene read counts were divided by total RNA-seq library aligned read counts and gene lengths and standardized within each tumor (mean 0, SD 1); then, genes were standardized (mean = 0; SD = 1). Clustering of these normalized expression values revealed two clusters that were tightly associated with HPV integration status. (Fig. S2A). HPV Integration Breakpoint Enrichment Simulation. We identified the breakpoints of HPV integration sites in the HPV genome, and we evaluated whether these integration sites were distributed nonrandomly in genes. We used only samples with HPV16 in this simulation. First, we took 93 HPV16 integration breakpoint locations and calculated the number of breakpoints occurring in each HPV gene. Because the HPV genome has overlapping genes, breakpoints in regions with multiple genes were assigned to both genes. Second, we performed 10,000 simulations, in which we chose 93 HPV breakpoints in random locations in the HPV genome and quantified the number of HPV breakpoints in each HPV gene. Using the gene-specific distributions formed by the results of these simulations, we evaluated whether our observed data were significant. Interestingly, we found that there were more E1 breakpoints than would have been expected by chance (P = 0.008) (Fig. S2B). Notably, 21 of 25 integrated samples contained an integration site in E1. Integration in E1 would be predicted to separate E2 from the HPV promoter, thus likely decreasing expression of the downstream E2 gene. Decreased expression of E2 is known to allow for increased expression of the E6 and E7 oncoproteins (10). Identification of CNVs. To characterize somatic copy number alterations in the tumor genome, we applied Bayesian Information Criterion (BIC) sequencing (11), an algorithm that we have developed previously. We first counted the uniquely aligned reads in fixed-size (1,000 bp), nonoverlapping windows along the genome. Given these bins with read counts for tumor and matched normal genomes, BIC sequencing attempts to iteratively combine neighboring bins with similar copy numbers. The decision to merge two neighboring bins is based on the BIC, a statistical criterion measuring both the fit and complexity of a statistical model. Segmentation stops when no merging of windows improves BIC, and the boundaries of the windows are reported as a final set of copy number breakpoints. Segments with copy ratio difference smaller than 0.1 (log2 scale) between tumor and normal genomes were merged in the postprocessing step to avoid excessive refinement of altered regions with high read counts. million reads mapped) across 70% of the patient samples as not expressed (white with black outline in Fig. 2 and Figs. S4 and S5). For graphical representation of exon expression levels, we mean-centered the RPKM values and divided by their SD for each exon. The lengths of the composite exons were transformed using the square root of the length. DNA Methylation and mRNA Expression Analysis. The Infinium HM450k array was used for TCGA Head and Neck samples (13). This platform includes probes for more than 480,000 CpG sites, spanning 99% of Ref-seq. In total, 96% of CpG islands and 92% of CpG shores are represented by at least one probe. The analysis of DNA methylation and mRNA expression datasets was performed using R/Bioconductor software with the limma package and custom routines for data analysis (14–16). In each dataset separately, we looked for differential methylation and expression between HPV-positive samples with and without virus integration. For each CpG locus and gene in DNA methylation and mRNA expression datasets, we estimated a t statistic and a P value by fitting a linear model of its differential methylation and expression, respectively (17). All CpG loci and genes with P values (adjusted by the Benjamini–Hochberg procedure) less than 0.05 were considered as significantly differentially methylated or expressed. The sites and corresponding statistics for all CpG loci and genes can be found in Dataset S1, Tables S3 and S4. Additionally, we carried out an unsupervised hierarchical clustering with Euclidean distance on DNA methylation data using the most variable 1.0% of CpG loci. As shown on Fig. S7A, the clustering separates samples into two groups according to the dendrogram. Fisher’s exact test was used to test for associations between methylation groups, virus integration status, and other clinical factors as well as E6/E7 expression (Fig. S7B). HPV Integration and Clinical Covariates. The log-rank test as implemented in the survival package (18) of R/Bioconductor software was applied to test for the association of HPV integration status and survival and showed no statistically significant association (P value = 0.11) (Fig. S1A). Student t tests were used to test for the association between age, total number of somatic mutations, and virus integration status (Fig. S1B), and they also did not show statistically significant association. Additionally, Fisher’s exact test was used to test for associations between virus integration and other clinical factors as well as E6/E7 expression (Fig. S1C). FISH. DNA from two fosmids, G248P80924G11 and G248P86073A9, Exon Expression Graphs. To summarize the potential impact of HPV integration on transcription of the host genes, we visualized RNA-seq–derived exon expression levels for genes involved in integration events. We first flagged exons that had expression levels below one RPKM (reads per kilobase of transcript per was isolated and labeled with 5-Fluorescein (Empire Genomics). FISH was performed following standard protocols. Briefly, 5-μm tissue slides were baked overnight at 600 °C, deparaffinized, heated in pressure cooker for 20 min in citrate buffer (pH 6), treated with a 150-μg/mL solution of Proteinase K, fixed in 1% neutral-buffered formalin, and codenatured with the probe for 5 min at 800 °C. Hybridization was carried out overnight at 37 °C, and posthybridization slide washes were carried out according to the Empire Genomics protocol. FISH signal evaluation and acquisition were performed manually using filter sets and software developed by Applied Spectral Imaging. We attempted simultaneous FISH detection of HPV and RAD51B genomic material using the Pan Human Papillomavirus Probe (Invitrogen). Unfortunately, we were unable to detect HPV using the conditions optimized for the RAD51B probe, perhaps because of significantly shorter length HPV probe, which requires different posthybridization wash conditions followed by additional steps of tyramide signal amplification. 1. Cancer Genome Atlas Research Network (2012) Comprehensive genomic characterization of squamous cell lung cancers. Nature 489(7417):519–525. 2. Wang K, et al. (2010) MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res 38(18):e178. Spatial Correlation Analysis of Integration Sites. To assess spatial correlation of integration sites and human genes, miRNAs, and somatic CNVs, we used the software GenometriCorr (12). The conclusion of whether two sets of intervals are spatially correlated across a genome was based on results of the Jaccard measure and the projection test and estimation of whether we are in the upper or lower tail of the distribution. The coordinates of the human genes and miRNAs were obtained from the RefSeq genes, version hg19, and miRBase, version 19, respectively. Parfenov et al. www.pnas.org/cgi/content/short/1416074111 2 of 10 3. Li B, Dewey CN (2011) RSEM: Accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12:323. 4. Kostic AD, et al. (2011) PathSeq: Software to identify or discover microbes by deep sequencing of human tissue. Nat Biotechnol 29(5):393–396. 5. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14):1754–1760. 6. Altschul SF, et al. (1997) Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402. 7. Van Doorslaer K, et al. (2013) The Papillomavirus Episteme: A central resource for papillomavirus sequence data and analysis. Nucleic Acids Res 41(Database issue): D571–D578. 8. Cornet I, et al.; IARC HPV Variant Study Group (2012) Human papillomavirus type 16 genetic variants: Phylogeny and classification based on E6 and LCR. J Virol 86(12):6855–6861. 9. Lawrence M, et al. (2013) Software for computing and annotating genomic ranges. PLOS Comput Biol 9(8):e1003118. 10. Thierry F (2009) Transcriptional regulation of the papillomavirus oncogenes by cellular and viral transcription factors in cervical carcinoma. Virology 384(2): 375–379. a 11. Xi R, et al. (2011) Copy number variation detection in whole-genome sequencing data using the Bayesian information criterion. Proc Natl Acad Sci USA 108(46):E1128–E1136. 12. Favorov A, et al. (2012) Exploring massive, genome scale datasets with the GenometriCorr package. PLOS Comput Biol 8(5):e1002529. 13. Bibikova M, et al. (2011) High density DNA methylation array with single CpG site resolution. Genomics 98(4):288–295. 14. R Development Core Team (2004) R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, Vienna). 15. Gentleman RC, et al. (2004) Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol 5(10):R80. 16. Smyth G (2005) Limma: Linear models for microarray data. Bioinformatics and Computational Biology Solutions Using R and Bioconductor, eds Gentleman R, Carey VJ, Dudoit S, Huber W, Irizarry RA (Springer Science+Business Media, Inc., New York), pp 397–420. 17. Smyth GK (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3(1):Article3. 18. Therneau T (2014) A Package for Survival Analysis in S. R Package Version 2.37-7. Available at http://CRAN.R-project.org/package=survival. Accessed 30 September 2014. 1.0 Survival by HPV integration 0.0 0.2 0.4 0.6 0.8 HPV integration + HPV integration − n = 35, p−value = 0.11 0 500 1000 1500 2000 2500 3000 Days b Age The total number of mutations 2.6 log10(# of mutations) 80 years 70 60 50 2.4 2.2 2.0 1.8 1.6 1.4 40 1.2 Negative Positive Negative Positive c HPV integration Site 0.144 Smoking 0.421 0.588 0.002 E7 expression 0.004 BB−4225 CR−6467 BA−5153 CR−6480 CR−6470 CR−5250 BB−4223 BB−4228 CR−5248 CV−6433 DQ−7591 CV−7100 CV−6939 CR−6471 CV−5443 CR−6473 BA−4077 CR−6482 CR−6487 CR−7385 CN−5374 BA−5559 CR−6472 CR−7404 CR−5243 HD−7754 CR−6481 CV−7406 CN−4741 CV−6961 CV−5971 CV−5442 CQ−5323 CR−7368 CR−7369 Stage E6 expression HPV integration: positive negative Site: Oropharynx Oral Cavity Larynx Smoking: Yes No Stage: II III IVA/B E6/7 expression: High Low Fig. S1. Correlative analysis of clinical variables and HPV integration. Association of HPV integration status and (A) survival, (B) age and number of somatic mutations, and (C) anatomic site of tumor, tumor stage, and smoking status of the subject as well as expression level of HPV genes E6 and E7. P values are shown on the right of each track. One sample, CR-6480, did not have RNA-seq data and is left blank on the E6/E7 expression heat map. Parfenov et al. www.pnas.org/cgi/content/short/1416074111 3 of 10 Fig. S2. (A) Heat map of HPV gene expression. Depending on their integration status, HPV(+) tumors tended to form two clusters. Tumors with integrated HPV are characterized by increased expression of E6 and E7 and severely reduced levels of E2, whereas nonintegrated tumors have elevated expression levels of E2, E4, and E5. Mixed expression patterns suggest the presence of both episomal and integrated forms of HPV in the same tumor. (B) HPV integration breakpoint distribution. Violin plots representing expected and actual (red circles) counts of breakpoints in the corresponding HPV genes. In each violin plot, the inner portion is a standard box and whisker plot, and the outer portion is the density of the distribution. The E1 gene shows significant enrichment for integration breakpoints (P = 0.008). Parfenov et al. www.pnas.org/cgi/content/short/1416074111 4 of 10 Fig. S3. FISH on tissue sample TCGA-BA-4077. FISH probe (green) for the RAD51B fragment involved in HPV integration. Nuclei are counterstained with DAPI (blue). Tumor cells contain multiple copies of the 42-kb RAD51B segment. WGS data show that this RAD51B segment is present as either a tandem amplification incorporated in the genome in a single place or multiple extra chromosomes. The punctuate pattern of RAD51B amplification suggests an extrachromosomal location. Parfenov et al. www.pnas.org/cgi/content/short/1416074111 5 of 10 a HPV16 ORI P97 P670 E7 E2 E6 E1 ORI L2 E4 E5 L1 TCGA-CV-6961 (hg19) Events (schematic) 5 6 7 8 AAGTGAGTCAG...[GCATCCTAG]...TTATTTACATCCTAGTTA chimeric HPV-human read 9 10 TAATAGATTGGTGGTGTTCCATAATAAACCTCTTA chimeric human-HPV read ETS2 Exon Expression 5 6 7 8 9 10 40,177,231 40,196,876 Chromosome 21 Copy Number z-score <-2 -1 0 1 >2 none 4 3 2 1 0 40,180,000 40,190,000 40,200,000 b HPV16 ORI P97 P670 E7 E2 E6 E1 ORI L2 E4 E5 L1 TCGA-CR-6482 (hg19) Events (schematic) NR4A2 TGGAAATCCAAACCAAACC chimeric HPV-human read ATTTATCAGTTTGTG chimeric human-HPV read TTCCTGTTGTTATAAC GTATCAGGT...CACC GCTAGAGGTGAGGA fused human-human read NR4A2 Exon Expression 8 7 6 5 4 3 2 1 157,180,946 157,188,992 z-score <-2 -1 Chromosome 2 Copy Number 0 1 >2 none 248 186 124 62 0 157,000,000 157,100,000 157,200,000 157,300,000 c HPV16 ORI P97 P670 E7 E2 E6 E1 ORI L2 E4 E5 L1 TCGA-CN-4741 (hg19) Events (schematic) A chr3 B C TPRG1 D E 7 F G H KLF5 I J K L chr13 discordant read pairs Chromosome 3 Copy Number 16 14 12 10 B A 8 6 4 2 0 188,800,000 D C F 189,000,000 Chromosome 13 Copy Number Insertion Detail 16 16 14 14 12 12 H 10 10 8 8 6 6 4 4 2 2 0 0 E 189,200,000 H I J K G 73,600,000 73,800,000 TPRG1 Exon Expression 188,665,003 L 74,000,000 KLF5 Exon Expression 189,041,270 73,629,114 73,651,675 z-score <-2 -1 0 1 >2 none Fig. S4. Integrated analysis of HPV integration events. Breakpoint locations and joining patterns are shown with regard to schematic representations of the HPV and human genomes, regions of copy number alteration, and exon expression of genes involved and/or located near integration sites. Colored regions on the genome schemes represent sequences included in the integrated structure. Gray regions represent sequences that were replaced or lost as a result of integration. (A) ETS2 integration led to the heterozygous loss of two exons, which is reflected by copy number loss and decreased expression of the seventh and eighth exons. (B) NR4A2 double integration was associated with high-level amplification and overexpression of the entire gene. The integrated virus included its ori and promoters. (C) HPV integration associated with interchromosomal chromosome 3/chromosome 13 translocation (rearrangement F–G) and four subsequent tandem duplications (rearrangements B–K, C–J, D–I, and E–H) involving the TPRG1 and KLF5 genes that are amplified and overexpressed. Parfenov et al. www.pnas.org/cgi/content/short/1416074111 6 of 10 Fig. S5. Integrated analysis of the HPV integration event. Virus integrated in the fourth intron of the PDL1 gene. The entire PDL1 gene, HPV insertion, and a part of PLGRKT gene underwent structural rearrangement, amplification, and overexpression of human exons, supporting the model of an extrachromosomal element. The pattern of PDL1 exon expression is changed compared with other tumors. Fig. S6. DNA methylation and gene expression analysis for tumors with different HPV status. Unsupervised clustering of differentially (A) methylated loci and (B) expressed genes in four groups: HPV(+)/integration(+), HPV(+)/integration(−) and HPV(−) tumors and adjacent normal tissue. Parfenov et al. www.pnas.org/cgi/content/short/1416074111 7 of 10 a HD−7754 CN−5374 CQ−5323 CR−7368 CR−7369 CR−6471 CV−7100 CV−6961 BA−4077 CR−6472 CV−7406 CV−6939 CR−6470 CR−7385 CV−5971 CR−6481 CV−5442 CN−4741 BB−4225 BB−4228 CR−5243 CV−5443 CR−5250 BA−5153 CV−6433 CR−6473 CR−6480 BB−4223 CR−6467 DQ−7591 BA−5559 CR−6482 CR−5248 CR−7404 CR−6487 CpG locus Sample 0 beta value 1 b DNA methylation subtype 0.01 HPV integration Site 0.038 Smoking 0.723 Stage 1 0.082 E7 expression 0.045 HD−7754 CN−5374 CQ−5323 CR−7368 CR−7369 CR−6471 CV−7100 CV−6961 BA−4077 CR−6472 CV−7406 CV−6939 CR−6470 CR−7385 CV−5971 CR−6481 CV−5442 CN−4741 BB−4225 BB−4228 CR−5243 CV−5443 CR−5250 BA−5153 CV−6433 CR−6473 CR−6480 BB−4223 CR−6467 DQ−7591 BA−5559 CR−6482 CR−5248 CR−7404 CR−6487 E6 expression DNA methylation subtype: 1 2 HPV integration: positive negative Site: Oropharynx Oral Cavity Larynx Smoking: Yes No Stage: II III IVA/B E6/7 expression: High Low Fig. S7. Correlative analysis of clinical variables and methylation status. (A) Unsupervised hierarchical clustering with Euclidean distance from DNA methylation data. (B) Associations between methylation clusters, virus integration status, anatomic site of tumor, tumor stage, smoking status of the subject, and expression level of HPV genes E6 and E7 using Fisher’s exact test. P values are shown on the right of each track. One sample, CR-6480, did not have RNA-seq data and is left blank on the E6/E7 expression heat map. Parfenov et al. www.pnas.org/cgi/content/short/1416074111 8 of 10 a mRNA expression. BARX2 BARX2 expression, DNA methylation, and copy number variation p−value=2.42e−04 Virus integration + Virus integration − Homozygous deletion Heterozygous deletion CN gain CN amp HPV+/int− HPV+/int+ Normal DNA methylation. BARX2_cg17692125 8 4 2 Beta values 0.0 0.2 0.4 0.6 0.8 CpGi relation=Island. Probe location=TSS1500. p−value=1.57e−03 6 HPV− mRNA expression 2 4 6 10 log2(RSEM) 8 10 12 Spearman correlation = −0.586 HPV− HPV+/int− HPV+/int+ 0.0 Normal 0.2 0.4 0.6 0.8 1.0 DNA methylation mRNA expression. IRX4 IRX4 expression, DNA methylation, and copy number variation p−value=8.68e−04 Spearman correlation = −0.636 Virus integration + Virus integration − Homozygous deletion Heterozygous deletion CN gain CN amp HPV+/int+ Normal DNA methylation. IRX4_cg07882671 2 0 Beta values 0.0 0.2 0.4 0.6 0.8 CpGi relation=Island. Probe location=TSS200. p−value=1.02e−03 6 HPV+/int− 4 HPV− mRNA expression 0 8 2 4 6 10 log2(RSEM) 8 10 12 b HPV− HPV+/int− HPV+/int+ Normal 0.0 0.2 0.4 0.6 0.8 1.0 DNA methylation mRNA expression. SIM2 SIM2 expression, DNA methylation, and copy number variation p−value=1.02e−11 Spearman correlation = −0.862 Virus integration + Virus integration − Homozygous deletion Heterozygous deletion CN gain CN amp HPV+/int− HPV+/int+ Normal DNA methylation. SIM2_cg01065599 10 6 0.8 0.6 0.4 4 0.2 Beta values CpGi relation=N_Shore. Probe location=Body. p−value=3.89e−06 8 HPV− mRNA expression 4 6 8 12 log2(RSEM) 10 12 c HPV− HPV+/int+ 0.0 Normal 0.2 0.4 0.6 0.8 1.0 DNA methylation mRNA expression. CTSE CTSE expression, DNA methylation, and copy number variation p−value=2.01e−05 Spearman correlation = −0.718 10 d HPV+/int− HPV+/int+ Normal DNA methylation. CTSE_cg21478437 2 0.7 0.5 0 0.3 Beta values 0.9 CpGi relation=NA. Probe location=TSS200. p−value=3.99e−06 6 HPV+/int− mRNA expression HPV− 4 0 2 8 4 6 log2(RSEM) 8 10 Virus integration + Virus integration − Homozygous deletion Heterozygous deletion CN gain CN amp HPV− HPV+/int− HPV+/int+ Normal 0.0 0.2 0.4 0.6 0.8 1.0 DNA methylation Fig. S8. DNA methylation and expression analysis for chosen genes in tumors with different HPV status. DNA methylation and expression levels for (A) BARX2, (B) IRX4, (C) SIM2, and (D) CTSE genes in four groups: HPV−, HPV(+)/integration(−), and HPV(+)/integration(+) and adjacent normal tissue. P values were calculated using t test for HPV(−), HPV(+)/integration(−), and HPV(+)/integration(+) tumors. Tumors with copy number gain or loss are shown in red and blue, respectively. CN, copy number; int, integration. Parfenov et al. www.pnas.org/cgi/content/short/1416074111 9 of 10 Dataset S1. Characteristics of HPV-positive tumors Dataset S1 (Table S1) The dataset contains variant calls for 29 HPV16-positive tumors. Fields include position in the reference genome sequence (GenBank accession no. AY686584), reference allele, and corresponding genotypes for the tumors. The symbol −/− represents homozygous genotype by reference allele, and the symbol ./. represents absence of a high-quality call. (Table S2) The dataset contains summary information for 35 HPV-positive tumors, including characterization of clinical features, HPV type and lineage, and a detailed description of the integration sites. The viral part of a split read is highlighted in red. Underlined bold nucleotides represent regions of microhomology between viral and human parts of a split read. The asterisk marks samples where integration events had insufficient discordant read pairs or split reads from WGS and were additionally supported by RNA-seq data. The symbol # marks deceased patients. (Table S3) The dataset contains a list of CpG loci significantly differentially methylated between HPV(+)/integration(+) and HPV(+)/integration(−) tumors. P values were adjusted using the Benjamini–Hochberg procedure. (Table S4) The dataset contains a list of genes significantly differentially expressed between HPV(+)/integration(+) and HPV(+)/integration(−) tumors. P values were adjusted using the Benjamini–Hochberg procedure. Other Supporting Information Files SI Appendix (DOCX) Parfenov et al. www.pnas.org/cgi/content/short/1416074111 10 of 10
© Copyright 2025