Poster Abstract - Genome Science 2015

 Genome Science Biology, Technology & Bioinformatics Department of Zoology & Lady Margaret Hall, Oxford 1st-­‐3rd September 2014
Poster Abstracts www.genomescience.org.uk #ukgs2014 Poster Abstract 1 Silencing of pluripotency-­‐associated genes by DNA methylation in differentiated germ cell tumours Safiah Alhazmi, Matthew Carr, Dzul Azri Mohamed Noor, Jennie N Jeyapalan, Claire Wallace, Christopher Tan, Martin Cusack, Jaime Hughes, Tom Reader, Janet Shipley, Denise Sheer, Paul J Scotting School of Life Sciences, Centre for Genetics and Genomics, University of Nottingham, UK Germ cell tumors are a class of cancer that arises from primordial germ cells which differentiate during embryogenesis where DNA methylation reprogramming occurs. Germ cell tumors classified to seminoma and non-­‐seminoma where the former exhibits low level of DNA methylation and high sensitivity to chemotherapy. Previous studies have shown that there is a close relation between methylation and gene expression. In cancer biology, DNA hypermethylation is associated with gene silencing. However, identifying silenced genes and the genomic structures that are methylated have not yet been established. In this study, we analysed genome wide methylation and expression of GCT cell lines then we combined our results with expression data of primary tumour samples. We identified 17 silenced genes which exhibit difference between seminomatous and nonseminomus in methylation and expression level. Some of these genes which are closely associated with pluripotency could be implicated in progression from chemosensitive seminomas to the more differentiated and aggressive non-­‐seminomas. To confirm that these genes were silenced by methylation, we treated non-­‐seminoma cells (YST) with 5-­‐aza-­‐2-­‐deoxycytidine (demethylating agent). Furthermore, we found that differential methylation of CpG islands between seminomatous and nonseminomus correlated with differential gene expression. In conclusion, hypermethylation of CpG islands could play important role in silencing of the methylated gene and could be related to the progression to nonseminoma. Poster Abstract 2 Assessing HMW DNA for NGS applications using the Argus Opgen Optical Mapper Dave Baker, Chris Watkins, Chris Attfield, Ben White and Kirsten McLay The Genome Analysis Centre, Norwich Research Park, Colney Lane, Norwich, NR4 7UH. For the delivery of many Next-­‐Generation Sequencing (NGS) applications, such as long insert Mate Pair library construction, quick and accurate assessment of the size and distribution of high molecular weight (HMW) DNA is essential. Here at TGAC, we have created a novel use of the Argus Opgen platform which utilises the Q-­‐card system for assessing HMW DNA for library construction and sequencing on other NGS platforms, such as Illumina and Pacbio. Without running overnight labour intensive pulse field gels, traditionally TGAC would run HMW samples on the 12 kb Agilent Bioanalyzer chip or the Perkin Elmer Labchip GX genomic chip. Pipetting 1 or 2 µl of HMW is inaccurate, and sometimes difficult and not representative of the sample as a whole. We have also found that these capillary based systems struggle to load very HMW samples so results are usually inconclusive. Here we present some data showing how easy it is to assess HMW DNA, and in some applications the final NGS library. Poster Abstract 3 Development of Microfluidic Devices for Single Cell Genomics Jonathan Beckett1*, Neil Ashley1*, Marie Jensen2*, Rodolphe Marie2, Anders Kristensen2 and Walter Bodmer1 1
Department of Oncology, Cancer and Immunogenetics Laboratory, Weatherall Institute of Molecular Medicine, Oxford, United Kingdom. 2
Department of Micro-­‐ and Nanotechnology, Technical University of Denmark, Copenhagen, Denmark. *Authors contributed equally. Next generation sequencing is currently revolutionizing medicine and medical research. However most human genomes are still sequenced from DNA extracted from multiple cells, which misses differences between cells that could be crucial in controlling gene expression, cell behavior and drug response. Single cell genomics can identify the extent and nature of genomic, epigenomic and transcriptomic heterogeneity that occurs between cells. However, genomic analysis of single cells remains a major technical challenge, particularly for high throughput applications and current approaches are costly and unreliable. Our aim, as part of the Cellomatic EU FP7 consortium, is to develop low cost microfluidics to enable high throughput single cell genomics. We present here methods for isolating cells into individual cell traps for genomic analysis using microfluidics. We also present data from a pinch flow based micro-­‐fluidic device for rare cell enrichment. Poster Abstract 4 The Use of Random Mutagenesis in the Functional Annotation of the Streptococcus uberis Genome Adam Blanchard, Richard Emes, Sharon Egan & James Leigh School of Veterinary Medicine & Science, University of Nottingham, Sutton Bonington Campus, Sutton Bonington, UK, LE12 5RD Streptococcus uberis (S. uberis) is a major barrier to the eradication of bovine mastitis, with approximately 71% of cows developing the disease every year. S. uberis accounts for almost 64% of all sub clinical infections, which often go undiagnosed resulting in impaired milk yield and poor mammary development. Mastitis has wider implications in addition to animal welfare; to overcome the expected loss of milk from infected animals, more are kept to compensate, thus increasing the requirement for feed, bedding, shelter and generation of waste and other by-­‐products like greenhouse gases. All of these factors raise the cost of dairy farming and subsequently increase the price of dairy products. At present, S. uberis pathogenesis is poorly understood and the absence of knowledge on the phenotype to genotype correlation creates an interesting avenue of investigation. Previous research hints towards virulence being of a multifactorial nature as some of the molecules required for colonisation and S. uberis induced infections have been shown to be important but not being wholly accountable. Random Mutagenesis is an important tool for the functional annotation of a genome, coupled with high throughput sequencing, mutant mapping provides a method that allows for the evaluation of each gene for its importance in bacterial survival. Current protocols often require multiple molecular techniques and specialised sequencing strategies in conjunction with complex bioinformatic analysis to manage the vast amounts of data produced from a single experiment. Difficulties are also compounded when investigating Gram-­‐positive bacteria, as they are poorly transformable, with low transposition frequencies, meaning widely used transposition mechanisms are inadequate and would generate skewed results. Our aim was to develop a complete pragmatic and accessible protocol for the evaluation of essential bacterial genes, which allowed results to be generated simply and confidently. PIMMS (Pragmatic Insertion Mutant Mapping System) exploits the pGh9:ISS1 transposon system, which has been used extensively in S. uberis and other Gram-­‐positive bacteria, and incorporates the Illumina MiSeq, potentially the most cost effective and least error prone sequencing platform, to generate enough data for an informed conclusion on the essentiality of each gene. PIMMS has been used with a clinical isolated strain of S. uberis, mutated with pGh9:ISS1 to create a pool of 100,000 mutants to be evaluated. Bacterial DNA was extracted and subjected to acoustic fragmentation, self-­‐ligation and amplified using PCR with primers complementing the terminal ends of the ISS1 insertion sequence. The samples were then prepared and sequenced following the standard protocols. The data acquired was then used in our PIMMS analysis bioinformatic pipeline to generate a genetic map of mutations identifying potentially essential genes. This data can now be used as a reference for comparison to phenotypically grown bacteria containing the same mutations, giving a great depth of knowledge to how adaptive S. uberis is within different environments. Poster Abstract 5 Epigenetic reprogramming during spermatogenesis involves DNA demethylation via the formation of 5-­‐carboxylcytosine Blythe MJ, Loose MW and Ruzov A DNA methylation (5-­‐methylcytosine, 5-­‐mC) is a major modification of mammalian genome usually associated with gene repression. Alteration of 5-­‐mC patterns during development and differentiation contributes to the regulation of gene expression and cell specification. Although the proteins introducing 5-­‐mC into the genome are well characterised, the mechanisms of active DNA demethylation still remain obscure. Recent studies showing that 5-­‐mC can be enzymatically oxidised into 5-­‐hydroxymethylcytosine (5-­‐hmC) and further into 5-­‐carboxylcytosine (5-­‐caC), which, in turn, can be recognized and removed from the DNA by thymine DNSA glycosylase (Tdg) and the components of base excision repair machinery, lead to the model wherein 5-­‐caC could serve as an intermediate in DNA demethylation process. Here, employing immunohistochemistry and confocal microscopy we show that whilst 5-­‐hmC is found in both somatic and germ cells in murine testis, 5-­‐
caC can be detected only in the cells of germ cell lineage at final stages of spermatogenesis (pachytene spermatocytes and round spermatids). We demonstrate that, unlike 5-­‐mC, both 5-­‐hmC and 5-­‐caC are distributed in gene-­‐rich euchromatic regions of round spermatids' nuclei and 5-­‐
carboxylcytosine is being removed from the DNA during spermatogenesis not being detectable in spermatozoa. Moreover, using DNA immunoprecipitation (DIP) coupled with deep sequencing analysis we demonstrate that 5-­‐caC is eliminated from a subset of retroposons in a likely Tdg-­‐
independent process during spermatogenesis. Our results suggest importance of 5-­‐hmC oxidation for epigenetic reprogramming taking place in germ lineage cells. Poster Abstract 6 miniTour: A web based platform for rapid and realtime analysis of minION hdf5 sequence files Martin J. Blythe, Sunir Malla, Matt Loose DeepSeq, School of Life Sciences, University of Nottingham, Nottingham, NG7 2UH Technologies such as minION provide read data in the hdf5 file format generated in almost real time as a sequencing run is active. Such technologies hold great promise for rapid sequencing but few analysis tools are currently available. Here we present a web based platform, minoTour, which processes hdf5 files as they are generated, aligns reads to a reference using LAST and stores key metrics within a mySQL database. This platform allows users to export reads in fasta/fastq format as either individual files, all those that align, all that do not, or all reads. As runs are in process, minoTour reports an estimate of coverage of reference sequences, total numbers of reads, average read lengths and max/min. In addition, minoTour reports sequencing rates and read lengths over time, pore activity and read qualities for aligned and unaligned reads. minoTour also reports coverage of reads against each reference sequence and mapping of 5’ and 3’ ends of reads. This feature is particularly useful when sequencing specific amplicons. Finally, individual reads can be analysed and exported to fasta/fastq. Additional features under development include alerting the user when specific regions of a reference sequence are covered to a specific depth and automatic binning of reads based on barcodes. minoTour is a secure method of sharing data with individual users who may prefer data in a fasta/fastq format. It also serves as a portal for archiving data and rapidly comparing runs. Currently minoTour and its associated scripts require a Linux/Unix machine with local network access to the computer operating the minION. minoTour can provide real time monitoring of runs or simply provide a rapid archive and analysis platform for hdf5 sequence data. Poster Abstract 7 Revealing malaria parasite transcriptomes using directional, amplification-­‐free RNA-­‐seq Lia Chappell1, Lindsey Altenhofen2, Julian Rayner1, Manuel Llinás2, Chris Newbold3 and Matt Berriman1 1
Wellcome Trust Sanger Institute, United Kingdom 2
Penn State, United States of America 3
Wetherall Institute of Molecular Medicine, United Kingdom We have developed a new directional, amplification-­‐free RNA-­‐seq protocol, which reduces bias against the AT-­‐rich cDNA generated from the malaria parasite transcriptome. Removing the bias introduced in the PCR amplification step has unveiled extensive transcription within the intergenic regions of the genome, where the average AT-­‐content is as much as 90%. This protocol has been applied to multiple time points of the disease-­‐causing blood stages of the human malaria parasite Plasmodium falciparum (genome average AT-­‐content ~ 80%). Our initial analysis suggest that at least 75% of the genome sequence is transcribed, with >4% transcribed from both stands-­‐ far exceeding the protein-­‐coding region of the genome. Much of the newly detected transcription belongs to UTRs, which have been revealed for the first time on a genome-­‐wide scale. These UTRs can be long (> 1,000 nt), and are often overlapping, particularly for genes where the 3’ ends of genes point towards each other on opposite strands of DNA (a “tail to tail” orientation). We have used the Cufflinks tool (Trapnell et al., 2010) to generate transcript models and have developed a custom computational pipeline to identify UTR positions. We are analysing the properties of these UTRs to link their length and position to patterns of gene expression. It is difficult to resolve closely spaced or overlapping transcripts on the same strand of DNA, so we are currently developing a new assay that will generate precise positions of the transcriptional start sites and polyadenylation start sites. We have also detected transcripts antisense to current gene models that do not appear to be part of these long UTRs. Insights from these data sets allow us to redefine the boundaries of Plasmodium gene models. We suggest that this modified RNA-­‐seq method may be applicable to any system where it would be desirable to have more even sequence coverage, such as transcript isoform assembly or transcriptome studies of other species with AT-­‐biased or GC-­‐biased genomes. Poster Abstract 8 Identification and monitoring of mutations in the cfDNA of a patient with metastatic melanoma using Next Generation Sequencing methodologies Anthony Cutts1, Alexander Dilthey2, Oliver Venn2, Avinash Gupta3, Mark Middleton3, Shirley Henderson4, Joanne Mason4, Anna Schuh1 1 Nuffield Division of Clinical Laboratory Sciences, Radcliffe Department of Medicine, John Radcliffe Hospital, Headley Way, Oxford OX3 9DU, UK 2 NIHR Comprehensive Biomedical Research Centre, Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford OX3 7BN, UK 3 Department of Medical Oncology, Churchill Hospital, University of Oxford, Old Road, Oxford, OX3 7LJ 4 NGS Core Facility, BRC/NHS Oxford Molecular Diagnostics Centre, Department of Haematology, Oxford Radcliffe Hospitals NHS Trust, Oxford OX3 9DU, UK Cancer is characterized by the acquisition of multiple genetic aberrations, with clonal and subclonal evolution leading to considerable genetic tumour heterogeneity. In routine diagnostics biopsy material is used to provide details of the genetic profile of the tumour, and this has taken on a greater importance recently with the development of targeted therapies. However testing biopsy material has numerous limitations, in particular that it does not give an accurate depiction of intratumoral and intermetastatic heterogeneity. To overcome these issues, genetic testing may be carried out on cell free DNA (cfDNA), which is highly fragmented extracellular DNA occurring in plasma. This itself can be problematic though as cfDNA is usually present at low levels, and tumour DNA is present along with non tumour DNA, thus sensitive sequencing methodologies are required to detect variants. The aims of this study were to do WGS on the cfDNA from a cancer patient, identify variants using a newly developed bioinformatics pipeline, confirm variant calls using a secondary method and sequence further cfDNA samples to monitor changes in the mutational profile. We developed a low input (50ng) PCR free library preparation protocol for whole genome sequencing cfDNA. The cfDNA was obtained from a patient with metastatic melanoma prior to treatment commencing, and WGS sequencing was carried out on a HiSeq 2500 (Illumina, Inc). Subsequently the newly developed bioinformatics pipeline was used to detect variants down to 1% variant allele frequency (VAF), and after filtering 117 variants were called. To verify these variant calls, a targeted approach was taken using a different NGS platform. An Ion Ampliseq custom panel was designed for sequencing cfDNA using minimal (5ng) input DNA on the Ion Torrent (Life Technologies). Variants were detected down to as little as 0.2% VAF, and the variants detected on both platforms showed excellent concordance. A further 9 sequential samples were taken from the same patient over the following 13 months, and the cfDNA from each was sequenced on the Ion Torrent to monitor variant levels and look for evidence of clonal and subclonal evolution. Using this method we were able to ascertain the effects different treatment therapies had on the tumour and identify the emergence of subclones, which correlated with the patient’s disease progression. Overall we show that WGS can be carried out on cfDNA using a PCR free protocol with low input DNA and that bioinformatics tools can be developed to accurately detect variants with high sensitivity. Further, using a targeted approach we were able to monitor tumour dynamics over a prolonged period which could potentially be utilized in diagnostic laboratories. Poster Abstract 9 Varroa destructor transcriptome assembly and assessment De Paiva Alves,Eduardo1, Fairley, Susan1, Mcinness, Ewan2 & Bowman, Alan S2 1
Centre for Genome-­‐Enabled Biology and Medicine, University of Aberdeen, UK. 2
Institute of Biological and Environmental Science, University of Aberdeen, UK. As part of a project investigating differential gene expression in pesticide resistant and susceptible Varroa destructor, a parasitic mite that infests beehives, RNA-­‐Seq data was generated for adult female varroa collected from beehives in Greece. The RNA-­‐seq data comprised a total of 400 million paired-­‐end, 100bp long reads from six samples of pooled varroa. A genome assembly for V.destructor was published in 2010 (Cornman et al., 2010). Analysis of the V.destructor genome using CEGMA (Parra et al., 2007) indicates the available genome may be incomplete, with 33% of the 248 most highly conserved Core Eukaryotic Genes being returned as complete and 78% as partially complete. For the wider set of 458 core genes, Hmmer located matches for 303 at an E-­‐value threshold of 1.0 (Cornman et al., 2010). Consequently, in an effort to obtain the most complete transcriptome possible for the genes expressed, both genome assisted and de novo transcriptome assembly methods were considered. Scripture (Guttman et al., 2010) and AUGUSTUS (Stanke et al., 2004) were applied to the genome, while Trinity (Grabherr et al., 2011) and Oases (Schultz et al., 2012) were used for de novo assembly. Alterations to various parameters were explored, including methods of normalization and contamination filtering of the data, k-­‐mer length, sample grouping and minimum read coverage filtering. Assemblies were evaluated by a number of measures (O’Neil and Emrich, 2013). In the assessment process, use was made of the gene annotation from ORCAE (Sterck et al., 2012) for Tetranychus urticae (accessed via Ensembl Metazoa releases 19 and 21 (Kersey et al., 2014)), one of the most closely related species for which an annotated genome was available. The metrics used included transcript number, transcript length, reads mapping back to transcripts,CEGMA comparisons, unigene orthologue hit ratio (OHR) and reciprocal best hits (RBH), the last two relative to the 18,224 T. urticae protein coding genes. Numbers of transcripts produced by de novo assembly ranged from 51,843 to 1,147,959,494. Some assemblies contained hits reported as complete by CEGMA for all 248 highly conserved genes. The percentages of reads input to the assembly process that mapped back to the assemblies ranged from 39-­‐95%, in part influenced by the application of coverage filters. Average unigene OHRs ranged from 48.5-­‐51.7%, while the number of unigenes with a RBH in T. urticae varied between 2,123 and 4,563. No single assembly performed best across all metrics. Poster Abstract 10 Immune sequencing protocol for complete B-­‐cell and T-­‐cell repertoire sequencing Eileen T Dimalanta1, Adrian W Briggs2, Chris Clouser2, William Donahue2, Gur Yaari2, Lynne M Apone1, Fiona J Stewart1, Salvatore Russello1, Theodore B Davis1, Francois Vigneault2 1
NEB 2
AbVitro Immune sequencing, which allows for the study of complex immunological diseases by sequencing B-­‐cell antibodies and T-­‐cell receptors, is gaining in popularity due to recent throughput and read length improvements in next-­‐generation sequencing (NGS) technologies. However, the structural and sequence complexities of antibody genes have made reliable targeting approaches challenging. We have developed and optimized a method for accurate sequencing of the immune gene repertoires of B-­‐cells and T-­‐cells. In contrast to previous studies, our method generates full-­‐length sequences of B-­‐cell antibody and T-­‐cell receptor genes. This allows for exhaustive somatic mutation profiling across complete V, D and J segments, full isotype information analysis (IgM, IgD, IgG, IgA and IgE), and the possibility for synthesis and expression of complete antibody chains for downstream immunological assays. By introduction of a unique barcode ID into every captured mRNA molecule, all PCR copies of each mRNA fragment can be collapsed into a single consensus sequence, making the assay extremely accurate by resolving PCR bias and sequencing errors, as well as allowing quantitative digital molecule counting. The assay can work with as low as sub-­‐nanogram levels of input total RNA and is the first method that allows targeted amplification of all possible antibody heavy and light chains (IGH+IGK+IGL) and T-­‐cell receptors (TCRA+TCRB) in a single reaction simultaneously. Poster Abstract 11 Identification and characterization of functional genetic variants in Dupuytren’s Disease Juanjiangmeng Du1,2, Kerstin Becker1,2, Holger Thiele1, Janie Altmüller1, Sigrid Tinschert3,4, Michael Nothnagel1, The Dupuytren Study Group1, Peter Nürnberg1,2, Hans Christian Hennies1,2,4 1 Cologne Center for Genomics, University of Cologne, Germany 2 Cologne Excellence Cluster on Cellular Stress Responses in Aging-­‐associated Diseases, University of Cologne, Germany 3 Institute of Clinical Genetics, Dresden University of Technology, Germany 4 Dermatogenetics, Div. of Human Genetics, Medical University of Innsbruck, Austria Dupuytren’s Disease (DD) is a progressive, aging-­‐associated fibromatosis disorder of the palm and fingers, leading to progressive flexion contractures. DD is the most frequent genetic disorder of connective tissue and has a multifactorial etiopathogenesis. Current treatment of DD consists largely of surgical removal of the contracted tissue, which is, however, associated with risk of neurovascular injury and recurrence. Hence, unraveling the molecular etiology of DD is needed to provide insight into potential therapeutic targets for treatment of DD. Therefore, we aim to capture genetic variants that are directly involved in the predisposition to DD. We used genome-­‐wide association studies in a cohort of 800 clinically well characterized patients with DD who underwent surgical treatment to identify susceptibility loci. Validated GWAS-­‐identified candidate regions are being further analysed for functional variants using targeted next-­‐generation genomic sequencing. We are also conducting further bioinformatic and in vitro functional studies using myofibroblasts, the major cell type responsible for development of DD, to test the importance of these variants for gene expression, fibroblast differentiation, myofibroblast functional properties, extracellular matrix deposition, and regulatory pathways involved in DD pathogenesis. We expect our experiments to pinpoint genetic variants that underlie the manifestation of DD, which peaks around 60 years of age, and gene networks to unravel pathomechanisms leading to the complex and disfiguring disorder. Poster Abstract 12 Genomic Sequencing Unit: an NGS facility at the University of Dundee Melanie Febrer Genomic Sequencing Unit, University of Dundee, Dow street, Dundee, DD1 5EH The Genomic Sequencing Unit (GSU) is a core next generation sequencing facility based in the College of Life Sciences at the University of Dundee. The facility was implemented a year ago, primarily to fulfil the research requirements of the Centre for Dermatology and Genetic Medicine for genodermatology research and diagnosis but GSU services are also available to individuals within and outwith the University of Dundee. The facility is equipped with Illumina NGS instrumentation, HiSeq2000 and MiSeq and provides services for a wide range and ever-­‐increasing number of applications such as de novo/re-­‐sequencing, RNA-­‐seq (either random primed or directional), whole exome sequencing, targeted sequencing, ChIP-­‐seq, 16S amplicon sequencing and metagenomics. At present, GSU provides minimal bioinformatics analysis and is currently developing a system to analyse certain analyses. Poster Abstract 13 Optimization of drying times for Agencourt AMPure XP beads for next-­‐generation sequencing library preparation Pablo Fuentes-­‐Utrilla1, Ewan Grant2 and Richard Talbot1 1
Edinburgh Genomics, University of Edinburgh, Easter Bush Midlothian EH25 9RG, UK; 2
Beckman Coulter United Kingdom Ltd, Oakley Court, Kingsmead Business Park, London Road, High Wycombe, Buckinghamshire HP11 1JU, UK Solid Phase Reversible Immobilization (SPRI) beads have increasingly replaced columns and gels for purification of DNA samples between enzymatic reactions, with Beckman-­‐Coulter’s Agentcourt AMPure XP beads being commonly used in next-­‐generation sequencing (NGS) library preparation. The main advantages of using AMPure XP beads are their high DNA recovery, the ability to select a low fragment size cut-­‐off and their scalability. The most critical parameter affecting DNA recovery from AMPure XP beads is the drying time after the second ethanol wash. According to the manufacturer, over drying AMPure XP beads (i.e. bead pellets showing cracks while on the magnet) significantly decreases DNA elution efficiency. For this reason, it is recommended not to exceed 5 minutes of drying time. However, most NGS library protocols for Illumina sequencing recommend either 5 or 15 minutes drying time. In our experience, drying times longer than 5 min frequently result in over-­‐dried beads, making it necessary to adjust the drying time by eye for each batch of libraries. However, having a standardized drying time is desirable to minimize technical bias in library preparation. We systematically investigated the effect of AMPure XP drying time on DNA recovery. We tested several drying times ranging from 2 to 15 minutes, for the two most common DNA : Ampure XP volume ratios (1:1 and 1:1.8), using manual and automated purifications. Here we present the results of these tests on DNA recovery, and propose recommendations to improve library preparation protocols. Poster Abstract 14 Evaluation of Agilent’s D5000 and HS D5000 ScreenTape assays for quality control of next-­‐
generation sequencing libraries, and comparison with Bioanalyzer HS DNA Chips, and (HS) D1000 and Genomic DNA ScreenTape assays Pablo Fuentes-­‐Utrilla1, Sarah White1, Karen Troup1, Adam Inche3 Anna Montazam2 and Richard Talbot1 1
Edinburgh Genomics, University of Edinburgh, Easter Bush Midlothian EH25 9RG,UK 2
Edinburgh Genomics, Ashworth Laboratories, The University of Edinburgh, Edinburgh, EH9 3JT, UK 3
Agilent Technologies UK Ltd., 5 Lochside Avenue, Edinburgh EH12 9DJ, UK The preparation of next-­‐generation sequencing libraries includes one or more quality control (QC) steps to assess the fragment length distribution of processed sample. These QC steps are generally performed after DNA fragmentation, intermediate amplifications (e.g. in capture protocols) and as a final library QC step prior to sequencing. While the Agilent’s Bioanalyzer has traditionally been used for this task, the Agilent’s TapeStation platform offers an attractive alternative due to its simplicity, speed, higher throughput and lower cost per sample. Comparison of the HS D1000 and the Bioanalyser DNA HS shows that they present similar sizing and sensitivity for lower molecular size samples (<1000 bp). For libraries with large insert sizes or over-­‐amplified fragments, however, the wider range of the Bioanalyzer’s High Sensitivity DNA Chip (50 -­‐ 7000 bp) outperforms the current Tapestation’s D1000 Screentape (35 -­‐ 1000 bp). Here we tested Agilent’s D5000 and HS D5000 ScreenTape assays. The larger range (50 – 7000 bp, with markers ranging 15 -­‐ 10000 bp) is directly comparable with the Bioanalyzer’s HS DNA Chip. We tested the D5000 ScreenTape for a range of applications (TruSeq and Nextera library QC, long range PCR amplicon sizing, mate-­‐pair initial DNA sonication), using peak and region parameters and compared the data to the Bioanalyzer’s HS DNA Chip and the D1000 and Genomic DNA ScreenTape assays. Here we present the results of our tests and discuss the potential application of the D5000 ScreenTape range for NGS library preparation. Poster Abstract 15 An Improved cDNA Library Generation Protocol for Transcriptome Analysis from a Single Cell Rachel Fish, Sally Zhang, Magnolia Bostick, Cynthia Chang, Suvarna Gandlur, Andrew Farmer Clontech Laboratories, Inc., 1290 Terra Bella Ave., Mountain View, CA 94043 As Next Generation Sequencing (NGS) technologies and transcriptome profiling using NGS mature, they are increasingly being used for more sensitive applications that have only limited sample availability. The ability to analyse the transcriptome of a single cell consistently and meaningfully has only recently been realized. SMART™ technology is a powerful method for cDNA synthesis that enables library preparation from very small amounts of starting material. Indeed, the SMARTer® Ultra™ Low RNA method allows researchers to readily obtain high quality data from a single cell or 10 pg of total RNA—the approximate amount of total RNA in a single cell. Recent studies have used this method to investigate heterogeneity among individual cells based on RNA expression patterns (1, 2). A new SMARTer Ultra Low kit has been developed that is simpler and faster while improving the quality and yield of the cDNA produced. The full-­‐length cDNA from this method may be used as a template for library sample preparation for Ion Torrent and Illumina® NGS platforms. Sequencing results for libraries created from single cells or from equivalent amounts of total RNA demonstrate that approximately 90% of the reads map to RefSeq, less than 0.5% of the total reads map to rRNA, and the average transcript coverage is uniform. Improvements in the protocol following first strand synthesis and during cDNA amplification show higher sensitivity with an increase in gene counts and improved representation from GC-­‐rich genes. These data indicate that the improved SMART cDNA protocol is an ideal choice for single cell transcriptome analysis. Poster Abstract 16 Development and evaluation of the clinical utility of a next generation sequencing tool for myeloid disorders Angela Hamblin1,2,3, Adam Burns1,3, Christopher Tham4, Adele Timbs1,3, Joanne Mason1,3, Helene Dreau1,3, Andreas Weller1, Jithesh Puthen1, Adam Mead2,3,5, Andy Peniket2,3, Paresh Vyas2,3,5, Richard Barker6, Shirley Henderson1,3, Anna Schuh1,2,3 1. BRC/NHS Translational Molecular Diagnostics Centre, Level 4 Haematology, John Radcliffe Hospital, Oxford, OX3 9DU, UK, 2. Department of Haematology, Oxford University Hospitals NHS Trust, Churchill Hospital, Old Road, Headington, Oxford OX3 7LE, UK, 3. Oxford BRC Blood Theme, Oxford University Hospitals NHS Trust, Churchill Hospital, Old Road, Headington, Oxford OX3 7LE, UK, 4.Oxford University Medical School, John Radcliffe Hospital, Oxford, OX3 9DU, UK, 5.Weatherall Institute of Molecular Medicine, John Radcliffe Hospital, Oxford, OX3 9DU, UK, 6.Oxford-­‐UCL Centre for the Advancement of Sustainable Medical Innovation, University of Oxford, Level 2, New Richards Building, Old Road Campus, Oxford, OX3 7LG, UK Historically diagnosis, and by default prognosis, of myeloid disorders has been determined by a combination of morphology, immunophenotype, cytogenetic and more recently single gene, if not single mutation, analysis. The introduction of next generation sequencing (NGS) techniques has resulted in an explosion in the quantity of mutational data available with potential diagnostic or prognostic significance. Evaluating each candidate mutation individually in a diagnostic setting is impractical due to financial and DNA quantity constraints creating a need to develop an assay able to assess multiple targets simultaneously. We have therefore developed a targeted resequencing assay using a TruSeq Custom Amplicon panel with the MiSeq platform (both Illumina) consisting of 341 amplicons (~56 kb) designed around exons of genes frequently mutated in myeloid malignancies (ASXL1, ATRX, CBL, CBLB, CBLC, CEBPA, CSF3R, DNMT3a, ETV6, EZH2, FLT3, HRAS, IDH1, IDH2, JAK2, KIT, KRAS, MPL, NPM1, NRAS, PDGFRA, PHF6, PTEN, RUNX1, SETBP1, SF3B1, SRSF2, TET2, TP53, U2AF1, WT1 & ZRSR2). Filtering, variant calling and annotation were performed using Basespace and Variant Studio (both Illumina). Validation of the assay was achieved using a cohort of samples previously characterised with conventional techniques (Sanger sequencing and fragment analysis). The initial concordance in mutations detected was 88% rising to 100% once an alternative aligner (Pindel) was used to identify FLT3 ITDs. The lower limit of detection was established at 1-­‐3% variant allele frequency by using a comparison with qPCR and determination of background noise. Base coverage of >50 reads was ~85% on both intra-­‐ and inter-­‐run analyses. Post-­‐validation, we have analysed samples (blood or bone marrow) from >140 patients with suspected myeloid disorders (myelodysplasia, acute myeloid leukaemia or myeloproliferative neoplasms). In order to gather clinical utility data we have developed a reporting algorithm to feed back information to referring clinicians: Usually only those variants previously described as acquired (in COSMIC or peer-­‐reviewed literature) are reported with the exception being novel mutations predicted to result in a truncated protein (i.e. nonsense or indel mutations causing a frameshift). Using this algorithm 66% of patients who underwent testing had a suspected pathogenic mutation relevant to a myeloid disorder. The median number of reported variants identified per sample was one (range 0-­‐6). An audit of the clinical care of these patients showed information derived from this assay has confirmed suspected diagnoses (in some cases sparing elderly patients invasive bone marrow sampling) therefore aiding treatment decisions. In addition identification of particular variants has allowed more individualised disease monitoring schedules and improved prognostic stratification. Overall the development of an NGS Myeloid Gene Panel has provided extra information to clinicians helping inform diagnosis, determine further follow-­‐up and support treatment decisions. It has also identified a cohort of patients who, despite a definitive diagnosis of a myeloid disorder, have had no pathogenic mutations detected and could be considered for whole exome or genome sequencing. Further work is focussed on improving the reporting bioinformatic pipelines (particularly the calling of larger indels) and developing an approach to take with previously unreported novel variants. Poster Abstract 17 Optimization of Library Amplification Step For Next Generation Sequencing Katja Heitz, Ioanna Andreou, Peter Hahn, Annika Piotrowski, Holger Wedler, Erika Wedler, Frank Reinecke & Nan Fang QIAGEN GmbH Uniform coverage of all genomic regions during Next Generation Sequencing (NGS) is critical for efficiently utilizing sequencing capacity and preventing loss of important sequence information due to drop-­‐out or under-­‐representation of certain regions. The coverage uniformity is especially important in applications such as microbiome-­‐sequencing, where different microbial strains could have significantly different GC contents. GC content -­‐related sequencing bias could potentially lead to under-­‐representation or even complete loss of the genomic regions or microbial strains with very low or very high percentage of GC bases. The PCR step of the NGS library construction procedure has been shown to be the major source of GC bias in the NGS workflow. To solve this common problem in the NGS field, we established a test system where a mixture of high-­‐GC and low-­‐GC bacteria genomes is used to optimize library amplification conditions and used this system to develop a novel NGS library amplification mix that amplifies the genomes with widely different GC contents with minimal bias and high fidelity. Poster Abstract 18 Functional genomics and more of the parasite Schistosoma mansoni Thomas Huckvale, Anna Protasio, Nancy Holroyd, Mandy Sanders Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK The flatworm Schistosoma mansoni is a causative agent of schistosomiasis, a debilitating disease affecting more than 200 million people worldwide, usually in areas with poor sanitation. Via a snail host, free-­‐living freshwater S. mansoni larvae penetrate the skin before developing into adults in the bloodstream, where they can live and produce eggs for many years. In spite of effective drug treatment reinfection can rapidly occur in endemic areas. The Parasite Genomics group at the Wellcome Trust Sanger Institute is producing high-­‐quality sequence data from helminth and protozoan parasite species of importance to human health or their model equivalents. This is achieved using a broad range of methods for NGS library preparation on Illumina HiSeq, Illumina MiSeq and PacBio RS platforms alongside the development of computational tools for mapping, assembly and analysis. We have produced an optical map from male and female S. mansoni using the OpGen Argus platform which we have used to significantly improve the genome assembly. Our in-­‐house replication of the entire lifecycle of S. Mansoni allows us to establish robust protocols for this parasite, from the extraction of nucleic acids through to library construction and data analysis. We have successfully produced genomic libraries from single larvae using whole genome amplification. Using transcriptomics, we have made directional amplification -­‐free RNA-­‐seq libraries to improve our view of the mRNA transcriptome and using ribosome profiling we hope to gain insight into the protein coding fraction of the transcriptome as well as the non-­‐coding fraction. Here we give an overview of several functional genomic approaches to improve our understanding of S. mansoni genome organization, development, host-­‐parasite and male-­‐female interaction as well as gene function. Poster Abstract 19 ngsCAT: a tool to assess the efficiency of targeted enrichment sequencing Javier Santoyo-­‐Lopez 1,2, Francisco J. López-­‐Domingo2, Javier P. Florido2, Antonio Rueda2 and Joaquín Dopazo 2,3. 1
Edinburgh Genomics, University of Edinburgh, Edinburgh, UK 2
Genomics and Bioinformatics Platform of Andalusia (GBPA), Seville, Spain. 3
Computational Genomics Department, Centro de Investigación Príncipe Felipe, Valencia, Spain. NGS technologies have opened new opportunities for inspecting and understanding genomic and transcriptomic sequences providing a wealth of new data. Whole genome sequencing is now technically feasible, however it is still expensive to run and in many cases only specific genomic regions are being sequenced. Thus, targeted NGS has become a common tool for interrogating at once several loci, or all coding regions of the genome, at a relatively low cost, but successful sequencing is highly dependent on the efficiency of this target enrichment procedure. In consequence in targeted sequencing experiments, the efficiency and the lack of bias in the enrichment process need to be assessed as a quality control step before performing downstream analysis of the sequence data. To perform this quality control step and to assess the capture process in terms of sensitivity, specificity and uniformity we have implemented the next-­‐generation sequencing data Capture Assessment Tool (ngsCAT). This is a Linux command application written in Python that can be run efficiently on a standard computer. ngsCAT takes the information of the mapped reads and the coordinates of the targeted regions as input files, and generates a report with metrics and figures that allows the evaluation of the efficiency of the enrichment process. The tool can also take as input the information of two samples allowing the comparison of two different experiments. ngsCAT can help to detect samples not properly hybridized, optimize targeted enrichment protocols and adjust data analysis pipelines. Poster Abstract 20 Annotation of draft assemblies of Globodera pallida pathotypes: emphasis on effector genes Dominik R Laetsch1,2, Peter J Cock2, Vivian C Blok3, Mark L Blaxter1 1
University of Edinburgh, Institute of Evolutionary Biology, Edinburgh, EH9 3JT, United Kingdom 2
James Hutton Institute, Information and Computational Science Group, Dundee, DD2 5DA, United Kingdom 3
James Hutton Institute, Cell and Molecular Science Group, Dundee, DD2 5DA, United Kingdom The pale potato cyst nematode Globodera pallida is an obligate sedentary endoparasite of Solanum tuberosum (potato) crops and is estimated to cause annual crop losses ranging £50 million in the UK alone. The biology of its life cycle, which makes the control of infestations through crop rotation impractical, paired with recent, tighter legislation regarding the use of nematicides makes an understanding of the interactions between parasite and host imperative for sustainable and competitive potato production. The Globodera genome project has produced a vast collection of genome and transcriptome data for different pathotypes of G. pallida which are now been investigated to understand the basis of host and parasite specificities. Here, we present preliminary draft assemblies and structural annotations of seven different pathotypes of G. pallida from five, invasive UK and two, endemic South American populations. Due to the high degree of bacterial and fungal contamination in the read datasets, an iterative read filtering strategy informed by TAGC (Taxon-­‐annotated GC-­‐coverage) plots was applied to improve genome assemblies. Annotation was achieved through a custom pipeline combining several gene-­‐
finding algorithms (SNAP, Genemark-­‐ES, Maker, Augustus) and by incorporating protein and RNAseq evidence. Furthermore, we explore the diversity of putative effectors genes. Putative effectors are proteins suspected of being secreted from the parasite into the host in order to manipulate the host cell and establish the permanent feeding site. The effectors are also putative factors recognized by the host in nonspecific or specific (resistance) incompatible responses. Analysis of the “effectorome” of each population was performed by comparing predicted gene models against known effectors in the G. pallida reference genome, as well as by searching for novel effectors bearing signal peptides and lacking transmembrane domains. Preliminary results suggest that some groups of effectors display a high rate of divergence between the populations which may suggests strong selective pressures on these gene families. Poster Abstract 21 Pipeline for Quality Checking of Illumina HiSeq/MiSeq sequencing runs Loecherbach J (1), Bridgett SJ (2), Cezard T (2), Trivedi U (2), Turner F (1), Taylor SJ (2), Talbot R (1), Santoyo-­‐Lopez J (2), Watson M (1), Blaxter M (2), Gharbi K (2) (1) Edinburgh Genomics, The Roslin Institute and R(D)SVS, The University of Edinburgh, EH25 9RG, Easter Bush, Scotland, UK (2) Edinburgh Genomics, Ashworth Laboratories, The King’s Buildings, The University of Edinburgh, EH9 3JT, Edinburgh, Scotland, UK Quality control and filtering of NGS data is a crucial step in any sequencing project. It enables better understanding of the data generated to provide feedback to improve library preparation for future sequencing runs; highlight sequencer problems; help optimize assembly and mapping steps; and facilitate more accurate interpretation of the project’s final results. Several potential sources of error in the sequencing data need to be evaluated, including base-­‐quality, presence of adapters, and sample contamination. Here we describe the pipeline used at Edinburgh Genomics for quality checking of reads from Illumina HiSeq and MiSeq sequencers. A similar approach can be used for other platforms such as 454 and Ion Torrent/PGM. This pipeline uses a combination of several open source packages including FastQC, FASTX-­‐toolkit, Usearch, Fastq_screen, and several in-­‐house Perl and Python scripts. The pipeline checks for completed sequencing runs, initiates demultiplexing, QC, archiving of data, sending notification emails and reporting results via an in-­‐house wiki. Poster Abstract 22 A Method for Selectively Enriching Microbial DNA from Contaminating Vertebrate Host DNA Erbay Yigit1, George R. Feehery1, Samuel O. Oyola2, Yan Wei Lim3, David Hernandez4, Bradley W. Langhorst1, Victor T. Schmidt5,6, J. Kirk Harris7, Charles E. Robertson8 Joanna Bybee1, Laurie Mazzola1, Lynne M. Apone1, Christine L. Chater1, Pingfang Liu1, Daniela B. Munafó1, Vaishnavi Panchapakesa1, Deyra N. Rodriguez1, Christine J Sumner1, Donovan Bailey4, Fiona J Stewart1, Eileen T. Dimalanta1, Linda A. Amaral-­‐Zettler5,9, Theodore Davis1, Michael A. Quail2, Sriharsa Pradhan1 1 New England Biolabs Inc., Ipswich, MA, USA 2 Wellcome Trust Sanger Institute, Cambridge, UK 3 San Diego State University, San Diego, CA, USA 4 Department of Biology, New Mexico State University, Las Cruces, NM, USA 5 The Josephine Bay Paul Center for Comparative Molecular Biology and Evolution, Marine Biological Laboratory, Woods Hole, MA, USA 6 Department of Ecology and Evolutionary Biology, Brown University, Providence, RI, USA 7 University of Colorado School of Medicine, Department of Pediatrics, Aurora, CO, USA 8 University of Colorado Boulder, Department of Molecular, Cellular, and Developmental Biology, Boulder, CO, USA 9 Department of Geological Sciences, Brown University, Providence, RI, USA Recent discoveries have implicated the human microbiome as playing a role in certain physical conditions and disease states, and these advances have opened up the potential for development of microbiome-­‐based diagnostic and therapeutic tools. The majority of microbiome DNA studies to date have employed 16S analysis, but these provide very little information regarding function. In contrast, sequencing of the total DNA of a microbiome sample provides a broader range of information including genes, variants, polymorphisms, and putative functional information. However, many samples, including those derived from vertebrate skin, bodily cavities, and body fluids, contain both host and microbial DNA. Since a single human cell contains approximately 1,000 times more DNA than a single bacterial cell, even low-­‐level human cell contamination can substantially complicate the analysis of a sample. In some cases, as little as 1% of sequencing reads may pertain to the microbes of interest and a large percentage of sequencing reads must be discarded, making such experiments impractical. To address this issue, we developed a method to enrich for microbial DNA using methyl-­‐CpG binding domain (MBD) to separate methylated host DNA from microbial DNA. Importantly, microbial diversity and relative abundance is maintained after enrichment. This simple magnetic bead-­‐based method was used to remove human or fish host DNA from bacterial and protistan DNA. We describe the enrichment of DNA samples from human saliva, human blood, a mock malaria-­‐infected blood sample, human cystic fibrosis sputum, and a black molly fish, followed by next generation sequencing on multiple platforms. Sequence reads aligning to host genomes were reduced approximately 50-­‐fold, while the percentage of sequence reads corresponding to microbial sequences increased approximately 10-­‐fold. The host DNA captured in the bead fraction can also be eluted, using Proteinase K, and after sequencing of the bead-­‐bound fraction of DNA, a vast majority (>99%) of the mapped reads aligned to the human reference genome. Beyond microbial DNA analysis, separation on the basis of differential methylation status also enables isolation of various organelles’ DNA. For example, while plant genomic DNA is bound readily by MBD, chloroplast DNA is not bound and remains in solution. Similarly, due to human and plant mitochondrial DNA’s limited CpG methylation, this method successfully enriches such DNA. This new method holds promise for microbiome sequence analysis with a variety of sample types, enabling enrichment while accurately reflecting the diversity of the original sample, and the same simple methodology can be used to efficiently sequence large numbers of organellar genomes. Poster Abstract 23 High quality NGS sample preparation with Covaris: DNA Shearing and FFPE tissue DNA extraction Rowan Gibson+, Austin Purdy, James Han*, Hamid Khoja*, Ed Rudd*, Guillaume Durin+ and James Laugharn* + Covaris Ltd, Unit 3 Brighton Office Campus, Hunns Mere Way, Woodingdean, Brighton, BN2 6AH, UK * Covaris Inc, 14 Gill Street, Woburn, MA, 1801, USA The controlled generation of DNA fragments is a critical sample preparation step required by all sequencing applications. The diversity of applications and higher throughput provided by NG sequencers have combined to create a new level of demands for fragmentation systems. Today’s technology must provide high quality, tight control, and reproducibility. Additionally, it must be versatile, providing high quality fragmentation across a wide range of fragment lengths, and scalable with the ability to process multiple samples. Generating tight distributions of DNA fragments ranging from 100 bp to 20 kb, and scalable to 96 well plates, the Covaris Adaptive Focused Acoustics (AFA) process is considered the industry standard. Based on a high frequency focused acoustic transducer, it generates a truly random mechanical fragmentation, key to obtain a statistically unbiased representation of original samples. Covaris is now introducing a new solution for low sample volume DNA shearing. The microTUBE LV allows to fragment DNA in as little as 15 µl and to generate accurate, tunable & unbiased fragments with the same high DNA recovery as with larger volume. Covaris has also developed truXTRAC™, a solution utilizing its AFA technology for the high efficiency extraction of nucleic acids from FFPE tissues. The highly simplified workflow ensures high yield extraction of nucleic acids and allows the seamless integration with NGS and other molecular applications. Paraffin is actively removed from the FFPE tissue sample by the finely controlled and reproducible acoustic energy provided by Covaris Focused-­‐ultrasonicators. The entire process is performed in a single tube without transfer steps and DNA is typically ready for library preparation or other analytical methods in 3 hours. Poster Abstract 24 Workflows Incorporating a DNA enzyme repair mix improve NGS library prep from FFPE samples Adam Peltan, Lixin Chen, Laurence Ettwiller, Pingfang Liu, Fiona J Stewart, Eileen T Dimalanta, and Thomas C Evans Jr. New England Biolabs, Ipswich, MA, 01938, USA Treating biopsy samples with formalin and embedding them in paraffin is a widely practiced method for preserving and archiving clinical samples. The rise of next generation sequencing (NGS) technologies makes it possible for the billions of unique formalin-­‐fixed, paraffin-­‐embedded (FFPE) samples stored worldwide to provide a wealth of information in retrospective genomic studies of human disease. Although well suited to histopathological studies, it has been challenging to retrieve genetic information from FFPE samples. This is often attributed to DNA damage incurred during fixation, including fragmentation, oxidation, deamination, and protein-­‐DNA crosslinks. The poor quality of DNA extracted from FFPE samples has significantly limited the information that can be generated by NGS technologies. We have developed an enzyme cocktail formulated to repair damaged template DNA prior to its use in subsequent detection technologies. This mix is active on a broad range of DNA damages, including modified bases, nicks and gaps, and a variety of blocking moieties at the 3´end of DNA. In this study, we have investigated the effects of this DNA repair on NGS library preparation from FFPE samples, using a longer library preparation method that has separate DNA repair, end repair, dA tailing and adaptor ligation steps, as well as a more streamlined protocol that combines reactions and reduces bead cleanup steps. This repair of DNA samples was found to increase library yield and library success rates without introducing bias into the sequence data. In conclusion, incorporating this DNA repair treatment into library preparation workflows improves the quantity and quality of NGS libraries from FFPE samples. Furthermore, streamlined protocols that combine reaction steps significantly reduce the turn-­‐around time enabling high throughput processing of samples for clinical analysis and large scale genomic studies. Poster Abstract 25 ADLA: a tool to aid adapter design for novel library applications and PacBio-­‐based library interrogation Lawrence Percival-­‐Alwyn, Matt Clark The Genome Analysis Centre All current NGS platforms rely on the addition of platform-­‐specific adapters to fragments generated from a DNA/RNA source of interest. New NGS methods are constantly being designed to answer fresh biological questions, however, adapter design can be complicated, time-­‐consuming and is expensive to get wrong. Adapter Design and Library Analysis (ADLA) offers a solution to this by providing in silico adapter prototyping for NGS platforms. A hidden obstacle in the development of new NGS methods is the potential to introduce low diversity sequences. For example, Illumina's sequencing technology can be problematic when sequencing low diversity libraries, resulting in low yields and lower per-­‐base quality scores compared to sequencing more diverse libraries. Single molecule real time sequencing (SMRT®) developed by Pacific Biosciences offers a solution to spanning homopolymeric, low-­‐complexity, and highly repetitive regions. In addition to aiding the design of novel adapters, ADLA utilises both the PacBio RSII platform and adapter design information to perform an unbiased, quick and low cost adapter structure informed analysis of pilot NGS libraries and indexed library pools. Poster Abstract 26 Rapid diagnosis for unexplained anaemia using targeted massively parallel sequencing Noémi BA Roy1,2,3, Joanne Mason1,, Chris Babbs2, Juliana Teo4, Julie Curtin4, Wale Atoyebi3, Deborah Hay2,3, Jennifer Eglinton1, Georgina Hall6, Veronica Buckle2, Irene Roberts2, David Roberts5, Doug Higgs2, Shirley Henderson1 and Anna Schuh1, 3 1. BRC/NHS Translational Molecular Diagnostics Centre, Level 4 Haematology, John Radcliffe Hospital, Oxford, OX3 9DU, 2. Weatherall Institute of Molecular Medicine, John Radcliffe Hospital, Oxford, OX3 9DU, 3. Dept of Haematology, Oxford University Hospitals NHS Trust, Churchill Hospital, Old Road, Headington, Oxford OX3 7LE, 4.Dept of Haematology, Sydney Children’s Hospitals Network, Westmead, Australia, 5. NHS Blood and Transplant, NHSBT – John Radcliffe Hospital, Level 2, Oxford OX3 9BQ, 6.Dept of Paediatrics, Oxford University Hospitals NHS Trust, John Radcliffe Hospital, Oxford, OX3 9D 7. Oxford-­‐UCL Centre for the Advancement of Sustainable Medical Innovation, University of Oxford, Level 2, New Richards Building, Old Road Campus, Oxford, OX3 7LG Clinicians faced with unexplained congenital anaemia frequently need to refer patient samples to multiple centres for diagnostic investigations since specialist centres often focus on the gene(s) associated with particular subtypes of congenital anaemia. This may lead to delays in diagnosis, increased costs and crucially, a definitive genetic diagnosis is often not achieved. Furthermore, screening exclusively for known mutations misses novel pathogenic changes. Targeted next-­‐
generation resequencing (NGS) obviates these problems, offering clinicians a 'one-­‐stop' test screening for rare and common mutations causing congenital anaemias. We have designed a panel for Diamond-­‐Blackfan anaemia (DBA), dyskeratosis congenita, Schwachman-­‐Diamond syndrome, sideroblastic anaemia, congenital dyserythropoietic anaemia (CDA) and the enzyme deficiencies G6PD, PKLR and pyrimidine 5’ nucleotidase. This panel covers 33 genes (target region ~118 kb). Sequence capture and library preparation is carried out using TSCA (Illumina) and sequencing is performed using an Illumina MiSeq. Variant calling and filtering is carried out using Basespace Custom amplicon workflow and Variant Studio for annotation and filtering. Validation shows excellent coverage, with >95% of known mutations covered by >30 reads. Thirty mutations in known positive controls were correctly identified by the panel and validated by Sanger sequencing. One SNP identified by Sanger sequencing was not detected by the panel. The process, machine, pipeline and assay were further verified using the Illumina Infinium Human OmniExpress Exome v1.2 beadchips. Comparing the gVCF data with the microarray results showed 100% concordance of the calls. This corresponded to 283 calls of which 16 were variants and 267 were reference calls. Combining the data from Sanger sequencing and from the microarray comparison yields a 100% specificity and a 99.7% sensitivity (95% c.i. 97.9-­‐99.9%). Once the pipeline and process were validated, 57 diagnostic samples were analysed. Overall, a diagnosis was made in 33% of the cases, although this varied by phenotype (DBA 7/9, enzyme deficiency 1/1, CDA 6/10, sideroblastic anaemia 1/4, “unexplained anaemia” 4/33). Furthermore, we detected novel mutations in several of the genes on the panel including RPL5 mutations in patients with DBA, and CDAN1 mutations in patients with CDA type I.Targeted NGS for congenital anaemia using a bespoke panel of genes offers clinicians a fully validated diagnostic test for unexplained congenital anaemias. Patients found not to have mutations in the genes covered on the panel are considered for further genetic analysis such as exome sequencing on a research basis, with the aim of identifying novel genes involved in erythropoiesis. Poster Abstract 27 Transcriptome assemblies for studying sex-­‐biased gene expression in the guppy Eshita Sharma*§, Axel Künstner*, Bonnie A Fraser*, Christine Dreyer*, Detlef Weigel* *Department 6, Molecular Biology, Max Planck Institute of Developmental Biology, Tübingen 72076, Germany §
Present address: Bioinformatics and Statistical Genetics, WTCHG, Roosevelt Drive, Oxford, OX3 7BN The Trinidadian guppy, Poecilia reticulata, is a model organism in evolution, ecology and behaviour. This freshwater live-­‐bearer fish has been long-­‐studied for rapid evolution of life-­‐history and sexually advantageous traits. Many of these traits are sexually dimorphic with the favourite example being the highly polymorphic colour patterns of the males. In order to identify the molecular mechanisms underlying the guppy’s sexual dimorphism, we studied gene expression differences between three sexually dimorphic tissues in adults. As a first step we built a comprehensive reference transcriptome combining transcripts from genome-­‐guided and genome-­‐independent assemblies. Using this reference we compared gene expression between the sexes in their brains, tails and gonads. Expectedly, the gonads were the most sexually diverged organs, but we also found several small but significant expression differences in the somatic tissues. We found tissue-­‐associated sex-­‐
biased expressions in genes related to the tissue-­‐specific phenotypic dimorphism. Considering sex-­‐
biased genes contribute to the sexually dimorphic traits, we expect their evolution under similar pressures of sexual selection and sex-­‐specific natural selection. We found enrichment of ovary-­‐
biased genes and depletion of testis-­‐biased genes on the nascent sex-­‐chromosome of the guppy, indicating sex-­‐specific selection pressures even in the absence of a truly hemizygous state. Comparisons of rates of nucleotide substitution in sex-­‐biased and un-­‐biased genes produced interesting results. We observed signatures of rapid evolution in sex-­‐biased genes in gonads, but only female-­‐biased genes in the somatic tissue. These differences reinforce the theory of sex-­‐biased gene evolution under varying selection pressures in the reproductive and non-­‐reproductive tissues. The guppy transcriptome provides a large molecular resource for further research on the adaptive and sexually dimorphic traits of the guppy. Poster Abstract 28 Investigating network regulation via mouse phenotype data S Kumar1, M Simon1, A-­‐M Mallon1 1
MRC Harwell, Harwell Science and Innovation Campus, Harwell OX11 0RD, UK How genetic variations translate into disease phenotypes is largely unknown. With increasing number of genomic variants potentially associated with diseases being identified through GWAS studies and next-­‐generation sequencing, it is important to work out underlying principles of genotype-­‐to-­‐phenotype relationships. There are intricate molecular networks involved in cellular function to which mutations can cause aberrations of single genes as well as disrupt the broader network and result in observed phenotype. The majority of disease phenotypes arise as an effect of multiple genes and proteins; knowledge of the involved molecular networks provides a basic framework to understand the disease aetiology. Presently two large scale mutagenesis screens are carried out at MRC Harwell. First, a forward genetics screen where mice are ENU mutagenized and their progeny phenotyped [1]. Second, a reverse genetics screen, carried out by International Mouse Phenotyping Consortium (IMPC). IMPC aims to knock out every gene in the mouse genome [2]. Mice from both the screens go through a series of phenotype procedures and the data is collected, disseminated and analysed. Here, I use mutant phenotype data derived from the two screens together with molecular networks and protein data to identify novel associations between specific mouse phenotypes and the underlying biological networks that may be disrupted. Our aim is to show how an allelic series can provide novel insights into disease pathogenesis. 1. Simon MM, Mallon A-­‐M, Howell GR, Reinholdt LG: High throughput sequencing approaches to mutation discovery in the mouse. Mamm Genome 2012, 23:499–513. 2. Koscielny G, Yaikhom G, Iyer V, Meehan TF, Morgan H, Atienza-­‐Herrero J, Blake A, Chen C-­‐K, Easty R, Di Fenza A, Fiegel T, Grifiths M, Horne A, Karp NA, Kurbatova N, Mason JC, Matthews P, Oakley DJ, Qazi A, Regnart J, Retha A, Santos LA, Sneddon DJ, Warren J, Westerberg H, Wilson RJ, Melvin DG, Smedley D, Brown SDM, Flicek P, et al.: The International Mouse Phenotyping Consortium Web Portal, a unified point of access for knockout mice and related phenotyping data. Nucleic Acids Res 2014, 42(Database issue):D802–9. Poster Abstract 29 Selective Depletion of Abundant RNAs to Enable Transcriptome Analysis of Low Input and Highly Degraded RNA from FFPE Breast Cancer Samples Daniela B Munafó1, Bradley W Langhorst1, Joanna Bybee1, Laurie Mazola1, Christine L Chater1, Deyra N Rodriguez1, Christine J Sumner1, Pingfang Liu1, Lynne M Apone1, Erbay Yigit1, Vaishnavi Panchapakesa1, Salvatore Russello1, Fiona J Stewart1, Dominick Sinicropi2, John Morlan2, Kunbin Qu2, Mei-­‐Lan Liu2, Jennie Jeong2, Mylan Pho2, Ranjana Ambannavar2, Eileen T Dimalanta1 and Theodore B Davis1 1
New England Biolabs, Inc. 240 County Road, Ipswich, MA 01938, USA 2 Genomic Health, Inc. 301 Penobscot Drive, Redwood City, CA 94063, USA Deep sequencing of cDNA prepared from total RNA (RNA-­‐Seq) has become the method of choice for transcript profiling, and discovery. The standard whole-­‐transcriptome approach faces a significant challenge as the vast majority of reads map to ribosomal RNA (rRNA). One solution is to enrich the sample RNA for polyadenylated transcripts using oligo (dT)-­‐based affinity matrices; however, this also eliminates other biologically relevant RNA species, such as microRNAs and noncoding RNAs, and relies on having a high quality and quantity RNA sample. Here, we present a method to eliminate abundant RNAs from total RNA with different degradation levels, from intact RNA to highly degraded formalin-­‐fixed paraffin-­‐embedded (FFPE) samples. This method is based on hybridization of probes to the targeted abundant RNA, followed by subsequent enzymatic degradation. We applied this method to remove cytoplasmic and mitochondrial rRNA from different eukaryotic total RNA samples (human, mouse and rat), as well as degraded (1 year old) and highly degraded (10 year old) FFPE breast tumour biopsy RNA samples. We evaluated the depletion efficiency and off target effect of this method using strand specific RNA high-­‐throughput sequencing. Ribosomal RNA depletion resulted in a minimal percentage of total reads mapping to ƌZEƐĞƋƵĞŶĐĞƐ͕ƌĞŐĂƌĚůĞƐƐŽĨƚŚĞƐƉĞĐŝĞƐ͕ŝŶƉƵƚĂŵŽƵŶƚ;ϭʅŐŽƌϭϬϬŶŐͿ͕ŽƌĚĞŐƌĂĚĂƚŝŽŶůĞǀĞů͘
Additionally, there was very good transcript expression (FPKM) correlation (>0.93) between rRNA depleted and non-­‐depleted libraries. This method offers a robust and simple solution for transcriptome analysis of a variety of samples, including low quality and low quantity clinical samples such as FFPE RNA. Moreover, it is amenable to high-­‐throughput sample preparation and robotic automation. This method is sensitive, specific, and produces increased coverage of less abundant, non-­‐targeted transcripts in RNA-­‐Seq studies. Poster Abstract 30 Quality control of next-­‐generation sequencing data without a reference Urmi Trivedi1, Timothée Cézard1, Stephen Bridgett1, Anna Montazam1, Jenna Nichols1, Mark Blaxter 1,2
, and Karim Gharbi1, 2 1
Edinburgh Genomics, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JT 2 Institute of Evolutionary Biology, University of Edinburgh, Edinburgh EH9 3JT Next-­‐generation sequencing (NGS) technologies have dramatically expanded the breadth of genomics. Genome-­‐scale data, once restricted to a small number of biomedical model organisms, can now be generated for virtually any species at remarkable speed and low cost. Yet non-­‐model organisms often lack a suitable reference to map sequence reads against, making alignment-­‐based quality control (QC) of NGS data more challenging than cases where a well-­‐assembled genome is already available. Here we show that by generating a rapid, non-­‐optimized draft assembly of raw reads, it is possible to obtain reliable and informative QC metrics, thus removing the need for a high quality reference. We use benchmark datasets generated from control samples across a range of genome sizes to illustrate that QC inferences made using draft assemblies are broadly equivalent to those made using a well-­‐established reference, and describe QC tools routinely used in our production facility to assess the quality of NGS data from non-­‐model organisms. Poster Abstract 31 LoRIS RNA-­‐seq reveals plant responses at high resolution Walter Verweij and Matt Clark The Genome Analysis Centre (TGAC) Transcript profiling is a powerful strategy to unravel biological mechanisms in many organisms. However, generally, total RNA is extracted from a collection of many heterogenic cells e.g. whole leaves, this loses the spatial information of responses within a plants tissues. In order to study plant and microbial interaction by transcriptional changes, I developed a method that allows us to sequence RNA from small amounts of tissue in a way that the spatial information is kept. Arabidopsis thaliana columbia-­‐0 leaves were treated with 1) flg22 peptides 2) green aphids and 3) mechanical wounding. For each treatment I was able to detect transcriptional changes and identify protein networks that are unique for each treatment, indicating that the LORIS method has the sensitivity to detect subtle transcriptional changes in very small amounts of tissue that otherwise would be of too small amounts for standard Illumina TruSeq sequencing. Our future goal is to deploy the method and to develop a technology that allows us detect transcriptional changes on single cell level after challenging leaf tissue with various biotic and abiotic stresses. Poster Abstract 32 AlmostSignificant, a tool for organising and simplifying NGS data QC Joseph Ward1, Nicholas Schurch2, Geoff Barton2 & Melanie Febrer1 1. Genomic Sequencing Unit, University of Dundee, Dow Street, Dundee, DD1 5EH 2. Division of Computational Biology, College of Life Sciences, University of Dundee, Dow Street, Dundee, DD1 5EH With the advancement of sequencing technology, the amount and speed of data generated make tracking of metadata a more complex problem that is often overlooked. In addition, quality control of NGS data coming from high throughput instruments is an important and time-­‐consuming stage. AlmostSignificant is a platform for collating metadata from Illumina sequencing runs, fastq files and various QC tools into a single easily navigable interface. AlmostSignificant is also able to interact with a LIMS system or from text files to allow integration of the sequencing results with project and sample information used by other systems. The platform can be searchable either by run number and date or by project and sample ID, thus allowing to evaluate run or project performance at a glance. AlmostSignificant can also generate a number of useful statistics arranged per run types and per platform. The information calculated include, but not limited to, number of reads per lane and per run, data per lane and per run, mean Q30 and various plots such as cluster density to reads, cluster density to Q30 and 1st base report versus actual cluster density. AlmostSignificant has been created for use at the Genomic Sequencing Unit, an NGS sequencing facility based at the University of Dundee, but has the potential to be implemented in other sequencing centres. It is based on django and a mysql database and as such is platform independent as interaction is done through a web interface. Poster Abstract 33 viRome: an R package for the visualization and analysis of viral small RNA sequence datasets Mick Watson1, Esther Schnettler2, Alain Kohl2 1
The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush, Midlothian EH25 9RG 2
MRC-­‐University of Glasgow Centre for Virus Research, 8 Church Street, Glasgow G11 5JR, UK RNA interference (RNAi) is known to play an important part in defence against viruses in a range of species. Second-­‐generation sequencing technologies allow us to assay these systems and the small RNAs that play a key role with unprecedented depth. However, scientists need access to tools that can condense, analyse and display the resulting data. Here, we present viRome, a package for R that takes aligned sequence data and produces a range of essential plots and reports. viRome is released under the BSD license as a package for R available for both Windows and Linux http://virome.sf.net. Additional information and a tutorial is available on the ARK-­‐Genomics website: http://www.ark-­‐
genomics.org/bioinformatics/virome Poster Abstract 34 An integrated study of the Impact of Genomic Aberrations on miRNA expression and miRNA-­‐
mRNA interactions: The Case of miR-­‐210 in Breast Cancer Laura Winchester1, Antoine De Weck1, Jiannis Ragoussis2, Adrian Harris3 & Francesca Buffa 1 1
Applied Computational Genomics, Department of Oncology, University of Oxford. Old Road Campus Research Building, Oxford, OX3 7DQ 2
The Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford (Present Address: McGill University and Genome Quebec Innovation Centre, 740 DR Penfield Ave, Montreal H3A 0G1, Canada) 3
Growth Factor Group, The Weatherall Institute of Molecular Medicine, University of Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DS The global landscape of genomic aberration and its effect on gene expression and function have been demonstrated in recent cancer studies. Several microRNA (miRNAs) have been identified as markers of bad prognosis in cancer. Here, we explore the effect of genomic aberrations, focusing on genomic amplification, on miRNA expression and on the action of miRNA on their target mRNAs. MiR-­‐210 is expressed at increased levels in hypoxic breast and head and neck cancers with poor prognosis. We have shown previously detection of down-­‐regulation of the expression of miR-­‐210 target genes in cancer samples expressing high values of this miRNA. This is an important proof of principle, given that miRNA action is mostly translational rather than transcriptional repression, and thus it is difficult to detect down-­‐regulation based solely on mRNA levels without considering protein levels. We analysed an integrated dataset of 198 early breast cancer samples treated in Oxford characterised by matched full clinical annotation, complete 10-­‐years follow-­‐up, microRNA and mRNA expression, and copy number aberration. Multiple linear regression and survival analyses were used to clarify effects of the genomic aberrations on the miRNA–mRNA expression. A study of amplicon content showed that the proportion of miR-­‐210 targets-­‐to-­‐genes is higher than expected in larger amplicons. Importantly, miR-­‐210 regulatory action on amplified target genes was preserved. Global target down-­‐regulation was detected in analysis of miRNA–mRNA expression alone, showing that genomic aberrations can be a confounder, if not accounted for. After model correction for genomic aberrations, miR210 expression has a greater functional effect on target expression than the amplification changes, and the ability of detecting down-­‐regulation of target genes is increased. We show that integrated global analysis of genomic and transcriptomic data can be used to study the action of miRNA control on target gene expression despite disruption from other factors. Email: [email protected] Email: [email protected] Poster Abstract 35 StatsDB v1.2: Easier, faster run metric storage and analysis Neil Pearson, Ricardo Ramirez-­‐Gonzalez, Richard M. Leggett, Ram Shrestha, Anil Thanki, Robert P. Davey The Genome Analysis Centre, Norwich Research Park, Norwich, NR4 7UH As sequencing operations continue to generate ever-­‐larger volumes of data, the challenge of handling and interpreting that data becomes increasingly significant. Centres must be able to collate quality and yield metrics of individual runs and samples into clear reports for their customers, and may also wish to monitor changes in their output over longer time intervals. Recently, we published StatsDB, an open-­‐source software package for storage and analysis of next generation sequencing run metrics[1]. The system is designed to be incorporated into the standard primary analysis pipeline of, for example, Illumina sequencers. Metrics are stored in an SQL database and a set of APIs provides the ability to store and access the data while abstracting the underlying database design. This allows simpler, wider querying across multiple fields than is possible by the equivalent manual steps. The open nature of the database schema has allowed new capabilities, through the addition of new support software, and through expansion of the original APIs. Enhancements have targeted speed improvements to database access, ease of use, and incorporation into primary analysis pipelines. New parser plugins mean that as well as supporting ingest of statistics produced by FastQC[2], a commonly-­‐used tool for the quality control analysis of sequence reads, StatsDB is now capable of storing summarised InterOp data generated by Illumina machines. A parser for equivalent run metrics from PacBio machines is also in development. Consumer tools to dissect individual reports, e.g. "provide metrics about nucleotide bias in libraries using adapter barcode X, across all runs on sequencer A, within the last month", and produce human-­‐ and computer-­‐readable output are now provided. Finally, we have developed a Python API to StatsDB to complement the existing APIs in Java and Perl, which makes StatsDB easier to incorporate into Python-­‐based pipelines. [1] Ramirez-­‐Gonzalez, Legget, Waite et al., StatsDB: platform-­‐agnostic storage and understanding of next generation sequencing run metrics, F1000Res. 2013; 2: 248. [2] http://www.bioinformatics.babraham.ac.uk/projects/fastqc Poster Abstract 36 Exploring isoforms and splicing events in ENU mouse mutants using RNA-­‐seq S Sethi, M Simon, M Parsons, P Nolan, A-­‐M Mallon MRC Harwell, Harwell Science and Innovation Campus, Harwell OX11 0RD, UK MRC Harwell conducts two large scale mutagenesis screens namely, The International Mouse Phenotyping Consortium (IMPC) [1] which aims to find a phenotype for every gene in the mammalian genome and an ENU Aging mutagenesis screen to study the genetics of ageing. These screens provide a rich source of functional data which can generate hypothesis about the mutation in question. In addition we use DNA-­‐Seq to identify the ENU mutations which lie within genes and RNA-­‐Seq to identify differentially expressed genes in the mutants. The emergence of next generation RNA sequencing has provided an exciting new technology to analyse alternative splicing on a large scale. However, computational methods for analysing differential expression and differential splicing from short-­‐read sequencing are not fully established yet and there are still no standard solutions available for a variety of data analysis tasks. One of the major challenges with RNA-­‐Seq analysis is the identification of differential isoforms and splicing events. Isoforms and abhorrent splicing events have previously been implicated in a number of different diseases [2, 3] and associated with different cancer types [4]. Therefore effective detection of isoforms and splicing events in mouse models of disease is critical to identify novel functional roles of genes relating to the phenotype. Recently, we have sequenced the transcriptome to analyse the gene expression levels between affected and unaffected mice. Here we present our analysis on identifying differential splicing and isoform expression in ENU mutants from RNA-­‐seq data by integrating multiple statistical algorithms. In addition we will show how regulatory networks and mechanisms contribute to the phenotype by predicting co-­‐expressed molecular complexes in pathways. Overall we will show how differential splicing events and isoforms can contribute to abhorrent phenotypes. 1. Koscielny, G., et al., The International Mouse Phenotyping Consortium Web Portal, a unified point of access for knockout mice and related phenotyping data. Nucleic Acids Res, 2014. 42(Database issue): p. D802-­‐9. 2. Kim, E., A. Goren, and G. Ast, Alternative splicing and disease. RNA Biol, 2008. 5(1): p. 17-­‐9. 3. Faustino, N.A. and T.A. Cooper, Pre-­‐mRNA splicing and human disease. Genes Dev, 2003. 17(4): p. 419-­‐37. 4. Christofk, H.R., et al., The M2 splice isoform of pyruvate kinase is important for cancer metabolism and tumour growth. Nature, 2008. 452(7184): p. 230-­‐3. Poster Abstract 37 Viral/host gene expression profiles in lymphoid and feather follicle epithelial (FFE) cells infected with Marek’s disease virus (MDV) Deepali Vasoya2, Lydia Kgosana1, William Mwangi1, Mick Watson2, and Venugopal Nair1 1 Avian Viral Disease Programme, The Pirbright Institute, Compton Laboratory, Berkshire RG20 7NN, UK. 2 Division of Genetics and Genomics, The Roslin Institute, University of Edinburgh, Easter Bush, Midlothian, EH25 9RG,UK Marek’s disease virus (MDV) is highly infectious herpesvirus that induces rapid-­‐onset malignant T-­‐cell lymphoma in the chicken. While the virus remains latent in the lymphocytes, it replicates in the feather follicle epithelial (FFE) cells and spread via the chicken respiratory track after inhalation of contaminated dust to continue the life cycle. The gene expression study in the lymphoid cells and FFE can provide significant insights into the u nique virus-­‐-­‐host interactions in these two distinct sites of virus replication. Chickens were infected intra-­‐-­‐abdominally at 8 days of age with a modified RB-­‐
1B virus expressing enhanced green fluorescent protein (EGFP) ad observed for the occurrence of MD. The EGFP marker was used to track the virus infection in different cell types. Tissue samples from spleen, kidney and liver were collected for transcriptome analysis. The skin samples collected as frozen cryoblocks were used to make thin cryosections to demonstrate the expression of viral proteins. The Laser micro-­‐dissection (LMD) was also used to isolate infected EHFP-­‐positive infected cells and extraction of RNA for transcriptome analysis. All the samples are sequenced using Illumina HiSeq2500. The reads were mapped to host and virus using Tophat and gene expression levels were measured using HTseq-­‐count. Differentially expressed genes between infected FFE and control as well as infected lymphoid and control were identified using edgeR. Ingenuity pathway analysis (IPA) tool recognized the functions and pathways of differentially expressed genes. Consequences on differentially expressed host and viral genes between the infected FFE and lymphoid and their potential role in virus-­‐host interactions will be presented. Poster Abstract 38 Design and development of exome capture sequencing for the domestic pig (Sus scrofa) Christelle Robert1, Pablo Fuentes-­‐Utrilla1,2, Karen Troup1,2, Julia Loecherbach1,2, Frances Turner1,2, Richard Talbot1,2, Alan L Archibald1, Alan Mileham3, Nader Deeb4, David A Hume1, Mick Watson1,2 1 The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush, Edinburgh EH25 9RG, UK 2 Edinburgh Genomics, University of Edinburgh, Easter Bush, Edinburgh EH25 9RG, UK 3 Genus plc, 1525 River Road, DeForest, WI 53532, USA 4 Genus plc, 100 Bluegrass Commons Blvd. Suite 2200, Hendersonville, TN 37075, USA The domestic pig (Sus scrofa) is both an important livestock species and a model for biomedical research. Exome sequencing has accelerated identification of protein-­‐coding variants underlying phenotypic traits in human and mouse. We aimed to develop and validate a similar resource for the pig. We developed probe sets to capture pig exonic sequences based upon the current Ensembl pig gene annotation supplemented with mapped expressed sequence tags (ESTs) and demonstrated proof-­‐of-­‐principle capture and sequencing of the pig exome in 96 pigs, encompassing 24 capture experiments. For most of the samples at least 10x sequence coverage was achieved for more than 90% of the target bases. Bioinformatic analysis of the data revealed over 236,000 high confidence predicted SNPs and over 28,000 predicted indels. We have achieved coverage statistics similar to those seen with commercially available human and mouse exome kits. Exome capture in pigs provides tool to identify coding region variation associated with production traits, including loss of function mutations which may explain embryonic and neonatal losses, and to improve genomic assemblies in the vicinity of protein coding genes in the pig. Poster Abstract 39 Size selection of Illumina TruSeq Small RNA Libraries using Sage Blue Pippin Carolyn Riddell1, Urmi Trivedi1, Jenna Nichols1, Richard Talbot1 and Karim Gharbi12 1 Edinburgh Genomics, School of Biological Sciences, The University of Edinburgh, EH9 3JT, UK 2 Institute of Evolutionary Biology, The University of Edinburgh, Edinburgh EH9 3JT, UK The Illumina small RNA library prep kit promises an easy solution for the targeted extraction and sequencing of a range of small RNA species (145-­‐160bp), through gel-­‐based size selection. The protocol successfully targets these smaller fragments whilst minimising adapter contamination, yet is time-­‐-­‐consuming; initial preparation of 24 libraries can take up to 1 day, but size selection adds another 2 days, extracting and purifying the fragment, and is also limited to processing around 16 samples each day. This bottleneck in the workflow hampers the efficiency of an otherwise simple protocol, in turn increasing library preparation costs. Here we set out to test the capability of Sage’s Blue Pippin to accurately size select small DNA fragments. Although limited to processing 5 samples at a time, the method can combine sizing, selection and purification into a 80 minute run, allowing higher throughput processing of small RNA libraries in one day, and reducing total library preparation time by a third. We tested Sage’s 3% cartridge aimed at the 90-­‐200bp fraction, testing a number of size ranges based on Illumina’s standard gel-­‐based protocol, as well as those used in other commercial and custom protocols. Using individual and pools of libraries, we find that the Blue Pippin is capable of targeting a tight size range (43bp) and have identified an optimum that consistently and accurately cuts the 145-­‐160bp region with little or no adapter contamination in the final extraction. We also present data from control samples that demonstrate the equivalence of manually vs Blue Pippin size selected libraries. Poster Abstract 40 Bioinformatics Training for Genomics Bert Overduin1, Christiane Hertz-­‐Fowler2, Javier Santoyo-­‐Lopez1, Neil Hall2, Mick Watson1, Karim Gharbi1 & Mark Blaxter1 1
Edinburgh Genomics, The University of Edinburgh, 2
Centre for Genomic Research, University of Liverpool One of the aims of Edinburgh Genomics and the Centre for Genomic Research (CGR) is to provide bioinformatics training, aimed at scientists with a range of abilities, from beginner to expert, with a focus on the application of bioinformatics tools to big data from genomics experiments and in particular for next-­‐generation sequencing. To this end, we are developing a portfolio of workshops, covering a variety of topics, ranging from “Linux for Genomics” to “RNA -­‐Seq Data Analysis”. To get a better understanding of the immediate training needs within the research community we have conducted a survey in the Edinburgh and Liverpool areas as well as the wider UK. The results of this survey will be used to prioritise the development of workshops on topics for which there is great demand and customise them to the needs of researchers. Edinburgh Genomics and CGR have access to a range of training rooms equipped with PCs and training servers where practical sessions can be run. Courses will be mainly delivered by Edinburgh Genomics and CGR bioinformatics teams, whose members have extensive experience in the analysis of genomic data. Courses about specialised topics will be taught by external teachers who are experts in that particular field. Poster Abstract 41 Updates to MISO, the open-­‐source NGS LIMS project Xingdong Bian, Anil Thanki, Robert Davey The Genome Analysis Centre, Norwich Research Park, UK MISO ("Managing Information for Sequencing Operations") is a freely available open-­‐source LIMS for recording next-­‐generation sequencing (NGS) metadata for sequencing centres. Based on the common objects (projects, samples, libraries, pools and runs etc.) from the European Bioinformatics Institute (EBI) S equence Read Archive schemas, M ISO stores relevant metadata for typical lab workflows and automatically tracks run information from common NGS platforms (e.g. Illumina GA, HiSeq and MiSeq, Roche 454, ABI SOLiD and PacBio RS). MISO can also initiate HPC job submission for initial analysis and QC of sequencing data, and automatically generates public repository data submission schemas. Because MISO is modular, and it is designed to be extensible and customisable, MISO can by used by both large centres characterised by high-­‐throughout data production and smaller scale laboratories with constrained expenditure for IT solutions. We present the recent highlighted updates of M ISO: Plate support (e.g. 96-­‐well and 384-­‐well), new visualisations in reporting, entity groups (i.e. grouping of objects for easier project management, such as sample groups), more flexible barcode printing, sequencing run QC analysis reporting and visualisation, support for traditional Excel/ODF/CSV input and output for bulk data import and export, continued support of new NGS platforms, and a new workflow system for customised lab processes. Poster Abstract 42 Nuclear factories and genome organization Michal R. Gdula1, Catherine C. Green1 1
Wellcome Trust Centre for Human Genetics, University of Oxford DNA replication and transcription activities are not diffusely spread throughout eukaryotic nuclei but concentrated in discrete foci called “replication factories” and “transcription factories”. A growing body of evidence indicates that this structural organisation of replication and transcription has functional importance, enabling efficient copying and processing of genomic information, and securing genome stability. Although a number of models have been proposed, the internal configuration of these factories is still poorly described; indeed even the existence of functionally organised replication and transcription factories is still debated. Recently developed assays such as 4C, Hi-­‐C, ChIP-­‐seq, ChIA-­‐PET, and Repli-­‐seq allow the direct characterization of chromatin interactions, analyses of the interactions between chromatin and chosen proteins and measurement of replication timing in a genome-­‐wide manner. These techniques have shown that genomic regions which interact frequently with each other (delineated by Hi-­‐C) tend to replicate at the same time (in so called “replication domains”). These data fit well with the concept of transcription and replication factories, but further information regarding the structure of such factories has not yet been extracted from these data sets. Here we present initial results from our analysis of replication and transcription factories based on our own and published data. We hypothesise that replication factories, similarly to transcription factories, affect chromatin conformation, bringing distant loci into close proximity in a non-­‐random fashion. To investigate this we have re-­‐analysed published 4C data comparing the DNA-­‐DNA interactions of 10 specific loci in resting peripheral blood mononuclear cells (PBMCs), and cycling stimulated PBMCs and lymphoblastoid cells. We detect a number of differences in the frequencies of chromatin interactions between these cycling and resting cells, and propose that these interactions are the result of replication factory-­‐induced reorganisation of the chromatin. We have confirmed several of these replication-­‐specific interactions by our newly developed Repli-­‐3C assay, a version of 3C that specifically reports on DNA proximities in replicating chromatin. We also combine these data with other available high-­‐throughput chromatin conformation capture data sets, Repli-­‐seq and other genome-­‐wide assays, in order to better characterize transcription and replication factories. A fuller understanding of these fascinating cellular structures will be vital because the proximity of loci to each other within the nucleus is a crucial determinant of translocation probability upon DNA break formation. Thus the 3 dimensional organisation of the genome during the dynamic processes of transcription and replication will influence the formation of genomic rearrangements, with implications for oncogenic transformation. Poster Abstract 43 GSVMining: A global approach to detect genomic structural variations using next generation sequencing Fei Sang, William Brown, Matt Loose Genomic Structural Variations (GSVs) are usually characterized as alterations of large chromosomal regions including deletions, insertions, duplications, inversions and translocations. GSVs dramatically affect genome structure, which can be important for understanding the genetic variation within a population. Next Generation Sequencing (NGS) technology provides us a platform to capture GSVs by high-­‐throughput paired-­‐end reads. Recent computational methods have revealed large numbers of GSVs for various species, however, accurate characterization of GSVs is still difficult and subject to many limitations. Here we introduce a novel global approach to detect GSVs with higher accuracy and lower false positive rates. GSVMining can predict a wide variety of GSVs of all categories described above. We tested its performance in comparison with other methods, such as GASV, BreakDancer, etc. using both simulated data and reads from a well-­‐characterised rearranged Schizosaccharomyces pombe genome. We identified 4 large translocations and 2 large inversions in the S. pombe dataset, which agree with previously published data. In contrast GASV only detected 2 large translocations, and BreakDancer discovered 2 large translocations and 1 large inversion. Both these packages suffered from higher false positive rates than GSVMining. However, in common with these other packages, GSVMining is inefficient in detecting GSVs occurring within repetitive regions, an issue which may be addressed with longer reads. Poster Abstract 44 Copy number variation in Age-­‐related macular degeneration Stephen J Bridgett and Anne E Hughes Queens University Belfast, Centre for Public Health, RVH Site, Belfast, BT12 6BA Copy number variation (CNV), due to for example duplication or deletion of part or all of a gene, occurs in normal healthy genetic variation, but can sometimes be associated with disease. A CNV can range in size from a kilobase to several megabases in size. Several programs, such as EXCAVATOR[1] and PatternCNV[2] have been developed for detecting CNV's in whole exome sequencing data, typically for detecting denovo CNVs in cancer samples. We are investigating genetic associations in Age-­‐related macular degeneration (AMD), which is the leading cause of visual impairment in the elderly. It has previously been found that deletion of Complement factor H related genes CFHR1 and CFHR3, are associated with protection from AMD [3]. From subsequent research it is suspected that deletion or duplication of other complement gene regions also have an effect. To investigate this, Illumina 100bp paired-­‐end targetted sequencing was performed on individually indexed AMD patient samples, in which 1,520 target regions were sequenced, including the Complement receptor genes (CR1, CR1L, CR2) and Complement factor-­‐H and related genes (CFH, CFHR1 to CFHR5). Reads were aligned to the human reference with BWA and also with Novoalign. We hoped that the EXCAVATOR[1] or PatternCNV[2] programs would confirm CNVs in these complement genes, however these programs were unsuccessful, and from correspondence with developers, EXCAVATOR is designed for full whole exome sequencing data that contains typically 200,000 regions, whereas our data had 1,520 target regions which they say was not sufficient for the FastCall statistics (the second step of EXCAVATOR) to detect any relevant copy number events. While PatternCNV detected a few CNVs, (such as single copy of X chromosome regions in samples from males), again a known CNV was not detected in the difficult complement gene cluster. Using read coverage alone to detect CNVs is challenging as read depths vary between and within the target regions, due partly to PCR CG bias and specific motifs within the regions. Moreover these complement genes are complex as contain a number of variable repeats of several exons. Finally we extracted mini-­‐genomes each containing only one gene (CFH, CFHR1, CFHR3, etc), then aligned all reads to these mini-­‐genomes using Novoalign with its “-­‐All” option (to enable alignment of reads to all good matches, rather than just the top match). Then we used then Integrated Genome Browser (IGV) to manually identify alleles from the SNPs, to deduce the numbers of copies of each exon. It may be possible to partly automate this analysis by developing a script to analyse the GATK SNP/InDel calls, or perhaps there is already a tool for this that we're unaware of. References: [1] Magi et al, "EXCAVATOR: detecting copy number variants from whole-­‐exome sequencing data", Genome Biology 2013, 14:R120. [2] Wang et al, "PatternCNV: a versatile tool for detecting copy number changes from exome sequencing data", Bioinformatics, 2014 May 29. pii: btu363. [3] Hughes AE et al, "A common CFH haplotype, with deletion of CFHR1 and CFHR3, is associated with lower risk of age-­‐related macular degeneration", Nat Genet. 2006 Oct;38(10):1173-­‐7. Poster Abstract 45 High Throughput Deep Sequencing of Viral Blood-­‐Borne RNA Pathogens from Clinical Samples A. Trebes1, P. Piazza1, A. Ansari2, Camilla Ip1, D. Bonsall2, A. Brown2, E. Barnes2, R. Bowden1, D. Buck1 1
Wellcome Trust Centre for Human Genetics, University of Oxford, UK 2
Peter Medawar Building for Pathogen Research, University of Oxford, UK Blood-­‐borne viruses are notoriously difficult to isolate from their host to produce whole genome sequencing data. The potential for primer-­‐bias limits traditional amplicon based sequencing, thus modern viral RNA-­‐seq approaches are free from virus-­‐specific PCR1. Personalised medicine is becoming increasingly popular as an approach for treating human disease and the effective antiviral treatment of both Hepatitis C virus (HCV) and HIV are currently informed by viral genotyping and consensus sequencing to detect mutations associated with antiviral resistance. With high throughput deep sequencing, it is possible to detect resistance-­‐associated variants present at low frequency in the viral quasi-­‐species and to differentiate between multiple genotypes circulating within the same patient. The level of information obtained with high throughput sequencing technologies, will further our understanding of how mixed-­‐genotype infections, the development of novel mutations and viral evolution impact on infectiousness, virulence and levels of resistance to both established and new medicines. At The Wellcome Trust Centre for Human Genetics, with the STOP-­‐HCV Consortium (Peter Medawar Building for Pathogen Research), we have used the principles of a modified RNA-­‐seq workflow1 to sequence RNA viruses such as HCV and HIV from plasma total RNA. This metagenomic approach provides a correlation between the sequencing data and the clinically determined viral load, as well as an overview of the range of bacterial and viral species present in the host. In order to increase the coverage and/or throughput, we validated a custom probe-­‐based capture method to enrich for the viral RNA of interest, obtaining 100-­‐fold more sequencing reads per sample. Metagenomic libraries were sequenced as up to 96-­‐plex pools on the Illumina HiSeq2500 Rapid, while post-­‐enrichment libraries were sequenced in single Illumina MiSeq runs (up to 96-­‐plex pool per run). We have developed a pool of IDT xGen® Lockdown® probes targeting a diverse set of HCV genomes from multiple genotypes, along with an algorithm to optimise the choice of additional probes to cover new sequences. Here we compare the results of the metagenomic and probe-­‐based enrichment library preparations, with data analysis from a custom-­‐built pipeline (see poster by Ansari and Ip et al., entitled ‘A pipeline for inferring the diversity of intra-­‐host quasi-­‐species of Hepatitis C virus genomes sequenced with a new probe-­‐capture viral RNA-­‐Seq (Illumina) protocol’). Poster Abstract 46 Oxford Nanopore MinION™ Access Programme: a closer look at library preparation Mariateresa de Cesare1, Amy Trebes1, Camilla Ip1, Elizabeth Batty1, Rory Bowden1, David Buck1, Paolo Piazza1 1
Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, Oxfordshire, UK; The introduction of High Throughput Sequencing (HTS) technologies has accelerated the pace at which genomic studies could be performed. Studies requiring whole genome or targeted re-­‐
sequencing of tens of thousands individuals are now possible. Similarly, metagenomic studies in both ecological and clinical samples have broadened our understanding of viral and bacterial communities in a variety of environments. Despite the advancements of sequencing technologies in the last decade, a major limitation of HTS remains the relative short reads that can be produced. Single molecule sequencing platforms such as the PacBio RSII, provide a remarkable opportunity to study complex, rearranged or repetitive region-­‐containing genomes due to its capacity to generate several kilobases long reads. However, the size and the cost of this instrument together with the relative low amount of data generated and high error rate, limit its applications in areas such as cancer genomics, transcriptomics and population genetics. Oxford Nanopore Technologies Ltd has developed an alternative platform which is capable of very long reads, and has a minimal footprint. As part of the Cycle 1 of the MinION™ Access Programme (MAP), the small MinION device has been available to the High-­‐Throughput Genomics (HTG) group at the Wellcome Trust Centre for Human Genetics. The size of this instrument is comparable to that of a smart phone and can be operated from a laptop. The costs associated with this technology are not clear at this point but are expected to be lower than existing ones. Although the capacity of the MinION is as limited as that of the PacBio RSII, a higher capacity version of this platform (GridION) has also been developed by Oxford Nanopore. Despite the tremendous potential of the MinION at this early stage, it is easy to see that strategies to use low input material, maximise the median insert size, increase the proportion of library that contains a hairpin (used to reduce error rates) will greatly expand the scope and power of these devices. In line with the commitment to support cutting edge research, the HTG group has been investigating ways to improve the library preparation and expand the usability of the MinION. Poster Abstract 47 Comparative genomics of colour-­‐pattern variation in the two-­‐spot and ten-­‐spot ladybirds, Adalia bipunctata and Adalia decempunctata Tamsin M.O. Majerus School of Life Sciences, University of Nottingham Understanding the genetic basis of adaptation is a fundamental aim in evolutionary biology. A small number of well-­‐understood examples suggest that similar genetic changes can result in parallel evolutionary outcomes, even across divergent taxa. Ladybirds are colourful and popular insects, as well as economically important biological control agents. They are fascinating models for studies including invasive species, reproductive strategies, sexually transmitted diseases, host-­‐parasite interactions and male-­‐killing. Yet despite exhibiting such phenotypic and behavioural characteristics that inevitably impact upon their survival and reproductive fitness, their genomes are largely unstudied. Research has shown that colour-­‐patterns can influence survival and reproductive fitness. Ladybirds are an excellent model system to investigate colour-­‐pattern control, with many species exhibiting high levels of polymorphism in the colour-­‐patterns of their wing-­‐cases. The two-­‐spot and ten-­‐spot ladybirds, Adalia bipunctata and Adalia decempunctata, are particularly well-­‐suited, with over 200 different forms being described. Although little is known about the genetic basis of this polymorphism, breeding experiments show its inheritance is consistent with the segregation of alleles (variants) at a single (super)gene. It is also well understood in an ecological context, with several evolutionary forces important in maintaining colour-­‐pattern variation, including mimicry, thermal and industrial melanism, and sexual selection. Bright, contrasting colours also serve as warning-­‐colours as ladybirds are toxic or distasteful to potential predators. Restriction-­‐site Associated DNA (RAD)-­‐sequencing is being used to investigate this question. Identifying sequenced markers that segregate with different phenotypes will allow construction of a linkage map of the region controlling colour-­‐pattern and investigation of whether the same genetic regions and changes are involved in both species. In addition, the same data will provide markers linked to sex and will serve as the foundation for future research into the basis of male-­‐killing, a common phenomenon in several insect species which has dramatic consequences for reproductive success. Poster Abstract 48 Searching for imprinted genes in Bombus terrestris using MRE and MeDIP data Kate Lee*, Harindra E. Amarasinghe* and Eamonn B. Mallon* *University of Leicester Methylation has been documented in insects. It has been mainly found within insect genes and has been linked to control of insect social structures. However, while methylation is a known mechanism of gene imprinting, imprinting has not yet been demonstrated in social insects. In this study we aimed to identify genes that could potentially be imprinted in Bombus terrestris (buff-­‐tailed bumblebee). We used a set of data sampled from a single insect including RNA-­‐seq for allele expression, MeDIP-­‐seq to show areas of methylated DNA and MRE-­‐seq to show areas of non-­‐
methylated DNA. GATK coverage files of each of the data sets, CpG islands identified with CpG island searcher and custom perl scripts were used to identify hemi-­‐methylated, mono-­‐allelically expressed transcripts. 19 potential genes of interest were discovered, many of which are involved in sex differentiation and differential development. These hemi-­‐methylated mono-­‐allelically expressed genes in bumblebee are likely to be imprinted, but further wet-­‐lab studies will be required to confirm this. Poster Abstract 49 The retinoic acid pathway contributes to accelerated jaw growth in halfbeak fish Helen Gunter1, Shaohua Fan2, Arne Jacobs3, Maximilian Haas2 and Axel Meyer2 1
Edinburgh Genomics, University of Edinburgh 2
University of Konstanz 3
University of Glasgow Heterochronic shifts (alterations in developmental timing) can generate novel, adaptive phenotypes as the result of simple developmental switches and are thus ideal models for investigating the molecular basis of evolutionary change. Belonoid fishes, a group that includes needlefish and halfbeaks, achieved a vast array of craniofacial morphologies through a series of heterochronic shifts, contributing to their considerable evolutionary success in comparison to their sister group the medaka. In some species (needlefishes) both the upper and lower jaws are highly elongated, whilst in other species only the lower jaw is elongated (halfbeaks). Using a combination of RNA-­‐seq, qRT-­‐
PCR and functional analyses, we examined the molecular basis of accelerated heterochronic growth in the jaws of the halfbeak Dermogenys pusilla. These analyses identified a range of candidate genes, which are likely to underlie the accelerated jaw growth in the halfbeak. Notably we identified the retinoic acid (RA) pathway as a putative regulator of heterochrony in the jaws, an observation that was confirmed by ectopic application of RA at various developmental timepoints during heterochronic growth of the lower jaw. Analyses of genes that synthesise and degrade RA in the lower jaw of halfbeak indicate that a low RA environment is necessary to permit heterochronic growth in the lower jaw. We postulate that alterations to the expression levels of these genes may have contributed to the stunning diversity in jaw length observed in Belonoid fishes. Poster Abstract 50 Reducing the complexity of Next Generation Phage Display experiments: An automated procedure for processing sequence data and ranking results Andrew Warry1,2, Jonathan Owen3, Ben Maddison3, Richard Emes1,2, Kevin Gough1 1
School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington Campus, Leicestershire, UK. 2
Advanced Data Analysis Centre, University of Nottingham UK. 3
ADAS UK, School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington Campus, Leicestershire UK. The application of next generation sequencing techniques to phage display protocols has greatly improved their robustness and utility in attacking a range of biological problems involving protein-­‐
protein interactions. Here we describe a flexible bioinformatics pipeline, combining custom Perl scripts and publicly available software tools, which is applicable to a variety of next generation phage display (NGPD) panning protocols and library types. The large numbers of reads generated by direct deep sequencing of a phage display experiment are first subjected to a process of quality control and barcode de-­‐multiplexing. The relevant read regions corresponding to the displayed peptide are translated in-­‐silico, after which the short peptide sequences obtained are clustered then ranked using statistical assessments (Log-­‐Likelihood/Z-­‐score) of their relative occurrence in different phage panning samples and their corresponding negative controls. The results are then reported in the form of simple colour coded Excel spreadsheets or HTML tables with linked frequency plots summarising the occurrence of each individual peptide across all samples in the particular phage display panning experiment. This allows for the selection of highly ranked peptides, with an increased amount of confidence, for further analysis, synthesis or expression. The pipeline is equally applicable to epitope or scFv (antibody fragment) phage display libraries and is currently being successfully applied to a number of projects concerned with epitope selection or antibody screening in various viral, bacterial and mammalian systems. Poster Abstract 51 IMEter v2.2: a new version of the IMEter now predicts the expression-­‐increasing ability of introns in over 40 plant species Bradnam K.R.1, Korf I.F. 1,2, and Rose A.B.2 1
Genome Center, UC Davis, CA, USA, 2
Dept. of Molecular and Cellular Biology, UC Davis, CA, USA Many introns in a variety of species including plants, animals, and fungi are able to increase the expression of the gene that they are contained in. This process of intron-­‐mediated enhancement (IME) is most thoroughly studied in Arabidopsis thaliana, where it has been shown that enhancing introns preferentially occur near to the transcription start site (TSS) and appear compositionally distinct from downstream introns. Although it is desirable to experimentally test introns to ascertain whether they enhance expression or not, this is not always practical, especially in species for which no resources are available for genetic transformation. We have previously developed a computational tool (the IMEter) that can predict the degree of enhancement for any Arabidopsis intron. IMEter scores in Arabidopsis thaliana are typically highest for those introns that occur in the first 500 nt of the transcript; we previously observed a similar pattern when applying our Arabidopsis-­‐trained IMEter to introns from eight other plant species. In our current work, we have leveraged data from the Phytozome database (http://phytozome.net), and are now able to produce species-­‐specific IMEters that have been trained in over 40 diverse plant species. There are significant challenges when developing an IMEter to work with draft genome sequences. Most notably, gene annotations are often incomplete and lack UTR features. This is important for two reasons: firstly, 5' UTRs often contain introns, and secondly accurate IMEter scores require training from genes in which we know the correct distance of introns from the TSS. Our software pipeline filters out spurious gene annotations to produce a robust data set for training the IMEter. Additionally, we have also made a number of improvements to the algorithm that the IMEter uses. These improve the overall accuracy of the IMEter by reducing the conflating effect of certain dinucleotide biases that also appear enriched in regions of the transcript that are closest to the TSS. Such biases appear unrelated to any IME effect on increasing gene expression. We hope that our IMEter v2.2 resource will enable people to more easily discover candidate introns that may enhance gene expression in their species of interest. Software and data relating to the IMEter are available from http://korflab.ucdavis.edu and http://github.com/Korflab/IME. Poster Abstract 52 Using transcript counting to define a gene expression profile through vertebrate development John E. Collins, Ian Sealy, Neha Wali, Christopher M. Dooley, Peter Clarke, James Morris, Jeffrey Barrett, Derek L. Stemple and Elisabeth M. Busch-­‐Nentwich Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom Transcript counting by 3’ mRNA pull down and Illumina sequencing is a convenient method for rapid transcriptome analysis. We have developed a pipeline to convert an RNA sample into a gene list and called it the differential expression transcript counting technique (DeTCT). This high-­‐throughput technique enables us to process many collections of samples from various conditions with numerous biological replicates. Thus we can efficiently produce a list of Ensembl identifiers showing differential transcript abundance which link the biological sample to the wealth of publicly available gene ontology and network information. By extracting total nucleic acid from single zebrafish embryos, 12 mutant and 12 wild type, we are able to establish the genotype of each zebrafish from the DNA and a corresponding transcription profile from the RNA. Using gene knockouts from the Zebrafish Mutation Resource (ZMP) we have produced molecular phenotypes for various gene perturbations. To assess the influence of individual embryos in these analyses, for example to identify developmental delay in the mutants, we have used transcript counting to produce wild-­‐type transcriptome signatures for eight zebrafish developmental stages. These data have also given us extensive co-­‐expression information and together with the gene knockout data we have been able to begin assembling gene networks. We will present an update of our DeTCT pipeline and the data produced form the gene expression baseline analysis. Poster Abstract 53 Integrating RNA-­‐seq with whole genome sequencing to investigate the causes of common variable immune deficiency disorders Emma E Davenport1, Pauline van Schouwenberg2, Anne-­‐Kathrin Kienzler2, WGS500 Consortium, Helen Chapel2, Smita Patel2 and Julian C Knight1 1
Wellcome Trust Centre for Human Genetics, University of Oxford Department of Clinical Immunology, University of Oxford 2 Whole genome sequencing (WGS) has the potential to radically advance our understanding of the heritable basis of human disease but analysis is currently typically focused on variants within the coding exome. Here, we describe how we have sought to leverage the potential of WGS by combining with global transcriptomic profiling through RNA-­‐seq to understand the key pathways and biological processes involved in disease and interrogate noncoding variants. We applied this approach to common variable immune deficiency disorders (CVID), a group of typically sporadic diseases in which insufficient quantity and quality of immunoglobulin usually leads to susceptibility to recurrent bacterial infections. Heterogeneity in the disease phenotype in terms of clinical features and complications has made defining the genetic aetiology challenging, with candidate gene and one reported genome-­‐wide association study having limited success. To address the underlying immunopathogenesis of CVID, we conducted WGS, as part of the WGS500 project, for a cohort of 35 CVID patients. WGS500 is a collaborative project between the University of Oxford and Illumina which aims to sequence the genomes of 500 individuals with a range of diseases including rare inherited diseases, immunological disorders and cancer. Individuals were initially screened for rare variants in genes previously associated with CVID or other primary immunodeficiency diseases. We identified one individual with a novel variant in the BTK gene which causes X-­‐linked agammaglobulinemia. Further work is being carried out to verify this finding and the patient was removed from further analysis. The cohort included 31 sporadic patients. 14,457,884 variants were identified across these individuals. We employed filters to prioritise high quality calls for rare, non-­‐
synonymous and likely pathogenic variants. In addition we were able to exclude technical artefacts using MAF for samples within WGS500. This resulted in 4,422 variants in 3,524 genes. Ingenuity Variant Analysis revealed enriched biological pathways including CD28 signalling in T helper cells and the role of NFAT in regulation of the immune response. Applying a biological context filter of “immunodeficiency” further reduced the candidates to 964 variants in 813 genes. To resolve potentially disease causing regulatory variants, we generated RNA-­‐seq data for three CVID patients and three healthy controls. The gene expression data were integrated with the WGS to help prioritise non-­‐coding variants. For example, one individual had 8 compound heterozygous variants which passed the biological context filter and had reduced expression of the corresponding gene compared to healthy controls. Additionally, we used the RNA-­‐seq data to identify 262 genes which were differentially expressed between CVID patients and healthy controls. Ingenuity Pathway Analysis revealed a number of enriched pathways which overlapped with those from the WGS providing further evidence for those important in the underlying immunopathogenesis of CVID. This work illustrates some of the challenges associated with WGS in clinical samples but also the power of combining multiple sequencing techniques to provide a better understanding of complex disease. Poster Abstract 54 PAREnet: a tool for degradome assisted discovery and visualization of small RNA/target interaction networks Leighton Folkes 1, Dominic Smith1, Matthew Stocks 2, David Swarbreck1, Tamas Dalmay3, Vincent Moulton 2,4, Simon Moxon 1,4 1
The Genome Analysis Centre, Norwich Research Park, Norwich, NR4 7UH, United Kingdom. School of Computing Sciences, University of East Anglia, Norwich, NR4 7TJ, United Kingdom. 3
School of Biological Sciences, University of East Anglia, Norwich, NR4 7TJ, United Kingdom. 4
Joint corresponding authors. Small RNAs (sRNAs) are an important class of short (20-­‐24nt) non-­‐coding RNAs which regulate gene expression in both plants and animals. Recent studies on sRNA interactions have shown that many do not operate independently, but instead can form part of a larger, more complex, regulatory interaction network. However, these studies have been carried out on only a tiny subset of all sRNAs, such as miRNAs, or were based on computational predictions using sequence complementarity and not empirical evidence. A new technique called Parallel Analysis of RNA Ends (PARE) or more commonly known as degradome sequencing can be used to capture a snapshot of the mRNA degradation profile on a genome-­‐wide scale from which we are able to extract clear signals of position specific sRNA mediated mRNA cleavage. Computational methods now exist to rapidly analyse the degradome to identify and validate sRNA/target interactions for all sRNAs obtained from a next-­‐generation sequencing experiment. The resultant sRNA/mRNA interactions evidenced through the peaks in mRNA degradation signal can be used to identify regulatory interaction networks on a genome-­‐wide scale. Several computational methods have been described and used to identify such networks. However, these methods have relied upon in-­‐house computational pipelines that are not publicly available. Here we describe a new publicly available, user-­‐friendly, interactive software tool that allows users to build, visualize and investigate sRNA interaction networks which are supported by genome-­‐wide degradome analysis. sRNAs play important roles in diverse processes such as pathogen response, development, reproduction and stress response and we reason that large scale regulatory networks of sRNA interactions are also involved in such diverse processes. By using our approach, we have been able to discover new regulatory interaction networks as well as provide a new tool that requires no computational expertise to use. 2
Poster Abstract 55 Comparison of variant calls from Illumina TruSeq v3 and HiSeq v4 data at different cluster densities Karim Gharbi1,2, Timothée Cézard1, Urmi Trivedi1, Tony Miles1, Richard Talbot1, Javier Santoyo-­‐
Lopez1, Mark Blaxter1,2, and Mike Watson1 1
Edinburgh Genomics, School of Biological Sciences, The University of Edinburgh, EH9 3JT, UK 2
Institute of Evolutionary Biology, The University of Edinburgh, Edinburgh EH9 3JT, UK The recent release of the v4 chemistry for the Illumina HiSeq 2500 promises throughput of up to 1 Tb per run at a reduced price point per gigabase relative to the previous v3 kits. Early validation tests carried out in our facility suggest that the v4 upgrade will deliver on its promise of higher throughput and lower cost. However there is anecdotal evidence that HiSeq v4 data generated from over-­‐
clustered lane may behave differently than TruSeq v3 data in variant calling pipelines. Here we systematically compared variant calls from TruSeq v3 and HiSeq v4 data at different cluster densities. We used whole genome re-­‐sequencing data generated from Corriell human samples as control datasets to interrogate the effect of SBS chemistry and cluster density on variant calls using standard data analysis pipelines. We highlight where differences exist between the two chemistries and make recommendations for rigorous validation of new chemistries as they are released. Poster Abstract 56 Identification of novel disease genes in craniosynostosis Miller KA1, Twigg SRF1, Taylor IB1, Fenwick AL1, McGowan SJ2 and Wilkie AOM1,3 1
Clinical Genetics Group, Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, UK. 2
Computational Biology Research Group, Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, UK, 3
Craniofacial Unit, Oxford University Hospitals NHS Trust, Oxford, UK. Congenital malformations, single or multiple structural abnormalities present at birth, are the leading cause of infant morbidity and mortality, affecting 2-­‐3% of live births. Approximately one third of all congenital malformations involve the craniofacial structures, where premature fusion of the bone plates in the skull (craniosynostosis) occurs with a prevalence of 1 in 2,200. This can develop in isolation (non-­‐syndromic) or in combination with a variety of other birth abnormalities (syndromic). Craniosynostosis patients are often born with an irregular shaped skull that can lead to increased intracranial pressure, often associated with persistent headaches, learning difficulties and vision problems in later years. Patients often have to undergo surgery at a young age to avoid these complications and correct the shape of the head. Craniosynostosis is a clinically and genetically heterogeneous disorder that has the characteristics of a multifactorial trait. Although a number of genes have been implicated in a variety of craniosynostosis-­‐related conditions, approximately 75% of affected individuals do not yet have a genetic diagnosis. We have recruited a cohort of over 1000 craniosynostosis patients recruited from four main centres in the UK and are utilising a combination of sequencing techniques to increase our understanding of the genetic control of congenital cranial malformation. We have trialled a Fluidigm multiplexing and Ion PGM sequencing strategy to identify novel variants in 9 genes identified as interesting candidates by exome or whole genome sequencing. Using this methodology we were able to rapidly re-­‐sequence 205 amplicons, totalling 36,784 base pairs, in 432 affected subjects (in duplicate) with high sensitivity and specificity. We have discovered novel variants in at least 5 of these genes by this method, highlighting the power of this strategy. This sequencing approach provides us with the ability to systematically and inexpensively identify putative variations and estimate their pathogenicity across a large dataset, providing a foundation to delineate novel genetic pathways underlying this condition. Poster Abstract 57 Strategies for the assembly of the 1.2 Gb genome of the ruff (Philomachus pugnax; Aves) Judith Risse1, Michael Stocks2, Clemens Kuepper2, Anna Montazam1, Jenna Nichols1, Karim Gharbi1,3, Terry Burke2 and Mark Blaxter1,3 1
Edinburgh Genomics 2
University of Sheffield 3
Institute of Evolutionary Biology, University of Edinburgh We are assembling the genome of a wading bird, the ruff (Philomachus pugnax), as part of a wider study into the genetic basis of its complex mating behaviour. We are particularly interested in the performance of mate-­‐pair and long read data in scaffolding genomes of this size and complexity. We have generated ~100x coverage of the genome from 3 Illumina TruSeq short insert libraries (200 bp: 25x; 400 bp: 27x; 600 bp: 44x) and 2 Nextera mate-­‐pair (MP) libraries (3 kb: 10x; 5 kb: 29x) sequenced using 150 base paired-­‐end (PE) reads on the Illumina HiSeq 2500 in rapid mode, and approximately 10x PacBio using v3 SMRT cells (P4-­‐C3 chemistry). We assessed the performances of CLC, ABySS and Masurca assemblers using PE and MP reads, and then integrated the PacBio data with the assemblies, comparing Cerulean long read scaffolding with PBJelly gap filling followed by Quiver error correction. Assembly quality was assessed using standard contig metrics (i.e. N50, mean contig length, cumulative contig graph), CEGMA scores and mapping of the assembly to the chicken genome using nucmer. ABySS yielded the best MP-­‐scaffolded assembly with an N50 size of 984kb and ~72,000 scaffolds > 100 bp. The number of nonATGC bases was ~85 million. After gap-­‐filling and cleaning, 57,000 scaffolds > 100 bp remained with an N50 size of ~1.2 Mb and the number of nonATGC bases dropped to ~16 million. These excellent statistics are offset by assessment of completeness using CEGMA: only 140 complete genes (55%) and 217 partial (87.9%) were found. However this is close to the values found for the “complete” chicken genome (151 complete and 216 partial). When the ABySS mate-­‐pair assembly was gap-­‐filled with PBJelly, the CEGMA report showed three genes were no longer complete and one partial was gained. When comparing the best ABySS assembly to the PBJelly gap-­‐filled assembly using nucmer plots with chicken we identified a significant number of collapsed repetitive regions in the gap-­‐filled assembly. In order to determine the origin of these misassemblies, we assessed the standalone scaffolder SSPACE for MP scaffolding. One hypothesis is that the contigs of the assembly end in repetitive or low complexity regions and subsequent scaffolding uses these regions as anchors, and we are now exploring the effect of masking repetitive and low complexity regions in the original PE assembly prior to scaffolding. In the future, we will also super-­‐scaffold the genome into linkage groups with RADSeq data, and map the loci underpinning male mating types. Poster Abstract 58 Rearrangement of the 3D genome by a disease causing mutation in patient cells Ian Sudbery1*, Valentina Caputo2*, Joana Rodriguez2, David Sims1, Andreas Heger1, Anastasios Karadimitris2. 1
CGAT, MRC Functional Genomics Unit, University of Oxford 2
Center for Haematology, Imperial College London * These authors contributed equally Inherited GPI deficiency is a very rare genetic disorder that causes seizures and thrombosis. A single C>G mutation at position -­‐270 within the core promoter of the housekeeping gene PIGM has been identified as the causal change in two independent families. The -­‐270C>G mutation lies within a binding site for the transcription factor SP1. Mutants lose binding of SP1, acetylation of histones in the PIGM promoter and expression of PIGM in B-­‐lymphocytes. This change can be reversed in cell lines as well as in patients by treatment with the broad spectrum HDAC inhibitor sodium butyrate. We performed 4C-­‐seq on patient-­‐derived lymphoblastoid cell lines to analyze the effect of this change on the 3D arrangement of the genome around the PIGM gene. In wild-­‐type cell lines, the PIGM locus makes a number of strong and reproducible contacts with other genomic loci within 1Mb of the PIGM gene. These contacts are lost in cell lines derived from patients, where the PIGM locus interacts only with sites directly adjacent to the gene. Thus we demonstrate for the first time that changes in genome architecture are associated with changes in transcription factor binding and histone acetylation resulting from a disease-­‐causing genetic mutation. Poster Abstract 59 Recent updates to the TGAC Browser project Anil S. Thanki, Xingdong Bian, Robert P. Davey The Genome Analysis Centre, Norwich Research Park, Norwich NR4 7UH, UK TGAC Browser is an open-­‐source web-­‐based genome browser developed at TGAC. Being a web-­‐
based client, it utilises JavaScript libraries to provide a fast and intuitive genome browsing experience. We focus on harnessing Internet architectures as well as localised HPC hardware, concentrating on improved, more productive interfaces and analytical capabilities. We present a new updated version of TGAC Browser, with support for more data formats, new visual types and analysis implementations. A notable update sees the browser being able to process and visualise data directly from next-­‐generation sequencing (NGS) data output, supporting formats such as BAM/SAM [1], BigWig/wig [2], GFF [3], and VCF [4]. TGAC Browser visualises genomic data based on the type and depth of data, which is more informative to the user and memory efficient, using bar charts and heat maps to condense large amounts of information as well as classical individual tracks with glyphs. TGAC Browser is also able to utilise analytical frameworks to increase its functionality. For example, we have implemented the concept of a BLAST Manager, which can run multiple concurrent BLAST analyses, either locally or on a remote HPC installation. All BLAST analyses are available as a selectable list, allowing previous results to be shown and hidden. A beta version of a new manual annotation feature is now available, whereby a user can annotate features directly in the TGAC Browser instance, which can then be persisted on the server, reloaded, and shared at a later date. These annotations can be exported as a GFF or JSON [5] file and sent to a project curator for subsequent inclusion into the core database. We are also developing a browser component to address syntenic regions of potentially highly fragmented genome references, allowing visualisation of homologous genes shared between various species based on information loaded into the Ensembl Compara database schema [6]. The synteny browser will subsequently be incorporated into TGAC Browser in future. Demo: http://tgac-­‐-­‐-­‐browser.tgac.ac.uk Source Code: https://github.com/tgac/tgacbrowser Email: [email protected] References: 1. http://genome.sph.umich.edu/wiki/SAM 2. http://genome.ucsc.edu/goldenPath/help/bigWig.html 3. http://www.sequenceontology.org/resources/gff3.html 4. http://genome.ucsc.edu/FAQ/FAQformat.html#format10.1 5. http://en.wikipedia.org/wiki/JSON 6. http://www.ensembl.org/info/genome/compara/ Poster Abstract 60 RNAseq of matched oral precancer and cancer highlights candidates with a potential role in malignant transformation Lucy F. Stead1*, Caroline Conway1,2*, Preetha Chengot1, Catherine Daly1, Rebecca Chalkley1, Lisa Ross1, Alastair Droop1 and Pamela Rabbitts1 * These authors contributed equally to the work 1 Leeds Institute of Cancer and Pathology, University of Leeds, LS9 7TF, UK. 2 Now at School of Biomedical Sciences, University of Ulster, Coleraine, Co. Londonderry, BT52 1SA, UK. Oral squamous cell carcinoma (OSCC) is one of the top ten most prevalent cancers in the world. Prognosis is poor and quality of life is commonly reduced for patients who survive. Most OSCC progresses via a premalignant stage called dysplasia. Effective treatment of dysplasia prior to malignant transformation, or the ability to predict the 10-­‐20% of dysplasias that will progress to OSCC, is an unmet clinical need. To further understand the biology of dysplasia progression, and attempt to identify therapeutic targets and markers of early disease, we performed RNA sequencing of 19 matched, HPV-­‐negative patient trios: normal oral mucosa, dysplasia and associated OSCC. Our approach ensured that we captured strand-­‐specific information on both coding and non-­‐coding genes in matched samples for the first time. We performed differential gene expression, principal component and correlated gene network analysis using these data. Alongside novel coding and non-­‐
coding candidates for involvement in oral dysplasia development and malignant transformation, our results highlighted potential therapeutic approaches for the treatment of oral dysplasia. Our work demonstrates how a systematic approach to analyzing large, genome-­‐wide datasets, can be used to make biological inferences of clinical importance. Poster Abstract 61 Population resequencing of a highly diverse disease vector – The Anopheles gambiae 1000 genomes project Alistair Miles and Dominic Kwiatkowski University of Oxford on behalf of the Anopheles gambiae 1000 genomes consortium (http://www.malariagen.net/ag1000g) The closely related mosquito species Anopheles gambiae and Anopheles coluzzi are the predominant vectors of malaria in Africa, where the burden of disease remains high. Controlling malaria continues to be a major public health challenge, and current efforts depend on vector control measures including insecticide treated bed-­‐nets and indoor spraying of insecticides. These efforts are being undermined by evolutionary changes within mosquito populations, including the emergence and spread of insecticide resistance, and changes in mosquito feeding behaviour. The Anopheles gambiae 1000 genomes project (Ag1000G) has been established to provide a foundation for the next generation of research into malaria vector control, by studying natural genetic variation in mosquito populations spanning sub-­‐Saharan Africa. The Ag1000G consortium comprises members from 13 institutions, and has to date completed deep sequencing of over 1700 whole genomes of mosquitoes sampled from 13 different countries. Work on SNP discovery in phase 1 of the project, comprising an initial cohort of 765 samples from 8 countries, has recently been completed, and has provided the first whole-­‐genome view of the astonishing natural diversity within this species. A total of 39 million autosomal SNPs passed all quality filters, an average of 1 SNP every 5 base pairs. This map of natural variation will be published shortly as a community resource, under Fort Lauderdale conditions. A preview of this dataset, comprising a SNP call set for 103 Ugandan samples, has very recently been published, and is available via the Ag1000G web site at www.malariagen.net/ag1000g. In addition to genotypes available for bulk download in VCF or HDF5 formats, a novel Web application has been developed, which provides a highly interactive experience for browsing variation data across multiple genome scales, from base-­‐pair resolution up to entire chromosomes, available at www.malariagen.net/apps/ag1000g. Work is now beginning on analysis of the major population-­‐genetic features of these data. Early results indicate that nucleotide diversity is high (~10-­‐
2
) and the scale of linkage disequilibrium is short (<500bp) in all populations except Kenya, consistent with very large effective population size. Most variation is rare, with 72% of SNPs found at an allele frequency below 0.5%. There is evidence for strong population structure, which is primarily driven by the division between sub-­‐populations representing the two incipient species Anopheles gambiae and Anopheles coluzzi, although we confirm that Guinea-­‐Bissau is a region of ongoing hybridisation. We are also able to see for the first time the profound genetic impact of vector control measures. In mosquitoes from coastal Kenya, where a major programme of bed-­‐net distribution has been undertaken, we find high rates of inbreeding, with all samples exhibiting long runs of homozygosity. We also confirm the introgression of an insecticide resistance allele from Anopheles gambiae into Anopheles coluzzi in two populations where hybridisation is rare. These findings demonstrate the immediate public health relevance of these data. The project is ongoing, and current foci include haplotype estimation, population genetic structure and history, and evidence for recent selection. Poster Abstract 62 Evolution, Health And Disease: Understanding The Role Of Genomic Regulation And Variation Across The Spectrum Lisa Skipper1, Graham Etherington1, Federica Di Palma1,2, The Cichlid Genome Consortium and the Rabbit Genome Consortium. 1
Vertebrate and Health Genomics, The Genome Analysis Centre (TGAC), Norwich, UK. 2
The Broad Institute of MIT and Harvard, Cambridge, MA, USA. Proliferation of an organism is dependent on its ability to expertly adapt to environmental challenges; indeed health and a disease in an individual can be viewed as two extremes (success and failure) in the spectrum of life processes. Vertebrate organisms, including humans, share many of the key life processes such as regulation of gene expression, signal transduction pathways and cell cycle regulation. Here, we use comparative genomics and the study of non-­‐traditional model organisms as two powerful and complimentary approaches to understand genomes in the context of health and disease. Our projects span 1) the functional annotation of vertebrate genomes (including model organisms and key agricultural species) to identify and interpret non-­‐coding functional elements and the regulation of gene expression by these; 2) the study of model organisms to answer specific biological questions, as well as continuing to shed light on genes, pathways and molecular mechanisms involved in complex traits and how they might have evolved. The diverse East African cichlids from the Great East African Lakes and the humble bunny rabbit are two examples of model organisms that have provided insight into the relationship between natural and artificial selection in the evolution of traits. We sequenced the genomes and transcriptomes of five lineages of African cichlids, which gave rise to all East African cichlid radiations. We describe how a number of molecular mechanisms shaped East African cichlid genomes, and how accumulation of standing variation during periods of relaxed purifying selection may have been important in facilitating subsequent evolutionary diversification. We also generated a high-­‐quality reference genome for rabbit and compared it to re-­‐sequencing data from populations of wild and domestic rabbits. Our data shows a polygenic basis for phenotypic change during rabbit domestication, targeting genes affecting brain and neuronal development. Poster Abstract 63 Premium Sponsors