conference abstracts - Genome Science 2015

 Genome Science Biology, Technology & Bioinformatics Department of Zoology & Lady Margaret Hall, Oxford 1st-­‐3rd September 2014
Speaker Abstracts www.genomescience.org.uk #ukgs2014 Day 1: Monday 1st September 2014 Session 1: Opening Remarks & Keynote Lecture Time: Monday 1st September 2014 3.00pm-­‐4.00pm Location: Lecture theatre A Chair: Chris Ponting (MRC Functional Genomics Unit & CGAT, University of Oxford) From genome to phenome – the promise of big data Professor Jackie Hunter, Chief Executive, BBSRC The investment in basic research made by funders, such as the BBSRC, to enable the deciphering of genetic information has been generating large amounts new information which has the potential to underpin a bioscience-­‐driven industrial revolution. From improved livestock and crops, through biotechnology, to a greater understanding of how biological systems function, the data being generated now and in the future will have a major impact on society. But such a ‘data deluge’ brings challenges:-­‐ x how to standardise, collate, annotate and store data, especially phenotypic data, in an accessible format x what data to store x how data across scales can be linked to generate new understanding and knowledge x how to ensure clinical data are appropriately managed and information on relative risks of disease etc are communicated appropriately x how to incentivise and reward scientists for making data available and working more collaboratively in areas such as data standardisation x how ensure the translation of new knowledge into benefit proceeds as rapidly as possible within an appropriate regulatory environment x BBSRC has played a key role in supporting basic, innovative research in both technologies and experimentation that enabled this data generation in the past. In the future BBSRC will seek new ways to enable the derivation of knowledge from data and its translation into societal benefit. Parallel session 2a: Emerging Technologies Time: Monday 1st September 2014 4.05pm-­‐6.45pm Location: Lecture theatre A Chair: Mike Quail (Wellcome Trust Sanger Institute) Methods and Devices for DNA/RNA Sequencing Using Nanopores Clive Brown, Oxford Nanopore Nanopore based molecular analysis platforms are being commercialized in 2014 after 20 years of background academic research and 7 years of commercial development with over $80M spent. Significant technical challenges needed to be solved in order to bring such devices to market. These include reliable ways to make Nanopore arrays, custom sensing circuits, efficient capture and processing of DNA molecules through the nanopores at a measurable speed and the tailoring of nanopores to give enough signal-­‐to-­‐noise for suitable algorithms to generate sequence data. Nanopore sequencing of DNA/RNA easily produces very long reads, with minimal sample preparation, at high throughput and low cost. Already competitive for certain applications, all of these attributes will improve over time as the underlying technology is incrementally refined. This presentation will describe these key technical challenges and how they have been solved, with reference to their embodiment in the MinION™ instrument. Insights will be given on possible future developments of nanopore technology and what kind of devices and new applications may result. Single-­‐molecule DNA analysis with SIMDEQ™ Chas André,1 Jimmy Ouellet,2 Gordon Hamilton,1 David Bensimon,2 and Vincent Croquette2 PicoSeq SAS, 74 Rue Lecourbe, Paris 75015 , France Laboratoire de Physique Statistique, Ecole Normal Superiéure, 24 rue Lhomond, Paris 75005, France PicoSeq is developing a novel technology capable of high-­‐resolution mapping, full sequencing, and detection of a wide range of epigenetic modifications on individual unamplified molecules of DNA. This approach is called SIMDEQ (for ‘SIngle-­‐molecule Magnetic DEtection and Quantification’) and is based on the ‘magnetic trap’, an instrument designed to manipulate and visualise individual DNA molecules attached to micron-­‐sized magnetic beads. Although originally used to study DNA replication and repair, we are currently exploiting the unique abilities of the system to extract both genetic and epigenetic information from tethered DNA. In addition to powerful mapping and sequencing capabilities, a unique feature of our technology is that it can be used to detect and locate the position of specific binding molecules directly on the DNA. This opens up an exciting range of new applications, including precise mapping of modifications such as 5-­‐methylcytosine without prior chemical conversion of the DNA. SIMDEQ can therefore potentially be used to perform DNA analyses currently available with existing technologies, and to bring a whole new level of understanding to the emerging field of epigenetics. We will discuss recent results demonstrating full sequencing and high resolution mapping of modified bases. G&T-­‐Seq: Combined DNA and RNA sequencing from a single cell Iain C. Macaulay1, Parveen Kumar2, Yang Li3, Tim Xiaoming Hu3, Wilfried Haerty3, Nathalie Saurat4, Rick Livesey4, Mubeen Goolam5, Magdalena Zernicka-­‐Goetz5, Mabel Teng1, Stephan Lorenz1, Chris Ponting1,3, Thierry Voet1,2 1
Wellcome Trust Sanger Institute-­‐EBI Single Cell Genomics Centre, Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK. 2
Laboratory of Reproductive Genomics, Department of Human Genetics, KU Leuven, Belgium. 3
MRC Computational Genomics Analysis and Training Programme, MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, UK. 4
Gurdon Institute, University of Cambridge, Cambridge, UK. 5Department of Physiology, Development and Neuroscience, University of Cambridge, Cambridge, UK. Advances in genome or transcriptome sequencing from many single cells are offering a unique perspective from which to investigate cellular heterogeneity in development and disease. Here we present a novel method, G&T-­‐seq, which permits simultaneous sequencing of the genome and the transcriptome from the same single cell. G&T-­‐seq provides whole genome amplified genomic DNA and full-­‐length transcript sequence and with automation, 96 samples can be processed in parallel. Using cancer cell lines and other models, we have explored the relationship between DNA copy number and gene expression at the single cell level. From single cells from the breast cancer cell line HCC38 and matched normal control cells, several thousand transcripts were detected per cell, while low coverage genome sequencing demonstrated that copy number variants observed in bulk. Capture Hi-­‐C (cHi-­‐C) identifies the chromatin interactome of colorectal cancer risk loci Gabriele Migliorini1*, Roland Jäger1*, Marc Henrion1*, Helen Speedy1 , Radhika Kandaswamy1, Andreas Heindl2, Nicola Whiffin1, Maria J Carnicer3, Laura Broome4, Nicola Dryden4, Takashi Nagano5, Stefan Schoenfelder5, Martin Enge6, Yinyin Yuan2, Jussi Taipale6, Peter Fraser5, Olivia Fletcher4, Richard S Houlston1 1 Division of Genetics and Epidemiology, Institute of Cancer Research, Sutton, Surrey, UK 2 Division of Molecular Pathology, Institute of Cancer Research, London, UK 3 Division of Molecular Pathology, HO Research Unit, Institute of Cancer Research Sutton, UK 4 Breakthrough Breast Cancer Research Centre, Institute of Cancer Research, London, UK 5 Nuclear Dynamics Programme, The Babraham Institute, Cambridge, UK 6 Science for Life Laboratory, Department of Biosciences and Nutrition, Karolinska Institutet, Stockholm, Sweden *contributed equally Multiple regulatory elements distant to their targets on the linear genome can influence the expression of a single gene through chromatin looping. Chromosome conformation capture implemented in Hi-­‐C allows for genome-­‐wide agnostic characterization of chromatin contacts, however, detection of functional enhancer-­‐promoter interactions is precluded by its resolution. We have developed a capture Hi-­‐C (cHi-­‐C) approach to agnostically identify these physical interactions at state of the art resolution on a genome-­‐wide scale. Single nucleotide polymorphisms associated with complex diseases often reside within regulatory elements and exert effects through long-­‐range regulation of gene expression. Applying this novel cHi-­‐C approach to 14 colorectal cancer risk loci has allowed us to identify key long-­‐range chromatin interactions in cis and trans involving these loci. Parallel session 2b: Genome Evolution Time: Monday 1st September 2014 4.05pm-­‐6.45pm Location: Lecture theatre B Chair: Chris Ponting (MRC Functional Genomics Unit & CGAT, University of Oxford) Sex-­‐specific selection and genome evolution Judith Mank, University College London Males and females of most species share nearly the entire genome, and yet they use many of their shared genes in radically different ways, leading to conflict over optimal transcription and ultimately differences in expression levels between males and females (sex-­‐biased gene expression). Differences in male-­‐ and female-­‐specific evolutionary pressures affect the sex chromosomes and autosomes in different ways, leading to several explicit predictions about the distribution and evolution of sex-­‐biased genes. First, altering sex-­‐specific selection should elicit a response in sex-­‐
biased gene expression, and this response should be more pronounced for genes linked to sex chromosomes. Second, the degree of sex-­‐biased expression on the sex chromosomes should correlate with sex chromosome age. Case studies using both comparative and experimental evolutionary frameworks will be presented to address these predictions. Understanding the biology of pluripotent stem cells from flatworms Aziz Aboobaker, Department of Zoology, University of Oxford Planarian flatworms have an unrivalled capacity to regenerate due to a population of totipotent adult stem cells. Previous studies attempting to understand the biology of these cells have suggested that they have features in common with both embryonic stem cell and germ line stem cells from other animals. Here we present an analysis of extant RNAseq based experiments to analyse the expression profile of this cell population. We present our attempts to rationalise and combine datasets from different labs and show that simple comparisons of different de novo transcriptomes and reported expression levels give very low correlation between experiments from different groups. However, by re-­‐analysing raw data, remapping to the same consolidated transcriptome and basing comparison of expression in different cellular compartments on transformed and ranked expression values we find independent studies with different experimental paradigms implicate a core set of genes as having enriched in expression in planarians stem cells. The approaches we use should be of use to other analyses looking to compare datasets with the same biological question in mind but with differing or independent experimental paradigms. Seeking lncRNA function through experimental and computational genomics Chris Ponting, CGAT & MRC Functional Genomics Unit, University of Oxford Thousands of human lncRNA loci have been discovered yet only a handful have had their functions experimentally defined. A large-­‐scale lncRNA knockout project will be required exploiting multiple targeting strategies to determine the full range of contributions that lncRNAs make to human biology and disease. In the meantime, either computational or experimental investigation of individual lncRNA loci can provide insights into lncRNA mechanism. In our computational approaches, we have investigated the level of sequence conservation and constraint on lncRNA sequence. Our experimental approaches demonstrate that many lncRNAs are biologically consequential by regulating gene expression levels in trans either by regulating rates of transcription or levels of transcripts. Our results indicate that a large number of lncRNAs act to regulate transcript levels by competing for binding to miRNAs. Computationally, it is striking that: (i) mammalian lncRNA exons appear constrained, yet their level of constraint is far removed from that for protein-­‐coding regions, and that (ii) fruitfly lncRNA exons appear moderately constrained, presumably as a consequence of this species’ much higher effective population size. It is thus clear that only a small fraction of nucleotides in lncRNA transcripts transact their functions. In our experimental approaches, we have investigated the transcriptional and post-­‐transcriptional mechanisms of individual lncRNAs. We have acquired strong evidence from CHART-­‐Seq and other experiments that two lncRNAs (Paupar and Dali) regulate the expression levels of genes in trans by associating with specific genomic locations via their physical association with RNA-­‐binding proteins. This regulation is dependent on lncRNA levels, which demonstrates that these lncRNAs not just are transcribed and associate with proteins and DNA, but these interactions are biologically consequential. Efficient recovery of complete organelle genomes and nuclear genes from genomic NGS data by baiting and iterative mapping Christoph Hahn, Evolutionary Biology Group, School of Biological, Biomedical and Environmental Sciences, University of Hull, Kingston upon Hull, UK Recent advances in sequencing technologies have vastly enhanced the generation of genomic data for non-­‐model species. Organelle genomes are important tools in evolutionary biology, but their identification and assembly from genomic data is often far from straight forward, especially in the absence of closely related references as often the case for non-­‐model organisms. Here, I present an efficient bioinformatics pipeline for the automated reconstruction of mitochondrial genomes directly from genomic NGS data. The approach, dubbed MITObim (MITOchondrial baiting and iterative mapping), uses seed sequences (even from distantly related taxa) to identify an initial putative organelle readpool, which is then assembled and iteratively extended until the number of reads converges due to the circular nature of the organelle genome. I demonstrate the efficiency and sensitivity of the approach using real and simulated Illumina data. The pipeline is freely available at http://github.com/chrishah/MITObim. Phylogenomic analyses are usually based on initial de novo assemblies of nuclear genomes, followed by the identification of suitable gene models for all involved taxa. This is a computationally highly demanding and currently non-­‐automated task representing a pronounced bottleneck for m any phylogenomic studies. I describe here the applicability of our baiting and iterative mapping approach for the targeted assembly of nuclear genes, potentially enabling the construction of large-­‐scale multi-­‐ gene phylogenomic datasets without the need to initially assemble the complete nuclear genomes of the taxa in question. The phylogenetic effect on the abundance of transposable elements in Nematoda Amir Szitenberg1, Mark Blaxter2 and David H. Lunt1 1
Evolutionary Biology Group, School of Biological, Biomedical & Environmental Sciences, University of Hull, Hull, HU6 7RX, UK; 2
Institute of Evolutionary Biology, The University of Edinburgh, EH9 3JT, UK Comparative genomics is a rich source of biological information where the variability of genomic features such as gene content, gene expression, intron length, synteny, linkage groups and transposable elements can reveal much about species diversification. Traditionally, comparative evolutionary studies of morphological traits have taken account of the phylogenetic effect on character variability. This comparative approach, emphasizes that characters in phylogenetically clustered taxa may covary, regardless of any environmental or selective forces shaping their evolution. Phylogenetic controls should also be central to any comparative genomics inference, although this is currently less common. Here we present an approach to comparative genomics which will account for phylogenetic covariance of traits. As a case study, we estimate the importance of phylogenetic effects on transposable element (TE) loads and diversity in Nematoda species. We have quantified and classified the TEs in 37 Nematoda genome assemblies. TE loads appear unrelated to the N50 length of the assembly (a proxy of the assembly quality) and represent a biological signal. TE loads were observed to be highly associated with the phylogenetic relationships among the nematode species as well as with the overall mutation rate differences among the lineages. Since the TE counts and branch length are correlated, we propose a method to correct the TE counts by the branch length of their respective lineages. Our results demonstrate the importance of a well informed phylogenetic framework in the study of the rapidly evolving transposable elements and genomic features in general. Day 2: Tuesday 2nd September 2014 Parallel session 3a: Big Data Analysis Time: Tuesday 2nd September 2014 9.00am-­‐12.45pm Location: Lecture theatre A Chair: Chris Yau (Wellcome Trust Centre for Human Genetics, Oxford) Quantifying the true extent of heterogeneity in the cancer genome: Advanced computational methods for characterising uncertainty in tumour heterogeneity deconvolution Chris Yau, Wellcome Trust Centre for Human Genetics, University of Oxford The existence of multiple genetically distinct tumour sub-­‐populations is a widely recognised confounding problem in the analysis and interpretation of genomic data derived from heterogeneous tumour samples. Whilst a number of computational techniques are available for de-­‐
convolving DNA and RNA sequence data, most fail to characterise the statistical uncertainty involved in the deconvolution problem, reporting only a single characterisation of the underlying sub-­‐clonal architecture. This can lead to misleading quantitative measures of tumour heterogeneity. In this talk I will discuss the computational problems involved in tumour heterogeneity deconvolution which lead to a “Big Data, Bigger Models” challenge. I will then describe a novel simulation algorithm (the “Hamming Ball Sampler”) that allows us to do tractable and full Bayesian statistical inference for the tumour deconvolution problem applied to whole genome sequences for the first time. I will illustrate, with examples, the utility of the algorithm for quantification of uncertainty in tumour heterogeneity de-­‐convolution and advantages over methods reporting only point estimates. Joint work with Michalis Titsias and Paul Kirk. Modeling gene expression heterogeneity between individuals and single cells Oliver Stegle, EBI, Hinxton The analysis of large-­‐scale expression datasets is compromised by hidden structure between samples. In the context of genetic studies, this structure is linked to differences between individuals, which can either reflect their genetic makeup (such as population structure) or be traced back to environmental and technical factors. In this talk, I will discuss statistical methods to reconstruct this structure from the observed data and account for it in genetic analyses. These models permit to increase power and robustness of expression quantitative trait loci studies and yield new insights into the co-­‐expression structure between genes. In the second part of my talk I will extend this class of latent variable models to applications in single cell transcriptomics. We develop and validate an approach to account for confounding heterogeneity between individual cells. In applications to a Th2 differentiation study, we show how this model allows for dissecting expression patterns of individual genes and reveals new substructure between cells that is linked to cell differentiation. Reconstructing the 3D architecture of the genome Jean-­‐Phillipe Vert, Institut Curie & Mines ParisTech, Paris Recent technological advances allow the measurement, in a single Hi-­‐C experiment, of the frequencies of physical contacts among pairs of genomic loci at a genome-­‐wide scale. The next challenge is to infer, from the resulting DNA–DNA contact maps, accurate 3D models of how chromosomes fold and fit into the nucleus. I will present PASTIS, a new method to infer a consensus model of DNA structure from Hi-­‐C data. I will also discuss how DNA structure is related to gene regulation in Plasmodium falciparum, the parasite responsible for malaria. Disease, networks and epistasis Caleb Webber, MRC Functional Genomics Unit, University of Oxford I will give an overview of our recent work in identifying the pathways and processes underlying complex disorders, illustrating how different functional genomics resources can each provide novel biological insights into the same phenotype-­‐influencing gene network and test the “same pathway, same phenotype” hypothesis on a large and systematically-­‐phenotyped cohort. The topologies of identified networks can identify pathway loading (both additive and epistatic) along with the direction in which the pathway is perturbed, thereby inviting drug repurposing. I will also illustrate some of the novel integrative functional genomics approaches that we’ve been applying in large GWA/exome disease studies, demonstrating that different populations converge on the same pathways despite little overlap in the variant genes. From genetic association to function: insights from genomic mapping of regulatory variants in immune disease phenotypes Julian C Knight, Wellcome Trust Centre for Human Genetics, University of Oxford Genome-­‐wide association studies (GWAS) have highlighted the extent of genetic associations with susceptibility to common immune-­‐mediated diseases. However understanding the functional basis of these associations and delivering translational utility remains a significant challenge to the field. Recent work has implicated non-­‐coding regulatory variants as responsible for most reported GWAS while we, and others, have demonstrated that such variants are major drivers of diversity in the immune response transcriptome. The talk will discuss approaches we are taking to try and establish functional links between immune phenotype-­‐associated regulatory genomic and epigenomic variation, and specific modulated genes and pathways. I will describe insights from the application of expression quantitative trait (eQTL) mapping to analyse genomic modulators of the global transcriptomic response in different primary immune cell populations and in response to innate immune stimuli. This work shows how local and distant eQTL can be defined involving immunoregulatory variants showing evidence of disease association, with trans-­‐regulatory loci enabling the discovery and dissection of gene networks informative for disease. Further progress in this area will require interpretation of associated variants in the context-­‐specific epigenomic landscape in which they may act together with evidence establishing mechanism, for example based on mapping chromatin interactions and application of genome editing techniques. Bayesian non-­‐parametric methods for modelling transcription and its regulation Magnus Rattray, University of Manchester We are using Bayesian non-­‐parametric methods to model omic time course data. I will present a method for modelling the dynamics of transcription using polymerase (pol-­‐II) ChIP-­‐Seq time course data. This allows us to infer elongation rates and promoter activity profiles. We are also integrating pol-­‐II data with expression data (from RNA-­‐Seq) through a model of transcription and degradation. Finally I will present a method for inferring regulatory network from expression time series data. We use Gaussian processes as flexible non-­‐parametric models of time-­‐varying quantities which can be fitted to data using Bayesian methods. Parallel session 3b: Environmental Genomics Time: Tuesday 2nd September 2014 9.00am-­‐10.30am Location: Lecture theatre B Chair: Dawn Field (Centre for Ecology & Hydrology, Wallingford ) & Peter Kille (University of Cardiff) Sequencing the Earth: Genomic Observatories and Ocean Sampling Day Dawn Field, Centre for Ecology & Hydrology, Wallingford We are now sequencing the Earth. From the discovery of DNA to the completion of the human genome sequence genomics projects are now escalating in size and ambition towards giving us a view of DNA on earth. The number and types of projects in the field of genomics, including metagenomics and DNA barcoding (eDNA), is growing quickly towards the planetary-­‐scale. This now includes the formation of an international network of Genomic Observatories. This talk will focus on the first action of this network, Ocean Sampling Day (OSD). This is a DNA sequencing campaign designed to help understand the role of microbes in the ocean. The June 2014 solstice sampling event involved over 180 marine research sites across the globe and represented the first simultaneous sequencing campaign of such size and geographic spread. Wolbachia in nematodes: unexpected associations and palaeosymbiology Mark Blaxter1,2, Georgios Koutsovoulos1, Dominic Laetsch1 and Charles Opperman3 1
Institute of Evolutionary Biology, University of Edinburgh, Edinburgh EH9 3JT 2
Edinburgh Genomics, University of Edinburgh, Edinburgh EH9 3JT 3
Plant Nematode Genetics Group, Department of Plant Pathology, NC State University, Raleigh, NC 27695, USA Wolbachia are alpha-­‐proteobacteria, famous as parasitic symbionts of arthropods, especially insects, where they manipulate host reproduction to assure their own transmission. Wolbachia are also found in nematodes, and were first identified in human-­‐parasitic filarial species. The association with filarial nematodes appears to be mutualistic, and the parasitic phenotypes that characterise the insect associations are missing. These Wolbachia are now a promising drug target for treating human filarial disease. We are interested in the origins and current maintenance of this putatively mutualistic symbiosis, and the coevolution of host and symbiont genomes. Using whole genome data we have surveyed additional filarial species for infection with Wolbachia, and identified a closely related outgroup species that lacks the symbiont: this firmly identifies the phylogenetic point-­‐of-­‐origin of the symbiosis. We have also identified Wolbachia infections in other nematodes, including Radopholus, a plant parasite only distantly related to filarial nematodes. We are also using nuclear integrations of Wolbachia DNA to identify "fossil" symbioses in species that no longer have living infections. Several filarial nematodes are Wolbachia-­‐free, but have Wolbachia nuclear insertions, indicating that they have lost their symbionts. Excitingly, we have also identified Wolbachia nuclear insertions in Dictyocaulus viviparus, a strongylid nematode, that reveal that this group was once host to a Wolbachia. The fossil fragments of Dictyocaulus Wolbachia have features (and a phylogenetic placement) that illuminate the biology of the symbiont that gave rise to the important filarial Wolbachia Towards understanding the functional and taxonomic repertoire of microbial communities using the EBI metagenomics portal Rob Finn, EBI, Hinxton The application of metagenomics, the shot-­‐gun sequencing of DNA extracted from environmental samples, is widespread. With the diminishing cost of sequencing the computational analysis of data is arguably the greatest burden when using metagenomic approaches. The European Bioinformatics Institute (EMBL-­‐EBI) offers a freely available analysis resource for the archiving the characterisation of metagenomic sequences. The analysis pipeline, which integrates tools produced by both the EMBL-­‐EBI and the wider research community, will be outlined, high lighting some of the features that allow us to scale with the ever-­‐increasing demand. The results of the analysis platform will be illustrated using some of the public projects already available from the portal. Parallel session 3c: Proffered Abstracts Time: Tuesday 2nd September 2014 11.15am-­‐12.45pm Location: Lecture theatre B Probabilistic modelling of Carbon Copy Chromatin Conformation Capture (5C) and ChIP-­‐
Seq profiles reveal a high-­‐resolution spatial genomic proximity network controlling epidermal keratinocyte differentiation K.Poterlowicz1, J.Yarker1, N.Naumova2, B.Lajoie2, A.Mardaryev1, A.Sharov3, J.Dekker2, V.Botchkarev1,3, and M.Fessing1 1
University of Bradford 2
University of Massachusetts Medical School 3
Boston University School of Medicine During development, the execution of distinct cell differentiation programs is accompanied by establishing specific higher-­‐order chromatin arrangements between the genes and their regulatory elements. The Epidermal Differentiation Complex (EDC) locus contains multiple co-­‐regulated genes involved in the epidermal keratinocyte (KC) differentiation. Here we applied a probabilistic approach for the investigation of properties of chromatin architecture. Furthermore, we characterise the high-­‐
resolution spatial genomic proximity network of a 5Mb region containing the EDC and its flanking regions in mouse epidermal KCs. This was done by modelling data obtained from the 5C experiments and a set of eighteen ChIP-­‐Seq profiles for histone modifications, chromatin architectural and remodelling proteins. The analysis reveals that a substantial number of the spatial interactions at the EDC overlap with chromatin states involving regulators of chromatin architecture. These include the genome organizer Satb1, the cohesion subunit Rad21, the deacyetylase complex subunit Sin3a and H3K4 specific demethylase Rbbp2, suggesting that these proteins act to control KC-­‐specific folding of chromatin at the EDC locus. We confirmed by using both 5C and 3D FISH that chromatin at the 5Mb genome locus spanning the EDC and its flanking regions form several topologically associated domains (TADs) with similar borders. Moreover, it showed markedly different intra-­‐domain folding in KCs versus thymocytes (TC), e.g. two adjacent TADs at the EDC central part were more condensed and non-­‐randomly folded in KCs versus TCs. In summary, our probabilistic approach allows us to suggest an involvement of the chromatin architecture and remodelling proteins into the spatial interaction network of gene cis-­‐regulatory regions controlling co-­‐ordinated gene expression at the EDC locus. It provides an important platform for further studies of the higher order chromatin folding at KC-­‐specific genomic loci involved in controlling gene expression programmes in skin epithelia in health and disease. Epigenomic profiling of the MHC transactivator CIITA using an integrated ChIP-­‐seq and genetical genomics approach Daniel Wong1, Wanseon Lee1, Peter Humburg1, Seiko Makino1, Evelyn Lau1, Vivek Naranbhai1, Benjamin P Fairfax1, Kenneth Chan2, Katharine Plant1, Julian C Knight1 1
Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, OX3 7BN, UK; 2
William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University London, London, United Kingdom Many genes within the Major Histocompatibility Complex (MHC) region of the genome have well characterized roles in immunity and have been implicated in a wide range of diseases including inflammatory arthritis, diabetes and Multiple sclerosis. Similarly, genetic diversity within the MHC as illustrated by the different extended MHC haplotypes has likewise been associated with pre-­‐
disposition to several of these diseases. Master transactivator-­‐molecules CIITA and NLRC5 regulate genes within the MHC and are important constituents of the machinery that shape the epigenomic landscape across different cell types. We have profiled CIITA-­‐mediated activity in primary human B cells and monocytes using the chromatin immunoprecipitation technique in conjunction with high-­‐
throughput sequencing (ChIP-­‐seq). Genome-­‐wide analysis indicates a broader role for CIITA than what has been currently described in that CIITA not only binds in proximity to genes within the MHC, but also in proximity to genes outside of the MHC with roles in immune function and susceptibility to infectious disease. Our approach included employment of expression quantitative trait (eQTL) mapping to resolve differentially expressed genes associated in trans with a CIITA sequence variant and these were integrated with targets of CIITA. It allowed for an examination of cell type specificity underlying CIITA-­‐mediated regulation, and also identified gene networks through common variants modulating gene expression locally and at a distance through CIITA. We further validated trans association with expression of protein encoded by HLA genes within the MHC. We will extend this approach to profiling NLRC5 and also histones in different immune cell types. The results will be employed as a route-­‐map for subsequent epigenomic (ChIP-­‐seq) and gene expression (RNA-­‐seq) analyses of individuals with distinct, extended MHC haplotypes. We anticipate that this will inform the continued refinement of epigenetic pharmacology as applied to inflammatory and immune diseases. Population and single cell transcriptomics reveal the Aire-­‐dependency, composition and relief from Polycomb silencing of self-­‐antigen expression in thymic epithelia Stephen N. Sansom1, Noriko Shikama-­‐Dorn2, Saule Zhanybekova2, Gretel Nusspaumer2, Iain C. Macaulay3, Mary E. Deadman4, Andreas Heger1, Chris P. Ponting1,3, Georg A. Holländer2,4. 1
CGAT, MRC FGU, University of Oxford. 2
University of Basel. 3
Wellcome Trust Sanger Institute-­‐EBI Single Cell Genomics Centre, Wellcome Trust Sanger Institute. 4
Department of Paediatrics, University of Oxford Promiscuous gene expression (PGE) by thymic epithelial cells (TEC) is essential for generating a diverse T cell antigen receptor repertoire tolerant to self-­‐antigens, and thus for avoiding autoimmunity (a process known as central tolerance). Nevertheless, the extent and nature of this unusual expression program within TEC populations and single cells is unknown. Using deep transcriptome sequencing we found TEC populations to be capable of expressing up to 19,293 protein-­‐coding genes, the highest number of genes known to be expressed in any cell type. Remarkably, in mouse mTEC, the Auto Immune Regulator AIRE alone positively regulates 3980 of these promiscuously expressed genes which are otherwise tissue-­‐restricted in expression. Notably, the tissue specificities of these genes include known targets of autoimmunity in human AIRE deficiency. Led by an observation that Aire-­‐induced genes are generally characterized by a repressive chromatin state in somatic tissues, we found Aire up-­‐regulated genes in mTEC to be strongly associated with H3K27me3 marks. Our findings are consistent with AIRE targeting and inducing the promiscuous expression of genes previously epigenetically silenced by Polycomb group proteins. While it has been suggested that the composition of Aire dependent gene expression may be either stochastic or clonal (potentially recapitulating developmental lineages) at the single cell level, this has not been previously assessed at genome scale. We therefore analysed the transcriptomes of hundreds of single mTEC using a quantitative microfluidics approach. We found that Aire-­‐dependent genes are expressed stochastically at low cell frequency and that furthermore, when expressed, Aire-­‐dependent transcript levels were 16-­‐fold higher, on average, in individual TEC than in the mTEC population; findings with important implications for the understanding of central tolerance. Parallel session 4a: Bioinformatics Infrastructure Time: Tuesday 2nd September 2014 2.30pm-­‐5.45pm Location: Lecture theatre A Chair: Mick Watson (Roslin Institute) The ELIXIR bioinformatics infrastructure: Data, Computing and Services to Communities Niklas Blomberg, ELIXIR The mission of ELIXIR is to construct and operate a sustainable infrastructure for the sharing of biological information throughout Europe, to support life science research and drive its translation to medicine and the environment, the bio-­‐ industries and society. The challenges in storing, integrating and analyzing the data from modern biological experiments are real; ELIXIR meets this challenge through a distributed e-­‐infrastructure of bioinformatics services built around established European centers of excellence. This talk will discuss some of the challenges in meeting the transformation of biological research into a big data driven science: handling, analyzing and archiving large and also highly diverse data-­‐sets. As ELIXIR is currently embarking upon its construction phase it has commissioned pilot-­‐actions to look at issues related to accessing very large datasets. Furthermore the talk will discuss experiences in data integration and the need for establishing data-­‐management plans within projects that address the issues of meta-­‐data annotation and long term archiving. Building a Better Medical Genome Michael A. Eberle1, Peter Krusche1, Richard J. Shaw1, Morten Kallberg2, Subramanian S. Ajay2, Zoya Kingsbury1, Carri-­‐Lyn R. Mead2, Zamin Iqbal3, Gil McVean3, Sean Humphrey1, Elliott H. Margulies1, David Bentley1 1
Illumina Cambridge Ltd., Chesterford Research Park, Saffron Walden, Essex, CB10 1XL, UK 2
Illumina Inc., 5200 Illumina Way, San Diego, CA 92122, USA 3
Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford OX3 7BN Advances in high-­‐throughput sequencing technology now make it possible to rapidly sequence individuals and call variants ranging from single base substitutions to large structural events. As sequencing increasingly moves into clinical applications, it is important that we systematically assess and improve the accuracy of these variant calls made by standard informatics pipelines and, where needed, develop targeted informatics pipelines to call medically relevant variants. In order to assess and improve existing sequence analysis pipelines, we have developed a high quality truth data set of variant calls based on whole genome sequencing of the parents and eleven children from the CEPH/Utah pedigree 1463. We combined all the variant calls made using a variety of variant calling pipelines along with haplotype transmission information from the pedigree. We identified and phased over 4.8M SNPs, 650k indels (between 1 and 50bp in size) and over 3,400 large indels/CNVs. This comprehensive, pedigree-­‐validated variant catalogue has been used to improve the sensitivity of the Isaac SNP and indel calls by 1.5% and 35%, respectively with effectively no reduction in precision. In parallel with our work to improve our standard informatics pipelines, we are developing targeted informatics approaches to call medically actionable variants. For example, repeat expansions that can extend up to 1000s of bp in length are a challenge for next generation sequencing where current read lengths are not sufficient to span the full length of the repeat. While single physical reads may not span the full length of the expansion, the number of reads that are within the repeat can be used to differentiate pathogenic from non-­‐pathogenic alleles for most repeat expansions. Based on this idea, we have developed an algorithm that can detect pathological repeat expansions in one or more specified repeat regions in a given sample using paired 100mer reads from whole genome sequence data. To test our method we sequenced eight Coriell samples with known, quantified FMR1 triplet repeat expansions. Though the repeat counts in these samples range from 76 to 645 repeats (228bp to 1,935bp), our algorithm correctly detected mutations deemed to be pathogenic in all eight of these samples. Furthermore the calls of pathogenic mutations were significantly separate from calls in data from a random, uncharacterised “control” set of 964 samples sequenced to 30x depth and this method was able to reject FMR1 repeat expansion pathogenicity (>=60 repeats) in over 97.5% of the uncharacterised samples. Analyzing large cohorts without losing your mind: GATK's new reference model pipeline for variant discovery Geraldine Van der Auwera, Broad Institute, Cambridge, MA, USA Variant discovery is greatly empowered by the ability to analyse large cohorts of samples rather than single samples taken in isolation, but doing so presents considerable challenges. Variant callers that operate per-­‐locus (such as Samtools and GATK’s UnifiedGenotyper) can handle fairly large cohorts (thousands of samples) and produce good results for SNPs, but they perform poorly on indels. More recently developed callers that operate using assembly graphs (such as Platypus and GATK’s HaplotypeCaller) perform much better on indels, but their runtime and computational requirements tend to increase exponentially with cohort size, limiting their application to cohorts of hundreds at most. In addition, traditional multisample calling workflows suffer from the so-­‐called “N+1 problem”, where full cohort analysis must be repeated each time new samples are added. To overcome these challenges, we developed an innovative workflow that decouples the two steps in the multisample variant discovery process: identifying evidence of variation in each sample, and interpreting that evidence in light of the evidence gathered for the entire cohort. Only the second step needs to be done jointly on all samples, while the first step can be done just as well (and much faster) on one sample at a time. This decoupling hinges on the use of a novel method for reference confidence estimation that produces a genomic VCF (gVCF) intermediate for each sample. The new workflow enables fast, highly accurate and computationally cheap variant discovery in cohort sizes that were previously intractable: it has already been applied successful to a cohort of nearly one hundred thousand samples. This replaces previous brute-­‐force approaches and lowers the threshold of accessibility of sophisticated cohort analysis methods for all, including researchers who do not have access to large amounts of computing power. Maintaining the long-­‐term value of NGS data Matthew Addis, Arkivum The use of Next Generation Sequencing is growing in both clinical and research contexts with whole exome or whole genomes becoming increasingly common place as costs of sequencing, analysis and storage all fall. The extended coverage of this data compared to targeting specific genes means it has longer-­‐term value and wider re-­‐use scenarios. This brings the question of how to protect this value over the long-­‐term so the data can be re-­‐used in the future with confidence on its origin, integrity and reliability. This talk will look at the challenges of NGS data preservation, including preserving the context around the data as well as the data itself. For example, capturing and recording how a sample was extracted, prepared, sequenced and capturing a description of the pipeline that the resulting sequence data is put through will all influence repeatability, reproducibility and reuse of the data. In a clinical setting, QC and validation is an essential part of this record. This talk will explore these issues with suggestions on how information security, digital preservation and risk management can all be brought to bear on the problem. Addressing the most challenging variant calling in next generation sequencing and assessing its performance Marghoob Mohiyuddin1, John Mu2, Jian Li1, Narges Bani Asadi1, Mark Gerstein3, Alexej Abyzov4, Wing Wong5, Hugo Lam1 1
Department of Bioinformatics, Bina Technologies, Redwood City, CA 94065, USA 2
Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA 3
Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA 4
Mayo Clinic, Department of Health Sciences Research, Rochester, MN 55905, USA 5
Department of Statistics, Stanford University, Staford, VA 94305, USA A key challenge in genomic analysis is accurate structural variant (SV) detection owing to the complexity of SVs. SVs are genomic rearrangements formed by various mechanisms and vary largely in size, making it almost impossible to completely detect with the relatively short reads from next-­‐
generation sequencing (NGS) using any single algorithm. Each algorithm has its own limitation and is only sensitive to certain kinds of SVs with varying degree of accuracy and resolution. Nevertheless, to date, SV merging tools such as SVMerge are still limited in accuracy and precision as different SV detection methods are treated uniformly. While a couple of tools such as iSVP might consider different methods, their merging is limited to only removing duplicates. Many of the widely used SV detection tools are also not supported by these tools without any modification. Here we present MetaSV (Method-­‐aware SV detection), an algorithm for accurate and method-­‐
aware SV calling. It merges SVs from VCFs detected by multiple methods and by multiple tools for a method. In contrast to just taking either the inner or outer bounds of the merged SVs, it resolves SV breakpoints based on the resolution of the methods, i.e., it is method-­‐aware. It attempts to recover missing zygosity from alignments and resolve conflicts based on the specificity of the methods in different regions. Local assembly with dynamic programming are used to provide further validation of SVs and to enhance breakpoint precision. One approach to validate the accuracy of our algorithm is simulation. Realistic simulation validation frameworks are essential for an unbiased comparison of the performance of high-­‐throughput sequencing analysis algorithms. To this end, we developed VarSim, an integrated simulation and validation framework that leverages the state-­‐of-­‐the-­‐art read simulation tools and vast annotation databases to generate realistic high-­‐throughput sequencing reads and report detailed accuracy statistics. VarSim first generates a phased diploid genome using variants from existing annotations and novel sites -­‐ this includes real insertion sequences when simulating structural variants. Next, reads are simulated from this diploid genome using empirical error models derived from publicly available sequencing data. After alignment and variant-­‐calling on the simulated reads, VarSim reports detailed statistics on the accuracy of the results. These statistics include alignment accuracy and variant-­‐
calling accuracy for different variant types and sizes, as well as for different categories of genomic regions, e.g., genes and repeats. Since VarSim generates a diploid genome, genotyping accuracy is also reported. VarSim is not only useful for comparing structural variant callers, but also aligners and SNP/indel detectors. To validate the performance of MetaSV, we constructed a synthetic genome based on the NA12878 variants reported previously and simulated their reads at high coverage (50x) using VarSim. We also gathered over a thousand of experimentally validated, high-­‐quality and -­‐resolution SVs from previous literature. With simulation and experimental data, our results show that MetaSV achieves high accuracy, precision and sensitivity across all SV types and sizes. Parallel session 4b: Plant Genomics Time: Tuesday 2nd September 2014 2.30pm-­‐5.45pm Location: Lecture theatre B Chair: Neil Hall (University of Liverpool) Mutant hunting in complex genomes Anthony Hall1, Neil Hall1, Rachel Brenchley1, Donal O'Sullivan2 and Laura Gardiner1 University of Liverpool, 2University of Reading 1
Next generation sequencing technology is making it possible to rapidly map and identify mutations responsible for specific traits or phenotypes in Arabidopsis. These strategies include simultaneous mapping and mutant identification (mapping-­‐by-­‐sequencing) and direct sequencing of mutants. While these approaches are extremely useful for Arabidopsis and are quickly becoming routine, in crop species genome resources are often poor and the genome sizes are huge. Bread wheat is an allohexaploid with a genome size of 17GB. While a draft genome is available it is fragmented. Current sequencing technologies and computational speeds make a direct re-­‐
sequencing of a bulk segregating population of an F2 with sufficient sequence depth prohibitively expensive for large genomes. Therefore, our first step in the development of a mapping-­‐by sequencing approach has been to produce an enrichment array allowing us to sequence just the genic portion of wheat (150 Mb). We have used this in combination with a pseudo wheat genome, constructed based on synteny between Brachypodium and wheat to give long-­‐range order to genes. We will describe how this approach was used to map a mutant in first a diploid wheat (Triticum monococcum L.), a close relative to the bread wheat genome A progenitor. Secondly, we then to extend this approach identifying a candidate gene involved in yellow rust resistance in hexaploid bread wheat. Can genomics help forestry? Richard Buggs, School of Biological and Chemical Sciences, Queen Mary University of London, E1 4NS, United Kingdom As we increasingly depend upon the world’s forests for renewable energy and building materials, global trade in live plants is spreading pests and pathogens that threaten forest trees. We need tree breeding to boost productivity of forest plantations, and to increase tree resistance to pests and pathogens. The size and generation time of trees makes breeding programmes expensive and time consuming. Genetic marker assisted selection at juvenile stages may offer greater efficiency gains for tree breeding than (even) for agricultural crop and animal breeding. However, resources for genetic-­‐
trait studies – such as inbred lines, mapping populations, and large phenotyped populations – are hard to generate for trees. In addition, whilst broad-­‐leaved trees typically have genomes of less than 1Gbp in size, conifers typically have very large genomes of over 10Gbp, impeding genome assembly and re-­‐sequencing studies. One efficient way to identify candidate genes for resistance to a pest/pathogen may be by phylogenomic study of convergent evolution, within tree genera where some species have co-­‐evolved with that pest/pathogen. This talk will outline the potential and challenges of the application of genomics to forestry. Current work on ash trees in the UK will be reviewed as a case study for the application of genomics to a host-­‐pathogen interaction The potato genome: Marker discovery, diversity studies and trait analysis Glenn Bryan, The James Hutton Institute, Dundee, DD2 5DA, UK Potato is one of the world's most important food crops. The international Potato Genome Sequencing Consortium (PGSC) published the genome of the homozygous ‘DM’ genotype in 2011. The potato genome is highly duplicated and moderately repetitive, containing some ~39,000 coding genes, although the annotation is in need of significant improvement. An updated much improved version (v4.03) of the genome pseudomolecules was published in 2013. This latter activity involved construction of a de novo genetic map using a range of marker types. In my presentation I will give examples of how we are using the potato genome in combination with new genetic tools, such as single nucleotide polymorphism (SNP) platforms and genotyping by sequencing (GBS) approaches, and novel potato populations to analyse potato traits, including resistances to important pathogens and other important traits, such as tuber shape and dormancy. Dissecting the regulatory network governing C4 photosynthesis through exploitation of natural variation Steven Kelly, Plant Sciences, University of Oxford With over 60 independent origins, C4 photosynthesis is one of the most remarkable examples of convergent evolution in eukaryotic biology. Multiple studies have identified a large repertoire of genes that are differentially expressed between closely related C3 and C4 species, and within C4 species, thousands of genes have been implicated in biochemical and anatomical development. The extent to which separate lineages of C4 plants use the same genetic networks to maintain C4 photosynthesis is unknown. Here I will discuss progress made to elucidating these genetic networks through comparative transcriptomic and genomic approaches in multiple C3 and C4 species. Back to BACs? A High-­‐Throughput, low cost BAC sequencing pipeline Darren Heavens, Deepali Vasoya, Gawain Bennett, Heather Musk, James Lipscombe, Rachel Piddock, Dharanya Sampath, Jon Wright, Richard Leggett, Kirsten McLay, Sarah Ayling and Matthew D. Clark The Genome Analysis Centre, Norwich Research Park, United Kingdom Sequencing individual BACs from a minimal tile path (MTP) overcomes many assembly problems associated with heterozygosity, repeats, and duplications. This can be a huge benefit when sequencing large, complex, repeat rich and polyploid genomes such as bread wheat. Here we detail a scalable, low cost (currently £5/BAC), high throughput pipeline to construct indexed libraries from 2,304 BACs in a standard working day. The libraries can be pooled into a single Illumina lane, sequenced, demultiplexed and each BAC individually assembled. We have validated this approach by sequencing the barley 2H and wheat 3DL MTPs generating average contig N50 >15kbp and by adding further mate pair data we achieve average scaffold N50 >75kb. Session 5: Keynote lecture Time: Tuesday 2nd September 2014 5.45pm-­‐6.30pm Location: Lecture theatre A Chair: Rory Bowden (Wellcome Trust Centre for Human Genetics, Oxford) Genomic Studies of the Human Microbiome and Disease Prof. George Weinstock, The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA The microbial communities of the human body affect, and are affected by, the host’s lifestyle, health status, genotype, and most likely other factors. Dysbiosis, imbalance of a microbial community, is often associated with these aspects of the host, and the study of dysbiosis offers opportunities for diagnostics as well as understanding mechanisms of disease and human biology. This can be manifested as changes in the abundance of a single or multiple organisms, changes in abundance of metabolic pathways or other gene systems, or changes in ensemble properties such as biodiversity. We have been describing such phenomena in a range of clinical samples, and examples of how metagenomics can be informative to medicine will be presented. Day 3: Wednesday 3rd September 2014 Session 6: Sponsors Session Time: Wednesday 3nd September 2014 9.00-­‐10.30am Location: Lecture theatre A Chair: David Buck (Oxford Genomics Centre, WTCHG, University of Oxford) Accelerating the validation of genomic variants with CRISPR/Cas9 genome editing Scott Brouilette, Sr. Marketing Technical Specialist, UK & Ireland, Illumina Genome-­‐wide association studies (GWAS) have enabled identification of SNPs and other variants associated with traits and disease, but in many cases have failed to reveal the causative mutations. Next-­‐generation sequencing (NGS) provides the power to identify variants at single-­‐base resolution, thus increasing the power to detect causative, and potentially actionable mutations. Traditional methods for validating the impact of these variations are both time-­‐consuming and expensive, but with the emergence of the CRISPR-­‐Cas9 editing technology, derived from the CRISPR adaptive immune system of bacteria, we now have a tool that permits the rapid introduction of targeted modifications to the genome, the epigenome or transcripts. It may even be possible to perturb multiple genes simultaneously to model the additive effects that underlie complex polygenic disorders. Sequencing multiple whole human genome to 30X coverage and calling variants now takes just days using Illumina sequencing technology, but coupling this with CRISPR technology for the rapid generation of cellular and animal models to take these variants and validate the downstream impact promises to revolutionise disease research. Ion Torrent Semiconductor Sequencing Update Mike Lelivelt, Director of Bioinformatics and Software Products, Ion Torrent, part of Life Technologies/ThermoFisher Scientific Ion Torrent has invented the first device—a new semiconductor chip—capable of directly translating chemical signals into digital information. The Ion Personal Genome Machine™ Sequencer, launched in December of 2010, delivered 1000X scalability improvements in its first year of commercial availability. The PGM now can deliver over 2 GB of data using the 318 chip with 400 bp read lengths. The new HiQ enzyme improves overall accuracy, including troublesome In-­‐Del errors. Ion Torrent released the Ion Proton™ Sequencer in late 2012. The P1 chip has routinely generates 12 GB of across its 165 million microwells with 200 bp read lengths. Both sequencers generate data for a wide variety of applications include: gene panel sequencing, exome analysis, transcript analysis (include whole message, small message, and targeted message), copy number analysis, 16S analysis, and de novo assembly. A review of software development resources will be provided so any interested developer can integrate into the Ion Torrent analysis pipeline. Tailoring approaches for whole exome and epigenome analysis from a FFPE sample perspective Sudipto Das, University College Dublin Recent advances in DNA extraction methodologies have allowed usage of archival material from formalin fixed paraffin embedded (FFPE) tissue sample in studies involving genomic profiling using different next-­‐generation sequencing platforms. However, a major limitation of this material is the fragmented nature and low overall yield of DNA obtained from it, thus making downstream applications rather challenging We demonstrate applicability of the SeqCap EZ plus Exome design from Roche Nimblegen platform in order to identify novel mutations and SNPs using FFPE and matched fresh frozen material from a retrospective clinical trial in colorectal cancer. In addition, we illustrate the feasibility of using the Seq Cap Epi approach to understand the global DNA methylation profile across these samples. The key focus of this talk would thus be to examine the various strategies and technical aspects of using the above mentioned approaches, as well as to understand the limitation of the same in large studies involving FFPE samples. Automated solutions for NGS sample preparation Berwyn Lloyd, Beckman Coulter UK Ltd Sample Preparation for Next Generation Sequencing involves many steps that can be tedious and error prone. Most steps involved are highly amenable to automation that standardizes the process and provides greater consistency in results. Laboratories adopting automated sample preparation often report significant reduction in hands on time with increased efficiency and throughput resulting in sequence ready library samples. Beckman is a leader in automation and SPRI magnetic bead technology for RNA and DNA extraction and purification. Here we give examples of automated NGS sample prep solutions demonstrated on Biomek platforms and show the sorts of data we can generate in customer laboratories. Automation of NGS Sample Preparation: From Benchtop NGS to Genome Centres Paul Butler, Director, Strategic Business Development and Collaborations, PerkinElmer Next Generation Sequencing sample preparation is a bottleneck that can prevent optimal use of sequencer capacity. Library preparation workflows have numerous manual processes that bottleneck throughput and contribute to process inefficiency. NGS protocols are complex multi-­‐step methods requiring precise execution. PerkinElmer has developed a broad suite of solutions that have been specifically designed to precisely automate all steps of library creation, sample analysis, and fractionation providing consistent and reliable results with hours of unattended walk away time. This talk will introduce PerkinElmer's automation solutions for NGS sample preparation. RNA-­‐Seq sample prep does not need to be complicated Dalia Daujotyte, Research Scientist, Lexogen Compared to the great level of achievements in sequencing technology, pre-­‐sequencing steps are still less efficient. Consequently, Lexogen focuses on the development of reliable user-­‐friendly technologies for transcriptome analysis by NGS. There are numerous challenges in RNA-­‐Seq library preparation, and a few of them will be discussed during the talk. Parallel session 7a: Clinical Genomics Time: Wednesday 3nd September 2014 11.15am-­‐3.15pm Location: Lecture theatre A Chair: Nazneen Rahman (Institute of Cancer Research) Implementing genomic medicine – cancer predisposition as an exemplar Professor Nazneen Rahman BM BCh PhD FRCP FMedSci Head of Division of Genetics and Epidemiology, Institute of Cancer Research, London and Head of Cancer Genetics Unit, Royal Marsden NHS Foundation Trust. Genetic testing of cancer predisposition genes is one of the major activities of clinical genetics, and cancer is one of the foremost diseases for which gene testing has transformed medical care in multiple areas, including cancer prevention. Currently, over 100 genes associated with predisposition to cancer are known, but in most countries testing is very restricted with respect to the number of genes and the number of people tested. The Mainstreaming Cancer Genetics (MCG) Programme is a UK national cross-­‐disciplinary initiative to develop the NGS assays, analytical and interpretive pipelines, clinical infrastructure, training, ethical and evaluation processes required for routine genetic testing to be integrated into cancer patient care. In collaboration with Illumina we have developed a NGS panel targeting 97 cancer predisposition genes and bespoke analytical pipelines developed for high-­‐throughput clinical diagnostic data analysis and clinical interpretation. In parallel we have developed a mainstreamed ‘oncogenetic’ cancer gene testing pathway whereby consent for medical testing (i.e. in cancer patients) is undertaken by trained members of the cancer team, with only the mutation carriers typically seen by geneticists. We have developed protocols and e-­‐learning modules to deliver the required training to the cancer team. This system is faster, cheaper and scalable whilst retaining flexibility for increased input from genetics when required. We have completed a 6 month pilot and the pipeline has now been adopted as standard care for ovarian cancer patients at Royal Marsden. Patient and clinician feedback has been very positive. www.mcgprogramme.com Personalised cancer medicine in the era of genome sequencing Ultan McDermott, Wellcome Trust Sanger Institute, Hinxton Over the last decade we have witnessed the convergence of two powerful experimental designs towards a common goal of defining the molecular subtypes that underpin the likelihood of a cancer patient responding to treatment in the clinic. The first of these ‘experiments’ has been the systematic sequencing of large numbers of cancer genomes through the International Cancer Genome Consortium and The Cancer Genome Atlas. This endeavour is beginning to yield a complete catalogue of the cancer genes that are critical for tumourigenesis and amongst which we will find tomorrow’s biomarkers and drug targets. The second ‘experiment’ has been the use of large-­‐scale biological models such as cancer cell lines to correlate mutations in cancer genes with drug sensitivity, such that one could begin to develop rationale clinical trials to begin to test these hypotheses. It is at this intersection of cancer genome sequencing and biological models that there exists the opportunity to completely transform how we stratify cancer patients in the clinic for treatment Genomic testing leads clinical care in neonatal diabetes: a new paradigm Sian Ellard, Institute of Biomedical and Clinical Science, University of Exeter Medical School, Exeter, UK Recent years have seen significant progress towards defining the genetic aetiology of neonatal diabetes with >20 subtypes identified. It is likely that all cases of neonatal diabetes result from a single gene disorder since markers of autoimmunity associated with type 1 diabetes are rare in patients diagnosed before 6 months. Traditional genetic testing focusses on the analysis of one or a few genes according to clinical features; this testing approach is changing as next-­‐generation sequencing enables simultaneous analysis of many genes. Neonatal diabetes is the presenting feature of many discrete clinical phenotypes defined by different genetic aetiologies. Genetic subtype determines treatment, with improved glycaemic control on sulfonylurea therapy for most patients with potassium channel mutations. This discovery means that most patients are now referred for genetic testing soon after diabetes is diagnosed. Comprehensive testing of all aetiologies identifies causal mutations in >80% of cases. The genetic result predicts optimal diabetes treatment and development of related features. This represents a new paradigm for clinical care with genetic diagnosis preceding development of clinical features and guiding clinical management. Clinical microbial genomics and metagenomics: overview and applications Mark Pallen, University of Warwick In this talk, I will provide a brief overview of the applications of genome sequencing in clinical microbiology, drawing on illustrative case studies from my own work and from the literature. For nearly two decades, microbial genomics has delivered key insights into pathogen biology. In recent years, microbial whole genome sequencing has moved closer to the clinic and has become an important tool in investigating the emergence, evolution and spread of pathogens, both in hospitals and in the community. High-­‐throughput sequencing has also transformed our understanding of the complex microbial communities that modulate the balance between health and disease even in non-­‐
infectious conditions, such as obesity. Direct sequencing of microbial DNA from clinical samples through metagenomics brings the potential for rapid culture-­‐independent diagnosis, while advances in sequencing technologies may bring bedside diagnosis-­‐by-­‐sequencing a step closer. Accurate and high resolution HLA typing from targeted exome sequencing data Hang T.T. Phan1 & Gerton Lunter1 1
The Wellcome Trust Centre for Human Genetics, University of Oxford Human Leukocyte Antigen (HLA) system is a locus of genes forming the Major Histocompatibility Complex (MHC) in humans, located on chromosome 6p21.3. HLA genes are highly diverse in terms of sequences within a population, e.g. for HLA-­‐B loci, there are 3,455 alleles . This is the major source of complication in surgical operations such as transplantation or blood transfusion due to incompatibility of MHC genes. The diversity of HLA genes makes it challenging to accurately type them at high resolution in a high throughput fashion. Existing methods for typing HLA alleles using next generation sequencing data are mostly assembly-­‐alignment based approach where the reads are first assembled into contigs and then aligned against the HLA allele database. Here we introduce our pipeline to type HLA genes (class I and class II) from targeted exome sequencing data at high resolution (at least 4-­‐digit) with high accuracy. Our method, on the other hand, is an alignment–
variant calling based approach that uses the commonly available tools including BWA for read alignment and Platypus (for variant calling). The results of the alignment and variant calling process are then used in maximum likelihood framework to predict the most likely genotype for the input sample. It is also designed to encapsulate the contamination level of the sample. Validation results using in-­‐house data show that our pipeline obtained an accuracy of 90~97% for HLA class I genes and 85%~98% for HLA class II genes with little ambiguity. We are experimenting on applying the pipeline to more general sequencing data such as whole genome sequencing data to create a wider use of our approach. “From bench to bedside” via bioinformatics: Advances in cancer research and diagnostic Susanne Weller1,2, Anna Schuh2, Ruth Clifford2, Samantha Knight3, Andreas Weller 4 & Chris Holmes5 1
Nuffield Department of Clinical Laboratory Sciences 2
Haemato-­‐Molecular Diagnostic Laboratory, John Radcliffe Hospital, Oxford 3
Wellcome Trust Centre for Human Genetics 4
Oxford Nanopore 5
Statistical Genetics, Oxford Centre for Gene Function The Oxford BRC Haemato-­‐Molecular Diagnostic Laboratory brings state-­‐of-­‐the-­‐art sequencing into practise: We apply targeted sequencing as well as whole genome sequencing to understand cancer and rare diseases. We work towards the integration of phenotypic and genotypic information for diagnosis and treatment of constitutional and acquired disease. One focus of our research is to find unknown small genetic lesions that can potentially cause CLL (Chronic lymphocytic leukaemia) or lead to chemotherapy resistance. We analyzed tumours from 209 CLL patients using whole genome SNP arrays and a custom targeted sequencing panel. Extraordinary high resolution and coverage depth allow for conclusions about small lesions and subclones occurring in low frequencies. We combine our findings with comprehensive clinical data reporting outcome after treatment. MRD (Minimal residual disease) measures minute levels of cancer cells left after chemotherapy and is a good proxi for general outcome. Evidently, it also allows for a much faster analysis than patient survival, a factor that is crucial in today's vast advances in genome sequencing and drug development. Our approach aims to understand how interactions of small lesions and within-­‐patient evolution of subclones lead to the observed clinical outcome measures. We customize the latest machine learning tools such as random forests to the binary outcome variables (MRD) to produce a predictive model. We evaluate how our model stratifies patients with specific lesions and clinical parameters and how our findings can inform hands-­‐on clinical practice. We recently improved clinical reports for panel sequencing on Illumina MiSeq/HiSeq and Lifetech PGM/Proton: In a rountine diagnostic setting, it is crucial to distinguish between high-­‐confidence positions with a reference allele and low-­‐confidence positions where a variant call is impossible. So far, both cases were reported as “no variant found”. For clinical decision making though, it is quite a difference if a patient is wild-­‐type on a variant or if we cannot make a statement about it. “CoverageCheck” is a tool that applies user-­‐defined cut-­‐offs for coverage and strand bias and displays resulting high-­‐ and low-­‐confident positions in easy-­‐to-­‐interpret plots. Parallel session 7b: Microbial Genomics Time: Wednesday 3nd September 2014 11.15am-­‐3.15pm Location: Lecture theatre B Chair: Nick Loman (University of Birmingham) Nanopore sequencing in clinical microbiology Nicholas J. Loman, Institute of Microbiology and Infection, University of Birmingham Nanopore sequencing is a potentially disruptive genomics technology. Since May 2014, we have been participating in the Oxford Nanopore MinION(tm) access programme. The MinION(tm) is a USB-­‐powered and connected, portable single-­‐molecule sequencer, able to detect nucleotide words as they translocate through a biological nanopore. Event data is streamed in real-­‐time from the instrument and base-­‐called in the cloud. Real-­‐time analysis of sequencing data opens up new opportunities for the diagnosis of infectious diseases. In this talk I will relate our experiences in the MinION(tm) access programme, and characterise the platforms unique strengths and weaknesses. I will demonstrate our use of the MinION(tm) in a large hospital and community outbreak of Salmonella enterica serovar Enteritidis. Furthermore, I will discuss the potential future use of this platform for metagenomic diagnosis of infectious diseases. Translation of sequencing from research to the clinic: an example Zam Iqbal, Wellcome Trust Centre for Human Genetics, University of Oxford The challenges and goals of research software are not the same as those for more translational applications. I'll take us through a single example and demonstrate how close we now are to using sequencing in the clinic for infectious disease diagnostics. The genome deficit for microbial eukaryotes Holly Bik, UC Davis, USA Microbial eukaryotes (e.g. nematodes, fungi, protists, and other 'minor' metazoan phyla) are phylogenetically diverse and numerically abundant in most habitats on earth. Yet, genome databases continue to be heavily biased towards model species and a small subset of phylogenetic lineages, and oftentimes these public genomes are not maximally useful for studying microbial eukaryotes in the context of environmental genomics. In particular, the interpretation of metagenomic data from diverse ecosystems is still inherently reliant on comparisons to available genome sequences and their corresponding annotations. A large-­‐scale effort is needed to fill in the branches of the eukaryotic tree of life, including generating genome sequences for many underrepresented lineages currently maintained in culture collections, and single-­‐cell approaches to isolate and sequence unculturable taxa from diverse environmental samples. Rapid bacterial outbreak characterization from high-­‐throughput sequencing data Torsten Seeman, Monash, Melbourne High-­‐throughput whole genome sequencing is set to revolutionalize how public health microbiology laboratories respond to outbreaks. In this presentation I will outline a software pipeline that turns raw sequences read files into a detailed report which can then be summarised for health officials to respond appropriately. Features include quality control, species identification, genome assembly and annotation, MLST typing, antibiotic resistance screening, phylogenetic reconstruction including incorporation of historical isolates, and predictions of clonality. The system goals are to to be rapid, accurate and reliable; and the prototype has already been applied successfully to many small outbreaks handled by the Melbourne Diagnotic Unit in Australia. The system is planned to be rolled out nationally to members of the Public Health Laboratory Network in Australia A pipeline for inferring the diversity of intra-­‐host quasi-­‐species of Hepatitis C virus genomes sequenced with a new probe-­‐capture viral RNA-­‐Seq (Illumina) protocol Camilla Ip1, A. Ansari2, D. Bonsall2, P. Piazza1, A. Trebes1, A. Brown2, P. Klenerman2, J. Hurst2, D. Buck1, E. Barnes2, R. Bowden1 1
Wellcome Trust Centre for Human Genetics, University of Oxford, UK 2
Peter Medawar Building for Pathogen Research, University of Oxford, UK Hepatitis C Virus (HCV) is a human pathogen spread by contact with infected blood or bodily fluids, characterised by extreme levels of between and within-­‐host genomic diversity. Genome sequencing as a fast, cheap clinical diagnostic test would be an important advance that could guide treatment of individual patients (an example of ‘personalised medicine’). But more generally, a reliable protocol for whole-­‐genome sequencing and recovering quasispecies information from blood-­‐borne viruses like HCV, HBV (Hepatitis B) and HIV would be invaluable for studying viral population dynamics with the aim of finding weaknesses that could be targeted by new treatments or public health interventions. In a collaboration between The Wellcome Trust Centre for Human Genetics and Oxford members of the STOP-­‐HCV Consortium, we have developed a novel pipeline for probe-­‐based enrichment of HCV sequences and sequencing on the Illumina platform. With a robust 100-­‐fold+ enrichment for any HCV sequence, it is now practical to sequence batches of 96 samples on a single Illumina MiSeq run yielding ~10,000x coverage of the 10kb HCV genome. We have constructed a bioinformatics pipeline to assemble and analyse HCV Illumina metagenomic data, which can infer if a patient’s blood contained more than one genotype of HCV suggesting multiple infections, estimate the proportion of the sub-­‐genotypes present, infer the consensus genome of the HCV population, and known antiviral resistance mutations. Our current aims are to recover the HCV quasispecies present in the RNA, correct for batch cross-­‐sample contamination, and recover all genomic variation greater than our “noise” threshold. We will describe the final form of our pipeline and quantify the resolution and accuracy of the population structure we were able to infer from over 100 UK HCV samples. Molecular epidemiology of Vancomycin Resistance Enterococcus faecium bactereamia Sebastian van Hal1,2 , Camilla Ip3, Slade Jensen2, Azim Ansari4, Rory Bowden3 1
Department of Infectious Diseases and Microbiology, Royal Prince Alfred Hospital, Sydney NSW, Australia 2
Antibiotic Resistance & Mobile Elements Group, School of Medicine, University of Western Sydney, Sydney NSW, Australia 3
Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, Oxfordshire, UK 4
Peter Medawar Building for Pathogen Research, University of Oxford, Oxford, Oxfordshire, UK Enterococci form part of the normal human gastro-­‐intestinal tract microbiota. Two species predominate (E. faecalis and E. faecium), both of which are also significant opportunistic nosocomial pathogens. The recent emergence of vancomycin resistance in E. faecium is of concern because when infection is established with these isolates, treatment options are limited resulting in significant patient morbidity and mortality. Consequently, hospital control strategies are aimed at reducing patient cross-­‐transmission by isolation and contact precautions. However, the effectiveness of this resource intensive strategy is unknown as E. faecium is also able to acquire vancomycin resistance through the horizontal genetic exchange. To address this critical question, a collaboration between The Wellcome Trust Centre for Human Genetics, Oxford and Royal Prince Alfred Hospital (RPAH), Sydney, Australia was established. One hundred and thirty-­‐nine infection (all bacteremic) isolates from a single institution (RPAH) between 2005 and 2013 were sequenced on the Illumina platform. Of the 139 E. faecium isolates, 41 were known to be vancomycin resistant E. faecium (VRE) based on vanB PCR and phenotypic testing. Our current aims are to describe the molecular epidemiology of VRE and its emergence at a single Australian hospital. Phylogenetic analysis will allow us to explore relationships between isolates and whether VRE infections represent clonal dissemination within a hospital (an outbreak) or are secondary to in vivo vanB acquisition events. These data will not only be able to answer questions about the genetic context of emerging VRE but also the role infection control strategies play in preventing nosocomial cross transmission. Premium Sponsors