Document 97295

Accessing
the
1000
Genomes
Data
Paul
Flicek
European
BioinformaMcs
InsMtute
Data
access
• 
• 
• 
• 
• 
General
informaMon
File
access
1000
Genomes
Browser
Tools
Where
to
find
help
www.1000genomes.org
www.1000genomes.org
1000 Genomes Project Resources
L. Clarke, H. Zheng-Bradley, R. Smith, E Kuleshea, I Toneva, B. Vaughan, P. Flicek and
1000 Genomes Consortium
European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK
Introduction
DATA TYPE FILE FORMAT
SIZE
sequence
FASTQ
alignment
BAM
56 Tbytes of BAM files
variants
VCF
38.9M SNPs
~4.7M short indels
http://browser.1000genomes.org
The 1000 Genomes project utilizes the Ensembl Browser to display our
variant calls. We provide rapid access to project variant calls through the
browser before they become available via dbSNP and DGVa.
Tracks of 1000 genomes variants by population can be viewed in the location
page:
Variant Effect Predictor
The predictor takes a list of variant positions and alleles, and predicts the
effects of each of these on any overlapping features (transcripts, regulatory
features) annotated in Ensembl. An example output is shown below:
Data Slicer
Many of the 1000 Genomes files are large and cumbersome to handle. The
Data Slicer allows users to get data for specific regions of the genome and to
avoid having to download many gigabytes of data they don’t needl samples/
populations you choose. Below is the Data Slicer input interface:
Discoverability
The ftp site has a index updated nightly. This index is searchable from our website.
http://www.1000genome.org/ftpsearch
A list of variants can be obtained for any given transcript. In addition to
basic information about a variant, PolyPhen and SIFT annotation are
displayed to indicate the clinic significance of the variant.
The search allows
users to specify which
ftp site to get paths
to, to get md5
checksums and also
filter out high volume
results like bam and Allele frequency for individual variants in different populations is displayed
fastq files
on the ‘Population Genetics’ page.
We also have various routes for users to discover new data.
Variation Pattern Finder
•  The Variation Pattern Finder (VPF) allows one to look for patterns of
shared variation between individuals in the a VCF file.
•  Within a vcf file different samples have different combination of variation
genotypes. The VPF looks for distinct variation combinations within a user
specifed region, shared by different individuals.
•  The VPF only on variations that functional consequences for protein
coding genes such as non-synonymous coding SNPs and splice site
changes.
Users can Attach remote files as custom tracks. In example below, the
HG00120 track is 1000 Genomes bam file added to the browser.
Website http://www.1000genomes.org/announcements
Twitter @1000genomes
RSS http://www.1000genomes.org/announcements/rss.xml
Email [email protected]
Laura Clarke
EBI
[email protected]
http://browser.1000genomes.org/tools.html
The project provides several tools to help users access and interpret the data provided.
43 Tbases raw sequence
Sequence, alignment and variant data is made available as quickly as possible
through the project ftp site. (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ |
ftp://ftp-trace.ncbi.nih.gov/1000genomes/). With more than 200,000 files though
discovering new data can be difficult.
• 
• 
• 
• 
Accessibility
Visualization
The main goal of the 1000 genomes project is to establish a comprehensive and
detailed catalogue of human genome variations; which in turn will empower
association studies to identify disease-causing genes. The project now has data
and variant genotypes for more than 1000 individuals in 14 populations. The ftp
site contains more than 120Tbytes of data in 200,000 files.
Acknowledgements
We would like to thank the Ensembl variation team for all their help,
particularly Will McLaren and Graham Ritchie.
Funding: The Wellcome Trust
EMBL- EBI
Wellcome Trust Genome Campus
Hinxton
Cambridge
CB10 1SD
UK
T +44 (0) 1223 494 444
F +44 (0) 1223 494 468
http://www.ebi.ac.uk
Data
access
• 
• 
• 
• 
• 
General
informaMon
File
access
1000
Genomes
Browser
Tools
Where
to
find
help
>p://>p.1000genomes.ebi.ac.uk
>p://>p‐trace.ncbi.nih.gov/1000genomes/>p
Site
documentaMon
Sequences
&
alignments
by
sample
ID
Data
sets
to
accompany
the
pilot
data
publicaMon.
Current
and
archive
data
set
releases
Pre‐release
data
sets
and
project
working
materials
BIOINFORMATICS APPLICATIONS NOTE
Sequence analysis
Vol. 25 no. 16 2009, pages 2078–2079
doi:10.1093/bioinformatics/btp352
Data
formats
and
key
tools
The Sequence Alignment/Map format and SAMtools
Heng Li1,† , Bob Handsaker2,† , Alec Wysoker2 , Tim Fennell2 , Jue Ruan3 , Nils Homer4 ,
Gabor Marth5 , Goncalo Abecasis6 , Richard Durbin1,∗ and 1000 Genome Project Data
Processing Subgroup7
1 Wellcome
Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK, 2 Broad Institute of
MIT and Harvard, Cambridge, MA 02141, USA, 3 Beijing Institute of Genomics, Chinese Academy of Science, Beijing
100029, China, 4 Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095,
5 Department of Biology, Boston College, Chestnut Hill, MA 02467, 6 Center for Statistical Genetics, Department of
Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA and 7 http://1000genomes.org
Received on April 28, 2009; revised on May 28, 2009; accepted on May 30, 2009
Advance Access publication June 8, 2009
BAM
alignment
files
Associate Editor: Alfonso Valencia
2 METHODS
2.1 The SAM format
2.1.1 Overview of the SAM format The SAM format consists of one
header section and one alignment section. The lines in the header section
start with character ‘@’, and lines in the alignment section do not. All lines
are TAB delimited. An example is shown in Figure 1b.
In SAM, each alignment line has 11 mandatory fields and a variable
number of optional fields. The mandatory fields are briefly described in
Table 1. They must be present but their value can be a ‘*’ or a zero (depending
on the field) if the corresponding information is unavailable. The optional
fields are presented as key-value pairs in the format of TAG:TYPE:VALUE.
They store extra information from the platform or aligner. For example, the
‘RG’ tag keeps the ‘read group’ information for each read. In combination
Sequence analysis
with the ‘@RG’ header lines, this tag allows each read to be labeled with
metadata about its origin, sequencing center and library. The SAM format
1 INTRODUCTION
specification gives a detailed description of each field and the predefined
With the advent of novel sequencing technologies such as
TAGs.
BIOINFORMATICS APPLICATIONS NOTE
The variant call format and VCFtools
Downloaded from bioinformatics.oxfordjournals.org at Genome Research Ltd on October 13, 2011
ABSTRACT
Summary: The Sequence Alignment/Map (SAM) format is a generic
alignment format for storing read alignments against reference
sequences, supporting short and long reads (up to 128 Mbp)
produced by different sequencing platforms. It is flexible in style,
compact in size, efficient in random access and is the format in which
alignments from the 1000 Genomes Project are released. SAMtools
implements various utilities for post-processing alignments in the
SAM format, such as indexing, variant caller and alignment viewer,
and thus provides universal tools for processing read alignments.
Availability: http://samtools.sourceforge.net
Contact: [email protected]
Vol. 27 no. 15 2011, pages 2156–2158
doi:10.1093/bioinformatics/btr330
Advance Access publication June 7, 2011
VCF
variant
files
Illumina/Solexa, AB/SOLiD and Roche/454 (Mardis,
2008), a 1,† , Adam Auton2,† , Goncalo Abecasis3 , Cornelis A. Albers1 , Eric Banks4 ,
Petr Danecek
variety of new alignment tools (Langmead et al., 2009; Li et al.,
4 , Extended
4 , Gerton
2 , Gabor T. Marth5 ,
2.1.2
The standard CIGAR
descriptionLunter
of pairwise
RobertCIGAR
E. Handsaker
Mark
A.
DePristo
2008) have been designed to realize efficient read mapping against
alignment
defines three operations: 2,7
‘M’ for match/mismatch, ‘I’ for insertion
6 , Gilean
1,∗ and 1000 Genomes Project
McVean
,
Richard
Durbin
Stephen
T. Sherry
large reference sequences, including the human genome.
These tools
compared with the reference and ‘D’ for deletion. The extended CIGAR
Vol. 27 no. 5 2011, pages 718–719
generate alignments in different formats, however,
complicating
proposed in SAM added four more operations: ‘N’ for skipped bases on
Analysis
Group‡
doi:10.1093/bioinformatics/btq671
the reference, ‘S’ for soft clipping, ‘H’ for hard clipping and ‘P’ for padding.
downstream processing. A common alignment format
that supports
1 Wellcome
Trust Sanger
Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SA, 2 Wellcome Trust Centre
These support splicing, clipping, multi-part and padded alignments.
Figure 1
all sequence types and aligners creates a well-defined interface
3 Center
for Human Genetics,shows
University
ofof Oxford,
Oxford
OX3 7BN,
UK,
for Statistical Genetics, Department of
examples
CIGAR
strings
for
different
types
of
alignments.
between alignment and downstream analyses, including variant
Biostatistics, University of Michigan, Ann Arbor, MI 48109, 4 Program
in Medical analysis
and Population Genetics, Broad
Sequence
Advance Access publication January 5, 2011
detection, genotyping and assembly.
5 Department of Biology, Boston College, MA 02467, 6 National
Institute
of MITtoand 2.1.3
Harvard,
Cambridge,
MA
02141,
Binary
Alignment/Map
format
To improve
the performance, we
The Sequence Alignment/Map (SAM) format
is designed
7
Institutes
of Health
National
forformat
Biotechnology
Information,
20894,
designed a Center
companion
Binary Alignment/Map
(BAM),MD
which
is the USA and Department of Statistics,
achieve this goal. It supports single- and paired-end
reads
and
representation
of SAM
University
Oxford,binary
Oxford
OX1 3TG,
UK and keeps exactly the same information as
combining reads of different types, including color
space readsoffrom
SAM.
BAM
is
compressed
by
the
BGZF
library,
a
generic
library
developed
Associate
John Quackenbush
AB/SOLiD. It is designed to scale to alignment sets
of 1011Editor:
or more
by us to achieve fast random access in a zlib-compatible compressed file.
base pairs, which is typical for the deep resequencing of one human
An example alignment of 112 Gbp of Illumina GA data requires
116 GBLi
of
Heng
individual.
disk space (1.0 byte per input base), including sequences,Although
base qualities
and feature format (GFF) has recently been extended
ABSTRACT
generic
Program
Population Genetics, The Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA
In this article, we present an overview of the SAM format and
all
the
meta
information
generated
by
MAQ.
Most
of
this
space
is
used
toin Medical
to standardize storage
of variant information in genome variant
Summary: The variant call format (VCF) is a generic format for
briefly introduce the companion SAMtools software package. A
Editor:etDmitrij
Frishman
storedata
the base
format Associate
(GVF) (Reese
al., 2010),
this is not tailored for storing
storing DNA polymorphism
suchqualities.
as SNPs, insertions, deletions
detailed format specification and the complete documentation of
information across many samples. We have designed the VCF
and structural variants, together with rich annotations. VCF is usually
SAMtools are available at the SAMtools web site.
2.1.4 Sorting and indexing A SAM/BAM file can beformat
unsorted,tobut
besorting
scalable so as to encompass millions of sites with
BIOINFORMATICS APPLICATIONS NOTE
Tabix: fast retrieval of sequence features from generic
TAB-delimited files
Downloaded from bioinformatics.oxfordjournals.org at Genome Rese
All
indexed
for
fast
retrieval
stored in a compressed manner and can be indexed for fast data
ABSTRACT
by coordinate
is used
to streamline
processing and
to avoid
loading
genotype
data
and annotations from thousands of samples. We have
retrieval of variants from
a range of
positions
on the data
reference
Tabix is the first generic tool that indexes position sorted
whom correspondence should be addressed.
extra alignments into memory. A position-sorted BAMadopted
file canSummary:
be
indexed.encoding,
a textual
with complementary indexing, to allow
The format was developed for the 1000 Genomes Project,
† The authors wish it to be known that, in their opinion,genome.
the first two authors
TAB-delimited formats such as GFF, BED, PSL, SAM and
We combine the UCSC binning scheme (Kent et al., 2002)files
andinsimple
easy generation of the files while maintaining fast data access.
and has also been adopted by other projects such as UK10K,
should be regarded as Joint First Authors.
linear indexing to achieve fast random retrieval of alignments overlapping
a and quickly retrieves features overlapping specified
SQL export,
In this article, we present an overview of the VCF and briefly
dbSNP and the NHLBI Exome Project. VCFtools is a software suite
regions. Tabix features include few seek function calls per query, data
introduce the companion VCFtools software package. A detailed
that implements various utilities for processing VCF files, including
compression with gzip compatibility and direct FTP/HTTP access.
format specification and the complete documentation of VCFtools
validation, merging, comparing and also provides a general Perl API.
© 2009 The Author(s)
Tabix is implemented as a free command-line tool as well as a library
are available at the VCFtools web site.
Availability:
http://vcftools.sourceforge.net
This is an Open Access article distributed under the terms
of the Creative
Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
in C, Java, Perl and Python. It is particularly useful for manually
Contact:
[email protected]
by-nc/2.0/uk/) which permits unrestricted non-commercial
use, distribution,
and reproduction in any medium, provided the original work is properly cited.
examining local genomic features on the command line and enables
2 METHODS
genome viewers to support huge data files and remote custom tracks
Received on October 28, 2010; revised on May 4, 2011; accepted
∗ To
2 METHODS
Tabix indexing is a generalization of BAM indexing for generic TABdelimited files. It inherits all the advantages of BAM indexing, including
data compression and efficient random access in terms of few seek function
calls per query.
2.1 Sorting and BGZF compression
Before being indexed, the data file needs to be sorted first by sequence name
and then by leftmost coordinate, which can be done with the standard Unix
sort. The sorted file should then be compressed with the bgzip program
1000
Genomes
is
in
the
Amazon
cloud
1KG
pilot
content
(BAM)
is
available
at
s3://1000genomes.s3.amazonaws.com
You
can
see
the
XML
at
hOp://1000genomes.s3.amazonaws.com
Data
access
• 
• 
• 
• 
• 
General
informaMon
File
access
1000
Genomes
Browser
Tools
Where
to
find
help
hOp://browser.1000genomes.org
•  Gene
variaMon
zoom
•  PopulaMon
•  SIFT
•  PolyPhen
File
upload
to
view
with
1000
Genomes
data
•  Supports
popular
file
types:
–  BAM,
BED,
bedGraph,
BigWig,
GBrowse,
Generic,
GFF,
GTF,
PSL,
VCF*,
WIG
*
VCF
must
be
indexed
Uploaded
VCF
Example:
Comparison
of
August
calls
and
/technical/working/20110502_vqsr_phase1_wgs_snps/ALL.wgs.phase1.projectConsensus.snps.sites.vcf.gz
1000
Genomes
Browser
•  For
further
informaMon
on
the
capabiliMes
of
the
browser
and
its
use,
aOend
the
Ensembl
“New
Users”
Workshop
on
Saturday
at
12:30
hOp://pilotbrowser.1000genomes.org
Data
access
• 
• 
• 
• 
• 
General
informaMon
File
access
1000
Genomes
Browser
Tools
Where
to
find
help
hOp://browser.1000genomes.org
Tools
page
Ensembl
Variant
Effector
Predictor
(VEP)
•  Takes
list
of
variaMon
and
annotates
with
respect
to
Ensembl
features
•  Returns
whether
the
SNP
has
been
seen
in
the
1000
Genomes
and
if
it
has
an
rs
number
(if
one
has
been
assigned)
•  Returns
SIFT,
PolyPhen
and
Condel
scores
•  Extensive
filtering
opMons
by
MAF
and
populaMons
•  Web
and
command
line
versions
Data
slicer
for
subsets
of
the
data
hOp://trace.ncbi.nlm.nih.gov/Traces/1kg_slicer/
Sliced
BAM
to
files
VariaMon
PaOern
Finder
•  hOp://browser.1000genomes.org/
Homo_sapiens/UserData/VariaMonsMapVCF
•  VCF
input
•  Discovers
paOerns
of
Shared
Inheritance
•  Variants
with
funcMonal
consequences
considered
•  Web
output
with
csv
and
excel
downloads
Access
to
backend
Ensembl
databases
•  Public
MySQL
database
at
–  mysql‐db.1000genomes.org
port
4272
•  Full
programmaMc
access
with
Ensembl
API
–  More
informaMon
on
the
use
of
the
Ensembl
API
at
the
Ensembl
“Advanced
Users”
Workshop
tomorrow
Data
access
• 
• 
• 
• 
• 
General
informaMon
File
access
1000
Genomes
Browser
Tools
Where
to
find
help
Credits & Contact •  Eugene Kulesha, Iliana Toneva, Bren Vaughan
•  Will McLaren, Graham Ritchie, Fiona Cunningham •  Laura Clarke, Holly Zheng-Bradley, Rick Smith
•  Steve Sherry, Chunlin Xiao
For more information contact [email protected]