HOW TO FIND GENES WITHIN A DNA SEQUENCE?

HOW TO FIND GENES WITHIN A DNA SEQUENCE?
1. Bacterial genomes
- genes tightly packed, no introns...
Scan for ORFs (open reading frames)
- check all 6 reading frames (both strands)
Fig. 5.2
- look for significant distance between potential start
and stop codon (eg 100 codons)
… but when examining short sequences, start codon (or stop codon)
might be located further upstream (or downstream)
Use computer programs to search for ORFs:
Query: 3 kb sequence
Potential problems?
- if initiation codon other than ATG (relatively rare)
- if overlapping genes (rare)
- if deviation from standard genetic code (can change default)
- if gene contains intron(s)
2. Eukaryotic genomes (such as human)
- genes usually far apart, long introns & short exons
Would an ORF scan work here?
Fig.5.4
Can also use algorithms to look for:
1. Exon-intron boundaries
- “GT-AG” rule, but consensus sequences very short
(see Fig.5.5)
2. Regulatory motifs
- upstream promoters, downstream polyA addition signals…
- but consensus sequences usually very short
3. Codon bias patterns
- synonymous codons are
not all used equally
- patterns differ among
organisms
See Fig. 5.10 which shows
results from various
bioinformatics tools used to
analyze 15 kb of human genome
Table 5.1, Brown1st ed
4. Homologous sequences in databank
BLAST searches www.ncbi.nlm.nih.gov/BLAST/
Basic Local Alignment Search Tool
- search programs to look for similarity between your sequence
of interest (protein or DNA) and entries in global data banks
BLASTN – search at nucleotide level
BLASTP – search at protein level
BLASTX – search nt sequence against protein databases
(automatic 6-reading frame conceptual translation)
tBLASTN – protein query vs. conceptual translation
of DNA database
Fungal
Query = yeast mitochondrial
ribosomal protein L8 (238 aa)
Bacterial
Nomenclature may differ among organisms
- called L17 in Streptococcus but L8 in yeast
E-values: statistical measure of likelihood that sequences with
this degree of similarity occur randomly
ie. reflects number of hits expected by chance
What if this search was done at nucleotide (instead of protein) level?
Query = yeast mitochondrial ribosomal protein L8 gene
(including promoter & UTRs)
Only got “hits” with other yeast entries, in this case
Homologous genes from divergent organisms typically
show greater similarity at amino acid level than at nt level
Degeneracy of genetic code
Codon bias among organisms
Probability of specific stretch of nucleotides occurring by random
chance (“spurious hits”) is higher than for the same length of
amino acids
Fig. 5.18
To illustrate the power of amino acid level searches,
text shows 2 sequences with 76% nt identity
… but only 28% aa identity
But it’s a rather artificial example…
because if 2 DNA stretches of 300 bp or so (normal default
length in ORF Finder) showed 76% nt identity,
it’s very improbable that such similarity occurred by chance
HOMOLOGOUS GENES (share common evolutionary origin)
(p.145)
1. Orthologous
- homologous genes in different organisms
(eg. b-globin genes from mouse and human)
2. Paralogous
- homologous genes in same organism
(eg. multi-gene family members, a-globin and b-globin
from mouse)
Two genes are either evolutionarily related or they are not
…. so instead of “…% homologous”, use “… % identity”
ARE TWO SEQUENCES HOMOLOGOUS OR INDEPENDENT IN ORIGIN?
Factors to consider:
1. Length of sequence
- short sequences more likely to occur by chance
2. Base composition
- highly biased (eg if only AT) more likely to occur by chance
“low complexity regions”
3. Similarity at amino acid level (if protein-coding region)
- high % identity is strong argument for homology
- usually implies common protein function
- nt changes such that minimal effect on aa sequence
Comparison of homologous regions from multiple genomes
Human chr 7
(1.8 Mbp region)
MultiPipMaker program
(percent identity plot)
“Numbered boxes correspond to exons”
- score of % nt sequence similarity (blocks compared vs. reference sequence)
- gives overview of sequence relationships for genomic region shared among organisms
Thomas Nature 424:788, 2003
EXPERIMENTAL TECHNIQUES TO FIND GENES
1. Zoo blot (Southern) analysis
- find regions homologous to DNA from other organisms
- to determine presence/absence of gene among different
organisms
Heterologous hybridization
- use conditions of “reduced
stringency” (eg lower temp)
so that duplex hybrids with
some mismatches are stable
Interpretation of data shown in figure?
Fig. 5.12
2. Northern blot analysis
- to identify expressed regions of genomes
(detect transcripts)
In situ hybridization
- to determine cellular
location
gene X probe
Probe: tagged DNA (eg. PCR
product, restriction fragment,
cDNA clone…) in denatured form
or oligomer or antisense
(synthetic) RNA …
35S-labeled
b-myosin antisense probe
hybridizing to heart ventricle in 13-day
embryonic mouse
Strachan & Read Fig. 5.17
(but note that many identical copies of that
particular mRNA are present on blot)
Fig. 5.11
Some protein genes are constitutively expressed …
“housekeeping gene” products needed at all times
… whereas others are differentially expressed
in specific tissue type
during development
in response to environmental cues
Only a subset of genes are expressed at a given time
and mRNA levels can vary greatly among genes
~10,000 – 15,000 different mRNAs present in “typical” mammalian
cell type under given condition
(may be ~ 20,000 different proteins present)
Aside: RNA-sequencing studies suggest ~ 8000 genes ubiquitously expressed
in human tissues (Ramskold PLoS 2009)
(higher than predicted from microarray analysis, to be discussed in Topic 7)
EXPERIMENTAL TECHNIQUES TO FIND CODING
REGIONS WITHIN GENES
1. Sequencing of cDNA (or EST) clones
... & compare to genomic sequences
to determine positions of introns
eg. for primer can use mixture of
“anchored” oligo(dT)s with A, C or G
in the 3’ position
5’cap
3’
AAAAAAAAAn
5’
ESTs
- short sequences obtained by
sequence analysis of cDNA clones
… but if low abundance mRNA may not
be in bank
…. or cDNA maybe not full-length
If so, which end would you expect
to be missing?
Fig.3.36
Human phosphatidylinositol glycan gene (chromosome 18)
- additional info from RNA level data
RefSeq: gene data agreed
upon by everyone
~60% of RefSeq genes could be extended at 5’ and
3’ ends (based on additional EST data = UTRs)
Nusbaum Nature 437:551, 2005 (Fig S2)
2. To obtain sequence info corresponding
to termini of mRNAs:
RACE – rapid amplification of cDNA ends
“specialized” RT-PCR strategy
5’ RACE
where NNN… might be restriction site
(eg. to aid in cloning RACE product)
- mapping 5’ end of mRNA useful
in locating position of promoter
- promoter immediately upstream
of transcription start site
How would you carry out
3’ RACE (to determine
exact position of 3’end of
mRNA)?
Fig.5.13