Material Sequence Alignment Computational approaches for Systems  Biology and Network Medicine

Material
Computational approaches for Systems Biology and Network Medicine
• Textbook:
 Chapter 4, sections 4.1, 4.2, 4.3, 4.4, 4.5, 4.6
 Chapter 5, sections 5.1, 5.2, 5.4
Sequence Alignments
© A. Paccanaro, RHUL 2014
© A. Paccanaro, RHUL 2014
Prof. Alberto Paccanaro
http://www.paccanarolab.org
Figures taken from “Understanding bioinformatics”, Zvelebil & Baum
Sequence Alignment
The task of locating equivalent regions of two or more sequences to maximize their similarity
Why we want to align sequences?
• testing the fit of each alignment generated, giving it a quantitative score
© A. Paccanaro, RHUL 2014
© A. Paccanaro, RHUL 2014
• There is never just one possible alignment between any two sequences, and the best one is not always obvious
Similarity vs Homology
• To infer protein function
• To infer protein structure
•Homologous: protein or DNA sequences are said to be homologues if they have been derived from a common ancestor
Homology implies an EVOLUTIONARY relationship.
The relationship between protein function, protein sequence and protein structure
Function
The fundamental question is whether the similarities perceived between two sequences are due to chance, and are thus of little biological significance, or whether they are due to the derivation of the sequences from a common ancestral sequence, and are thus homologous.
© A. Paccanaro, RHUL 2014
Structure
© A. Paccanaro, RHUL 2014
Sequence
•Similarity is a descriptive measure quantifying the degree of matching existing between sequences.
It is easier to detect homology when comparing proteins sequences than nucleic acid sequences
Because homology implies a common ancestor, it can imply a common function or structure for two homologous proteins
1. there are only four letters in the DNA alphabet compared to the 20 letters in the protein alphabet
NOTE THAT sequence similarity does not always imply common function; and there are sequences which are very different and yet have the same structure and function
3. the complex three‐dimensional structure of a protein, and hence its function, is determined by the amino acid sequence
© A. Paccanaro, RHUL 2014
Convergent evolution vs divergent evolution
© A. Paccanaro, RHUL 2014
•
2. the genetic code is redundant
We need 3 parts
Part #2
1. We need algorithms for calculating the alignments
© A. Paccanaro, RHUL 2014
1. We need to discriminate between fortuitously good alignments and those due to a real evolutionary relationship (SIGNIFICANCE of an alignment)
1. Percent identity
© A. Paccanaro, RHUL 2014
Let us imagine we have an alignment.
Let’s look at methods for scoring how good it is
2. We need to compare alignments. Quantify the goodness of an alignment (SCORE of an alignment) These need to take into account:
a) Residues
b) Gaps
The dot‐plot
• Identity describes the degree to which two or more sequences are actually identical at each position
To compare two sequences X and Y, one sequence is written out vertically, with each residue in the sequence represented by a row, while the other is written horizontally, with each residue represented by a column. Each residue of X is compared to each residue of Y (row to column comparison) and a dot is placed where the residues are identical.
Measured by counting the number of identical bases or amino acids matched between the aligned sequences. 11/16=0.68
=> 68%
© A. Paccanaro, RHUL 2014
Example: what is the percent identity?
© A. Paccanaro, RHUL 2014
• percent identity is obtained by dividing the number of identical matches by the total length of the aligned region and multiplying by 100.
Filter: overlapping fixed‐length windows and requires that the comparison achieve some minimum identity score summed over that window before being considered; that is, only diagonals of a certain length will survive the filter.
2. Percent Similarity
• certain amino acids resemble each other closely in their physical and/or chemical properties  they can be substituted
• Below about 25% sequence identity only 10% of the aligned pairs represent structural similarity
• percent similarity: number of similar pairs of amino acids divided by the total length of the aligned region and multiplying by 100.
• Midnight zone: Even lower sequence identity (<20%).
Amino acids
Example: what is the percent similarity?
12/15=0.8
=> 80%
3. Overall alignment score
• sophisticated measures of assessing similarity between amino acids
each aligned pair of amino acids is given a numerical score based on the probability of the relevant change occurring during evolution
© A. Paccanaro, RHUL 2014
• overall alignment score: the sum the values of the amono acid pairs at all positions of the alignment
© A. Paccanaro, RHUL 2014
• Twilight zone: between 30% and 20% sequence identity, where homology may exist but cannot be reliably assumed in the absence of other evidence.
© A. Paccanaro, RHUL 2014
• 90% of sequence pairs with identity at or greater than 30% over their whole length are pairs of structurally similar proteins  likely homologues
© A. Paccanaro, RHUL 2014
Percent identity and homology
© A. Paccanaro, RHUL 2014
© A. Paccanaro, RHUL 2014
window with 10 residues, minimum score=3
Dot‐plots can be used to identify repeats within a sequence –
(BRCA2)
Substitution matrices
Which SM
• substitution matrix (SM): defines values for all possible pairs of amino acids
• There is no one correct SM for all circumstances. Good ones use actual evidence of what has happened during evolution, and are based on analysis of alignments of numerous homologs of well‐studied proteins from many different species.
• to identify very distant relationships reliably a wider range of substitutions should be treated favourably.
• BLOSUM‐62
• PAM120
© A. Paccanaro, RHUL 2014
© A. Paccanaro, RHUL 2014
• When an alignment is made, each aligned amino acid pair is given a score from the SM. These scores are then summed to give the overall score of the alignment.
• to align and score closely related sequences, the scoring scheme should be strongly biased toward giving high values to perfect matches and highly conserved substitutions Example: what is the overall score for this alignment using the BLOSUM‐62 substitution matrix?
BLOSUM‐62
52
© A. Paccanaro, RHUL 2014
PAM12
© A. Paccanaro, RHUL 2014
Overall score = How SMs are derived [5.1]
• SMs are based on the frequency of particular aminoacid (aa) substitutions observed
Odds ratio:
• Random Model: sequences are unrelated. Every aa
in one sequence is totally independent of the aa in the other sequence. The probability of finding a pair of residues a and b aligned is papb
qa ,b
pa pb
For the entire alignment:

u
 q a ,b 


 pa pb  u
where u ranges over all the residues in the alignment.
 qa ,b 

 a pb  u
 log  p
u
© A. Paccanaro, RHUL 2014
© A. Paccanaro, RHUL 2014
It is often practical to use the log‐odds ratio:
• Nonrandom Model: sequences are related. There is a high correlation between aligned aa. The probability of finding a pair of residues a and b aligned is qa,b
Inserting gaps
u
 s 
a ,b u
S
• Homologous sequences are often of different lengths as the result of insertions and deletions (indels)
 insert gaps in the sequences
u
where sa,b is the substitution matrix element associated with the alignment of residues a and b.
© A. Paccanaro, RHUL 2014
S, the score of the alignment, is a measure of the relative likelihood of the whole alignment arising due to the nonrandom model as compared with the random model.
forcing two sequences to match inserting large numbers of gaps will produce a meaningless alignment
• gap penalty: each time a gap is introduced, the score is penalized.
• Insertions tend to be several residues long (gap extension penalty)
• Normally end gaps are not penalized
© A. Paccanaro, RHUL 2014
 q a ,b 
 
 a pb  u
 log  p
Gap penalty assignments
 Linear gap penalty: assign a gap penalty ‐E on aligning any residue with a gap, where E ≥ 0
For a gap of length ngap the penalty is:
g ngap    ngap E
 Gap opening penalty (GOP) [I] {7‐15}
 Gap extension penalty (GEP) [E] {0.5‐2}
g ngap    I  ngap  1 E
© A. Paccanaro, RHUL 2014
High gap penalty (A) vs low gap penalty (B)
© A. Paccanaro, RHUL 2014
 Affine gap penalty:
Dynamic programming
• it refers to simplifying a complicated problem by breaking it down into simpler sub‐problems in a recursive manner
Part #1
Let us look at algorithms for building alignments
 Optimal substructure: the solution to a given optimization problem can be obtained by the combination of optimal solutions to its sub‐
problems
 Overlapping sub‐problems: the space of sub‐problems must be small
[from wikipedia – there is a good introduction there]
© A. Paccanaro, RHUL 2014
© A. Paccanaro, RHUL 2014
• 2 keys ingredients:
A classic example: the Fibonacci sequence
Another example: Dijkstra’s algorithm
Edsger Wybe Dijkstra
“He was also known for his low opinion of the GOTO statement in computer programming, culminating in the 1968 article ʺA Case against the GO TO Statementʺ regarded as a major step towards the widespread deprecation of the GOTO statement and its effective replacement by structured control constructs such as the while loop. This methodology was also called Structured programming. “
© A. Paccanaro, RHUL 2014
© A. Paccanaro, RHUL 2014
(Rotterdam, May 11, 1930 – Nuenen, August 6, 2002) was a Dutch computer scientist. He received the 1972 A. M. Turing Award for fundamental contributions in the area of programming languages, and was the Schlumberger Centennial Chair of Computer Sciences at The University of Texas at Austin from 1984 until his death in 2002. Why dynamic programming for sequence alignment •
If gaps are allowed, we have many ways to align 2 proteins
THE IDEA:
Because scores for the individual positions are added together, the score of the whole alignment is the sum of the scores of the parts; that is, their contributions to the score are independent.
Thus, the optimal global alignment can be reduced to the problem of determining the optimal alignments of smaller sections.
© A. Paccanaro, RHUL 2014
© A. Paccanaro, RHUL 2014
X1 … Xu Xu+1 … Xv Xv+1 … XL
Y1 … Yu Yu+1 … Yv Yv+1 … YL
The Needleman‐Wunsch algorithm for global sequence alignment
Algorithms for aligning protein sequences
• Blosum‐62
• Linear gap penalty g = ‐8ngap
© A. Paccanaro, RHUL 2014
• Local alignment – there are many cases where only parts of sequences are related.
 Smith–Waterman algorithm
© A. Paccanaro, RHUL 2014
• Global alignment – it is for sequences that are similar over their whole length
Two closely related homologous sequences will generally be of approximately the same length, so that their alignment will cover the full range of each sequence.  Needleman–Wunsch algorithm.
• For each matrix element we know the element(s) from which it was directly derived.
• Sm,n gives the optimal alignment score
Thus, having filled the matrix elements from the beginning of the sequences, we determine the alignment from the end of the sequences.
© A. Paccanaro, RHUL 2014
• element Si,j stores the score for the optimal alignment of all residues up to xi of sequence x with all residues up to yj of sequence y.
© A. Paccanaro, RHUL 2014
Traceback: beginning at Sm,n we follow the arrows back through the matrix to the start (S0,0). The Smith‐Waterman algorithm for local sequence alignment
The value of the gap penalty changes the alignment
• Similar to NW, 2 key differences:
Si, j
Si 1, j 1 s  xi  yi 

S i 1, j  g

 max 
S i , j 1 g


0
2. start traceback from the highest‐scoring matrix element wherever it occurs.
© A. Paccanaro, RHUL 2014
© A. Paccanaro, RHUL 2014
1. whenever the score of the optimal sub‐
sequence alignment is less than zero it is rejected, and that matrix element is set to zero.
How to find other high scoring alignments
© A. Paccanaro, RHUL 2014
BLOSUM‐62
g = ‐4ngap
• starts by calculating the optimal local alignment.
• set all the matrix elements on this initial path to zero • the matrix needs to be recalculated (be clever about this! right and below only…).
• suboptimal local alignment is identified by locating the highest matrix element. • further suboptimal alignments can be found by repeating the procedure, zeroing every element involved in a previous alignment.
© A. Paccanaro, RHUL 2014
e.g. to detect the presence of repeats in the sequence
BLOSUM‐62
g = ‐8ngap
Part # 3
© A. Paccanaro, RHUL 2014
Alignment score significance
Is it statistically significant?
How likely am I to obtain that alignment by chance?
•
© A. Paccanaro, RHUL 2014
Let us imagine we have got an alignment.
And for that alignment we have a score.
BLOSUM‐62
g = ‐4ngap
it can be proved that the optimal ungapped local alignment score follows the Gumbel extreme‐value distribution
Determine whether the score of an optimal alignment is significantly higher than would be expected for two unrelated sequences
E‐values (BLAST, FASTA)
m = length of query sequence
n = total length of DB entries
, K, constants that depend on the scoring matrix and sequence composition
For gapped local alignments, examination of actual database searches has shown that the scores of optimal alignments also fit an extreme‐value distribution.
© A. Paccanaro, RHUL 2014
probability of obtaining an alignment of score S greater than a value x:
© A. Paccanaro, RHUL 2014
• the score that we have is that of the optimal alignment; that is, it is the best possible score for that particular pair of sequences.  always from the extreme end of the distribution of all alignment scores
Types of Alignment
• The value of P(S ≥ x) is calculated – varies from 0 to 1. • The E‐values are obtained from this by multiplying by the number of sequences in the database.  if there are D database sequences, the E‐values range from 0 to D.
1. Local
Pairwise vs multiple
© A. Paccanaro, RHUL 2014
© A. Paccanaro, RHUL 2014
2. Global
The E‐value is the number of database sequences not related to the query sequence that are expected to have alignment scores greater than the observed score
Multiple sequence alignment
CLUSTALW
© A. Paccanaro, RHUL 2014
Searching Databases
Can be used for both DNA and protein sequences
Terminology: false positives, false negatives
searching sequence databases to find sequences that are similar to the query sequence or search sequence that we submit to them
• False positive: results indicates it as POSITIVE, but it is wrong (i.e. false). In other words, it should have been negative but was indicated as positive.
• local alignments first
• top‐scoring database sequences are then candidates for further analysis
© A. Paccanaro, RHUL 2014
• False negative: results indicates it as NEGATIVE, but it is wrong (i.e. false). In other words, it should have been positive but was indicated as negative.
• High sensitivity: few false negatives
• High specificity: few false positives
© A. Paccanaro, RHUL 2014
Extending dynamic programming is not straightforward
© A. Paccanaro, RHUL 2014
 compare all the sequences in a pairwise fashion
 performs a cluster analysis to generate a hierarchy of sequences (tree)  Build a multiple alignment using the tree: align the most similar pairs; then align the other sequences with these pairs until all the sequences have been aligned
• alignment of multiple sequences will give a more reliable assessment of similarity ambiguities in a pairwise comparison can often be resolved when further sequences are compared
NW & SW vs. FASTA & BLAST
Database searching needs to be both sensitive, in order to detect distantly related homologs and avoid false‐negative searches, and also specific, in order to reject unrelated sequences with fortuitous similarity (false‐positive hits). This is not an easy balance to achieve, and search results should be scrutinized with care. [textbook, page 94]
• FASTA and BLAST use dynamic programming, but only for database entries that have a segment sufficiently similar to the query sequences. The methods used to find these entries are purely heuristic; that is, not rigorous.
© A. Paccanaro, RHUL 2014
© A. Paccanaro, RHUL 2014
• The Needleman–Wunsch and Smith–Waterman methods are rigorous: given a scoring scheme they are guaranteed to find the best‐scoring alignments between two sequences. FASTA
BLAST
K‐tuples: short stretches of k contiguous residues
• relies on finding core similarity, which is defined by a window of preset size (called a “word”) with a certain minimum density of matches (for DNA) or with an amino‐acid similarity score above a given threshold (for proteins) – these amino acid word‐matches The program makes up a dictionary of all possible k‐
tuples within the query sequence (hashing).
do not only include identities and they are scored with a standard substitution matrix.
for each k‐ tuple in the searched sequences, FASTA only has to consult the dictionary to find out if it occurs in the query sequence.
• blastp compares an amino acid query sequence against a protein‐sequence database • blastn compares a nucleotide query sequence against a nucleic acid sequence database
• blastx compares a nucleotide query sequence translated in all reading frames against a protein‐
sequence database • tblastn compares a protein query sequence against a nucleotide‐sequence database dynamically translated in all reading frames
• tblastx compares the six possible translations of a nucleotide query sequence against the six frame translations of a nucleotide‐sequence database
© A. Paccanaro, RHUL 2014
Programs in the BLAST suite
1. all suitable matches are located in each database sequence.
2. matches are extended without including gaps, and on this basis the database sequences are ranked. 3. The highest‐scoring sequences are subjected to dynamic programming to obtain the final alignments and scores. © A. Paccanaro, RHUL 2014
© A. Paccanaro, RHUL 2014
K‐tuple regions are then extended
Main BLAST steps: