| C.Problems. Sequence Analysis I (Pairwise, BLAST)|
| Writer : Seyeon Weon
Updated : 10-14
Hit : 4010
(For those who have taken the courses and want to submit for evaluation, please read the instructions linked on the table of contents page.)
- Here is a part of my own quick and dirty implementation of Smith-Waterman algorithm in python, which has been mailed to you as an example. This function returns two strings with gaps (i.e., dash lines) at appropriate positions in the original sequences. seq1 is the first original sequence and seq2 is the second one, and align is the optimal path obtained from the Smith-Waterman algorithm. k is the number of matched positions between two sequences (Of course, you can obtain it by examining align instead of passing as an argument).
What is the asymptotic complexity of this function with respect to the average length, n, of the original sequences. Can you think of any improvement that can result in better asymptotic complexity?
- Using the Boyer-Moore algorithm, find the so-called TATA box in the following sequence:
Schematically draw how you did it in such a way that can be understood by everybody.
- Which one, DNA sequence or protein sequence, would yield better performance in general for Boyer-Moore string search algorithm?
- Write your own implementation of Boyer-Moore string search algorithm using the extended bad character shift rule. And, go ahead to implement the strong good suffix shift rule and complete it with the maximum possible shift after a match is found.
- The term motif is commonly used for the conserved regions in both protein sequences and DNA sequences. (a) Concisely explain what they are in biological sense. (b) Which one, between protein and DNA, is easier to deal with computationally? Explain why. (7 lines for each)
- Draw the deterministic finite state automata that can identify all codons for the negatively charged amino acids.
- Draw the dynamic programming table for the Smith-Waterman algorithm with O(n^3) time complexity. The sequences to be used are the following:
Use your own taste, though should be reasonable one in biological sense, of match scores, gap opening and gap extension penalties. Also, you are free to choose either global alignment or local alignment as you like. These apply to problem 8 and 9, too.
- Write your own implementation of Smith-Waterman algorithm. Scoring matrices can be obtained from
ftp://bio-mirror.kr.apan.net/pub/software/blast/matrices/. Gotoh's method should be used as long as the affine gap penalty is used.
- Using the same sequences as in problem 7, align them using the so-called Gotoh's improvement in time complexity of O(n^2). Show the data structure and the actual values in it used for the improvement.
- The space saving algorithm of Smith-Waterman algorithm accomplishes O(n) space complexity by sacrificing some time complexity. The essential idea is that recalculation of each value requires the calculation of the diminishing length by the geometric sequence, 1/2, 1/4, 1/8, 1/16, ... Using this fact, describe the overall overhead of the time complexity of the algorithm comparing to that of the standard Smith-Waterman algorithm.
- Using the same sequences as in problem 8, align them using Myers' linear space algorithm. Show the data structure and the actual values in it used for the alignment of the sequences.
- Affine gap penalty is commonly used one. Logarithmic gap penalty has been attempted, too. Why do we try it even if it worsens the computing time? Also, why can't we use more complex gap penalties such as periodic gap penalty?
- You, as the representative bioinformatician of earth, are visiting a planet in our galaxy. The intelligent organisms who invited you need a lot of help from you since the stage of their bioinformatics field is only in its infancy. They, the intelligent organisms in the planet, already have Smith-Waterman algorithm. Of course, it is called differently. However, they haven't realized better way of assigning match scores and they are using a constant value for all matches. It is your job to give them a version of PAM matrix of their own. The multiple alignment of the homologous sequences you gathered is the following:
Notice that their proteins (or something equivalent to protein) has only three kinds of amino acids (or something equivalent). Create PAM1 and PAM2 for their proteins. (They will be easily learning how to make higher PAMs after that.)
- After searching GenBank using BLAST with a protein sequence of your interest, you couldn't find any highly significant match. However, some sequences with marginal significance were detected. When you examined the amino acid compositions of the detected sequences, you found that they were rich in Pro, Gly, Lys, and Leu. Explain why these matches may not be of biological significance.
- It seems that the growth of GenBank size is about to win against what Moore's law can provide us. For several years up to now, due to the pretty impressive development in the PC world, a well furnished but inexpensive PC can serve quite nicely as a BLAST machine for small but common projects. However, we feel a crisis is coming and this happy hour is about to end. (This problem was made during 2001. It seems that this crisis hasn't come yet.) As you might quite often have to explain this changing situation to your boss or your colleagues, write down (a) your own explanation and (b) a possible solution to overcome this problem. Of course, you should use as much computer knowledge as possible since your boss or your colleagues have a quiet good knowledge of computer. (7 lines for each)
- (a) Verbally explain what the number at the right most column of a BLAST output (e.g, 6e-12, etc.) means. (b) Give two examples in real world (or everyday life) that might follow extreme value distribution. (7 lines for each)
- The alignment score between two aligned sequences is simply the sum of corresponding values in weight matrix minus the sum of gap penalties. However, alignments are obtained as a list by matching against a large collection of biological sequences (i.e., GenBank). In the list, the ones with high scores are considered to be due to homology (meaning common ancestry). The problem we are confronting here is that homologous proteins are not always well conserved and two non-homologous proteins can give a quite high score by random chance. Another deteriorating factor is that these sequences code for proteins that have the constraint of being physical entities (e.g., being a certain type of fold, etc.). Also, a worsening effect on this problem is that the distribution of alignment scores obtained from the search against a large collection of sequences follows extreme value distribution. Explain (within 5 lines) why this fact (i.e., following extreme value distribution) worsens the problem? Remember that the amount of biological sequences in the collection is getting ever bigger and bigger, too. Considering above discussions, what would you do to overcome the problem. (within 7 lines)
- (a) What is the effect of different word size in BLAST? (b) Can you think of any situation that may be better to turn off the low complexity filter, which is turned on by default? (7 lines for each)
- In spite of its weaknesses, the original BLAST (i.e., BLAST version 1.x) had been used as the primary sequence search tool for several years. FASTA was another one that was competing with BLAST. Answer the following questions (within 5 lines for each):
- What was their common weakness?
- Why BLAST was more popular than FASTA?
- What is the major improvement of BLAST version 2.x over BLAST version 1.x?
- Even though sequence alignment using dynamic programming strategy may be the most successful computational tool ever developed to deal with biological information, it can only be a computer algorithm that is blind to the majority of uncommon situations. Also, one obvious problem is what is in the question that "how would you choose the proper PAM matrix or gap penalty?" One method to do something for above question is checking the number of suboptimal alignments in varying PAM matrices (or gap-penalties). Explain the relationship between the choice of PAM matrices and the number of suboptimal alignments. Of course, explain the reason for the relationship.
- You are trying to grasp the similarity and rearrangement between the whole genomes of two relatively close microorganisms using a dot matrix software. Describe the situation when the following patterns are observed in the output plot.
(1) diagonal lines
(2) the other rather scarce diagonal lines perpendicular to (1)
(3) short diagonal lines with the same direction as (2), and (1) cross them at the middle
(4) horizontal lines
(5) rectangular shape blobs