Bioinformatics Exercises
  C.Problems. Seqeunce Analysis II (Multiple Alignment, Motif)
  Writer : Seyeon Weon   Updated : 10-14   Hit : 2451   Updates 
(For those who have taken the courses and want to submit for evaluation, please read the instructions linked on the table of contents page.)
  1. Using the progressive pairwise alignment method with the sum of pair scoring scheme, perform multiple alignment of the following 3 sequences:

    AATGC
    AAGC
    AGGC

    Choose appropriate match scores and gap penalties and try to incorporate what you learned from class as much as possible.
  2. A greedy algorithm is usually a clever choice as long as local minima can be avoided. Multiple alignment would be one of those kinds of problem that could be easily solvable by a greedy algorithm if we were allowed to ignore the local minima problem. Of course, we can never ignore it. However, there can be some way to minimize the adverse effect of local minima. Briefly describe what it is (in multiple alignment). Also, this idea can be incorporated in a stochastic framework, and then the results become much better in general. Briefly describe one (or more, if you want) of the most successful currently available ones that use this strategy for multiple alignment.
  3. Below is a multiple alignment of an imaginary promoter region. First, make a consensus sequence for the region. Second, write a regular expression that may describe the region best. (You can either use the common Unix style regular expression or the PROSITE style.) Third, make a position specific weight matrix. Make your own choice for the problem of unobserved nucleotides and describe how you did and why.

    CCGTGTTC
    AGGTCGTG
    GCCTGGTC
    AGATCATT
    GAATGTTC
    AGGTGTTC

    What are the scores obtained using the position specific weight matrix for the following two sequences? Which sequence would be the better candidate for the promoter when using the consensus sequence method, the regular expression method, and the position specific weight matrix method, respectively? Explain the results.

    Sequence 1 : AGGCGTAC
    Sequence 2 : CAGTCATC
  4. Most higher eukaryotes have GC contents not far removed from 50%. However, some prokaryotes have strongly biased GC contents. Noticeable examples are thermophilic bacteria and they require it to withstand high temperature. Calculate the average Shannon's entropy (in bits) for the genome with the GC content of 50%. Also, calculate it for the genome with the GC content of 80%. Considering genomes as the recording media for protein sequences, how can the latter genome overcome the weakness of being poor recording media?
  5. Mammalian genomes are significantly lack of CG dinucleotide. CpG island is the region with higher frequency of CG dinucleotide around the translation starting site of genes. CpG island can be found in most of house keeping genes. Therefore, it is one of the important clues to find a gene. (The structure of mammalian genes are very complex and it is hard to find the starting site of gene.) Using some sequences obtained from GenBank, create two transition matrices for the Markov models to detect CpG islands, one for CpG island regions and the other for non-CpG island regions, respectively. Submit the accession numbers of the sequences used, too.
  6. Using the data set used in problem 5, perform leave-one-out cross-validation. Describe the quality of your model.
  7. A novel DNA sequence, which does not show any significant homology with GenBank sequences, was given to you by someone who is infamous for his ill nature. Since the sequence was from that person, you are starting to doubt that it may not be of biological origin. Devise as many methods as you can think of to confirm that it is not biologically originated.
  8. Choose some fraction of the sequences used in the problem 5 and create a probable HMM for CpG island. You can skip the EM process (i.e., making it a single step by relying on your brain’s power of pattern recognition).

    (a) Draw the model.
    (b) Identify the Viterbi path and calculate the probability for the following sequence:

    ATCGAGTA
  9. A sequence of coin tosses were given to you as the following:

    HHHTHHTHHHHHTHTHTHHT

    What is the maximum likelihood estimator for the probability of head appearing and what is the LOD score comparing to the null hypothesis that the coin is a fair one?
  10. If the sequence in problem 9 is in fact generated by two coins. One is a fair coin and the other one is a loaded coin. Also, the probability of head appearing for the loaded coin is the one calculated in problem 9. Finally, the probabilities of the exchange between two coins are 0.5 for both directions. Then, what would be the Viterbi path of the HMM for the sequence?
  11. To use a Bayesian method, you need to have a prior probability, which is a subjective one. This is the main weakness of the method and it may be the price we have to pay for playing god. However, we can find a "good" prior probability for certain problems from experiences or observations. Briefly explain about these good prior probabilities for protein sequences.
  12. If Baum and Welch algorithm is used to reestimate the parameters of the HMM in problem 10, would the probability of head appearing for the loaded coin increase or decrease? You don't need to do the whole process of Baum and Welch algorithm. You only need to provide a brief rationale for your choice.

Up