Bioinformatics Exercises
  A.Problems. Sequence Analysis
  Writer : Seyeon Weon   Updated : 10-14   Hit : 2553   Updates 

(For those who have taken the courses and want to submit for evaluation, please read the instructions linked on the table of contents page.)

  1. Draw a plot of GC content of the sequence 2 from problem 8 using sliding window method (with window size 4 and step size 1).
  2. Describe how you would choose "the window size" and "the step size" when you use sliding window method.
  3. Suppose that you are trying to choose a threshold for a sequence analysis program based on sliding window method. You have some positive examples and some negative examples, meaning that their identities are predetermined by some outside knowledge. Having these examples, how would you use them in your decision for choosing the threshold?
  4. A fanciful looking result of a novel motif modeling method was just presented in front of you. The result looked so good that it seemed almost perfect, meaning that it could detect all the motif sites in the sequences they used and didn't detect any false site. Now, it is your turn to say something about this incredible accomplishment. What would you say?
  5. cDNA sequences are obtained by sequencing mRNA via complementary DNA. Sometimes we want to find out the corresponding genomic DNA for a cDNA sequence. For such purpose, how would you adjust gap penalties? If you are going to use BLASTX (instead of BLASTN) for the purpose, which weight matrix would you use? (Even though better methods are available for this purpose, this problem is to give you a chance to think about BLAST in different situations.)
  6. With a protein sequence of your choice, obtain the homologous sequences using BLAST. Run clustalW for the multiple alignment of the sequences.
  7. Create a PSSM for a region of 10 aa long from the multiple alignment in problem 6. Add the pseudo-count 1 to the amino acids not appeared in a column. Use the amino acid composition of SWISS-PROT at http://www.expasy.org/sprot/relnotes/ as the background probability of amino acids.
  8. Draw the dynamic programming table for the following two sequences with the Needleman-Wunsch style alignment. (Use 1 for match and -3 for mismatch, and a constant gap penalty of -3)
    sequence 1: AGTGTCTGCACGG
    sequence 2: AGGGCTTGCTCCG
  9. Keratin and collagen are the most abundant animal proteins. Study their structures from a biochemistry textbook if you have not done so yet. These structural proteins exhibit problems when doing sequence alignment. Explain.
  10. To have a large value in PAM or BLOSUM matrices, what property should a pair of amino acids have? Answer it by verbally describing the meaning of the log odd ratio used to calculate the matrices.
  11. Calculate the information contents of each nt of the each site in the following multiple alignment.
    sequence 1: GCTG
    sequence 1: GAAA
    sequence 1: GTTC
    sequence 1: GGTG
  12. Most higher eukaryotes have GC contents not far removed from 50%. However, some prokaryotes have strongly biased GC contents. A noticeable example is thermophilic bacteria and they require it to withstand high temperature. Calculate the average Shannon's entropy (in bits) for the genome with the GC content of 50%. Also, calculate it for the genome with the GC content of 80%. Considering genomes as the recording media for protein sequences, how can the latter genome overcome the weakness of being poor recording media?
  13. Draw a Sequence Logo of the multiple alignment from problem 6. (Since you know what you are looking for, the Google god will guide you.)
  14. Write down the regular expression of the DNA sequences which code for the trypsin cleavage sites of proteins.
  15. CDD (Conserved Protein Domain Database) and the search tools related to it are relatively recent additions at the NCBI site. Pick three or more multi-domain proteins of your own choice and do the following:

    (a) Search conserved domains in your proteins and explain the results.
    (b) Using CDART, search the proteins with similar domain architectures with your proteins and explain the results.

Up