Bioinformatics Exercises
  B.Problems. Sequence Analysis I
  Writer : Seyeon Weon   Updated : 10-14   Hit : 7595   Updates 

(For those who have taken the courses and want to submit for evaluation, please read the instructions linked on the table of contents page.)

Install the lastest version of bioperl and write perl scripts for the following works:

  1. Obtaining Sequences by Entrez Search:

    Write perl scripts for the following works. Submit perl scripts only. Do not submit the results.
    1. Using bioperl, search GenBank with a keyword of your choice and retrieve entries. If you don't have any preferred one, use "leghemoglobin". Store the entries in files for later use. (If you are using NCBI site, don't go too far with this assignment. If you want to do real thing, you need your own GenBank locally. In other words, the number of entries retrieved should not be something like tens of thousands. However, since you have to do multiple alignment with them, the number should be more than 50 or so.)
    2. Parse the files and extract all the protein sequences in CDS fields, and store them in a concatenated fasta format file.
    3. Parse the files from 1.1 and create a summary file which contains "VERSION, GI, protein_id, MEDLINE, DEFINITION". (Use new line as the field separator. That is, each field occupies a line.)

  2. Obtaining Sequences by BLAST Search:

    Write perl scripts for the following works. Submit perl scripts only. Do not submit the results.
    1. Select (carefully) one sequence from 1 and run BLAST search using BLASTP.
    2. Retrieve entries with the expect values higher than a threshold of your choice and do the same works as 1.1, 1.2, 1.3 omitting protein_id for 1.3.

  3. Redundancy Removal:

    First, install BLAST locally without databases. (Of course, you can install databases if you can afford.) Or, you can use other alignment tools such as Smith-Waterman and local FASTA if you prefer.
    1. Merge the protein sequences obtained from 1 and 2 while removing the same protein sequences by comparing protein_id and VERSION. Submit the perl script for it.
    2. Do pairwise alignment for all possible pairs of the sequences. Write the perl script for the automation of the alignment job and the matrix creation. The value in the matrix should be "Expect value, Score, and three values for Identities". Therefore, 5 values in total per entry.
    3. Decide the cutoff value for the length and percent identity and remove the almost identical sequences. Submit the list of the first 80 amino acids from each sequence after the redundancy removal.

  4. Clustering and ClustalW:

    Well, this is not exactly "the" method we usually do, but a perfect combination occurred here. So, do the following:
    1. Do the clustering of the final matrix from 3 using bioperl's k-mean clustering module. Which one, among Expect value, Score, and Percent Identity, would you use for clustering? Also, submit the perl script.
    2. Briefly describe the disadvantage of using k-mean clustering instead of other clustering methods for this purpose. (Hmm... Let's contribute some codes for bioperl...)
    3. Explain what you have obtained. That is, number of clusters, brief description for each cluster, and so on. Also, if these works that you have done so far for obtaining sequences and removing redundancy were for a real research, what would you do more?
    4. Choose one of the clusters and do multiple alignment using ClustalW. Of course you need to install Clustalw locally. Submit the perl script and the resulting alignment.

  5. PubMed:
    1. Retrieve medline record (including abstract) for each sequence in the cluster you used for the multiple alignment. Submit the list of title lines and the python script.
    2. Briefly describe what you could learn about the proteins in the cluster by reading the abstracts.