Bioinformatics Exercises
  A.Hands-on. Sequence Retrieval and Manipulation
  Writer : Seyeon Weon   Updated : 10-14   Hit : 3032   Updates 

(For those who have taken the courses and want to submit for evaluation, please read the instructions linked on the table of contents page.)

Please install your Linux box including all documentation and get a good Unix book such as UNIX Power Tools, and do the following:

  1. GenBank:

    Read the GenBank release note at Also. read the GenBank overview page at NCBI and the links within the page. Briefly answer the following questions:
    1. In what interval is GenBank updated?
    2. What other international databases have the same content as GenBank?
    3. If you want to obtain the sequences deposited in GenBank during the past 5 days, where would you go for them?
    4. If you want search GenBank using BLAST only against the sequences deposited during this month, how should you do?
    5. In the flat file release of GenBank, which file contains the sequences of chimpanzee? Which one contains the sequences of yeast?
    6. The three biggest class of files in the flat file release of GenBank are gbest, gbgss, and gbhtg. Briefly explain what they are.
    7. NCBI is doing such a great job on maintaining GenBank and supporting biomedical researchers all over the world. However, you should not use the services provided by NCBI for a mainstream genomics work of yours. Briefly explain why.

  2. Inside GenBank:

    Also, look the inside of a GenBank flat file and briefly answer the following questions:
    1. What is the record separator?
    2. What does CDS stand for?
    3. What are the unique identifiers used to identify individual records. What are the differences among them?
    4. What are the cross references used to link to other databases?
    5. For extra points, briefly explain what kinds of computer tools would you use to parse the database. What would you think about the structure of the records in the database?

  3. PudMed:

    Search PubMed for "computational methods for protein coding region (i.e., gene) identification". (Do not use this phrase as your query!) Submit your work as the following:
    1. What are the MeSH terms which are closely related to this concept?
    2. Submit the final query that you used and the total number of items found from the final query.
    3. Obtain your result in MEDLINE format. Submit the first 200 items. (If you found less than 200 items, then submit as many as you found.)
    4. Submit only "author", "title", and "source" fields. (Do not use a word processor and your labor. Use an Unix tool!)
    5. For extra points, import the result to your favorite bibliographic software and submit the imported file as a separate attached file. Notify the name and the version of the software you used.

  4. Entrez:

    Use Entrez to obtain as many sequences as possible which belong to the globin (hemoglobins and myoglobins) family. For those who with biology major, substitute "globin" to something of your own interest and do the same things below. If your choice has not been established as a family, then use the result from the PSI-BLAST problem in project 2 assuming as if it is an established family. Also, if your choice has very few homologous genes with it, then choose another one with more homologous genes. Don't forget that you have to use these sequences to construct a phylogenetic tree later.
    1. Submit the LOCUS lines of the sequences. Be careful to avoid redundancy. That is, if there are more than one entries in GenBank for a gene from a species, you should chose only one with the most complete information for the gene.
    2. What are the "classes" that have this family according to your result? "Class" here is a taxonomic term.
    3. In human genome, where are the locations of the family? Where are the locations in mouse genome?
    4. What is the most common genetic disease that is caused by a disorder of this family among Asians? If your choice doesn't have known relationship to any human genetic disease, use "globin" instead to answer this question.
    5. Is there any polymorphism known for the family?

  5. Sequence File Manipulation:

    In near future, I wish that you will learn how to do the same thing with perl and bioperl. For now, let's do it with other available tools. Using the sequences obtained from the problem 4, do the following:
    1. Save each sequence in a separate file in GenBank format. A better way to do it is saving them as one big concatenated file at once, which is a functionality provided by the NCBI web interface, and split the file into pieces using some Unix tools. Also, it is much convenient for later use to make the names of sequence files according to their LOCUS names since they are often meaningful as well as unique. We will get extra points for doing this way. (Hints: csplit, grep, tr, cut) Submit the command line or shell script that you used, which is for extra points. However, do not submit the sequence files themselves.
    2. Using readseq, create a concatenated FASTA format file which contains all the sequences from above files. Of course, you can get the same thing using the web interface provided by NCBI. Submit the command line or shell script you used. However, do not submit the sequence file itself. Also, do not use readseq with its interactive mode. (If so, you wouldn't have any command line to submit.)
    3. Get EMBOSS package. Install it in your Linux box. If you don't know how to install software in your Linux box, it is a good time to learn. Using coderet in the package, extract all the protein sequences from the CDS fields in the feature tables of your sequence files. Submit the protein sequences in a concatenated FASTA format file.