Bioinformatics Exercises
  B.Problems. Sequence Analysis II
  Writer : Seyeon Weon   Updated : 10-14   Hit : 3423   Updates 

(For those who have taken the courses and want to submit for evaluation, please read the instructions linked on the table of contents page.)

Write the following perl scripts:

  1. Reverse Complement:

    Fasta format is very simple one which is the most commonly used to exchange biological sequences. An example is as the following:


    Write a perl script that reads a DNA sequence in fasta format and outputs the reverse complement of it also in fasta format.
  2. Random Sequence by Shuffling:

    Sometimes you need to have random sequences to use them as the inputs that should generate ground values. The most common way to generate a random sequence is the random shuffling of bases or residues in sequence. By this way, you can automatically obtain a random sequence with the same composition and length with the input sequence. Write a perl script that reads a DNA sequence in fasta format and outputs a randomized sequence (using above method) in fasta format.
  3. Random Sequence by de novo Generation:

    Write a perl script that takes the length and GC content as options and creates a random sequence (in fasta format) with the specified length and GC content. The composition of bases should follow the rules that were described in our lecture. (For extra points: Can you make your script generate the local patchness of the distribution similar to biological sequences?)
  4. Sliding Window Method:

    Obtain a biological sequence from GenBank. Generate shuffled sequence using the script written in 2. Also, using the script written in 3, generate de novo random sequence with the same GC content as the original biological sequence. Examine and compare some statistical properties of these three sequences using sliding window method written in perl. The statistical properties required to be examined are monomer and dimer distributions. Also, you can try other properties for extra points. Gnuplot is recommended for the visualization of the results. Though, you could use your own favorite. Submit the postscript files of your visualization, too.
  5. File Scanner:

    Write a perl script that examines only the LOCUS line of GenBank flat file and extracts the size of sequence from each LOCUS line. The output of the script should be (1) the list of LOCUS lines, (2) arithmetic mean, (3) standard deviation, (4) the longest one, (5) the shortest one, and (6) the input file for a plotting software, which will visualize the actual distribution of sequence size. You can also write the (6) functionality as a separate script for later usage. Use gbmam.seq file at for the scanning. Submit the postscript file of your visualization, too. (For extra points, show that how above can be done without perl scripts, that is, by using standard Unix utilities and other readily available programs.)