Bioinformatics Exercises
  A.Questions. DNA Microarrays I (Expression Profile)
  Writer : Seyeon Weon   Updated : 10-26   Hit : 5543   Updates 

(For those who have taken the courses and want to submit for evaluation, please read the instructions linked on the table of contents page. Most of the questions below have straightforward answers from the material in the corresponding courses, although a few questions require some further studies, which are still based on the course material.)


  1. Each spot on a DNA microarray can be considered as a Northern blot. However, there are some differences between these two methods. What are they?
  2. You may know well what to do with the data from a Northern blot experiment. Lets assume that you have somehow conducted 20,000 northern blot experiments with a single sample. (Dont ask me how.) What would you do with the data?
  3. For so-called printed (spotted) arrays, it is a necessity to use two-color differential hybridization. Why is it so?
  4. It is recommended that factory-manufactured arrays, instead of homebrew printed arrays, should be chosen, as long as they are available for the organism you are studying. What is the main reason for this recommendation?
  5. The quality of your DNA microarray data shows a weekly periodicity. That is, it is better for the experiments done during weekends than weekdays. What would be a possible cause of this phenomenon?
  6. Describe the structure of the TIFF files commonly used for microarray images.
  7. Describe the major steps in the image analysis process of DNA microarrays.
  8. If you are using the intensity based segmentation method for the image analysis, what is the worst scenario that can happen to ruin the result?
  9. Some spots in a microarray can be bad ones and they must be discarded before they are subjected to a data analysis, since they can only hinder the analysis. What is the most commonly used method for this cleaning step?
  10. The term background is used to denote the area where no DNA probes are printed and the term foreground is used to denote the area where DNA probes are printed. The intensity values of background should be subtracted from the foreground values. Why is it necessary?
  11. For Affymetrix arrays, there are no such areas to be used as background. However, we still need some kind of background values. What is the commonly used method for this purpose in the case of Affymetrix arrays and what is the rationale behind it?
  12. It is a common practice that microarray data are presented in spreadsheet format. And the convention is that columns become the arrays (or samples) and rows become the genes in the organism. Why couldnt we use the other way around?
  13. The data from tissue microarrays are presented in spreadsheet format, too. However, in this case, it is natural that columns become the genes and rows become the samples. Explain the reason.
  14. [Simple Sorting] With these spreadsheet style data, you can do many different kinds of analysis with computer. One obvious method is sorting the columns according to the values in a row (i.e., the expression level of a gene) and vice versa. Explain what you can obtain by these two types of sorting. (Of course, this is not the way that we do with microarray data.)
  15. However, above type of a rather naïve way of thinking does not quite work for microarray experiments, since there are thousands of rows (i.e., genes) to be sorted by and there are dozens of columns, too. Of course, you may want to choose, lets say, 3 favorite genes. The problem is that Northern blot experiments would be better choices than microarray experiments for this purpose. Explain how you can make a full use of the microarray data instead of this naïve approach? (This question is actually not meant to be a question but to make you think.)
  16. [Comparing Groups] This is the one actually in use. First, samples (i.e., columns) are divided into two groups. And then, genes (i.e., rows) are sorted according to the differences in intensity between the two groups of samples. By this way, we will be able to get an ordered list of genes according to how strong a gene is up-regulated in a condition comparing to the other condition. This has a clean biological sense, and this may be the first thing comes into the mind of many researchers when they are thinking of using DNA microarrays for their research. What is the most important difference between this method and the one in [Simple Sorting]?
  17. It is recommended to use at least 3 arrays for each group in [Comparing Groups], and the more, the better as long as your budget can afford and the samples are available. Why is it so?
  18. Unluckily, the method in [Comparing Groups] has been giving us many disappointments, even though it appears to be a clean and logical approach. Things are not that easy in biology as we all know well. Explain how come this simple and clean use of microarrays doesnt work so well. (Again, this question is actually not meant to be a question but to make you think.)
  19. With the spreadsheet style data obtained from DNA microarrays, other types of approaches to analyze the data are based upon the correlation between genes or samples. Since these types of approaches are of great use in many types of omics research, please make yourself fully understand the basic knowledge necessary to use these approaches and the practical aspects involved as well. The term correlation has a specific meaning in statistics and it is used in such a sense here, too. Try to explain the meaning of correlation at least in three different ways. Try to make each explanation as short as possible.
  20. Pearson correlation coefficient is the most commonly used measure of statistical correlation. You need to understand it since you will encounter it quite often as long as you are doing some research in modern biology. You should be able to answer the following questions about Pearson correlation coefficient:

    (a) If you add a constant to all values in a variable, what would happen to the Pearson correlation coefficient?
    (b) If you multiply a constant to all values in a variable, what would happen to the Pearson correlation coefficient?
    (c) One characteristics of Pearson correlation coefficient is that values are normalized by the means and the standard deviations of both variables. Because of this, not the actual magnitudes but the general pattern is important. This gives us a desirable property and an undesirable property depending on what you want. Explain these properties in the context of gene expression.
  21. Spearman rank-order correlation coefficient is another one quite often used in DNA microarray analysis. You should be able to answer the following questions about it:

    (a) What is the most important advantage of Spearman rank-order correlation coefficient comparing to Pearson?
    (b) The advantage in (a) becomes a disadvantage depending on what you want. Explain the information lost when Spearman rank-order correlation coefficient is used in the biological context?
    (c) Comparing to Pearson, the correlations between genes are exaggerated especially when the number of samples is small. Explain the reason.
  22. Euclidean distance is another measure also quite often used for the same purpose. Pearson, Spearman, and Euclidean distance are actually only the small part of the several dozens of the distance measures that can be used to denote the distances between genes or samples in DNA microarray data. They are just the most commonly used ones. Comparing to Pearson, the actual values of the expression level of genes are used in Euclidean distance. (In mathematical terms, the lengths of vectors are not normalized.) It seems that this should make a better biological sense. However, Pearson is more widely used and biologists often say that Pearson has a better biological sense. Explain why it is so.
  23. [Distance Matrix] If you have 10,000 genes in the spreadsheet style data of microarrays, what would be the size of the matrix obtained from the calculation of the distances between the genes using one of the measures introduced in above questions. Would it fit in the RAM of your computer? How about the matrix of the distances between the samples? Also, note that the former one is called gene-to-gene distance matrix and the latter one is sample-to-sample distance matrix.
  24. With the matrices in [Distance Matrix], many different types of analysis can be tried. Among dozens of such attempts so far appeared in journal articles, so-called clustering and classification are the most standard types of methods. Can you explain the essence of these two types of methods to your biologist colleagues in 10 minutes? Write down the summary of the explanation in a few sentenses.

(More problems on the further steps in DNA microarray data analysis are in "A.Problems. DNA Microarrays".)