Study Guide #1 for Dec. 2 Exam in Bioinformatics I

  • Write a brief essay on the UCSC Human Genome Browser, Ensemble, and NCBI's resources for the human and mouse genome sequences. What kinds of questions can be answered by using them? Which sites are best for which uses?

  • Write a brief essay on computer methods, and particularly Web resources, such as RepeatMasker, GenScan, Blast and PipMaker, for analyzing a given, unannotated genomic DNA sequence. What specific tools would you recommend? Include brief definitions of the important features that can be found by the methods you discuss.

  • What is an ``alignment'' of two sequences? What is the difference between a ``local'' and a ``global'' alignment? What is the ``dynamic programming method'' for aligning two sequences? Be able to apply it by hand to two short sequences for either a local or a global alignment.

  • What are the differences among blastn, blastp and psi-blast? Under what conditions can one expect psi-blast to do a better job than blastp? Briefly explain how psi-blast works.

  • Describe how ``log odds'' approaches can be used to determine amino-acid substitution scores that are appropriate for, say, blastp. Sketch the approaches taken for the BLOSUM matrices.

  • What are ``weight matrices'' for identifying, say, transcription factor binding sites, and how can they be determined from experimental data?

  • Suppose you have two classes of DNA sequences that you wish to distinguish by computational means, where the two classes show different frequencies of words of length W (fixed; say W=5). For instance, the fraction of AGTTC might be 0.009 in the first set, but 0.001 in the second set. Design a method that will score an arbitrary sequence so that the score of a sequence exceeds 0 if and only if it looks more like sequences in the first set than those in the second set.

  • What is a ``hidden Markov model''? What is meant by ``the probability of generating a particular observed sequence''? What is meant by ``the most-probable state-path for generating a particular observed sequence''?