## Study Guide for Bioinformatics I, Fall 2009

First, I promised I would ask you to
use the dynamic programming algorithm to find an optimal global alignment
of two short sequences (say, both of length 4) given particular scores
for all possible alignment
columns, and show your work. Second, the exam may well ask for
short and not overly formal definitions of some terms, such as
"computational problem",
"instance of a computational problem",
"greedy algorithm",
"dynamic programming",
"NP-complete problem",
"performance guarantee",
"alignment",
"global alignment",
"local alignment" and
"progressive multiple alignment".
Finally, you will be asked to write essays on two problems from a list of
topics discussed in class, probably including the following.
Thus, it might be a good idea to pick two of the following subjects and
prepare answers.
(Writing about the easiest topic may not be the best approach.)

Consider the following informal problem: find a motif that appears in
each member of a given set of DNA sequences.
Give two distinct formalizations of the problem in terms of optimizing a
specified score over a set of specified objects.
Compare the computational efficiencies of a brute-force optimization for the
two formalizations. Finally, sketch a greedy method for the same problem.
Sketch how *blast* and *psi-blast* work, describe the
conditions under which *psi-blast* outperforms *blastp*, and
explain the reasons for this difference.
Explain in detail how to compute an optimal alignment of two sequences using
space that is proportional to the length of the shorter sequence.
Show that your procedure works as advertized.
Define the problem of sorting by reversals, and discuss at least two heuristics for its solution, including one with an instance-independent performance
guarantee that you verify. Also, show that the other heuristic has no such
performance guarantee.
Discuss how statistical models provide a framework for developing methods
to discriminate between two sets of sequences, based on training data.
Explain how this works in a few cases, such as recognizing
transcription-factor binding sites or promoters.
Bonus points will be given for discussions of how statistical models can be
used to determine appropriate parameters for scoring pairwise alignments and/or
recognizing donor splice site allowing for dependencies among positions
(as done by GenScan).
Discuss Hidden Markov Models, how they are analysed, and how they have been
used in bioinformatics.
Explain how NP-completeness results and graph algorithms lead to a
formalization of the Sequencing by Hybridization Problem (i.e., reconstructing
a sequence from the set of all its *k*-mers for some *k*)
that can be solved efficiently.
I'll have time Wednesday and before class on Thursday to answer questions.