## Study Guide for Part I of Exam in Bioinformatics I

What are "weight matrices" for identifying transcription factor
binding sites (or any set of highly similar sequences of the same length)?
How can they be determined from experimental data?
Explain how this relates to log-odds methods for discriminating two sets of
DNA sequence, given a probabilistic model for each set.
(Advanced) Suppose you have two classes of DNA sequences that you wish to distinguish
by computational means, where the two classes
show different frequencies of words of length *W* (fixed; say W=5).
For instance, the
fraction of AGTTC might be 0.009 in the first set, but 0.001 in the second set.
Design a method that will score an arbitrary sequence so that
the score of a sequence exceeds 0 if and only if it looks more like sequences
in the first set than those in the second set.
Define
*computational problem*,
*instance* of a computational problem, and
*algorithm* for a computational problem.
Informally, what is a *greedy* algorithm (in general)?
Consider the following informal problem: find a motif of specified length
(say, 15) that appears in each member of a given set of DNA sequences.
Give two distinct formalizations of the problem in terms of optimizing a
specified score over a set of specified objects.
Compare the computational efficiencies of a brute-force optimization for the
two formalizations.
Sketch a greedy method for the same problem.
What is the problem of *sorting by reversals*?
Give a simple greedy algorithm for sorting by reversals and show that
it has no (instance-independent) *performance guarantee*.
(Advanced) Sketch a method for sorting by reversals that has a (constant)
performance guarantee, and verify that guarantee.
What is a "hidden Markov model"?
Discuss two examples of bioinformatics problems where HMMs provide a good
approach.
What is meant by
"the probability of generating a particular observed sequence"?
What is meant by
"the most-probable state-path for generating a particular observed sequence"?
Informally, what is meant by an *NP-complete* problem?
What are the Hamiltonian Cycle Problem and the Eulerian Cycle Problem?
Show how to reformulate the Sequencing by Hybridization Problem
(i.e., reconstructing
a sequence from the set of all its *k*-mers for some *k*)
as either the Hamiltonian Cycle Problem or the Eulerian Cycle Problem.
Which formalization is NP-complete and which is computationally easy to solve?