Study Guide I for Bioinformatics I, Fall 2004

Study Guide for Part I of Exam in Bioinformatics I

What are "weight matrices" for identifying transcription factor binding sites (or any set of highly similar sequences of the same length)? How can they be determined from experimental data? Explain how this relates to log-odds methods for discriminating two sets of DNA sequence, given a probabilistic model for each set.

(Advanced) Suppose you have two classes of DNA sequences that you wish to distinguish by computational means, where the two classes show different frequencies of words of length W (fixed; say W=5). For instance, the fraction of AGTTC might be 0.009 in the first set, but 0.001 in the second set. Design a method that will score an arbitrary sequence so that the score of a sequence exceeds 0 if and only if it looks more like sequences in the first set than those in the second set.

Define computational problem, instance of a computational problem, and algorithm for a computational problem.

Informally, what is a greedy algorithm (in general)?

Consider the following informal problem: find a motif of specified length (say, 15) that appears in each member of a given set of DNA sequences. Give two distinct formalizations of the problem in terms of optimizing a specified score over a set of specified objects. Compare the computational efficiencies of a brute-force optimization for the two formalizations. Sketch a greedy method for the same problem.

What is the problem of sorting by reversals?

Give a simple greedy algorithm for sorting by reversals and show that it has no (instance-independent) performance guarantee.

(Advanced) Sketch a method for sorting by reversals that has a (constant) performance guarantee, and verify that guarantee.

What is a "hidden Markov model"? Discuss two examples of bioinformatics problems where HMMs provide a good approach. What is meant by "the probability of generating a particular observed sequence"? What is meant by "the most-probable state-path for generating a particular observed sequence"?

Informally, what is meant by an NP-complete problem?

What are the Hamiltonian Cycle Problem and the Eulerian Cycle Problem? Show how to reformulate the Sequencing by Hybridization Problem (i.e., reconstructing a sequence from the set of all its k-mers for some k) as either the Hamiltonian Cycle Problem or the Eulerian Cycle Problem. Which formalization is NP-complete and which is computationally easy to solve?