Bioinformatics Review

Study Guide for Bioinformatics I, Fall 2009

First, I promised I would ask you to use the dynamic programming algorithm to find an optimal global alignment of two short sequences (say, both of length 4) given particular scores for all possible alignment columns, and show your work. Second, the exam may well ask for short and not overly formal definitions of some terms, such as "computational problem", "instance of a computational problem", "greedy algorithm", "dynamic programming", "NP-complete problem", "performance guarantee", "alignment", "global alignment", "local alignment" and "progressive multiple alignment". Finally, you will be asked to write essays on two problems from a list of topics discussed in class, probably including the following. Thus, it might be a good idea to pick two of the following subjects and prepare answers. (Writing about the easiest topic may not be the best approach.)

Consider the following informal problem: find a motif that appears in each member of a given set of DNA sequences. Give two distinct formalizations of the problem in terms of optimizing a specified score over a set of specified objects. Compare the computational efficiencies of a brute-force optimization for the two formalizations. Finally, sketch a greedy method for the same problem.

Sketch how blast and psi-blast work, describe the conditions under which psi-blast outperforms blastp, and explain the reasons for this difference.

Explain in detail how to compute an optimal alignment of two sequences using space that is proportional to the length of the shorter sequence. Show that your procedure works as advertized.

Define the problem of sorting by reversals, and discuss at least two heuristics for its solution, including one with an instance-independent performance guarantee that you verify. Also, show that the other heuristic has no such performance guarantee.

Discuss how statistical models provide a framework for developing methods to discriminate between two sets of sequences, based on training data. Explain how this works in a few cases, such as recognizing transcription-factor binding sites or promoters. Bonus points will be given for discussions of how statistical models can be used to determine appropriate parameters for scoring pairwise alignments and/or recognizing donor splice site allowing for dependencies among positions (as done by GenScan).

Discuss Hidden Markov Models, how they are analysed, and how they have been used in bioinformatics.

Explain how NP-completeness results and graph algorithms lead to a formalization of the Sequencing by Hybridization Problem (i.e., reconstructing a sequence from the set of all its k-mers for some k) that can be solved efficiently.

I'll have time Wednesday and before class on Thursday to answer questions.