Some Concepts from Chapters 4 and 5

  • Bioinformatics typically deals with combinatorial (also called discrete) problems, rather than continuous problems as in calculus.
  • For each instance of a problem, the set of possible solutions is finite (though frequently exponentially larger than the size of the data for that instance).
  • Given an informally stated computational problem, such as that of finding the best motif that is shared by a set of given DNA sequences, we can (try to) give a precise formulation of the problem, which frequently takes the form of maximizing or minimizing a specific "score" or "cost" function that associates a numerical quality measure with each possible solution. (Indeed, for the motif-finding problem, Chapter 4 gives two different formalizations.)
  • We can frequently think of the process of "trying every possible solution" as walking around a tree-shaped "solution space", some entire branches can be skipped without sacrificing an optimal solution. (Sometimes called a branch and bound strategy.)
  • For some of these combinatorial optimization problems, it may be that the only algorithms that are guaranteed to give an optimal solution for any problem instance sometimes take an amount of time that is exponentially larger than the size of the data. A large number of combinatorial optimization problems have been shown to be NP-complete, which probably implies that they cannot be solved in time proportional to any fixed polynomial (such as the 17-th power) of the instance size. For a more precise (but longer) discussion, see pp. 49-51 of the book.
  • For many bioinformatics problems, we can devise a "greedy" strategy, i.e., one that builds up a proposed solution by making early decisions that are not later reversed. Typically (but not always) these methods produce sub-optimal answers for many or most instances. The hope is that they "work well in practice", though it may be very difficult to show this convincingly.
  • Computational methods that do not always produce an optimal solution are frequently called heuristic. (Often used as a noun, as in, "We solved it with a heuristic.") Some of these have a performance guarantee, e.g., they can be guaranteed to produce output that is within a fixed percent of optimal.

    Example. Consider the following problem (with no attempted biological motivation.) Given: A set of n positive numbers. Desired output: Divide the numbers into two sets, where the difference between the sums of the sets is as small as possible. (Given a set of blocks of different heights, put them into two piles that are as close as possible to the same height.) Devise a greedy method that attempts to find a solution, and find an instance where it gives a sub-optimal solution.

    Observation: The dissussion in Chapter 5 concerns the unsigned version of the Sorting By Reversals Problem. For some ways of identifying genes (for instance, when the entire genome is sequenced), we also know each gene's orientation (i.e., transcription direction). This leads naturally to the signed version of the problem. Here, we are given a permutation of N integers, but each now has a sign (plus or minus). A reversal operation also flips signs in the reversed interval. Surprisingly, the problem of sorting signed permutations by reversals can be optimally solved in polynomial time. (Incidentally, this was first shown, about 10 years ago, by a Penn State graduate student, Sridhar Hannenhalli.) This is discussed in the book's on-line class notes for Chapter 5.