What do I mean by "probabilistic model"?
Each feature to be modeled (promoter, coding region, splice junction, etc.) may
need its own kind of model, so it is important to know how general these models can be. Here is how I think of them. I'm given a "pseudo-random number generator",
which is a computational procedure that I can show a set of possible output
symbols with probabilities, such as "either H with probability 0.75 or T with
probability 0.25", and the procedure will pick one of the outputs according to
the specified probability.
A probabilistic model is a recipe (computer program or algorithm)
that repeatedly uses the pseudo-random number generator to generate an output
sequence; the output symbols are specified but the probabilities may not be.
For instance, GenScan models coding regions by a model that in essence says:
pick the sequence length according to some estimated distribution of exon sizes,
then generate that many symbols according to a 5-th order Markov model
(in fact, an "inhomogeneous" Markov model, to capture the differences among the
three positions in an exon).
Also, GenScan models donor splice sites (those at the
start of the intron relative to the transcription direction) in a way that
handles dependence among positions. Informally, the rule looks like the
following. (See Fig. 2 on page 84 of the GenScan paper.)
The generated sequence has positions called -3, -2 -1, 1, 2, 3, 4, 5 and 6,
where positions 1 and 2 are always GT.
Pick a nucleotide for position 5
according to: A% = 6, C% = 5. G% = 84, T% = 5.
If the letter is not G, then fill in the other positions according to
the distribution (observed frequencies) in the upper right of Fig. 2.
If it is a G, then generate the nucleotide in position -1 according to:
A% = 9, C% = 4. G% = 78, T% = 9.
If the letter is not G, then fill in the remaining positions according to
the second frequencies on the right in Figure 2.
Otherwise, pick the entry in position -2 according to:
A% = 59, C% = 10. G% = 15, T% = 16.
And so on.