TABLE OF CONTENTS
This package contains the tools TOAST and ROAST for finding orthologous alignments of DNA sequences, plus some associated utility programs.
TOAST -- two way orthologous selection tool
ROAST -- reference dependent multiple alignment tool
Platform: | These tools were developed on a Debian GNU/Linux system, but are written in C and should be compilable on other Linux or Unix platforms with little change. |
Author: | Minmei Hou, <mhou @ cs.niu.edu> |
Date: | March 2008 |
Citation: | [To be added] |
Download the distribution file into an empty source directory, and unpack it using a command like:
zcat toast-roast.tar.gz | tar xvf -Edit the
Makefile
to adjust the settings for your system
(especially INSTALLDIR
); then from the source directory, run:
make make installYou will also need to install the
blastz
program
(which is distributed as a separate package) and make sure that these programs
can find it, e.g. by adjusting your PATH
or by making a symlink.
Here is a small example that aligns the beta-globin gene cluster from four
species, with human as the reference, using sample input files included with
the distribution. The input consists of sequence files in FASTA format for
human
, marmoset
, mouse
, and
monodelphis
, plus an annotation file chr11.genes.txt
which specifies the gene locations in the human sequence. The annotation file
is optional, but is useful when the genes have long introns, and may help TOAST
produce better output.
To run this example for yourself, type:
all_bz + A=2 F=human D=0 T=chr11.genes.txt "(((human marmoset) mouse) monodelphis)" roast + X=2 E=human "(((human marmoset) mouse) monodelphis)" *.toast2.maf example.roast.mafThen the output file
example.roast.maf
will contain the
orthologous multiple alignment.
The all_bz
program automates the creation of preliminary pairwise
alignments for the necessary pairs of species, using blastz
,
toast
, and several utility tools such as single_cov2
and chain
.
The roast
program automates the progressive creation of
multi-species alignments from the pairwise ones using the multiz
or multic
programs, which are tools for aligning alignments.
Required: | sequence files, species tree |
Optional: |
1) annotation file for the reference species 2) specification file for running blastz 3) additional command-line parameters |
all_bz
(blastz
, toast
, etc.)
*.orig.maf | -- | original blastz output (local pairwise alignments) |
*.toast.maf | -- | orthologous alignments |
*.toast2.maf | -- | orthologous alignments, where each position in the reference species is aligned to at most one position in another species |
roast
(multiz
, etc.)Orthologous multi-species alignment in MAF format (filename specified as command-line parameter).
Each species has at most one sequence file, which includes one or more sequences
from the same or different chromosomes or contigs. The sequence file name is
usually the species or assembly name, and it must be consistent with
string1
(below).
Sequences must be supplied in FASTA format. The primary form for the headers is
>string1:string2:int1:char:int2where
string1
is the species or assembly name;
string2
is the chromosome or contig;
int1
is the sequence start position (1-based, inclusive) on
the chromosome/contig or its reverse complement;
char
is +
or -
, denoting the strand;
-
, the input sequence is in reverse complement
with respect to the chromosome/contig, and int1
is specified
from the end of the entire chromosome/contig back to the nearest endpoint
of the supplied sequence.)
int2
is the size of the entire chromosome/contig.
string1.string2
will be used as the source
sequence field in the alignment output. Note that this 1-based coordinate
convention is different from the MAF output, whose sequence positions start at
0 instead of 1.
Alternatively,
MSA
header format is also supported. (This has been wrapped here for easier
readability, but in actual use must appear all on one line.) In this format
an empty field is represented by a dot ".
".
>${COMMON_NAME}|${ENCODE_REGION}|${FREEZE_DATE}|${NCBI_TAXON_ID} |${ASSEMBLY_PROVIDER}|${ASSEMBLY_DATE}|${ASSEMBLY_ID}|${CHROMOSOME} |${CHROMOSOME_START}|${CHROMOSOME_END}|${CHROM_LENGTH}|${STRAND} |${ACCESSION}.${VERSION}|${NUM_BASES}|${NUM_N}|${THIS_CONTIG_NUM} |${TOTAL_NUM_CONTIGS}|${OTHER_COMMENTS}
If there is only one sequence in the file, a header like
>string1is also acceptable. For instance, one of the sample files included with this distribution has a header of simply "
>marmoset
".
In this case the given sequence is treated as the whole chromosome/contig,
so the alignment output will not use genomic coordinates.
In the sequence itself, TOAST and ROAST support the same characters as
blastz
, including lowercase letters and N
to
represent unsequenced positions. Other ambiguity codes such as W
,
R
, -
, etc. are not allowed.
Each line in this file records the location of a gene in the reference species
(which must have only one sequence for this to be supported).
If the FASTA header for the reference sequence specifies its location within
a larger chromosome or contig, these gene positions must be relative to the
entire chromosome/contig; otherwise they must be relative to the given input
sequence. The intervals can be either 0-based or 1-based, and half-open or
inclusive, since 1 bp either way doesn't matter much for this purpose.
Note that start
<= end
regardless of gene
orientation, and the gene intervals cannot overlap.
start endThe supplied utility program
transformAnnotation
helps to
convert files such as the knownGene
and refGene
tables
from UCSC
to this simpler format.
A species-guide-tree
is used as an argument in many of the
programs to indicate the evolutionary relationships among the input species.
It is a binary tree built from the species/assembly names using parentheses,
and is enclosed in double quotes, e.g.
"((human chimp)(rat mouse))"
. Note that the
species/assembly names must be consistent with their sequences' file names,
and also with the string1
field in the sequence header.
Most of these programs produce alignment output in UCSC's
MAF format. Even though
the raw output from blastz
is in another format called LAV,
all_bz
converts this to MAF automatically, using the included
utility tool lav2maf
. If the FASTA header for an input sequence
specifies its location within a larger chromosome or contig, the output
coordinates for that sequence will be relative to the entire chromosome/contig;
otherwise they will be relative to the given input sequence (this is true for
the intermediate pairwise output as well).
Except for the main programs all_bz
(which creates many MAF files
with names of its own choosing) and roast
(which uses a final
output filename that you specify), most of the other, helper programs write
their output to stdout
so it can be piped, redirected, etc.
For all programs below, square brackets []
denote optional
parameters, while question marks ?
and items in angle brackets
<>
represent meta-syntactic variables that should be
replaced with your numbers or names. Don't type any of the brackets or the
question marks. Some of the command lines have been wrapped here for easier
readability, but they should always be typed all on one line; the order of
the arguments is not important. In the parameter
descriptions, default values are given in parentheses ()
.
Not all of the programs are listed here, but in general, running any of them
with no arguments will print a help message with command-line syntax, similar
to the following.
all_bz
--
runs blastz
and post-processing commands for all necessary pairs
of the specified sequences.
all_bz [+-] [b=?] [A=?] [F=<reference-species>] [T=<annotation-file>] [h=?] [q=?] [D=?] [c=?] [f=?] <species-guide-tree> [<blastz-specfile>]
+ | (off) | verbose |
- | (off) | list commands only, without running them (+ must be off) |
b | (2 ) |
0 : run post-processing (toast , single_cov2 ,
chain , etc.) only1 : produce pairwise alignments only2 : run both |
A | (1 ) |
0 : run toast 1 : run single_cov2 2 : run toast , followed by chain on
<reference-species> |
F | (null) | null: each sequence position is aligned at most once to each other species<reference-species> : each position in the reference species
is aligned at most once to each other species, but there is no such
restriction when aligning the non-reference sequences(used by single_cov2 and chain ) |
T | (null) | annotation file (used by toast and
chain ) |
h | (300 ) |
minimum chaining size (used by toast ) |
q | (600 ) |
minimum cluster size (used by toast ) |
D | (1 ) |
0 : run all_bz for roast 1 : run all_bz for TBA (see separate
TBA package) |
c | (500 ) |
distance threshold (used by blastz_clean ) |
f | (2 ) |
percentage for determining in-paralogs (used by toast ) |
<species-guide-tree> |
a binary tree specifying the sequences to be aligned and their evolutionary relationships | |
<blastz-specfile> |
(see the blastz
package distribution) |
toast
--
Two-way Orthologous Assignment Selection Tool; produces orthologous alignments
from blastz
alignments.
toast [q=?] [h=?] [s=?] [f=?] [A=<annotation-file>] <seq1-seq2-maf> <seq1-seq1-maf> <seq2-seq2-maf>
q | (600 ) |
minimum node size (see publication) |
h | (300 ) |
minimum chaining length (see publication) |
s | (5 ) |
minimum score per base; alignments below this threshold are ignored |
f | (2 ) |
inflation rate. A pair of duplicates from the same species whose
self-alignment score is higher than any cross alignment between
either of these duplicates and another species are considered to be
in-paralogs (see publication). This parameter
allows a difference of f % below a cross alignment to
still be considered in-paralogs, i.e. lowering the threshold. |
A | (null) | annotation file with gene locations in seq1 |
<seq1-seq2-maf> |
blastz alignment between seq1 and seq2, in MAF format | |
<seq1-seq1-maf> |
blastz alignment between seq1 and itself, in MAF format | |
<seq2-seq2-maf> |
blastz alignment between seq2 and itself, in MAF format |
roast
--
reference dependent multiple aligner; uses pairwise alignments produced by all_bz
.
roast [+-] [R=?] [M=?] [P=?] [T=<temp-dir>] [X=?] [C=?] E=<reference-species> <species-guide-tree> <maf-source> <destination>
+ | (off) | verbose |
- | (off) | list commands only, without running them (+ must be off) |
R | (30 ) |
dynamic programming radius, used by multiz or multic (see
TBA publication) |
M | (1 ) |
minimum block length of output |
P | (multiz ) |
multiz : use the multiz program, which requires single
coverage for the pairwise alignments' reference rowmultic : use the multic program, which has no
requirement for single coverage |
T | (/tmp ) |
specifies the directory where intermediate alignment files are stored |
X | (0 ) |
utilize pairwise MAF files with particular suffixes, reflecting different post-processing:0 : .sing.maf from single_cov2 1 : .toast.maf from toast 2 : .toast2.maf from toast and
chain projection |
C | (50 ) |
connection threshold, used by multic only (see publication
Controlling size when aligning multiple genomic sequences
with duplications) |
E |
reference species; must be consistent with the source sequence field in the MAF file | |
<species-guide-tree> |
a binary tree specifying the evolutionary relationships among the species being aligned | |
<maf-source> |
the pairwise alignment files for producing the multiple alignment;
typically shell wildcards are used (e.g. *.toast.maf ) | |
<destination> |
the path and name of the multiple alignment file to be created |
blastz_clean
--
screens out close overlapped alignments: whenever two alignments overlap in one
sequence and are nearby in the other sequence, the overlapping part of the shorter
one is removed.
blastz_clean [c=?] <pairwise-maf-file>
c | (500 ) |
distance threshold, in basepairs |
<pairwise-maf-file> |
all blocks must have the same first-row species and the same second-row species, but chromosomes/contigs can differ |
chain
--
groups alignment blocks into chains (see publication) and
projects the chains onto a reference species.
chain [R=<reference-species>] [A=<annotation-file>] <pairwise-maf-file>
R | (null) | null: the alignment blocks are just grouped into chains<reference-species> : the alignment blocks are chained and also
projected according to the reference, so that
any position in the reference species is
aligned to at most one position of another
species |
A | (null) | annotation file for the reference species |
<pairwise-maf-file> |
all blocks must have the same first-row species and the same second-row species, but chromosomes/contigs can differ |
maf_order
--
orders the rows in MAF blocks according to a given list of species/assemblies
(regardless of chromosome/contig); rows not specified are removed.
maf_order <maf-file> <species1> <species2> ... [nohead] [all]
nohead | (off) | removes the MAF file's header from the output |
all | (off) | when this is off, single-row blocks are dropped |
maf_project
--
extracts MAF blocks that include a given reference sequence or species.
maf_project <maf-file> <reference> [<from> <to>] [<filename-for-removed>] [<species-guide-tree>] [nohead]
<reference> |
if only a species name is given, all of its chromosomes/contigs qualify | |
<from> <to> | (entire sequence) | specifies a sub-region of the reference sequence, using positions in the MAF file's coordinate system for this sequence (0-based, inclusive, and with respect to either the entire chromosome/contig or just the input sequence, depending on the format of the FASTA header); note that this option is not meaningful if the reference includes multiple chromosomes/contigs, and blocks that partially overlap the interval are kept |
<filename-for-removed> | (null) | collects alignment blocks that are removed |
<species-guide-tree> | (null) | species not in this tree are removed from the blocks |
nohead | (off) | removes the MAF file's header from the output |
maf2lav
--
converts a MAF file to the format used by
PipMaker and the
Laj alignment viewer. (This is not
needed for the Gmaj alignment viewer,
which reads MAF files directly.)
maf2lav <pairwise-maf-file> <seq1-file> <seq2-file>
<pairwise-maf-file> |
all blocks must have the same first-row species and the same second-row species; chromosomes/contigs can differ only in the second row |
<seq1-file> <seq2-file> |
the sequence files used to create the alignment |
single_cov2
--
screens out all overlapped alignments: whenever two alignments overlap in the
specified species, the overlapping part of the shorter one is removed.
single_cov2 <pairwise-maf-file> [R=<species>] [F=<filename-for-removed>]
<pairwise-maf-file> |
all blocks must have the same first-row species and the same second-row species, but chromosomes/contigs can differ | |
R | (null) | null: single coverage is done for both species<species> : single coverage is done for the specified
species only |
F | (null) | file for collecting the removed alignments |
transformAnnotation
--
transforms a gene annotation file in a format like the knownGene
or
refGene
tables at the
UCSC Table Browser
to the simpler format used by argument A
of
toast
. It removes overlaps in a heuristic fashion while
attempting to have each resulting interval still correspond to a gene.
(Note that this program extracts the transcription start
and
end
fields without doing any coordinate conversion or chromosome
filtering; if you are starting with the full
UCSC table files, you will need to grep
for your chromosome
of interest before running this.)
transformAnnotation <known-gene-file>