TOAST and ROAST

TABLE OF CONTENTS

Introduction
Installation
Example
Data Flow
File and Tree Formats
Program Descriptions

Introduction

This package contains the tools TOAST and ROAST for finding orthologous alignments of DNA sequences, plus some associated utility programs.

TOAST -- two way orthologous selection tool

ROAST -- reference dependent multiple alignment tool

Platform:	These tools were developed on a Debian GNU/Linux system, but are written in C and should be compilable on other Linux or Unix platforms with little change.
Author:	Minmei Hou, <mhou @ cs.niu.edu>
Date:	March 2008
Citation:	[To be added]

Installation

Download the distribution file into an empty source directory, and unpack it using a command like:

    zcat toast-roast.tar.gz | tar xvf -

Edit the Makefile to adjust the settings for your system (especially INSTALLDIR); then from the source directory, run:

    make
    make install

You will also need to install the blastz program (which is distributed as a separate package) and make sure that these programs can find it, e.g. by adjusting your PATH or by making a symlink.

Here is a small example that aligns the beta-globin gene cluster from four species, with human as the reference, using sample input files included with the distribution. The input consists of sequence files in FASTA format for human, marmoset, mouse, and monodelphis, plus an annotation file chr11.genes.txt which specifies the gene locations in the human sequence. The annotation file is optional, but is useful when the genes have long introns, and may help TOAST produce better output.

To run this example for yourself, type:

    all_bz + A=2 F=human D=0 T=chr11.genes.txt "(((human marmoset) mouse) monodelphis)"
    roast + X=2 E=human "(((human marmoset) mouse) monodelphis)" *.toast2.maf example.roast.maf

Then the output file example.roast.maf will contain the orthologous multiple alignment.

Data Flow

The all_bz program automates the creation of preliminary pairwise alignments for the necessary pairs of species, using blastz, toast, and several utility tools such as single_cov2 and chain. The roast program automates the progressive creation of multi-species alignments from the pairwise ones using the multiz or multic programs, which are tools for aligning alignments.

Input (see below for formats)

Required:	sequence files, species tree
Optional:	1) annotation file for the reference species 2) specification file for running `blastz` 3) additional command-line parameters

||
|| all_bz (blastz, toast, etc.)
\/

Pairwise Alignment Output

`*.orig.maf`	--	original `blastz` output (local pairwise alignments)
`*.toast.maf`	--	orthologous alignments
`*.toast2.maf`	--	orthologous alignments, where each position in the reference species is aligned to at most one position in another species

||
|| roast (multiz, etc.)
\/

Multiple Alignment Output

Orthologous multi-species alignment in MAF format (filename specified as command-line parameter).

File and Tree Formats

Input Sequences

Each species has at most one sequence file, which includes one or more sequences from the same or different chromosomes or contigs. The sequence file name is usually the species or assembly name, and it must be consistent with string1 (below).

Sequences must be supplied in FASTA format. The primary form for the headers is

    >string1:string2:int1:char:int2

where

string1 is the species or assembly name;
string2 is the chromosome or contig;
int1 is the sequence start position (1-based, inclusive) on the chromosome/contig or its reverse complement;
char is + or -, denoting the strand;
(If this is -, the input sequence is in reverse complement with respect to the chromosome/contig, and int1 is specified from the end of the entire chromosome/contig back to the nearest endpoint of the supplied sequence.)
int2 is the size of the entire chromosome/contig.

With this format, string1.string2 will be used as the source sequence field in the alignment output. Note that this 1-based coordinate convention is different from the MAF output, whose sequence positions start at 0 instead of 1.

Alternatively, MSA header format is also supported. (This has been wrapped here for easier readability, but in actual use must appear all on one line.) In this format an empty field is represented by a dot ".".

    >${COMMON_NAME}|${ENCODE_REGION}|${FREEZE_DATE}|${NCBI_TAXON_ID}
    |${ASSEMBLY_PROVIDER}|${ASSEMBLY_DATE}|${ASSEMBLY_ID}|${CHROMOSOME}
    |${CHROMOSOME_START}|${CHROMOSOME_END}|${CHROM_LENGTH}|${STRAND}
    |${ACCESSION}.${VERSION}|${NUM_BASES}|${NUM_N}|${THIS_CONTIG_NUM}
    |${TOTAL_NUM_CONTIGS}|${OTHER_COMMENTS}

If there is only one sequence in the file, a header like

    >string1

is also acceptable. For instance, one of the sample files included with this distribution has a header of simply ">marmoset". In this case the given sequence is treated as the whole chromosome/contig, so the alignment output will not use genomic coordinates.

In the sequence itself, TOAST and ROAST support the same characters as blastz, including lowercase letters and N to represent unsequenced positions. Other ambiguity codes such as W, R, -, etc. are not allowed.

Annotation File

Each line in this file records the location of a gene in the reference species (which must have only one sequence for this to be supported). If the FASTA header for the reference sequence specifies its location within a larger chromosome or contig, these gene positions must be relative to the entire chromosome/contig; otherwise they must be relative to the given input sequence. The intervals can be either 0-based or 1-based, and half-open or inclusive, since 1 bp either way doesn't matter much for this purpose. Note that start <= end regardless of gene orientation, and the gene intervals cannot overlap.

    start end

The supplied utility program transformAnnotation helps to convert files such as the knownGene and refGene tables from UCSC to this simpler format.

Species Guide Tree

A species-guide-tree is used as an argument in many of the programs to indicate the evolutionary relationships among the input species. It is a binary tree built from the species/assembly names using parentheses, and is enclosed in double quotes, e.g. "((human chimp)(rat mouse))". Note that the species/assembly names must be consistent with their sequences' file names, and also with the string1 field in the sequence header.

MAF Alignment Output

Most of these programs produce alignment output in UCSC's MAF format. Even though the raw output from blastz is in another format called LAV, all_bz converts this to MAF automatically, using the included utility tool lav2maf. If the FASTA header for an input sequence specifies its location within a larger chromosome or contig, the output coordinates for that sequence will be relative to the entire chromosome/contig; otherwise they will be relative to the given input sequence (this is true for the intermediate pairwise output as well). Except for the main programs all_bz (which creates many MAF files with names of its own choosing) and roast (which uses a final output filename that you specify), most of the other, helper programs write their output to stdout so it can be piped, redirected, etc.

Program Descriptions

For all programs below, square brackets [] denote optional parameters, while question marks ? and items in angle brackets <> represent meta-syntactic variables that should be replaced with your numbers or names. Don't type any of the brackets or the question marks. Some of the command lines have been wrapped here for easier readability, but they should always be typed all on one line; the order of the arguments is not important. In the parameter descriptions, default values are given in parentheses (). Not all of the programs are listed here, but in general, running any of them with no arguments will print a help message with command-line syntax, similar to the following.

all_bz -- runs blastz and post-processing commands for all necessary pairs of the specified sequences.

    all_bz [+-] [b=?] [A=?] [F=<reference-species>] [T=<annotation-file>]
      [h=?] [q=?] [D=?] [c=?] [f=?] <species-guide-tree> [<blastz-specfile>]

`+`	(off)	verbose
`-`	(off)	list commands only, without running them (`+` must be off)
`b`	(`2`)	`0`: run post-processing (`toast`, `single_cov2`, `chain`, etc.) only `1`: produce pairwise alignments only `2`: run both
`A`	(`1`)	`0`: run `toast` `1`: run `single_cov2` `2`: run `toast`, followed by `chain` on `<reference-species>`
`F`	(null)	null: each sequence position is aligned at most once to each other species `<reference-species>`: each position in the reference species is aligned at most once to each other species, but there is no such restriction when aligning the non-reference sequences (used by `single_cov2` and `chain`)
`T`	(null)	annotation file (used by `toast` and `chain`)
`h`	(`300`)	minimum chaining size (used by `toast`)
`q`	(`600`)	minimum cluster size (used by `toast`)
`D`	(`1`)	`0`: run `all_bz` for `roast` `1`: run `all_bz` for TBA (see separate TBA package)
`c`	(`500`)	distance threshold (used by `blastz_clean`)
`f`	(`2`)	percentage for determining in-paralogs (used by `toast`)
`<species-guide-tree>`		a binary tree specifying the sequences to be aligned and their evolutionary relationships
`<blastz-specfile>`		(see the `blastz` package distribution)

toast -- Two-way Orthologous Assignment Selection Tool; produces orthologous alignments from blastz alignments.

    toast [q=?] [h=?] [s=?] [f=?] [A=<annotation-file>] <seq1-seq2-maf>
      <seq1-seq1-maf> <seq2-seq2-maf>

`q`	(`600`)	minimum node size (see publication)
`h`	(`300`)	minimum chaining length (see publication)
`s`	(`5`)	minimum score per base; alignments below this threshold are ignored
`f`	(`2`)	inflation rate. A pair of duplicates from the same species whose self-alignment score is higher than any cross alignment between either of these duplicates and another species are considered to be in-paralogs (see publication). This parameter allows a difference of `f`% below a cross alignment to still be considered in-paralogs, i.e. lowering the threshold.
`A`	(null)	annotation file with gene locations in seq1
`<seq1-seq2-maf>`		`blastz` alignment between seq1 and seq2, in MAF format
`<seq1-seq1-maf>`		`blastz` alignment between seq1 and itself, in MAF format
`<seq2-seq2-maf>`		`blastz` alignment between seq2 and itself, in MAF format

roast -- reference dependent multiple aligner; uses pairwise alignments produced by all_bz.

    roast [+-] [R=?] [M=?] [P=?] [T=<temp-dir>] [X=?] [C=?] E=<reference-species>
      <species-guide-tree> <maf-source> <destination>

`+`	(off)	verbose
`-`	(off)	list commands only, without running them (`+` must be off)
`R`	(`30`)	dynamic programming radius, used by `multiz` or `multic` (see TBA publication)
`M`	(`1`)	minimum block length of output
`P`	(`multiz`)	`multiz`: use the `multiz` program, which requires single coverage for the pairwise alignments' reference row `multic`: use the `multic` program, which has no requirement for single coverage
`T`	(`/tmp`)	specifies the directory where intermediate alignment files are stored
`X`	(`0`)	utilize pairwise MAF files with particular suffixes, reflecting different post-processing: `0`: `.sing.maf` from `single_cov2` `1`: `.toast.maf` from `toast` `2`: `.toast2.maf` from `toast` and `chain` projection
`C`	(`50`)	connection threshold, used by `multic` only (see publication Controlling size when aligning multiple genomic sequences with duplications)
`E`		reference species; must be consistent with the source sequence field in the MAF file
`<species-guide-tree>`		a binary tree specifying the evolutionary relationships among the species being aligned
`<maf-source>`		the pairwise alignment files for producing the multiple alignment; typically shell wildcards are used (e.g. `*.toast.maf`)
`<destination>`		the path and name of the multiple alignment file to be created

blastz_clean -- screens out close overlapped alignments: whenever two alignments overlap in one sequence and are nearby in the other sequence, the overlapping part of the shorter one is removed.

    blastz_clean [c=?] <pairwise-maf-file>

`c`	(`500`)	distance threshold, in basepairs
`<pairwise-maf-file>`		all blocks must have the same first-row species and the same second-row species, but chromosomes/contigs can differ

chain -- groups alignment blocks into chains (see publication) and projects the chains onto a reference species.

    chain [R=<reference-species>] [A=<annotation-file>] <pairwise-maf-file>

`R`	(null)	null: the alignment blocks are just grouped into chains `<reference-species>`: the alignment blocks are chained and also projected according to the reference, so that any position in the reference species is aligned to at most one position of another species
`A`	(null)	annotation file for the reference species
`<pairwise-maf-file>`		all blocks must have the same first-row species and the same second-row species, but chromosomes/contigs can differ

maf_order -- orders the rows in MAF blocks according to a given list of species/assemblies (regardless of chromosome/contig); rows not specified are removed.

    maf_order <maf-file> <species1> <species2> ... [nohead] [all]

`nohead`	(off)	removes the MAF file's header from the output
`all`	(off)	when this is off, single-row blocks are dropped

maf_project -- extracts MAF blocks that include a given reference sequence or species.

    maf_project <maf-file> <reference> [<from> <to>] [<filename-for-removed>]
      [<species-guide-tree>] [nohead]

`<reference>`		if only a species name is given, all of its chromosomes/contigs qualify
`<from> <to>`	(entire sequence)	specifies a sub-region of the reference sequence, using positions in the MAF file's coordinate system for this sequence (0-based, inclusive, and with respect to either the entire chromosome/contig or just the input sequence, depending on the format of the FASTA header); note that this option is not meaningful if the reference includes multiple chromosomes/contigs, and blocks that partially overlap the interval are kept
`<filename-for-removed>`	(null)	collects alignment blocks that are removed
`<species-guide-tree>`	(null)	species not in this tree are removed from the blocks
`nohead`	(off)	removes the MAF file's header from the output

maf2lav -- converts a MAF file to the format used by PipMaker and the Laj alignment viewer. (This is not needed for the Gmaj alignment viewer, which reads MAF files directly.)

    maf2lav <pairwise-maf-file> <seq1-file> <seq2-file>

`<pairwise-maf-file>`	all blocks must have the same first-row species and the same second-row species; chromosomes/contigs can differ only in the second row
`<seq1-file>` `<seq2-file>`	the sequence files used to create the alignment

single_cov2 -- screens out all overlapped alignments: whenever two alignments overlap in the specified species, the overlapping part of the shorter one is removed.

    single_cov2 <pairwise-maf-file> [R=<species>] [F=<filename-for-removed>]

`<pairwise-maf-file>`		all blocks must have the same first-row species and the same second-row species, but chromosomes/contigs can differ
`R`	(null)	null: single coverage is done for both species `<species>`: single coverage is done for the specified species only
`F`	(null)	file for collecting the removed alignments

transformAnnotation -- transforms a gene annotation file in a format like the knownGene or refGene tables at the UCSC Table Browser to the simpler format used by argument A of toast. It removes overlaps in a heuristic fashion while attempting to have each resulting interval still correspond to a gene. (Note that this program extracts the transcription start and end fields without doing any coordinate conversion or chromosome filtering; if you are starting with the full UCSC table files, you will need to grep for your chromosome of interest before running this.)

    transformAnnotation <known-gene-file>

Minmei Hou and Cathy Riemer, March 2008