Name | Last modified | Size | Description | |
---|---|---|---|---|
Parent Directory | - | |||
gmaj_geneconv.images/ | 2011-06-09 23:13 | - | ||
docs/ | 2012-03-30 18:59 | - | ||
annot.d/ | 2012-03-24 23:55 | - | ||
aglobin.example/ | 2012-07-26 12:58 | - | ||
gmaj_geneconv.html | 2011-06-10 14:17 | 26K | ||
CHAP.2012-07-26.fast.tar.gz | 2012-07-26 12:54 | 68M | ||
CHAP.2012-07-26.compact.tar.gz | 2012-07-26 12:54 | 3.3M | ||
CHAP.2012-07-02.fast.tar.gz | 2012-07-02 17:27 | 68M | ||
CHAP.2012-07-02.compact.tar.gz | 2012-07-02 17:27 | 3.3M | ||
CHAP.2012-05-03.fast.tar.gz | 2012-05-03 16:39 | 68M | ||
CHAP.2012-05-03.compact.tar.gz | 2012-05-03 16:39 | 3.3M | ||
CHAP.2012-03-30.fast.tar.gz | 2012-03-30 23:34 | 68M | ||
CHAP.2012-03-30.compact.tar.gz | 2012-03-30 23:34 | 3.3M | ||
CHAP.2011-08-02.fast.tar.gz | 2011-08-02 16:21 | 67M | ||
CHAP.2011-08-02.compact.tar.gz | 2011-08-02 16:21 | 2.3M | ||
CHAP.2011-06-10.fast.tar.gz | 2011-06-10 14:53 | 67M | ||
CHAP.2011-06-10.compact.tar.gz | 2011-06-10 14:54 | 2.3M | ||
CHAP.2011-03-17.fast.tar.gz | 2011-03-17 16:40 | 67M | ||
CHAP.2011-03-17.compact.tar.gz | 2011-03-17 16:40 | 2.4M | ||
Both of these methods rely on the conversion calls from the original pipeline, so the new pipeline always calls the old one automatically. Thus you can run the orthology script and get both orthology and conversion results, or if you are only interested in conversions you can just run the old command as before. (Actually the original pipeline produces orthologs too, because it needs them for detecting conversions, but they are obtained by a different method and are rough and preliminary.) Note that the conversion pipeline runs for all species at once, but the new orthology mapper currently runs only for the single reference species you specify.
Preparation of input data is nearly identical for the two pipelines, except that the new orthology one makes more use of gene annotations, especially for visualizing gene orthology, whereas annotations are recommended but not strictly necessary for the original conversion pipeline. We will discuss the input files for both pipelines together in one section, and then devote separate sections for the commands and output of the two programs.
make
and
gcc
(though other C compilers could probably be used by adjusting
the Makefile
s). User commands are provided in the
form of Bourne shell scripts, which use various standard utilities such as
cat
, grep
, sed
, tr
, etc.
If you want to get automatic orthology diagrams or use the included Gmaj
program to view the results interactively, you will also need a Java runtime
environment; for best compatibility
Sun's JRE (or JDK) is
recommended.
RepeatMasker
executable is not in your command path, modify the
right-hand side of the line
REPEATMASKER=RepeatMaskernear the start of the file
conversion.sh
to indicate its location
on your computer.
maketo compile the component programs and install them in the
bin
subdirectory.
make
), CHAP is configured to keep
its scripts and Java programs in the package directory, while compiled
binaries and resource data files are located in the bin
and
resources
subdirectories, respectively. If you want to
install it elsewhere (e.g. centrally for multiple users), you can edit the
lines for CHAP_SCRIPT_DIR
, CHAP_JAVA_DIR
,
CHAP_BINARY_DIR
, and CHAP_RESOURCE_DIR
at the top
of the Makefile
to specify the desired locations, and then run
make install(it is not necessary to run
make
first, but it doesn't hurt
either). This will configure the installed scripts to look for their
programs and resource files in the directories you have specified, instead
of relative to the working cluster directory (which
then no longer needs to be inside the package directory). However, it also
means that users will need to modify the command paths in the examples
accordingly.
For each gene cluster that you want to analyze, do the following.
seq.d
and put your FastA-formatted sequence files in it, giving each file the
appropriate species name, e.g., human
, vervet
.
annot.d
and put your gene annotation files in it. These files
use a "coding exons" format that is similar to the exons format supported
by our PipMaker server,
except that the position endpoints reflect coding regions only (i.e.
translation rather than transcription, so UTRs are excluded). The CHAP
distribution includes sample files
in this format. The file names must consist of the species name followed
by a .codex
extension, e.g. human.codex
,
vervet.codex
, etc.
In this format, the directionality of a gene (>
,
<
, or |
), the start and end positions of its
coding sequence, and its name should be on one line, followed by lines
specifying the coding start and end positions of each exon, which must be
listed in order of increasing address even if the gene is on the reverse
strand (<
). All positions are relative to the cluster
sequence files you provide (not the entire chromosomes), and use a 1-based,
closed-interval coordinate system (i.e., the first nucleotide in your
corresponding sequence file is called "1", and the specified ranges
include both endpoints). Names ending in _ps
indicate
pseudogenes (an exception to the "coding only" rule). We recommend
limiting each gene name to a single word (i.e. without spaces), but if
it has multiple words then the _ps
suffix must be on the
first word rather than the last one in order to be properly recognized.
Thus, the file might begin as follows:
> 12910 14400 HBZ-T1 12910 13004 13892 14096 14272 14400 > 23122 25156 HBZ-T2_ps 23122 25156 > 25998 26708 HBK 25998 26089 26268 26472 26580 26708 ... etc.
The orthology pipeline requires these annotation files for making its
gene orthology diagrams; you must supply gene
annotations for the reference and at least one other species to get any
figures. If you just want orthologous alignments for the sequences, or
if you are just running the conversion pipeline, then these files are
not strictly necessary but are still recommended for best accuracy.
They assist somewhat in the preliminary ortholog detection for finding
conversions, and the enhanced orthology mapper uses them to refine its
similarity scoring. If present, they are also used by the
gc-info
summary program to compute conversion statistics
for coding regions (with pseudogenes excluded), and by Gmaj to annotate
its display.
If you do not know the actual gene locations in some of your sequences,
you may be able to estimate them with a program such as
Wise2
(Birney et al. 2004), using known protein
sequences in, say, human to find gene structures within the DNA sequences
of other species. The CHAP package includes a utility script called
infer-annot.sh
to help with automating this approach; please
see the Utility Programs section for information on
how to use it.
species_tree.txt
.
seq.d
and annot.d
as well as the species tree, from steps 2-4 of the
Data Preparation section), run a command like
../ortho.sh species_tree.txt humanwhere the last argument is the name of the sequence to use as the reference. The pipeline may take from several minutes to an hour or more, depending on the complexity of the cluster's history and the number of sequences.
../ortho.sh species_tree.txt vervet --no_rmThis is currently rather inefficient because it runs the conversion pipeline again unnecessarily, but at least the
--no_rm
option avoids
re-running RepeatMasker. Since the output files include the reference in
their names, your earlier results should coexist peacefully without being
overwritten. One small exception, however, is the inferred pseudogenes in
the fig_annot.d
directory, which are used by default for the
PostScript figures and Gmaj viewer. These are computed in a theoretically
reference-dependent manner, so their endpoints may change slightly when
they are overwritten by a new run. Note that running multiple jobs
simultaneously in the same cluster directory is not supported
and may produce erroneous output, since they will attempt to use the same
temporary scratch files.
human.x-ortho.eps
and
human.n-ortho.eps
respectively, where the first part
of the name indicates the reference sequence. They are placed in the
figures.d
directory.
ortho.d/events.d
directory list the
evolutionary events identified in the reference sequence by the orthology
mapper; these may be helpful for interpreting the figures. Events are
listed in reverse chronological order (i.e., most recent first). Lines
beginning with "# sp
"
represent speciation events where the reference lineage split from the
subtree containing the indicated species. The other lines contain six
space-separated columns, listed in Table 1.
Note that all position coordinates are 1-based, closed-interval (i.e.
the first nucleotide in the FastA sequence is called "1", and the intervals
include both endpoints), and are specified relative to the present-day
sequence for the reference species (e.g. seq.d/human
).
Table 1. Fields in event output files
(ortho.d/events.d/*.events
).
col_1 | event type, encoded as:
| ||||||||||
col_2, col_3 |
start and end positions of the source region (or deleted region) | ||||||||||
col_4, col_5 |
start and end positions of the target region (0 0 for
deletions) | ||||||||||
col_6 | percent identity of the two regions (0 for deletions) |
ortho.d/x-ortho.d
and ortho.d/n-ortho.d
. You can
pass these to other tools for further analysis, or examine them visually
by running Gmaj with commands like
../gmaj-ortho.sh human vervet contextor
../gmaj-ortho.sh human vervet contentSee the Utility Programs section for more information about
gmaj-ortho.sh
.
docs/ortho.html
provides examples of how to interpret the PostScript figures and
use Gmaj to investigate the orthology results.
Note that the orthology pipeline will run this for you automatically, so you only need to run it manually if you are not interested in the improved orthology calls. Also, the conversion pipeline always runs for all reference sequences, not just the one you specify for orthology.
seq.d
and annot.d
as well as the species tree, from steps 2-4 of the
Data Preparation section), run the command
../conversion.sh species_tree.txtThe pipeline may run for an hour or more.
CRIT_BOUND=0.8near the start of the file
conversion.sh
. Note that values
below 60% or above 90% are generally not recommended.
../bin/gc-info non-redundant.gc annot.d self.d
../gmaj-conv.sh humanwhere the argument is the reference species whose conversions you want to see. The Utility Programs section has more information about
gmaj-conv.sh
, and the file
docs/gmaj_geneconv.html
provides a short tour of how to use Gmaj to investigate conversions.
all.gc
, which
contains the details of all of the conversion observations in each species,
and can be inspected directly if gc-info
and Gmaj do not convey
the desired information. Additional output files include
non-redundant.gc
, which lists only one representative line from
all.gc
for each distinct conversion event,
species_tree_with_index.txt
, which simply numbers the tree
branches consecutively for reference purposes, and an assortment of MAF
alignments, some of which are used by Gmaj.
all.gc
and non-redundant.gc
files use the
same format. The first line is a copy of the
species_tree_with_index.txt
file, labeling the tree edges so
those associated with each conversion event can be indicated. The next line
provides brief headers for the data columns, and subsequent lines contain
detailed information for each paralogous pair of intervals where conversion
was detected, using the tab-separated fields listed in
Table 2.
Note that all position coordinates are 1-based, closed-interval (i.e.
the first nucleotide in the FastA sequence is called "1", and the intervals
include both endpoints), and are specified relative to the entire given
sequence for that species (e.g. conversion regions are not relative
to the paralogs in which they are found). If an interval has an orientation
(strand) of "−
",
the endpoints are reported the same as if it were
"+
".
Table 2. Fields in conversion output files all.gc
and non-redundant.gc
.
pair | index for each pair of paralogous sequences within a species | ||||||||||
species | name of species containing the conversion | ||||||||||
beg1, end1 |
start and end positions of the first sequence (i.e., the first paralogous interval of the pair, in the named species) | ||||||||||
species | name of species (again) | ||||||||||
beg2, end2 |
start and end positions of the second sequence (i.e., the second paralogous interval) | ||||||||||
orient | orientation (strand) of the second sequence with respect to the first | ||||||||||
length | length of the first sequence | ||||||||||
identity | fraction of identical nucleotides for the two sequences | ||||||||||
gc_len | length of the conversion region (measured in the first sequence) | ||||||||||
p-value | P-value for the conversion test | ||||||||||
gc_beg1, gc_end1 |
start and end positions for the conversion region in the first sequence | ||||||||||
gc_beg2, gc_end2 |
start and end positions for the conversion region in the second sequence | ||||||||||
direction | direction of conversion, encoded as:
| ||||||||||
c1_name, c1_start, c1_end, c1_orient |
ortholog of the first sequence in the outgroup species | ||||||||||
c2_name, c2_start, c2_end, c2_orient |
ortholog of the second sequence in the outgroup species | ||||||||||
event_id | identifying number for the conversion event (note that multiple observation lines may reflect the same event) | ||||||||||
tree_branch | indication of where the conversion event occurred in the tree topology, specified as a comma-separated list of possible edges | ||||||||||
c1_blocks | indices of alignment blocks containing the ortholog of the first sequence | ||||||||||
c2_blocks | indices of alignment blocks containing the ortholog of the second sequence | ||||||||||
ortholog_status | status of orthologs in the outgroup species, encoded as:
|
Note that running these programs without any arguments will typically give you a brief reminder of the usage syntax.
ortho-fig.sh
This script generates the PostScript orthology figures, by first creating
the *.fig
files describing the diagrams and then running the
orthofig.jar
program with appropriate parameters to do the
actual drawing (*.eps
). It is normally called automatically
by the main ortho.sh
pipeline, but you can rerun it manually
if needed (e.g. to use a different set of gene annotations), via a command
like
../ortho-fig.sh human annot_dirThe reference species must be one for which you have already run
ortho.sh
. If you do not specify an annotation directory, the
default is to use fig_annot.d
, which contains your original
annotations from annot.d
plus pseudogenes that have been
inferred by the pipeline. The colors for the gene boxes are specified in
the file ortho-fig.colors
, which you can edit if desired. By
default this file is located in the package's resources
directory.
orthofig.jar
This is the Java program that draws the PostScript figures from the
*.fig
files. You can rerun it manually to change the drawing
parameters, but only after the *.fig
files have been created.
The ortho-fig.sh
script prints the parameters it is using, so
you can just tweak the ones you want to. For an explanation of the
available parameters, run the command
java -jar ../orthofig.jar -help
gmaj-ortho.sh
This script runs Gmaj to view the orthology calls between the reference and another species. The orthologous alignments are shown superimposed (in black) on the full set of chained pairwise alignments between the two sequences (brown). Use a command like
../gmaj-ortho.sh human vervet orth_type annot_dirwhere
orth_type
is either
"context
",
"content
", or
"cage
" (the latter
specifies the preliminary orthology calls made by the conversion pipeline's
CAGE program). Orthology results by context (X-orthology) and by content
(N-orthology) are only
available for the reference species you specified when running
ortho.sh
, but the CAGE calls are produced for all reference
species. Again, if you don't specify an annotation directory, then
fig_annot.d
is used by default (unless you only ran
conversion.sh
instead of ortho.sh
, in which case
fig_annot.d
was not created, so annot.d
is used).
Gmaj can draw annotations on the alignment plots in the form of colored
background bands called underlays. By default the CHAP scripts
build underlays for Gmaj automatically from your gene annotation files, but
you can override this by supplying your own underlay files (e.g. to include
items other than genes and exons). These files must have names like
human.underlays
, vervet.underlays
, etc., and
follow the
format specified in the documentation for the
main release of Gmaj
(except that a new color PaleGray
has been added for CHAP).
You can put them either in the annotation directory you specify or in
annot.d
(putting them in fig_annot.d
is also
possible but not recommended because that directory is wiped out and
recreated each time ortho.sh
is run). Note that the default
underlays are placed in temp_underlays.d
; this directory is
wiped out and recreated whenever the Gmaj scripts are run, but you can use
the files in it as examples or templates for making your own custom
underlay files.
gmaj-conv.sh
This script runs Gmaj to view the conversion calls for a particular reference species and examine the evidence for them. It has a number of parameters available for customization, but only the reference species is required.
../gmaj-conv.sh human annot_dir genomic_offset "title" exon_colorExamples:
../gmaj-conv.sh human ../gmaj-conv.sh human my_annot.d 31334805 "Conversions in the human CCL region" LightYellow ../gmaj-conv.sh human "" 0 "" NoneThe parameters are position-dependent; if you want to keep the default annotations, or if you do not want an offset or a title, then use "", 0, and "" respectively to reach subsequent options. As before, the default annotation directory is
fig_annot.d
if it exists,
otherwise annot.d
. The genomic_offset
is added
to all position labels in the reference sequence, so they can be displayed
with respect to e.g. the entire chromosome instead of the provided cluster
sequence. The title
is applied to the Gmaj window, and
exon_color
is used for building default underlays if you
haven't supplied custom ones for a particular sequence (see the discussion
of underlay files above). The list of
valid underlay colors is available in the documentation for the
main release of Gmaj
(except that a new color PaleGray
has been added for CHAP),
or you can specify
"None
" to have this script
suppress all underlays. The default exon color is LightGray
.
gc-info
This is a compiled C program located with the other binaries (by default
in the package's bin
directory). It computes some summary
statistics about the detected conversions in each of the species, and
prints them in a tab-separated format with column headers.
../bin/gc-info non-redundant.gc annot.d self.d
infer-annot.sh
This script aims to help you obtain estimated gene annotations for non-reference species from those of a reference species, using the Wise2 software from EBI (Birney et al. 2004).
First, download and install Wise2 according to the instructions that come with it. If the installed location is not in your command path, modify the right-hand side of the line
GENEWISE=genewisenear the start of CHAP's
infer-annot.sh
script to specify
the path for the genewise
executable on your computer.
Next, go to your cluster directory, create the annot.d
subdirectory, and put your annotation file for the reference species
in it (e.g. human.codex
), as described in the
Data Preparation section. (You could use any
subdirectory name for this inference step, but it needs to be called
annot.d
in order for the ortho.sh
and
conversion.sh
scripts to find it later.) Put all of your
sequences in the seq.d
directory as usual.
Finally, from the cluster directory, run the command
../infer-annot.sh human annot.dThis will use the reference annotations to estimate gene and exon locations for all of the other sequences that don't already have annotation files, and put the new
*.codex
files in the same
directory (annot.d
). Of course, you may edit these as
desired before going on to run the
ortho.sh
or
conversion.sh
pipeline.
cleanout.sh
This script is provided to help clean up a specified cluster directory, removing files and subdirectories added by the CHAP pipelines. Any material added by users to pipeline-created directories will be wiped out when the directories are removed, but other user files will generally be left alone.
../cleanout.sh cluster_dir clean_level refseq_name
If you are already in the cluster directory to be cleaned, you can use
".
" for that parameter.
The clean_level
controls which files and directories are
removed, with higher values specifying increasingly thorough/drastic
cleanup, as follows. They are cumulative, with each level including all
lower ones.
0 : |
temporary scratch files normally deleted automatically by the pipelines; useful if a script did not finish due to an error |
1 : |
additional intermediate output from pipeline programs (but final results and files needed by Gmaj and the figure generator are kept) |
2 : |
result files for all reference sequences other than the specified one |
3 : |
all output except RepeatMasker results and Gmaj user preferences;
useful with the pipelines' --no_rm option to avoid
the delay of re-masking
|
4 : |
all output; only the original user input should remain |
The refseq_name
is only used for level 2. It specifies the
reference sequence of the results you want to keep; others will
be discarded.
Examples:
../cleanout.sh . 1 # good for routine tidying ../cleanout.sh . 2 human # used on aglobin.example to save spaceFor details on exactly which files are deleted at which levels, please see the comments for the variable assignments in the top section of the script.
Birney E, Clamp M, Durbin R (2004) GeneWise and Genomewise. Genome Res. 14:988. PubMed 15123596
Smit AFA, Hubley R, Green P (1996-2010) RepeatMasker Open-3.0. Unpublished; http://www.repeatmasker.org.
Song G, Hsu C-H, Riemer C, Zhang Y, Kim HL, Hoffmann F, Zhang L, Hardison RC, NISC Comparative Sequencing Program, Green ED, Miller W (2011) Conversion events in gene clusters. BMC Evol. Biol. 11:226. PubMed 21798034
Song G, Riemer C, Dickins B, Kim HL, Zhang L, Zhang Y, Hsu C-H, Hardison RC, NISC Comparative Sequencing Program, Green ED, Miller W (2012) Revealing mammalian evolutionary relationships by comparative analysis of gene clusters. To appear in Genome Biol. Evol.
March 2012