Index of /miller_lab/dist/CHAP

[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory  -  
[DIR]gmaj_geneconv.images/2011-06-09 23:13 -  
[DIR]annot.d/2012-03-24 23:55 -  
[DIR]docs/2012-03-30 18:59 -  
[DIR]aglobin.example/2012-07-26 12:58 -  
[   ]CHAP.2011-03-17.fast.tar.gz2011-03-17 16:40 67M 
[   ]CHAP.2011-03-17.compact.tar.gz2011-03-17 16:40 2.4M 
[TXT]gmaj_geneconv.html2011-06-10 14:17 26K 
[   ]CHAP.2011-06-10.fast.tar.gz2011-06-10 14:53 67M 
[   ]CHAP.2011-06-10.compact.tar.gz2011-06-10 14:54 2.3M 
[   ]CHAP.2011-08-02.fast.tar.gz2011-08-02 16:21 67M 
[   ]CHAP.2011-08-02.compact.tar.gz2011-08-02 16:21 2.3M 
[   ]CHAP.2012-03-30.fast.tar.gz2012-03-30 23:34 68M 
[   ]CHAP.2012-03-30.compact.tar.gz2012-03-30 23:34 3.3M 
[   ]CHAP.2012-05-03.fast.tar.gz2012-05-03 16:39 68M 
[   ]CHAP.2012-05-03.compact.tar.gz2012-05-03 16:39 3.3M 
[   ]CHAP.2012-07-02.fast.tar.gz2012-07-02 17:27 68M 
[   ]CHAP.2012-07-02.compact.tar.gz2012-07-02 17:27 3.3M 
[   ]CHAP.2012-07-26.fast.tar.gz2012-07-26 12:54 68M 
[   ]CHAP.2012-07-26.compact.tar.gz2012-07-26 12:54 3.3M 

The Cluster History Analysis Package (CHAP 2)

The Cluster History Analysis Package (CHAP 2)

TABLE OF CONTENTS

Introduction

Installation

  1. The CHAP pipelines need to run the RepeatMasker program (Smit et al. 1996-2010), which can be obtained from www.repeatmasker.org. When installing RepeatMasker you will need to choose which sequence search engine and which repeat database library to use; we suggest Cross_Match and RepBase respectively, which are both free for academic use. If the RepeatMasker executable is not in your command path, modify the right-hand side of the line
        REPEATMASKER=RepeatMasker
    near the start of the file conversion.sh to indicate its location on your computer.
  2. In the directory containing the unpacked files from the CHAP distribution archive, which we will call the "package directory", type
        make
    to compile the component programs and install them in the bin subdirectory.
  3. For advanced users:  By default (if you just run make), CHAP is configured to keep its scripts and Java programs in the package directory, while compiled binaries and resource data files are located in the bin and resources subdirectories, respectively. If you want to install it elsewhere (e.g. centrally for multiple users), you can edit the lines for CHAP_SCRIPT_DIR, CHAP_JAVA_DIR, CHAP_BINARY_DIR, and CHAP_RESOURCE_DIR at the top of the Makefile to specify the desired locations, and then run
        make install
    (it is not necessary to run make first, but it doesn't hurt either). This will configure the installed scripts to look for their programs and resource files in the directories you have specified, instead of relative to the working cluster directory (which then no longer needs to be inside the package directory). However, it also means that users will need to modify the command paths in the examples accordingly.

Data Preparation

For each gene cluster that you want to analyze, do the following.

  1. In the package directory, create a subdirectory for the cluster, which we will call the "cluster directory".
  2. Sequences.  In the cluster directory, create a subdirectory called seq.d and put your FastA-formatted sequence files in it, giving each file the appropriate species name, e.g., human, vervet.
  3. Annotations.  In the cluster directory, create another subdirectory called annot.d and put your gene annotation files in it. These files use a "coding exons" format that is similar to the exons format supported by our PipMaker server, except that the position endpoints reflect coding regions only (i.e. translation rather than transcription, so UTRs are excluded). The CHAP distribution includes sample files in this format. The file names must consist of the species name followed by a .codex extension, e.g. human.codex, vervet.codex, etc.

    In this format, the directionality of a gene (>, <, or |), the start and end positions of its coding sequence, and its name should be on one line, followed by lines specifying the coding start and end positions of each exon, which must be listed in order of increasing address even if the gene is on the reverse strand (<). All positions are relative to the cluster sequence files you provide (not the entire chromosomes), and use a 1-based, closed-interval coordinate system (i.e., the first nucleotide in your corresponding sequence file is called "1", and the specified ranges include both endpoints). Names ending in _ps indicate pseudogenes (an exception to the "coding only" rule). We recommend limiting each gene name to a single word (i.e. without spaces), but if it has multiple words then the _ps suffix must be on the first word rather than the last one in order to be properly recognized.

    Thus, the file might begin as follows:

         > 12910 14400 HBZ-T1
         12910 13004
         13892 14096
         14272 14400
         > 23122 25156 HBZ-T2_ps
         23122 25156
         > 25998 26708 HBK
         25998 26089
         26268 26472
         26580 26708
         ... etc.
    

    The orthology pipeline requires these annotation files for making its gene orthology diagrams; you must supply gene annotations for the reference and at least one other species to get any figures. If you just want orthologous alignments for the sequences, or if you are just running the conversion pipeline, then these files are not strictly necessary but are still recommended for best accuracy. They assist somewhat in the preliminary ortholog detection for finding conversions, and the enhanced orthology mapper uses them to refine its similarity scoring. If present, they are also used by the gc-info summary program to compute conversion statistics for coding regions (with pseudogenes excluded), and by Gmaj to annotate its display.

    If you do not know the actual gene locations in some of your sequences, you may be able to estimate them with a program such as Wise2 (Birney et al. 2004), using known protein sequences in, say, human to find gene structures within the DNA sequences of other species. The CHAP package includes a utility script called infer-annot.sh to help with automating this approach; please see the Utility Programs section for information on how to use it.

  4. Species tree.  Put a text file containing a binary species tree in the cluster directory. The species names in the tree must match the file names used for your sequences. CHAP uses a simplified version of the Newick format, where branch lengths are omitted, all leaf nodes have names but interior nodes do not, and the tree is rooted at an interior node. Quoted labels are not supported, nor are comments in square brackets. (However unlike TBA, CHAP does expect the usual commas and ending semicolon.) The tree can include line breaks and extra spaces/tabs, but its maximum total length (excluding whitespace) is currently 1000 characters. This file can have any name; for concreteness, let's suppose it is called species_tree.txt.

Orthology Pipeline

  1. In the cluster directory (which contains subdirectories seq.d and annot.d as well as the species tree, from steps 2-4 of the Data Preparation section), run a command like
        ../ortho.sh species_tree.txt human
    where the last argument is the name of the sequence to use as the reference. The pipeline may take from several minutes to an hour or more, depending on the complexity of the cluster's history and the number of sequences.
  2. If desired, you can run the pipeline again for a different reference, e.g.
        ../ortho.sh species_tree.txt vervet --no_rm
    This is currently rather inefficient because it runs the conversion pipeline again unnecessarily, but at least the --no_rm option avoids re-running RepeatMasker. Since the output files include the reference in their names, your earlier results should coexist peacefully without being overwritten. One small exception, however, is the inferred pseudogenes in the fig_annot.d directory, which are used by default for the PostScript figures and Gmaj viewer. These are computed in a theoretically reference-dependent manner, so their endpoints may change slightly when they are overwritten by a new run. Note that running multiple jobs simultaneously in the same cluster directory is not supported and may produce erroneous output, since they will attempt to use the same temporary scratch files.

Orthology Output

Conversion Pipeline

Note that the orthology pipeline will run this for you automatically, so you only need to run it manually if you are not interested in the improved orthology calls. Also, the conversion pipeline always runs for all reference sequences, not just the one you specify for orthology.

  1. In the cluster directory (which contains subdirectories seq.d and annot.d as well as the species tree, from steps 2-4 of the Data Preparation section), run the command
        ../conversion.sh species_tree.txt
    The pipeline may run for an hour or more.
  2. For advanced users:  The conversion pipeline has a number of internal parameters that have been carefully tuned to reasonable defaults. One of these that is fundamental to our method for detecting conversions is the paralog coverage threshold for choosing whether to use the regular triplet/quadruplet criterion or the alternative "old dup" criterion: if a particular putative conversion covers more than the given fraction of its paralog pair by length, then the alternative criterion is used to test it. The default value for this threshold is 80%, and our simulation study showed that the results are not greatly affected by its exact value. However, if you do want to adjust it (e.g. for an unusual situation), you can edit the line
        CRIT_BOUND=0.8
    near the start of the file conversion.sh. Note that values below 60% or above 90% are generally not recommended.

Conversion Output

Utility Programs

Note that running these programs without any arguments will typically give you a brief reminder of the usage syntax.

References

Birney E, Clamp M, Durbin R  (2004)  GeneWise and Genomewise.  Genome Res. 14:988.  PubMed 15123596

Smit AFA, Hubley R, Green P  (1996-2010)  RepeatMasker Open-3.0.  Unpublished;  http://www.repeatmasker.org.

Song G, Hsu C-H, Riemer C, Zhang Y, Kim HL, Hoffmann F, Zhang L, Hardison RC, NISC Comparative Sequencing Program, Green ED, Miller W  (2011)  Conversion events in gene clusters.  BMC Evol. Biol. 11:226.  PubMed 21798034

Song G, Riemer C, Dickins B, Kim HL, Zhang L, Zhang Y, Hsu C-H, Hardison RC, NISC Comparative Sequencing Program, Green ED, Miller W  (2012)  Revealing mammalian evolutionary relationships by comparative analysis of gene clusters.  To appear in Genome Biol. Evol.


March  2012