PHYLOGENY HOMEWORK – Hemoglobins and reconciled gene trees

Bioinformatic 597 – Claude dePamphilis

November 18, 2004

Trees and your short report due (cwd3@psu.edu) not later than class time Tuesday November 30. I would prefer an emailed report, but paper will be accepted.

To practice phylogenetic analysis, I would like you to perform several analyses with the Hemoglobin dataset that Ross Hardison used in the paper you have read, and kindly provided. This dataset includes just a small fraction of the known globin gene sequences, but is a nice diverse collection including many organisms, several forms, and a set of 4 of the human globin gene paralogs. Because the sequences are highly diverged, spanning a probable billion years of globin evolution, the sequences are presented here as inferred amino acid sequences.

Several hemoglobin datasets on the web at the class website.

Hemoglobin.clustal - the output from several different clustal W analyses that were performed with different gap weights. This file includes a fair bit of descriptive information, names of organisms, how the alignments were obtained, etc.

Hemoglobin.notaligned – unaligned versions of these sequences (see note about myoglobin), to enable you to do additional alignments with additional sequences using Clustal

Four files with IDENTICAL information are also found there - HemoglobinMAC.nex, HemoglobinPC.nex, HemoglobinMAC.phy, HemoglobinPC.phy. These files have the same sequences as the first alignment shown in Hemoglobin.clustal. They differ only in whether they are written in NEXUS format (ie, for PAUP) or in PHYLIP format. Most programs can read one of these, or you can easily edit one of these to get the data into the format you need.

HOMEWORK ASSIGNMENT:

1) take a good look at the various alignments in Hemoglobin.clustal. Get an idea of the sensitivity of the alignments to different clustal options, and look at the protein sequences for features that may be of interest to you. If you are adventurous, you may want to take one of the inferred protein sequences into a simple program for predicting protein structures, such as the helices discussed in the paper and annotated on the dataset.

2) once you have your bearings, try to open one of the datasets in PHYLIP or make slight modifications to get the program to open the dataset of your choice (other than Clustal which is lame for phylogeny).

3) Using the sequence for human cytochrome C as an “outgroup” attempt phylogenetic analysis using parsimony and Neighbor joining, at least. Print out a tree with results of each analysis. Examine a few output options. This is your warmup, covering some of the territory that we covered in class.

4) ADD NEW DATA TO THIS DATASET in the following way: Using Blastp to genbank, find FOUR additional globin protein sequences, including a minimum of two additional sequences from a primate OTHER THAN HUMAN and at least two additional sequences from any nonprimate vertebrate. For example, you might add two zebrafish and two chimp sequences, or such. Save these files in fasta format, and add the sequences to the unaligned globin datafile. Now, align the sequences again in Clustal.

5) Perform either NJ and NJ bootstrap phylogenetic analysis or parsimony and parsimony bootstrap analysis on the new, larger dataset. If bootstrap analyses go slowly, you can drop from the bootstrap analyses the plant, bacterial, and protist sequences to speed things up.

6) Based on these findings, and knowledge of the species relationships, Create a RECONCILED GENE TREE for the VERTEBRATE SPECIES in the analysis that indicates the location of likely gene duplications and gene losses (or missing data) on your tree. Describe your results.

7) consider how you might improve upon these analyses. Going right back to the alignment, and then through the various analyses that can be performed, how could you try to improve upon these results to come up with a better analysis or more confident result? Discuss in a paragraph or so.

8) your report will be the alignment (a simple screen capture would be OK), trees, your results, description of results, and discussion.

(nb - Different programs may have some different options, such as PHYLIP’S protpars, which deals with proteins somewhat differently that PAUP’s parsimony. These differences can be important to the final outcome, so learning a little about the choices is important in actual research.)

(nb – If you are using PAUP, I have also included a treefile that should allow you to reconstruct the exact tree that Hardison found, myoglobin excluded. You use this in a parsimony analysis, for instance, when running a heuristic analysis, by selecting the option for “enforce topological constraint” )

if you have trouble, feel free to email me (cwd3@psu.edu). Once you get a file to open up in your program, then you can explore. Leave plenty of time to do these analyses. You may need several hours to perform all of the bootstrap analyses. Be sure to run through the basic exercise quickly to be sure that you will not run into any problems later than November 24.