This file describes the steps used to create the salmonid phylogenies data set, and a description of the contents of the archive file found on the cGrasp website. Contigs EST sequences were taken from a variety of sources (Koop lab, Genome Sciences Centre, NCBI) for 6 species: Salmo salar (Atlantic Salmon), Oncorhynchus mykiss (Rainbow Trout), Osmerus mordax (Rainbow Smelt), Coregonus clupeaformis (Lake Whitefish), and Salvelinus fontinalis (Brook Trout). For each species a two-stage assembly was performed using Phrap, with values of 100/0.99 and 300/0.96 used for Phrap's minscore and repeat_stringency parameters for the two stages. The output of this step was a set of contigs for each species as summarized by the following table. Species # of ESTs # of Contigs Atlantic Salmon (ssal) 434384 81398 Rainbow Trout (omyk) 346704 51199 Rainbow Smelt (omor) 36758 12159 Lake Whitefish (cclu) 10842 6446 Brook Trout (sfon) 10051 4946 Artic Grayling (tthy) 5930 5408 Clusters All of the contigs resulting from the two-stage assemblies in step 1 were then blasted against each other (evalue=1e-35, hits==100) and the results used to generate clusters of contigs. Clustering algorithm: for each contig c for each blast hit against a different contig if alignment good a = existing cluster containing contig or new cluster if none b = existing cluster containing blast hit or new cluster if none join clusters a & b where alignment consisted of ends-free alignment with scores of 2/-2/-5/-1 for match/mismatch/open gaps/extend gaps. A good alignment was considered to be greater than 75% (of the length of the shorter sequence) overlap with greater than 70% identity in the overlapping region. Trimming the Clusters After the contigs had been grouped into clusters the individual clusters were then further filtered to only contain contigs that had mutually overlapping regions. Cluster filtering algorithm: for each cluster sort contigs in cluster by length (longest to shortest) good set = {} for each contig in cluster if alignment good with all other members of good set add contig to good set trim all members of good set to largest common alignment cluster = good set where alignment consisted of ends-free alignment with scores of 2/-2/-8/-1 for match/mismatch/open gaps/extend gaps. A good alignment was considered to be greater than or equal to 300 base pairs overlap with greater than 60% identity in the overlapping region. Removing Gaps Prior to making distance matrices all gaps (and their corresponding positions in other sequences of the cluster) were removed. After this point all clusters consisted of one or more sequences of exactly the same length and with no gaps. Clusters not containing at least one representative from all 6 species were discarded, which resulted in 78 clusters remaining. Bootstraps, Distance Matrices and Consensus Trees Each cluster was bootstrapped 500 times, and then distance matrices were computed for each cluster using the F84 model of nucleotide substitution and Gamma-distributed rates of variation across sites with a coefficient of varation of 0.5. A neighbor-joining tree was then computed from each distance matrix, and a 70% majority consensus tree was constructed for each resulting data set. The trees were then rooted manually using RainBow Smelt as an outgroup. As a final step pairs of leaf nodes of the same species were iteratively collapsed to simplify the trees (since duplication events occuring after all speciaition events aren't of interest to this study). Salmoid Phylogenies Data File The salmonid phylogenies data file available on the cGrasp website contains a summary of the resulting data sets. The file is a gzipped tar archive containing one numbered subdirectory for each of the 78 data sets (the numbers are arbitrary and have no specific meaning). Each subdirectory contains: - The final alignment for the cluster that was used to create the tree, in interleaved phylip format. - A tab-separated text file (*.csv) containing detailed information about the sequences that make up the cluster (Internal Koop lab sequence IDs, Swissprot annotation information, and accession numbers for the individual ESTs that make up each individual contig) - A “raw” tree file, prior to collapsing identical leaf nodes and removing Koop lab ID numbers. - A “processed” tree file, which has the identical leaf nodes collapsed and the leaf node names converted into 4-letter species name abbreviations.