Des Higgins (Higgins@EBI.AC.UK)
European Bioinformatics Institute
Cambridge CB10 1RQ
Please e-mail bug reports/complaints/suggestions (polite if possible) to Toby Gibson or Des Higgins.
Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673-4680.
The usage of Clustal W is largely the same as for Clustal V details of which are described in clustalv.doc. Details of the new alignment algorithms are described in the manuscript by Thompson et. al. above, an ascii/text version of which is included (clustalw.ms). This file lists some of the details not covered by either of the above documents.
FASTA (Pearson), NBRF/PIR, EMBL/Swiss Prot, GDE, CLUSTAL, GCG/ MSF.
The program tries to "guess" which format is being used and whether the sequences are nucleic acid (DNA/RNA) or amino acid (proteins). The format is recognised by the first characters in the file. This is kind of stupid/crude but works most of the time and it is difficult to do reliably, any other way.
Format First non blank word or character in the file. ............................................................... FASTA > NBRF >P1; or >D1; EMBL/SWISS ID GDE protein % GDE nucleotide # CLUSTAL CLUSTAL (blocked multiple alignments) GCG/MSF PILEUP " " "Note, that the only way of spotting that a file is MSF format is if the word PILEUP appears at the very beginning of the file. If you produce this format from software other than the GCG pileup program, then you will have to insert the word PILEUP at the start of the file. Similarly, if you use clustal format, the word CLUSTAL must appear first.
All of these formats can be used to read in AN EXISTING FULL ALIGNMENT. With CLUSTAL format, this is just the same as the output format of this program and Clustal V. If you use PILEUP or CLUSTAL format, all sequences must be the same length, INCLUDING GAPS ("_" in clustal format; "." in MSF). With the other formats, sequences can be gapped with "-" charcters. If you read in any gaps these are kept during any later alignments. You can use this facility to read in an alignment in order to calculate a phylogenetic tree OR to output the same alignment in a different format (from the output format options menu of the multiple alignment menu) e.g. read in a GCG/ MSF format alignment and output a PHYLIP format alignment. This is also useful to read in one reference alignment and to add one or more new sequences to it using the "profile alignment" facilities.
DNA vs. PROTEIN: the program will count the number of A,C,G,T,U and N charcters. If 85% or more of the characters in a sequence are as above, then DNA/RNA is assumed, protein otherwise.
IMPORTANT : Additional note from Toby Gibson :
Clustal W NEVER DELETES gaps in the original alignment. This facility allows you to add new sequences to an alignment that already has gaps in. Then the old alignment stays in good quality (if it was good before....). Parameter "reset gaps between alignments" only deletes NEW gaps just added in an alignment run. This option is for use if you align the same sequences twice without leaving the program eg to try different gap penalties. In fact it is INCORRECT to do a PILEUP alignment first, although Clustal W can read and write these alignments for compatibility. It is better to use the GCG command "etopir @sequences.lis" where sequences.lis is a file of sequence entry names to get your sequences, this uses EGCG etopir command.
The alignment output format can be set to any (or all) of: CLUSTAL (a self explanatory blocked alignment) NBRF/PIR (same as input format but with "-" characters for gaps) MSF (the main GCG package multiple alignment format) PHYLIP (Joe Felsenstein's phylogeny inference package. Gaps are set to "-" characters. For some programs (e.g. PROTPARS/DNAPARS) these should be changed to "?" characters for unknown residues. GDE (Used by Steven Smith's GDE package)
You can also choose between having the sequences in the same order as in the input file or writing them out in an order that more closely matches the order used to carry out the multiple alignment.
The New Hampshire format is only useful if you have software to display or manipulate the trees. The PHYLIP package is highly recommended if you intend to do much work with trees and includes programs for doing this. If you do not have such software, request the trees in the older clustal format and see the documentation for Clustal V (clustalv.doc). WE DO NOT PROVIDE ANY DIRECT MEANS FOR VIEWING TREES GRAPHICALLY.
acgtacgtacgtacgt acgtacgtacgtacgt a----cgtacgtacgt gets the same score as ----acgtacgtacgtNOW, terminal gaps are free. This is better on average and stops silly effects like single residues jumping to the edge of the alignment. However, it is not perfect. It does mean that if there should be a gap near the end of the alignment, the program may be reluctant to insert it i.e.
cccccgggccccc cccccgggccccc ccccc---ccccc may be considered worse (lower score) than cccccccccc---In the right hand case above, the terminal gap is free and may score higher than the laft hand alignment. This can be prevented by lowering the gap opening and extension penalties. It is difficult to get this right all the time. Please watch the ends of your alignments.
Any gaps that are read in from the input file are always kept, regardless of the setting of this switch. If you read in a full multiple alignment, the "reset gaps" switch has no effect. The old gaps will remain and if you carry out a multiple alignment, any new gaps will be added in. If you wish to carry out a full new alignment of a set of sequences that are already aligned in a file you must input the sequences without gaps.
A second option is to align the sequences from the second profile, one at a time to the first profile. This is done, taking the underlying tree between the sequences into account. This is useful if you have a set of new sequences (not aligned) and you wish to add them all to an older alignment.
In Clustal V we used a simple formula to convert an observed distance to one that is corrected for multiple hits. The observed distance is the mean number of differences per site in an alignment (ignoring sites with a gap) and is therefore always between 0.0 (for ientical sequences) an 1.0 (no residues the same at any site). These distances can be multiplied by 100 to give percent difference values. 100 minus percent difference gives percent identity. The formula we use to correct for multiple hits is from Motoo Kimura (Kimura, M. The neutral Theory of Molecular Evolution, Camb.Univ.Press, 1983, page 75) and is:
K = -Ln(1 - D - (D.D)/5) where D is the observed distance and K is corrected distance.This formula gives mean number of estimated substitutions per site and, in contrast to D (the observed number), can be greater than 1 i.e. more than one substitution per site, on average. For example, if you observe 0.8 differences per site (80% difference; 20% identity), then the above formula predicts that there have been 2.5 substitutions per site over the course of evolution since the 2 sequences diverged. This can also be expressed in PAM units by multiplying by 100 (mean number of substitutions per 100 residues). The PAM scale of evolution and its derivation/calculation comes from the work of Margaret Dayhoff and co workers (the famous Dayhoff PAM series of weight matrices also came from this work). Dayhoff et al constructed an elaborate model of protein evolution based on observed frequencies of substitution between very closely related proteins. Using this model, they derived a table relating observed distances to predicted PAM distances. Kimura's formula, above, is just a "curve fitting" approximation to this table. It is very accurate in the range 0.75 > D > 0.0 but becomes increasingly unaccurate at high D (>0.75) and fails completely at around D = 0.85.
To circumvent this problem, we calculated all the values for K corresponding to D above 0.75 directly using the Dayhoff model and store these in an internal table, used by Clustal W. This table is declared in the file dayhoff.h and gives values of K for all D between 0.75 and 0.93 in intervals of 0.001 i.e. for D = 0.750, 0.751, 0.752 ...... 0.929, 0.930. For any observed D higher than 0.930, we arbitrarily set K to 10.0. This sounds drastic but with real sequences, distances of 0.93 (less than 7% identity) are rare. If your data set includes sequences with this degree of divergence, you will have great difficulty getting accurate trees by ANY method; the alignment itself will be very difficult (to construct and to evaluate).
There are some important things to note. Firstly, this formula works well if your sequences are of average amino acid composition and if the amino acids substitute according to the original Dayhoff model. In other cases, it may be misleading. Secondly, it is based only on observed percent distance i.e. it does not DIRECTLY take conservative substitutions into account. Thirdly, the error on the estimated PAM distances may be VERY great for high distances; at very high distance (e.g. over 85%) it may give largely arbitrary corrected distances. In most cases, however, the correction is still worth using; the trees will be more accurate and the branch lengths will be more realistic.
A far more sophisticated distance correction based on a full Dayhoff model which DOES take conservative substitutions and actual amino acid composition into account, may be found in the PROTDIST program of the PHYLIP package. For serious tree makers, this program is highly recommended.
A further problem arises in almost exactly the opposite situation: when you bootstrap a data set which contains 3 or more sequences that are identical or almost identical. Here, the sets of identical sequences should be shown as a multifurcation (several sequences joing at the same part of the tree). Because the Neighbor-Joining method only gives strictly dichotomous trees (never more than 2 sequences join at one time), this cannot be exactly represented. In practice, this is NOT a problem as there will be some internal branches of zero length seperating the sequences. If you display the tree with all branch lengths, you will still see a multifurcation. However, when you bootstrap the tree, only the branching orders are stored and counted. In the case of multifurcations, the exact branching order is arbitrary but the program will always get the same branching order, depending only on the input order of the sequences. In practice, this is only a problem in situations where you have a set of sequences where all of them are VERY similar. In this case, you can find very high support for some groupings which will disappear if you run the analysis with a different input order. Again, the PHYLIP package deals with this by offering a JUMBLE option to shuffle the input order of your sequences between each bootstrap sample.