Codon usage analysis
Included with this distribution of codonW should be a test dataset of sequences (input.dat). We will use this set of sequences as a typical example of a codon usage analysis. This test dataset is derived from the open reading frames (ORFs) of Saccharomyces cerevisiae chromosome III as annotated in the EMBL feature table for the sequence entry SCCHRIII (accession number X59720). In the current EMBL (Release 51 June 1997) the number of annotated ORFs was 172. The file input.dat contains 111 of these ORFs. The rationale and why some ORFs were removed is explained below.
The commandline syntax of codonW will be used in this tutorial; all options selected from the commandline are also selectable using the menu system. For more information please read the command line help (codonw -help) or just type "codonw" and use the menu specific online.
Build your dataset of genes carefully.
Always remember that, as in any analysis, but particularly with codon usage, GIGO (garbage in, garbage out) applies. Examine as many sources of information about the data as possible, particularly the original publication and sequence annotations. It is important that the sequences are a representative sample. Five ORFs were removed from the dataset because they were annotated (and had sequence identity) with genes within the previously identified transposable elements Ty2 and Ty5. These ORFs were annotated at positions 1537-2127, 2118-2558, 2816-3742, 84714-86030, 84714-90384. The codon usage of transposable element genes differs from that of chromosomal genes.
Further checks of sequence annotation were carried out, those sequences which had not been assigned gene names or SwissProt accession numbers were removed. The SwissProt annotation was also checked, genes described as hypothetical but having no sequence identity with other proteins were removed.
Sequences should be checked to confirm that they match some basic gene characteristics. Each sequence might reasonably be expected to have an initiation codon and a translation termination codon, and no internal stop codons. Those sequences that do not match these characteristics, or sequences that have partial codons or untranslatable codons are flagged by codonw with warning messages.
To make a first pass of the input data to check for simple sequence problems:
codonw input.dat -nomenu
By default codonw will report the codon usage of each
gene to the file input.blk. As there are no problems with this dataset
there should be no warning messages. However analysis of a previous version
of this dataset based on EMBL Release 50 where SCCHRIII
had 230 annotated ORFs, generated these typical warning messages.
Warning: Sequence 178 "SCCHRIII.PE178______" does not begin with a recognised start codon
Warning: Sequence 178 "SCCHRIII.PE178______" is not terminated by a stop codon
Warning: Sequence 202 "SCCHRIII.PE202______" does not begin with a recognised start codon
Warning: Sequence 202 "SCCHRIII.PE202______" has 1 internal stop codon(s)
Warning: Sequence 202 "SCCHRIII.PE202______" is not terminated by a stop codon
Each sequence is labelled by its numerical occurrence in the input file (i.e. these are the 178th and 202nd sequences in the input file) and its sequence header line.
Sequences that generate warning messages should be examined closely to ascertain why they do. Some sequences may be annotated as partial sequences and therefore the absence of a start or stop codon or the presence of a 3' partial codon is to be expected. Note the presence of a 5' partial codon would cause a frame shift, hence it is ESSENTIAL that 5' partial codons are removed. Unless the frame shift that they produce results in a (incorrect) reading frame that contains internal stop codons, codonw cannot detect this problem. The codon usage of a frame shifted gene sequence could adversely affect the correspondence analysis (COA), though such genes are often recognisable as being outliers on the COA plots.
If a sequence warning is due to incorrect annotation this should be corrected manually. Sequences that produce warnings that cannot be explained or justified (e.g. a gene with an internal stop codon) should be excluded. These warning are for information only and do not exclude sequences from the analysis.
Once the initial quality checks have been made for the data we can then proceed with the codon usage analysis (strictly speaking we can generate COA and codon usage indices tasks at the same time). Some of the indices of codon usage bias that CodonW calculates (i.e. Fop, CAI and CBI) use information about a preferred set of codons for highly expressed genes. This information is species-specific and does not apply to all species (most eukaryotes and many prokaryotes appear to display no codon preference in highly expressed genes). Therefore care must be taken that the appropriate set of optimal codons are used. For most species the optimal codons are not know and therefore the indices should not be calculated at this stage. However, this information is known for Saccharomyces cerevisiae, so we can immediately calculate these indices of codon usage. Later we will see how codonW identifies optimal codons and can generate this information for your species.
The default optimal codons and codon adaptation values are those of E. coli. To select an alternative choice we use the c_type (for CAI values ) and f_type (for FOP/CBI) commandline arguments. These switches require an integer value, where this value is the same as the option number if we were using the menu system to change the codon information.
Example "-c_type 2" is equivalent to
codonw input.dat -all_indices -c_type 2 -f_type 4 -nomenu
See below for the output of this command
The commandline flag -nomenu by-passes the menu system,
the -all_indices indicates to codonw that you wish to calculate all the
codon and amino acid usage indices. These indices areT3s, C3s, A3s, G3s,
CAI, CBI, Fop, Nc, GC3s, GC, L_sym, L_aa, Gravy and Aromaticity. For a
fuller explanation of what these indices are see indices.
These indices can also be used to check whether there are any identical
or almost identical sequences in the input file. If we sort the result
file "input.out" it is much easier to identify the sequences which are
similar.
sort -k 2n input.out (unix for "sort using the second numerical field")
The sorted output reveals the presence of two pairs of identical sequences (Mating type proteins)
ALPHA2____________63 0.3636 0.2273 0.4939 0.2177 0.109 MATALPHA2_________63 0.3636 0.2273 0.4939 0.2177 0.109and
ALPHA1____________52 0.4361 0.2180 0.4228 0.2589 0.112 MATALPHA1_________52 0.4361 0.2180 0.4228 0.2589 0.112
Sequences which appear to be multiple copies of the same gene are normally removed from our codon usage datasets, even if the sequences are not identical but where the differences can be attributed to sequencing error or allelic polymorphism. However different sequences that appear to be members of the same multigene family are retained, even if identical. As we know these ORFs are from different regions of the chromosome and are not the same sequence they were not removed from the sample dataset.
A common representation of codon usage is a tabular format of the total codon usage of a dataset. CodonW can automatically generate this table for you.
codonw input.dat -cutot
The tabulated total codon usage is stored in the output file input.blk.
See Tabulated codon usage below.
The effective number of codons index (ENc), is a very useful preliminary tool for codon usage analysis (Wright 1990). It is a simple measure of codon bias, analogous to the effective number of alleles measurment used in population genetics. It gives the number of equally used codons that would generate the same codon usage bias as observed, lower values indicating stronger bias. A useful feature of ENc is that the effect that GC biases have on the index can be estimated. This allows the comparison of GC3s and ENc against the theoretical values if codon bias was simply caused by GC mutational bias. A plot of ENc vs. GC3s can be seen at http://codonw.sourceforge.net/ENcVsGC3s.gif. Although the majority of genes in this plot have a degree of codon bias that can be explained in terms of GC mutation, the cluster of genes (six genes with ENc <40) have much stronger codon bias than be simply explained in terms of mutational biases. These genes are good candidates for genes whose codon usage has been determined by natural selection probably selection for translational efficiency.
We are now ready to generate a correspondence analysis of the codon usage of SCCHRIII genes. We have a choice about how much information is generated. In this example we will use the default values.
codonw input.dat -coa_cu -nomenu -silent
the -silent flag stops all prompting
This generates a COA of codon usage. The summary file is "summary.coa" and contains most of the data generated by the COA. One of the first sections is the "Explanation of the variation by axis" also stored in eigen.coa.
The total inertia of the data was 0.263176 Num. Eigenval. R.Iner. R.Sum |Num. Eigenval. R.Iner. R.Sum | 01 +4.5755E-02 +0.1739 +0.1739 |02 +3.2372E-02 +0.1230 +0.2969 | 03 +1.8405E-02 +0.0699 +0.3668 |04 +1.2499E-02 +0.0475 +0.4143 |
The relative inertia explained by the first axis is 17.4%, the 2nd axis explains 12.3%, the 3rd 7.0%, etc. (17.45% is not remarkably high for relative inertia explained by the first axis, but there are ORFs included which are described as hypothetical and these produce random noise into the data if they are not real).
The next two sections report position of each gene and codon on the trends.
label Axis1 Axis2 Axis3 Axis4 1_YCG9_Probable_____ 0.00904 0.13153 0.34028 -0.05372 2_YCG8________573_re 0.07429 -0.24652 -0.05502 -0.39837 3_ALPHA2________633_ 0.30675 0.04259 -0.22864 -0.03878 4_ALPHA1________528_ 0.16444 0.00399 -0.02000 0.00937 5_CHA1_________1083_-0.00322 0.10387 0.07137 0.11896
this information is best viewed graphically, an example of the location of the genes on the two principal axes can be seen here http://codonw.sourceforge.net/axes.gif.
Codonw automatically tries to identify the optimal codons in your data, or more precisely identify the codons which contribute to the major trend (if the main trend is due to selection for optimun translation these will be the optimimun codons). It does this by comparing the codon usage of groups of genes taken from each extreme of the principle trend (axis 1). It identifies the set of genes with the highest bias (using the effective number of codons index) and tests for significant differences in the codon usage between the higher bias set with a two way Chi-squared contingency test. The putative optimal codons are listed in summary.coa and hilo.coa. It is the responsibility of the user to confirm that the major codon usage trend is due to selection for optimal translational, and not some mutational pressure (see GC variation). The number of genes included in the two groups can be selected using the command line switch ( -coa_num ) as an absolute number of genes, or a percentage of the total genes in the dataset (by default 5%).
The analysis of this dataset identified 19 codons that appeared to be optimal. 18 of these agree with optimal codons previously identified using a larger dataset set of 575 genes (Sharp and Cowe 1991). The codon identified in this analysis as being optimal but not in the previous analysis, was GCC. This codon has been previously suggested as being an optimal codon in S. cerevisiae (Bennetzen and Hall 1982). The U ending codons, AUU, GUU and UGU, which have been previously identified as optimal (Sharp and Cowe 1991), were not identified here at p<0.01; although UGU was identified as potentially optimal with a p<0.02. The main reason that the U ending codons were not identified from this dataset was their much higher usage in the lower biased dataset.
Caveats
1) The codons identified by codonw, as being optimal will be dependent on the strength of the trend and the size of the datasets.
2) The composition of the genes from chromosome III is quite different from the 575-gene dataset used by Sharp and Cowe. Only one of the 30 genes they considered to be highly expressed, and none of the genes they considered lowly expressed are present in this dataset. The reader is reminded that there are approximately 15,000 yeast genes, so just a little over 1% are located on chromosome III.
On the assumption that the principle trend identified by codonw is selection for translational optimality, and that the genes assigned to the highly biased codon usage group are highly expressed, codonw outputs files with the "optimal codons" and "CAI adaptation fitness values". These files are fop.coa, cbi.coa and cai.coa; their filenames being related to the index they have been formatted for. These files can be used to calculate the indices in species where the preferred codon usage has not been hardwired into codonW.
codonw input.dat -fop_file fop.coa
codonw input.dat -cai_file cai.coa -cbi_file cbi.coa
Caveats
When we calculate the indexes CAI, CBI and Fop using the "codonw" generated optimal codons and fitness values based on this small dataset, as we would expect they differ from when these indices are calculated using the codonw internal codon usage information for S. cerevisiae. The internal values are more accurate because the datasets used to generate them were larger, and contained experimentally verified gene sequences.
Although the two sets of indices differ, they remain highly correlated, all three indices have correlation coefficients greater than 0.96. Therefore if comparisons between the index values are internally consistent (i.e. they were both calculated using the same optimal codon information) relative comparisons of codon usage and bias can be made. Based on a dataset of 111 genes we have been able to identify optimal codons, which give us some insight into the codon usage of S. cerevisiae.
Alternative datasets could have been chosen that better matched previously published datasets and where the optimal codons identified better matched those previously published). This dataset was specifically chosen as the codon usage variation for genes from this chromosome is know to have a second trend, GC3s varies with chromosomal location in a systematic fashion (Sharp and Lloyd 1993). When we examine correlation coefficients between the first 4 axes the correlation coefficient between axis2 and GC3s is highly significant (r=0.89). If we examine a graph of GC3s vs ENc we can clearly see the characteristic GC rich peaks of the two arms of chromosome three. Interestingly the bias is most strong among the U ending codons and it is possible that the presence of this trend contributed to why the three U ending codons were not identified here as optimal codons. This trend is quite strong, accounting for 12.3% of the relative inertia of the data, where the principle trend (apparently selection for optimun translation) accounted for 17.4%. We therefore see how it is possible that the strongest influence on the choice of codon usage might not be translation optimality but mutation biases.
Last updated on July 14, 1997 by John Peden
For the most up to date version see http://codonw.sourceforge.net/Tutorial.html
Typical output from codonw -all_indices -nomenu
======================= Output ====================================== Genetic code is currently set to Universal Genetic code TGA=* TAA=* TAG=* Welcome to CodonW 1.3 for Help type h Using Saccharomyces cerevisiae (Sharp and Cowe (1991) Yeast 7:657-678) w values to calculate CAI Using Saccharomyces cerevisiae (Sharp and Cowe (1991) Yeast 7:657-678) optimal codons to calculate CBI Using Saccharomyces cerevisiae (Sharp and Cowe (1991) Yeast 7:657-678) optimal codons to calculate Fop .................................................................. Number of sequences: 111 Files: Input file was input.dat Output file was input.out (codon usage indices, e.g. gc3s) Output file was input.blk (bulk output e.g. raw codonusage) CodonW has finished
======================================================
Tabulation of total codon usagePhe UUU 1483 1.14 Ser UCU 1094 1.47 Tyr UAU 1000 1.12 Cys UGU 434 1.18 UUC 1117 0.86 UCC 773 1.04 UAC 789 0.88 UGC 303 0.82 Leu UUA 1349 1.55 UCA 882 1.19 TER UAA 47 1.27 TER UGA 36 0.97 UUG 1549 1.78 UCG 487 0.66 UAG 28 0.76 Trp UGG 665 1.00 CUU 698 0.80 Pro CCU 747 1.27 His CAU 677 1.15 Arg CGU 328 0.86 CUC 364 0.42 CCC 415 0.71 CAC 499 0.85 CGC 171 0.45 CUA 671 0.77 CCA 911 1.55 Gln CAA 1388 1.35 CGA 151 0.39 CUG 604 0.69 CCG 281 0.48 CAG 668 0.65 CGG 103 0.27 Ile AUU 1612 1.35 Thr ACU 1052 1.38 Asn AAU 1778 1.17 Ser AGU 717 0.97 AUC 1018 0.85 ACC 660 0.87 AAC 1262 0.83 AGC 500 0.67 AUA 943 0.79 ACA 883 1.16 Lys AAA 2118 1.13 Arg AGA 1038 2.71 Met AUG 1156 1.00 ACG 444 0.58 AAG 1645 0.87 AGG 504 1.32 Val GUU 1184 1.49 Ala GCU 1055 1.40 Asp GAU 1905 1.25 Gly GGU 1284 1.87 GUC 674 0.85 GCC 765 1.01 GAC 1145 0.75 GGC 552 0.80 GUA 622 0.78 GCA 836 1.11 Glu GAA 2371 1.41 GGA 557 0.81 GUG 690 0.87 GCG 368 0.49 GAG 995 0.59 GGG 355 0.52 53400 codons (used Universal Genetic code)======================================================
References
Bennetzen, J. L., and B. D. Hall, (1982). Codon selection in yeast. Journal of Biological Chemistry 257: 3026-3031.
Sharp, P. M., and E. Cowe, (1991). Synonymous codon usage in Saccharomyces cerevisiae. Yeast 7: 657-678.
Sharp, P. M., and A. T. Lloyd, (1993). Regional base composition variation along yeast chromosome III evolution of chromosome primary structure. Nucleic Acids Research 21: 179-183.
Wright, F., (1990). The effective number of codons used in a gene. Gene 87 : 23-29.