Correspondence Analysis of Codon Usage

Correspondence Analysis of Codon Usage

Tutorial

Codon usage analysis

Included with this distribution of codonW should be a test dataset of sequences (input.dat). We will use this set of sequences as a typical example of a codon usage analysis. This test dataset is derived from the open reading frames (ORFs) of Saccharomyces cerevisiae chromosome III as annotated in the EMBL feature table for the sequence entry SCCHRIII (accession number X59720). In the current EMBL (Release 51 June 1997) the number of annotated ORFs was 172. The file input.dat contains 111 of these ORFs. The rationale and why some ORFs were removed is explained below.

The commandline syntax of codonW will be used in this tutorial; all options selected from the commandline are also selectable using the menu system. For more information please read the command line help (codonw -help) or just type "codonw" and use the menu specific online.

Build your dataset of genes carefully.

Always remember that, as in any analysis, but particularly with codon usage, GIGO (garbage in, garbage out) applies. Examine as many sources of information about the data as possible, particularly the original publication and sequence annotations. It is important that the sequences are a representative sample. Five ORFs were removed from the dataset because they were annotated (and had sequence identity) with genes within the previously identified transposable elements Ty2 and Ty5. These ORFs were annotated at positions 1537-2127, 2118-2558, 2816-3742, 84714-86030, 84714-90384. The codon usage of transposable element genes differs from that of chromosomal genes.

Further checks of sequence annotation were carried out, those sequences which had not been assigned gene names or SwissProt accession numbers were removed. The SwissProt annotation was also checked, genes described as hypothetical but having no sequence identity with other proteins were removed.

Check basic sequence integrity

Sequences should be checked to confirm that they match some basic gene characteristics. Each sequence might reasonably be expected to have an initiation codon and a translation termination codon, and no internal stop codons. Those sequences that do not match these characteristics, or sequences that have partial codons or untranslatable codons are flagged by codonw with warning messages.

To make a first pass of the input data to check for simple sequence problems:

codonw input.dat -nomenu

By default codonw will report the codon usage of each gene to the file input.blk. As there are no problems with this dataset there should be no warning messages. However analysis of a previous version of this dataset based on EMBL Release 50 where SCCHRIII had 230 annotated ORFs, generated these typical warning messages.

Warning: Sequence 178 "SCCHRIII.PE178______" does not begin with a recognised start codon
Warning: Sequence 178 "SCCHRIII.PE178______" is not terminated by a stop codon
Warning: Sequence 202 "SCCHRIII.PE202______" does not begin with a recognised start codon
Warning: Sequence 202 "SCCHRIII.PE202______" has 1 internal stop codon(s)
Warning: Sequence 202 "SCCHRIII.PE202______" is not terminated by a stop codon

Each sequence is labelled by its numerical occurrence in the input file (i.e. these are the 178^th and 202^nd sequences in the input file) and its sequence header line.

Sequences that generate warning messages should be examined closely to ascertain why they do. Some sequences may be annotated as partial sequences and therefore the absence of a start or stop codon or the presence of a 3' partial codon is to be expected. Note the presence of a 5' partial codon would cause a frame shift, hence it is ESSENTIAL that 5' partial codons are removed. Unless the frame shift that they produce results in a (incorrect) reading frame that contains internal stop codons, codonw cannot detect this problem. The codon usage of a frame shifted gene sequence could adversely affect the correspondence analysis (COA), though such genes are often recognisable as being outliers on the COA plots.

If a sequence warning is due to incorrect annotation this should be corrected manually. Sequences that produce warnings that cannot be explained or justified (e.g. a gene with an internal stop codon) should be excluded. These warning are for information only and do not exclude sequences from the analysis.

Codon usage indices

Once the initial quality checks have been made for the data we can then proceed with the codon usage analysis (strictly speaking we can generate COA and codon usage indices tasks at the same time). Some of the indices of codon usage bias that CodonW calculates (i.e. Fop, CAI and CBI) use information about a preferred set of codons for highly expressed genes. This information is species-specific and does not apply to all species (most eukaryotes and many prokaryotes appear to display no codon preference in highly expressed genes). Therefore care must be taken that the appropriate set of optimal codons are used. For most species the optimal codons are not know and therefore the indices should not be calculated at this stage. However, this information is known for Saccharomyces cerevisiae, so we can immediately calculate these indices of codon usage. Later we will see how codonW identifies optimal codons and can generate this information for your species.

The default optimal codons and codon adaptation values are those of E. coli. To select an alternative choice we use the c_type (for CAI values ) and f_type (for FOP/CBI) commandline arguments. These switches require an integer value, where this value is the same as the option number if we were using the menu system to change the codon information.

Example "-c_type 2" is equivalent to

Choose "Main Menu"

Choose "Changes Defaults Menu"

Choose "Change the CAI values"

Choose "(2) Saccharomyces cerevisiae"

Choose "Main Menu"

Choose "Changes Defaults Menu"

Choose "Change the Fop/CBI values"

Choose "(4) Saccharomyces cerevisiae"

Saccharomyces cerevisiae

codonw input.dat -all_indices -c_type 2 -f_type 4 -nomenu

See below for the output of this command

The commandline flag -nomenu by-passes the menu system, the -all_indices indicates to codonw that you wish to calculate all the codon and amino acid usage indices. These indices areT3s, C3s, A3s, G3s, CAI, CBI, Fop, Nc, GC3s, GC, L_sym, L_aa, Gravy and Aromaticity. For a fuller explanation of what these indices are see indices. These indices can also be used to check whether there are any identical or almost identical sequences in the input file. If we sort the result file "input.out" it is much easier to identify the sequences which are similar.

sort -k 2n input.out (unix for "sort using the second numerical field")

The sorted output reveals the presence of two pairs of identical sequences (Mating type proteins)

ALPHA2____________63 0.3636 0.2273 0.4939 0.2177 0.109
MATALPHA2_________63 0.3636 0.2273 0.4939 0.2177 0.109

ALPHA1____________52 0.4361 0.2180 0.4228 0.2589 0.112
MATALPHA1_________52 0.4361 0.2180 0.4228 0.2589 0.112

Sequences which appear to be multiple copies of the same gene are normally removed from our codon usage datasets, even if the sequences are not identical but where the differences can be attributed to sequencing error or allelic polymorphism. However different sequences that appear to be members of the same multigene family are retained, even if identical. As we know these ORFs are from different regions of the chromosome and are not the same sequence they were not removed from the sample dataset.

Tabulation of Overall Codon Usage

A common representation of codon usage is a tabular format of the total codon usage of a dataset. CodonW can automatically generate this table for you.

codonw input.dat -cutot

The tabulated total codon usage is stored in the output file input.blk.

See Tabulated codon usage below.

Preliminary codon usage analysis

The effective number of codons index (ENc), is a very useful preliminary tool for codon usage analysis (Wright 1990). It is a simple measure of codon bias, analogous to the effective number of alleles measurment used in population genetics. It gives the number of equally used codons that would generate the same codon usage bias as observed, lower values indicating stronger bias. A useful feature of EN_c is that the effect that GC biases have on the index can be estimated. This allows the comparison of GC_3sand EN_c against the theoretical values if codon bias was simply caused by GC mutational bias. A plot of EN_c vs. GC_3s can be seen at http://codonw.sourceforge.net/ENcVsGC3s.gif. Although the majority of genes in this plot have a degree of codon bias that can be explained in terms of GC mutation, the cluster of genes (six genes with EN_c <40) have much stronger codon bias than be simply explained in terms of mutational biases. These genes are good candidates for genes whose codon usage has been determined by natural selection probably selection for translational efficiency.

Correspondence Analysis (COA)

We are now ready to generate a correspondence analysis of the codon usage of SCCHRIII genes. We have a choice about how much information is generated. In this example we will use the default values.

codonw input.dat -coa_cu -nomenu -silent

the -silent flag stops all prompting

This generates a COA of codon usage. The summary file is "summary.coa" and contains most of the data generated by the COA. One of the first sections is the "Explanation of the variation by axis" also stored in eigen.coa.

The total inertia of the data was 0.263176

Num. Eigenval.   R.Iner. R.Sum |Num.  Eigenval. R.Iner.  R.Sum |
01 +4.5755E-02 +0.1739 +0.1739 |02 +3.2372E-02 +0.1230 +0.2969 |
03 +1.8405E-02 +0.0699 +0.3668 |04 +1.2499E-02 +0.0475 +0.4143 |

The relative inertia explained by the first axis is 17.4%, the 2^nd axis explains 12.3%, the 3^rd 7.0%, etc. (17.45% is not remarkably high for relative inertia explained by the first axis, but there are ORFs included which are described as hypothetical and these produce random noise into the data if they are not real).

The next two sections report position of each gene and codon on the trends.

label                Axis1     Axis2    Axis3    Axis4
1_YCG9_Probable_____ 0.00904   0.13153  0.34028 -0.05372
2_YCG8________573_re 0.07429  -0.24652 -0.05502 -0.39837
3_ALPHA2________633_ 0.30675   0.04259 -0.22864 -0.03878
4_ALPHA1________528_ 0.16444   0.00399 -0.02000  0.00937
5_CHA1_________1083_-0.00322   0.10387  0.07137  0.11896

this information is best viewed graphically, an example of the location of the genes on the two principal axes can be seen here http://codonw.sourceforge.net/axes.gif.

Automatic Identification of Putative Optimal Codons

Codonw automatically tries to identify the optimal codons in your data, or more precisely identify the codons which contribute to the major trend (if the main trend is due to selection for optimun translation these will be the optimimun codons). It does this by comparing the codon usage of groups of genes taken from each extreme of the principle trend (axis 1). It identifies the set of genes with the highest bias (using the effective number of codons index) and tests for significant differences in the codon usage between the higher bias set with a two way Chi-squared contingency test. The putative optimal codons are listed in summary.coa and hilo.coa. It is the responsibility of the user to confirm that the major codon usage trend is due to selection for optimal translational, and not some mutational pressure (see GC variation). The number of genes included in the two groups can be selected using the command line switch ( -coa_num ) as an absolute number of genes, or a percentage of the total genes in the dataset (by default 5%).

The analysis of this dataset identified 19 codons that appeared to be optimal. 18 of these agree with optimal codons previously identified using a larger dataset set of 575 genes (Sharp and Cowe 1991). The codon identified in this analysis as being optimal but not in the previous analysis, was GCC. This codon has been previously suggested as being an optimal codon in S. cerevisiae (Bennetzen and Hall 1982). The U ending codons, AUU, GUU and UGU, which have been previously identified as optimal (Sharp and Cowe 1991), were not identified here at p<0.01; although UGU was identified as potentially optimal with a p<0.02. The main reason that the U ending codons were not identified from this dataset was their much higher usage in the lower biased dataset.

Caveats

1) The codons identified by codonw, as being optimal will be dependent on the strength of the trend and the size of the datasets.

2) The composition of the genes from chromosome III is quite different from the 575-gene dataset used by Sharp and Cowe. Only one of the 30 genes they considered to be highly expressed, and none of the genes they considered lowly expressed are present in this dataset. The reader is reminded that there are approximately 15,000 yeast genes, so just a little over 1% are located on chromosome III.

Codonw generated personal choice of codons

On the assumption that the principle trend identified by codonw is selection for translational optimality, and that the genes assigned to the highly biased codon usage group are highly expressed, codonw outputs files with the "optimal codons" and "CAI adaptation fitness values". These files are fop.coa, cbi.coa and cai.coa; their filenames being related to the index they have been formatted for. These files can be used to calculate the indices in species where the preferred codon usage has not been hardwired into codonW.

codonw input.dat -fop_file fop.coa

codonw input.dat -cai_file cai.coa -cbi_file cbi.coa

Caveats

The original CAI paper calculated fitness values from experimentally determined highly expressed genes. The fitness values that are internal to codonW were derived from these criteria. CAI indices calculated using fitness values derived from genes identified solely by COA as being highly expressed, should not be regarded as true CAI values.
The optimal codons stored in the files cbi.coa and fop.coa were identified by codonw using a statistical test of significance, and this test is dependent on sample size.
The size of the sample taken from the extremes of the axis will affect the identified optimal codons.
The principle trend in the variation of codon usage may not be translation optimality.

When we calculate the indexes CAI, CBI and F_op using the "codonw" generated optimal codons and fitness values based on this small dataset, as we would expect they differ from when these indices are calculated using the codonw internal codon usage information for S. cerevisiae. The internal values are more accurate because the datasets used to generate them were larger, and contained experimentally verified gene sequences.

Although the two sets of indices differ, they remain highly correlated, all three indices have correlation coefficients greater than 0.96. Therefore if comparisons between the index values are internally consistent (i.e. they were both calculated using the same optimal codon information) relative comparisons of codon usage and bias can be made. Based on a dataset of 111 genes we have been able to identify optimal codons, which give us some insight into the codon usage of S. cerevisiae.

Axis2 is highly correlated with GC3s content

Alternative datasets could have been chosen that better matched previously published datasets and where the optimal codons identified better matched those previously published). This dataset was specifically chosen as the codon usage variation for genes from this chromosome is know to have a second trend, GC_3s varies with chromosomal location in a systematic fashion (Sharp and Lloyd 1993). When we examine correlation coefficients between the first 4 axes the correlation coefficient between axis2 and GC_3s is highly significant (r=0.89). If we examine a graph of GC_3s vs EN_c we can clearly see the characteristic GC rich peaks of the two arms of chromosome three. Interestingly the bias is most strong among the U ending codons and it is possible that the presence of this trend contributed to why the three U ending codons were not identified here as optimal codons. This trend is quite strong, accounting for 12.3% of the relative inertia of the data, where the principle trend (apparently selection for optimun translation) accounted for 17.4%. We therefore see how it is possible that the strongest influence on the choice of codon usage might not be translation optimality but mutation biases.

Last updated on July 14, 1997 by John Peden

For the most up to date version see http://codonw.sourceforge.net/Tutorial.html

Tutorial

Typical output from codonw -all_indices -nomenu

======================= Output ======================================
Genetic code is currently set to Universal Genetic code TGA=* TAA=* TAG=*

Welcome to CodonW 1.3 for Help type h

Using Saccharomyces cerevisiae (Sharp and Cowe (1991) Yeast 7:657-678)
w values to calculate CAI

Using Saccharomyces cerevisiae (Sharp and Cowe (1991) Yeast 7:657-678)
optimal codons to calculate CBI

Using Saccharomyces cerevisiae (Sharp and Cowe (1991) Yeast 7:657-678)
optimal codons to calculate Fop

..................................................................

Number of sequences: 111

Files:
Input  file was input.dat
Output file was input.out (codon usage indices, e.g. gc3s)
Output file was input.blk (bulk output e.g. raw codonusage)

CodonW has finished

======================================================

Tabulation of total codon usage


Phe UUU 1483 1.14 Ser UCU 1094 1.47 Tyr UAU 1000 1.12 Cys UGU  434 1.18 
    UUC 1117 0.86     UCC  773 1.04     UAC  789 0.88     UGC  303 0.82 
Leu UUA 1349 1.55     UCA  882 1.19 TER UAA   47 1.27 TER UGA   36 0.97 
    UUG 1549 1.78     UCG  487 0.66     UAG   28 0.76 Trp UGG  665 1.00 

    CUU  698 0.80 Pro CCU  747 1.27 His CAU  677 1.15 Arg CGU  328 0.86 
    CUC  364 0.42     CCC  415 0.71     CAC  499 0.85     CGC  171 0.45 
    CUA  671 0.77     CCA  911 1.55 Gln CAA 1388 1.35     CGA  151 0.39 
    CUG  604 0.69     CCG  281 0.48     CAG  668 0.65     CGG  103 0.27 

Ile AUU 1612 1.35 Thr ACU 1052 1.38 Asn AAU 1778 1.17 Ser AGU  717 0.97 
    AUC 1018 0.85     ACC  660 0.87     AAC 1262 0.83     AGC  500 0.67 
    AUA  943 0.79     ACA  883 1.16 Lys AAA 2118 1.13 Arg AGA 1038 2.71 
Met AUG 1156 1.00     ACG  444 0.58     AAG 1645 0.87     AGG  504 1.32 

Val GUU 1184 1.49 Ala GCU 1055 1.40 Asp GAU 1905 1.25 Gly GGU 1284 1.87 
    GUC  674 0.85     GCC  765 1.01     GAC 1145 0.75     GGC  552 0.80 
    GUA  622 0.78     GCA  836 1.11 Glu GAA 2371 1.41     GGA  557 0.81 
    GUG  690 0.87     GCG  368 0.49     GAG  995 0.59     GGG  355 0.52

53400 codons (used Universal Genetic code)

References

Bennetzen, J. L., and B. D. Hall, (1982). Codon selection in yeast. Journal of Biological Chemistry 257: 3026-3031.

Sharp, P. M., and E. Cowe, (1991). Synonymous codon usage in Saccharomyces cerevisiae. Yeast 7: 657-678.

Sharp, P. M., and A. T. Lloyd, (1993). Regional base composition variation along yeast chromosome III evolution of chromosome primary structure. Nucleic Acids Research 21: 179-183.

Wright, F., (1990). The effective number of codons used in a gene. Gene 87 : 23-29.