Correspondence Analysis of Codon Usage

CodonW has been hosted at SourceForge since 2005, its project name is codow and can be accessed via this link

Correspondence Analysis of Codon Usage

System and Hardware

CodonW

Introduction to Codon Usage

Correspondence Analysis

Input files

The Menu Interface

Correspondence analysis in CodonW

Correspondence analysis output files

License

References

README

Contact

System and Hardware

The programme CodonW is written in standard ANSI C (Kernighan and Richie 1988). It compiles cleanly using the GNU C compiler (version 2.7.2.1) with the stringent ANSI and pedantic command line switches. The CodonW source code and Makefile can be downloaded as a compressed UNIX tar archive, via SourceForge . The source code consists of seven C files and a common header file. Correspondence analysis as implemented by CodonW is based on the NetMul library (Thioulouse and Chevenet 1996), a subset of the ADE (Analysis of Environmental Data) package (Thioulouse et al. 1995).

CodonW

A biological feature that many codon based analysis programmes ignore is that the universal genetic code is an exaggeration (there are several genetic codes). CodonW was designed to work with any genetic code. Decisions regarding whether an amino acid is synonymous or non-synonymous, the translation of a codon, the number of codons in a codon family, how many synonyms a codon has, are all determined at run time. Seven alternatives to the universal genetic code have been in-built, others can be added.

The COA output from CodonW is most easily interpreted when presented graphically. CodonW does not try to handle graphics, as there are numerous programs that do this much better and are also better designed to work with numerical data (e.g. Cricket Graph (MAC/PC), StatView (MAC/PC), Excel (MAC/PC), Minitab (MAC/PC/VMS/UNIX), SAS (PC/UNIX/VMS), SPSS (PC/UNIX/VMS), Harvard Graphics (PC), Gnuplot (X-windows) ). Therefore, graphical presentation of the data is left to external software. These mentioned programs accept ASCII input files. By default, CodonW creates much of its output in a format designed to be easily read by eye (output from COA is always machine-readable). The command line switch (-machine) or "Defaults menu" (Menu 3) selection Machine readable changes the output format to machine-readable ASCII delimited values. The delimiter can be changed using the Change ASCII delimiter in output option under the "Defaults menu" (Menu 3).

Introduction to Codon Usage

The vast majority of prokaryotic and eukaryotic species have non-random codon usage. The major factor in codon choice in many unicellular and some multicellular organisms is Darwinian selection between synonyms; highly expressed genes using a restricted set of codons (Gouy and Gautier 1982; Ikemura 1985; Sharp and Matassi 1994). This selection is almost certainly for optimal translational efficiency, and is most pronounced in highly expressed genes in species whose effective population size is large (Bulmer 1991; Li, 1987 ). Divergence of codon usage and choice of optimal codons correlates with evolutionary distance, but usage patterns in phylogenetically distant species may converge due to the similarities of factors that influence the drift in choice of optimal codons.

Analysis of codon usage has been used to identify highly expressed genes. (Cancilla et al. 1995b; Freirepicos et al. 1994; Gharbia et al. 1995). Atypical codon usage has been used to infer that genes have been acquired by horizontal transfer (Delorme et al. 1994; Groisman et al. 1992; Medigue et al. 1991).

Correspondence Analysis

The purpose of statistics has been described as to summarise, simplify and eventually explain (Greenacre, 1984). Since codon usage by its very nature is multivariate, it is necessary to analyse this data with multivariate statistical techniques (i.e Correspondence analysis). If one examines the set of conventional statistical techniques in use today, it is clear that the statistician can rarely proceed without introducing a certain degree of subjectivity into the analysis (Greenacre, 1984). It is therefore advantageous to use a method that can examine data without a priori assumptions and there are several multivariate analysis techniques that satisfy this condition (Greenacre, 1984).

Multivariate analyses (MVA) are used to simplify rectangular matrices in which (for our purposes) the columns represent some measurement of codon usage or amino acid usage and the rows represent individual genes. Examples of MVA techniques that have been successfully applied to the analysis of codon usage are cluster analysis and correspondence analysis. Cluster analysis partitions data into discrete groups based on the trends within the data, but has the disadvantage that it sometimes forces arbitrary divisions of a dataset even when presented with continuous variation. Correspondence analysis is an ordination technique that identifies the major trends in the variation of the data and distributes genes along continuous axes in accordance with these trends. Correspondence analysis has the advantage that it does not assume that the data falls into discrete clusters and therefore can therefore represent continuous variation accurately.

Input files

Sequences to be analysed should be in a single file. Sequences must be nucleic acid, protein-encoding, sequential, and separated by at least one header line. A header line; is defined as any line whose first character is either a semicolon ‘;’ or a right angled bracket ‘>’. There may be any number of header lines but at least one must precede each sequence; the second and any subsequent header lines are ignored. The first 20 characters after the ‘;’ or ‘>’ should be some unique description of the sequence, because they are used to label the sequences in the output files. To overcome the limitation of 20 characters and to take into account the possibility of blank sequence descriptors, each descriptor is preceded by the numerical position of that sequence in the input file. Any line that does not start with either ‘;’ or ‘>’, is considered to contain sequence information. Sequences must be in the correct reading frame. CodonW assumes that the first alphabetical character of the sequence is equivalent to the first position of the first codon. If this is not so, the reading frame of the sequences must be edited, i.e. untranslated 5’ or 3’ sequence, or 5’ partial codons must be removed.

The format of each line of sequence data is relaxed and sequences can be either DNA or RNA, containing upper or lower case characters. Input lines may be any width and contain spaces and/or numbers. All non-alphabetical characters including white space (spaces, tabs, new-line) are stripped from the input. Each nucleotide in the sequence must be represented by an IUB-IUPAC code. Nucleotides that cannot be identified must be represented by the IUPAC codes X or N. Any codon that does not consist entirely of "A/T/C/G/U" bases is considered non-translatable.

CodonW performs some basic checks on the integrity of the input data. These include whether the sequence contains internal stop codons (usually an indicator that the sequence is out of frame), for the presence of non-standard characters, and that the sequence is not amino acid. Codon usage has been shown to vary with position within a gene in E. coli (Eyre-Walker 1995; Eyre-Walker and Bulmer 1993). Therefore, variation in codon usage may be introduced by comparing partial and full-length sequences. CodonW checks that each sequence contains a valid start and termination codon; the prokaryotic start codons (i.e. NTG, ATN) (Osawa et al. 1992) are accepted as valid start codons. If a problem is found warning messages are displayed and these can be redirected to a file. CodonW only warns about problems, but sequences that generate warnings should be carefully checked as these problem sequences are not excluded.

While only one input file is required to use CodonW, selection of some options will cause CodonW to prompt for additional input files. The indices CAI, CBI and Fop quantify the adaptation of codon usage towards a set of preferred codons and for some species these have been built into CodonW. However, the selection of any of these indices causes CodonW to prompt for a personal choice of optimal codons or CAI adaptiveness values. Before creating a personal choice of CAI adaptiveness values, it is recommended that the original paper (Sharp and Li 1987) be consulted. The input files prompted for, if calculating the CBI/Fop indices, are expected to contain a score for each codon: 1 for a rare codon; 3 for an optimal codon; 2 for other codons. The values can be separated by any white space character, but must occur in the file in a set order.

There is an alternative to creating these Fop/CAI input files by hand. During a correspondence analysis (of codon usage) the files "fop.coa", "cbi.coa", and "cai.coa" are generated automatically (fop.coa is also the appropriate file for the CBI indices). These files can then be used to calculate their respective indices. However, in order to make these files CodonW makes important assumptions about the dataset, which must be verified by the researcher. These are:

That the dataset is a representative sample.

That the dataset contains genes that are lowly and highly expressed.

The subset of genes with the strongest codon bias is the more highly expressed.

All genes automatically assigned to the highly expressed subset are highly expressed.

The major trend in the codon usage is selection for optimal translation.

The verification of these assumptions is not trivial. If the files "fop.coa" and "cai.coa" are accepted uncritically, it is very likely that results will be erroneous. When these assumptions are valid, these files greatly simplify the task of calculating these indices.

Correspondence analysis can be adversely affected by genes that have atypical codon or amino acid usage. These genes (i.e. plasmid genes, transposons, etc.) are normally excluded from the analysis. However, it is often interesting to view where these genes lie relative to the genes used to identify the trends. This is possible by identifying the trends using a dataset of "good" genes (the main input file), then using these to transform the codon usage of the additional genes into co-ordinates. This is possible if the "add additional genes after COA" is selected under the "advanced correspondence analysis" menu. After the initial COA, the second file (containg the additional genes) will be prompted for.

The Menu Interface

CodonW was designed to be driven via a series of nested menus, each menu having its own online help. The initial menu allows the user to choose a sub-menu, to start an analysis, or to exit the programme. There are nine sub-menus.

Correspondence analysis in CodonW.

The most complex analysis that CodonW performs is correspondence analysis of either amino acid or codon usage. In essence, correspondence analysis creates a series of orthogonal axes to identify trends that explain the data variation, with each subsequent axis explaining a decreasing amount of the variation (Benzecri 1992). Correspondence analysis assigns ordination for each gene and codon (or amino acid) on these axes, and a very useful property is that the ordination of the genes and codons are superimposable. When used to analyse codon usage data, we routinely include the 59 synonymous sense codons, which generate up to 58 axes, while the alternative analysis of the 20 standard amino acid usage produces 19 axes.

Correspondence analysis generates a large volume of data and CodonW writes the core data necessary to interpret the correspondence analysis to the file "summary.coa". To aid analysis most of this information is duplicated and compartmentalised into separate files. To reduce the memory requirements, additional data from intermediate stages of the analysis are also stored as temporary files. Depending on the precise options selected more than 20 permanent and temporary files may be created, however, to minimise amount of user interaction required for a correspondence analysis these files are created, overwritten and deleted automatically. Permanent correspondence analysis files created by CodonW have the extension ".coa". For a description of their format and contents see the Table below.

If the major trend in variation identified with the codon or RSCU usage is correlated with gene expression it is possible to identify those codons which are used preferentially in highly expressed genes, i.e. the optimal codons. To simplify analysis, CodonW assumes that the principal trend is correlated with gene expression. It then uses this assumption to identify optimal codons and then to draw tables and create files. It is, however, the researcher's responsibility to establish this fact before accepting those codons putatively identified as optimal codons. This poses the question of how one can establish that the major trend is correlated with expression. While not definitive there are a number of indicators that are very useful, which are as follows:

(1) The location of the genes on the principal axis is very important evidence for the hypothesis that the major trend is correlated with expression. If true, the genes found at one extreme of the ordination scale for the principal sould be highly expressed or expected to be (i.e. ribosomal proteins and glycolytic genes) while those at the opposite end would be expected to be lowly expressed (i.e. regulatory proteins and cytoplasmic membrane proteins).

(2) The principal axis should explain a large proportion ( >15%) of the total variation in the data and should ideally explain approximately two times as much of the variation as the second and subsequent axes.

<(3) A codons (UUC, UAC, AUC, AAC, GAC and GGU) are optimal in E. coli, B. subtilis, S. cerevisiae, S. pombe, and D. melanogaster (Sharp and Devine 1989) and are almost always preferentially used in highly expressed genes; similarly certain codons are commonly avoided in highly expressed prokaryotic genes (AGG, AGA). The predicted optimal codons should agree with these observations.

(4) The ordination of the genes on the principal axis should be significantly correlated with some independent measure of codon bias such as the effective number of codons Nc.

(5) The possibility that the major trend is due to some additional mutational bias such as the variation in the GC content (the principal trend in the codon usage of many eukaryotes), must always be considered and eliminated, particularly where there is a significant correlation between the principal axis and some measure of base composition (such as GC3s or GC) and the principal axis.

While CodonW assumes that the trend represented by axis 1 is correlated with expression, and attempts to identify optimal codons, the adage GIGO "garbage in, garbage out" must be stressed. If the trend represented by axis 1 does not correlate with expression then codons will be erroneously reported as optimal. So although CodonW automatically generates putative optimal codons these should not be accepted until a correlation between the principal trend in codon usage variation and expression level is established.

To identify the putative optimal codons, the genes are sorted according to their position on the principal axis, and the sorted codon usage of these genes is written to the file "cusort.coa". Then a number of genes, decided by the advanced correspondence analysis menu option "number of genes used to identify optimal codons", are read from the start and end of this file (i.e. equivalent to the extremes of the principle axis), and the codon usage of each set of genes is totalled. A prerequisite for the identification of optimal codons is the identification of which group of genes are "more highly" expressed. As the orientation of the principal axis is arbitrary, highly expressed genes can be located at either the positive or negative extreme of the axis. Fortuitously, in those species where codon usage is a function of gene expression, selection for optimal codons in more highly expressed genes has the effect of increasing their codon bias. The codon bias of the totalled codon usage from both sets of genes is estimated using Nc (an index independent of the choice of optimal codons) and the set of genes with the lower Nc (more highly biased) is putatively identified as the more highly expressed.

Optimal codons have been defined as those codons which occur more often (relative to their synonyms) in highly expressed genes than in lowly expressed genes (Ikemura 1981). CodonW uses a modification of this definition, insofar as optimal codons are defined as those codons that occur significantly more often in highly expressed genes relative to their frequency in lowly expressed genes (Lloyd and Sharp 1991; Lloyd and Sharp 1993; Sharp and Cowe 1991; Sharp et al. 1988; Shields and Sharp 1987). Significance is assessed by a two way chi-squared contingency test with the criterion of 0.01. The advantage of using a test of significance to identify optimal codons is that variation in codon usage between highly and lowly expressed genes, that is due to random noise is suppressed, the disadvantage is that the level of significance is related to sample size. After CodonW does a two way chi-squared test on the genes taken from the extremes of axis 1, their codon usage and RSCU is output as a table to "summary.coa" and "hilo.coa". those codons which have been putatively identified as optimal p < 0.01 are indicated with an asterisk (*). Although not considered optimal by CodonW, codons that occur more frequently in the highly expressed dataset at 0.01 < p < 0.05 are indicated with an "@" symbol.

CodonW calculates two indices that measure the degree to which the codon usage of a gene has adapted towards the usage of optimal codons, namely the frequency of optimal codons (Fop) and the codon adaptation index (CAI). Before either of these indices can be calculated information about codon usage in the species being analysed is required. The index Fop requires the identification of the optimal codons for the species, while the index CAI requires that a set of highly expressed genes be identified (and that the major trend in codon usage is correlated with expression). For several species where this information is known, it is been included within CodonW (under the "Change Defaults" menu). For other species either these indices cannot be calculated or the user must supply additional information, for which they are prompted during the calculation of the indices. The format for this data has been discussed previously, but during correspondence analysis of codon usage (or RSCU), based on the assumptions given above, CodonW generates the output files "cai.coa", "cbi.coa" and "fop.coa". These files can be used as input files for their respective indices, as they are already in the correct format. Again, it must be stressed that CodonW must make a number of assumptions to generate this information, these are: that the major trend in the codon usage is correlated with expression level; that the dataset contains highly expressed genes; and that the genes used to identify optimal codons are highly expressed. If these assumptions are valid then the files "cai.coa", "cbi.coa" and "fop.coa" can be used to calculates the indices CAI, CBI and Fop respectively. To do this, select the index to be calculated and when prompted for a "personal choice of values" input the relevant filename.

Description of output files created during a correspondence analysis

Filename	Description of contents
summary.coa	This file contains a summary of all the information generated by correspondence analysis, including all the data written to files listed below, except for the output which is written to cusort.coa.
eigen.coa	Each axis generated in the correspondence analysis is represented by a row of information. Each row consists of four columns, (1) the number of the axis, (2) the axis eigenvalue, (3) the relative inertia of the axis, (4) the sum of the relative inertia.
amino.coa† or codon.coa	Each codon or amino acid included in the correspondence analysis is represented by a row. The first column contains a description of the variable while the subsequent columns contain the coordinate of the codon or amino acid on each axis, the number of axes is user definable.
genes.coa	Each row represents one gene. The first column contains a unique description for each gene, subsequent columns contain the gene's coordinate on each of the recorded axes. If additional genes are added to the correspondence analysis (" advanced correspondence analysis option "), the coordinates of these genes are appended to this file.
cusort.coa†	Contains the codon usage of each gene, sorted by the gene’s coordinate on the principal axis. This information is used to generate the table in hilo.coa.
hilo.coa†	This files records a 2 way Chi squared contingency test between two subsets (which are determined by the "advanced correspondence analysis option") of genes positioned at the extremes of axis 1 as recorded in cusort.coa.
cai.coa†	Contains the relative usage of each codon within each synonym family. The most frequent codon is assigned the value one and all other codons are expressed relative to this. This file can be used to calculate species specific CAI values.
fop.coa†	Contains a list of the optimal codons and non-optimal codons as identified in the file "hilo.coa". The format of this file can be utilised by CodonW to calculate Fop using a personal choice of optimal codons.
inertia.coa	This file is only generated if the " exhaustive output option" is selected under the advanced correspondence analysis menu. It contains fours blocks of information. The first two blocks report the absolute contribution of each gene (block 1) and codon or amino acid (block 2) to the inertia explained by each of the axes. The second two blocks report the fraction of variation in each gene (block 3) and codon or amino acid (block 4) explained by the axes.

† These files are not created during the correspondence analysis of amino acids

Readme

For more information about codonW and to have quick tutorial on codon usage click here. I also suggest that the Readme files included with the distribution are a good starting place to learn about more codonW.

License

CodonW is freeware and is distributed under the conditions of the GNU General Public License version 2, which allows free distribution and modification but prevents the use of this code in commercial packages.

References

Benzecri, J.P.., (1992). The Correspondence analysis hankbook;Statistics: textbooks and monographs; Publish. Marcel Dekker

Bulmer, M., (1991). The selection-mutation-drift theory of synonymous codon usage. Genetics 129: 897-907.

Cancilla, M. R., A. J. Hillier and B. E. Davidson, (1995). Lactococcus lactis glyceraldehyde-3-phosphate dehydrogenase gene, gap - further evidence for strongly biased codon usage in glycolytic pathway genes. Microbiology-uk, 141: 1027-1036.

Delorme, C., J. J. Godon, S. D. Ehrlich and P. Renault, (1994). Mosaic structure of large regions of the Lactococcus lactis subsp cremoris chromosome. Microbiology-uk, 140: 3053-3060.

Eyre-Walker, A., (1995). The distance between Escherichia coli genes is related to gene expression levels. Journal of Bacteriology, 177: 5368-5369.

Eyre-Walker, A., and M. Bulmer, (1993). Reduced synonymous substitution rate at the start of Enterobacterial genes. Nucleic Acids Research 21: 4599-4603.

Freirepicos, M. A., M. I. Gonzalezsiso, E. Rodriguezbelmonte, A. M. Rodrigueztorres, E. Ramil et al., (1994). Codon usage in Kluyveromyces lactis and in yeast cytochrome c-encoding genes. Gene 139: 43-49.

Gharbia, S. E., J. C. Williams, D. M. A. Andrews and H. N. Shah, (1995). Genomic clusters and codon usage in relation to gene-expression in oral gram-negative anaerobes. Anaerobe, 1 : 239-262.

Greenacre, M. J., (1984). Theory and applications of correspondence analysis. Academic Press, London.

Groisman, E. A., M. H. J. Saier and H. Ouchman, (1992). Horizontal transfer of a phosphatase gene as evidence for mosaic structure of the Salmonella. EMBO Journal 11: 1309-1316.

Gouy, M., and C. Gautier, (1982). Codon usage in bacteria correlation with gene expressivity. Nucleic Acids Research 10: 7055-7074.

Ikemura, T., (1985). Codon usage and transfer-RNA content in unicellular and multicellular organisms. Molecular Biology And Evolution 2: 13-34.

Li, W. H., (1987). Models of nearly neutral mutations with particular implications for non-random usage of synonymous codons. Journal of Molecular Evolution 24: 337-345.

Kernighan, B. W., and D. M. Richie, (1988). The C programming language. Prentice-Hall, Englewood Cliffs, NJ.

Medigue, C., T. Rouxel, P. Vigier, A. Henaut and A. Danchin, (1991). Evidence for horizontal gene transfer in Escherichia coli speciation. Journal of Molecular Biology 222: 851-856.

Osawa, S., T. H. Jukes, K. Watanabe and A. Muto, (1992). Recent evidence for evolution of the genetic code. Microbiology Reviews 56: 229-64.

Sharp, P. M., and W. H. Li, (1987). The codon adaptation index a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Research 15: 1281-1295.

Sharp, P. M., and G. Matassi, (1994). Codon usage and genome evolution. Current Opinions in Genetics and Development 4: 851-860.

Thioulouse, J., and F. Chevenet, (1996). Netmul, a world-wide-web user interface for multivariate analysis software. Computational Statistics & Data Analysis, 21: 369-372.

Thioulouse, J., S. Doledec, D. Chessel and J. M. Oliver, (1995). ADE softeware: multivariate analysis and graphical display of enviromental data, pp. 57-61 in Sofware per l'Ambiente, edited by G. Guariso, and A. Rizzoli. Patron editor, Bolonia.

Page maintained by John Peden, last updated 7/May/2005.