Readme file for codonW

CodonW was written by John Peden Email in the laboratory of Paul Sharp at the University of Nottingham. It is distributed under the terms of the GNU public license, see the file License included with the distribution.

========================================

README.coa

The permanent result files from a COA created by CodonW have the extension ".coa" for a description of their and contents see Table 1.

Table 1

Filename	Description of contents
summary.coa	This file contains a summary of all the information generated by correspondence analysis, including all the data written to files listed below, except for the output written to cusort.coa.
eigen.coa	Each axis generated in the correspondence analysis is represented by a row of information. Each row consists of four columns, (1) the number of the axis, (2) the axis eigenvalue, (3) the relative inertia of the axis, (4) the sum of the relative inertia.
amino.coa† or codon.coa	Each codon or amino acid included in the correspondence analysis is represented by a row. The first column is description of the variable, the subsequent columns contain the coordinate of the codon or amino acid on the axes, the number of axes is user definable.
genes.coa	Each row represents one gene, the first column contains a unique description for each gene, and subsequent columns contain the coordinates for each of the recorded axis. If additional genes are added to the correspondence analysis (advanced correspondence analysis option), the coordinates of these genes are appended to this file.
cusort.coa†	Contains the codon usage of each gene, sorted by the gene's coordinate on the principal axis, this information is used to generate the table in hilo.coa.
hilo.coa†	This files records a 2 way Chi squared contingency test between two subsets (as defined by the "advanced correspondence analysis options") of genes positioned at the extremes of axis 1 (cusort.coa).
cai.coa†	Contains the relative usage of each codon within each synonym family, the most frequent codon assigned the value one and all other codons are expressed relative to this. This file can be used to calculate species specific CAI values.
fop.coa† and cbi.coa†	Contains a list of the optimal codons and non-optimal codons as identified in the file "hilo.coa". The format of this file can be utilised by CodonW to calculate Fop and CBI using a specific choice of optimal codons.
inertia.coa	This file is only generated if the exhaustive output option is selected under the advanced correspondence analysis menu. It contains four tables of information, the first two report the absolute contribution of each gene and codon (or amino acid) to the inertia explained by each axis. The second two tables' report the fraction of variation in each gene and codon (or amino acid) explained by each axis.

† Files that are not generated during a correspondence analysis of amino acids

========================================

summary.coa

Correspondence analysis generate a large volume of data, CodonW writes the essential data necessary to interpret the correspondence analysis to the file "summary.coa".

========================================

genes.coa codons.coa amino.coa

The most complex analysis that CodonW performs is correspondence analysis (COA). COA creates a series of orthogonal axis to identify trends that explain the data variation, with each subsequent axis explaining a decreasing amount of the variation. COA positions each gene and codon (or amino acid) on these axes. An important property is that the ordination of the rows (genes) and columns (codons or amino acids) are superimposable.

========================================

eigen.coa

The Eigen values of the principle trends, as well as the more accessible fraction (with the cumulative total) of the total data inertia, that each axes is explaining, is recorded to summary.coa and eigen.coa.

========================================

cusort.coa

To simplify analyse of codon usage CodonW assumes that the principle trend is correlated with gene expression. It uses this assumption to identify putative optimal codons. Though the adage GIGO "garbage in, garbage out" must be stressed, it is the researchers responsibility to establish that the principle trend is correlated with gene expression (see tutorial for some example of how to do this).

To identify the putative optimal codons, the genes are sorted according to their position on the principle, the sorted codon usage of these genes is written to the file "cusort.coa". Then a number of genes, decided by the advanced correspondence analysis menu option "number of genes used to identify optimal codons", are read from the start and end of this file (i.e. equivalent the extremes of the principle axis), the codon usage of each set of genes is totalled. The set of genes with the lower Nc (more highly biased) is putatively identified as the more highly expressed.

========================================

hilo.coa

Optimal codons are defined as those codons that occur significantly more often in highly expressed genes relative to their frequency in lowly expressed genes. Significance is assessed by a two-way chi square contingency test with the criterion of p < 0.01. The advantage of using a test of significance to identify optimal codons is that variation in codon usage between highly and lowly expressed genes, that is due to random noise is suppressed, but a disadvantage is that the test is dependent on sample size.

After CodonW does a two way chi squared test on the genes taken from the extremes of axis 1, their codon usage and RSCU is output as a table to "summary.coa" and "hilo.coa". those codons which have been putatively identified as optimal p < 0.01 are indicated with an asterisk (*). Though not considered optimal by CodonW, codons that occur more frequently in the highly expressed dataset at 0.01 < p < 0.05 are indicated with a ampersand (@).

========================================

fop.coa cbi.coa cai.coa

CodonW measures the degree to which the codon usage of a gene has adapted towards the usage of optimal codons. It does this by calculating these indices, the frequency of optimal codons (Fop), codon bias index, and codon adaptation index (CAI). To calculate these indexes, information about codon usage in the species being analysed is needed. The indices Fop and CBI used the optimal codons for the species. The index CAI uses codon adaptation values.

For some species this information is known, and for these the optimal codons and codon adaptiveness values are in-built into codonW (see the "Change Defaults" menu). For other species these indexes cannot be calculated unless the additional information is know. During calculation of these indices the user is prompted for input files.

During a COA CodonW generates the output files "cai.coa", "fop.coa" and "cbi.coa". These files can be used as input files for their respective indices (they are already in the correct format).

Again it must be stressed that CodonW must make a number of assumptions to generate these files. These are: that the major trend in the codon usage is correlated with expression level; that the dataset contains highly expressed genes; that the genes used to identify of optimal codons where highly expressed. If these assumptions are valid then the files "cbi.coa", "cai.coa" and "fop.coa" can be used to calculate the indexes CBI, CAI and Fop respectively.

========================================

For the most up to date version see http://codonw.sourceforge.net/ReadmeCoa.html