近日,中国科学院北京基因组研究所基因组科学与信息重点实验室“百人计划”章张研究员带领其团队,成功设计开发出检测密码子使用偏好(Codon Usage Bias,简称CUB)的新算法:密码子偏差系数模型(Codon Deviation Coefficient,简称CDC)。该研究成果发表在BMC Bioinformatics杂志上。
此项工作原创性地将概率论中的交、并、补操作应用到组分分析,用GC含量(S)和嘌呤含量(R)来表示四个核苷酸组分,并在此基础上推导出密码子和氨基酸的组分,从而设计出基于S和R的组分模型,应用该模型考察基因的CUB,进而提出了CDC算法。不同于现有的CAI、ENC等相关算法,CDC通过GC含量和嘌呤含量考虑了不同序列的背景组分特异性,独创性地运用自展重抽样法(Bootstrap Resampling)检测CUB的显著性,且不需要高表达基因作为先验信息。
经验证,CDC在模拟数据中优于现有的多个相关算法,在真实数据中CDC与基因表达含量的关联系数(Correlation Coefficient)高于其它算法,并且在大肠杆菌中发现CUB的显著性与基因功能有着紧密联系。
该项成果的发布,使科研工作者能更准确快速地分析CUB,进而更深入地研究在自然选择压力下的基因突变、基因表达,蛋白质功能等的进化。(生物谷Bioon.com)
doi:10.1186/1471-2105-13-43
PMC:
PMID:
Codon Deviation Coefficient: a novel measure for estimating codon usage bias and its statistical significance
Zhang Zhang, Jun Li, Peng Cui, Feng Ding, Ang Li, Jeffrey P Townsend and Jun Yu
Background Genetic mutation, selective pressure for translational efficiency and accuracy, level of gene expression, and protein function through natural selection are all believed to lead to codon usage bias (CUB). Therefore, informative measurement of CUB is of fundamental importance to making inferences regarding gene function and genome evolution. However, extant measures of CUB have not fully accounted for the quantitative effect of background nucleotide composition and have not statistically evaluated the significance of CUB in sequence analysis. Results Here we propose a novel measure--Codon Deviation Coefficient (CDC)--that provides an informative measurement of CUB and its statistical significance without requiring any prior knowledge. Unlike previous measures, CDC estimates CUB by accounting for background nucleotide compositions tailored to codon positions and adopts the bootstrapping to assess the statistical significance of CUB for any given sequence. We evaluate CDC by examining its effectiveness on simulated sequences and empirical data and show that CDC outperforms extant measures by achieving a more informative estimation of CUB and its statistical significance. Conclusions As validated by both simulated and empirical data, CDC provides a highly informative quantification of CUB and its statistical significance, useful for determining comparative magnitudes and patterns of biased codon usage for genes or genomes with diverse sequence compositions.