近日,中科院西双版纳热带植物园研究员Chuck Cannon与北京基因组所和美国得州理工大学的科研人员合作,研发出可直接分析高通量短序列数据的程序包,简化了高通量数据的比较基因组和转录组研究。相关研究成果日前发表于《科学公共图书馆—综合》。
据Cannon介绍,高通量测序又称“下一代”测序,可一次并行对几十万到几百万条DNA分子测序。因此,这种测序方法能对物种的转录组和基因组进行比以往更为全貌的分析。
但是,由于“下一代”测序技术原始数据的读长只有数十或一两百个碱基,按照传统的分析流程,必须要采取生物信息学工具将这些短的碱基数据组装成较长的序列组或基因组框架,才能进一步取得具有生物学意义的结果。这制约了此类数据在没有参照基因组的非模式生物基因组研究中的发展。
“我们研发的直接分析高通量短序列数据的程序包,可直接通过检测数据中kmer片段是否存在和出现频次,来探讨一定数量目标基因组中的序列差异,所以该程序包可突破此类数据经常面临的生物信息学的分析瓶颈。”Cannon告诉记者。
同时,基于先前工作,他们还进一步改善了非组装分析法,比较了174个叶绿体全基因组数据,用以印证该程序包的功能和运行流程。
该研究得到中科院知识创新工程重要方向项目和云南省高端科技人才引进计划项目的资助。(生物谷Bioon.com)
doi:10.1371/journal.pone.0048995
PMC:
PMID:
Reference-Free Comparative Genomics of 174 Chloroplasts
Chai-Shian Kua, Jue Ruan, John Harting, Cheng-Xi Ye, Matthew R. Helmus, Jun Yu, Charles H. Cannon
Direct analysis of unassembled genomic data could greatly increase the power of short read DNA sequencing technologies and allow comparative genomics of organisms without a completed reference available. Here, we compare 174 chloroplasts by analyzing the taxanomic distribution of short kmers across genomes [1]. We then assemble de novo contigs centered on informative variation. The localized de novo contigs can be separated into two major classes: tip = unique to a single genome and group = shared by a subset of genomes. Prior to assembly, we found that ~18% of the chloroplast was duplicated in the inverted repeat (IR) region across a four-fold difference in genome sizes, from a highly reduced parasitic orchid [2] to a massive algal chloroplast [3], including gnetophytes [4] and cycads [5]. The conservation of this ratio between single copy and duplicated sequence was basal among green plants, independent of photosynthesis and mechanism of genome size change, and different in gymnosperms and lower plants. Major lineages in the angiosperm clade differed in the pattern of shared kmers and de novo contigs. For example, parasitic plants demonstrated an expected accelerated overall rate of evolution, while the hemi-parasitic genomes contained a great deal more novel sequence than holo-parasitic plants, suggesting different mechanisms at different stages of genomic contraction. Additionally, the legumes are diverging more quickly and in different ways than other major families. Small duplicated fragments of the rrn23 genes were deeply conserved among seed plants, including among several species without the IR regions, indicating a crucial functional role of this duplication. Localized de novo assembly of informative kmers greatly reduces the complexity of large comparative analyses by confining the analysis to a small partition of data and genomes relevant to the specific question, allowing direct analysis of next-gen sequence data from previously unstudied genomes and rapid discovery of informative candidate regions.