最近,美国洛斯阿拉莫斯国家实验室(LANL)的一个遗传学小组和一国际财团联合提出了一套旨在阐明可公开获取的基因测序数据信息的质量标准。新标准最终可使遗传研究人员开发出更有效的疫苗,或有助于公共健康部门或安全人员更迅速地应对潜在的公共卫生突发事件。
在最新一期的《科学》杂志上,LANL遗传学家帕特里克·钱恩和他的同事提出了6个基因组测序数据标签,可将基因测序数据按其完整性、准确性以及由此带来的可靠性进行归类。这些标签可在公共数据库中获取,而目前使用的标签仅为两个。此项成果的重要性在于,研究人员必须每天使用这样的数据,以对未知遗传数据和已知生物体的遗传数据进行相互参照,而有了这样的新的分类标准,数据的获取与对比工作的效率将大大提高。
每个生物体的细胞内都有DNA,由4个分子构建模块(或称碱基对)组成,碱基对排成特定序列时就可构成基因。这些基因序列可包含对生物体有益或有害的遗传指令。基因组研究人员编目了数以千计的基因数据,并将其放在公众数据库中以供其他研究者使用。 然而,由于基因数据的复杂性,公共数据库中的遗传信息范围从粗略到精致一概都有。过去,这些基因数据常被归类为“草图”和“成品”两大类,给基因数据的准确性留下了太多的不确定性。
钱恩表示,在过去几年里,基因测序技术已取得重大进步,公众可获得的基因数据已呈爆炸性增长,每天产生的碱基对序列数据量要比过去几年产生的数据量还要多几十亿次。不同的测序技术具有不同的精确度。一个序列中的高度不确定性可能会引导研究人员走向一条耗时长达一年甚至数年的错误道路。因此,有必要建立一个标准,为研究人员提供对遗传测序数据质量的明确评估。
钱恩联合了大大小小的数个基因组测序中心,如美国能源部联合基因组研究所、桑格研究所、人类微生物群系项目Jumpstart联盟测序中心、密歇根州立大学以及安大略省癌症研究所等,共同提议将现有的测序数据分类从两大类充实为6大类。这6个标准涵盖了从代表公众提交最低要求的“标准草图序列”到代表最高标准的“完成序列”,而“完成序列”的验收标准是每10万个碱基对中最多只能包含一个错误。
LANL基因科学小组负责人、联合基因组研究所LANL研究中心主任克里斯·戴特表示,该项研究的目的是为了让所有主要的基因组中心和基因组研究小组都能用上符合其需要的分类基因组测序数据。而为了尽可能保证基因组序列的完整性,一些较小的研究中心也可采用这个分类等级来建立和提交其研究成果,以帮助其他科学家了解既已完成的工作。(生物谷Bioon.com)
生物谷推荐原始出处:
Science 9 October 2009:DOI: 10.1126/science.1180614
Genome Project Standards in a New Era of Sequencing
P. S. G. Chain,1,2,3,*,, D. V. Grafham,4,, R. S. Fulton,5, M. G. FitzGerald,6, J. Hostetler,7, D. Muzny,8,J. Ali,9 B. Birren,6 D. C. Bruce,1,10 C. Buhay,8 J. R. Cole,3 Y. Ding,8 S. Dugan,8 D. Field,11 G. M. Garrity,3 R. Gibbs,8 T. Graves,5 C. S. Han,1,10 S. H. Harrison,3,* S. Highlander,8 P. Hugenholtz,1 H. M. Khouri,12 C. D. Kodira,6,* E. Kolker,13,14 N. C. Kyrpides,1 D. Lang,12 A. Lapidus,1 S. A. Malfatti,12 V. Markowitz,15 T. Metha,6 K. E. Nelson,7 J. Parkhill,4 S. Pitluck,1 X. Qin,8 T. D. Read,16 J. Schmutz,17 S. Sozhamannan,18 P. Sterk,11 R. L. Strausberg,7 G. Sutton,7 N. R. Thomson,4 J. M. Tiedje,3 G. Weinstock,5 A. Wollam,5 Genomic Standards Consortium Human Microbiome Project Jumpstart Consortium, J. C. Detter10,,
For over a decade, genome sequences have adhered to only two standards that are relied on for purposes of sequence analysis by interested third parties (1, 2). However, ongoing developments in revolutionary sequencing technologies have resulted in a redefinition of traditional whole-genome sequencing that requires reevaluation of such standards. With commercially available 454 pyrosequencing (followed by Illumina, SOLiD, and now Helicos), there has been an explosion of genomes sequenced under the moniker "draft"; however, these can be very poor quality genomes (due to inherent errors in the sequencing technologies, and the inability of assembly programs to fully address these errors). Further, one can only infer that such draft genomes may be of poor quality by navigating through the databases to find the number and type of reads deposited in sequence trace repositories (and not all genomes have this available), or to identify the number of contigs or genome fragments deposited to the database. The difficulty in assessing the quality of such deposited genomes has created some havoc for genome analysis pipelines and has contributed to many wasted hours. Exponential leaps in raw sequencing capability and greatly reduced prices have further skewed the time- and cost-ratios of draft data generation versus the painstaking process of improving and finishing a genome. The result is an ever-widening gap between drafted and finished genomes that only promises to continue (see the figure, page 236); hence, there is an urgent need to distinguish good from poor data sets.
1 U.S. Department of Energy Joint Genome Institute.
2 Lawrence Livermore National Laboratory.
3 Michigan State University.
4 The Sanger Institute.
5 Washington University School of Medicine.
6 The Broad Institute.
7 J. Craig Venter Institute.
8 Baylor College of Medicine.
9 Ontario Institute for Cancer Research.
10 Los Alamos National Laboratory.
11 Natural Environmental Research Council Centre for Ecology and Hydrology.
12 National Center for Biotechnology Information.
13 Seattle Children's Hospital and Research Institute.
14 University of Washington School of Medicine.
15 Lawrence Berkeley National Laboratory.
16 Emory GRA (Georgia Research Alliance) Genomics Center.
17 HudsonAlpha Institute.
18 Naval Medical Research Center.