每个生物体的细胞内都有DNA,由4个分子构建模块(或称碱基对)组成,碱基对排成特定序列时就可构成基因。这些基因序列可包含对生物体有益或有害的遗传指令。基因组研究人员编目了数以千计的基因数据,并将其放在公众数据库中以供其他研究者使用。 然而,由于基因数据的复杂性,公共数据库中的遗传信息范围从粗略到精致一概都有。过去,这些基因数据常被归类为“草图”和“成品”两大类,给基因数据的准确性留下了太多的不确定性。
Science 9 October 2009:DOI: 10.1126/science.1180614
Genome Project Standards in a New Era of Sequencing
P. S. G. Chain,1,2,3,*,, D. V. Grafham,4,, R. S. Fulton,5, M. G. FitzGerald,6, J. Hostetler,7, D. Muzny,8,J. Ali,9 B. Birren,6 D. C. Bruce,1,10 C. Buhay,8 J. R. Cole,3 Y. Ding,8 S. Dugan,8 D. Field,11 G. M. Garrity,3 R. Gibbs,8 T. Graves,5 C. S. Han,1,10 S. H. Harrison,3,* S. Highlander,8 P. Hugenholtz,1 H. M. Khouri,12 C. D. Kodira,6,* E. Kolker,13,14 N. C. Kyrpides,1 D. Lang,12 A. Lapidus,1 S. A. Malfatti,12 V. Markowitz,15 T. Metha,6 K. E. Nelson,7 J. Parkhill,4 S. Pitluck,1 X. Qin,8 T. D. Read,16 J. Schmutz,17 S. Sozhamannan,18 P. Sterk,11 R. L. Strausberg,7 G. Sutton,7 N. R. Thomson,4 J. M. Tiedje,3 G. Weinstock,5 A. Wollam,5 Genomic Standards Consortium Human Microbiome Project Jumpstart Consortium, J. C. Detter10,,
For over a decade, genome sequences have adhered to only two standards that are relied on for purposes of sequence analysis by interested third parties (1, 2). However, ongoing developments in revolutionary sequencing technologies have resulted in a redefinition of traditional whole-genome sequencing that requires reevaluation of such standards. With commercially available 454 pyrosequencing (followed by Illumina, SOLiD, and now Helicos), there has been an explosion of genomes sequenced under the moniker "draft"; however, these can be very poor quality genomes (due to inherent errors in the sequencing technologies, and the inability of assembly programs to fully address these errors). Further, one can only infer that such draft genomes may be of poor quality by navigating through the databases to find the number and type of reads deposited in sequence trace repositories (and not all genomes have this available), or to identify the number of contigs or genome fragments deposited to the database. The difficulty in assessing the quality of such deposited genomes has created some havoc for genome analysis pipelines and has contributed to many wasted hours. Exponential leaps in raw sequencing capability and greatly reduced prices have further skewed the time- and cost-ratios of draft data generation versus the painstaking process of improving and finishing a genome. The result is an ever-widening gap between drafted and finished genomes that only promises to continue (see the figure, page 236); hence, there is an urgent need to distinguish good from poor data sets.
1 U.S. Department of Energy Joint Genome Institute.
2 Lawrence Livermore National Laboratory.
3 Michigan State University.
4 The Sanger Institute.
5 Washington University School of Medicine.
6 The Broad Institute.
7 J. Craig Venter Institute.
8 Baylor College of Medicine.
9 Ontario Institute for Cancer Research.
10 Los Alamos National Laboratory.
11 Natural Environmental Research Council Centre for Ecology and Hydrology.
12 National Center for Biotechnology Information.
13 Seattle Children's Hospital and Research Institute.
14 University of Washington School of Medicine.
15 Lawrence Berkeley National Laboratory.
16 Emory GRA (Georgia Research Alliance) Genomics Center.
17 HudsonAlpha Institute.
18 Naval Medical Research Center.