通过利用超级计算机比较人类和其他哺乳动物基因组部分,来自康奈尔大学的研究人员发现了300个之前没有确定出的人类基因,并且还发现了几百个已知基因的范围。
这些发现是基于一种特殊的理论:当有机体进化时,对有机体有用的遗传密码部分以不同的方式发生变化。研究人员将这项研究的结果发表在近期网络版的Genome Research。
完整的人类基因组在几年前已经完成了测序,但这只是表示人们知道了构成遗传密码的碱基序列而已。人们还需要确定出所有编码蛋白质或履行调节功能等的DNA序列的确切位置。
尽管目前已经确定出了超过20000个蛋白质编码基因,但康奈尔的这项发现证实,仍然有许多基因用目前的生物分析方法被漏掉了。这些方法对发现广泛表达的基因是非常有效的,但却会漏掉旨在特定气管表达或在胚胎发育早期表达的基因。
研究组利用进化观点来确定这些基因。研究人员表示,进化做这种实验已经有数百万年的历史了。计算就是看到这些结果的“显微镜”。
领导这项研究的Siepel和同事准备照出自阿进化上保守的基因,这些基因对所有生命都是至关重要的,并且其形式相同或非常相似。
利用大规模的计算机组,研究人员运行了三种不同的程序来比较这些已由其他研究人员发现的存在于人类、小鼠、大鼠和小鸡的联合阵列。
从构建和检测数学模型到最终运行程序的整个计划大约进行了3年。最终,他们发现了300个新的人类基因。
此前,由来自16个国家的超过100个研究机构的数百名科研工作者合作进行的一项大型研究计划测序和比较了12种果蝇的基因组。这项计划获得的数据使研究人员对果蝇的了解前进了一大步。但是,即使是人类基因组生物学家也还是会写下这样的记录:这项计划还揭露出了他们鉴定基因过程中的明显的缺点、不足。
来自美国印第安纳大学的Thomas Kaufman表示,近年来研究人员已经取得了基因组研究的巨大进步,但是只靠将数据输入计算机来得到序列“真相”的方法却解决不了很多问题。这项新的大型研究告诉了我们这样一件事:当比较许多不同但相关的基因组时,你更可能“看到”深埋在所有A-C-T-G碎片中的基因。
《自然》杂志上发表的两篇该计划的研究报告,给出了这个为期四年的基因组计划的结果,并根据这些数据作出有关果蝇的一些结论。在这两篇论文的结论中隐含了这样一个观点:分析任何单个物种的基因组时,将其与相关基因组进行比较能够极大提高鉴定的效率。研究人员表示将有超过40个“同伴”草图被公布,而每个草图则分析了12个果蝇基因组数据的一个不同的方面。
原始出处:
Published online before print November 7, 2007
Genome Research, DOI: 10.1101/gr.7128207
Targeted discovery of novel human exons by comparative genomics
Adam Siepel1,9, Mark Diekhans2, Brona Brejová1, Laura Langton3, Michael Stevens3, Charles L.G. Comstock3, Colleen Davis4, Brent Ewing4, Shelly Oommen5, Christopher Lau5, Hung-Chun Yu5, Jianfeng Li5, Bruce A. Roe5, Phil Green4, Daniela S. Gerhard6, Gary Temple7, David Haussler2,8, and Michael R. Brent3
1 Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853, USA; 2 Center for Biomolecular Science and Engineering, University of California, Santa Cruz, California 95064, USA; 3 Laboratory for Computational Genomics, Washington University, Saint Louis, Missouri 63130, USA; 4 Howard Hughes Medical Institute and Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA; 5 Departments of Chemistry and Biochemistry, University of Oklahoma, Norman, Oklahoma 73109, USA; 6 National Cancer Institute, Bethesda, Maryland 20892, USA; 7 National Human Genome Research Institute, Bethesda, Maryland 20892, USA; 8 Howard Hughes Medical Institute, University of California, Santa Cruz, California 95064, USA
A complete and accurate set of human protein-coding gene annotations is perhaps the single most important resource for genomic research after the human-genome sequence itself, yet the major gene catalogs remain incomplete and imperfect. Here we describe a genome-wide effort, carried out as part of the Mammalian Gene Collection (MGC) project, to identify human genes not yet in the gene catalogs. Our approach was to produce gene predictions by algorithms that rely on comparative sequence data but do not require direct cDNA evidence, then to test predicted novel genes by RT–PCR. We have identified 734 novel gene fragments (NGFs) containing 2188 exons with, at most, weak prior cDNA support. These NGFs correspond to an estimated 563 distinct genes, of which >160 are completely absent from the major gene catalogs, while hundreds of others represent significant extensions of known genes. The NGFs appear to be predominantly protein-coding genes rather than noncoding RNAs, unlike novel transcribed sequences identified by technologies such as tiling arrays and CAGE. They tend to be expressed at low levels and in a tissue-specific manner, and they are enriched for roles in motor activity, cell adhesion, connective tissue, and central nervous system development. Our results demonstrate that many important genes and gene fragments have been missed by traditional approaches to gene discovery but can be identified by their evolutionary signatures using comparative sequence data. However, they suggest that hundreds—not thousands—of protein-coding genes are completely missing from the current gene catalogs.