THE IMPRESSIVE Lewis-Sigler Institute for Integrative Genomics at Princeton University features contemporary architecture and a modern art-inspired setting. But can it fulfill the ambitious objectives of its new director, David Botstein?
Botstein's scientific credentials are impeccable. His insight into human gene mapping 25 years ago laid the foundation for the Human Genome Project. Later at Genentech, he pushed the therapeutic use of humanized antibodies such as Herceptin, for cancer. With Stanford's Pat Brown, he popularized large-scale microarray studies. And his own lab's research on yeast helped establish the most comprehensive database of any organism. Botstein's goal is to produce bilingual scientists — biologists fluent in computer science, or chemists who can speak evolutionary biology. Whatever the term — integrative genomics, systems biology, or computational biology — a new breed of integrated scientist, one with complete understanding of genomics, proteomics, and networks, is essential to translate the genome sequence into predictive and preventative medicine.
During a 60-minute conversation in Princeton, IDC life science analyst Zachary Zimmerman questioned the garrulous geneticist on his philosophy, science, his mission for integrative science, and what he deems a success for systems biology.
Q: The 2001 Nature paper on the Human Genome Project listed three key experiments, one being your classic 1980 paper in the American Journal of Human Genetics. What prompted that work, and at the time did you think that your DNA mapping concept would be so influential?
A: In 1978, there was human genetics only by analogy with other organisms. You could not, for example, map a human gene on the basis of its phenotype. In 1978, at a meeting at Alta, Utah, three of us — Mark Skolnick, Ron Davis, and myself — came to a realization ... that using polymorphisms in families in which the disease gene and the polymorphism both segregated was a feasible undertaking, and you could do this in humans. That would change the prospects of finding disease genes, because if you knew they were linked to some piece of DNA on chromosome 9, then you had reduced the search for that gene by twentyfold or something. Then you could break it down to about 1 megabase just by mapping. So this started a very large effort to map many genes ... and many of those mappings have turned [out to be] genes that cause disease ...
This had an effect on the genome project, because right from the very beginning ... I was opposed to thoughtless sequencing of the genome. However, one of the really robust arguments for sequencing the genome was that the physical mapping required to sequence would help you map [disease genes]. So genetic mapping and sequencing reinforced each other.
Q: At Genentech you were involved in the success of Herceptin. What is your opinion on mixing academic science with commercial science?
A: For a couple of years, I was vice president at Genentech ... One of the things that we did was recognize the possibility that an antibody against Her2 might have therapeutic use [for breast cancer] ... Even if it hadn't been [therapeutic], we would have learned that you can humanize antibodies, [so it was also a technology development project] that opened the door to antibody products. I don't know what fraction of total [biotech] income is antibodies, but it's not a small fraction ... The pipeline is full of antibodies at Genentech and elsewhere.
If one has a reasonable ethical compass and a clear idea at what point one is working for whom, [academic/industry collaboration] is a wonderful thing — it keeps your feet on the ground in both venues ... That said, I am very skeptical of people who do the same things in their academic life and in their consulting life ... I have managed to avoid that simply by not doing in my lab what Genentech is doing.
Q: You and Pat Brown started large-scale gene-expression studies. Can you comment on the first microarrays, and what is the future for gene-expression studies?
A: When I took over leadership of the [Stanford University] genetics department, Pat Brown, who was then an assistant professor of pediatrics, came in and started to tell me about one of his many ideas ... He was interested in mapping human genes by a method called genomic mismatch scanning ... One of his students put me on his thesis committee. The student was Dari Shalon, who made the first microarrays at Stanford. At a certain point we realized that there was a huge amount that could be done, [and] the low-hanging fruit was cancer. We decided to throw in together to found a joint lab to explore the possibilities of this technology for cancer, using yeast as a model, not just for the biology but for the technology. Since then, we have many more papers than we should have done, but it turned out to be fertile collaboration, partly because I was able to provide a certain kind of pedagogic experience and Pat had all of these fantastic ideas ... Going to Princeton was the end of that, but given the difference in our ages it was inevitable; our collaboration couldn't go on forever, anyway!
Gene expression right now is the only truly effective whole-genome high-throughput method, so although one can argue that you would rather know things about the proteins or the DNA sequence, the fact is you can get a very detailed and revealing picture of the physiology of the cell by looking at its transcription program over time or between different [states], like tumors or fibroblasts ... This is the only game in town at that level of productivity. That said, one of my students [Olga Troyanskaya] wrote a paper recently in which she was able to computationally use a Bayesian network to take very diverse kinds of data and put them together — mostly expression data, but by no means all of it. I think in the future, expression patterns may be the glue that puts together more fragmentary data from other sources and tells you that the story is about 'gene i' in an array of 35,000 such genes.
Q: What impact will Princeton have on you?
A: It's a new job — a wonderful opportunity to go back to my roots, and by that I don't mean the East Coast so much as I mean undergraduate and graduate teaching. I did a lot of graduate teaching at Stanford — I flatter myself that I had a positive influence on their teaching methods and attitude — but at the end of the day, Stanford was not interested in the same way that Princeton is in pushing the envelope of teaching at the interface of biology and more quantitative sciences. And I welcome the opportunity to do that. What I hope to achieve here is to produce an alternative curriculum whereby the students who come out are multilingual in physics, computation, math, and biology at the same time, which are the kinds of students that very few universities ever see.
Q: What is integrative genomics, and is it distinguishable from systems biology?
A: I have to make a disclaimer: I did not invent the name 'integrative genomics.' It has history at Princeton, which is only of local significance. Systems biology is also not my name. As far as I know, it arose ... as the name of an institute that Lee Hood built. Whether it's integrated genomics or systems biology or whatever, we're making a program here called quantitative and computational biology, and that would be another name ...
Instead of looking at genes or the best-case pathways one at a time, we want to look at the ensemble and try to understand the behavior of organisms, not just through the lens of a single gene, protein, or pathway ... In order to be sure you are looking at everything that is relevant, at least in the beginning, you have to look at the whole genome, and gene expression is the only truly comprehensive technology that is practical on the ground today. I am sure there will be others in the future.
We all know that in the future, we have to address this system-level, genome-level problem, and in order to do that, we are going to have to be much more quantitative and computational and mathematical than in the past.
At the level of many genes' function, many pathways interacting, many proteins touching each other ... [current] analytical methods are just going to be inadequate. You have to collect much more comprehensive data, and that means computation. It means databases. It means analytical technology that really involves a fair amount of higher mathematics. So we all know that in the future, we have to address this system-level, genome-level problem, and in order to do that, we are going to have to be much more quantitative and computational and mathematical than in the past.
Q: What is the range of existing science that fits under the umbrella of systems biology?
A: At Princeton, it appears [to] involve principally [members of] four or five departments: molecular biology, ecology and evolutionary biology, physics, chemistry, computer science. Those are the big players. Chemical engineering — we have a chemical engineer. Miscellaneous people in other departments ... So it's pretty broad across the sciences. Our undergraduate program imagines an introductory two-year sequence that leaves [students] prepared to major in any of those fields.
Q: What does this 'integrative' approach suggest about the traditional method of scientific research, and has there been any resistance to this new type of structure?
A: It says nothing. It simply says it's different, it's new, it's an opportunity that has been brought up by the genome sequences and the technology that allows you to look at many things at once, together with the computational capacity to try to analyze and appreciate everything at once. It's completely complementary to the classical methods; however, there is always a certain amount of friction because the new thing tends to attract a different kind of people, and there's always some kind of understandable competition between the old and the new.
The most difficult thing is the standard interdisciplinary problem — someone who spends half her time doing computation and doing experiments takes the risk of being fully supported by neither — of course she might also discover that she is fully supported by both. But unfortunately in the academic world, it's too easy to have a silo mentality, where if it isn't authentic 'genetics' or authentic 'computer science' then it doesn't belong with 'us.' Princeton has really put its money where its mouth is in building this magnificent building with the idea that it is going to be really interdisciplinary. So I am very optimistic that we will be able to live with whatever remains of that attitude.
Q: Systems biology is becoming almost trendy. So far, only elite academic institutions have established programs — Princeton, Stanford's BioX, the Broad Institute, and the Institute for Systems Biology. Is systems biology going to spread throughout all academic institutions?
A: That depends on how successful it really is. The combination of biology and more quantitative approaches — the modern analogue of biophysics — absolutely is inevitable, but is that systems biology? The four institutions you mentioned are actually not so similar to each other as it would first appear. BioX is real estate ... There is no discernible theme there, although it involves things other than biology in the same building. The Broad Institute is, as far as I can tell, academic pharmaceutical research — they're going to use the genome to find drugs and treat disease — very medical in its approach. I don't think it's the same thing that we are thinking about, which is trying to understand why a glutamine is an important regulator as opposed to a glutamate in a particular circumstance. Neither of the above institutions have any interest in that. We are much more academic in that regard. I do believe that Lee Hood's thoughts about this are more like ours. I would say that in this spectrum, he is somewhere between us and Eric Lander and the Broad folks.
Our main mandate, as worked out between ourselves and [Princeton president] Shirley Tilghman, is in the teaching area — this business of producing an academic program that will produce the kinds of scientists that all of these other places are going to need. Because they are not involved in teaching ... the connection between biology and quantitative stuff needs to happen. But each institution has different ways of looking at it.
Q: What classifies as a success for systems biology?
A: A major success would be a really functional mathematical computer model of the development of an epithelial sheet in some organism. Or a full understanding of why, when you put in a pulse of glucose, a particular subset of enzymes has more or less transcription and the fluxes of carbon go in different ways. We have a very general understanding of those things but nothing specifically quantitative, dynamic about it at all. [Or] a really good explanation including quantitative parameters of a signal transduction cascade. Those are the intellectual successes ... Another success for us would be that we produce a steady stream of undergraduates who become professionals in the field and their education gives them a big advantage because we have educated them in a forward-looking and sensible way.
Q: Do you expect that systems biology will remain limited to the academic setting and a handful of biotechs, or will it become a widely used approach in the average pharma company?
A: That's easy! If it succeeds ... it will be like genetics. Even today, there are a lot of pharmaceutical companies that don't believe in genetics — but they do it anyway! Success is in the arena of ideas. If we are right about this, the answer is, it will spread.
Q: What organizational structures are needed to facilitate interdisciplinary collaborations besides simply locating scientists of diverse disciplines in offices next to each other and making them share a water cooler and coffeepot?
A: That's a really good question, because I think that just placing people in labs next to each other is not likely to do very much except to have people drink coffee and pee together! The place has to have a mandate, a reason to have people work together ... The task that we have is to educate the students in the context of each other's academic expertise and views and by being thrown together with the necessity to provide a whole menu of courses. In that way, we will get to know each other better than I could imagine, except if we were working on the same project.
Q: The diversity of scientific disciplines carries with it a diversity of software tools, many specific to their own domain. What challenges and opportunities does this situation hold for various types of software vendors?
A: We ... have always been writing our own software. The vendors have always lagged significantly behind what we see as the state of the art. Where we do use commercial software, we have already worked out the algorithms and have something that works, and then for various reasons some vendor is motivated to write it to run faster, run on more platforms, etc. The axon scanner is a perfectly good example: GenePix is a much improved and better engineered version of something that Mike Eisen wrote in our lab. So the vendors make their money not from us but the pharmas.
To a large extent, their systems are weakened by being closed, not open-source, and unnecessarily restrictive in access and by being expensive. As a result, if you have the kind of tools for writing your own software, [that is] the disincentive to pay a lot of money so that one of your guys can work on one terminal, on the kind of computer that you don't really like to support. So if you have bright students, who in a few days can work up something in Perl and in a few weeks can make something pretty robust in C or Java, why would you use those [expensive commercial] things? So we don't. Most state-of-the-art places don't use the commercial stuff. It's also the same reason we made our own microarrays — because we could do better for less, and the really good stuff we just couldn't afford.
Q: What are the key challenges to applying modeling and simulation to biology?
A: Simulation depends on what you mean — this is very complicated ... First, think about statistics and data analysis. We have a very distinct preference toward nonparametric methods [because] we rarely understand the underlying statistical distribution. So a lot of simulation, bootstrapping, and Monte Carlo goes on simply to generate something credible, so that you know if your results are likely to be the product of chance or whether you actually discovered a gene ... In terms of really working computer models that tell you something you didn't know before, actually the world is not full of such models. There are some in neurobiology, a few in chemotaxis, Harley McAdams' feedback loops for lambda, which are very impressive, but actually I don't think we learned much there that the classical lambdologists didn't know ...
Q: A full-fledged development of modeling and simulation in biology ultimately leads to issues on how to effectively make simulations interoperate. What steps are being taken to learn from modeling and simulation experts in other fields? For example, designing simulations that implement the simulation standard IEEE 1516.
A: I am deeply suspicious of premature standardization. I call it Stalinism! Once you know what you are doing, then erecting standards and constraints makes sense to me. If there is [even] the suspicion that you might not know what you are doing, then I think the right thing to do is proceed ahead and leave the experimenters as much leeway as possible. That said, the standard I wish everyone to adhere to, which is the only important standard at this stage, would be that people actually make their data available. The standards matter less than the availability — if I have it in an Excel spreadsheet and you have it in an Oracle database, that is much less important than if I have it and you don't.
The fact is, if I have a table in Excel and you have a table in Oracle and we both know Perl, and we have some idea what the rules are about how the different cells of the different tables should relate to each other, we are not going to have any trouble just because we don't both use Excel or Oracle ... I think that a lot of people are pushing standards rather than thinking about science. That makes no sense to me.
Q: Are there places where systems biologists are taking shortcuts because it is too expensive (in time and money) computationally to use a more rigorous method?
A: I don't know of circumstances where we are computationally limited. If we have, let's say, a 16-processor Origin-type computer, the petaFLOP (quadrillion floating-point operations per second) world doesn't seem to me to offer very much right now. However, the protein folders/molecular dynamics folks clearly believe that if they could look at the molecules wiggling for a few milliseconds instead of a few microseconds ... they would learn a lot. That could well be true, in which case I think they ought to have the opportunity to test that.
In genomics, the biggest heavy issues have to do with multiple alignments of very many genomes at once. That can take a lot of computation, but even that is something that we are talking about, a day or two on an Origin — we are not talking about petaFLOPS ... On the other hand, throughout the entire history of computation [we have] underestimated the value of having more. So as long as Moore's Law goes on, I am going along. Would I tell the government to spend billions, or even tens of millions, on increasing computational power? That question was put to us in the BISTI [Biomedical Information Science and Technology Initiative] report a few years ago, and the universal answer was 'no.' That the weather guys need that, the bomb builders need that, the molecular dynamicists need that ... but in genomics — at the gene level —not yet. Maybe some day ...
Q: What will be the role of relational databases in systems biology? What alternative data management tools and structures might serve the field better?
A: This is more of a religious question. Relational databases are useful, a good way of keeping the data in order. Atomizing the data is always a good thing, programming styles come and go. For a long time we had object-oriented everything, now we have some object-oriented everything and some just straight code. I think this is an intensely technical question, and to that extent I take an evolutionary view — whatever works!