对物种,尤其是低等模式生物的蛋白质组的分析是近来热点,蛋白质的信息不仅包括蛋白本身的功能,而且包括蛋白质的在整体中的时间表达特性和空间定位分布特性,以及蛋白质间相互协调。因此研究全体蛋白质间相互作用对揭示生命活动的本质意义重大,从模式生物研究开始逐渐向高等生物拓展,逐步揭示高等生物生命活动的特片。
Nature 425, 686 - 691 (16 October 2003); doi:10.1038/nature02026
Global analysis of protein localization in budding yeast
WON-KI HUH1,*, JAMES V. FALVO1,*, LUKE C. GERKE1, ADAM S. CARROLL1, RUSSELL W. HOWSON1, JONATHAN S. WEISSMAN1,2 & ERIN K. O'SHEA1
1 Howard Hughes Medical Institute, University of California–San Francisco, Department of Biochemistry and Biophysics, 600 16th Street, San Francisco, California 94143-2240, USA
2 Department of Cellular and Molecular Pharmacology, 600 16th Street, San Francisco, California 94143-2240, USA
* These authors contributed equally to this work
Correspondence and requests for materials should be addressed to E.K.O. (oshea@biochem.ucsf.edu).
A fundamental goal of cell biology is to define the functions of proteins in the context of compartments that organize them in the cellular environment. Here we describe the construction and analysis of a collection of yeast strains expressing full-length, chromosomally tagged green fluorescent protein fusion proteins. We classify these proteins, representing 75% of the yeast proteome, into 22 distinct subcellular localization categories, and provide localization information for 70% of previously unlocalized proteins. Analysis of this high-resolution, high-coverage localization data set in the context of transcriptional, genetic, and protein–protein interaction data helps reveal the logic of transcriptional co-regulation, and provides a comprehensive view of interactions within and between organelles in eukaryotic cells.
Eukaryotic cells are organized into a complex network of membranes and compartments, which are specialized for various biological functions. Comprehensive knowledge of the location of proteins within these cellular microenvironments is critical for understanding their functions and interactions; this requires assaying the cell's full complement of proteins. The complete genome sequence of the budding yeast Saccharomyces cerevisiae1 coupled with high-throughput experimental techniques has made systematic analyses of a eukaryotic proteome feasible. Recent studies have taken a genome-wide approach to analysing messenger RNA abundance and stability2, 3, biochemical activity4, 5, protein–protein interactions6-9, transcriptional regulation10, gene disruption phenotypes11-14 and protein abundance15.
Previous large-scale analyses of protein localization in S. cerevisiae have depended on transposon-mediated random epitope tagging and plasmid-based overexpression of epitope-tagged proteins11, 16. However, epitope tagging of partial open reading frames (ORFs) can interrupt important localization signals, and overexpression of proteins may saturate intracellular transport mechanisms, leading to abnormal subcellular localization. To circumvent these potential problems, we generated a yeast strain collection expressing full-length proteins, tagged at the carboxy terminal end with green fluorescent protein (GFP), from their endogenous promoters by inserting the coding sequence of Aequorea victoria GFP (S65T)17 in-frame immediately preceding the stop codon of each ORF. With this strategy, wild-type levels and patterns of protein expression are minimally perturbed. Furthermore, because GFP fluorescence does not require external cofactors, GFP signal can be monitored in living cells without disrupting cellular integrity. We have analysed this strain collection using fluorescence microscopy to comprehensively characterize protein subcellular localization in a simple eukaryotic cell.
Construction and analysis of a GFP-tagged library
We systematically tagged each ORF in its chromosomal location through oligonucleotide-directed homologous recombination (Fig. 1a). For each of the 6,234 annotated ORFs18 a pair of oligonucleotides was generated that had homology to the desired chromosomal insertion site at the 5' end of each primer and homology to a vector containing the GFP tag at the 3' end. These primers were used to amplify the GFP tag and an auxotrophic marker from a plasmid template19, and the resulting polymerase chain reaction (PCR) products were transformed into a haploid yeast strain. Transformants were assayed by genomic PCR with one primer specific for the GFP tag and a second specific for each ORF, to determine whether the cassette had integrated at the appropriate locus. A total of 6,029 strains with chromosomally GFP-tagged ORFs were grown to mid-logarithmic phase in synthetic medium and analysed by fluorescence microscopy; 4,156 of these showed GFP signals above background levels (Table 1).
Figure 1 Microscopic analysis of yeast strains expressing GFP-tagged proteins. Full legend
High resolution image and legend (74k)
Micrographs of each GFP-tagged strain (Fig. 1b; see also Supplementary Fig. S1), lacking ORF identifiers, were independently evaluated by two scorers and initially classified into one or more of 12 subcellular localization categories (Table 2). We then refined these categories by performing a series of co-localization experiments. Haploid reference strains expressing monomeric red fluorescent protein (mRFP)20 fusions to proteins whose localization had been characterized previously (Table 2) were mated to approximately 700 GFP strains that were not assigned definitive localizations by GFP microscopy alone, and the resulting diploid cells were analysed by fluorescence microscopy (Fig. 1c). On the basis of this analysis, proteins were assigned to an additional 11 localization categories (Table 2). All information was captured into a database (http://yeastgfp.ucsf.edu).
Subcellular localization of yeast proteins
The 4,156 proteins for which we defined subcellular localizations in the GFP library represent 75% of the yeast proteome15, 21, 22. Our results provide localization data for about 70% of previously unlocalized yeast proteins, constituting about 30% of the proteome (Fig. 2a). Over 90% of the proteins visible in the GFP collection were also detected by western blot analysis of a collection of TAP (tandem affinity purification)-tagged strains15, suggesting that the false-positive rate in this study is extremely low.
Figure 2 Subcellular localization of yeast proteins. Full legend
High resolution image and legend (93k)
The distribution of protein subcellular localization reveals that, as expected, many proteins are found in the nucleus or cytoplasm, whereas 1,839 proteins, 44% of the total observed, localize to other specific subcellular regions (Fig. 2b). Notably, over 40% of the proteins that we assigned to the cytoplasm, late Golgi/clathrin and lipid particle represent new localization assignments. There are limitations to the subcellular localizations in yeast discernible by fluorescence microscopy; for example, we cannot distinguish kinetochore versus spindle pole body, or membrane versus lumen for mitochondria or the endoplasmic reticulum. However, use of the GFP tag and co-localization with RFP-tagged reference proteins allowed us to resolve many related subcellular compartments with confidence. For example, the nucleus, nuclear periphery and the endoplasmic reticulum are distinct (Fig. 1b, top row), as are the vacuole and vacuolar membrane, and multiple compartments of the secretory pathway. This level of precision greatly facilitates our assignment of protein localization as well as integration with other genome-wide data sets.
Previously published localization data from the Saccharomyces Genome Database (SGD)18, including data from earlier large-scale studies11, 16, were available for a total of 2,526 proteins visible in the GFP library—we found that there was 80% agreement between our data and those of the SGD. We also found that our localization assignments generally agree with those of the pioneering studies of the Snyder laboratory11, 16. However, for those assignments that differ, our results show closer agreement with the SGD (Supplementary Fig. S2). Direct comparison between our data and the results of a mass spectrometric analysis of the nuclear pore complex23 (NPC) revealed that, of 29 identified NPC components, 25 were visible in our study: 23 proteins (92%) were localized to the nuclear periphery and one each was localized to the nucleus/cytoplasm and endoplasmic reticulum. Furthermore, of 16 spindle-pole-body components identified by mass spectrometry24, all 14 of the proteins visible in this study were localized to the spindle pole. We found an additional 20 proteins localized to the nuclear periphery and 14 to the spindle pole; of these, 11 had not been detected previously in the nuclear periphery and 7 had not been detected in the spindle pole (Supplementary Table S1). The strong correlation between the data we obtained by fluorescence microscopy and localization data obtained by other methods supports the reliability of this study in defining new protein localizations.
A potential source of discrepancy between our data and those from other studies is that the C-terminal fusion of the GFP protein (approximately 27 kDa) may cause mislocalization through steric hindrance or interruption of critical C-terminal localization/retention sequences (Supplementary Table S2). For example, the small GTP-binding protein Ras2 was localized to the nucleus and the cytoplasm in this study, but it is known to be localized to the plasma membrane due to modification of its C terminus with palmitoyl and farnesyl groups25. Proteins localized to the cell wall26 and subsets of proteins localized to the peroxisome27 and endoplasmic reticulum28 also contain C-terminal targeting signals, and these were often mislocalized in this study.
Organellar proteomics of the nucleolus
The identification of subsets of proteins in various organelles is an initial step towards the understanding of biological processes at the cellular level. 'Organellar proteomics' studies29 would benefit especially from the comprehensive localization data for yeast proteins provided by this study. For example, we detected 164 proteins in the nucleolus in this study; 82 of these overlap the 127 nucleolar proteins catalogued in the SGD, but another 82 are newly defined (Fig. 2c). Of the remaining 45 nucleolar proteins from the SGD, 28 were not visualized in our study, whereas the others were localized to the nucleus (7 proteins), nucleus/cytoplasm (7 proteins), nuclear periphery (2 proteins) and cytoplasm (1 protein). These proteins may occupy the nucleolus in a transient fashion, at levels not detectable by our methods, or under conditions distinct from those of our study—mislocalization may also result from the GFP tag. A number of the nucleolar proteins found in this study are involved in ribosomal RNA transcription and processing and in ribosome biogenesis, in accordance with the classical role of the nucleolus; for some of these proteins, we provide the first direct demonstration that they reside or are enriched in the nucleolus (Fig. 2d). Given that some nucleolar proteins are involved in cell cycle control and gene regulation30-33, it will be very interesting to investigate the functional roles of nucleolar proteins newly defined in this study.
It has been reported that essential proteins and orthologues are enriched in related protein complexes isolated from yeast and humans8. Of the proteins localized to the nucleolus in this study, 99 proteins (60%) are known to be essential, substantially more than the 20% required for viability in the proteome as a whole12, 14. Recently, mass spectrometric analysis of the human nucleolus identified 271 proteins, 166 of which have homologues in yeast34; 52 of these proteins are classified as nucleolar in this study (Supplementary Table S3). Of the 112 proteins remaining from the 164 proteins that we have detected in the yeast nucleolus, 73 have human homologues and 33 of these are localized to the nucleolus or have biological functions related to transcription and processing of rRNAs and ribosome biogenesis (Supplementary Table S4) according to the Human Proteome Survey Database35. Given the enrichment of essential proteins in the yeast nucleolus and the enrichment of essential proteins and orthologues in related protein complexes from yeast and humans8, we expect that many of the remaining human homologues of yeast proteins detected in the nucleolus in this study will also be nucleolar proteins.
Protein localization and mRNA co-expression
Many genome-wide analyses have demonstrated that mRNA transcript expression patterns are similar for groups of functionally related genes2, 36-38; mRNA abundance is also similar within certain cellular compartments39. However, transcriptional co-regulation has not been directly compared to subcellular protein localization on a proteome-wide scale. To assess the extent of this correlation, we made use of a study that identified 33 transcriptional 'modules' of genes with marked co-regulation based on analysis of over 1,000 microarray data sets reflecting the results of different mutant strain backgrounds or environmental perturbations38, 40. For each module, the fraction of proteins with a given subcellular localization was calculated and divided by that fraction in the whole proteome to generate fold enrichment in each subcellular localization category (Fig. 3a). We obtained statistically significant enrichments (one-sided binomial test with P < 0.05) for 19 of the 22 most highly expressed modules, indicating that co-localization is strongly correlated with transcriptional co-expression and, by extension, with biological function.
Figure 3 Correlation between transcriptional co-regulation and subcellular localization. Full legend
High resolution image and legend (114k)
The combination of protein localization and transcriptional co-expression can be used to corroborate or predict the function of unnamed ORFs in a specific module. For example, YGL068W and YNL122C, both of which belong to the mitochondrial ribosomal protein transcriptional module, localize to the mitochondrion in our study, as do 13 other members of this module, strongly supporting the function predicted by the module (Fig. 3b). Indeed, the sequence of YGL068W shows 49% similarity to that of the human mitochondrial ribosomal protein L12 (ref. 41).
Localization and co-regulation data can also be used to gain insight into biological function when proteins in a given transcription module are enriched in more than one localization category. This allows us to subdivide sets of co-expressed proteins, providing a level of information that cannot be gleaned solely from their classification in the same module based on their expression profiles. For example, proteins in the G1 module (representing processes coordinated at the G1/S transition) localize to three basic categories: nucleus, bud/bud neck and spindle pole (Fig. 3c). The basic functions of proteins in the G1 module, where known, can be divided by localization; proteins localized to the bud/bud neck are involved in bud formation, whereas nuclear proteins from this module are involved mainly in chromosome cohesion, transcription, and DNA replication, repair and recombination. Thus, given that the G1 module proteins Hif1, Hsn1, YGR151C, YKR077W and YMR144W are localized to the nucleus, it is likely that they share the functions of nuclear proteins from this module.
Comparison with genetic and physical interactions
Recent genome-wide studies have sought to enumerate all protein–protein interactions that occur in S. cerevisiae6-9. Despite the large scale of these efforts, the agreement between studies42 suggests that total coverage is poor and false-positive rates remain high. To interact physically proteins must exist in close proximity, at least transiently, suggesting that co-localization may be an effective means for evaluating hypothetical interactions. To assess the relationship between co-localization and interaction, we chose as a reference set the sum of all genetic and protein–protein interactions reported in the GRID database43. Although this set is certain to contain a considerable fraction of false-positive interactions, it was chosen to minimize systematic bias in individual screens that inevitably results from alternative interaction detection methods. We determined the subcellular localizations of each interacting protein pair from this reference set and the fraction of the total number of interactions occurring for each localization pair. A set of randomized protein pairs was also generated from the whole proteome, and localization pair statistics were collected on this set in the same way. We calculated the fold enrichment observed for each localization pair in our reference data set as compared with the randomized data set to generate an interaction matrix (Fig. 4a).
Figure 4 Relationship between genetic and physical interactions and subcellular localization. Full legend
High resolution image and legend (169k)
This analysis supports and extends interaction data from other studies. As expected, interactions are strongly enriched between proteins that co-localize (one-sided binomial test with P < 0.001), but the degree of enrichment varies widely by compartment. For example, interactions between cytoplasmic proteins are 1.3-fold enriched above chance, whereas interactions between microtubule proteins are 56-fold enriched above chance, implying that co-localization of two putative interacting proteins to the microtubule cytoskeleton provides better evidence of physical and functional interaction than the simple fact that they do co-localize.
Of particular interest is that enrichment in interactions was observed between distinct localization categories in our study, shown as red off-diagonal circles in the matrix (Fig. 4a). Such off-diagonal circles are indicative of functional relationships between subcellular localizations: they are neither the result of systematic errors from individual interaction data sets (Supplementary Fig. S3) nor of proteins being assigned multiple localizations (Supplementary Fig. S4). An extensive network of interactions occurs between proteins localized to the actin cytoskeleton and those localized to the bud neck (Fig. 4b). Notably, some proteins not previously known to localize to these regions have known or predicted functions consistent with their assigned localizations. The Rho-GTPase activator protein Bem2, for example, has been shown through genetic studies to be involved in bud growth, establishment of cell polarity, and organization and biogenesis of the actin and microtubule cytoskeleton and of the cell wall44. These functions are consistent both with localization to the bud neck and functional interaction with the actin cytoskeleton. Similarly, Chs7 is involved in cell wall chitin biosynthesis45, consistent with the role of certain bud neck proteins46, and Akl1 has a predicted role in the organization of the actin cytoskeleton47.
The biological importance of statistically significant interactions between localization categories in the GFP library is fully revealed when these interactions are considered in the context of a eukaryotic cell (Fig. 4c). A network of interactions connects subcellular regions that are functionally and physically related. Strongly interconnected localizations can reflect dynamic interchange of proteins between compartments; for example, compartments of the secretory pathway (Golgi, early Golgi/COPI (coat protein I) and late Golgi/clathrin). Intercompartmental interactions can also reflect close proximity and extensive physical association between localization categories, as is the case for the bud neck, the bud and the actin cytoskeleton. The interaction matrix provides an overview of communication between subcellular compartments as well as a template for evaluating the validity of protein–protein interactions from large-scale experimental or theoretical data sets.
Discussion
By creating a GFP-tagged yeast strain collection and database that covers three-quarters of the proteome and over two-thirds of previously unlocalized proteins, we have provided an experimental and informational resource to the scientific community. Although we have presented an analysis of the yeast proteome in a nominal resting state, the GFP library serves as a starting point for understanding the complex state of flux in the eukaryotic proteome that underlies the survival and development of an organism. The library provides a tool for analysing the global dynamics of the proteome in response to specific external stimuli or growth conditions over a selected period of time. Similarly, the library can be used in combination with high-throughput strain construction techniques13 to assay the effects of deletion or mutation of a protein of interest on global protein localization. Complex regulatory networks responsible for targeting proteins to specific cellular compartments can thus be systematically dissected.
We have shown that the combination of high-resolution, high-accuracy, proteome-wide localization information with data from other proteomics-scale studies provides an independent dimension of information that reveals patterns not visible within a single data set. The localization data from the GFP library can confirm and extend predictions based on trends within a single data set; if proteins grouped together in a given data set have a common localization, the prediction of common function is strengthened. This is particularly useful in the case of proteins for which little functional data exists. The localization data also make it possible to subdivide groups of proteins related by genome-wide trends in other data sets, indicating that one group may be composed of subsets of proteins with even more specific, separate biological roles. A comparative proteomics approach promises to reveal important features of basic cellular processes, improving our understanding of S. cerevisiae and of the proteins and pathways conserved among eukaryotes.
Methods
Construction of GFP-tagged yeast strains To construct a chromosomally GFP-tagged library, 6,234 pairs of gene-specific oligonucleotide primers were synthesized, each of which had been designed to share complementary sequences to the GFP tag-marker cassette at the 3' end and contain 40 base pairs (bp) of homology with a specific gene of interest to allow in-frame fusion of the GFP tag at the C-terminal coding region of the gene. Gene-specific cassettes containing a C-terminally positioned GFP tag were then generated by PCR using as a template pFA6a–GFP(S65T)–His3MX, which contains the Schizosaccharomyces pombe his5+ gene and permits selection of transformed strains in histidine-free media19. The haploid parent yeast strain (ATCC 201388: MATa his31 leu20 met150 ura30) was transformed with the PCR products, and strains were selected in SD medium (synthetic medium plus dextrose, Difco) lacking histidine. Insertion of the cassette by homologous recombination was verified by genomic PCR of samples from individual colonies with a primer internal to the GFP tag and a separate set of ORF-specific primers designed to produce a product of approximately 500 bp. Strains representing 6,029 ORFs were successfully tagged with GFP (Table 1), and independent strains from two to six selected colonies from each ORF were analysed by fluorescence microscopy.
Microscopic imaging of GFP-tagged strains Aliquots of strains grown to mid-logarithmic phase in SD medium lacking histidine were analysed in 96-well glass-bottom microscope slides (BD Falcon) pre-treated with concanavalin A (50 µg ml-1) to ensure cell adhesion. Cells were incubated in SD medium containing 1 µg ml-1 4',6-diamidino-2-phenylindole (DAPI) as a marker for the nucleus and mitochondria, and analysed by multiple wavelength fluorescence and visible light microscopy with a digital imaging-capable Nikon TE200/300 inverted microscope using an oil-immersed objective at 100 magnification. Using a script in MetaMorph version 4.6r8 imaging software (Universal Imaging Corporation), fluorescence microscopy images for GFP, DAPI and Nomarski/DIC (differential interference contrast) images were taken in rapid succession, and the stage was automatically advanced between wells on the 96-well slide.
Localization category refinement by co-localization Subcellular localizations that could not be assigned readily by GFP fluorescence alone—typically classified as punctate or non-uniform nuclear—were resolved by mating the GFP-tagged strains to strains expressing reference proteins (Table 2) fused to mRFP20. The coding sequence for mRFP was amplified by PCR from its parent vector (pRSET–mRFP1)20 and inserted into pFA6a–KanMX6, which carries the Escherichia coli kanamycin-resistance gene19, to create the plasmid pFA6a–mRFP–KanMX6. This vector was then used to generate gene-specific cassettes to yield reference strains expressing C-terminally mRFP-tagged proteins. The haploid parent strain (ATCC 201389: MAT his31 leu20 lys20 ura30) was transformed, selected in the presence of G418 sulphate (200 µg ml-1), and analysed for positive RFP signal by fluorescence microscopy as described above. The GFP-tagged strains were then mated with at least one of the mRFP-tagged reference strains in SD medium lacking lysine and methionine, and the resulting diploid strains were analysed by microscopy to generate GFP, RFP, DIC and GFP–RFP merged images. Haploid strains exhibiting potential non-uniform mitochondrial GFP patterns were subjected to the same microscopic analysis using the mitochondrion-specific dye MitoTracker red CMXRos (Molecular Probes).
Database features We have designed a publicly available web-based user interface to the localization database at http://yeastgfp.ucsf.edu. At this site, users can perform searches using a number of criteria, including ORF name, gene name, subcellular localization, cell cycle, cell morphology, cell–cell brightness variability and subcellular signal heterogeneity. Searches retrieve full-sized, lossless compressed images that were used to assign localizations in this study; specific cells used to justify localizations are indicated in the images.
Comparison with other data sets The distribution of subcellular localizations exhibited by the test ORF set (groups of transcriptionally co-regulated genes38, 40) was assessed in comparison to a reference set (the localization distribution seen in all ORFs characterized in this study). The identity of genes in the modules can be found at http://barkai-serv.weizmann.ac.il/modules/page/details.html using a threshold cutoff of 4.0. The frequency with which each subcellular localization is observed in the test and reference set was calculated; the ratio of these frequencies is reported as the enrichment. Individual binomial tests were performed for each subcellular localization to accept or reject the null hypothesis that the measured enrichment occurred due to chance. A one-tailed P-value <0.05 is taken to be statistically significant and is indicated by red circles in Fig. 3a. Distribution of subcellular localization of interacting partners was assessed by comparison to that which would occur by random association of ORFs, giving an enrichment of interactions between localizations. Individual binomial tests confirm that enrichment for certain localization pairs, indicated in Fig. 4a by red circles, is not the product of sampling error (P < 0.001).
Supplementary information accompanies this paper.