本期(2003年10月16日)Nature发表两篇关于酵母内全体蛋白质表达和相互作用的研究论文。酵母作为最重的模式生物之一,对今后研究更复杂的系统,尤其是高等生物体内的蛋白质相互作用提示前瞻性探索。
Nature 425, 737 - 741 (16 October 2003); doi:10.1038/nature02046
Global analysis of protein expression in yeast
SINA GHAEMMAGHAMI1,2, WON-KI HUH1,3, KIOWA BOWER1,2, RUSSELL W. HOWSON1,3, ARCHANA BELLE1,3, NOAH DEPHOURE1,3, ERIN K. O'SHEA1,3 & JONATHAN S. WEISSMAN1,2
1 Howard Hughes Medical Institute, University of California–San Francisco, San Francisco, California 94143-2240, USA
2 Departments of Cellular & Molecular Pharmacology University of California–San Francisco, San Francisco, California 94143-2240, USA
3 Biochemistry & Biophysics, University of California–San Francisco, San Francisco, California 94143-2240, USA
Correspondence and requests for materials should be addressed to J.S.W. (jsw1@itsa.ucsf.edu).
The availability of complete genomic sequences and technologies that allow comprehensive analysis of global expression profiles of messenger RNA1-3 have greatly expanded our ability to monitor the internal state of a cell. Yet biological systems ultimately need to be explained in terms of the activity, regulation and modification of proteins—and the ubiquitous occurrence of post-transcriptional regulation makes mRNA an imperfect proxy for such information. To facilitate global protein analyses, we have created a Saccharomyces cerevisiae fusion library where each open reading frame is tagged with a high-affinity epitope and expressed from its natural chromosomal location. Through immunodetection of the common tag, we obtain a census of proteins expressed during log-phase growth and measurements of their absolute levels. We find that about 80% of the proteome is expressed during normal growth conditions, and, using additional sequence information, we systematically identify misannotated genes. The abundance of proteins ranges from fewer than 50 to more than 106 molecules per cell. Many of these molecules, including essential proteins and most transcription factors, are present at levels that are not readily detectable by other proteomic techniques nor predictable by mRNA levels or codon bias measurements.
The diverse chemical nature of proteins makes the development of globally applicable proteomic assays very challenging. We have overcome this obstacle in the yeast S. cerevisiae by individually tagging each of its annotated open reading frames (ORFs) with a high-affinity epitope tag so that the resulting fusion proteins are expressed under the control of their natural promoters. The fusion library allows the immunodetection and immunopurification of the entire yeast proteome using a single antibody, enabling the development of a range of high-throughput functional assays. To allow for the facile construction of epitope-tagged yeast fusion libraries, we synthesized 6,234 pairs of ORF-specific oligonucleotide primers. Each of the oligonucleotide pairs have shared 3' ends that allow for polymerase chain reaction (PCR) amplification of a common insertion cassette, as well as gene-specific 5' ends that allow for the precise introduction, through homologous recombination, of the amplified insertion cassettes as a perfect in-frame fusion at the carboxy-terminal end of the coding region of each gene4 (Fig. 1a). The insertion cassettes contained the coding region for a modified version of the tandem affinity purification (TAP) tag5, 6, which consists of a calmodulin binding peptide, a TEV cleavage site and two IgG binding domains of Staphylococcus aureus protein A, as well as a selectable marker (see Supplementary Information). In total, we obtained successful integrants for 98% of all ORFs annotated in the Saccharomyces genome database (as of April 2001; http://www-genome.stanford.edu/Saccharomyces), including 93% of all essential ORFs7 in haploid yeast.
Figure 1 Tagging and detection of the yeast proteome. Full legend
High resolution image and legend (59k)
Western blot analysis, using an antibody that specifically recognizes the TAP tag, demonstrated that the large majority (>95%) of detected fusion proteins migrate predominantly as a single band of the approximate expected molecular mass (Fig. 1b). Furthermore, analysis of two known cell-cycle-regulated proteins, Clb2 and Sic18, 9, indicated that the tagging does not hinder their regulated proteolysis by the ubiquitin/proteasome degradation system and that the TAP tag itself is rapidly destroyed during the targeted degradation of the fusion protein (Fig. 1c). These and other data6 suggest that the function, regulation and stability of most, but not all (see Supplementary Information), of the proteome is uncompromised by the fused tag.
We observed a protein product for 4,251 of the TAP-tagged ORFs by comprehensive western blot analysis. This set of proteins shows excellent overlap (>90%) with the set of green fluorescent protein (GFP) fusion proteins detected by fluorescence microscopy10 (Fig. 2a), and together indicate that at least 4,517 proteins are expressed during log-phase growth in rich media. We detect 79% of all essential proteins and 83% of gene products corresponding to ORFs with assigned gene names. By contrast, only 73% of all annotated ORFs expressed a detectable protein product (Fig. 2b). This discrepancy largely results from the presence of spurious ORFs in the annotated yeast genome database stemming from well-known difficulties in distinguishing actual coding regions from fortuitous short ORFs11, 12. For the original annotation of the yeast genome, an arbitrary cut-off of 100 codons was used to qualify ORFs as potential genes, leading to an anomalous peak centred between 100 and 150 amino acids in the sequence length distribution (Fig. 2c, black)13 of the genome that is not present in the length distribution of the subset of named genes (Fig. 2c, green). Importantly, although we tagged and analysed all potential ORFs, the length distribution of the subset of observed proteins did not contain the above artefactual peak (Fig. 2c, red), indicating that our analysis of expressed genes has a very low false-positive rate (see also Supplementary Information).
Figure 2 Analysis of proteins expressed during log-phase growth. Full legend
High resolution image and legend (43k)
A number of bioinformatics approaches, including recent analyses of the genomic sequences of a number of related yeast species, have been used to distinguish between the real and misannoted ORFs14-17, although the true number and identity of the spurious ORFs remain unclear. Our results offer experimental verification for a large number of hypothetical genes (we observed 1,018 protein products belonging to functionally uncharacterized ORFs), and yields a large, experimentally validated set to evaluate the success of computational methods for identifying falsely annotated genes. By combining a novel metric—termed the codon enrichment correlation (CEC), which evaluates the patterns of codon usage in potential ORFs—with our protein expression data, we identified a set of 525 potentially spurious ORFs (listed in Supplementary Information) that have codon compositions not characteristic of genuine genes and did not yield detectable protein products (Fig. 2d, Methods). On the basis of the CEC distribution of genuine ORFs, we estimate that this list is contaminated by 20 genuine coding sequences. Our proteomics-based approach complements the comparative genomics strategy for identifying spurious ORFs16. The large majority (all but seven) of the 496 spurious ORFs suggested by Kellis et al.16 were not observed in our TAP and GFP studies. The set of spurious ORFs that we identified overlaps well with those detected by this cross-species genome study (381 genes were identified as spurious by both studies), and expands the set by 144 ORFs. Among these 144 ORFs are a large number of sequences that overlap with real genes on the opposite strand, and therefore are difficult to distinguish through homology analysis.
After discounting the spurious ORFs, there remain 1,000 genuine coding regions that did not produce a detectable protein product. To determine if the unobserved proteins belong to classes of genes that are not transcribed during normal log-phase growth conditions, we compared our results with global transcriptional array data. A recent analysis of mRNA expression profiles from 1,000 published microarray experiments allowed for the identification of 33 'modules' of transcriptionally co-regulated genes18, 19. For modules that are expressed in log phase (for example, those coding for housekeeping functions, such as ergosterol and amino-acid biosynthesis and cell cycle), we were able to detect the large majority of the protein products (Fig. 3). By contrast, modules composed of genes involved in functions required only under specialized conditions (for example, meiosis/sporulation and alternative nitrogen utilization) generally produced few detectable proteins.
Figure 3 Functional categorization of proteins expressed during log-phase growth in rich medium. Full legend
High resolution image and legend (53k)
We took advantage of the fact that all gene products were detected using the same epitope/antibody interaction to measure the absolute abundance of each of the tagged proteins using quantitative western blot analyses. This effort was facilitated by the inclusion of internal standards in each gel (Fig. 1b). We find that the levels of different proteins show an enormous dynamic range, varying from fewer than 50 to more than 106 molecules per cell (Fig. 4a, b). The results show that previous efforts to quantify protein levels using two-dimensional gel electrophoresis or mass spectrometry were strongly biased towards the detection of abundant proteins (Fig. 4a, see also Supplementary Fig. S3)20-23. For example, a recent study using mass spectrometry and isotope labelling succeeded in quantitatively monitoring changes in the abundance of 688 yeast proteins22. For the most abundant proteins (>50,000 molecules per cell) the coverage was excellent (60%), whereas for the 75% of the proteome that is present at fewer than 5,000 molecules per cell, only 8% of the proteins were observed. Another mass-spectrometry effort that focused on detecting, without directly quantifying, the complement of proteins in log-phase yeast23 observed a larger number (1,484) of proteins, although it was also biased towards abundant proteins (90% of the proteome present at >50,000 molecules per cell was detected, whereas only 19% of the proteome present at fewer than 5,000 molecules per cell was observed). Our validated list of expressed proteins will help evaluate future advances in mass spectrometry approaches24.
Figure 4 Abundance distribution of the yeast proteome. Full legend
High resolution image and legend (85k)
Overall, we observe a significant relationship between mRNA levels, as measured by an earlier microarray analysis of log-phase yeast25, and protein levels (Spearman rank correlation coefficient rs = 0.57). Very abundant mRNAs generally encode for abundant proteins, and the average protein per mRNA ratio remains remarkably constant throughout the full range of mRNA abundances (Fig. 4c, middle, and Supplementary Fig. S4). The average protein per mRNA ratio is 4,800 using this measure of mRNA levels, and is 4,200 using an alternative mRNA abundance measurement based on a microarray analysis comparing mRNA to genomic DNA levels26 (Supplementary Fig. S4). However, individual genes with equivalent mRNA levels can result in large differences in protein abundances (Fig. 4c, top). To assess if this variability was primarily caused by protein measurement error and/or disruption of protein function by the TAP tag, we performed further triplicate measurements of protein abundances on a subset of 206 essential, soluble proteins (See Supplementary Information); the selected strains grew robustly, showing that the tagged proteins were functional. This subset also shows a high degree of protein to mRNA variability relative to our measurement error, indicating that the large differences in individual protein to mRNA ratios are not due primarily to noise in the protein abundance measurements or disruption of the protein by the tag (Fig. 4c, bottom). However, the correlation between mRNA and protein levels is somewhat greater (rs = 0.66), suggesting that the disruption of protein by the TAP tag or difficulty in analysing membrane proteins may have contributed to some of the variation. We also observed a significant relationship (rs = 0.55) between protein abundance and codon usage as measured by the codon adaptation index (CAI)27. Protein abundances drop rapidly for genes with CAI values <0.2, explaining the difficulty that previous proteomic approaches have typically had in detecting these proteins22. But on an individual gene basis, there is great variability that is also present in the subset of more carefully measured essential, soluble proteins (Fig. 4d).
A number of observations support the argument that the full range of abundances detected in this study, including the very low expression levels, represent functionally significant amounts of the proteins. First, the analysis of transcription modules (Fig. 3) indicates that within groups of genes that are turned off during log-phase growth the corresponding proteins are not observed, even at residual levels. Second, the abundance distribution profile of the entire yeast proteome (Fig. 4b, red) is similar to the profile of the portion of the proteome whose function is required for survival under standard growth conditions (Fig. 4b, purple). This suggests that, in general, functional proteins are not under-represented amongst low-abundance proteins. Third, there are entire classes of functionally important proteins, such as transcription factors (Fig. 4b, line) and cell-cycle proteins (Supplementary Fig. S5), that are present at very low expression levels. Thus the low-abundance proteins detected and quantified in the present study represent a large and functionally important portion of the yeast proteome that is almost entirely invisible to systematic quantitative analysis by other proteomic methods.
The TAP-tagged library now makes it feasible to monitor dynamically the abundance of the yeast proteome through basic cellular events such as the cell cycle and meiosis, and will allow the determination of protein lifetimes. In addition, important subsets of proteins, such as transcription factors, can be readily studied under a more comprehensive set of conditions. This protein-based data will provide critical information for efforts to understand the logic of cellular regulatory circuits, and, by comparison to mRNA levels, the data will give insight into the nature and extent of post-transcriptional regulation.
Methods
Quantification of protein levels Cultures (1.7 ml) of tagged strains were grown in 96-well format to log phase, and total cell extracts were examined by SDS–polyacrylamide gel electrophoresis (PAGE)/western blot analysis as described in Supplementary Information. The bands corresponding to the tagged proteins were detected using chemiluminescence and a CCD camera (FluorChem 8800, Alpha Innotech). To control for variation in extraction and loading, each blot was probed with an antibody against endogenous hexokinase in addition to the TAP-specific anti-CBP antibody. Extracts whose hexokinase signals varied by greater than a factor of 2 from the expected value were re-grown and re-analysed. A standard containing a mixture of three TAP-tagged proteins (Pgk1, Cdc19, Rpl1A) were included in each gel at one-, ten- and 100-fold dilutions. Proteins whose chemiluminescence signals were approaching saturation were re-examined by performing the western blot analysis using a tenfold dilution of the extract and/or lower exposure times during detection. Before the quantitative SDS–PAGE/western blot analysis, strains were ordered on the basis of estimates of TAP abundance from a preliminary dot-blot analysis. In order to provide a standard for the conversion of western signals to absolute protein levels, a TAP-tagged protein (Escherichia coli initiation factor A, INFA) was overexpressed in E. coli and purified to homogeneity. Yeast extracts containing serial dilutions of INFA ranging from 500 attomoles (which was the limit of detection, see Supplementary Fig. S1) to 25 picomoles were run on a gel along with extracts from 25 different yeast TAP-tagged strains representing the full range of observed protein signals (a second TAP-tagged protein (initiation factor B) was also analysed to ensure that the observed TAP signal was not influenced by the fusion protein). Comparison of the signals generated by these 25 proteins to the known standards allowed the creation of a conversion factor between the observed western blot signals and absolute protein levels. Based on the number of cells (1 107) used for the SDS–PAGE/western blot analysis, the protein levels were then converted to measurements of protein molecules per cell.
In order to assess the error in our quantification, a set of 33 proteins with a range of abundances were grown in duplicate cultures, separately extracted and analysed on different gels. The replicate signals showed a linear correlation coefficient of R = 0.94, with the pairs of proteins having a median variation of a factor of 2.0. This error analysis does not account for potential alterations in the endogenous levels of the proteins caused by the fused tag, which may be particularly disruptive for small proteins (Supplementary Information) or difficulty in analysing some polytopic membrane proteins by SDS–PAGE. For dynamic measurements of protein levels (for example, the cell-cycle dependence of Clb2 and Sic1 levels shown in Fig. 1c or triplicate measurements in Fig. 4c, d) much smaller errors can be obtained by running the samples being compared side-by-side on a single gel. For quantification in the triplicate measurements shown at the bottom of Fig. 4c, d, serial dilutions of extracts containing purified TAP-tagged INFA were run on each gel.
CEC and identification of spurious ORFs Codon usage in genuine protein-coding regions deviates systematically from randomly generated ORFs, owing to both preferences in amino-acid composition and biases in the usage of synonymous codons28, and the codon enrichment correlation (CEC) provides a measure of this deviation. To calculate CEC values, we first determined the relative prevalence of the 61 amino acids specifying codons in the 3,753 named ORFs (Supplementary Table S1). The codon usage expected in random sequences was then calculated based on the approximate prevalence of 30% T, 30% A, 20% C and 20% G nucleotides in the yeast genomes. The enrichment of each codon for the positive set is given by dividing its prevalence among the named ORFs by its expected prevalence in random sequences (Supplementary Table S1). Codon enrichments were similarly calculated for each test ORF. The CEC is the linear correlation coefficient (r) between the codon enrichments of the test ORF and the positive set (for examples, see Supplementary Fig. S2). ORFs were designated as spurious if they failed to be detected by both the TAP and GFP analyses, and they had CEC values below a cut-off of 0.25, 0.16, 0.07 or 0.06 for ORFs of size 0–150, 151–200, 201–250 and 251–300 codons, respectively. For ORFs >150 amino acids, these values were chosen so that <4.5% of the ORFs falling below these cut-offs that are not detected by the GFP or TAP analyses are genuine coding sequences. The number of genuine coding sequences contaminating our list of spurious ORFs was estimated for each size range and CEC cut-off by the following equation: Nreal = NobsR, where Nobs is the number of detected ORFs that have a CEC value below the cut-off, and R is the ratio of unobserved to observed ORFs, as determined by the probability of detecting named ORFs for the given size range.
Supplementary information accompanies this paper.