Скачать книгу

for their encouragement and for producing this book so beautifully.

      Sriganesh Srihari

      Chern Han Yong

      Limsoon Wong

      May 2017

      1

      Introduction to Protein Complex Prediction

      Unfortunately, the proteome is much more complicated than the genome.

      —Carol Ezzell [Ezzel et al. 2002]

      In an early survey, American biochemist Bruce Alberts termed large assemblies of proteins as protein machines of cells [Alberts et al. 1998]. Protein assemblies are composed of highly specialized parts that coordinate to execute almost all of the biochemical, signaling, and functional processes in cells [Alberts et al. 1998]. It is not hard to see why protein assemblies are more advantageous to cells than individual proteins working in an uncoordinated manner. Compare, for example, the speed and elegance of the DNA replication machinery that simultaneously replicates both strands of the DNA double helix with what could ensue if each of the individual components—DNA helicases for separating the double-stranded DNA into single stands, DNA polymerases for assembling nucleotides, DNA primase for generating the primers, and the sliding clamp to hold these enzymes onto the DNA—acted in an uncoordinated manner [Alberts et al. 1998]. Although what might seem like individual parts brought together to perform arbitrary functions, protein assemblies can be very specific and enormously complicated. For example, the spliceosome is composed of 5 small nuclear RNAs (snRNAs or “snurps”) and more than 50 proteins, and is thought to catalyze an ordered sequence of more than 10 RNA rearrangements at a time as it removes an intron from an RNA transcript [Alberts et al. 1998, Baker et al. 1998]. The discovery of this intron-splicing process won Phillip A. Sharp and Richard J. Roberts the 1993 Nobel Prize in Physiology or Medicine.1

      Protein assemblies are known to be in the order of hundreds even in the simplest of eukaryotic cells. For example, more than 400 protein assemblies have been identified in the single-celled eukaryote Saccharomyces cerevisiae (budding yeast) [Pu et al. 2009]. However, our knowledge of these protein assemblies is still fragmentary, as is our conception of how each of these assemblies work together to constitute the “higher level” functional architecture of cells. A faithful attempt toward identification and characterization of all protein assemblies is therefore crucial to elucidate the functioning of the cellular machinery.

      To identify the entire complement of protein assemblies, it is important to first crack the proteome—a concept so novel that the word “proteome” first appeared only around 20 years ago [Wilkins et al. 1996, Bryson 2003, Cox and Mann 2007]. The proteome, as defined in the UniProt Knowledgebase, is the entire complement of proteins expressed or derived from protein-coding genes in an organism [Bairoch and Apweiler 1996, UniProt 2015]. With the introduction of high-throughput experimental (proteomics) techniques including mass spectrometric [Cox and Mann 2007, Aebersold and Mann 2003] and protein quantitative trait locus (QTL) technologies [Foss et al. 2007], mapping of proteins on a large scale has become feasible. Just like how genomics techniques (including genome sequencing) were first demonstrated in model organisms, proteome-mapping has progressed initially and most rapidly for model prokaryotes including Escherichia coli (bacteria) and model eukaryotes including Saccharomyces cerevisiae (budding or baker’s yeast), Drosophila melanogaster (fruit fly), Caenorhabditis elegans (a nematode), and Arabidopsis thaliana (a flowering plant). Table 1.1 summarizes the numbers of proteins or protein-coding genes identified from these organisms. Of these, the proportions of protein-coding genes that are essential (genes that are thought to be critical for the survival of the cell or organism; “fitness genes”) range from ∼2% in Drosophila to ∼6.5% in Caenorhabditis and ∼18% in Saccharomyces [Cherry et al. 2012, Chen et al. 2012]. Recent landmark studies using large-scale proteomics [Wilhelm et al. 2014, Kim et al. 2014, Uhlén et al. 2010, Uhlén et al. 2015] on Homo sapiens (human) cells have characterized >17,000 (or >90%) putative protein-coding genes from ≥40 tissues and organs in the human body. An encyclopedic resource on these proteins covering their levels of expression and abundance in different human tissues is available from the ProteomicsDB (http://www.proteomicsdb.org/) [Wilhelm et al. 2014], The Human Proteome Map (http://humanproteomemap.org/) [Kim et al. 2014], and The Human Protein Atlas (http://www.proteinatlas.org/) [Uhlén et al. 2010, Uhlén et al. 2015] projects. GeneCards (http://www.genecards.org/) [Safran et al. 2002, Safran et al. 2010] aggregates information on human protein-coding genes from >125 Web sources and presents the information in an integrative user-friendly manner. The expression levels of nearly 200 proteins that are essential for driving different human cancers are available from The Cancer Proteome Atlas (TCPA) project (http://app1.bioinformatics.mdanderson.org/tcpa/_design/basic/index.html) [Li et al. 2013], measured from more than 3,000 tissue samples across 11 cancer types studied as part of The Cancer Genome Atlas (TCGA) project (http://cancergenome.nih.gov/). Short-hairpin RNA (shRNA)-mediated knockdown [Paddison et al. 2002, Lambeth and Smith 2013], clustered regularly interspaced short palindromic repeats (CRISPR)/Cas9-based gene editing [Sanjana et al. 2014, Baltimore et al. 2015, Shalem et al. 2015], and disruptive mutagenesis [Bökel 2008] screening using MCF-10A (near-normal mammary), MDA-MB-435 (breast cancer), KBM7 (chronic myeloid leukemia), HAP1 (haploid), A375 (melanoma), HCT116 (colorectal cancer), and HUES62 (human embryonic stem) cells have characterized 1,500–1,880 (or 8–10%) “core” protein-coding genes as essential in human cells [Marcotte et al. 2016, Silva et al. 2008, Wang et al. 2014, Hart et al. 2015, Hart et al. 2014, Wang et al. 2015, Blomen et al. 2015].

image

      Comparative analyses of proteomes from different species have revealed interesting insights into the evolution and conservation of proteins. For example, it is estimated that the genomes (proteomes) of human and budding yeast diverged about 1 billion years ago from a common ancestor [Douzery et al. 2014], and these share several thousand genes accounting for more than one-third of the yeast genome [O’Brien et al. 2005, Östlund et al. 2010]. Yeast and human orthologs are highly diverged; the amino-acid sequence similarity between human and yeast proteins ranges from 9–92%, with a genome-wide average of 32%. But, sequence similarity predicts only a part of the picture [Sun et al. 2016]. Recent studies [Kachroo et al. 2015, Laurent et al. 2015] have reported that 414 (or nearly half of the) essential protein-coding genes in yeast could be “replaced” by human genes, with replaceability depending on gene (protein) assemblies: genes in the same process tend to be similarly replaceable (e.g., sterol biosynthesis) or not replaceable (e.g., DNA replication initiation).

      Irrespective of whether in a lower-order model or a higher-order complex organism, a protein has to physically interact with other proteins and biomolecules to remain functional. Estimates in human suggest that over 80% of proteins do not function alone, but instead interact to function as macromolecular assemblies [Berggárd et al. 2007]. This organization of individual proteins into assemblies is tightly regulated in cellular space and time, and is supported by protein conformational changes, posttranslational modifications, and competitive binding [Gibson and Goldberg 2009]. On the basis of the stability (area of interaction surface and duration of interaction) and partner specificity, the interactions between proteins are classified as homo- or hetero-oligomeric, obligate or non-obligate, and permanent or transient [Zhang 2009, Nooren and Thornton 2003]. Proteins in obligate

Скачать книгу