Machine Learning for Protein Subcellular Localization Prediction

Book description

Comprehensively covers protein subcellular localization from single-label prediction to multi-label prediction, and includes prediction strategies for virus, plant, and eukaryote species. Three machine learning tools are introduced to improve classification refinement, feature extraction, and dimensionality reduction.

Table of contents

  1. Cover
  2. Also of Interest
  3. Title Page
  4. Copyright Page
  5. Preface
  6. Contents
  7. List of Abbreviations
  8. 1   Introduction
    1. 1.1    Proteins and their subcellular locations
    2. 1.2    Why computationally predict protein subcellular localization?
      1. 1.2.1 Significance of the subcellular localization of proteins
      2. 1.2.2 Conventional wet-lab techniques
      3. 1.2.3 Computational prediction of protein subcellular localization
    3. 1.3    Organization of this book
  9. 2   Overview of subcellular localization prediction
    1. 2.1    Sequence-based methods
      1. 2.1.1 Composition-based methods
      2. 2.1.2 Sorting signal-based methods
      3. 2.1.3 Homology-based methods
    2. 2.2    Knowledge-based methods
      1. 2.2.1 GO-term extraction
      2. 2.2.2 GO-vector construction
    3. 2.3    Limitations of existing methods
      1. 2.3.1 Limitations of sequence-based methods
      2. 2.3.2 Limitations of knowledge-based methods
  10. 3   Legitimacy of using gene ontology information
    1. 3.1    Direct table lookup?
      1. 3.1.1 Table lookup procedure for single-label prediction
      2. 3.1.2 Table-lookup procedure for multi-label prediction
      3. 3.1.3 Problems of table lookup
    2. 3.2    Using only cellular component GO terms?
    3. 3.3    Equivalent to homologous transfer?
    4. 3.4    More reasons for using GO information
  11. 4   Single-location protein subcellular localization
    1. 4.1    Extracting GO from the Gene Ontology Annotation Database
      1. 4.1.1 Gene Ontology Annotation Database
      2. 4.1.2 Retrieval of GO terms
      3. 4.1.3 Construction of GO vectors
      4. 4.1.4 Multiclass SVM classification
    2. 4.2    FusionSVM: Fusion of gene ontology and homology-based features
      1. 4.2.1 InterProGOSVM: Extracting GO from InterProScan
      2. 4.2.2 PairProSVM: A homology-based method
      3. 4.2.3 Fusion of InterProGOSVM and PairProSVM
    3. 4.3    Summary
  12. 5   From single- to multi-location
    1. 5.1    Significance of multi-location proteins
    2. 5.2    Multi-label classification
      1. 5.2.1 Algorithm-adaptation methods
      2. 5.2.2 Problem transformation methods
      3. 5.2.3 Multi-label classification in bioinformatics
    3. 5.3    mGOASVM: A predictor for both single- and multi-location proteins
      1. 5.3.1 Feature extraction
      2. 5.3.2 Multi-label multiclass SVM classification
    4. 5.4    AD-SVM: An adaptive decision multi-label predictor
      1. 5.4.1 Multi-label SVM scoring
      2. 5.4.2 Adaptive decision for SVM (AD-SVM)
      3. 5.4.3 Analysis of AD-SVM
    5. 5.5    mPLR-Loc: A multi-label predictor based on penalized logistic regression
      1. 5.5.1 Single-label penalized logistic regression
      2. 5.5.2 Multi-label penalized logistic regression
      3. 5.5.3 Adaptive decision for LR (mPLR-Loc)
    6. 5.6    Summary
  13. 6   Mining deeper on GO for protein subcellular localization
    1. 6.1    Related work
    2. 6.2    SS-Loc: Using semantic similarity over GO
      1. 6.2.1 Semantic similarity measures
      2. 6.2.2 SS vector construction
    3. 6.3    HybridGO-Loc: Hybridizing GO frequency and semantic similarity features
      1. 6.3.1 Hybridization of two GO features
      2. 6.3.2 Multi-label multiclass SVM classification
    4. 6.4    Summary
  14. 7   Ensemble random projection for large-scale predictions
    1. 7.1    Random projection
    2. 7.2    RP-SVM: A multi-label classifier with ensemble random projection
      1. 7.2.1 Ensemble multi-label classifier
      2. 7.2.2 Multi-label classification
    3. 7.3    R3P-Loc: A compact predictor based on ridge regression and ensemble random projection
      1. 7.3.1 Limitation of using current databases
      2. 7.3.2 Creating compact databases
      3. 7.3.3 Single-label ridge regression
      4. 7.3.4 Multi-label ridge regression
    4. 7.4    Summary
  15. 8   Experimental setup
    1. 8.1    Prediction of single-label proteins
      1. 8.1.1 Datasets construction
      2. 8.1.2 Performance metrics
    2. 8.2    Prediction of multi-label proteins
      1. 8.2.1 Dataset construction
      2. 8.2.2 Datasets analysis
      3. 8.2.3 Performance metrics
    3. 8.3    Statistical evaluation methods
    4. 8.4    Summary
  16. 9   Results and analysis
    1. 9.1    Performance of GOASVM
      1. 9.1.1 Comparing GO vector construction methods
      2. 9.1.2 Performance of successive-search strategy
      3. 9.1.3 Comparing with methods based on other features
      4. 9.1.4 Comparing with state-of-the-art GO methods
      5. 9.1.5 GOASVM using old GOA databases
    2. 9.2    Performance of FusionSVM
      1. 9.2.1 Comparing GO vector construction and normalization methods
      2. 9.2.2 Performance of PairProSVM
      3. 9.2.3 Performance of FusionSVM
      4. 9.2.4 Effect of the fusion weights on the performance of FusionSVM
    3. 9.3    Performance of mGOASVM
      1. 9.3.1 Kernel selection and optimization
      2. 9.3.2 Term-frequency for mGOASVM
      3. 9.3.3 Multi-label properties for mGOASVM
      4. 9.3.4 Further analysis of mGOASVM
      5. 9.3.5 Comparing prediction results of novel proteins
    4. 9.4    Performance of AD-SVM
    5. 9.5    Performance of mPLR-Loc
      1. 9.5.1 Effect of adaptive decisions on mPLR-Loc
      2. 9.5.2 Effect of regularization on mPLR-Loc
    6. 9.6    Performance of HybridGO-Loc
      1. 9.6.1 Comparing different features
    7. 9.7    Performance of RP-SVM
      1. 9.7.1 Performance of ensemble random projection
      2. 9.7.2 Comparison with other dimension-reduction methods
      3. 9.7.3 Performance of single random-projection
      4. 9.7.4 Effect of dimensions and ensemble size
    8. 9.8    Performance of R3P-Loc
      1. 9.8.1 Performance on the compact databases
      2. 9.8.2 Effect of dimensions and ensemble size
      3. 9.8.3 Performance of ensemble random projection
      4. 9.9    Comprehensive comparison of proposed predictors
        1. 9.9.1 Comparison of benchmark datasets
        2. 9.9.2 Comparison of novel datasets
      5. 9.10    Summary
  17. 10   Properties of the proposed predictors
    1. 10.1    Noise data in the GOA Database
    2. 10.2    Analysis of single-label predictors
      1. 10.2.1 GOASVM vs FusionSVM
      2. 10.2.2 Can GOASVM be combined with PairProSVM?
    3. 10.3    Advantages of mGOASVM
      1. 10.3.1 GO-vector construction
      2. 10.3.2 GO subspace selection
      3. 10.3.3 Capability of handling multi-label problems
    4. 10.4    Analysis for HybridGO-Loc
      1. 10.4.1 Semantic similarity measures
      2. 10.4.2 GO-frequency features vs SS features
      3. 10.4.3 Bias analysis
    5. 10.5    Analysis for RP-SVM
      1. 10.5.1 Legitimacy of using RP
      2. 10.5.2 Ensemble random projection for robust performance
    6. 10.6    Comparing the proposed multi-label predictors
    7. 10.7    Summary
  18. 11   Conclusions and future directions
    1. 11.1    Conclusions
    2. 11.2    Future directions
  19. A   Webservers for protein subcellular localization
    1. A.1    GOASVM webserver
    2. A.2    mGOASVM webserver
    3. A.3    HybridGO-Loc webserver
    4. A.4    mPLR-Loc webserver
  20. B   Support vector machines
    1. B.1    Binary SVM classification
    2. B.2    One-vs-rest SVM classification
  21. C   Proof of no bias in LOOCV
  22. D   Derivatives for penalized logistic regression
  23. Endnotes
  24. Bibliography
  25. Index

Product information

  • Title: Machine Learning for Protein Subcellular Localization Prediction
  • Author(s): Shibiao Wan, Man-Wai Mak
  • Release date: May 2015
  • Publisher(s): De Gruyter
  • ISBN: 9781501501524