Protein subcellular localization prediction based on compartment-specific features and structure conservation

Emily Chia Yu Su, Hua Sheng Chiu, Allan Lo, Jenn Kang Hwang, Ting Yi Sung, Wen Lian Hsu

Research output: Contribution to journalArticle

51 Citations (Scopus)

Abstract

Background: Protein subcellular localization is crucial for genome annotation, protein function prediction, and drug discovery. Determination of subcellular localization using experimental approaches is time-consuming; thus, computational approaches become highly desirable. Extensive studies of localization prediction have led to the development of several methods including composition-based and homology-based methods. However, their performance might be significantly degraded if homologous sequences are not detected. Moreover, methods that integrate various features could suffer from the problem of low coverage in high-throughput proteomic analyses due to the lack of information to characterize unknown proteins. Results: We propose a hybrid prediction method for Gram-negative bacteria that combines a one-versus-one support vector machines (SVM) model and a structural homology approach. The SVM model comprises a number of binary classifiers, in which biological features derived from Gram-negative bacteria translocation pathways are incorporated. In the structural homology approach, we employ secondary structure alignment for structural similarity comparison and assign the known localization of the top-ranked protein as the predicted localization of a query protein. The hybrid method achieves overall accuracy of 93.7% and 93.2% using ten-fold cross-validation on the benchmark data sets. In the assessment of the evaluation data sets, our method also attains accurate prediction accuracy of 84.0%, especially when testing on sequences with a low level of homology to the training data. A three-way data split procedure is also incorporated to prevent overestimation of the predictive performance. In addition, we show that the prediction accuracy should be approximately 85% for non-redundant data sets of sequence identity less than 30%. Conclusion: Our results demonstrate that biological features derived from Gram-negative bacteria translocation pathways yield a significant improvement. The biological features are interpretable and can be applied in advanced analyses and experimental designs. Moreover, the overall accuracy of combining the structural homology approach is further improved, which suggests that structural conservation could be a useful indicator for inferring localization in addition to sequence homology. The proposed method can be used in large-scale analyses of proteomes.

Original languageEnglish
Article number330
JournalBMC Bioinformatics
Volume8
DOIs
Publication statusPublished - Sep 8 2007
Externally publishedYes

Fingerprint

Conservation
Homology
Proteins
Protein
Prediction
Bacteria
Gram-Negative Bacteria
Translocation
Support vector machines
Pathway
Support Vector Machine
Sequence Homology
Three-way Data
Composition Methods
Proteome
Drug Discovery
Structural Similarity
Design of experiments
Proteomics
Secondary Structure

ASJC Scopus subject areas

  • Medicine(all)
  • Structural Biology
  • Applied Mathematics

Cite this

Protein subcellular localization prediction based on compartment-specific features and structure conservation. / Su, Emily Chia Yu; Chiu, Hua Sheng; Lo, Allan; Hwang, Jenn Kang; Sung, Ting Yi; Hsu, Wen Lian.

In: BMC Bioinformatics, Vol. 8, 330, 08.09.2007.

Research output: Contribution to journalArticle

Su, Emily Chia Yu ; Chiu, Hua Sheng ; Lo, Allan ; Hwang, Jenn Kang ; Sung, Ting Yi ; Hsu, Wen Lian. / Protein subcellular localization prediction based on compartment-specific features and structure conservation. In: BMC Bioinformatics. 2007 ; Vol. 8.
@article{1df5f19a58dc4f478cc8fd969e39f055,
title = "Protein subcellular localization prediction based on compartment-specific features and structure conservation",
abstract = "Background: Protein subcellular localization is crucial for genome annotation, protein function prediction, and drug discovery. Determination of subcellular localization using experimental approaches is time-consuming; thus, computational approaches become highly desirable. Extensive studies of localization prediction have led to the development of several methods including composition-based and homology-based methods. However, their performance might be significantly degraded if homologous sequences are not detected. Moreover, methods that integrate various features could suffer from the problem of low coverage in high-throughput proteomic analyses due to the lack of information to characterize unknown proteins. Results: We propose a hybrid prediction method for Gram-negative bacteria that combines a one-versus-one support vector machines (SVM) model and a structural homology approach. The SVM model comprises a number of binary classifiers, in which biological features derived from Gram-negative bacteria translocation pathways are incorporated. In the structural homology approach, we employ secondary structure alignment for structural similarity comparison and assign the known localization of the top-ranked protein as the predicted localization of a query protein. The hybrid method achieves overall accuracy of 93.7{\%} and 93.2{\%} using ten-fold cross-validation on the benchmark data sets. In the assessment of the evaluation data sets, our method also attains accurate prediction accuracy of 84.0{\%}, especially when testing on sequences with a low level of homology to the training data. A three-way data split procedure is also incorporated to prevent overestimation of the predictive performance. In addition, we show that the prediction accuracy should be approximately 85{\%} for non-redundant data sets of sequence identity less than 30{\%}. Conclusion: Our results demonstrate that biological features derived from Gram-negative bacteria translocation pathways yield a significant improvement. The biological features are interpretable and can be applied in advanced analyses and experimental designs. Moreover, the overall accuracy of combining the structural homology approach is further improved, which suggests that structural conservation could be a useful indicator for inferring localization in addition to sequence homology. The proposed method can be used in large-scale analyses of proteomes.",
author = "Su, {Emily Chia Yu} and Chiu, {Hua Sheng} and Allan Lo and Hwang, {Jenn Kang} and Sung, {Ting Yi} and Hsu, {Wen Lian}",
year = "2007",
month = "9",
day = "8",
doi = "10.1186/1471-2105-8-330",
language = "English",
volume = "8",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",

}

TY - JOUR

T1 - Protein subcellular localization prediction based on compartment-specific features and structure conservation

AU - Su, Emily Chia Yu

AU - Chiu, Hua Sheng

AU - Lo, Allan

AU - Hwang, Jenn Kang

AU - Sung, Ting Yi

AU - Hsu, Wen Lian

PY - 2007/9/8

Y1 - 2007/9/8

N2 - Background: Protein subcellular localization is crucial for genome annotation, protein function prediction, and drug discovery. Determination of subcellular localization using experimental approaches is time-consuming; thus, computational approaches become highly desirable. Extensive studies of localization prediction have led to the development of several methods including composition-based and homology-based methods. However, their performance might be significantly degraded if homologous sequences are not detected. Moreover, methods that integrate various features could suffer from the problem of low coverage in high-throughput proteomic analyses due to the lack of information to characterize unknown proteins. Results: We propose a hybrid prediction method for Gram-negative bacteria that combines a one-versus-one support vector machines (SVM) model and a structural homology approach. The SVM model comprises a number of binary classifiers, in which biological features derived from Gram-negative bacteria translocation pathways are incorporated. In the structural homology approach, we employ secondary structure alignment for structural similarity comparison and assign the known localization of the top-ranked protein as the predicted localization of a query protein. The hybrid method achieves overall accuracy of 93.7% and 93.2% using ten-fold cross-validation on the benchmark data sets. In the assessment of the evaluation data sets, our method also attains accurate prediction accuracy of 84.0%, especially when testing on sequences with a low level of homology to the training data. A three-way data split procedure is also incorporated to prevent overestimation of the predictive performance. In addition, we show that the prediction accuracy should be approximately 85% for non-redundant data sets of sequence identity less than 30%. Conclusion: Our results demonstrate that biological features derived from Gram-negative bacteria translocation pathways yield a significant improvement. The biological features are interpretable and can be applied in advanced analyses and experimental designs. Moreover, the overall accuracy of combining the structural homology approach is further improved, which suggests that structural conservation could be a useful indicator for inferring localization in addition to sequence homology. The proposed method can be used in large-scale analyses of proteomes.

AB - Background: Protein subcellular localization is crucial for genome annotation, protein function prediction, and drug discovery. Determination of subcellular localization using experimental approaches is time-consuming; thus, computational approaches become highly desirable. Extensive studies of localization prediction have led to the development of several methods including composition-based and homology-based methods. However, their performance might be significantly degraded if homologous sequences are not detected. Moreover, methods that integrate various features could suffer from the problem of low coverage in high-throughput proteomic analyses due to the lack of information to characterize unknown proteins. Results: We propose a hybrid prediction method for Gram-negative bacteria that combines a one-versus-one support vector machines (SVM) model and a structural homology approach. The SVM model comprises a number of binary classifiers, in which biological features derived from Gram-negative bacteria translocation pathways are incorporated. In the structural homology approach, we employ secondary structure alignment for structural similarity comparison and assign the known localization of the top-ranked protein as the predicted localization of a query protein. The hybrid method achieves overall accuracy of 93.7% and 93.2% using ten-fold cross-validation on the benchmark data sets. In the assessment of the evaluation data sets, our method also attains accurate prediction accuracy of 84.0%, especially when testing on sequences with a low level of homology to the training data. A three-way data split procedure is also incorporated to prevent overestimation of the predictive performance. In addition, we show that the prediction accuracy should be approximately 85% for non-redundant data sets of sequence identity less than 30%. Conclusion: Our results demonstrate that biological features derived from Gram-negative bacteria translocation pathways yield a significant improvement. The biological features are interpretable and can be applied in advanced analyses and experimental designs. Moreover, the overall accuracy of combining the structural homology approach is further improved, which suggests that structural conservation could be a useful indicator for inferring localization in addition to sequence homology. The proposed method can be used in large-scale analyses of proteomes.

UR - http://www.scopus.com/inward/record.url?scp=35448970604&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=35448970604&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-8-330

DO - 10.1186/1471-2105-8-330

M3 - Article

VL - 8

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

M1 - 330

ER -