Predicting protein subnuclear localization using GO-amino-acid composition features

Wen Lin Huang, Chun Wei Tung, Hui Ling Huang, Shinn Ying Ho

Research output: Contribution to journalArticle

27 Citations (Scopus)

Abstract

The nucleus guides life processes of cells. Many of the nuclear proteins participating in the life processes tend to concentrate on subnuclear compartments. The subnuclear localization of nuclear proteins is hence important for deeply understanding the construction and functions of the nucleus. Recently, Gene Ontology (GO) annotation has been used for prediction of subnuclear localization. However, the effective use of GO terms in solving sequence-based prediction problems remains challenging, especially when query protein sequences have no accession number or annotated GO term. This study obtains homologies of query proteins with known accession numbers using BLAST to retrieve GO terms for sequence-based subnuclear localization prediction. A prediction method PGAC, which involves mining informative GO terms associated with amino acid composition features, is proposed to design a support vector machine-based classifier. PGAC yields 55 informative GO terms with training and test accuracies of 85.7% and 76.3%, respectively, using a data set SNL_35 (561 proteins in 9 localizations) with 35% sequence identity. Upon comparison with Nuc-PLoc, which combines amphiphilic pseudo amino acid composition of a protein with its position-specific scoring matrix, PGAC using the data set SNL_80 yields a leave-one-out cross-validation accuracy of 81.1%, which is better than that of Nuc-PLoc, 67.4%. Experimental results show that the set of informative GO terms are effective features for protein subnuclear localization. The prediction server based on PGAC has been implemented at http://iclab.life.nctu.edu.tw/prolocgac.

Original languageEnglish
Pages (from-to)73-79
Number of pages7
JournalBioSystems
Volume98
Issue number2
DOIs
Publication statusPublished - Nov 1 2009
Externally publishedYes

Fingerprint

Gene Ontology
Ontology
Amino Acids
Amino acids
Genes
Proteins
Protein
Chemical analysis
Term
Prediction
Nuclear Proteins
Nucleus
Position-Specific Scoring Matrices
Query
Molecular Sequence Annotation
Protein Sequence
Scoring
Cross-validation
Support vector machines
Annotation

Keywords

  • Amino acid composition
  • Gene Ontology
  • Subnuclear localization

ASJC Scopus subject areas

  • Statistics and Probability
  • Modelling and Simulation
  • Biochemistry, Genetics and Molecular Biology(all)
  • Applied Mathematics

Cite this

Predicting protein subnuclear localization using GO-amino-acid composition features. / Huang, Wen Lin; Tung, Chun Wei; Huang, Hui Ling; Ho, Shinn Ying.

In: BioSystems, Vol. 98, No. 2, 01.11.2009, p. 73-79.

Research output: Contribution to journalArticle

Huang, Wen Lin ; Tung, Chun Wei ; Huang, Hui Ling ; Ho, Shinn Ying. / Predicting protein subnuclear localization using GO-amino-acid composition features. In: BioSystems. 2009 ; Vol. 98, No. 2. pp. 73-79.
@article{d4e9c87dc9644f6082419f266af64d3d,
title = "Predicting protein subnuclear localization using GO-amino-acid composition features",
abstract = "The nucleus guides life processes of cells. Many of the nuclear proteins participating in the life processes tend to concentrate on subnuclear compartments. The subnuclear localization of nuclear proteins is hence important for deeply understanding the construction and functions of the nucleus. Recently, Gene Ontology (GO) annotation has been used for prediction of subnuclear localization. However, the effective use of GO terms in solving sequence-based prediction problems remains challenging, especially when query protein sequences have no accession number or annotated GO term. This study obtains homologies of query proteins with known accession numbers using BLAST to retrieve GO terms for sequence-based subnuclear localization prediction. A prediction method PGAC, which involves mining informative GO terms associated with amino acid composition features, is proposed to design a support vector machine-based classifier. PGAC yields 55 informative GO terms with training and test accuracies of 85.7{\%} and 76.3{\%}, respectively, using a data set SNL_35 (561 proteins in 9 localizations) with 35{\%} sequence identity. Upon comparison with Nuc-PLoc, which combines amphiphilic pseudo amino acid composition of a protein with its position-specific scoring matrix, PGAC using the data set SNL_80 yields a leave-one-out cross-validation accuracy of 81.1{\%}, which is better than that of Nuc-PLoc, 67.4{\%}. Experimental results show that the set of informative GO terms are effective features for protein subnuclear localization. The prediction server based on PGAC has been implemented at http://iclab.life.nctu.edu.tw/prolocgac.",
keywords = "Amino acid composition, Gene Ontology, Subnuclear localization",
author = "Huang, {Wen Lin} and Tung, {Chun Wei} and Huang, {Hui Ling} and Ho, {Shinn Ying}",
year = "2009",
month = "11",
day = "1",
doi = "10.1016/j.biosystems.2009.06.007",
language = "English",
volume = "98",
pages = "73--79",
journal = "BioSystems",
issn = "0303-2647",
publisher = "Elsevier Ireland Ltd",
number = "2",

}

TY - JOUR

T1 - Predicting protein subnuclear localization using GO-amino-acid composition features

AU - Huang, Wen Lin

AU - Tung, Chun Wei

AU - Huang, Hui Ling

AU - Ho, Shinn Ying

PY - 2009/11/1

Y1 - 2009/11/1

N2 - The nucleus guides life processes of cells. Many of the nuclear proteins participating in the life processes tend to concentrate on subnuclear compartments. The subnuclear localization of nuclear proteins is hence important for deeply understanding the construction and functions of the nucleus. Recently, Gene Ontology (GO) annotation has been used for prediction of subnuclear localization. However, the effective use of GO terms in solving sequence-based prediction problems remains challenging, especially when query protein sequences have no accession number or annotated GO term. This study obtains homologies of query proteins with known accession numbers using BLAST to retrieve GO terms for sequence-based subnuclear localization prediction. A prediction method PGAC, which involves mining informative GO terms associated with amino acid composition features, is proposed to design a support vector machine-based classifier. PGAC yields 55 informative GO terms with training and test accuracies of 85.7% and 76.3%, respectively, using a data set SNL_35 (561 proteins in 9 localizations) with 35% sequence identity. Upon comparison with Nuc-PLoc, which combines amphiphilic pseudo amino acid composition of a protein with its position-specific scoring matrix, PGAC using the data set SNL_80 yields a leave-one-out cross-validation accuracy of 81.1%, which is better than that of Nuc-PLoc, 67.4%. Experimental results show that the set of informative GO terms are effective features for protein subnuclear localization. The prediction server based on PGAC has been implemented at http://iclab.life.nctu.edu.tw/prolocgac.

AB - The nucleus guides life processes of cells. Many of the nuclear proteins participating in the life processes tend to concentrate on subnuclear compartments. The subnuclear localization of nuclear proteins is hence important for deeply understanding the construction and functions of the nucleus. Recently, Gene Ontology (GO) annotation has been used for prediction of subnuclear localization. However, the effective use of GO terms in solving sequence-based prediction problems remains challenging, especially when query protein sequences have no accession number or annotated GO term. This study obtains homologies of query proteins with known accession numbers using BLAST to retrieve GO terms for sequence-based subnuclear localization prediction. A prediction method PGAC, which involves mining informative GO terms associated with amino acid composition features, is proposed to design a support vector machine-based classifier. PGAC yields 55 informative GO terms with training and test accuracies of 85.7% and 76.3%, respectively, using a data set SNL_35 (561 proteins in 9 localizations) with 35% sequence identity. Upon comparison with Nuc-PLoc, which combines amphiphilic pseudo amino acid composition of a protein with its position-specific scoring matrix, PGAC using the data set SNL_80 yields a leave-one-out cross-validation accuracy of 81.1%, which is better than that of Nuc-PLoc, 67.4%. Experimental results show that the set of informative GO terms are effective features for protein subnuclear localization. The prediction server based on PGAC has been implemented at http://iclab.life.nctu.edu.tw/prolocgac.

KW - Amino acid composition

KW - Gene Ontology

KW - Subnuclear localization

UR - http://www.scopus.com/inward/record.url?scp=70349795723&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70349795723&partnerID=8YFLogxK

U2 - 10.1016/j.biosystems.2009.06.007

DO - 10.1016/j.biosystems.2009.06.007

M3 - Article

C2 - 19583993

AN - SCOPUS:70349795723

VL - 98

SP - 73

EP - 79

JO - BioSystems

JF - BioSystems

SN - 0303-2647

IS - 2

ER -