ProLoc-rGO

Using rule-based knowledge with gene ontology terms for prediction of protein subnuclear localization

Wen Lin Huang, Chun Wei Tung, Shih Wen Ho, Shinn Ying Ho

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Gene Ontology (GO) annotation is a controlled vocabulary of terms and phrases describing the function of genes and gene products, which has been succeeded in predicting subcellualr and subnuclear localization. Generally, each gene product is annotated by very few GO terms from more than 25,000 annotations available at present. How to represent a protein sequence using GO terms as features plays an important role in designing prediction systems for protein subnuclear localization. Our previous work ProLoc-GO can select a small number m out of a large number n GO terms, where m ≤≤ n. However, its off-line time for training is large up to several days even though running on high speedily PC clusters. Therefore, this study proposes an efficient system (ProLoc-rGO) by using the decision tree method to speedily mine m informative GO terms and acquire interpretable rule-based knowledge for predicting subnuclear localization. The ProLoc-rGO performing on SNL9-80 (714 proteins in nine compartments with ≤80 identity) can mine m=17 informative GO terms, 17 interpretable rules and yield training and test accuracies of 84.9% and 78.2%. For comparison, an accuracy 82.6% (Matthews correlation coefficient (MCC) = 0.711) for ProLoc-rGO performed on SNL9-80 (714 proteins in nine compartments with ≤80 identity) is obtained, which is better than 67.4% (MCC = 0.50) for Nuc-PLoc that fuses the pseudo-amino acid composition of a protein and its position-specific scoring matrix.

Original languageEnglish
Title of host publication2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '08
Pages201-206
Number of pages6
DOIs
Publication statusPublished - Dec 1 2008
Externally publishedYes
Event2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '08 - Sun Valley, ID, United States
Duration: Sep 15 2008Sep 17 2008

Publication series

Name2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '08

Conference

Conference2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '08
CountryUnited States
CitySun Valley, ID
Period9/15/089/17/08

Fingerprint

Gene Ontology
Ontology
Genes
Proteins
Position-Specific Scoring Matrices
Controlled Vocabulary
Molecular Sequence Annotation
Decision Trees
Thesauri
Electric fuses
Decision trees
Amino Acids
Amino acids

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computational Theory and Mathematics
  • Biomedical Engineering
  • Health Informatics

Cite this

Huang, W. L., Tung, C. W., Ho, S. W., & Ho, S. Y. (2008). ProLoc-rGO: Using rule-based knowledge with gene ontology terms for prediction of protein subnuclear localization. In 2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '08 (pp. 201-206). [4675779] (2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '08). https://doi.org/10.1109/CIBCB.2008.4675779

ProLoc-rGO : Using rule-based knowledge with gene ontology terms for prediction of protein subnuclear localization. / Huang, Wen Lin; Tung, Chun Wei; Ho, Shih Wen; Ho, Shinn Ying.

2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '08. 2008. p. 201-206 4675779 (2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '08).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Huang, WL, Tung, CW, Ho, SW & Ho, SY 2008, ProLoc-rGO: Using rule-based knowledge with gene ontology terms for prediction of protein subnuclear localization. in 2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '08., 4675779, 2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '08, pp. 201-206, 2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '08, Sun Valley, ID, United States, 9/15/08. https://doi.org/10.1109/CIBCB.2008.4675779
Huang WL, Tung CW, Ho SW, Ho SY. ProLoc-rGO: Using rule-based knowledge with gene ontology terms for prediction of protein subnuclear localization. In 2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '08. 2008. p. 201-206. 4675779. (2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '08). https://doi.org/10.1109/CIBCB.2008.4675779
Huang, Wen Lin ; Tung, Chun Wei ; Ho, Shih Wen ; Ho, Shinn Ying. / ProLoc-rGO : Using rule-based knowledge with gene ontology terms for prediction of protein subnuclear localization. 2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '08. 2008. pp. 201-206 (2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '08).
@inproceedings{2ab7e4ec475d4d87860910f5759b7864,
title = "ProLoc-rGO: Using rule-based knowledge with gene ontology terms for prediction of protein subnuclear localization",
abstract = "Gene Ontology (GO) annotation is a controlled vocabulary of terms and phrases describing the function of genes and gene products, which has been succeeded in predicting subcellualr and subnuclear localization. Generally, each gene product is annotated by very few GO terms from more than 25,000 annotations available at present. How to represent a protein sequence using GO terms as features plays an important role in designing prediction systems for protein subnuclear localization. Our previous work ProLoc-GO can select a small number m out of a large number n GO terms, where m ≤≤ n. However, its off-line time for training is large up to several days even though running on high speedily PC clusters. Therefore, this study proposes an efficient system (ProLoc-rGO) by using the decision tree method to speedily mine m informative GO terms and acquire interpretable rule-based knowledge for predicting subnuclear localization. The ProLoc-rGO performing on SNL9-80 (714 proteins in nine compartments with ≤80 identity) can mine m=17 informative GO terms, 17 interpretable rules and yield training and test accuracies of 84.9{\%} and 78.2{\%}. For comparison, an accuracy 82.6{\%} (Matthews correlation coefficient (MCC) = 0.711) for ProLoc-rGO performed on SNL9-80 (714 proteins in nine compartments with ≤80 identity) is obtained, which is better than 67.4{\%} (MCC = 0.50) for Nuc-PLoc that fuses the pseudo-amino acid composition of a protein and its position-specific scoring matrix.",
author = "Huang, {Wen Lin} and Tung, {Chun Wei} and Ho, {Shih Wen} and Ho, {Shinn Ying}",
year = "2008",
month = "12",
day = "1",
doi = "10.1109/CIBCB.2008.4675779",
language = "English",
isbn = "9781424417780",
series = "2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '08",
pages = "201--206",
booktitle = "2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '08",

}

TY - GEN

T1 - ProLoc-rGO

T2 - Using rule-based knowledge with gene ontology terms for prediction of protein subnuclear localization

AU - Huang, Wen Lin

AU - Tung, Chun Wei

AU - Ho, Shih Wen

AU - Ho, Shinn Ying

PY - 2008/12/1

Y1 - 2008/12/1

N2 - Gene Ontology (GO) annotation is a controlled vocabulary of terms and phrases describing the function of genes and gene products, which has been succeeded in predicting subcellualr and subnuclear localization. Generally, each gene product is annotated by very few GO terms from more than 25,000 annotations available at present. How to represent a protein sequence using GO terms as features plays an important role in designing prediction systems for protein subnuclear localization. Our previous work ProLoc-GO can select a small number m out of a large number n GO terms, where m ≤≤ n. However, its off-line time for training is large up to several days even though running on high speedily PC clusters. Therefore, this study proposes an efficient system (ProLoc-rGO) by using the decision tree method to speedily mine m informative GO terms and acquire interpretable rule-based knowledge for predicting subnuclear localization. The ProLoc-rGO performing on SNL9-80 (714 proteins in nine compartments with ≤80 identity) can mine m=17 informative GO terms, 17 interpretable rules and yield training and test accuracies of 84.9% and 78.2%. For comparison, an accuracy 82.6% (Matthews correlation coefficient (MCC) = 0.711) for ProLoc-rGO performed on SNL9-80 (714 proteins in nine compartments with ≤80 identity) is obtained, which is better than 67.4% (MCC = 0.50) for Nuc-PLoc that fuses the pseudo-amino acid composition of a protein and its position-specific scoring matrix.

AB - Gene Ontology (GO) annotation is a controlled vocabulary of terms and phrases describing the function of genes and gene products, which has been succeeded in predicting subcellualr and subnuclear localization. Generally, each gene product is annotated by very few GO terms from more than 25,000 annotations available at present. How to represent a protein sequence using GO terms as features plays an important role in designing prediction systems for protein subnuclear localization. Our previous work ProLoc-GO can select a small number m out of a large number n GO terms, where m ≤≤ n. However, its off-line time for training is large up to several days even though running on high speedily PC clusters. Therefore, this study proposes an efficient system (ProLoc-rGO) by using the decision tree method to speedily mine m informative GO terms and acquire interpretable rule-based knowledge for predicting subnuclear localization. The ProLoc-rGO performing on SNL9-80 (714 proteins in nine compartments with ≤80 identity) can mine m=17 informative GO terms, 17 interpretable rules and yield training and test accuracies of 84.9% and 78.2%. For comparison, an accuracy 82.6% (Matthews correlation coefficient (MCC) = 0.711) for ProLoc-rGO performed on SNL9-80 (714 proteins in nine compartments with ≤80 identity) is obtained, which is better than 67.4% (MCC = 0.50) for Nuc-PLoc that fuses the pseudo-amino acid composition of a protein and its position-specific scoring matrix.

UR - http://www.scopus.com/inward/record.url?scp=77950921562&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77950921562&partnerID=8YFLogxK

U2 - 10.1109/CIBCB.2008.4675779

DO - 10.1109/CIBCB.2008.4675779

M3 - Conference contribution

SN - 9781424417780

T3 - 2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '08

SP - 201

EP - 206

BT - 2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '08

ER -