PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis

Jia Ming Chang, Emily Chia Yu Su, Allan Lo, Hua Sheng Chiu, Ting Yi Sung, Wen Lian Hsu

Research output: Contribution to journalArticle

34 Citations (Scopus)

Abstract

Prediction of protein subcellular localization (PSL) is important for genome annotation, protein function prediction, and drug discovery. Many computational approaches for PSL prediction based on protein sequences have been proposed in recent years for Gram-negative bacteria. We present PSLDoc, a method based on gapped-dipeptides and probabilistic latent semantic analysis (PLSA) to solve this problem. A protein is considered as a term string composed by gapped-dipeptides, which are defined as any two residues separated by one or more positions. The weighting scheme of gapped-dipeptides is calculated according to a position specific score matrix, which includes sequence evolutionary information. Then, PLSA is applied for feature reduction, and reduced vectors are input to five one-versus-rest support vector machine classifiers. The localization site with the highest probability is assigned as the final prediction. It has been reported that there is a strong correlation between sequence homology and subcellular localization (Nair and Rost, Protein Sci 2002;11:2836-2847; Yu et al., Proteins 2006;64:643-651). To properly evaluate the performance of PSLDoc, a target protein can be classified into low- or high-homology data sets. PSLDoc's overall accuracy of low- and high-homology data sets reaches 86.84% and 98.21%, respectively, and it compares favorably with that of CELLO II (Yu et al, Proteins 2006;64:643-651). In addition, we set a confidence threshold to achieve a high precision at specified levels of recall rates. When the confidence threshold is set at 0.7, PSLDoc achieves 97.89% in precision which is considerably better than that of PSORTb v.2.0 (Gardy et al., Bioinformatics 2005;21:617-623). Our approach demonstrates that the specific feature representation for proteins can be successfully applied to the prediction of protein subcellular localization and improves prediction accuracy. Besides, because of the generality of the representation, our method can be extended to eukaryotic proteomes in the future. The web server of PSLDoc is publicly available at http://bio-cluster.iis.sinica.edu.tw/~bioapp/PSLDoc/.

Original languageEnglish
Pages (from-to)693-710
Number of pages18
JournalProteins: Structure, Function and Genetics
Volume72
Issue number2
DOIs
Publication statusPublished - Aug 1 2008
Externally publishedYes

Fingerprint

Dipeptides
Semantics
Proteins
Molecular Sequence Annotation
Drug Discovery
Proteome
Sequence Homology
Computational Biology
Gram-Negative Bacteria
Bioinformatics
Support vector machines
Bacteria
Classifiers
Servers
Genome
Genes

Keywords

  • Document classification
  • Gapped-dipeptides
  • Probabilistic latent semantic analysis
  • Protein subcellular localization
  • Support vector machines
  • Vector space model

ASJC Scopus subject areas

  • Genetics
  • Structural Biology
  • Biochemistry

Cite this

PSLDoc : Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis. / Chang, Jia Ming; Su, Emily Chia Yu; Lo, Allan; Chiu, Hua Sheng; Sung, Ting Yi; Hsu, Wen Lian.

In: Proteins: Structure, Function and Genetics, Vol. 72, No. 2, 01.08.2008, p. 693-710.

Research output: Contribution to journalArticle

Chang, Jia Ming ; Su, Emily Chia Yu ; Lo, Allan ; Chiu, Hua Sheng ; Sung, Ting Yi ; Hsu, Wen Lian. / PSLDoc : Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis. In: Proteins: Structure, Function and Genetics. 2008 ; Vol. 72, No. 2. pp. 693-710.
@article{a846430d0f704f5389031565dc5fafda,
title = "PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis",
abstract = "Prediction of protein subcellular localization (PSL) is important for genome annotation, protein function prediction, and drug discovery. Many computational approaches for PSL prediction based on protein sequences have been proposed in recent years for Gram-negative bacteria. We present PSLDoc, a method based on gapped-dipeptides and probabilistic latent semantic analysis (PLSA) to solve this problem. A protein is considered as a term string composed by gapped-dipeptides, which are defined as any two residues separated by one or more positions. The weighting scheme of gapped-dipeptides is calculated according to a position specific score matrix, which includes sequence evolutionary information. Then, PLSA is applied for feature reduction, and reduced vectors are input to five one-versus-rest support vector machine classifiers. The localization site with the highest probability is assigned as the final prediction. It has been reported that there is a strong correlation between sequence homology and subcellular localization (Nair and Rost, Protein Sci 2002;11:2836-2847; Yu et al., Proteins 2006;64:643-651). To properly evaluate the performance of PSLDoc, a target protein can be classified into low- or high-homology data sets. PSLDoc's overall accuracy of low- and high-homology data sets reaches 86.84{\%} and 98.21{\%}, respectively, and it compares favorably with that of CELLO II (Yu et al, Proteins 2006;64:643-651). In addition, we set a confidence threshold to achieve a high precision at specified levels of recall rates. When the confidence threshold is set at 0.7, PSLDoc achieves 97.89{\%} in precision which is considerably better than that of PSORTb v.2.0 (Gardy et al., Bioinformatics 2005;21:617-623). Our approach demonstrates that the specific feature representation for proteins can be successfully applied to the prediction of protein subcellular localization and improves prediction accuracy. Besides, because of the generality of the representation, our method can be extended to eukaryotic proteomes in the future. The web server of PSLDoc is publicly available at http://bio-cluster.iis.sinica.edu.tw/~bioapp/PSLDoc/.",
keywords = "Document classification, Gapped-dipeptides, Probabilistic latent semantic analysis, Protein subcellular localization, Support vector machines, Vector space model",
author = "Chang, {Jia Ming} and Su, {Emily Chia Yu} and Allan Lo and Chiu, {Hua Sheng} and Sung, {Ting Yi} and Hsu, {Wen Lian}",
year = "2008",
month = "8",
day = "1",
doi = "10.1002/prot.21944",
language = "English",
volume = "72",
pages = "693--710",
journal = "Proteins: Structure, Function and Genetics",
issn = "0887-3585",
publisher = "Wiley-Liss Inc.",
number = "2",

}

TY - JOUR

T1 - PSLDoc

T2 - Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis

AU - Chang, Jia Ming

AU - Su, Emily Chia Yu

AU - Lo, Allan

AU - Chiu, Hua Sheng

AU - Sung, Ting Yi

AU - Hsu, Wen Lian

PY - 2008/8/1

Y1 - 2008/8/1

N2 - Prediction of protein subcellular localization (PSL) is important for genome annotation, protein function prediction, and drug discovery. Many computational approaches for PSL prediction based on protein sequences have been proposed in recent years for Gram-negative bacteria. We present PSLDoc, a method based on gapped-dipeptides and probabilistic latent semantic analysis (PLSA) to solve this problem. A protein is considered as a term string composed by gapped-dipeptides, which are defined as any two residues separated by one or more positions. The weighting scheme of gapped-dipeptides is calculated according to a position specific score matrix, which includes sequence evolutionary information. Then, PLSA is applied for feature reduction, and reduced vectors are input to five one-versus-rest support vector machine classifiers. The localization site with the highest probability is assigned as the final prediction. It has been reported that there is a strong correlation between sequence homology and subcellular localization (Nair and Rost, Protein Sci 2002;11:2836-2847; Yu et al., Proteins 2006;64:643-651). To properly evaluate the performance of PSLDoc, a target protein can be classified into low- or high-homology data sets. PSLDoc's overall accuracy of low- and high-homology data sets reaches 86.84% and 98.21%, respectively, and it compares favorably with that of CELLO II (Yu et al, Proteins 2006;64:643-651). In addition, we set a confidence threshold to achieve a high precision at specified levels of recall rates. When the confidence threshold is set at 0.7, PSLDoc achieves 97.89% in precision which is considerably better than that of PSORTb v.2.0 (Gardy et al., Bioinformatics 2005;21:617-623). Our approach demonstrates that the specific feature representation for proteins can be successfully applied to the prediction of protein subcellular localization and improves prediction accuracy. Besides, because of the generality of the representation, our method can be extended to eukaryotic proteomes in the future. The web server of PSLDoc is publicly available at http://bio-cluster.iis.sinica.edu.tw/~bioapp/PSLDoc/.

AB - Prediction of protein subcellular localization (PSL) is important for genome annotation, protein function prediction, and drug discovery. Many computational approaches for PSL prediction based on protein sequences have been proposed in recent years for Gram-negative bacteria. We present PSLDoc, a method based on gapped-dipeptides and probabilistic latent semantic analysis (PLSA) to solve this problem. A protein is considered as a term string composed by gapped-dipeptides, which are defined as any two residues separated by one or more positions. The weighting scheme of gapped-dipeptides is calculated according to a position specific score matrix, which includes sequence evolutionary information. Then, PLSA is applied for feature reduction, and reduced vectors are input to five one-versus-rest support vector machine classifiers. The localization site with the highest probability is assigned as the final prediction. It has been reported that there is a strong correlation between sequence homology and subcellular localization (Nair and Rost, Protein Sci 2002;11:2836-2847; Yu et al., Proteins 2006;64:643-651). To properly evaluate the performance of PSLDoc, a target protein can be classified into low- or high-homology data sets. PSLDoc's overall accuracy of low- and high-homology data sets reaches 86.84% and 98.21%, respectively, and it compares favorably with that of CELLO II (Yu et al, Proteins 2006;64:643-651). In addition, we set a confidence threshold to achieve a high precision at specified levels of recall rates. When the confidence threshold is set at 0.7, PSLDoc achieves 97.89% in precision which is considerably better than that of PSORTb v.2.0 (Gardy et al., Bioinformatics 2005;21:617-623). Our approach demonstrates that the specific feature representation for proteins can be successfully applied to the prediction of protein subcellular localization and improves prediction accuracy. Besides, because of the generality of the representation, our method can be extended to eukaryotic proteomes in the future. The web server of PSLDoc is publicly available at http://bio-cluster.iis.sinica.edu.tw/~bioapp/PSLDoc/.

KW - Document classification

KW - Gapped-dipeptides

KW - Probabilistic latent semantic analysis

KW - Protein subcellular localization

KW - Support vector machines

KW - Vector space model

UR - http://www.scopus.com/inward/record.url?scp=46449089781&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=46449089781&partnerID=8YFLogxK

U2 - 10.1002/prot.21944

DO - 10.1002/prot.21944

M3 - Article

C2 - 18260102

AN - SCOPUS:46449089781

VL - 72

SP - 693

EP - 710

JO - Proteins: Structure, Function and Genetics

JF - Proteins: Structure, Function and Genetics

SN - 0887-3585

IS - 2

ER -