Predicting RNA-binding sites of proteins using support vector machines and evolutionary information

Cheng Wei Cheng, Emily Chia Yu Su, Jenn Kang Hwang, Ting Yi Sung, Wen Lian Hsu

Research output: Contribution to journalArticle

80 Citations (Scopus)

Abstract

Background: RNA-protein interaction plays an essential role in several biological processes, such as protein synthesis, gene expression, posttranscriptional regulation and viral infectivity. Identification of RNA-binding sites in proteins provides valuable insights for biologists. However, experimental determination of RNA-protein interaction remains time-consuming and labor-intensive. Thus, computational approaches for prediction of RNA-binding sites in proteins have become highly desirable. Extensive studies of RNA-binding site prediction have led to the development of several methods. However, they could yield low sensitivities in trade-off for high specificities. Results: We propose a method, RNAProB, which incorporates a new smoothed position-specific scoring matrix (PSSM) encoding scheme with a support vector machine model to predict RNA-binding sites in proteins. Besides the incorporation of evolutionary information from standard PSSM profiles, the proposed smoothed PSSM encoding scheme also considers the correlation and dependency from the neighboring residues for each amino acid in a protein. Experimental results show that smoothed PSSM encoding significantly enhances the prediction performance, especially for sensitivity. Using five-fold cross-validation, our method performs better than the state-of-the-art systems by 4.90%∼6.83%, 0.88%∼5.33%, and 0.10∼0.23 in terms of overall accuracy, specificity, and Matthew's correlation coefficient, respectively. Most notably, compared to other approaches, RNAProB significantly improves sensitivity by 7.0%∼26.9% over the benchmark data sets. To prevent data over fitting, a three-way data split procedure is incorporated to estimate the prediction performance. Moreover, physicochemical properties and amino acid preferences of RNA-binding proteins are examined and analyzed. Conclusion: Our results demonstrate that smoothed PSSM encoding scheme significantly enhances the performance of RNA-binding site prediction in proteins. This also supports our assumption that smoothed PSSM encoding can better resolve the ambiguity of discriminating between interacting and non-interacting residues by modelling the dependency from surrounding residues. The proposed method can be used in other research areas, such as DNA-binding site prediction, protein-protein interaction, and prediction of posttranslational modification sites.

Original languageEnglish
Article numberS6
JournalBMC Bioinformatics
Volume9
Issue numberSUPPL. 12
DOIs
Publication statusPublished - Dec 12 2008
Externally publishedYes

Fingerprint

RNA-Binding Proteins
Binding sites
RNA
Position-Specific Scoring Matrices
Support vector machines
binding sites
Support Vector Machine
Binding Sites
Scoring
Proteins
Protein
prediction
Encoding
proteins
Prediction
Performance Prediction
Specificity
Amino Acids
Viral Gene Expression Regulation
Three-way Data

ASJC Scopus subject areas

  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Structural Biology
  • Applied Mathematics
  • Agricultural and Biological Sciences (miscellaneous)
  • Artificial Intelligence
  • Computational Theory and Mathematics
  • Software

Cite this

Predicting RNA-binding sites of proteins using support vector machines and evolutionary information. / Cheng, Cheng Wei; Su, Emily Chia Yu; Hwang, Jenn Kang; Sung, Ting Yi; Hsu, Wen Lian.

In: BMC Bioinformatics, Vol. 9, No. SUPPL. 12, S6, 12.12.2008.

Research output: Contribution to journalArticle

Cheng, Cheng Wei ; Su, Emily Chia Yu ; Hwang, Jenn Kang ; Sung, Ting Yi ; Hsu, Wen Lian. / Predicting RNA-binding sites of proteins using support vector machines and evolutionary information. In: BMC Bioinformatics. 2008 ; Vol. 9, No. SUPPL. 12.
@article{c103a56bb2c248e0bc4fbadfc54cc686,
title = "Predicting RNA-binding sites of proteins using support vector machines and evolutionary information",
abstract = "Background: RNA-protein interaction plays an essential role in several biological processes, such as protein synthesis, gene expression, posttranscriptional regulation and viral infectivity. Identification of RNA-binding sites in proteins provides valuable insights for biologists. However, experimental determination of RNA-protein interaction remains time-consuming and labor-intensive. Thus, computational approaches for prediction of RNA-binding sites in proteins have become highly desirable. Extensive studies of RNA-binding site prediction have led to the development of several methods. However, they could yield low sensitivities in trade-off for high specificities. Results: We propose a method, RNAProB, which incorporates a new smoothed position-specific scoring matrix (PSSM) encoding scheme with a support vector machine model to predict RNA-binding sites in proteins. Besides the incorporation of evolutionary information from standard PSSM profiles, the proposed smoothed PSSM encoding scheme also considers the correlation and dependency from the neighboring residues for each amino acid in a protein. Experimental results show that smoothed PSSM encoding significantly enhances the prediction performance, especially for sensitivity. Using five-fold cross-validation, our method performs better than the state-of-the-art systems by 4.90{\%}∼6.83{\%}, 0.88{\%}∼5.33{\%}, and 0.10∼0.23 in terms of overall accuracy, specificity, and Matthew's correlation coefficient, respectively. Most notably, compared to other approaches, RNAProB significantly improves sensitivity by 7.0{\%}∼26.9{\%} over the benchmark data sets. To prevent data over fitting, a three-way data split procedure is incorporated to estimate the prediction performance. Moreover, physicochemical properties and amino acid preferences of RNA-binding proteins are examined and analyzed. Conclusion: Our results demonstrate that smoothed PSSM encoding scheme significantly enhances the performance of RNA-binding site prediction in proteins. This also supports our assumption that smoothed PSSM encoding can better resolve the ambiguity of discriminating between interacting and non-interacting residues by modelling the dependency from surrounding residues. The proposed method can be used in other research areas, such as DNA-binding site prediction, protein-protein interaction, and prediction of posttranslational modification sites.",
author = "Cheng, {Cheng Wei} and Su, {Emily Chia Yu} and Hwang, {Jenn Kang} and Sung, {Ting Yi} and Hsu, {Wen Lian}",
year = "2008",
month = "12",
day = "12",
doi = "10.1186/1471-2105-9-S12-S6",
language = "English",
volume = "9",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",
number = "SUPPL. 12",

}

TY - JOUR

T1 - Predicting RNA-binding sites of proteins using support vector machines and evolutionary information

AU - Cheng, Cheng Wei

AU - Su, Emily Chia Yu

AU - Hwang, Jenn Kang

AU - Sung, Ting Yi

AU - Hsu, Wen Lian

PY - 2008/12/12

Y1 - 2008/12/12

N2 - Background: RNA-protein interaction plays an essential role in several biological processes, such as protein synthesis, gene expression, posttranscriptional regulation and viral infectivity. Identification of RNA-binding sites in proteins provides valuable insights for biologists. However, experimental determination of RNA-protein interaction remains time-consuming and labor-intensive. Thus, computational approaches for prediction of RNA-binding sites in proteins have become highly desirable. Extensive studies of RNA-binding site prediction have led to the development of several methods. However, they could yield low sensitivities in trade-off for high specificities. Results: We propose a method, RNAProB, which incorporates a new smoothed position-specific scoring matrix (PSSM) encoding scheme with a support vector machine model to predict RNA-binding sites in proteins. Besides the incorporation of evolutionary information from standard PSSM profiles, the proposed smoothed PSSM encoding scheme also considers the correlation and dependency from the neighboring residues for each amino acid in a protein. Experimental results show that smoothed PSSM encoding significantly enhances the prediction performance, especially for sensitivity. Using five-fold cross-validation, our method performs better than the state-of-the-art systems by 4.90%∼6.83%, 0.88%∼5.33%, and 0.10∼0.23 in terms of overall accuracy, specificity, and Matthew's correlation coefficient, respectively. Most notably, compared to other approaches, RNAProB significantly improves sensitivity by 7.0%∼26.9% over the benchmark data sets. To prevent data over fitting, a three-way data split procedure is incorporated to estimate the prediction performance. Moreover, physicochemical properties and amino acid preferences of RNA-binding proteins are examined and analyzed. Conclusion: Our results demonstrate that smoothed PSSM encoding scheme significantly enhances the performance of RNA-binding site prediction in proteins. This also supports our assumption that smoothed PSSM encoding can better resolve the ambiguity of discriminating between interacting and non-interacting residues by modelling the dependency from surrounding residues. The proposed method can be used in other research areas, such as DNA-binding site prediction, protein-protein interaction, and prediction of posttranslational modification sites.

AB - Background: RNA-protein interaction plays an essential role in several biological processes, such as protein synthesis, gene expression, posttranscriptional regulation and viral infectivity. Identification of RNA-binding sites in proteins provides valuable insights for biologists. However, experimental determination of RNA-protein interaction remains time-consuming and labor-intensive. Thus, computational approaches for prediction of RNA-binding sites in proteins have become highly desirable. Extensive studies of RNA-binding site prediction have led to the development of several methods. However, they could yield low sensitivities in trade-off for high specificities. Results: We propose a method, RNAProB, which incorporates a new smoothed position-specific scoring matrix (PSSM) encoding scheme with a support vector machine model to predict RNA-binding sites in proteins. Besides the incorporation of evolutionary information from standard PSSM profiles, the proposed smoothed PSSM encoding scheme also considers the correlation and dependency from the neighboring residues for each amino acid in a protein. Experimental results show that smoothed PSSM encoding significantly enhances the prediction performance, especially for sensitivity. Using five-fold cross-validation, our method performs better than the state-of-the-art systems by 4.90%∼6.83%, 0.88%∼5.33%, and 0.10∼0.23 in terms of overall accuracy, specificity, and Matthew's correlation coefficient, respectively. Most notably, compared to other approaches, RNAProB significantly improves sensitivity by 7.0%∼26.9% over the benchmark data sets. To prevent data over fitting, a three-way data split procedure is incorporated to estimate the prediction performance. Moreover, physicochemical properties and amino acid preferences of RNA-binding proteins are examined and analyzed. Conclusion: Our results demonstrate that smoothed PSSM encoding scheme significantly enhances the performance of RNA-binding site prediction in proteins. This also supports our assumption that smoothed PSSM encoding can better resolve the ambiguity of discriminating between interacting and non-interacting residues by modelling the dependency from surrounding residues. The proposed method can be used in other research areas, such as DNA-binding site prediction, protein-protein interaction, and prediction of posttranslational modification sites.

UR - http://www.scopus.com/inward/record.url?scp=57649181721&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=57649181721&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-9-S12-S6

DO - 10.1186/1471-2105-9-S12-S6

M3 - Article

VL - 9

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - SUPPL. 12

M1 - S6

ER -