Investigation and identification of protein carbonylation sites based on position-specific amino acid composition and physicochemical features

Shun Long Weng, Kai Yao Huang, Fergie Joanda Kaunang, Chien Hsun Huang, Hui Ju Kao, Tzu Hao Chang, Hsin Yao Wang, Jang Jih Lu, Tzong Yi Lee

研究成果: 雜誌貢獻文章

11 引文 (Scopus)

摘要

Background: Protein carbonylation, an irreversible and non-enzymatic post-translational modification (PTM), is often used as a marker of oxidative stress. When reactive oxygen species (ROS) oxidized the amino acid side chains, carbonyl (CO) groups are produced especially on Lysine (K), Arginine (R), Threonine (T), and Proline (P). Nevertheless, due to the lack of information about the carbonylated substrate specificity, we were encouraged to develop a systematic method for a comprehensive investigation of protein carbonylation sites. Results: After the removal of redundant data from multipe carbonylation-related articles, totally 226 carbonylated proteins in human are regarded as training dataset, which consisted of 307, 126, 128, and 129 carbonylation sites for K, R, T and P residues, respectively. To identify the useful features in predicting carbonylation sites, the linear amino acid sequence was adopted not only to build up the predictive model from training dataset, but also to compare the effectiveness of prediction with other types of features including amino acid composition (AAC), amino acid pair composition (AAPC), position-specific scoring matrix (PSSM), positional weighted matrix (PWM), solvent-accessible surface area (ASA), and physicochemical properties. The investigation of position-specific amino acid composition revealed that the positively charged amino acids (K and R) are remarkably enriched surrounding the carbonylated sites, which may play a functional role in discriminating between carbonylation and non-carbonylation sites. A variety of predictive models were built using various features and three different machine learning methods. Based on the evaluation by five-fold cross-validation, the models trained with PWM feature could provide better sensitivity in the positive training dataset, while the models trained with AAindex feature achieved higher specificity in the negative training dataset. Additionally, the model trained using hybrid features, including PWM, AAC and AAindex, obtained best MCC values of 0.432, 0.472, 0.443 and 0.467 on K, R, T and P residues, respectively. Conclusion: When comparing to an existing prediction tool, the selected models trained with hybrid features provided a promising accuracy on an independent testing dataset. In short, this work not only characterized the carbonylated substrate preference, but also demonstrated that the proposed method could provide a feasible means for accelerating preliminary discovery of protein carbonylation.
原文英語
文章編號66
期刊BMC Bioinformatics
18
DOIs
出版狀態已發佈 - 三月 14 2017

指紋

Protein Carbonylation
Carbonylation
Amino Acids
Amino acids
Proteins
Protein
Chemical analysis
Predictive Model
Specificity
Position-Specific Scoring Matrices
Substrate
Oxidative Stress
Reactive Oxygen Species
Prediction
Arginine
Threonine
Post Translational Protein Processing
Amino Acid Sequence
Substrate Specificity
Oxidative stress

ASJC Scopus subject areas

  • Structural Biology
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics

引用此文

Investigation and identification of protein carbonylation sites based on position-specific amino acid composition and physicochemical features. / Weng, Shun Long; Huang, Kai Yao; Kaunang, Fergie Joanda; Huang, Chien Hsun; Kao, Hui Ju; Chang, Tzu Hao; Wang, Hsin Yao; Lu, Jang Jih; Lee, Tzong Yi.

於: BMC Bioinformatics, 卷 18, 66, 14.03.2017.

研究成果: 雜誌貢獻文章

Weng, Shun Long ; Huang, Kai Yao ; Kaunang, Fergie Joanda ; Huang, Chien Hsun ; Kao, Hui Ju ; Chang, Tzu Hao ; Wang, Hsin Yao ; Lu, Jang Jih ; Lee, Tzong Yi. / Investigation and identification of protein carbonylation sites based on position-specific amino acid composition and physicochemical features. 於: BMC Bioinformatics. 2017 ; 卷 18.
@article{fe44cdafc75e4401bab6c15e9c4ac6ec,
title = "Investigation and identification of protein carbonylation sites based on position-specific amino acid composition and physicochemical features",
abstract = "Background: Protein carbonylation, an irreversible and non-enzymatic post-translational modification (PTM), is often used as a marker of oxidative stress. When reactive oxygen species (ROS) oxidized the amino acid side chains, carbonyl (CO) groups are produced especially on Lysine (K), Arginine (R), Threonine (T), and Proline (P). Nevertheless, due to the lack of information about the carbonylated substrate specificity, we were encouraged to develop a systematic method for a comprehensive investigation of protein carbonylation sites. Results: After the removal of redundant data from multipe carbonylation-related articles, totally 226 carbonylated proteins in human are regarded as training dataset, which consisted of 307, 126, 128, and 129 carbonylation sites for K, R, T and P residues, respectively. To identify the useful features in predicting carbonylation sites, the linear amino acid sequence was adopted not only to build up the predictive model from training dataset, but also to compare the effectiveness of prediction with other types of features including amino acid composition (AAC), amino acid pair composition (AAPC), position-specific scoring matrix (PSSM), positional weighted matrix (PWM), solvent-accessible surface area (ASA), and physicochemical properties. The investigation of position-specific amino acid composition revealed that the positively charged amino acids (K and R) are remarkably enriched surrounding the carbonylated sites, which may play a functional role in discriminating between carbonylation and non-carbonylation sites. A variety of predictive models were built using various features and three different machine learning methods. Based on the evaluation by five-fold cross-validation, the models trained with PWM feature could provide better sensitivity in the positive training dataset, while the models trained with AAindex feature achieved higher specificity in the negative training dataset. Additionally, the model trained using hybrid features, including PWM, AAC and AAindex, obtained best MCC values of 0.432, 0.472, 0.443 and 0.467 on K, R, T and P residues, respectively. Conclusion: When comparing to an existing prediction tool, the selected models trained with hybrid features provided a promising accuracy on an independent testing dataset. In short, this work not only characterized the carbonylated substrate preference, but also demonstrated that the proposed method could provide a feasible means for accelerating preliminary discovery of protein carbonylation.",
keywords = "Amino acid composition, Physicochemical properties, Protein carbonylation, Reactive Oxygen Species (ROS)",
author = "Weng, {Shun Long} and Huang, {Kai Yao} and Kaunang, {Fergie Joanda} and Huang, {Chien Hsun} and Kao, {Hui Ju} and Chang, {Tzu Hao} and Wang, {Hsin Yao} and Lu, {Jang Jih} and Lee, {Tzong Yi}",
year = "2017",
month = "3",
day = "14",
doi = "10.1186/s12859-017-1472-8",
language = "English",
volume = "18",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",

}

TY - JOUR

T1 - Investigation and identification of protein carbonylation sites based on position-specific amino acid composition and physicochemical features

AU - Weng, Shun Long

AU - Huang, Kai Yao

AU - Kaunang, Fergie Joanda

AU - Huang, Chien Hsun

AU - Kao, Hui Ju

AU - Chang, Tzu Hao

AU - Wang, Hsin Yao

AU - Lu, Jang Jih

AU - Lee, Tzong Yi

PY - 2017/3/14

Y1 - 2017/3/14

N2 - Background: Protein carbonylation, an irreversible and non-enzymatic post-translational modification (PTM), is often used as a marker of oxidative stress. When reactive oxygen species (ROS) oxidized the amino acid side chains, carbonyl (CO) groups are produced especially on Lysine (K), Arginine (R), Threonine (T), and Proline (P). Nevertheless, due to the lack of information about the carbonylated substrate specificity, we were encouraged to develop a systematic method for a comprehensive investigation of protein carbonylation sites. Results: After the removal of redundant data from multipe carbonylation-related articles, totally 226 carbonylated proteins in human are regarded as training dataset, which consisted of 307, 126, 128, and 129 carbonylation sites for K, R, T and P residues, respectively. To identify the useful features in predicting carbonylation sites, the linear amino acid sequence was adopted not only to build up the predictive model from training dataset, but also to compare the effectiveness of prediction with other types of features including amino acid composition (AAC), amino acid pair composition (AAPC), position-specific scoring matrix (PSSM), positional weighted matrix (PWM), solvent-accessible surface area (ASA), and physicochemical properties. The investigation of position-specific amino acid composition revealed that the positively charged amino acids (K and R) are remarkably enriched surrounding the carbonylated sites, which may play a functional role in discriminating between carbonylation and non-carbonylation sites. A variety of predictive models were built using various features and three different machine learning methods. Based on the evaluation by five-fold cross-validation, the models trained with PWM feature could provide better sensitivity in the positive training dataset, while the models trained with AAindex feature achieved higher specificity in the negative training dataset. Additionally, the model trained using hybrid features, including PWM, AAC and AAindex, obtained best MCC values of 0.432, 0.472, 0.443 and 0.467 on K, R, T and P residues, respectively. Conclusion: When comparing to an existing prediction tool, the selected models trained with hybrid features provided a promising accuracy on an independent testing dataset. In short, this work not only characterized the carbonylated substrate preference, but also demonstrated that the proposed method could provide a feasible means for accelerating preliminary discovery of protein carbonylation.

AB - Background: Protein carbonylation, an irreversible and non-enzymatic post-translational modification (PTM), is often used as a marker of oxidative stress. When reactive oxygen species (ROS) oxidized the amino acid side chains, carbonyl (CO) groups are produced especially on Lysine (K), Arginine (R), Threonine (T), and Proline (P). Nevertheless, due to the lack of information about the carbonylated substrate specificity, we were encouraged to develop a systematic method for a comprehensive investigation of protein carbonylation sites. Results: After the removal of redundant data from multipe carbonylation-related articles, totally 226 carbonylated proteins in human are regarded as training dataset, which consisted of 307, 126, 128, and 129 carbonylation sites for K, R, T and P residues, respectively. To identify the useful features in predicting carbonylation sites, the linear amino acid sequence was adopted not only to build up the predictive model from training dataset, but also to compare the effectiveness of prediction with other types of features including amino acid composition (AAC), amino acid pair composition (AAPC), position-specific scoring matrix (PSSM), positional weighted matrix (PWM), solvent-accessible surface area (ASA), and physicochemical properties. The investigation of position-specific amino acid composition revealed that the positively charged amino acids (K and R) are remarkably enriched surrounding the carbonylated sites, which may play a functional role in discriminating between carbonylation and non-carbonylation sites. A variety of predictive models were built using various features and three different machine learning methods. Based on the evaluation by five-fold cross-validation, the models trained with PWM feature could provide better sensitivity in the positive training dataset, while the models trained with AAindex feature achieved higher specificity in the negative training dataset. Additionally, the model trained using hybrid features, including PWM, AAC and AAindex, obtained best MCC values of 0.432, 0.472, 0.443 and 0.467 on K, R, T and P residues, respectively. Conclusion: When comparing to an existing prediction tool, the selected models trained with hybrid features provided a promising accuracy on an independent testing dataset. In short, this work not only characterized the carbonylated substrate preference, but also demonstrated that the proposed method could provide a feasible means for accelerating preliminary discovery of protein carbonylation.

KW - Amino acid composition

KW - Physicochemical properties

KW - Protein carbonylation

KW - Reactive Oxygen Species (ROS)

UR - http://www.scopus.com/inward/record.url?scp=85015168370&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85015168370&partnerID=8YFLogxK

U2 - 10.1186/s12859-017-1472-8

DO - 10.1186/s12859-017-1472-8

M3 - Article

C2 - 28361707

AN - SCOPUS:85015168370

VL - 18

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

M1 - 66

ER -