Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters

Trinh Trung Duong Nguyen, Nguyen Quoc Khanh Le, Quang Thai Ho, Dinh Van Phan, Yu Yen Ou

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Membrane transport proteins and their substrate specificities play crucial roles in various cellular functions. Identifying the substrate specificities of membrane transport proteins is closely related to protein-target interaction prediction, drug design, membrane recruitment, and dysregulation analysis, thus being an important problem for bioinformatics researchers. In this study, we applied word embedding approach, the main cause for natural language processing breakout in recent years, to protein sequences of transporters. We defined each protein sequence based on the word embeddings and frequencies of its biological words. The protein features were then fed into machine learning models for prediction. We also varied the lengths of protein sequence's constituent biological words to find the optimal length which generated the most discriminative feature set. Compared to four other feature types created from protein sequences, our proposed features can help prediction models yield superior performance. Our best models reach an average area under the curve of 0.96 and 0.99, respectively on the 5-fold cross validation and the independent test. With this result, our study can help biologists identify transporters based on substrate specificities as well as provides a basis for further research that enriches a field of applying natural language processing techniques in bioinformatics.

Original languageEnglish
Pages (from-to)73-81
Number of pages9
JournalAnalytical Biochemistry
Volume577
DOIs
Publication statusPublished - Jul 15 2019
Externally publishedYes

Fingerprint

Substrate Specificity
Substrates
Natural Language Processing
Proteins
Membrane Transport Proteins
Bioinformatics
Computational Biology
Drug Design
Processing
Area Under Curve
Learning systems
Research Personnel
Membranes
Research
Pharmaceutical Preparations

Keywords

  • Feature extraction
  • Natural language processing
  • Protein function prediction
  • Substrate specificities
  • Support vector machine
  • Transporter
  • Word embeddings

ASJC Scopus subject areas

  • Biophysics
  • Biochemistry
  • Molecular Biology
  • Cell Biology

Cite this

Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters. / Nguyen, Trinh Trung Duong; Le, Nguyen Quoc Khanh; Ho, Quang Thai; Phan, Dinh Van; Ou, Yu Yen.

In: Analytical Biochemistry, Vol. 577, 15.07.2019, p. 73-81.

Research output: Contribution to journalArticle

@article{92a2e7a90e1744378d998f98f6d1ae28,
title = "Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters",
abstract = "Membrane transport proteins and their substrate specificities play crucial roles in various cellular functions. Identifying the substrate specificities of membrane transport proteins is closely related to protein-target interaction prediction, drug design, membrane recruitment, and dysregulation analysis, thus being an important problem for bioinformatics researchers. In this study, we applied word embedding approach, the main cause for natural language processing breakout in recent years, to protein sequences of transporters. We defined each protein sequence based on the word embeddings and frequencies of its biological words. The protein features were then fed into machine learning models for prediction. We also varied the lengths of protein sequence's constituent biological words to find the optimal length which generated the most discriminative feature set. Compared to four other feature types created from protein sequences, our proposed features can help prediction models yield superior performance. Our best models reach an average area under the curve of 0.96 and 0.99, respectively on the 5-fold cross validation and the independent test. With this result, our study can help biologists identify transporters based on substrate specificities as well as provides a basis for further research that enriches a field of applying natural language processing techniques in bioinformatics.",
keywords = "Feature extraction, Natural language processing, Protein function prediction, Substrate specificities, Support vector machine, Transporter, Word embeddings",
author = "Nguyen, {Trinh Trung Duong} and Le, {Nguyen Quoc Khanh} and Ho, {Quang Thai} and Phan, {Dinh Van} and Ou, {Yu Yen}",
year = "2019",
month = "7",
day = "15",
doi = "10.1016/j.ab.2019.04.011",
language = "English",
volume = "577",
pages = "73--81",
journal = "Analytical Biochemistry",
issn = "0003-2697",
publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters

AU - Nguyen, Trinh Trung Duong

AU - Le, Nguyen Quoc Khanh

AU - Ho, Quang Thai

AU - Phan, Dinh Van

AU - Ou, Yu Yen

PY - 2019/7/15

Y1 - 2019/7/15

N2 - Membrane transport proteins and their substrate specificities play crucial roles in various cellular functions. Identifying the substrate specificities of membrane transport proteins is closely related to protein-target interaction prediction, drug design, membrane recruitment, and dysregulation analysis, thus being an important problem for bioinformatics researchers. In this study, we applied word embedding approach, the main cause for natural language processing breakout in recent years, to protein sequences of transporters. We defined each protein sequence based on the word embeddings and frequencies of its biological words. The protein features were then fed into machine learning models for prediction. We also varied the lengths of protein sequence's constituent biological words to find the optimal length which generated the most discriminative feature set. Compared to four other feature types created from protein sequences, our proposed features can help prediction models yield superior performance. Our best models reach an average area under the curve of 0.96 and 0.99, respectively on the 5-fold cross validation and the independent test. With this result, our study can help biologists identify transporters based on substrate specificities as well as provides a basis for further research that enriches a field of applying natural language processing techniques in bioinformatics.

AB - Membrane transport proteins and their substrate specificities play crucial roles in various cellular functions. Identifying the substrate specificities of membrane transport proteins is closely related to protein-target interaction prediction, drug design, membrane recruitment, and dysregulation analysis, thus being an important problem for bioinformatics researchers. In this study, we applied word embedding approach, the main cause for natural language processing breakout in recent years, to protein sequences of transporters. We defined each protein sequence based on the word embeddings and frequencies of its biological words. The protein features were then fed into machine learning models for prediction. We also varied the lengths of protein sequence's constituent biological words to find the optimal length which generated the most discriminative feature set. Compared to four other feature types created from protein sequences, our proposed features can help prediction models yield superior performance. Our best models reach an average area under the curve of 0.96 and 0.99, respectively on the 5-fold cross validation and the independent test. With this result, our study can help biologists identify transporters based on substrate specificities as well as provides a basis for further research that enriches a field of applying natural language processing techniques in bioinformatics.

KW - Feature extraction

KW - Natural language processing

KW - Protein function prediction

KW - Substrate specificities

KW - Support vector machine

KW - Transporter

KW - Word embeddings

UR - http://www.scopus.com/inward/record.url?scp=85064809652&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85064809652&partnerID=8YFLogxK

U2 - 10.1016/j.ab.2019.04.011

DO - 10.1016/j.ab.2019.04.011

M3 - Article

C2 - 31022378

AN - SCOPUS:85064809652

VL - 577

SP - 73

EP - 81

JO - Analytical Biochemistry

JF - Analytical Biochemistry

SN - 0003-2697

ER -