iEnhancer-5Step: Identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding

Nguyen Quoc Khanh Le, Edward Kien Yee Yapp, Quang Thai Ho, N. Nagasundaram, Yu Yen Ou, Hui Yuan Yeh

Research output: Contribution to journalArticle

20 Citations (Scopus)

Abstract

An enhancer is a short (50–1500bp) region of DNA that plays an important role in gene expression and the production of RNA and proteins. Genetic variation in enhancers has been linked to many human diseases, such as cancer, disorder or inflammatory bowel disease. Due to the importance of enhancers in genomics, the classification of enhancers has become a popular area of research in computational biology. Despite the few computational tools employed to address this problem, their resulting performance still requires improvements. In this study, we treat enhancers by the word embeddings, including sub-word information of its biological words, which then serve as features to be fed into a support vector machine algorithm to classify them. We present iEnhancer-5Step, a web server containing two-layer classifiers to identify enhancers and their strength. We are able to attain an independent test accuracy of 79% and 63.5% in the two layers, respectively. Compared to current predictors on the same dataset, our proposed method is able to yield superior performance as compared to the other methods. Moreover, this study provides a basis for further research that can enrich the field of applying natural language processing techniques in biological sequences. iEnhancer-5Step is freely accessible via http://biologydeep.com/fastenc/.

Original languageEnglish
Pages (from-to)53-61
Number of pages9
JournalAnalytical Biochemistry
Volume571
DOIs
Publication statusPublished - Apr 15 2019
Externally publishedYes

Fingerprint

DNA sequences
Natural Language Processing
Genomics
Computational Biology
Inflammatory Bowel Diseases
Research
Gene expression
Support vector machines
Classifiers
Servers
RNA
Gene Expression
DNA
Processing
Neoplasms
Proteins
Datasets
Support Vector Machine

Keywords

  • Continuous bag of words
  • Regulatory transcription factor
  • Sequence analysis
  • Skip gram
  • Support vector machine
  • Two-layer classification

ASJC Scopus subject areas

  • Biophysics
  • Biochemistry
  • Molecular Biology
  • Cell Biology

Cite this

iEnhancer-5Step : Identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding. / Le, Nguyen Quoc Khanh; Yapp, Edward Kien Yee; Ho, Quang Thai; Nagasundaram, N.; Ou, Yu Yen; Yeh, Hui Yuan.

In: Analytical Biochemistry, Vol. 571, 15.04.2019, p. 53-61.

Research output: Contribution to journalArticle

Le, Nguyen Quoc Khanh ; Yapp, Edward Kien Yee ; Ho, Quang Thai ; Nagasundaram, N. ; Ou, Yu Yen ; Yeh, Hui Yuan. / iEnhancer-5Step : Identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding. In: Analytical Biochemistry. 2019 ; Vol. 571. pp. 53-61.
@article{906d931c4e2840b9bb3623523cbfa56f,
title = "iEnhancer-5Step: Identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding",
abstract = "An enhancer is a short (50–1500bp) region of DNA that plays an important role in gene expression and the production of RNA and proteins. Genetic variation in enhancers has been linked to many human diseases, such as cancer, disorder or inflammatory bowel disease. Due to the importance of enhancers in genomics, the classification of enhancers has become a popular area of research in computational biology. Despite the few computational tools employed to address this problem, their resulting performance still requires improvements. In this study, we treat enhancers by the word embeddings, including sub-word information of its biological words, which then serve as features to be fed into a support vector machine algorithm to classify them. We present iEnhancer-5Step, a web server containing two-layer classifiers to identify enhancers and their strength. We are able to attain an independent test accuracy of 79{\%} and 63.5{\%} in the two layers, respectively. Compared to current predictors on the same dataset, our proposed method is able to yield superior performance as compared to the other methods. Moreover, this study provides a basis for further research that can enrich the field of applying natural language processing techniques in biological sequences. iEnhancer-5Step is freely accessible via http://biologydeep.com/fastenc/.",
keywords = "Continuous bag of words, Regulatory transcription factor, Sequence analysis, Skip gram, Support vector machine, Two-layer classification",
author = "Le, {Nguyen Quoc Khanh} and Yapp, {Edward Kien Yee} and Ho, {Quang Thai} and N. Nagasundaram and Ou, {Yu Yen} and Yeh, {Hui Yuan}",
year = "2019",
month = "4",
day = "15",
doi = "10.1016/j.ab.2019.02.017",
language = "English",
volume = "571",
pages = "53--61",
journal = "Analytical Biochemistry",
issn = "0003-2697",
publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - iEnhancer-5Step

T2 - Identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding

AU - Le, Nguyen Quoc Khanh

AU - Yapp, Edward Kien Yee

AU - Ho, Quang Thai

AU - Nagasundaram, N.

AU - Ou, Yu Yen

AU - Yeh, Hui Yuan

PY - 2019/4/15

Y1 - 2019/4/15

N2 - An enhancer is a short (50–1500bp) region of DNA that plays an important role in gene expression and the production of RNA and proteins. Genetic variation in enhancers has been linked to many human diseases, such as cancer, disorder or inflammatory bowel disease. Due to the importance of enhancers in genomics, the classification of enhancers has become a popular area of research in computational biology. Despite the few computational tools employed to address this problem, their resulting performance still requires improvements. In this study, we treat enhancers by the word embeddings, including sub-word information of its biological words, which then serve as features to be fed into a support vector machine algorithm to classify them. We present iEnhancer-5Step, a web server containing two-layer classifiers to identify enhancers and their strength. We are able to attain an independent test accuracy of 79% and 63.5% in the two layers, respectively. Compared to current predictors on the same dataset, our proposed method is able to yield superior performance as compared to the other methods. Moreover, this study provides a basis for further research that can enrich the field of applying natural language processing techniques in biological sequences. iEnhancer-5Step is freely accessible via http://biologydeep.com/fastenc/.

AB - An enhancer is a short (50–1500bp) region of DNA that plays an important role in gene expression and the production of RNA and proteins. Genetic variation in enhancers has been linked to many human diseases, such as cancer, disorder or inflammatory bowel disease. Due to the importance of enhancers in genomics, the classification of enhancers has become a popular area of research in computational biology. Despite the few computational tools employed to address this problem, their resulting performance still requires improvements. In this study, we treat enhancers by the word embeddings, including sub-word information of its biological words, which then serve as features to be fed into a support vector machine algorithm to classify them. We present iEnhancer-5Step, a web server containing two-layer classifiers to identify enhancers and their strength. We are able to attain an independent test accuracy of 79% and 63.5% in the two layers, respectively. Compared to current predictors on the same dataset, our proposed method is able to yield superior performance as compared to the other methods. Moreover, this study provides a basis for further research that can enrich the field of applying natural language processing techniques in biological sequences. iEnhancer-5Step is freely accessible via http://biologydeep.com/fastenc/.

KW - Continuous bag of words

KW - Regulatory transcription factor

KW - Sequence analysis

KW - Skip gram

KW - Support vector machine

KW - Two-layer classification

UR - http://www.scopus.com/inward/record.url?scp=85062237812&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85062237812&partnerID=8YFLogxK

U2 - 10.1016/j.ab.2019.02.017

DO - 10.1016/j.ab.2019.02.017

M3 - Article

C2 - 30822398

AN - SCOPUS:85062237812

VL - 571

SP - 53

EP - 61

JO - Analytical Biochemistry

JF - Analytical Biochemistry

SN - 0003-2697

ER -