iN6-methylat (5-step): identifying DNA N 6 -methyladenine sites in rice genome using continuous bag of nucleobases via Chou’s 5-step rule

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

DNA N 6 -methyladenine is a non-canonical DNA modification that occurs in different eukaryotes at low levels and it has been identified as an extremely important function of life. Moreover, about 0.2% of adenines are marked by DNA N 6 -methyladenine in the rice genome, higher than in most of the other species. Therefore, the identification of them has become a very important area of study, especially in biological research. Despite the few computational tools employed to address this problem, there still requires a lot of efforts to improve their performance results. In this study, we treat DNA sequences by the continuous bags of nucleobases, including sub-word information of its biological words, which then serve as features to be fed into a support vector machine algorithm to identify them. Our model which uses this hybrid approach could identify DNA N 6 -methyladenine sites with achieved a jackknife test sensitivity of 86.48%, specificity of 89.09%, accuracy of 87.78%, and MCC of 0.756. Compared to the state-of-the-art predictor as well as the other methods, our proposed model is able to yield superior performance in all the metrics. Moreover, this study provides a basis for further research that can enrich a field of applying natural language-processing techniques in biological sequences.

Original languageEnglish
JournalMolecular Genetics and Genomics
DOIs
Publication statusAccepted/In press - Jan 1 2019
Externally publishedYes

Fingerprint

Genome
DNA
Natural Language Processing
Adenine
Eukaryota
Research
Sensitivity and Specificity
Oryza
6-methyladenine
Support Vector Machine

Keywords

  • Continuous bag of words
  • DNA N -methyladenine
  • DNA replication
  • FastText
  • Skip gram
  • Support vector machine

ASJC Scopus subject areas

  • Molecular Biology
  • Genetics

Cite this

@article{9c8478aab3d74d2eac942f72a445780f,
title = "iN6-methylat (5-step): identifying DNA N 6 -methyladenine sites in rice genome using continuous bag of nucleobases via Chou’s 5-step rule",
abstract = "DNA N 6 -methyladenine is a non-canonical DNA modification that occurs in different eukaryotes at low levels and it has been identified as an extremely important function of life. Moreover, about 0.2{\%} of adenines are marked by DNA N 6 -methyladenine in the rice genome, higher than in most of the other species. Therefore, the identification of them has become a very important area of study, especially in biological research. Despite the few computational tools employed to address this problem, there still requires a lot of efforts to improve their performance results. In this study, we treat DNA sequences by the continuous bags of nucleobases, including sub-word information of its biological words, which then serve as features to be fed into a support vector machine algorithm to identify them. Our model which uses this hybrid approach could identify DNA N 6 -methyladenine sites with achieved a jackknife test sensitivity of 86.48{\%}, specificity of 89.09{\%}, accuracy of 87.78{\%}, and MCC of 0.756. Compared to the state-of-the-art predictor as well as the other methods, our proposed model is able to yield superior performance in all the metrics. Moreover, this study provides a basis for further research that can enrich a field of applying natural language-processing techniques in biological sequences.",
keywords = "Continuous bag of words, DNA N -methyladenine, DNA replication, FastText, Skip gram, Support vector machine",
author = "Le, {Nguyen Quoc Khanh}",
year = "2019",
month = "1",
day = "1",
doi = "10.1007/s00438-019-01570-y",
language = "English",
journal = "Molecular Genetics and Genomics",
issn = "1617-4615",
publisher = "Springer Verlag",

}

TY - JOUR

T1 - iN6-methylat (5-step)

T2 - identifying DNA N 6 -methyladenine sites in rice genome using continuous bag of nucleobases via Chou’s 5-step rule

AU - Le, Nguyen Quoc Khanh

PY - 2019/1/1

Y1 - 2019/1/1

N2 - DNA N 6 -methyladenine is a non-canonical DNA modification that occurs in different eukaryotes at low levels and it has been identified as an extremely important function of life. Moreover, about 0.2% of adenines are marked by DNA N 6 -methyladenine in the rice genome, higher than in most of the other species. Therefore, the identification of them has become a very important area of study, especially in biological research. Despite the few computational tools employed to address this problem, there still requires a lot of efforts to improve their performance results. In this study, we treat DNA sequences by the continuous bags of nucleobases, including sub-word information of its biological words, which then serve as features to be fed into a support vector machine algorithm to identify them. Our model which uses this hybrid approach could identify DNA N 6 -methyladenine sites with achieved a jackknife test sensitivity of 86.48%, specificity of 89.09%, accuracy of 87.78%, and MCC of 0.756. Compared to the state-of-the-art predictor as well as the other methods, our proposed model is able to yield superior performance in all the metrics. Moreover, this study provides a basis for further research that can enrich a field of applying natural language-processing techniques in biological sequences.

AB - DNA N 6 -methyladenine is a non-canonical DNA modification that occurs in different eukaryotes at low levels and it has been identified as an extremely important function of life. Moreover, about 0.2% of adenines are marked by DNA N 6 -methyladenine in the rice genome, higher than in most of the other species. Therefore, the identification of them has become a very important area of study, especially in biological research. Despite the few computational tools employed to address this problem, there still requires a lot of efforts to improve their performance results. In this study, we treat DNA sequences by the continuous bags of nucleobases, including sub-word information of its biological words, which then serve as features to be fed into a support vector machine algorithm to identify them. Our model which uses this hybrid approach could identify DNA N 6 -methyladenine sites with achieved a jackknife test sensitivity of 86.48%, specificity of 89.09%, accuracy of 87.78%, and MCC of 0.756. Compared to the state-of-the-art predictor as well as the other methods, our proposed model is able to yield superior performance in all the metrics. Moreover, this study provides a basis for further research that can enrich a field of applying natural language-processing techniques in biological sequences.

KW - Continuous bag of words

KW - DNA N -methyladenine

KW - DNA replication

KW - FastText

KW - Skip gram

KW - Support vector machine

UR - http://www.scopus.com/inward/record.url?scp=85065388389&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85065388389&partnerID=8YFLogxK

U2 - 10.1007/s00438-019-01570-y

DO - 10.1007/s00438-019-01570-y

M3 - Article

AN - SCOPUS:85065388389

JO - Molecular Genetics and Genomics

JF - Molecular Genetics and Genomics

SN - 1617-4615

ER -