A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information

Nguyen Quoc Khanh Le, Quang Thai Ho, Trinh Trung Duong Nguyen, Yu Yen Ou

Research output: Contribution to journalArticlepeer-review

53 Citations (Scopus)

Abstract

Recently, language representation models have drawn a lot of attention in the natural language processing field due to their remarkable results. Among them, bidirectional encoder representations from transformers (BERT) has proven to be a simple, yet powerful language model that achieved novel state-of-The-Art performance. BERT adopted the concept of contextualized word embedding to capture the semantics and context of the words in which they appeared. In this study, we present a novel technique by incorporating BERT-based multilingual model in bioinformatics to represent the information of DNA sequences. We treated DNA sequences as natural sentences and then used BERT models to transform them into fixed-length numerical matrices. As a case study, we applied our method to DNA enhancer prediction, which is a well-known and challenging problem in this field. We then observed that our BERT-based features improved more than 5-10% in terms of sensitivity, specificity, accuracy and Matthews correlation coefficient compared to the current state-of-The-Art features in bioinformatics. Moreover, advanced experiments show that deep learning (as represented by 2D convolutional neural networks; CNN) holds potential in learning BERT features better than other traditional machine learning techniques. In conclusion, we suggest that BERT and 2D CNNs could open a new avenue in biological modeling using sequence information.

Original languageEnglish
Article numberbbab005
JournalBriefings in Bioinformatics
Volume22
Issue number5
DOIs
Publication statusPublished - Sept 2021

Keywords

  • BERT
  • biological sequence
  • contextualized word embedding
  • convolutional neural network
  • DNA enhancer
  • NLP transformer

ASJC Scopus subject areas

  • Information Systems
  • Molecular Biology

Fingerprint

Dive into the research topics of 'A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information'. Together they form a unique fingerprint.

Cite this