A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information

Nguyen Quoc Khanh Le, Quang Thai Ho, Trinh Trung Duong Nguyen, Yu Yen Ou

研究成果: 雜誌貢獻文章同行評審

53 引文 斯高帕斯(Scopus)

摘要

Recently, language representation models have drawn a lot of attention in the natural language processing field due to their remarkable results. Among them, bidirectional encoder representations from transformers (BERT) has proven to be a simple, yet powerful language model that achieved novel state-of-The-Art performance. BERT adopted the concept of contextualized word embedding to capture the semantics and context of the words in which they appeared. In this study, we present a novel technique by incorporating BERT-based multilingual model in bioinformatics to represent the information of DNA sequences. We treated DNA sequences as natural sentences and then used BERT models to transform them into fixed-length numerical matrices. As a case study, we applied our method to DNA enhancer prediction, which is a well-known and challenging problem in this field. We then observed that our BERT-based features improved more than 5-10% in terms of sensitivity, specificity, accuracy and Matthews correlation coefficient compared to the current state-of-The-Art features in bioinformatics. Moreover, advanced experiments show that deep learning (as represented by 2D convolutional neural networks; CNN) holds potential in learning BERT features better than other traditional machine learning techniques. In conclusion, we suggest that BERT and 2D CNNs could open a new avenue in biological modeling using sequence information.
原文英語
文章編號bbab005
期刊Briefings in Bioinformatics
22
發行號5
DOIs
出版狀態已發佈 - 9月 2021

ASJC Scopus subject areas

  • 資訊系統
  • 分子生物學

指紋

深入研究「A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information」主題。共同形成了獨特的指紋。

引用此