Background: Globalization and environmental changes have intensified the emergence or re-emergence of infectious diseases worldwide, such as outbreaks of dengue fever in Southeast Asia. Collaboration on region-wide infectious disease surveillance systems is therefore critical but difficult to achieve because of the different transparency levels of health information systems in different countries. Although the Program for Monitoring Emerging Diseases (ProMED)–mail is the most comprehensive international expert–curated platform providing rich disease outbreak information on humans, animals, and plants, the unstructured text content of the reports makes analysis for further application difficult. Objective: To make monitoring the epidemic situation in Southeast Asia more efficient, this study aims to develop an automatic summary of the alert articles from ProMED-mail, a huge textual data source. In this paper, we proposed a text summarization method that uses natural language processing technology to automatically extract important sentences from alert articles in ProMED-mail emails to generate summaries. Using our method, we can quickly capture crucial information to help make important decisions regarding epidemic surveillance. Methods: Our data, which span a period from 1994 to 2019, come from the ProMED-mail website. We analyzed the collected data to establish a unique Taiwan dengue corpus that was validated with professionals’ annotations to achieve almost perfect agreement (Cohen κ=90%). To generate a ProMED-mail summary, we developed a dual-channel bidirectional long short-term memory with attention mechanism with infused latent syntactic features to identify key sentences from the alerting article. Results: Our method is superior to many well-known machine learning and neural network approaches in identifying important sentences, achieving a macroaverage F1 score of 93%. Moreover, it can successfully extract the relevant correct information on dengue fever from a ProMED-mail alerting article, which can help researchers or general users to quickly understand the essence of the alerting article at first glance. In addition to verifying the model, we also recruited 3 professional experts and 2 students from related fields to participate in a satisfaction survey on the generated summaries, and the results show that 84% (63/75) of the summaries received high satisfaction ratings. Conclusions: The proposed approach successfully fuses latent syntactic features into a deep neural network to analyze the syntactic, semantic, and contextual information in the text. It then exploits the derived information to identify crucial sentences in the ProMED-mail alerting article. The experiment results show that the proposed method is not only effective but also outperforms the compared methods. Our approach also demonstrates the potential for case summary generation from ProMED-mail alerting articles. In terms of practical application, when a new alerting article arrives, our method can quickly identify the relevant case information, which is the most critical part, to use as a reference or for further analysis.
- bidirectional long short-term memory
- dual channel
- natural language processing
ASJC Scopus subject areas
- Health Informatics
- Public Health, Environmental and Occupational Health