A flexible template generation and matching method with applications for publication reference metadata extraction

Ting Hao Yang, Yu Lun Hsieh, Shih Hung Liu, Yung Chun Chang, Wen Lian Hsu

研究成果: 雜誌貢獻文章同行評審

摘要

Conventional rule-based approaches use exact template matching to capture linguistic information and necessarily need to enumerate all variations. We propose a novel flexible template generation and matching scheme called the principle-based approach (PBA) based on sequence alignment, and employ it for reference metadata extraction (RME) to demonstrate its effectiveness. The main contributions of this research are threefold. First, we propose an automatic template generation that can capture prominent patterns using the dominating set algorithm. Second, we devise an alignment-based template-matching technique that uses a logistic regression model, which makes it more general and flexible than pure rule-based approaches. Last, we apply PBA to RME on extensive cross-domain corpora and demonstrate its robustness and generality. Experiments reveal that the same set of templates produced by the PBA framework not only deliver consistent performance on various unseen domains, but also surpass hand-crafted knowledge (templates). We use four independent journal style test sets and one conference style test set in the experiments. When compared to renowned machine learning methods, such as conditional random fields (CRF), as well as recent deep learning methods (i.e., bi-directional long short-term memory with a CRF layer, Bi-LSTM-CRF), PBA has the best performance for all datasets.

原文英語
頁(從 - 到)32-45
頁數14
期刊Journal of the Association for Information Science and Technology
72
發行號1
DOIs
出版狀態已發佈 - 1月 2021

ASJC Scopus subject areas

  • 資訊系統
  • 電腦網路與通信
  • 資訊系統與管理
  • 圖書館與資訊科學

指紋

深入研究「A flexible template generation and matching method with applications for publication reference metadata extraction」主題。共同形成了獨特的指紋。

引用此