Collective instance-level gene normalization on the IGN corpus

Hong Jie Dai, Johnny Chi Yang Wu, Richard Tzong Han Tsai

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

A high proportion of life science researches are gene-oriented, in which scientists aim to investigate the roles that genes play in biological processes, and their involvement in biological mechanisms. As a result, gene names and their related information turn out to be one of the main objects of interest in biomedical literatures. While the capability of recognizing gene mentions has made significant progress, the results of recognition are still insufficient for direct use due to the ambiguity of gene names. Gene normalization (GN) goes beyond the recognition task by linking a gene mention to a database ID. Unlike most previous works, we approach GN on the instance-level and evaluate its overall performance on the recognition and normalization steps in abstracts and full texts. We release the first instance-level gene normalization (IGN) corpus in the BioC format, which includes annotations for the boundaries of all gene mentions and the corresponding IDs for human gene mentions. Species information, along with existing co-reference chains and full name/abbreviation pairs are also provided for each gene mention. Using the released corpus, we have designed a collective instance-level GN approach using not only the contextual information of each individual instance, but also the relations among instances and the inherent characteristics of full-text sections. Our experimental results show that our collective approach can achieve an Fscore of 0.743. The proposed approach that exploits section characteristics in full-text articles can improve the F-scores of information lacking sections by up to 1.8%. In addition, using the proposed refinement process improved the F-score of gene mention recognition by 0.125 and that of GN by 0.03. Whereas current experimental results are limited to the human species, we seek to continue updating the annotations of the IGN corpus and observe how the proposed approach can be extended to other species.

Original languageEnglish
Article numbere79517
JournalPLoS One
Volume8
Issue number11
DOIs
Publication statusPublished - Nov 25 2013

Fingerprint

Genes
genes
Names
Biological Phenomena
Biological Science Disciplines
Databases

ASJC Scopus subject areas

  • Agricultural and Biological Sciences(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Medicine(all)

Cite this

Collective instance-level gene normalization on the IGN corpus. / Dai, Hong Jie; Wu, Johnny Chi Yang; Tsai, Richard Tzong Han.

In: PLoS One, Vol. 8, No. 11, e79517, 25.11.2013.

Research output: Contribution to journalArticle

Dai, Hong Jie ; Wu, Johnny Chi Yang ; Tsai, Richard Tzong Han. / Collective instance-level gene normalization on the IGN corpus. In: PLoS One. 2013 ; Vol. 8, No. 11.
@article{8488572a303c4ff592b1ce25a2459bb7,
title = "Collective instance-level gene normalization on the IGN corpus",
abstract = "A high proportion of life science researches are gene-oriented, in which scientists aim to investigate the roles that genes play in biological processes, and their involvement in biological mechanisms. As a result, gene names and their related information turn out to be one of the main objects of interest in biomedical literatures. While the capability of recognizing gene mentions has made significant progress, the results of recognition are still insufficient for direct use due to the ambiguity of gene names. Gene normalization (GN) goes beyond the recognition task by linking a gene mention to a database ID. Unlike most previous works, we approach GN on the instance-level and evaluate its overall performance on the recognition and normalization steps in abstracts and full texts. We release the first instance-level gene normalization (IGN) corpus in the BioC format, which includes annotations for the boundaries of all gene mentions and the corresponding IDs for human gene mentions. Species information, along with existing co-reference chains and full name/abbreviation pairs are also provided for each gene mention. Using the released corpus, we have designed a collective instance-level GN approach using not only the contextual information of each individual instance, but also the relations among instances and the inherent characteristics of full-text sections. Our experimental results show that our collective approach can achieve an Fscore of 0.743. The proposed approach that exploits section characteristics in full-text articles can improve the F-scores of information lacking sections by up to 1.8{\%}. In addition, using the proposed refinement process improved the F-score of gene mention recognition by 0.125 and that of GN by 0.03. Whereas current experimental results are limited to the human species, we seek to continue updating the annotations of the IGN corpus and observe how the proposed approach can be extended to other species.",
author = "Dai, {Hong Jie} and Wu, {Johnny Chi Yang} and Tsai, {Richard Tzong Han}",
year = "2013",
month = "11",
day = "25",
doi = "10.1371/journal.pone.0079517",
language = "English",
volume = "8",
journal = "PLoS One",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "11",

}

TY - JOUR

T1 - Collective instance-level gene normalization on the IGN corpus

AU - Dai, Hong Jie

AU - Wu, Johnny Chi Yang

AU - Tsai, Richard Tzong Han

PY - 2013/11/25

Y1 - 2013/11/25

N2 - A high proportion of life science researches are gene-oriented, in which scientists aim to investigate the roles that genes play in biological processes, and their involvement in biological mechanisms. As a result, gene names and their related information turn out to be one of the main objects of interest in biomedical literatures. While the capability of recognizing gene mentions has made significant progress, the results of recognition are still insufficient for direct use due to the ambiguity of gene names. Gene normalization (GN) goes beyond the recognition task by linking a gene mention to a database ID. Unlike most previous works, we approach GN on the instance-level and evaluate its overall performance on the recognition and normalization steps in abstracts and full texts. We release the first instance-level gene normalization (IGN) corpus in the BioC format, which includes annotations for the boundaries of all gene mentions and the corresponding IDs for human gene mentions. Species information, along with existing co-reference chains and full name/abbreviation pairs are also provided for each gene mention. Using the released corpus, we have designed a collective instance-level GN approach using not only the contextual information of each individual instance, but also the relations among instances and the inherent characteristics of full-text sections. Our experimental results show that our collective approach can achieve an Fscore of 0.743. The proposed approach that exploits section characteristics in full-text articles can improve the F-scores of information lacking sections by up to 1.8%. In addition, using the proposed refinement process improved the F-score of gene mention recognition by 0.125 and that of GN by 0.03. Whereas current experimental results are limited to the human species, we seek to continue updating the annotations of the IGN corpus and observe how the proposed approach can be extended to other species.

AB - A high proportion of life science researches are gene-oriented, in which scientists aim to investigate the roles that genes play in biological processes, and their involvement in biological mechanisms. As a result, gene names and their related information turn out to be one of the main objects of interest in biomedical literatures. While the capability of recognizing gene mentions has made significant progress, the results of recognition are still insufficient for direct use due to the ambiguity of gene names. Gene normalization (GN) goes beyond the recognition task by linking a gene mention to a database ID. Unlike most previous works, we approach GN on the instance-level and evaluate its overall performance on the recognition and normalization steps in abstracts and full texts. We release the first instance-level gene normalization (IGN) corpus in the BioC format, which includes annotations for the boundaries of all gene mentions and the corresponding IDs for human gene mentions. Species information, along with existing co-reference chains and full name/abbreviation pairs are also provided for each gene mention. Using the released corpus, we have designed a collective instance-level GN approach using not only the contextual information of each individual instance, but also the relations among instances and the inherent characteristics of full-text sections. Our experimental results show that our collective approach can achieve an Fscore of 0.743. The proposed approach that exploits section characteristics in full-text articles can improve the F-scores of information lacking sections by up to 1.8%. In addition, using the proposed refinement process improved the F-score of gene mention recognition by 0.125 and that of GN by 0.03. Whereas current experimental results are limited to the human species, we seek to continue updating the annotations of the IGN corpus and observe how the proposed approach can be extended to other species.

UR - http://www.scopus.com/inward/record.url?scp=84894235491&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84894235491&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0079517

DO - 10.1371/journal.pone.0079517

M3 - Article

C2 - 24282506

AN - SCOPUS:84894235491

VL - 8

JO - PLoS One

JF - PLoS One

SN - 1932-6203

IS - 11

M1 - e79517

ER -