NTTMUNSW BioC modules for recognizing and normalizing species and gene/protein mentions

Hong Jie Dai, Onkar Singh, Jitendra Jonnagaddala, Emily Chia Yu Su

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

In recent years, the number of published biomedical articles has increased as researchers have focused on biological domains to investigate the functions of biological objects, such as genes and proteins. However, the ambiguous nature of genes and their products have rendered the literature more complex for readers and curators of molecular interaction databases. To address this challenge, a normalization technique that can link variants of biological objects to a single, standardized form was applied. In this work, we developed a species normalization module, which recognizes species names and normalizes them to NCBI Taxonomy IDs. Unlike most previous work, which ignored the prefix of a gene name that represents an abbreviation of the species name to which the gene belongs, the recognition results of our module include the prefixed species. The developed species normalization module achieved an overall F-score of 0.954 on an instance-level species normalization corpus. For gene normalization, two separate modules were respectively employed to recognize gene mentions and normalize those mentions to their Entrez Gene IDs by utilizing a multistage normalization algorithm developed for processing full-text articles. All of the developed modules are BioC-compatible .NET framework libraries and are publicly available from the NuGet gallery.Database URL: https://sites.google.com/site/hjdairesearch/Projects/isn-corpus.

Original languageEnglish
JournalDatabase : the journal of biological databases and curation
Volume2016
DOIs
Publication statusPublished - Jan 1 2016

Fingerprint

Genes
Proteins
Names
genes
proteins
Chemical Databases
Molecular interactions
Taxonomies
Libraries
Research Personnel
Databases
researchers
taxonomy
Processing

ASJC Scopus subject areas

  • Information Systems
  • Biochemistry, Genetics and Molecular Biology(all)
  • Agricultural and Biological Sciences(all)

Cite this

NTTMUNSW BioC modules for recognizing and normalizing species and gene/protein mentions. / Dai, Hong Jie; Singh, Onkar; Jonnagaddala, Jitendra; Su, Emily Chia Yu.

In: Database : the journal of biological databases and curation, Vol. 2016, 01.01.2016.

Research output: Contribution to journalArticle

@article{922edc73a552449cbbcff0690d966cc9,
title = "NTTMUNSW BioC modules for recognizing and normalizing species and gene/protein mentions",
abstract = "In recent years, the number of published biomedical articles has increased as researchers have focused on biological domains to investigate the functions of biological objects, such as genes and proteins. However, the ambiguous nature of genes and their products have rendered the literature more complex for readers and curators of molecular interaction databases. To address this challenge, a normalization technique that can link variants of biological objects to a single, standardized form was applied. In this work, we developed a species normalization module, which recognizes species names and normalizes them to NCBI Taxonomy IDs. Unlike most previous work, which ignored the prefix of a gene name that represents an abbreviation of the species name to which the gene belongs, the recognition results of our module include the prefixed species. The developed species normalization module achieved an overall F-score of 0.954 on an instance-level species normalization corpus. For gene normalization, two separate modules were respectively employed to recognize gene mentions and normalize those mentions to their Entrez Gene IDs by utilizing a multistage normalization algorithm developed for processing full-text articles. All of the developed modules are BioC-compatible .NET framework libraries and are publicly available from the NuGet gallery.Database URL: https://sites.google.com/site/hjdairesearch/Projects/isn-corpus.",
author = "Dai, {Hong Jie} and Onkar Singh and Jitendra Jonnagaddala and Su, {Emily Chia Yu}",
year = "2016",
month = "1",
day = "1",
doi = "10.1093/database/baw111",
language = "English",
volume = "2016",
journal = "Database : the journal of biological databases and curation",
issn = "1758-0463",
publisher = "Oxford University Press",

}

TY - JOUR

T1 - NTTMUNSW BioC modules for recognizing and normalizing species and gene/protein mentions

AU - Dai, Hong Jie

AU - Singh, Onkar

AU - Jonnagaddala, Jitendra

AU - Su, Emily Chia Yu

PY - 2016/1/1

Y1 - 2016/1/1

N2 - In recent years, the number of published biomedical articles has increased as researchers have focused on biological domains to investigate the functions of biological objects, such as genes and proteins. However, the ambiguous nature of genes and their products have rendered the literature more complex for readers and curators of molecular interaction databases. To address this challenge, a normalization technique that can link variants of biological objects to a single, standardized form was applied. In this work, we developed a species normalization module, which recognizes species names and normalizes them to NCBI Taxonomy IDs. Unlike most previous work, which ignored the prefix of a gene name that represents an abbreviation of the species name to which the gene belongs, the recognition results of our module include the prefixed species. The developed species normalization module achieved an overall F-score of 0.954 on an instance-level species normalization corpus. For gene normalization, two separate modules were respectively employed to recognize gene mentions and normalize those mentions to their Entrez Gene IDs by utilizing a multistage normalization algorithm developed for processing full-text articles. All of the developed modules are BioC-compatible .NET framework libraries and are publicly available from the NuGet gallery.Database URL: https://sites.google.com/site/hjdairesearch/Projects/isn-corpus.

AB - In recent years, the number of published biomedical articles has increased as researchers have focused on biological domains to investigate the functions of biological objects, such as genes and proteins. However, the ambiguous nature of genes and their products have rendered the literature more complex for readers and curators of molecular interaction databases. To address this challenge, a normalization technique that can link variants of biological objects to a single, standardized form was applied. In this work, we developed a species normalization module, which recognizes species names and normalizes them to NCBI Taxonomy IDs. Unlike most previous work, which ignored the prefix of a gene name that represents an abbreviation of the species name to which the gene belongs, the recognition results of our module include the prefixed species. The developed species normalization module achieved an overall F-score of 0.954 on an instance-level species normalization corpus. For gene normalization, two separate modules were respectively employed to recognize gene mentions and normalize those mentions to their Entrez Gene IDs by utilizing a multistage normalization algorithm developed for processing full-text articles. All of the developed modules are BioC-compatible .NET framework libraries and are publicly available from the NuGet gallery.Database URL: https://sites.google.com/site/hjdairesearch/Projects/isn-corpus.

UR - http://www.scopus.com/inward/record.url?scp=85011021624&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85011021624&partnerID=8YFLogxK

U2 - 10.1093/database/baw111

DO - 10.1093/database/baw111

M3 - Article

C2 - 27465130

AN - SCOPUS:85011021624

VL - 2016

JO - Database : the journal of biological databases and curation

JF - Database : the journal of biological databases and curation

SN - 1758-0463

ER -