Thesaurus-based disambiguation of gene symbols

15 years 6 months ago

Download www.biomedcentral.com

Background: Massive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge. However, disambiguation of gene symbols is a major bottleneck. Results: We developed a simple thesaurus-based disambiguation algorithm that can operate with very little training data. The thesaurus comprises the information from five human genetic databases and MeSH. The extent of the homonym problem for human gene symbols is shown to be substantial (33% of the genes in our combined thesaurus had one or more ambiguous symbols), not only because one symbol can refer to multiple genes, but also because a gene symbol can have -gene meanings. A test set of 52,529 Medline abstracts, containing 690 ambiguous human gene symbols taken from OMIM, was automatically generated. Overall accuracy of the disambiguation algorithm was up to 92.7% on the test set. Conclusion: The ambiguity of human gene symbols is substantial, not only because one symbol may ...

Bob J. A. Schijvenaars, Barend Mons, Marc Weeber,

Real-time Traffic

BMCBI 2005 | Gene Symbols | Human Gene | Massive Text Mining |

claim paper

» A Literature Based Method for Identifying GeneDisease Connections

» A system for finding biological entities that satisfy certain conditions from texts

Post Info
More Details (n/a)

Added	15 Dec 2010
Updated	15 Dec 2010
Type	Journal
Year	2005
Where	BMCBI
Authors	Bob J. A. Schijvenaars, Barend Mons, Marc Weeber, Martijn J. Schuemie, Erik M. van Mulligen, Hester M. Wain, Jan A. Kors

Comments (0)

Sciweavers

Thesaurus-based disambiguation of gene symbols

BMCBI 2005 | Gene Symbols | Human Gene | Massive Text Mining |

Explore & Download

Productivity Tools

Sciweavers