Programmatic Theme: Informatics Implementation
Abstract: Chemical entity recognition is essential for indexing scientific literature in the MEDLINE database at the National Library of Medicine. However, the tool currently used to suggest terms for indexing, the Medical Text Indexer, was not originally conceived as a chemical recognition tool. It has instead been adapted to the task via its use of MetaMap and the addition of in-house patterns and rules. In order to develop a tool more suitable for chemical recognition, we have created a collection of 200 MEDLINE titles and abstracts annotated with genes, proteins, inorganic and organic chemicals, as well as other biological molecules. We use this collection to evaluate eleven chemical entity recognition systems, where we seek to identify a tool that effectively recognizes chemical entities for indexing and also performs well on chemical recognition beyond the indexing task. We observe the highest performance with a SciBERT ensemble.
Learning Objective: After reading this paper, the reader should have a better understanding of:
1. Challenges faced by MTI in chemical entity recognition, such as recognition of out-of-vocabulary entities.
2. The challenges a CER tool will face in the indexing pipeline, such as recognition of many classes of entities.
3. Challenges faced by researchers in CER in general.
5. The effectiveness of pre-trained transformers, such as SciBERT, over other CER approaches.
6. The utility of the ChEMFAM corpus for evaluating CER systems.
Max Savery (Presenter)
National Library of Medicine
Willie Rogers, National Library of Medicine
Malvika Pillai, The University of North Carolina at Chapel Hill
James Mork, National Library of Medicine
Dina Demner-Fushman, National Library of Medicine