Domain Information & Vocabulary Extraction (DIVE)

Purpose

This project aims to develop new methods and tools to extract biological entities from publication and other text documents. There are two main goals of this project. One is to integrate the technology and software with existing publication pipeline at American Society for Plant Biologists as a production service. The second goal is to develop a tool that can intelligently uncover novel terms and information in the latest publications.

Overview

Domain informational vocabulary extraction (DIVE) aims to enrich digital publications through entity and key informational words detection and by adding additional annotations. The project investigates adoptions of the state of art methods for standard entity extraction and related natural language processing methods to the biological publication domains. The project leverages existing biological ontologies to expand both depths and coverage of potentially interesting new key information and words. The system implements multiple strategies for biological entity detection, including using regular expression rules, ontology, and a keyword dictionary. These extracted entities are then stored in a database and made accessible through an interactive web application for curation and evaluation by authors. Through the web interface, the author can make additional annotations and corrections to the current results. The updates can then be used to improve the entity detection in subsequent processed articles. The other goal of this system is to develop a useful tool for domain curators and researchers at large.

A major challenge remaining is to identify new terms appeared in the paper. Those terms might be synonyms of existing terms or novel terms for brand new concepts proposed in the paper. Manual curation is already hard to keep up with existing publications. Automated approaches based on existing domain knowledge are less effective for these cases due to lack of enough existing data to be used as training data set. Here, we propose to investigate further on how to apply deep learning methods for biological entities extraction and prediction based on ontologies. We will implement and evaluate methods using word2vec tool with existing vocabularies biological ontologies for new entities and synonyms detection. We will also work with the domain scientists to extend the model to the biological domain in order to further improve the efficiency and accuracies of the approaches.

Impact

The project will investigate new methods for entity extraction and bring useful tools for biology communities and publishers. A web service has been established to serve American Society for Plant Biologists publication pipeline for two most influential journals in plant biology: Plant Cell and Plant Physiology. The service will automatically generate the ten most useful entities from a manuscript, which can be attached to each new publication. The tool and service have the potential to be adopted by other publishers. The extraction results can also be used to provide additional methods of journal content discovery, and to help digital librarians add new potential terms to existing content ontologies.

Contributors

Amit Gupta
Research Associate

Publications

Weijia Xu, Amit Gupta, Pankaj Jaiswal, Crispin Taylor and Patti Lockhart (2016) "Web Application for Extracting Key Domain Information for Scientific Publications using Ontology", in Proceedings of International Conference on Biological Ontology

Amit Gupta, Pankaj Jaiswal, Crispin Taylor and Weijia Xu Improve Accessibility of Biology Papers through Integration of Domain Information Extraction in the Publication Pipeline, Workshop on Cyberinfrastructure and Machine Learning for Digital Libraries, JCDL 2018.

Amit Gupta, Weijia Xu, Pankaj Jaiswal, Crispin Taylor, and Jennifer Regala. 2019. Extracting Domain Information using Deep Learning. In Practice and Experience in Advanced Research Computing (PEARC '19), July 28-August 1, 2019, Chicago, IL, USA. ACM, New York, NY, USA, doi.org/10.1145/3332186.3332255

Funding Source

NSF Cyverse Grant