NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding
Journal article, 2021

Machine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades1,2, the most dramatic advances in MR have followed in the wake of critical corpus development3. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet4 was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.

Author

Kanix Wang

University of Chicago

Robert Stevens

University of Manchester

Halima Alachram

University of Göttingen

Yu Li

King Abdullah University of Science and Technology (KAUST)

Larisa N. Soldatova

Goldsmiths, University of London

Ross King

Chalmers, Biology and Biological Engineering, Systems and Synthetic Biology

Alan Turing Institute

University of Cambridge

Sophia Ananiadou

University of Manchester

Annika M. Schoene

University of Manchester

Maolin Li

University of Manchester

Fenia Christopoulou

University of Manchester

José Luis Ambite

Information Sciences Institute

Joel Matthew

Information Sciences Institute

Sahil Garg

Information Sciences Institute

Ulf Hermjakob

Information Sciences Institute

Daniel Marcu

Information Sciences Institute

Emily Sheng

Information Sciences Institute

Tim Beißbarth

University of Göttingen

Edgar Wingender

geneXplain GmbH

Aram Galstyan

Information Sciences Institute

Xin Gao

King Abdullah University of Science and Technology (KAUST)

Brendan Chambers

University of Chicago

Weidi Pan

University of Chicago

Bohdan B. Khomtchouk

University of Chicago

James A. Evans

University of Chicago

Andrey Rzhetsky

University of Chicago

npj Systems Biology and Applications

20567189 (eISSN)

Vol. 7 1 38

Subject Categories

Language Technology (Computational Linguistics)

Embedded Systems

Computer Systems

DOI

10.1038/s41540-021-00200-x

More information

Latest update

1/3/2024 9