LLM-retrieval based scientific knowledge grounding
Paper i proceeding, 2025
The automated high-throughput laboratory offers unprecedented potential for scientific discovery, yet effectively linking studies to existing knowledge remains a significant challenge. As the general body of scientific knowledge grows, so too does the burden of contextualizing a new experiment. While ontologies and databases serve as structured common repositories, their rigid schemas are often incompatible with the unstructured or semi-structured formats of most laboratories. In this study we investigate the integration of large language models (LLMs) with ontology-based vector databases to anchor semi-structured scientific experiments into knowledge bases via automated retrieval. Our approach extracts scientific entities from unstructured experimental texts, and grounds them to relevant ontology terms. We automate knowledge grounding, which enhances the integration of unstructured experimental data into established formal scientific languages. We have tested our method on a diverse selection of experimental yeast biology papers focused on Saccharomyces cerevisiae, a foundational model system that has driven major discoveries in molecular and cellular biology, and observed strong pipeline performance. We argue that such a knowledge grounding approach is a critical component for the new wave of efficient artificial intelligence (AI) driven automated laboratories that integrate LLMs with high-throughput experimentation and data-driven discovery.
Knowledge Engineering
Saccharomyces cerevisiae
Information Extraction for RKGs/SKGs
Large Language Models
Artificial Intelligence
Ontologies