Surface-Based Retrieval Reduces Perplexity of Retrieval-Augmented Language Models
Paper in proceeding, 2023

Augmenting language models with a retrieval mechanism has been shown to significantly improve their performance while keeping the number of parameters low. Retrieval-augmented models commonly rely on a semantic retrieval mechanism based on the similarity between dense representations of the query chunk and potential neighbors. In this paper, we study the state-of-the-art RETRO model and observe that its performance gain is better explained by surface-level similarities, such as token overlap. Inspired by this, we replace the semantic retrieval in RETRO with a surface-level method based on BM25, obtaining a significant reduction in perplexity. As full BM25 retrieval can be computationally costly for large datasets, we also apply it in a re-ranking scenario, gaining part of the perplexity reduction with minimal computational overhead.

Author

Ehsan Doostmohammadi

Linköping University

Tobias Norlund

Recorded Future

Chalmers, Computer Science and Engineering (Chalmers), Data Science and AI

Marco Kuhlmann

Linköping University

Richard Johansson

University of Gothenburg

Chalmers, Computer Science and Engineering (Chalmers), Data Science

Association for Computational Linguistics . Annual Meeting Conference Proceedings

0736-587X (ISSN)

Vol. 2 521-529
9781959429715 (ISBN)

61st Annual Meeting of the Association for Computational Linguistics, ACL 2023
Toronto, Canada,

Subject Categories

Computer Science

DOI

10.18653/v1/2023.acl-short.45

More information

Latest update

9/23/2024