A Picture is Worth a Thousand Words: Natural Language Processing in Context

Lovisa Hagström

A Picture is Worth a Thousand Words: Natural Language Processing in Context
Licentiate thesis, 2023

Modern NLP models learn language from lexical co-occurrences. While this method has allowed for significant breakthroughs, it has also exposed potential limitations of modern NLP methods. For example, NLP models are prone to hallucinate, represent a biased world view and may learn spurious correlations to solve the data instead of the task at hand. This is to some extent the consequence of training the models exclusively on text. In text, concepts are only defined by the words that accompany them and the information in text is incomplete due to reporting bias. In this work, we investigate whether additional context in the form of multimodal information can be used to improve on the representations of modern NLP models. Specifically, we consider BERT-based vision-and-language models that receive additional context from images. We hypothesize that visual training primarily should improve on the visual commonsense knowledge, i.e. obvious knowledge about visual properties, of the models. To probe for this knowledge we develop the evaluation tasks Memory Colors and Visual Property Norms.

Generally, we find that the vision-and-language models considered do not outperform unimodal model counterparts. In addition to this, we find that the models switch their answer depending on prompt when evaluated for the same type of knowledge. We conclude that more work is needed on understanding and developing vision-and-language models, and that extra focus should be put on how to successfully fuse image and language processing. We also reconsider the usefulness of measuring commonsense knowledge in models that cannot represent factual knowledge.

NLP

Knowledge representation

Neural network

Vision-and-language models

Grounding

BERT

EE-salen, Hörsalsvägen 11.

Opponent: Prof. Desmond Elliott, Department of Computer Science, University of Copenhagen, Denmark.

Author

Lovisa Hagström

Chalmers, Computer Science and Engineering (Chalmers), Data Science and AI

Other publications Research

Subject Categories (SSIF 2011)

Other Computer and Information Science

Language Technology (Computational Linguistics)

Computer Science

Infrastructure

C3SE (Chalmers Centre for Computational Science and Engineering)

Publisher

Chalmers