A Picture is Worth a Thousand Words: Natural Language Processing in Context
Licentiate thesis, 2023
Generally, we find that the vision-and-language models considered do not outperform unimodal model counterparts. In addition to this, we find that the models switch their answer depending on prompt when evaluated for the same type of knowledge. We conclude that more work is needed on understanding and developing vision-and-language models, and that extra focus should be put on how to successfully fuse image and language processing. We also reconsider the usefulness of measuring commonsense knowledge in models that cannot represent factual knowledge.
NLP
Knowledge representation
Neural network
Vision-and-language models
Grounding
BERT
Author
Lovisa Hagström
Chalmers, Computer Science and Engineering (Chalmers), Data Science and AI
Subject Categories
Other Computer and Information Science
Language Technology (Computational Linguistics)
Computer Science
Infrastructure
C3SE (Chalmers Centre for Computational Science and Engineering)
Publisher
Chalmers
EE-salen, Hörsalsvägen 11.
Opponent: Prof. Desmond Elliott, Department of Computer Science, University of Copenhagen, Denmark.