Controlling for Stereotypes in Multimodal Language Model Evaluation
Paper in proceeding, 2022

We propose a methodology and design two benchmark sets for measuring to what extent language-and-vision language models use the visual signal in the presence or absence of stereotypes. The first benchmark is designed to test for stereotypical colors of common objects, while the second benchmark considers gender stereotypes. The key idea is to compare predictions when the image conforms to the stereotype to predictions when it does not. Our results show that there is significant variation among multimodal models: the recent Transformer-based FLAVA seems to be more sensitive to the choice of image and less affected by stereotypes than older CNN-based models such as VisualBERT and LXMERT. This effect is more discernible in this type of controlled setting than in traditional evaluations where we do not know whether the model relied on the stereotype or the visual signal.

Gender stereotypes

Language model

Model us

Multi-modal

Model evaluation

Visual languages

Computational linguistics

Author

Manuj Malik

IIIT Bangalore

Richard Johansson

Chalmers, Computer Science and Engineering (Chalmers), Data Science

University of Gothenburg

BlackboxNLP 2022 - BlackboxNLP Analyzing and Interpreting Neural Networks for NLP, Proceedings of the Workshop

263-271
9781959429050 (ISBN)

5th Workshop on Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP 2022 hosted by the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022
Abu Dhabi, United Arab Emirates,

Subject Categories (SSIF 2025)

Natural Language Processing

Software Engineering

Gender Studies

DOI

10.18653/v1/2022.blackboxnlp-1.21

More information

Latest update

7/22/2025