The ontological politics of synthetic data: Normalities, outliers, and intersectional hallucinations
Artikel i vetenskaplig tidskrift, 2025
Synthetic data is increasingly used as a substitute for real data due to ethical, legal, and logistical reasons. However, the rise of synthetic data also raises critical questions about its entanglement with the politics of classification and the reproduction of social norms and categories. This paper aims to problematize the use of synthetic data by examining how its production is intertwined with the maintenance of certain worldviews and classifications. We argue that synthetic data, like real data, is embedded with societal biases and power structures, leading to the reproduction of existing social inequalities. Through empirical examples, we demonstrate how synthetic data tends to highlight majority elements as the “normal” and minimize minority elements, and that the slight changes to the data structures that create synthetic data will also inevitably result in what we term “intersectional hallucinations.” These hallucinations are inherent to synthetic data and cannot be entirely eliminated without compromising the purpose of creating synthetic datasets. We contend that decisions about synthetic data involve determining which intersections are essential and which can be disregarded, a practice which will imbue these decisions with norms and values. Our study underscores the need for critical engagement with the mathematical and statistical choices in synthetic data production and advocates for careful consideration of the ontological and political implications of these choices during curatorial style production of synthetic structured data.
intersectionality
data ethics
classification
data bias
Synthetic structured data
ontological politics