Comparative analysis of text mining and clustering techniques for assessing functional dependency between manual test cases
Artikel i vetenskaplig tidskrift, 2025

Text mining techniques, particularly those leveraging machine learning for natural language processing, have gained significant attention for qualitative data analysis in software testing. However, their complexity and lack of transparency can pose challenges, especially in safety-critical domains where simpler, interpretable solutions are often preferred unless accuracy is heavily compromised. This study investigates the trade-offs between complexity, effort, accuracy, and utility in text mining and clustering techniques, focusing on their application for detecting functional dependencies among manual integration test cases in safety-critical systems. Using empirical data from an industrial testing project at ALSTOM Sweden, we evaluate various string distance methods, NCD compressors, and machine learning approaches. The results highlight the impact of preprocessing techniques, such as tokenization, and intrinsic factors, such as text length, on algorithm performance. Findings demonstrate how text mining and clustering can be optimized for safety-critical contexts, offering actionable insights for researchers and practitioners aiming to balance simplicity and effectiveness in their testing workflows.

Software testing

Natural language processing

Clustering

Artificial intelligence

Text mining

Författare

Sahar Tahvili

Ericsson AB

Mälardalens högskola

Leo Hatvani

Mälardalens högskola

Michael Felderer

Universität zu Köln

Deutsches Zentrums für Luft- und Raumfahrt (DLR)

Francisco Gomes

Chalmers, Data- och informationsteknik, Interaktionsdesign och Software Engineering

Göteborgs universitet

Wasif Afzal

Mälardalens högskola

Robert Feldt

Chalmers, Data- och informationsteknik, Software Engineering

Göteborgs universitet

Software Quality Journal

0963-9314 (ISSN) 1573-1367 (eISSN)

Vol. 33 2 24

Ämneskategorier (SSIF 2025)

Datavetenskap (datalogi)

Datorsystem

DOI

10.1007/s11219-025-09722-7

Mer information

Senast uppdaterat

2025-05-27