Comparative analysis of text mining and clustering techniques for assessing functional dependency between manual test cases
Journal article, 2025

Text mining techniques, particularly those leveraging machine learning for natural language processing, have gained significant attention for qualitative data analysis in software testing. However, their complexity and lack of transparency can pose challenges, especially in safety-critical domains where simpler, interpretable solutions are often preferred unless accuracy is heavily compromised. This study investigates the trade-offs between complexity, effort, accuracy, and utility in text mining and clustering techniques, focusing on their application for detecting functional dependencies among manual integration test cases in safety-critical systems. Using empirical data from an industrial testing project at ALSTOM Sweden, we evaluate various string distance methods, NCD compressors, and machine learning approaches. The results highlight the impact of preprocessing techniques, such as tokenization, and intrinsic factors, such as text length, on algorithm performance. Findings demonstrate how text mining and clustering can be optimized for safety-critical contexts, offering actionable insights for researchers and practitioners aiming to balance simplicity and effectiveness in their testing workflows.

Software testing

Natural language processing

Clustering

Artificial intelligence

Text mining

Author

Sahar Tahvili

Ericsson

Mälardalens högskola

Leo Hatvani

Mälardalens högskola

Michael Felderer

University of Cologne

German Aerospace Center (DLR)

Francisco Gomes

Chalmers, Computer Science and Engineering (Chalmers), Interaction Design and Software Engineering

University of Gothenburg

Wasif Afzal

Mälardalens högskola

Robert Feldt

Chalmers, Computer Science and Engineering (Chalmers), Software Engineering (Chalmers)

University of Gothenburg

Software Quality Journal

0963-9314 (ISSN) 1573-1367 (eISSN)

Vol. 33 2 24

Subject Categories (SSIF 2025)

Computer Sciences

Computer Systems

DOI

10.1007/s11219-025-09722-7

More information

Latest update

5/27/2025