The Impact of Class Noise-handling on the Effectiveness of Machine Learning-based Methods for Build Outcome and Code Change Request Predictions
Journal article, 2026
Machine learning-based methods are increasingly used to optimize build processes and accelerate the integration of software code. These methods leverage large volumes of historical code changes to train models on predicting and preventing issues in the codebase that could delay code integrations and features delivery to end-users. The objective of this study is to examine the impact of handling class noise present in software code changes collected from Continuous Integration (CI) systems on the predictive performance of machine learning models for predicting the execution outcome of CI builds and negative code reviews. In this study, we conduct a series of computational experiments using data from 110 Java open-source projects, examining the effectiveness of two removal-based statistical techniques - Majority Filter (MF) and Consensus Filter (CF) - and two corrective techniques - Domain Knowledge-based (DB) and CleanLab. Our results show that removal-based techniques significantly improve model predictive performance in both build outcome and negative code review prediction tasks. For build outcome prediction, applying MF increased the F1-score from 82% to 97%, and MCC from 0.13 to 0.58. In negative code review predictions, MF improved the F1-score from 17% to 53%, and MCC from −0.03 to 0.57. The DB technique was effective primarily in the context of code review comments but less so for build outcome predictions. While CleanLab yielded more consistent predictions, its overall impact on model performance was more moderate compared to removal-based techniques. Additionally, our findings show that hyperparameter tuning, applied independently or in combination with CleanLab, can further improve model performance; however, these gains did not surpass those achieved by removal-based techniques alone. We conclude that applying removal-based techniques to the training data of code changes is necessary to improve the prediction of build outcomes and negative code review comments.