On the experiences of adopting automated data validation in an industrial machine learning project

Lucy Lwakatare; Ellinor Range; Ivica Crnkovic; Jan Bosch

doi:10.1109/ICSE-SEIP52600.2021.00034

On the experiences of adopting automated data validation in an industrial machine learning project
Paper i proceeding, 2021

Background: Data errors are a common challenge in machine learning (ML) projects and generally cause significant performance degradation in ML-enabled software systems. To ensure early detection of erroneous data and avoid training ML models using bad data, research and industrial practice suggest incorporating a data validation process and tool in ML system development process. Aim: The study investigates the adoption of a data validation process and tool in industrial ML projects. The data validation process demands significant engineering resources for tool development and maintenance. Thus, it is important to identify the best practices for their adoption especially by development teams that are in the early phases of deploying ML-enabled software systems. Method: Action research was conducted at a large-software intensive organization in telecommunications, specifically within the analytics R&D organization for an ML use case of classifying faults from returned hardware telecommunication devices. Results: Based on the evaluation results and learning from our action research, we identified three best practices, three benefits, and two barriers to adopting the data validation process and tool in ML projects. We also propose a data validation framework (DVF) for systematizing the adoption of a data validation process. Conclusions: The results show that adopting a data validation process and tool in ML projects is an effective approach of testing ML-enabled software systems. It requires having an overview of the level of data (feature, dataset, cross-dataset, data stream) at which certain data quality tests can be applied.

Machine learning

Data validation

Data errors

Software engineering

Data quality

Författare

Lucy Lwakatare

Chalmers, Data- och informationsteknik, Software Engineering

Forskning Andra publikationer

Ellinor Range

Ericsson AB

Ivica Crnkovic

Chalmers, Data- och informationsteknik, Software Engineering

Forskning Andra publikationer

Jan Bosch

Testing, Requirements, Innovation and Psychology

Forskning Andra publikationer

Proceedings - International Conference on Software Engineering

02705257 (ISSN)

248-257
978-0-7381-4669-0 (ISBN)

43rd IEEE/ACM International Conference on Software Engineering - Software Engineering in Practice (ICSE-SEIP) / 43rd ACM/IEEE International Conference on Software Engineering - New Ideas and Emerging Results (ICSE-NIER)
, ,

Ämneskategorier (SSIF 2011)

Annan data- och informationsvetenskap

Programvaruteknik

Datorsystem

DOI

10.1109/ICSE-SEIP52600.2021.00034

Publikationsdata kopplat till DOI

Mer information

Senast uppdaterat

2023-03-21

On the experiences of adopting automated data validation in an industrial machine learning project Paper i proceeding, 2021

Författare

Lucy Lwakatare

Ellinor Range

Ivica Crnkovic

Jan Bosch

Proceedings - International Conference on Software Engineering

Ämneskategorier (SSIF 2011)

DOI

Mer information

Senast uppdaterat

On the experiences of adopting automated data validation in an industrial machine learning project
Paper i proceeding, 2021