Data Labeling: An Empirical Investigation into Industrial Challenges and Mitigation Strategies
Paper in proceeding, 2020

Labeling is a cornerstone of supervised machine learning. However, in industrial applications, data is often not labeled, which complicates using this data for machine learning. Although there are well-established labeling techniques such as crowdsourcing, active learning, and semi-supervised learning, these still do not provide accurate and reliable labels for every machine learning use case in the industry. In this context, the industry still relies heavily on manually annotating and labeling their data. This study investigates the challenges that companies experience when annotating and labeling their data. We performed a case study using a semi-structured interview with data scientists at two companies to explore their problems when labeling and annotating their data. This paper provides two contributions. We identify industry challenges in the labeling process, and then we propose mitigation strategies for these challenges.

Machine learning

Data labeling

Case study

Author

Teodor Fredriksson

Chalmers, Computer Science and Engineering (Chalmers), Software Engineering (Chalmers)

David Issa Mattos

Chalmers, Computer Science and Engineering (Chalmers), Software Engineering (Chalmers)

Jan Bosch

Chalmers, Computer Science and Engineering (Chalmers), Software Engineering (Chalmers)

Helena Holmstrom Olsson

Malmö university

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

03029743 (ISSN) 16113349 (eISSN)

Vol. 12562 LNCS 202-216
9783030641474 (ISBN)

Product-Focused Software Process Improvement
Turin, Italy,

Subject Categories

Other Computer and Information Science

Learning

Information Science

DOI

10.1007/978-3-030-64148-1_13

More information

Latest update

3/10/2021