Opportunities, Challenges and Solutions for Automatic Labeling of Data Using Machine Learning
Licentiate thesis, 2023
Objective: This research aims to identify industry challenges and mitigation strategies for Data Labeling.
Method:
The research was conducted using multidisciplinary research. We performed a systematic mapping study to understand and identify the main approaches to labeling and their application domains. We performed a case study with two companies to understand the challenges and mitigation strategies in the industry. The case study consisted of an internship with one of the companies and interviews with data scientists from both companies. A thematic analysis was then utilized to formulate challenges based on collected data. For each of the challenges, a mitigation strategy was formulated. The rest of the research consists of simulations and the Bayesian Bradley-Terry Model and Item Response Theory to study what labeling approaches are best in accuracy and evaluate the sustainability of the datasets used to evaluate the labeling approaches.
Results:
In this thesis, we present four main findings. First, we present an overview of the most popular data labeling approaches used in different applications, and we provide an overview of the datasets used to evaluate these. Second, we define and categorize the different industry challenges that data scientists face. We then define and formulate mitigation strategies for these challenges. Third, we present the best automated labeling approaches for accuracy and how much manual effort these algorithms need to achieve the best accuracy. Fourth, we present the best benchmark datasets for evaluating automatic labeling approaches.
Outlook:
In future work, we want to examine safe and deep semi-supervised learning and how they are used in practice, as we have noticed that semi-supervised learning based on Deep Learning has become more prevalent in recent years.
Active Learning
Data Labeling
Software Engineering
Semi-Supervised Learning
Author
Teodor Fredriksson
Chalmers, Computer Science and Engineering (Chalmers), Interaction Design and Software Engineering
Machine Learning Algorithms for Labeling: Where and How They are Used?
SysCon 2022 - 16th Annual IEEE International Systems Conference, Proceedings,;(2022)
Paper in proceeding
Assessing the Suitability of Semi-Supervised Learning Datasets using Item Response Theory
Proceedings - 2021 47th Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2021,;(2021)p. 326-333
Paper in proceeding
Data Labeling: An Empirical Investigation into Industrial Challenges and Mitigation Strategies
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),;Vol. 12562 LNCS(2020)p. 202-216
Paper in proceeding
An Empirical Evaluation of Graph-based Semi-Supervised Learning for Data Labeling
Subject Categories
Software Engineering
Publisher
Chalmers