Opportunities, Challenges and Solutions for Automatic Labeling of Data Using Machine Learning
Licentiate thesis, 2023

Context: Supervised learning is the most common machine learning paradigm and requires labeled data. Because much data in the industry is unlabeled, data labeling is an essential step in the data preparation process. As data labeling can take time and require domain knowledge, many companies need more resources for in-house personnel with domain-specific knowledge to perform labeling. Therefore it is relevant for companies to explore cheaper, more accurate, and automated approaches to labeling.

Objective: This research aims to identify industry challenges and mitigation strategies for Data Labeling.

Method:
The research was conducted using multidisciplinary research. We performed a systematic mapping study to understand and identify the main approaches to labeling and their application domains. We performed a case study with two companies to understand the challenges and mitigation strategies in the industry. The case study consisted of an internship with one of the companies and interviews with data scientists from both companies. A thematic analysis was then utilized to formulate challenges based on collected data. For each of the challenges, a mitigation strategy was formulated. The rest of the research consists of simulations and the Bayesian Bradley-Terry Model and Item Response Theory to study what labeling approaches are best in accuracy and evaluate the sustainability of the datasets used to evaluate the labeling approaches.

Results:
In this thesis, we present four main findings. First, we present an overview of the most popular data labeling approaches used in different applications, and we provide an overview of the datasets used to evaluate these. Second, we define and categorize the different industry challenges that data scientists face. We then define and formulate mitigation strategies for these challenges. Third, we present the best automated labeling approaches for accuracy and how much manual effort these algorithms need to achieve the best accuracy. Fourth, we present the best benchmark datasets for evaluating automatic labeling approaches.

Outlook:
In future work, we want to examine safe and deep semi-supervised learning and how they are used in practice, as we have noticed that semi-supervised learning based on Deep Learning has become more prevalent in recent years.

Active Learning

Data Labeling

Software Engineering

Semi-Supervised Learning

CSE Jupiter 473
Opponent: Tommi Mikkonen

Author

Teodor Fredriksson

Chalmers, Computer Science and Engineering (Chalmers), Interaction Design and Software Engineering

Machine Learning Algorithms for Labeling: Where and How They are Used?

SysCon 2022 - 16th Annual IEEE International Systems Conference, Proceedings,;(2022)

Paper in proceeding

Assessing the Suitability of Semi-Supervised Learning Datasets using Item Response Theory

Proceedings - 2021 47th Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2021,;(2021)p. 326-333

Paper in proceeding

Data Labeling: An Empirical Investigation into Industrial Challenges and Mitigation Strategies

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),;Vol. 12562 LNCS(2020)p. 202-216

Paper in proceeding

An Empirical Evaluation of Graph-based Semi-Supervised Learning for Data Labeling

Subject Categories (SSIF 2011)

Software Engineering

Publisher

Chalmers

CSE Jupiter 473

Online

Opponent: Tommi Mikkonen

More information

Latest update

8/21/2023