On Semi-Supervised Learning: Evaluation, Challenges and Mitigation Strategies
Doktorsavhandling, 2025

Context: 
Supervised learning requires labeled data but in many real-world datasets there are few or no labeled instances available. Therefore companies may need to allocate resources to obtain labels. However, labeling is not always trivial and companies need people with domain knowledge to perform labeling. Acquiring suitable personnel for labeling may be expensive and time-consuming if new personnel needs to be hired and trained for labeling.

Objective:
The objective of this thesis is to investigate current challenges and mitigation strategies for data labeling. After challenges and weaknesses of current mitigation strategies have been identified, the goal is to identify solutions and improve current mitigation strategies.

Method:
This thesis employs multiple methods. The first study is a systematic mapping study that presents the most commonly utilized AI-based algorithms for data labeling related problems. In addition, the most common applications of these algorithms and the datasets utilized for evaluating these algorithms are presented. The second study reports on data collected during a case study in industry and interviews with company practitioners from two companies. Based on the data, three data labeling related challenges where formulated together with a mitigation strategy for each challenge. Statistical methods play an important role in the rest of the studies and are utilized to analyze algorithms. In two studies, the Bayesian Bradley-Terry model is utilized to rank graph-based and deep semi-supervised learning algorithms respectively. In both studies Bayesian generalized linear mixed models are utilized to analyze the probabilities of algorithms reaching a certain performance with and without noise added. In two other studies, Bayesian item response theory is utilized to assess how suitable the datasets are for evaluating graph-based and deep semi-supervised learning algorithms. Lastly, Bayesian linear regression is utilized to analyze the performance of a deep semi-supervised learning algorithm and its relative improvement over supervised learning on a real-world dataset provided by Saab.

Results:
First the most common AI-based algorithms for data labeling are presented along with the application domains and the datasets utilized to evaluate algorithms. Second, challenges and mitigation strategies are presented as well as currently available algorithms. Third, the optimal graph-based and deep semi-supervised learning algorithms are presented based on performance on each datatype. In addition manual effort is analyzed to demonstrate how many labeled instances are required to obtain a certain accuracy. Fourth, optimal datasets for evaluating graph-based amd deep semi-supervised learning algorithms are presented. Finally, proof demonstrating that deep semi-supervised learning may outperform supervised learning on real-world data collected from industry is presented.

Conclusions:
Many AI-based algorithms may help mitigate problems regarding data labeling. Active learning allows practitioners to reduce manual labeling and improve performance of supervised learning by choosing the most informative instances to be labeled. Graph-based algorithms are inductive learning algorithms that will automatically label data by learning from already labeled data. Deep semi-supervised learning algorithms are transductive algorithms that utilize unlabeled data to improve the performance of supervised learning by adding a loss term incorporating the loss function. Empirical evidence indicate that active learning outperforms passive learning where instances to be labeled are chosen at random. Theoretical studies demonstrate that machine learning algorithms utilizing unlabeled data may improve the performance over supervised learning. On the other hand, there are studies indicating that unlabeled data by degrade performance. These observations may be the cause as to why global companies have yet to incorporate semi-supervised learning and why there is a lack of research where semi-supervised learning is applied to real-world data. Deep semi-supervised learning has increased in popularity due to its many advantages such as robustness. The recently developed deep semi-supervised learning algorithms outperform supervised learning. Graph-based semi-supervised learning has the ability to label data with an accuracy above 90\%. In addition to performing well on benchmark datasets, both algorithms have proven to perform well when noise is present in the dataset, indicating that the algorithms are expected to perform well on real-world datasets. Noise may even increase the accuracy. On the other hand, the datasets utilized when evaluating algorithms may be inappropriate in the sense that they may be to easy for the algorithms to learn. This will cause a false sense of security as the algorithms may perform worse on real-world datasets that are more difficult to learn. Finally, it is demonstrated that deep semi-supervised learning algorithms based on pseudo-labeling and data augmentation have the ability to outperform supervised learning on real-world data from industry.

sofrware engineering

data labeling

machine learning

active learning

semi-supervised learning

Chalmers Linholmen
Opponent: Prof. Dr. Georg Herzwurm, University of Stuttgart, Institute of Business Administration (BWI)

Författare

Teodor Fredriksson

Chalmers, Data- och informationsteknik, Interaktionsdesign och Software Engineering

Machine Learning Algorithms for Labeling: Where and How They are Used?

SysCon 2022 - 16th Annual IEEE International Systems Conference, Proceedings,;(2022)

Paper i proceeding

Data Labeling: An Empirical Investigation into Industrial Challenges and Mitigation Strategies

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),;Vol. 12562 LNCS(2020)p. 202-216

Paper i proceeding

An Empirical Evaluation of Graph-based Semi-Supervised Learning Algorithms for Automatic Data Labeling

An empirical evaluation of deep semi-supervised learning

International Journal of Data Science and Analytics,;Vol. In Press(2025)

Artikel i vetenskaplig tidskrift

Assessing the Suitability of Semi-Supervised Learning Datasets using Item Response Theory

Proceedings - 2021 47th Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2021,;(2021)p. 326-333

Paper i proceeding

Assessing the Suitability of Deep Semi-Supervised Learning Datasets using Item Response Theory

Classification of Complex-Valued Radar Data using Semi-Supervised Learning: a Case Study

Proceedings - 2023 49th Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2023,;(2023)p. 102-107

Paper i proceeding

Ämneskategorier (SSIF 2025)

Programvaruteknik

Datavetenskap (datalogi)

Algoritmer

Artificiell intelligens

Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie

Utgivare

Chalmers

Chalmers Linholmen

Online

Opponent: Prof. Dr. Georg Herzwurm, University of Stuttgart, Institute of Business Administration (BWI)

Mer information

Senast uppdaterat

2025-05-23