An empirical evaluation of algorithms for data labeling

Teodor Fredriksson; David Issa Mattos; Jan Bosch; Helena Holmström Olsson

doi:10.1109/COMPSAC51774.2021.00038

An empirical evaluation of algorithms for data labeling
Paper i proceeding, 2021

The lack of labeled data is a major problem in both research and industrial settings since obtaining labels is often an expensive and time-consuming activity. In the past years, several machine learning algorithms were developed to assist and perform automated labeling in partially labeled datasets. While many of these algorithms are available in open-source packages, there is a lack of research that investigates how these algorithms compare to each other for different types of datasets and with different percentages of available labels. To address this problem, this paper empirically evaluates and compares seven algorithms for automated labeling in terms of their accuracy. We investigate how these algorithms perform in twelve different and well-known datasets with three different types of data, images, texts, and numerical values. We evaluate these algorithms under two different experimental conditions, with 10% and 50% labels of available labels in the dataset. Each algorithm, in each dataset for each experimental condition, is evaluated independently ten times with different random seeds. The results are analyzed and the algorithms are compared utilizing a Bayesian Bradley-Terry model. The results indicate that the active learning algorithms using the query strategies uncertainty sampling, QBC and random sampling are always the best algorithms. However, this comes with the expense of increased manual labeling effort. These results help machine learning practitioners in choosing optimal machine learning algorithms to label their data.

Active learning

Semi-supervised learning

Data labeling

Automatic labeling

Författare

Teodor Fredriksson

Testing, Requirements, Innovation and Psychology

Forskning Andra publikationer

David Issa Mattos

Chalmers, Data- och informationsteknik, Software Engineering

Forskning Andra publikationer

Jan Bosch

Testing, Requirements, Innovation and Psychology

Forskning Andra publikationer

Helena Holmström Olsson

Malmö universitet

Forskning Andra publikationer

Proceedings - 2021 IEEE 45th Annual Computers, Software, and Applications Conference, COMPSAC 2021

201-209
9781665424639 (ISBN)

45th IEEE Annual Computers, Software, and Applications Conference, COMPSAC 2021
Virtual, Online, Spain,

Ämneskategorier (SSIF 2011)

Sannolikhetsteori och statistik

Signalbehandling

Datavetenskap (datalogi)

DOI

10.1109/COMPSAC51774.2021.00038

Publikationsdata kopplat till DOI

Mer information

Senast uppdaterat

2021-10-07

An empirical evaluation of algorithms for data labeling Paper i proceeding, 2021

Författare

Teodor Fredriksson

David Issa Mattos

Jan Bosch

Helena Holmström Olsson

Proceedings - 2021 IEEE 45th Annual Computers, Software, and Applications Conference, COMPSAC 2021

Ämneskategorier (SSIF 2011)

DOI

Mer information

Senast uppdaterat

An empirical evaluation of algorithms for data labeling
Paper i proceeding, 2021