Assessing the Suitability of Semi-Supervised Learning Datasets using Item Response Theory

Teodor Fredriksson; David Issa Mattos; Jan Bosch; Helena Holmström Olsson

doi:10.1109/SEAA53835.2021.00049

Assessing the Suitability of Semi-Supervised Learning Datasets using Item Response Theory
Paper i proceeding, 2021

In practice, supervised learning algorithms require fully labeled datasets to achieve the high accuracy demanded by current modern applications. However, in industrial settings supervised learning algorithms can perform poorly because of few labeled instances. Semi-supervised learning (SSL) is an automatic labeling approach that utilizes complete labels to infer missing labels in partially complete datasets. The high number of available SSL algorithms and the lack of systematic comparison between them leaves practitioners without guidelines to select the appropriate one for their application. Moreover, each SSL algorithm is often validated and evaluated in a small number of common datasets. However, there is no research that examines what datasets are suitable for comparing different SSL algorihtms. The purpose of this paper is to empirically evaluate the suitability of the datasets commonly used to evaluate and compare different SSL algorithms. We performed a simulation study using twelve datasets of three different datatypes (numerical, text, image) on thirteen different SSL algorithms. The contributions of this paper are two-fold. First, we propose the use of Bayesian congeneric item response theory model to assess the suitability of commonly used datasets. Second, we compare the different SSL algorithms using these datasets. The results show that with except of three datasets, the others have very low discrimination factors and are easily solved by the current algorithms. Additionally, the SSL algorithms have overlapping 90% credible intervals, indicating uncertainty in the difference between the accuracy of these SSL models. The paper concludes suggesting that researchers and practitioners should better consider the choice of datasets used for comparing SSL algorithms.

Congeneric model

Semi- Supervised learning

Item Response Theory

Data Labeling

Författare

Teodor Fredriksson

Testing, Requirements, Innovation and Psychology

Forskning Andra publikationer

David Issa Mattos

Chalmers, Data- och informationsteknik, Software Engineering

Forskning Andra publikationer

Jan Bosch

Testing, Requirements, Innovation and Psychology

Forskning Andra publikationer

Helena Holmström Olsson

Malmö universitet

Proceedings - 2021 47th Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2021

326-333
9781665427050 (ISBN)

47th Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2021
Palermo, Italy,

Ämneskategorier (SSIF 2011)

Reglerteknik

Signalbehandling

Datavetenskap (datalogi)

DOI

10.1109/SEAA53835.2021.00049

Publikationsdata kopplat till DOI

Mer information

Senast uppdaterat

2021-11-26

Assessing the Suitability of Semi-Supervised Learning Datasets using Item Response Theory Paper i proceeding, 2021

Författare

Teodor Fredriksson

David Issa Mattos

Jan Bosch

Helena Holmström Olsson

Proceedings - 2021 47th Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2021

Ämneskategorier (SSIF 2011)

DOI

Mer information

Senast uppdaterat

Assessing the Suitability of Semi-Supervised Learning Datasets using Item Response Theory
Paper i proceeding, 2021