An interpretable method for automated classification of spoken transcripts and written text
Journal article, 2024

We investigate the differences between spoken language (in the form of radio show transcripts) and written language (Wikipedia articles) in the context of text classification. We present a novel, interpretable method for text classification, involving a linear classifier using a large set of n- gram features, and apply it to a newly generated data set with sentences originating either from spoken transcripts or written text. Our classifier reaches an accuracy less than 0.02 below that of a commonly used classifier (DistilBERT) based on deep neural networks (DNNs). Moreover, our classifier has an integrated measure of confidence, for assessing the reliability of a given classification. An online tool is provided for demonstrating our classifier, particularly its interpretable nature, which is a crucial feature in classification tasks involving high-stakes decision-making. We also study the capability of DistilBERT to carry out fill-in-the-blank tasks in either spoken or written text, and find it to perform similarly in both cases. Our main conclusion is that, with careful improvements, the performance gap between classical methods and DNN-based methods may be reduced significantly, such that the choice of classification method comes down to the need (if any) for interpretability.

Natural language processing

Interpretable methods

Text classification

Author

Mattias Wahde

Chalmers, Mechanics and Maritime Sciences (M2), Vehicle Engineering and Autonomous Systems

Marco Luigi Della Vedova

Chalmers, Mechanics and Maritime Sciences (M2), Vehicle Engineering and Autonomous Systems

Marco Virgolin

Stichting Centrum voor Wiskunde & Informatica (CWI)

Chalmers, Mechanics and Maritime Sciences (M2), Vehicle Engineering and Autonomous Systems

Minerva Suvanto

Chalmers, Mechanics and Maritime Sciences (M2), Vehicle Engineering and Autonomous Systems

Evolutionary Intelligence

1864-5909 (ISSN) 18645917 (eISSN)

Vol. 17 1 609-621

Subject Categories

Language Technology (Computational Linguistics)

Computer Science

DOI

10.1007/s12065-023-00851-1

Related datasets

Data set for (binary) text classification, involving spoken utterances and written text [dataset]

DOI: 10.5281/zenodo.7694422

More information

Latest update

3/7/2024 9