Evaluating N-best Calibration of Natural Language Understanding for Dialogue Systems

Ranim Khojah; Staffan Larsson; Alexander Berman

doi:10.18653/v1/2022.sigdial-1.54

Evaluating N-best Calibration of Natural Language Understanding for Dialogue Systems
Paper in proceeding, 2025

A Natural Language Understanding (NLU) component can be used in a dialogue system to perform intent classification, returning an N-best list of hypotheses with corresponding confidence estimates. We perform an in-depth evaluation of 5 NLUs, focusing on confidence estimation. We measure and visualize calibration for the 10 best hypotheses on model level and rank level, and also measure classification performance. The results indicate a trade-off between calibration and performance. In particular, Rasa (with Sklearn classifier) had the best calibration but the lowest performance scores, while Watson Assistant had the best performance but a poor calibration.

Calibration

Natural Language Understanding

Dialogue Systems

Author

Ranim Khojah

University of Gothenburg

Other publications Research

Staffan Larsson

Talkamatic AB

University of Gothenburg

Alexander Berman

University of Gothenburg

Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, SIGDIAL 2022

582-594

23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue
Edinburgh, United Kingdom,

Subject Categories (SSIF 2025)

Other Engineering and Technologies

Other Humanities

DOI

10.18653/v1/2022.sigdial-1.54

Publication data connected to DOI

More information

Latest update

4/4/2025 9

Evaluating N-best Calibration of Natural Language Understanding for Dialogue Systems Paper in proceeding, 2025

Author

Ranim Khojah

Staffan Larsson

Alexander Berman

Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, SIGDIAL 2022

Subject Categories (SSIF 2025)

DOI

More information

Latest update

Evaluating N-best Calibration of Natural Language Understanding for Dialogue Systems
Paper in proceeding, 2025