Understanding and Evaluating Chatbot Interactions in Software Engineering
Licentiatavhandling, 2025
Chatbots have been used in software engineering for a long time. Initially, they were based on basic commands. Then, Artificial Intelligence introduced many components such as Natural Language Understanding (NLU) and made the architecture of the chat- bot slightly more complex to be able to automate simple tasks such as closing issues on GitHub and to retrieve information and documentation. However, the emergence of Large Language Models (LLMs) unlocked many other possibilities for chatbots, which allowed them to have extensive knowledge and be context-aware while being able to perform complex tasks and make decisions during the software development process. Consequently, chatbots could assist in requirement elicitation, code generation, and even analyzing monitoring logs of the software. This enabled software engineers to explore more possibilities, in particular, focusing on automating complex tasks using LLM chatbots and interacting with them as traditional chatbots. However, this created new challenges that need to be addressed, for example, hallucinating requirements or providing vulnerable code. Consequently, human factors such as trust began to fade slowly. In this thesis, I argue that to better use chatbots for the right use cases, we need to understand the interactions with them, including the usage and conversational flow. In addition, the evaluation of chatbots (both NLU and LLM based) should go beyond their performance and focus on the value that they bring to software engineers through their interactions. Using empirical methods in four observational and experimental studies, I present an analysis of the characteristics of interactions with NLU and LLM chatbots in comparison with those with human developers. NLU chatbots are used as tools where reliability is an evaluation criterion that complements performance. However, interactions with LLM chatbots are more complex and are impacted by many factors that I introduce in a personal experience framework. In addition, I show how different dimensions of productivity are affected based on whether the chatbot is used to provide guidance, manipulate artifacts, or learn new concepts. Moreover, since prompt programming is commonly used to enhance the outcome of the inter- actions, I show how certain prompt techniques improve code generation, but their overall impact remains limited. Therefore, this thesis guides chatbot designers in enhancing chatbots’ ability to communicate to improve the user’s personal experience. It also urges practitioners to adapt their use of chatbots to focus on collaborating with them rather than using them as automation tools. This also encourages researchers to investigate effective ways to implement collaboration with chatbots at different stages of the software development lifecycle.
Software Engineering
Human-AI Collaboration
Chatbots
Natural Language Understanding
Large Language Models