An empirical evaluation of pre-trained large language models for repairing declarative formal specifications

Mohannad Alhanahnah; Md Rashedul Hasan; Lisong Xu; Hamid Bagheri

doi:10.1007/s10664-025-10687-1

An empirical evaluation of pre-trained large language models for repairing declarative formal specifications
Journal article, 2025

Automatic Program Repair (APR) has garnered significant attention as a practical research domain focused on automatically fixing bugs in programs. While existing APR techniques primarily target imperative programming languages like C and Java, there is a growing need for effective solutions applicable to declarative software specification languages. This paper systematically investigates the capacity of Large Language Models (LLMs) to repair declarative specifications in Alloy, a declarative formal language used for software specification. We designed six different repair settings, encompassing single-agent and dual-agent paradigms, utilizing various LLMs. These configurations also incorporate different levels of feedback, including an auto-prompting mechanism for generating prompts autonomously using LLMs. Our study reveals that dual-agent with auto-prompting setup outperforms the other settings, albeit with a marginal increase in the number of iterations and token usage. This dual-agent setup demonstrated superior effectiveness compared to state-of-the-art Alloy APR techniques when evaluated on a comprehensive set of benchmarks. This work is the first to empirically evaluate LLM capabilities to repair declarative specifications, while taking into account recent trending LLM concepts such as LLM-based agents, feedback, auto-prompting, and tools, thus paving the way for future agent-based techniques in software engineering.

Declarative specification

Formal methods

Automatic program repair

Alloy language

LLMs

Author

Mohannad Alhanahnah

University of Gothenburg

Chalmers, Computer Science and Engineering (Chalmers), Interaction Design and Software Engineering

Other publications Research

Md Rashedul Hasan

University of Nebraska - Lincoln

Lisong Xu

University of Nebraska - Lincoln

Hamid Bagheri

University of Nebraska - Lincoln

Empirical Software Engineering

1382-3256 (ISSN) 1573-7616 (eISSN)

Vol. 30 5 149

Subject Categories (SSIF 2025)

Software Engineering

Computer Sciences

DOI

10.1007/s10664-025-10687-1

Publication data connected to DOI

Related datasets

AlloySpecRepair [dataset]

URI: https://github.com/Mohannadcse/AlloySpecRepair

Get data

More information

Latest update

8/5/2025 3

An empirical evaluation of pre-trained large language models for repairing declarative formal specifications Journal article, 2025

Author

Mohannad Alhanahnah

Md Rashedul Hasan

Lisong Xu

Hamid Bagheri

Empirical Software Engineering

Subject Categories (SSIF 2025)

DOI

Related datasets

AlloySpecRepair [dataset]

More information

Latest update

An empirical evaluation of pre-trained large language models for repairing declarative formal specifications
Journal article, 2025