Literate Programming with LLMs? - A Study on Rosetta Code and CodeNet
Journal article, 2025
Literate programming, a concept introduced by Knuth in 1984, emphasized the importance of combining human-readable documentation with machine-readable code as writing literate programs is a prerequisite for software quality. Our objective with this paper is to evaluate whether generative AI models, Large Language Models (LLM) like GPT-4, LLaMA or Falcon, are capable of literate programming because of their extensive use in software engineering. To truly achieve literate programming, LLMs must generate natural language descriptions and corresponding code with aligned semantics based on user prompts. In addition, their internal representation of programs should allow us to recognize both programming languages and their descriptions. To evaluate their capabilities, we conducted a study using the Rosetta Code and CodeNet repositories. We perform four computational experiments using the Rosetta Code repository, encompassing 1,228 tasks across 926 programming languages, and validate our findings on the larger CodeNet dataset, which includes 55 tasks and 52 languages. Our findings show that LLMs in the trillion-parameter class are capable of literate programming, while models in the million- and billion-parameter classes are better at recognizing programming languages than tasks. Based on these results, we conclude that modern LLMs inhibit a deeper ability to encode programming languages and the semantics of programming tasks, bringing us closer to realizing the full potential of literate programming.
Code-related Tasks
Literate Programming
Computation Experiment
Large Language Model(LLM)