Data-Centric AI for Software Performance Engineering - Predicting Workload Dependent and Independent Performance of Software Systems Using Machine Learning Based Approaches
Licentiatavhandling, 2023

Context: Machine learning (ML) approaches are widely employed in various software engineering (SE) tasks. Performance, however, is one of the most critical software quality requirements. Performance prediction is estimating the execution time of a software system prior to execution. The backbone of performance estimation is prediction models, in which machine learning (ML) is a common choice. Two settings are commonly considered for ML-based performance prediction: workload-dependent and workload-independent performance, depending on whether or not the specific usage of the system is fed as input to the ML estimator.
Problem: Developers usually manually understand the performance behaviour with respect to the workload. This process consumes time, effort, and computational resources since the developer repeats the running of the same tested system ( e.g. benchmark) many times, each with different workload values. In a workload-independent setting, predicting the scalar value of execution time based on the structure of the source code is challenging as it is a function of many factors, including the underlying architecture, the input parameters, and the application’s interactions with the operating system. Consequently, works that have attempted to predict absolute execution time for arbitrary applications from source code generally report poor accuracy.
Goal: The thesis presents a modern machine learning-based approach for predicting the execution time from two angles: (a) workload-independent performance. (b) workload-dependent performance. 
Solution Approaches and Research Methodologies: To achieve the goal and tackle the problems mentioned earlier, we conducted a systematic empirical study to fill the gap of workload-dependant performance across five well-known projects in JMH benchmarking (including RxJava, Log4J2, and the Eclipse Collections framework) and 126 concrete benchmarks. We generated a dataset of approximately 1.4 million measurements.
As for the poor accuracy challenges, we aim to increase the quality of data which is the source code in this context. To that aim, we invest in Data-Centric AI. Thus, we conduct a systematic literature review, and systematic mapping study about the different approaches of source code representation and the level of information each representation can hold. Then, based on that, we conduct an experimental study to increase the quality of source code representation by establishing a rich hybrid code representation. Then marry this representation with a Graph Neural Network (GNN)- an ML approach to predict the scalar value of the functional test.
Results: Our results showed that by investing in classical ML approaches, we could predict the performance value of the benchmarks according to configuration workload. Moreover, with our proposed method, the developers can easily determine the impact of each workload on the performance measurement. On the other hand, by increasing the data quality through data-centric AI, we achieve very high and considerable accuracy in predicting the absolute execution time of software performance only according to the structure of the source code.

Representation Learning

Software Performance Prediction

Machine Learning

Source code Representation

Graph Neural Network

Deep Learning

Data-Centric AI

Analysen Room, EDIT building, Johanneberg
Opponent: Görel Hedin, Lund University, Sweden


Hazem Samoaa

Chalmers, Data- och informationsteknik, Interaktionsdesign och Software Engineering

TEP-GNN: Accurate Execution Time Prediction of Functional Tests Using Graph Neural Networks

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),; Vol. 13709 LNCS(2022)p. 464-479

Paper i proceeding

A Pipeline for Measuring Brand Loyalty Through Social Media Mining

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),; Vol. 12607(2021)p. 489-504

Paper i proceeding

An Exploratory Study of the Impact of Parameterization on JMH Measurement Results in Open-Source Projects

ICPE 2021 - Proceedings of the ACM/SPEC International Conference on Performance Engineering,; (2021)p. 213-224

Paper i proceeding


Data- och informationsvetenskap





Analysen Room, EDIT building, Johanneberg


Opponent: Görel Hedin, Lund University, Sweden

Mer information

Senast uppdaterat