From Trees to Graphs: Advancing Regression Analysis through Model-Centric AI, Data-Centric AI, and Active Learning

Peter Samoaa; Hazem Samoaa

From Trees to Graphs: Advancing Regression Analysis through Model-Centric AI, Data-Centric AI, and Active Learning
Doktorsavhandling, 2024

Context:
Trees and graphs are fundamental data structures that are extensively utilized for modelling relationships and facilitating efficient data organization and retrieval. In research, these structures underpin a wide variety of algorithms and theories, especially in fields like Artificial Intelligence (AI), where they are crucial for understanding and optimizing learning processes.
In real-world applications, trees and graphs have profound impacts. For instance, trees are at the heart of decision-making processes, from simple decision trees in machine learning to complex game trees in AI strategies for games like chess. Graphs, on the other hand, are vital in networking, whether in social networks, neural networks, or logistical networks, helping to map and optimize connections and flows.
The versatility of these structures in modelling complex systems makes them indispensable in both theoretical research and practical applications, impacting industries from technology to transportation and beyond. This dual significance not only underscores the theoretical importance of our study but also enhances its applicability in solving real-world problems.

Problem:
The main issue is that regression for trees and graphs is still not well explored in the literature. Many real-world problems for trees and graphs involve regressions, like predicting drug efficacy for molecular drugs or evolutionary outcomes for evolutionary biology trees.

Goal:
In this thesis, we aim to enhance the regression analysis by proposing and utilising AI models on trees and graphs.

Solution Approaches:
To that aim, we analysed the behaviour of different Tree-Based Neural Networks (TBNNs). Thus, Graph Neural Networks (GNNs), Tree-Convolutional Neural Networks (TreeCNN), path-based attention models, and transformer-based models are used. Then, we enhanced the behaviour of the transformer-based model by proposing our dual transformer based on cross attention as a model-centric AI approach to have a better representation. Then, we enhanced the regression analysis by focusing on data instead of the model. Thus, data-centric AI is used to augment the tree by adding more edges to represent more information. In this way, the augmented tree is converted into graphs, and then the same GNN models used in the previous analytical framework have better regression prediction by having a richer representation. Then, through data-centric AI, we improve the data by acquiring better labelling through interactive learning. Thus, we defined a unified active learning framework for labelling graphs for regression tasks. Through this framework, we select informative, representative, and diverse batches of samples for labelling. %Still, we also select the most diverse samples in each batch by relying on the Neural Tangent Kernel (NTK) utilised in the Gaussian process, which is used in selection methods to enforce the diversity.

Results:
The results show that the effective TBNN models for classification tasks fail to generalise for regression tasks. Thus, our proposed model outperforms all other TBNN models as well as GNN models through different settings and experiments. Moreover, The same GNN models used in the tree setting achieve higher Pearson correlation scores when we augment the tree and convert it into a graph, which shows that adding more information improves the prediction. Our results also show that the active learning framework can provide efficient query strategies for labelling the regression value on the entire graph level.

Neural Tangent Kernel (NTK)

Active Learning

Transformers

Model-Centric AI

Tree-Convolutional Neural Networks (TreeCNN)

Tree-Based Neural Networks (TBNNs)

Graph Neural Networks (GNNs)

Data-Centric AI

HB4, Johannesburg campus, https://maps.chalmers.se/#357c6694-3395-4255-86a4-73e8e99ed12e

Opponent: Professor Giovanna Guerrini, University of Genoa, Italy

Online disputation

Författare

Peter Samoaa

Chalmers, Data- och informationsteknik, Data Science och AI

Hazem Samoaa

Chalmers, Data- och informationsteknik, Data Science och AI

Forskning Andra publikationer

In the world of technology and science, trees and graphs are like maps that help us understand and organize complex relationships. Imagine how a family tree shows connections between relatives or how a map of a city's roads shows the best paths to take. Trees and graphs work similarly in the digital world, helping computers make decisions, plan routes, and even predict outcomes in areas like medicine and biology.

Despite their importance, predicting specific outcomes—like how effective a new drug might be—using these trees and graphs is still a challenge that hasn’t been fully solved. My research focuses on making these predictions more accurate by enhancing the way AI models work with trees and graphs.

To tackle this, I explored different types of AI models that learn from tree and graph structures. Some of these models, like Graph Neural Networks (GNNs) and Transformer models, are especially powerful because they can learn from the connections between data points. I then improved these models by creating a new dual-transformer model that better understands these complex relationships.

But I didn’t stop there. I also found a way to give the AI models even more information to work with by adding extra connections to the trees, turning them into richer graphs. This approach significantly improved the AI’s ability to make accurate predictions. Finally, I developed a method to help AI systems learn more efficiently by selecting the most important pieces of information to focus on during training.

The results of my research show that the new models I developed are much better at making predictions compared to existing methods. By giving AI systems more detailed maps (in the form of enhanced graphs) and helping them learn more effectively, we can make better predictions in fields ranging from drug development to understanding complex biological processes.

Ämneskategorier (SSIF 2011)

Data- och informationsvetenskap

ISBN

978-91-8103-101-0

Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 5559

Utgivare

Chalmers

HB4, Johannesburg campus, https://maps.chalmers.se/#357c6694-3395-4255-86a4-73e8e99ed12e

Online