Clustering techniques and keyword extraction with large language models for knowledge discovery in building defects data
Journal article, 2025
PurposeThe construction industry is undergoing a digital transformation and now holds large volumes of digital building defects data collected during inspections. This study aims to suggest an artificial intelligence-based method for analysing such building defects data to provide insights and knowledge faster than with traditional manual methods.Design/methodology/approachThis research explores a data set containing over 34,000 defects from hospital projects performed in Sweden from 2018 to 2021. The data mining uses keyword extraction based on both TF-IDF vectorisation and k-means clustering, the Mistral 7B model and KeyLLM. The results are compared with a content analysis using the GPT 3.5 turbo model. The analysis is performed both on an organisational and project level.FindingsThe paper presents a combination of methods for analysing building defects data. The result shows that the most common problems reported during the inspections concern missing fire sealing, jointing and subceiling problems. Using k-means clustering gives fast insights into the main defect categories of the data set but requires domain knowledge. Keyword extraction using an LLM requires longer computational time but creates a deeper understanding of subcategories of defects. Finally, GPT-based content analysis is a complement to provide project-specific insights and allow user-specific requests.Research limitations/implicationsThe study is performed using data digitally collected in Swedish hospital projects. However, the results and methodology can be applied on other project data, such as safety inspections and warranty data. The analysis focused solely on text data.Originality/valueThe method suggested in this paper uses clustering techniques and Large Language Models for analysing building defect data. The value of the proposed method is a faster process for leveraging knowledge from large amounts of unstructured text data, such as building defect reports, safety and moisture inspections and warranty issues.
Defects
Inspections
Knowledge generation
LLM