Data management and Data Pipelines: An empirical investigation in the embedded systems domain
Licentiatavhandling, 2021

Context: Companies are increasingly collecting data from all possible sources to extract insights that help in data-driven decision-making. Increased data volume, variety, and velocity and the impact of poor quality data on the development of data products are leading companies to look for an improved data management approach that can accelerate the development of high-quality data products. Further, AI is being applied in a growing number of fields, and thus it is evolving as a horizontal technology. Consequently, AI components are increasingly been integrated into embedded systems along with electronics and software. We refer to these systems as AI-enhanced embedded systems. Given the strong dependence of AI on data, this expansion also creates a new space for applying data management techniques.
Objective: The overall goal of this thesis is to empirically identify the data management challenges encountered during the development and maintenance of AI-enhanced embedded systems, propose an improved data management approach and empirically validate the proposed approach.
Method: To achieve the goal, we conducted this research in close collaboration with Software Center companies using a combination of different empirical research methods: case studies, literature reviews, and action research.
Results and conclusions: This research provides five main results. First, it identifies key data management challenges specific to Deep Learning models developed at embedded system companies. Second, it examines the practices such as DataOps and data pipelines that help to address data management challenges. We observed that DataOps is the best data management practice that improves the data quality and reduces the time tdevelop data products. The data pipeline is the critical component of DataOps that manages the data life cycle activities. The study also provides the potential faults at each step of the data pipeline and the corresponding mitigation strategies. Finally, the data pipeline model is realized in a small piece of data pipeline and calculated the percentage of saved data dumps through the implementation.
Future work: As future work, we plan to realize the conceptual data pipeline model so that companies can build customized robust data pipelines. We also plan to analyze the impact and value of data pipelines in cross-domain AI systems and data applications. We also plan to develop AI-based fault detection and mitigation system suitable for data pipelines.

artificial intelligence

data pipelines

software engineering

empirical investigation

embedded systems

data management

machine learning

CSE Jupiter 473, , Jupiter building, Hörselgången 5, floor 4
Opponent: Daniela Soares Cruzes , NTNU, Norway

Författare

Aiswarya Raj Munappy

Chalmers, Data- och informationsteknik, Software Engineering, Software Engineering for Testing, Requirements, Innovation and Psychology

Towards automated detection of data pipeline faults

Proceedings - Asia-Pacific Software Engineering Conference, APSEC,; Vol. 2020-December(2020)p. 346-355

Paper i proceeding

Modelling Data Pipelines

2020 46th Euromicro Conference on Software Engineering and Advanced Applications (SEAA),; (2020)p. 13-20

Paper i proceeding

From Ad-Hoc Data Analytics to DataOps

Proceedings of the International Conference on Software and System Processes,; (2020)p. 165-174

Paper i proceeding

Data Management Challenges for Deep Learning

Proceedings - 45th Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2019,; (2019)p. 140-147

Paper i proceeding

Data Pipeline Management in Practice: Challenges and Opportunities

Lecture Notes in Computer Science,; Vol. 12562(2020)p. 168-184

Paper i proceeding

Software Engineering for AI/ML/DL

Chalmers AI-forskningscentrum (CHAIR), 2019-11-01 -- 2022-11-01.

HoliDev - Holistic DevOps Framework

VINNOVA, 2018-01-01 -- 2019-12-31.

Ämneskategorier

Annan data- och informationsvetenskap

Programvaruteknik

Datavetenskap (datalogi)

Styrkeområden

Informations- och kommunikationsteknik

Utgivare

Chalmers tekniska högskola

CSE Jupiter 473, , Jupiter building, Hörselgången 5, floor 4

Online

Opponent: Daniela Soares Cruzes , NTNU, Norway

Mer information

Senast uppdaterat

2021-04-21