Data management and Data Pipelines: An empirical investigation in the embedded systems domain
Licentiate thesis, 2021

Context: Companies are increasingly collecting data from all possible sources to extract insights that help in data-driven decision-making. Increased data volume, variety, and velocity and the impact of poor quality data on the development of data products are leading companies to look for an improved data management approach that can accelerate the development of high-quality data products. Further, AI is being applied in a growing number of fields, and thus it is evolving as a horizontal technology. Consequently, AI components are increasingly been integrated into embedded systems along with electronics and software. We refer to these systems as AI-enhanced embedded systems. Given the strong dependence of AI on data, this expansion also creates a new space for applying data management techniques.
Objective: The overall goal of this thesis is to empirically identify the data management challenges encountered during the development and maintenance of AI-enhanced embedded systems, propose an improved data management approach and empirically validate the proposed approach.
Method: To achieve the goal, we conducted this research in close collaboration with Software Center companies using a combination of different empirical research methods: case studies, literature reviews, and action research.
Results and conclusions: This research provides five main results. First, it identifies key data management challenges specific to Deep Learning models developed at embedded system companies. Second, it examines the practices such as DataOps and data pipelines that help to address data management challenges. We observed that DataOps is the best data management practice that improves the data quality and reduces the time tdevelop data products. The data pipeline is the critical component of DataOps that manages the data life cycle activities. The study also provides the potential faults at each step of the data pipeline and the corresponding mitigation strategies. Finally, the data pipeline model is realized in a small piece of data pipeline and calculated the percentage of saved data dumps through the implementation.
Future work: As future work, we plan to realize the conceptual data pipeline model so that companies can build customized robust data pipelines. We also plan to analyze the impact and value of data pipelines in cross-domain AI systems and data applications. We also plan to develop AI-based fault detection and mitigation system suitable for data pipelines.

data management

empirical investigation

artificial intelligence

data pipelines

embedded systems

software engineering

machine learning

CSE Jupiter 473, , Jupiter building, Hörselgången 5, floor 4
Opponent: Daniela Soares Cruzes , NTNU, Norway

Author

Aiswarya Raj Munappy

Chalmers, Computer Science and Engineering (Chalmers), Software Engineering (Chalmers)

Towards automated detection of data pipeline faults

Proceedings - Asia-Pacific Software Engineering Conference, APSEC,; Vol. 2020-December(2020)p. 346-355

Paper in proceeding

Modelling Data Pipelines

2020 46th Euromicro Conference on Software Engineering and Advanced Applications (SEAA),; (2020)p. 13-20

Paper in proceeding

From Ad-Hoc Data Analytics to DataOps

ICSSP '20: Proceedings of the International Conference on Software and System Processes,; (2020)p. 165-174

Paper in proceeding

Data Management Challenges for Deep Learning

Proceedings - 45th Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2019,; (2019)p. 140-147

Paper in proceeding

Data Pipeline Management in Practice: Challenges and Opportunities

Lecture Notes in Computer Science,; Vol. 12562(2020)p. 168-184

Paper in proceeding

Software Engineering for AI/ML/DL

Chalmers AI Research Centre (CHAIR), 2019-11-01 -- 2022-11-01.

HoliDev - Holistic DevOps Framework

VINNOVA (2017-05218), 2018-01-01 -- 2019-12-31.

Subject Categories

Other Computer and Information Science

Software Engineering

Computer Science

Areas of Advance

Information and Communication Technology

Publisher

Chalmers

CSE Jupiter 473, , Jupiter building, Hörselgången 5, floor 4

Online

Opponent: Daniela Soares Cruzes , NTNU, Norway

More information

Latest update

12/10/2021