Explainable and Resource-Efficient Stream Processing Through Provenance and Scheduling
Doctoral thesis, 2022
This thesis tackles open challenges regarding making stream processing more user-friendly, customizable, and resource-efficient. The first part outlines our work, providing high-level background information, descriptions of the research problems, and our contributions. The second part presents our three state-of-the-art frameworks for explainable data streaming using data provenance, which can help users of streaming queries to identify important data points, explain unexpected behaviors, and aid query understanding and debugging. (A) GeneaLog provides backward provenance allowing users to identify the inputs that contributed to the generation of each output of a streaming query. (B) Ananke is the first framework to provide a duplicate-free graph of live forward provenance, enabling easy bidirectional tracing of input-output relationships in streaming queries and identifying data points that have finished contributing to results. (C) Erebus is the first framework that allows users to define expectations about the results of a streaming query, validating whether these expectations are met or providing explanations in the form of why-not provenance otherwise. The third part presents techniques for execution efficiency through custom scheduling, introducing our state-of-the-art scheduling frameworks that control resource allocation and achieve user-defined performance goals. (D) Haren is an SPE-agnostic user-level scheduler that can efficiently enforce user-defined scheduling policies. (E) Lachesis is a standalone scheduling middleware that requires no changes to SPEs but, instead, directly guides the scheduling decisions of the underlying Operating System. Our extensive evaluations using real-world SPEs and workloads show that our work significantly improves over the state-of-the-art while introducing only small performance overheads.
Data Streaming
Scheduling
Provenance
Stream Processing
Author
Dimitrios Palyvos-Giannas
Chalmers, Computer Science and Engineering (Chalmers), Networks and Systems (Chalmers)
GeneaLog: Fine-grained data streaming provenance in cyber-physical systems
Parallel Computing,;Vol. 89(2019)
Journal article
Ananke: A Streaming Framework for Live Forward Provenance
Proceedings of the VLDB Endowment,;Vol. 14(2020)p. 391-403
Journal article
Erebus: Explaining the Outputs of Data Streaming Queries
Proceedings of the VLDB Endowment,;Vol. 16(2022)p. 230-242
Paper in proceeding
Haren: A Framework for Ad-Hoc Thread Scheduling Policies for Data Streaming Applications
DEBS 2019 - Proceedings of the 13th ACM International Conference on Distributed and Event-Based Systems,;(2019)p. 19-30
Paper in proceeding
Lachesis: A Middleware for Customizing OS Scheduling of Stream Processing Queries
Middleware 2021 - Proceedings of the 22nd International Middleware Conference,;(2021)p. 365-378
Paper in proceeding
The large amounts of data generated by smartphones, social networks, and sensors have resulted in cutting-edge applications in areas such as image recognition and self-driving vehicles, affecting many aspects of our lives. A lot of this data is generated continuously as streams. For instance, after you request a taxi through a smartphone application, it continuously sends your location to the driver so the taxi can find you. It is desirable to process such data continuously, something possible through software tools called Stream Processing Engines (SPEs). SPEs allow analysts to define streaming queries over streaming data, e.g., “what is the average speed of all taxis every hour?” and get answers as streams.
This thesis proposes techniques that make streaming queries more explainable and efficient. To make queries more explainable, we introduce provenance techniques that allow analysts to get explanations about why some answers were or were not produced by a query (e.g., “the average speed was high because of taxi #5”), with only a small impact on the performance. Targeting resource efficiency, we develop scheduling frameworks that allow analysts to control which parts of the queries should be prioritized (e.g., “prioritize billing queries”), allowing the optimization of goals such as throughput and response time. Our work, evaluated in real-world scenarios, is shown to significantly improve over the previous state-of-the-art, opening new possibilities and research fronts.
HARE: Self-deploying and Adaptive Data Streaming Analytics in Fog Architectures
Swedish Research Council (VR) (2016-03800), 2017-01-01 -- 2020-12-31.
Subject Categories
Computer Engineering
Production Engineering, Human Work Science and Ergonomics
Software Engineering
Computer Science
Computer Systems
Areas of Advance
Information and Communication Technology
Energy
ISBN
978-91-7905-692-6
Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 5158
Publisher
Chalmers
Gustaf Dalénsalen, Chalmers Tvärgata 5, Chalmers (Campus Johanneberg)
Opponent: Prof. Peter Pietzuch, Department of Computing , Imperial College London, United Kingdom