Explainable and Resource-Efficient Stream Processing Through Provenance and Scheduling

Dimitrios Palyvos-Giannas

Explainable and Resource-Efficient Stream Processing Through Provenance and Scheduling
Doctoral thesis, 2022

In our era of big data, information is captured at unprecedented volumes and velocities, with technologies such as Cyber-Physical Systems making quick decisions based on the processing of streaming, unbounded datasets. In such scenarios, it can be beneficial to process the data in an online manner, using the stream processing paradigm implemented by Stream Processing Engines (SPEs). While SPEs enable high-throughput, low-latency analysis, they are faced with challenges connected to evolving deployment scenarios, like the increasing use of heterogeneous, resource-constrained edge devices together with cloud resources and the increasing user expectations for usability, control, and resource-efficiency, on par with features provided by traditional databases.

This thesis tackles open challenges regarding making stream processing more user-friendly, customizable, and resource-efficient. The first part outlines our work, providing high-level background information, descriptions of the research problems, and our contributions. The second part presents our three state-of-the-art frameworks for explainable data streaming using data provenance, which can help users of streaming queries to identify important data points, explain unexpected behaviors, and aid query understanding and debugging. (A) GeneaLog provides backward provenance allowing users to identify the inputs that contributed to the generation of each output of a streaming query. (B) Ananke is the first framework to provide a duplicate-free graph of live forward provenance, enabling easy bidirectional tracing of input-output relationships in streaming queries and identifying data points that have finished contributing to results. (C) Erebus is the first framework that allows users to define expectations about the results of a streaming query, validating whether these expectations are met or providing explanations in the form of why-not provenance otherwise. The third part presents techniques for execution efficiency through custom scheduling, introducing our state-of-the-art scheduling frameworks that control resource allocation and achieve user-defined performance goals. (D) Haren is an SPE-agnostic user-level scheduler that can efficiently enforce user-defined scheduling policies. (E) Lachesis is a standalone scheduling middleware that requires no changes to SPEs but, instead, directly guides the scheduling decisions of the underlying Operating System. Our extensive evaluations using real-world SPEs and workloads show that our work significantly improves over the state-of-the-art while introducing only small performance overheads.

Data Streaming

Scheduling

Provenance

Stream Processing

Gustaf Dalénsalen, Chalmers Tvärgata 5, Chalmers (Campus Johanneberg)

Opponent: Prof. Peter Pietzuch, Department of Computing , Imperial College London, United Kingdom

Online defence

Author

Dimitrios Palyvos-Giannas

Chalmers, Computer Science and Engineering (Chalmers), Networks and Systems (Chalmers)

Other publications Research

GeneaLog: Fine-grained data streaming provenance in cyber-physical systems

Parallel Computing,;Vol. 89(2019)

Journal article

Ananke: A Streaming Framework for Live Forward Provenance

Proceedings of the VLDB Endowment,;Vol. 14(2020)p. 391-403

Journal article

Erebus: Explaining the Outputs of Data Streaming Queries

Proceedings of the VLDB Endowment,;Vol. 16(2022)p. 230-242

Journal article

Haren: A Framework for Ad-Hoc Thread Scheduling Policies for Data Streaming Applications

DEBS 2019 - Proceedings of the 13th ACM International Conference on Distributed and Event-Based Systems,;(2019)p. 19-30

Paper in proceeding

Lachesis: A Middleware for Customizing OS Scheduling of Stream Processing Queries

Middleware 2021 - Proceedings of the 22nd International Middleware Conference,;(2021)p. 365-378

Paper in proceeding

Explainable and Efficient Processing of Streaming Data

The large amounts of data generated by smartphones, social networks, and sensors have resulted in cutting-edge applications in areas such as image recognition and self-driving vehicles, affecting many aspects of our lives. A lot of this data is generated continuously as streams. For instance, after you request a taxi through a smartphone application, it continuously sends your location to the driver so the taxi can find you. It is desirable to process such data continuously, something possible through software tools called Stream Processing Engines (SPEs). SPEs allow analysts to define streaming queries over streaming data, e.g., “what is the average speed of all taxis every hour?” and get answers as streams.

This thesis proposes techniques that make streaming queries more explainable and efficient. To make queries more explainable, we introduce provenance techniques that allow analysts to get explanations about why some answers were or were not produced by a query (e.g., “the average speed was high because of taxi #5”), with only a small impact on the performance. Targeting resource efficiency, we develop scheduling frameworks that allow analysts to control which parts of the queries should be prioritized (e.g., “prioritize billing queries”), allowing the optimization of goals such as throughput and response time. Our work, evaluated in real-world scenarios, is shown to significantly improve over the previous state-of-the-art, opening new possibilities and research fronts.

HARE: Self-deploying and Adaptive Data Streaming Analytics in Fog Architectures

Swedish Research Council (VR) (2016-03800), 2017-01-01 -- 2020-12-31.

Show Project

Subject Categories (SSIF 2011)

Computer Engineering

Production Engineering, Human Work Science and Ergonomics

Software Engineering

Computer Science

Computer Systems

Areas of Advance

Information and Communication Technology

Energy

ISBN

978-91-7905-692-6

Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 5158

Publisher

Chalmers

Gustaf Dalénsalen, Chalmers Tvärgata 5, Chalmers (Campus Johanneberg)

Online

Opponent: Prof. Peter Pietzuch, Department of Computing , Imperial College London, United Kingdom

More information

Latest update

11/8/2023

Explainable and Resource-Efficient Stream Processing Through Provenance and Scheduling Doctoral thesis, 2022

Author

Dimitrios Palyvos-Giannas

GeneaLog: Fine-grained data streaming provenance in cyber-physical systems

Ananke: A Streaming Framework for Live Forward Provenance

Erebus: Explaining the Outputs of Data Streaming Queries

Haren: A Framework for Ad-Hoc Thread Scheduling Policies for Data Streaming Applications

Lachesis: A Middleware for Customizing OS Scheduling of Stream Processing Queries

HARE: Self-deploying and Adaptive Data Streaming Analytics in Fog Architectures

Subject Categories (SSIF 2011)

Areas of Advance

ISBN

Publisher

More information

Latest update

Explainable and Resource-Efficient Stream Processing Through Provenance and Scheduling
Doctoral thesis, 2022