FEDAMON: A fully automated framework for communication-efficient continuous distributed monitoring with error guarantees
Artikel i vetenskaplig tidskrift, 2026

Efficient monitoring of large distributed systems is critical for applications such as data center load balancing, fleet management, and smart grid energy optimization. This paper addresses the continuous distributed monitoring problem, where a central coordinator tracks statistics from numerous distributed nodes in real time. We present FEDAMON, a novel Forecast-based, Error-bounded, and Data-Aware approach to continuous distributed Monitoring that significantly reduces communication costs while maintaining accuracy. Instead of transmitting all values to the coordinator, our event-based monitoring leverages lightweight forecasting models at the coordinator and distributed nodes to predict the evolution of observations, communicating when deviations exceed an error threshold. To adapt to dynamically changing data streams, we introduce a data-aware model selection strategy that optimizes the trade-off between communication and accuracy. Our solution is communication-efficient, fully automated, and equipped with dynamic error control, reducing system parametrization to a single error tolerance of three preset levels. FEDAMON reduces communication to 10% of the baseline with less than 2% error across all streams on diverse datasets on average. Moreover, the standard parameter solution surpasses even the best calibrated single models across all error bounds, achieving up to 33% improvement in communication efficiency with identical error guarantees. Further gains of 25% in accuracy is obtained by tuning the data-aware control factor without extra cost. In addition, our framework generalizes effectively to previously unseen datasets. Finally, our dynamic error control achieves comparable performance to fixed bounds. Results highlight the scalability and robustness of FEDAMON, enabling fully automatic, real-time monitoring with large communication savings and marginal error.

Distributed tracking

Continuous monitoring

Distributed data streams

Data-aware approaches

Network monitoring

Författare

Yixing Zhang

Chalmers, Data- och informationsteknik, Dator- och nätverkssystem

Göteborgs universitet

Romaric Duvignau

Chalmers, Data- och informationsteknik, Dator- och nätverkssystem

Göteborgs universitet

Information Systems

0306-4379 (ISSN)

Vol. 141 102751

READY: Rethinking Monitoring for Large Distributed Systems

Data- och informationsteknik, 2024-03-01 -- 2029-03-01.

Styrkeområden

Informations- och kommunikationsteknik

Ämneskategorier (SSIF 2025)

Datavetenskap (datalogi)

Nätverks-, parallell- och distribuerad beräkning

Infrastruktur

C3SE (-2020, Chalmers Centre for Computational Science and Engineering)

DOI

10.1016/j.is.2026.102751

Relaterade dataset

Source code [dataset]

URI: https://github.com/yixingzhang11/fedamon/tree/v1.0 DOI: https://doi.org/10.5281/zenodo.15310950

Mer information

Senast uppdaterat

2026-06-11