Distributed Model Training Task Migration for Hotspot Management in Intelligent Computing Center Interconnection with Tidal Characteristics
Artikel i vetenskaplig tidskrift, 2025

Intelligent computing center (ICC) is a new type of data center constructed with intelligent computing power, such as graphic processing units (GPUs) and artificial intelligence acceleration cards. With billions of parameters, the emergence of large models (e.g., ChatGPT) presents a significant demand of computing power. It may be challenging for a single ICC to provide the required computing power during large model training. Thus, ICC interconnections (ICCI) will become a typical and effective solution to provide intensive computing power. Due to human activities, traditional computing tasks (e.g., transaction processing and online entertainment) exhibit a tidal effect of computing demand, which leads to the tidal variation of remaining computing resources. Moreover, distributed model training (DMT) tasks are likely to cover peaks and valleys of the tidal effect in computing power. In this case, it is easy for DMT tasks to cause an ICC to become a hotspot (i.e., computing load in an ICC exceeds a desired threshold), which significantly degrades the reliability and performance of the ICC. This paper proposes DeepHM, a deep reinforcement learning-based hotspot management strategy through task migration in ICCI networks. To comprehensively consider the bandwidth metrics of the ICCI network, we further propose a dynamic wavelength allocation strategy, i.e., DeepHM-DWA. Simulation results show that the DeepHM and DeepHM-DWA reduce the hotspot compute unit time blocks by 19% and 18% with fewer number of migrated workers while balancing the computing load among multiple ICCs. DeepHM and DeepHM-DWA reduce the average completion time ratio of the DMT tasks by 2% and 5%, respectively.

hotspot management

tidal effect

distributed model training

task migration

Intelligent computing center interconnections

Författare

Yingbo Fan

Beijing University of Posts and Telecommunications (BUPT)

Yajie Li

Beijing University of Posts and Telecommunications (BUPT)

Carlos Natalino Da Silva

Chalmers, Elektroteknik, Kommunikation, Antenner och Optiska Nätverk

Yahui Wang

Beijing University of Posts and Telecommunications (BUPT)

Jiaxing Guo

Beijing University of Posts and Telecommunications (BUPT)

Wanping Wu

Beijing University of Posts and Telecommunications (BUPT)

Rongrong Ruan

Beijing University of Posts and Telecommunications (BUPT)

Wei Wang

Beijing University of Posts and Telecommunications (BUPT)

Yongli Zhao

Beijing University of Posts and Telecommunications (BUPT)

Jie Zhang

Beijing University of Posts and Telecommunications (BUPT)

IEEE Transactions on Network and Service Management

19324537 (eISSN)

Vol. In Press

Ämneskategorier (SSIF 2025)

Datavetenskap (datalogi)

DOI

10.1109/TNSM.2025.3590011

Mer information

Senast uppdaterat

2025-09-23