Adaptiveness, Asynchrony, and Resource Efficiency in Parallel Stochastic Gradient Descent
Doctoral thesis, 2023

Accelerated digitalization and sensor deployment in society in recent years poses critical challenges for associated data processing and analysis infrastructure to scale, and the field of big data, targeting methods for storing, processing, and revealing patterns in huge data sets, has surged. Artificial Intelligence (AI) models are used diligently in standard Big Data pipelines due to their tremendous success across various data analysis tasks, however exponential growth in Volume, Variety and Velocity of Big Data (known as its three V’s) in recent years require associated complexity in the AI models that analyze it, as well as the Machine Learning (ML) processes required to train them. In order to cope, parallelism in ML is standard nowadays, with the aim to better utilize contemporary computing infrastructure, whether it being shared-memory multi-core CPUs, or vast connected networks of IoT devices engaging in Federated Learning (FL).

Stochastic Gradient Descent (SGD) serves as the backbone of many of the most popular ML methods, including in particular Deep Learning. However, SGD has inherently sequential semantics, and is not trivially parallelizable without imposing strict synchronization, with associated bottlenecks. Asynchronous SGD (AsyncSGD), which relaxes the original semantics, has gained significant interest in recent years due to promising results that show speedup in certain contexts. However, the relaxed semantics that asynchrony entails give rise to fundamental questions regarding AsyncSGD, relating particularly to its stability and convergence rate in practical applications.

This thesis explores vital knowledge gaps of AsyncSGD, and contributes in particular to: Theoretical frameworks – Formalization of several key notions related to the impact of asynchrony on the convergence, guiding future development of AsyncSGD implementations; Analytical results – Asymptotic convergence bounds under realistic assumptions. Moreover, several technical solutions are proposed, targeting in particular: Stability – Reducing the number of non-converging executions and the associated wasted energy; Speedup – Improving convergence time and reliability with instance-based adaptiveness; Elasticity – Resource-efficiency by avoiding over-parallelism, and thereby improving stability and saving computing resources. The proposed methods are evaluated on several standard DL benchmarking applications and compared to relevant baselines from previous literature. Key results include: (i) persistent speedup compared to baselines, (ii) increased stability and reduced risk for non-converging executions, (iii) reduction in the overall memory footprint (up to 17%), as well as the consumed computing resources (up to 67%).

In addition, along with this thesis, an open-source implementation is published, that connects high-level ML operations with asynchronous implementations with fine-grained memory operations, leveraging future research for efficient adaptation of AsyncSGD for practical applications.

EE
Opponent: Assaf Schuster, Technion - Israel Institute of Technology, Israel

Author

Karl Bäckström

Network and Systems

MindTheStep-AsyncPSGD: Adaptive Asynchronous Parallel Stochastic Gradient Descent

Proceedings - 2019 IEEE International Conference on Big Data, Big Data 2019,;(2019)p. 16-25

Paper in proceeding

Consistent lock-free parallel stochastic gradient descent for fast and stable convergence

Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021,;(2021)p. 423-432

Paper in proceeding

ASAP.SGD: Instance-based Adaptiveness to Staleness in Asynchronous SGD

Proceedings of Machine Learning Research,;Vol. PMLR 162(2022)p. 1261-1271

Paper in proceeding

Bäckström, K, Papatriantafilou, M, Tsigas, P. Less is more: Elastic Parallelism Control for Asynchronous SGD

Our world grows increasingly digital, and we're using more sensors in everyday life. Think of cars that park themselves, medical devices that monitor our health, and smartphones that recognize our faces. All these gadgets generate massive data amounts, needing to be processed and analyzed efficiently, which is where big data and artificial intelligence (AI) come into play.

However, as the data grows in volume, variety, and speed, the AI models that make sense of it all become more complex. To handle this, parallelism in AI is standard nowadays, by using multiple computing cores to perform tasks more efficiently. Stochastic Gradient Descent (SGD) is at the heart of many AI applications, including deep learning. However, it has limitations, as it's not easily adaptable to parallelism without causing bottlenecks. To overcome this, researchers have been exploring Asynchronous SGD (AsyncSGD), which offers a more flexible approach.

This thesis investigates challenges and potential of AsyncSGD, focusing on its potential in real-world applications. It proposes several technical solutions to enhance the stability, speed, and efficiency of AsyncSGD. The results show that the proposed solutions not only improve the speed of AsyncSGD but also increase stability and reduce the risk of failed executions. Moreover, the solutions reduce computing resources (up to 67%), making it a more sustainable option for handling the ever-growing data generated by our increasingly connected world.

WASP SAS: Structuring data for continuous processing and ML systems

Wallenberg AI, Autonomous Systems and Software Program, 2018-01-01 -- 2023-01-01.

Driving Forces

Sustainable development

Innovation and entrepreneurship

Subject Categories

Computer and Information Science

Roots

Basic sciences

ISBN

978-91-7905-855-5

Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 5321

Publisher

Chalmers

EE

Online

Opponent: Assaf Schuster, Technion - Israel Institute of Technology, Israel

More information

Latest update

5/5/2023 1