A Large-Scale Study of ML-Related Python Projects
Paper in proceeding, 2024

The rise of machine learning (ML) for solving current and future problems increased the production of ML-enabled software systems. Unfortunately, standardized tool chains for developing, employing, and maintaining such projects are not yet mature, which can mainly be attributed to a lack of understanding of the properties of ML-enabled software. For instance, it is still unclear how to manage and evolve ML-specific assets together with other software-engineering assets. In particular, ML-specific tools and processes, such as those for managing ML experiments, are often perceived as incompatible with practitioners' software engineering tools and processes. To design new tools for developing ML-enabled software, it is crucial to understand the properties and current problems of developing these projects by eliciting empirical data from real projects, including the evolution of the different assets involved. Moreover, while studies in this direction have recently been conducted, identifying certain types of ML-enabled projects (e.g., experiments, libraries and software systems) remains a challenge for researchers. We present a large-scale study of over 31,066 ML projects found on GitHub, with an emphasis on their development stages and evolution. Our contributions include a dataset, together with empirical data providing an overview of the existing project types and analysis of the projects' properties and characteristics, especially regarding the implementation of different ML development stages and their evolution. We believe that our results support researchers, practitioners, and tool builders conduct follow-up studies and especially build novel tools for managing ML projects, ideally unified with traditional software-engineering tools.

tensorflow

machine learning

large-scale study

open-source projects

ml-enabled systems

evolution

scikit-learn

mining study

Author

Samuel Idowu

Software Engineering 2

Yorick Sens

Ruhr-Universität Bochum

Thorsten Berger

Ruhr-Universität Bochum

University of Gothenburg

Software Engineering 2

Jacob Krueger

Eindhoven University of Technology

Michael Vierhauser

Ruhr-Universität Bochum

Proceedings of the ACM Symposium on Applied Computing

1272-1281
9798400702433 (ISBN)

39th Annual ACM Symposium on Applied Computing, SAC 2024
Avila, Spain,

Subject Categories

Other Computer and Information Science

Business Administration

Software Engineering

Computer Systems

DOI

10.1145/3605098.3636056

More information

Latest update

7/25/2024