Information-Theoretic Generalization Bounds: Tightness and Expressiveness
Doctoral thesis, 2022

Machine learning has achieved impressive feats in numerous domains, largely driven by the emergence of deep neural networks. Due to the high complexity of these models, classical bounds on the generalization error---that is, the difference between training and test performance---fail to explain this success. This discrepancy between theory and practice motivates the search for new generalization guarantees, which must rely on other properties than function complexity. Information-theoretic bounds, which are intimately related to probably approximately correct (PAC)-Bayesian analysis, naturally incorporate a dependence on the relevant data distributions and learning algorithms. Hence, they are a promising candidate for studying generalization in deep neural networks.

In this thesis, we derive and evaluate several such information-theoretic generalization bounds. First, we derive both average and high-probability bounds in a unified way, obtaining new results and recovering several bounds from the literature. We also develop new bounds by using tools from binary hypothesis testing. We extend these results to the conditional mutual information (CMI) framework, leading to results that depend on quantities such as the conditional information density and maximal leakage.

While the aforementioned bounds achieve a so-called slow rate with respect to the number of training samples, we extend our techniques to obtain bounds with a fast rate. Furthermore, we show that the CMI framework can be viewed as a way of automatically obtaining data-dependent priors, an important technique for obtaining numerically tight PAC-Bayesian bounds. A numerical evaluation of these bounds demonstrate that they are nonvacuous for deep neural networks, but diverge as training progresses.

To obtain numerically tighter results, we strengthen our bounds through the use of the samplewise evaluated CMI, which depends on the information captured by the losses of the neural network rather than its weights. Furthermore, we make use of convex comparator functions, such as the binary relative entropy, to obtain tighter characterizations for low training losses. Numerically, we find that these bounds are nearly tight for several deep neural network settings, and remain stable throughout training. We demonstrate the expressiveness of the evaluated CMI framework by using it to rederive nearly optimal guarantees for multiclass classification, known from classical learning theory.

Finally, we study the expressiveness of the evaluated CMI framework for meta learning, where data from several related tasks is used to improve performance on new tasks from the same task environment. Through the use of a one-step derivation and the evaluated CMI, we obtain new information-theoretic generalization bounds for meta learning that improve upon previous results. Under certain assumptions on the function classes used by the learning algorithm, we obtain convergence rates that match known classical results. By extending our analysis to oracle algorithms and considering a notion of task diversity, we obtain excess risk bounds for empirical risk minimizers.

information theory

neural networks

generalization

statistical learning

meta learning

PAC-Bayes

Machine learning

EF-salen, Hörsalsvägen 11
Opponent: Gergely Neu, Universitat Pompeu Fabra, Spain

Author

Fredrik Hellström

Chalmers, Electrical Engineering, Communication, Antennas and Optical Networks

New Family of Generalization Bounds Using Samplewise Evaluated CMI

Advances in Neural Information Processing Systems,; Vol. 35 (2022)

Paper in proceeding

Evaluated CMI Bounds for Meta Learning: Tightness and Expressiveness

Advances in Neural Information Processing Systems,; Vol. 35(2022)

Paper in proceeding

Fast-Rate Loss Bounds via Conditional Information Measures with Applications to Neural Networks

IEEE International Symposium on Information Theory - Proceedings,; Vol. 2021-July(2021)p. 952-957

Paper in proceeding

Generalization Bounds via Information Density and Conditional Information Density

IEEE Journal on Selected Areas in Information Theory ,; Vol. 1(2020)p. 824-839

Journal article

Information and Understanding — How Can Machines Learn from Data?

Machine learning and artificial intelligence have recently taken huge strides. It is used to create new images, play chess far better than any human, and predict protein structures. To train a computer program to perform such tasks, one typically starts by collecting and labelling vast amounts of data. Then, the program trains to perform its task well on the collected data, before being deployed on new, unlabelled examples. For instance, the data can be amino acid sequences of proteins labelled with their corresponding structure, where the program is used to predict the structure based on the amino acids.

While these programs often perform well on new data—that is, they generalize—this is not always the case. Classical mathematical results for generalization can be intuitively described by Occam's razor: if the program is simple enough, it will generalize. However, the programs that are used in modern applications are too complex for these results to apply.

This thesis presents mathematical results that describe generalization in terms of the information that programs capture about the training data. Essentially: if the program performs well during training without memorizing the data, it will generalize. Our studies indicate that this approach may explain generalization in modern machine learning, while retaining insights about complexity from classical results. Using these results to improve performance of machine learning in practice is an intriguing area for more research.

INNER: information theory of deep neural networks

Chalmers AI Research Centre (CHAIR), 2019-01-01 -- 2021-12-31.

Subject Categories

Other Computer and Information Science

Communication Systems

Probability Theory and Statistics

Computer Vision and Robotics (Autonomous Systems)

Areas of Advance

Information and Communication Technology

Infrastructure

C3SE (Chalmers Centre for Computational Science and Engineering)

ISBN

978-91-7905-782-4

Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 5248

Publisher

Chalmers

EF-salen, Hörsalsvägen 11

Online

Opponent: Gergely Neu, Universitat Pompeu Fabra, Spain

More information

Latest update

10/27/2023