#
Information-Theoretic Generalization Bounds: Tightness and Expressiveness
Doctoral thesis, 2022

In this thesis, we derive and evaluate several such information-theoretic generalization bounds. First, we derive both average and high-probability bounds in a unified way, obtaining new results and recovering several bounds from the literature. We also develop new bounds by using tools from binary hypothesis testing. We extend these results to the conditional mutual information (CMI) framework, leading to results that depend on quantities such as the conditional information density and maximal leakage.

While the aforementioned bounds achieve a so-called slow rate with respect to the number of training samples, we extend our techniques to obtain bounds with a fast rate. Furthermore, we show that the CMI framework can be viewed as a way of automatically obtaining data-dependent priors, an important technique for obtaining numerically tight PAC-Bayesian bounds. A numerical evaluation of these bounds demonstrate that they are nonvacuous for deep neural networks, but diverge as training progresses.

To obtain numerically tighter results, we strengthen our bounds through the use of the samplewise evaluated CMI, which depends on the information captured by the losses of the neural network rather than its weights. Furthermore, we make use of convex comparator functions, such as the binary relative entropy, to obtain tighter characterizations for low training losses. Numerically, we find that these bounds are nearly tight for several deep neural network settings, and remain stable throughout training. We demonstrate the expressiveness of the evaluated CMI framework by using it to rederive nearly optimal guarantees for multiclass classification, known from classical learning theory.

Finally, we study the expressiveness of the evaluated CMI framework for meta learning, where data from several related tasks is used to improve performance on new tasks from the same task environment. Through the use of a one-step derivation and the evaluated CMI, we obtain new information-theoretic generalization bounds for meta learning that improve upon previous results. Under certain assumptions on the function classes used by the learning algorithm, we obtain convergence rates that match known classical results. By extending our analysis to oracle algorithms and considering a notion of task diversity, we obtain excess risk bounds for empirical risk minimizers.

information theory

neural networks

generalization

statistical learning

meta learning

PAC-Bayes

Machine learning

## Author

### Fredrik Hellström

Chalmers, Electrical Engineering, Communication, Antennas and Optical Networks

### New Family of Generalization Bounds Using Samplewise Evaluated CMI

Advances in Neural Information Processing Systems,; Vol. 35 (2022)

**Paper in proceeding**

### Evaluated CMI Bounds for Meta Learning: Tightness and Expressiveness

Advances in Neural Information Processing Systems,; Vol. 35(2022)

**Paper in proceeding**

### Fast-Rate Loss Bounds via Conditional Information Measures with Applications to Neural Networks

IEEE International Symposium on Information Theory - Proceedings,; Vol. 2021-July(2021)p. 952-957

**Paper in proceeding**

### Generalization Bounds via Information Density and Conditional Information Density

IEEE Journal on Selected Areas in Information Theory ,; Vol. 1(2020)p. 824-839

**Journal article**

Machine learning and artificial intelligence have recently taken huge strides. It is used to create new images, play chess far better than any human, and predict protein structures. To train a computer program to perform such tasks, one typically starts by collecting and labelling vast amounts of data. Then, the program trains to perform its task well on the collected data, before being deployed on new, unlabelled examples. For instance, the data can be amino acid sequences of proteins labelled with their corresponding structure, where the program is used to predict the structure based on the amino acids.

While these programs often perform well on new data—that is, they generalize—this is not always the case. Classical mathematical results for generalization can be intuitively described by Occam's razor: if the program is simple enough, it will generalize. However, the programs that are used in modern applications are too complex for these results to apply.

This thesis presents mathematical results that describe generalization in terms of the information that programs capture about the training data. Essentially: if the program performs well during training without memorizing the data, it will generalize. Our studies indicate that this approach may explain generalization in modern machine learning, while retaining insights about complexity from classical results. Using these results to improve performance of machine learning in practice is an intriguing area for more research.

### INNER: information theory of deep neural networks

Chalmers AI Research Centre (CHAIR), 2019-01-01 -- 2021-12-31.

### Subject Categories

Other Computer and Information Science

Communication Systems

Probability Theory and Statistics

Computer Vision and Robotics (Autonomous Systems)

### Areas of Advance

Information and Communication Technology

### Infrastructure

C3SE (Chalmers Centre for Computational Science and Engineering)

### ISBN

978-91-7905-782-4

Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 5248

### Publisher

Chalmers