Optimal Subsampling Designs Under Measurement Constraints
Doctoral thesis, 2023

We consider the problem of optimal subsample selection in an experiment setting where observing, or utilising, the full dataset for statistical analysis is practically unfeasible. This may be due to, e.g., computational, economic, or even ethical cost-constraints. As a result, statistical analyses must be restricted to a subset of data. Choosing this subset in a manner that captures as much information as possible is essential.

In this thesis we present a theory and framework for optimal design in general subsampling problems. The methodology is applicable to a wide range of settings and inference problems, including regression modelling, parametric density estimation, and finite population inference. We discuss the use of auxiliary information and sequential optimal design for the implementation of optimal subsampling methods in practice and study the asymptotic properties of the resulting estimators.

The proposed methods are illustrated and evaluated on three problem areas: on subsample selection for optimal prediction in active machine learning (Paper I), optimal control sampling in analysis of safety critical events in naturalistic driving studies (Paper II), and optimal subsampling in a scenario generation context for virtual safety assessment of an advanced driver assistance system (Paper III). In Paper IV we present a unified theory that encompasses and generalises the methods of Paper I–III and introduce a class of expected-distance-minimising designs with good theoretical and practical properties.

In Paper I–III we demonstrate a sample size reduction of 10–50% with the proposed methods compared to simple random sampling and traditional importance sampling methods, for the same level of performance. We propose a novel class of invariant linear optimality criteria, which in Paper IV are shown to reach 90–99% D-efficiency with 90–95% lower computational demand.

inverse probability weighting

unequal probability sampling

optimal design

active sampling

M-estimation

Pascal
Opponent: Frank Miller, Linköpings universitet och Stockholms universitet, Sverige

Author

Henrik Imberg

Chalmers, Mathematical Sciences, Applied Mathematics and Statistics

Optimal sampling in unbiased active learning

Proceedings of Machine Learning Research,; Vol. 108(2020)p. 559-569

Paper in proceeding

Optimization of Two-Phase Sampling Designs with Application to Naturalistic Driving Studies

IEEE Transactions on Intelligent Transportation Systems,; Vol. 23(2022)p. 3575-3588

Journal article

Statistisk slutledning för stora populationer är ett centralt problem inom statistik. I många situationer är slutsatser baserat på ”census”, dvs. komplett uppräkning av hela populationen, praktiskt eller ekonomiskt omöjligt. Till följd av den snabba utvecklingen inom IT och digitalisering har detta problem på senare år blivit än mer aktualiserat, då storskaliga datamängder blivit alltmer tillgängliga. I denna avhandling utvecklar vi statistiska metoder för att med hjälp av optimala urval dra slutsatser för en stor databas eller population, där endast en delmängd kan ingå i den statistiska analysen. Metoden tillämpas på problem inom bland annat maskininlärning och utvärdering av avancerade fordonssäkerhetssystem. Genom att göra optimala urval kan vi reducera antalet observationer som behövs för att svara på en given frågeställning med upp till 50%, jämfört med existerande metoder. Detta har viktiga praktiska implikationer och kan exempelvis innebära lägre kostnad för genomförandet av en studie, eller snabbare beräkningar och lägre energiåtgång för genomförandet av storskaliga datorbaserade experiment.

Statistical inference for large populations is a central problem in statistics. In many cases, complete enumeration is practically or economically infeasible, rendering subsampling inevitable. Stimulated by recent technological developments, this problem has gained renewed attention in the statistics and machine learning communities during the past few years. Some specific problems include analysis of massive datasets, and experiments with incomplete data under constraints on the number of additional measurements that can be taken. In this thesis, we develop methods for optimal subsampling when statistical analyses must be restricted to a subset of a large initial dataset or population. The methodology is applied and evaluated on problems in machine learning and traffic safety, including virtual safety assessment of an advanced driver assistance system. We demonstrate that optimal subsampling may reduce the sample size requirements by up to 50% compared to traditional methods, for the same level of performance. This has important practical implications, as it enables an experiment to run with fewer samples and lower cost.

Supporting the interaction of Humans and Automated vehicles: Preparing for the Environment of Tomorrow (Shape-IT)

European Commission (EC) (EC/H2020/860410), 2019-10-01 -- 2023-09-30.

Statistical methods to assess driving behaviour and causation of accidents from large naturalistic driving studies

Swedish Research Council (VR) (2012-5995), 2012-01-01 -- 2015-12-31.

Statistical sampling in machine learning

Stiftelsen Wilhelm och Martina Lundgrens Vetenskapsfond (2019-3132), 2019-05-01 -- 2019-12-31.

Stiftelsen Wilhelm och Martina Lundgrens Vetenskapsfond (2020-3446), 2020-05-01 -- 2020-12-31.

Improved quantitative driver behavior models and safety assessment methods for ADAS and AD (QUADRIS)

VINNOVA (2020-05156), 2021-04-01 -- 2024-03-31.

Roots

Basic sciences

Subject Categories

Probability Theory and Statistics

ISBN

978-91-7905-826-5

Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 5292

Publisher

Chalmers

Pascal

Opponent: Frank Miller, Linköpings universitet och Stockholms universitet, Sverige

More information

Latest update

7/17/2023