Achieving Privacy in the Adversarial Multi-Armed Bandit
Paper in proceedings, 2017
In this paper, we improve the previously best known regret
University of Lille, France Chalmers University of Technology, Sweden Harvard University, USA email@example.com
movie recommendations can be formalized similarly (Pandey and Olston 2006).
Privacy can be a serious issue in the bandit setting (c.f. (Jain, Kothari, and Thakurta 2012; Thakurta and Smith 2013; Mishra and Thakurta 2015; Zhao et al. 2014)). For example, in clinical trials, we may want to detect and publish results about the best drug without leaking sensitive information, such as the patient’s health condition and genome. Differen- tial privacy (Dwork 2006) formally bounds the amount of information that a third party can learn no matter their power or side information.
Differential privacy has been used before in the stochastic setting (Tossou and Dimitrakakis 2016; Mishra and Thakurta 2015; Jain, Kothari, and Thakurta 2012) where the authors obtain optimal algorithms up to logarithmic factors. In the adversarial setting, (Thakurta and Smith 2013) adapts an algorithm called Follow The Approximate Leader to make it private and obtain a regret bound of O(T2/3). In this work, we show that a number of simple algorithms can satisfy privacy guarantees, while achieving nearly optimal regret (up to logarithmic factors) that scales naturally with the level of privacy desired.
Our work is also of independent interest for non-private multi-armed bandit algorithms, as there are competitive with the current state of the art against switching-cost adversaries (where we recover the optimal bound). Finally, we provide rigorous empirical results against a variety of adversaries.
The following section gives the main background and nota- tions. Section 3.1 describes meta-algorithms that perturb the gain sequence to achieve privacy, while Section 3.2 explains how to leverage the privacy inherent in the EXP3 algorithm by modifying the way gains are used. Section 4 compares our algorithms with EXP3 in a variety of settings. The full proofs of all our main results are in the full version.
2.1 The Multi-Armed Bandit problem
Formally, a bandit game is defined between an adversary and an agent as follows: there is a set of K arms A, and at each round t, the agent plays an arm It ∈ A. Given the choice It, the adversary grants the agent a gain gIt,t ∈ [0,1]. The agent only observes the gain of arm It, and not that of any other arms. The goal of this agent is to maximize its total gain
bound to achieve ε-differential privacy in oblivious adversarial
bandits from O(T2/3/ε) to O(√T lnT/ε). This is achieved
by combining a Laplace Mechanism with EXP3. We show that
though EXP3 is already differentially private, it leaks a linear
amount of information in T . However, we can improve this
privacy by relying on its intrinsic exponential mechanism for
selecting actions. This allows us to reach O(
a regret of O(T2/3) that holds against an adaptive adversary, an improvement from the best known of O(T3/4). This is done by using an algorithm that run EXP3 in a mini-batch loop. Finally, we run experiments that clearly demonstrate the validity of our theoretical analysis.