Algorithms and Regret Bounds for Multi-objective Contextual Bandits with Similarity Information
Contextual bandit algorithms have been shown to be effective in solving sequential decision making problems under uncertain environments, ranging from cognitive radio networks to recommender systems to medical diagnosis. Many of these real world applications involve multiple and possibly conflicting objectives. In this thesis, we consider an extension of contextual bandits called multi-objective contextual bandits with similarity information. Unlike single-objective contextual bandits, in which the learner obtains a random scalar reward for each arm it selects, in the multi-objective contextual bandits, the learner obtains a random reward vector, where each component of the reward vector corresponds to one of the objectives and the distribution of the reward depends on the context that is provided to the learner at the beginning of each round. For this setting, first, we propose a new multi-objective contextual multi-armed bandit problem with similarity information that has two objectives, where one of the objectives dominates the other objective. Here, the goal of the learner is to maximize its total reward in the non-dominant objective while ensuring that it maximizes its total reward in the dominant objective. Then, we propose the multi-objective contextual multi-armed bandit algorithm (MOC-MAB), and define two performance measures: the 2-dimensional (2D) regret and the Pareto regret. We show that both the 2D regret and the Pareto regret of MOC-MAB are sublinear in the number of rounds. We also evaluate the performance of MOC-MAB in synthetic and real-world datasets. In the next problem, we consider a multi-objective contextual bandit problem with an arbitrary number of objectives and a high-dimensional, possibly uncountable arm set, which is endowed with the similarity information. We propose an online learning algorithm called Pareto Contextual Zooming (PCZ), and prove that it achieves sublinear in the number of rounds Pareto regret, which is near-optimal.