Skip to main content

Shimon Whiteson

University of Oxford, Department of Computer Science, Faculty Member

Followers

56

Following

29

Co-authors

2

Public Views

Address: Irvine, California, United States

less

Interests

Uploads

Papers by Shimon Whiteson

Temporal Difference and Policy Search Methods for Reinforcement Learning: An Empirical Comparison

Proceedings of the 22nd National Conference on Artificial Intelligence Volume 2, 2007

Reinforcement learning (RL) methods have become popular in recent years because of their ability ... more Reinforcement learning (RL) methods have become popular in recent years because of their ability to solve complex tasks with minimal feedback. Both genetic algorithms (GAs) and temporal difference (TD) methods have proven effective at solving difficult RL problems, but few rigorous comparisons have been conducted. Thus, no general guidelines describing the methods' relative strengths and weaknesses are available. This paper summarizes a detailed empirical comparison between a GA and a TD method in Keepaway, a standard RL benchmark domain based on robot soccer. The results from this study help isolate the factors critical to the performance of each learning method and yield insights into their general strengths and weaknesses.

Using Informative Behavior to Increase Engagement in the TAMER Framework

Proceedings of the 2013 International Conference on Autonomous Agents and Multi Agent Systems, May 1, 2013

In this paper, we address a relatively unexplored aspect of designing agents that learn from huma... more In this paper, we address a relatively unexplored aspect of designing agents that learn from human training by investigating how the agent's non-task behavior can elicit human feedback of higher quality and quantity. We use the TAMER framework, which facilitates the training of agents by human-generated reward signals, i.e., judgements of the quality of the agent's actions, as the foundation for our investigation. Then, we propose two new training interfaces to increase active involvement in the training process and thereby improve the agent's task performance. One provides information on the agent's uncertainty, the other on its performance. Our results from a 51-subject user study show that these interfaces can induce the trainers to train longer and give more feedback. The agent's performance, however, increases only in response to the addition of performanceoriented information, not by sharing uncertainty levels. Subsequent analysis of our results suggests that the organizational maxim about human behavior, "you get what you measure"-i.e., sharing metrics with people causes them to focus on maximizing or minimizing those metrics while deemphasizing other objectives-also applies to the training of agents, providing a powerful guiding principle for humanagent interface design in general.

Evolutionary Computation for Reinforcement Learning

Algorithms for evolutionary computation, which simulate the process of natural selection to solve... more Algorithms for evolutionary computation, which simulate the process of natural selection to solve optimization problems, are an effective tool for discovering high-performing reinforcement-learning policies. Because they can automatically find good representations, handle continuous action spaces, and cope with partial observability, evolutionary reinforcement-learning approaches have a strong empirical track record, sometimes significantly outperforming temporal-difference methods. This chapter surveys research on the application of evolutionary computation to reinforcement learning, overviewing methods for evolving neural-network topologies and weights, hybrid methods that also use temporal-difference methods, coevolutionary methods for multi-agent settings, generative and developmental systems, and methods for on-line evolutionary reinforcement learning.

Adaptive Job Routing and Scheduling

Eaai, 2005

Computer systems are rapidly becoming so complex that maintaining them with human support staffs ... more Computer systems are rapidly becoming so complex that maintaining them with human support staffs will be prohibitively expensive and inefficient. In response, visionaries have begun proposing that computer systems be imbued with the ability to configure themselves, diagnose failures, and ultimately repair themselves in response to these failures. However, despite convincing arguments that such a shift would be desirable, as of yet there has been little concrete progress made towards this goal. We view these problems as fundamentally machine learning challenges. Hence, this article presents a new network simulator designed to study the application of machine learning methods from a system-wide perspective. We also introduce learning-based methods for addressing the problems of job routing and CPU scheduling in the networks we simulate. Our experimental results verify that methods using machine learning outperform reasonable heuristic and hand-coded approaches on example networks designed to capture many of the complexities that exist in real systems.

Adaptive tile coding for value function approximation

Reinforcement learning problems are commonly tackled by estimating the optimal value function. In... more Reinforcement learning problems are commonly tackled by estimating the optimal value function. In many real-world problems, learning this value function requires a function approximator, which maps states to values via a parameterized function. In practice, the success of function approximators depends on the ability of the human designer to select an appropriate representation for the value function. This paper presents adaptive tile coding, a novel method that automates this design process for tile coding, a popular function approximator, by beginning with a simple representation with few tiles and refining it during learning by splitting existing tiles into smaller ones. In addition to automatically discovering effective representations, this approach provides a natural way to reduce the function approximator's level of generalization over time. Empirical results in multiple domains compare two different criteria for deciding which tiles to split and verify that adaptive tile coding can automatically discover effective representations and that its speed of learning is competitive with the best fixed representations.

Evolving Soccer Keepaway Players Through Task Decomposition

Machine Learning, May 1, 2005

Complex control tasks can often be solved by decomposing them into hierarchies of manageable subt... more Complex control tasks can often be solved by decomposing them into hierarchies of manageable subtasks. Such decompositions require designers to decide how much human knowledge should be used to help learn the resulting components. On one hand, encoding human knowledge requires manual effort and may incorrectly constrain the learner's hypothesis space or guide it away from the best solutions. On the other hand, it may make learning easier and enable the learner to tackle more complex tasks. This article examines the impact of this trade-off in tasks of varying difficulty. A space laid out by two dimensions is explored: 1) how much human assistance is given and 2) how difficult the task is. In particular, the neuroevolution learning algorithm is enhanced with three different methods for learning the components that result from a task decomposition. The first method, coevolution, is mostly unassisted by human knowledge. The second method, layered learning, is highly assisted. The third method, concurrent layered learning, is a novel combination of the first two that attempts to exploit human knowledge while retaining some of coevolution's flexibility. Detailed empirical results are presented comparing and contrasting these three approaches on two versions of a complex task, namely robot soccer keepaway, that differ in difficulty of learning. These results confirm that, given a suitable task decomposition, neuroevolution can master difficult tasks. Furthermore, they demonstrate that the appropriate level of human assistance depends critically on the difficulty of the problem.

Evolving keepaway soccer players through task decomposition

To appear in Adaptive Behavior, 15(1), 2007. Empirical Studies in Action Selection with Reinforcement Learning

MergeRUCB (A Method for Large-Scale Online Ranker Evaluation)

Proceedings of the Eighth Acm International Conference, Feb 2, 2015

Estimating interleaved comparison outcomes from historical click data

Proceedings of the 21st ACM international conference on Information and knowledge management - CIKM '12, 2012

ABSTRACT Interleaved comparison methods, which compare rankers using click data, are a promising ... more ABSTRACT Interleaved comparison methods, which compare rankers using click data, are a promising alternative to traditional information retrieval evaluation methods that require expensive explicit judgments. A major limitation of these methods is that they assume access to live data, meaning that new data must be collected for every pair of rankers compared. We investigate the use of previously collected click data (i.e., historical data) for interleaved comparisons. We start by analyzing to what degree existing interleaved comparison methods can be applied and find that a recent probabilistic method allows such data reuse, even though it is biased when applied to historical data. We then propose an interleaved comparison method that is based on the probabilistic approach but uses importance sampling to compensate for bias. We experimentally confirm that probabilistic methods make the use of historical data for interleaved comparisons possible and effective.

A probabilistic method for inferring preferences from clicks

Proceedings of the 20th ACM international conference on Information and knowledge management - CIKM '11, 2011

Evaluating rankers using implicit feedback, such as clicks on documents in a result list, is an i... more Evaluating rankers using implicit feedback, such as clicks on documents in a result list, is an increasingly popular alternative to traditional evaluation methods based on explicit relevance judgments. Previous work has shown that so-called interleaved comparison methods can utilize click data to detect small differences between rankers and can be applied to learn ranking functions online.

Machinelearningforeventselectioninhighenergyphysics

Towards Autonomic Computing: Adaptive Job Routing

Computer systems are rapidly becoming so complex that maintaining them with human support staffs ... more Computer systems are rapidly becoming so complex that maintaining them with human support staffs will be prohibitively expensive and inefficient. In response, visionaries have begun proposing that computer systems be imbued with the ability to configure themselves, diagnose failures, and ultimately repair themselves in response to these failures. However, despite convincing arguments that such a shift would be desirable, as of yet there has been little concrete progress made towards this goal. We view these problems as fundamentally machine learning challenges. Hence, this article presents a new network simulator designed to study the application of machine learning methods from a systemwide perspective. We also introduce learning-based methods for addressing the problems of job routing and scheduling in the networks we simulate. Our experimental results verify that methods using machine learning outperform heuristic and hand-coded approaches on an example network designed to capture many of the complexities that exist in real systems.

Stochastic Optimization for Collision Selection in High Energy Physics

The underlying structure of matter can be deeply probed via precision measurements of the mass of... more The underlying structure of matter can be deeply probed via precision measurements of the mass of the top quark, the most massive observed fundamental particle. Top quarks can be produced and studied only in collisions at high energy particle accelerators. Most collisions, however, do not produce top quarks; making precise measurements requires culling these collisions into a sample that is rich in collisions producing top quarks (signal) and spare in collisions producing other particles (background). Collision selection is typically performed with heuristics or supervised learning methods. However, such approaches are suboptimal because they assume that the selector with the highest classification accuracy will yield a mass measurement with the smallest statistical uncertainty. In practice, however, the mass measurement is more sensitive to some backgrounds than others. Hence, this paper presents a new approach that uses stochastic optimization techniques to directly search for selectors that minimize statistical uncertainty in the top quark mass measurement. Empirical results confirm that stochastically optimized selectors have much smaller uncertainty. This new approach contributes substantially to our knowledge of the top quark's mass, as the new selectors are currently in use selecting real collisions.

Evolutionary Function Approximation for Reinforcement Learning

Journal of Machine Learning Research

Temporal difference methods are theoretically grounded and empirically effective methods for addr... more Temporal difference methods are theoretically grounded and empirically effective methods for addressing reinforcement learning problems. In most real-world reinforcement learning tasks, TD methods require a function approximator to represent the value function. However, using function approximators requires manually making crucial representational decisions. This paper investigates evolutionary function approximation, a novel approach to automatically selecting function approximator representations that enable efficient individual learning. This method evolves individuals that are better able to learn. We present a fully implemented instantiation of evolutionary function approximation which combines NEAT, a neuroevolutionary optimization technique, with Q-learning, a popular TD method. The resulting NEAT+Q algorithm automatically discovers effective representations for neural network function approximators. This paper also presents on-line evolutionary computation, which improves the on-line performance of evolutionary computation by borrowing selection mechanisms used in TD methods to choose individual actions and using them in evolutionary computation to select policies for evaluation. We evaluate these contributions with extended empirical studies in two domains: 1) the mountain car task, a standard reinforcement learning benchmark on which neural network function approximators have previously performed poorly and 2) server job scheduling, a large probabilistic domain drawn from the field of autonomic computing. The results demonstrate that evolutionary function approximation can significantly improve the performance of TD methods and on-line evolutionary computation can significantly improve evolutionary methods. This paper also presents additional tests that offer insight into what factors can make neural network function approximation difficult in practice.

Sample-Efficient Evolutionary Function Approximation for Reinforcement Learning

Reinforcement learning problems are commonly tackled with temporal difference methods, which atte... more Reinforcement learning problems are commonly tackled with temporal difference methods, which attempt to estimate the agent's optimal value function. In most real-world problems, learning this value function requires a function approximator, which maps state-action pairs to values via a concise, parameterized function. In practice, the success of function approximators depends on the ability of the human designer to select an appropriate representation for the value function. A recently developed approach called evolutionary function approximation uses evolutionary computation to automate the search for effective representations. While this approach can substantially improve the performance of TD methods, it requires many sample episodes to do so. We present an enhancement to evolutionary function approximation that makes it much more sample-efficient by exploiting the off-policy nature of certain TD methods. Empirical results in a server job scheduling domain demonstrate that the enhanced method can learn better policies than evolution or TD methods alone and can do so in many fewer episodes than standard evolutionary function approximation. * We wish to thank Matt Taylor, Nick Jong and the anonymous reviewers for helpful comments about earlier versions of this paper.

Using informative behavior to increase engagement while learning from human reward

Autonomous Agents and Multi-Agent Systems, 2015

In this paper, we address a relatively unexplored aspect of designing agents that learn from huma... more In this paper, we address a relatively unexplored aspect of designing agents that learn from human training by investigating how the agent's non-task behavior can elicit human feedback of higher quality and quantity. We use the TAMER framework, which facilitates the training of agents by human-generated reward signals, i.e., judgements of the quality of the agent's actions, as the foundation for our investigation. Then, we propose two new training interfaces to increase active involvement in the training process and thereby improve the agent's task performance. One provides information on the agent's uncertainty, the other on its performance. Our results from a 51-subject user study show that these interfaces can induce the trainers to train longer and give more feedback. The agent's performance, however, increases only in response to the addition of performanceoriented information, not by sharing uncertainty levels. Subsequent analysis of our results suggests that the organizational maxim about human behavior, "you get what you measure"-i.e., sharing metrics with people causes them to focus on maximizing or minimizing those metrics while deemphasizing other objectives-also applies to the training of agents, providing a powerful guiding principle for humanagent interface design in general.

Automatic feature selection using FS-NEAT

This article describes a series of experiments used to analyze the FS-NEAT method on a double pol... more This article describes a series of experiments used to analyze the FS-NEAT method on a double pole-balancing domain. The FS-NEAT method is compared with regular NEAT to discern its strengths and weaknesses. Both FS-NEAT and regular NEAT find a policy, implemented in a neural network, to solve the pole-balancing task by use of genetic algorithms. FS-NEAT, contrary to regular NEAT, uses a different starting population. Whereas regular NEAT networks start out with links between all the inputs and the output, FS-NEAT networks have only one link between an input and the output. It is believed that this more simple starting topology allows for effective feature (input)-selection.

A Large-Scale Study of Agents Learning from Human Reward

The TAMER framework, which provides a way for agents to learn to solve tasks using human-generate... more The TAMER framework, which provides a way for agents to learn to solve tasks using human-generated rewards, has been examined in several small-scale studies, each with a few dozen subjects. In this paper, we present the results of the first large-scale study of TAMER, which was performed at the NEMO science museum in Amsterdam and involved 561 subjects. Our results show for the first time that an agent using TAMER can successfully learn to play Infinite Mario, a challenging reinforcement-learning benchmark problem based on the popular video game, given feedback from both adult (N = 209) and child (N = 352) trainers. In addition, our study supports prior studies demonstrating the importance of bidirectional feedback and competitive elements in the training interface. Finally, our results also shed light on the potential for using trainers' facial expressions as a reward signal, as well as the role of age and gender in trainer behavior and agent performance.

Challenge balancing for personalised game spaces

Abstract—In this paper we propose an approach for personalising the space in which a game is play... more Abstract—In this paper we propose an approach for personalising the space in which a game is played (ie, levels)–to the end of tailoring the experienced challenge to the individual user during actual play of the game. Our approach specifically considers two design challenges, namely implicit user feedback and high risk of user abandonment. We contribute an approach that acknowledges that for effective online game personalisation, one needs to (1) offline learn a policy that is appropriate in expectation across users–to be used for ...