Skip to main content

Andrew Barto

Followers

22

Following

1

Co-author

1

Public Views

Interests

Uploads

Papers by Andrew Barto

Reinforcement Learning: Connections, Surprises, Challenges

Ai Magazine, Mar 1, 2019

T he idea of implementing reinforcement learning (RL) in a computer was one of the earliest ideas... more T he idea of implementing reinforcement learning (RL) in a computer was one of the earliest ideas about the possibility of AI. In a 1948 report, Alan Turing described a design for a pleasure-pain system: When a configuration is reached for which the action is undetermined, a random choice for the missing data is made and the appropriate entry is made in the description, tentatively, and is applied. When a pain stimulus occurs all tentative entries are cancelled, and when a pleasure stimulus occurs they are all made permanent. (Turing [1948] 2004, 425) Turing did little to develop this idea, and it was not until the year of his death, 1954, that Wesley Clark and Belmont Farley simulated RL in a neural net on a digital computer (Farley and Clark 1954). In the same year, Marvin Minsky described an analog RL neural net in his Princeton PhD dissertation (Minsky 1954). There were earlier ingenious RL devices, though electromechanical rather than computer implementations, including Claude Shannon's maze-run-Articles SPRING 2019 3

Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective

IEEE Transactions on Autonomous Mental Development, Jun 1, 2010

There is great interest in building intrinsic motivation into artificial systems using the reinfo... more There is great interest in building intrinsic motivation into artificial systems using the reinforcement learning framework. Yet, what intrinsic motivation may mean computationally, and how it may differ from extrinsic motivation, remains a murky and controversial subject. In this article, we adopt an evolutionary perspective and define a new optimal reward framework that captures the pressure to design good primary reward functions that lead to evolutionary success across environments. The results of two computational experiments show that optimal primary reward signals may yield both emergent intrinsic and extrinsic motivation. The evolutionary perspective and the associated optimal reward framework thus lead to the conclusion that there are no hard and fast features distinguishing intrinsic and extrinsic reward computationally. Rather, the directness of the relationship between rewarding behavior and evolutionary success varies along a continuum.

A Agent Environment StatesActions Rewards Critic B Agent Internal Environment Rewards Critic External Environment Sensations StatesDecisions Actions " Organism " Figure 1

Humans and other animals often engage in activities for their own sakes rather than as steps towa... more Humans and other animals often engage in activities for their own sakes rather than as steps toward solving practical problems. Psychologists call these intrinsically motivated behaviors. What we learn during intrinsically motivated behavior is essential for our development as competent autonomous entities able to efficiently solve a wide range of practical problems as they arise. In this paper we present initial results from a computational study of intrinsically motivated learning aimed at allowing artificial agents to construct and extend hierarchies of reusable skills that are needed for competent autonomy. At the core of the model are recent theoretical and algorithmic advances in computational reinforcement learning, specifically, new concepts related to skills and new learning algorithms for learning with skill hierarchies.

Looking Back on the Actor–Critic Architecture

IEEE transactions on systems, man, and cybernetics, 2021

This retrospective describes the overall research project that gave rise to the authors' paper "N... more This retrospective describes the overall research project that gave rise to the authors' paper "Neuronlike adaptive elements that can solve difficult learning control problems" that was published in the 1983 Neural and Sensory Information Processing special issue of the IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS. This look back explains how this project came about, presents the ideas and previous publications that influenced it, and describes our most closely related subsequent research. It concludes by pointing out some noteworthy aspects of this article that have been eclipsed by its main contributions, followed by commenting on some of the directions and cautions that should inform future research.

On Ensuring that Intelligent Machines Are Well-Behaved

arXiv (Cornell University), Aug 17, 2017

Machine learning algorithms are everywhere, ranging from simple data analysis and pattern recogni... more Machine learning algorithms are everywhere, ranging from simple data analysis and pattern recognition tools used across the sciences to complex systems that achieve super-human performance on various tasks. Ensuring that they are well-behavedthat they do not, for example, cause harm to humans or act in a racist or sexist way-is therefore not a hypothetical problem to be dealt with in the future, but a pressing one that we address here. We propose a new framework for designing machine learning algorithms that simplifies the problem of specifying and regulating undesirable behaviors. To show the viability of this new framework, we use it to create new machine learning algorithms that preclude the sexist and harmful behaviors exhibited by standard machine learning algorithms in our experiments. Our framework for designing machine learning algorithms simplifies the safe and responsible application of machine learning.

TD-δπ: a model-free algorithm for efficient exploration

We study the problem of finding efficient exploration policies for the case in which an agent is ... more We study the problem of finding efficient exploration policies for the case in which an agent is momentarily not concerned with exploiting, and instead tries to compute a policy for later use. We first formally define the Optimal Exploration Problem as one of sequential sampling and show that its solutions correspond to paths of minimum expected length in the space of policies. We derive a model-free, local linear approximation to such solutions and use it to construct efficient exploration policies. We compare our model-free approach to other exploration techniques, including one with the best known PAC bounds, and show that ours is both based on a well-defined optimization problem and empirically efficient.

Looking Back on the Actor–Critic Architecture

IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2021

This retrospective describes the overall research project that gave rise to the authors' paper "N... more This retrospective describes the overall research project that gave rise to the authors' paper "Neuronlike adaptive elements that can solve difficult learning control problems" that was published in the 1983 Neural and Sensory Information Processing special issue of the IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS. This look back explains how this project came about, presents the ideas and previous publications that influenced it, and describes our most closely related subsequent research. It concludes by pointing out some noteworthy aspects of this article that have been eclipsed by its main contributions, followed by commenting on some of the directions and cautions that should inform future research.

Intrinsically Motivated Reinforcement Learning: A Promising Framework for Developmental Robot Learning

One of the primary challenges of developmental robotics is the question of how to learn and repre... more One of the primary challenges of developmental robotics is the question of how to learn and represent increasingly complex behavior in a self-motivated, open-ended way. Singh, Barto, & Chentanez 2004) have recently presented an algorithm for intrinsically motivated reinforcement learning that strives to achieve broad competence in an environment in a tasknonspecific manner by incorporating internal reward to build a hierarchical collection of skills. This paper suggests that with its emphasis on task-general, self-motivated, and hierarchical learning, intrinsically motivated reinforcement learning is an obvious choice for organizing behavior in developmental robotics. We present additional preliminary results from a gridworld abstraction of a robot environment and advocate a layered learning architecture for applying the algorithm on a physically embodied system.

Accelerating Reinforcement Learning through the Discovery of Useful Subgoals

An ability to adjust to changing environments and unforeseen circumstances is likely to be an imp... more An ability to adjust to changing environments and unforeseen circumstances is likely to be an important component of a successful autonomous space robot. This paper shows how to augment reinforcement learning algorithms with a method for automatically discovering certain types of subgoals online. By creating useful new subgoals while learning, the agent is able to accelerate learning on a current task and to transfer its expertise to related tasks through the reuse of its ability to attain subgoals. Subgoals are created based on commonalities across multiple paths to a solution. We cast the task of finding these commonalities as a multiple-instance learning problem and use the concept of diverse density to find solutions. We introduced this approach in [10] and here we present additional results for a simulated mobile robot task.

Linear Least-Squares algorithms for temporal difference learning

Machine Learning, 1996

We introduce two new temporal difference (TD) algorithms based on the theory of linear leastsquar... more We introduce two new temporal difference (TD) algorithms based on the theory of linear leastsquares function approximation. We define an algorithm we call Least-Squares TD (LS TD) for which we prove probability-one convergence when it is used with a function approximator linear in the adjustable parameters. We then define a recursive version of this algorithm, Recursive Least-Squares TD (RLS TD). Although these new TD algorithms require more computation per time-step than do Sutton's TD(A) algorithms, they are more efficient in a statistical sense because they extract more information from training experiences. We describe a simulation experiment showing the substantial improvement in learning rate achieved by RLS TD in an example Markov prediction problem. To quantify this improvement, we introduce the TD error variance of a Markov chain, arc,, and experimentally conclude that the convergence rate of a TD algorithm depends linearly on ~ro. In addition to converging more rapidly, LS TD and RLS TD do not have control parameters, such as a learning rate parameter, thus eliminating the possibility of achieving poor performance by an unlucky choice of parameters.

Reinforcement Learning and Its Relationship to Supervised Learning

Handbook of Learning and Approximate Dynamic Programming, 2009

Learning to act using real-time dynamic programming

Artificial Intelligence, 1995

Learning methods based on dynamic programming (DP) are receiving increasing attention in artifici... more Learning methods based on dynamic programming (DP) are receiving increasing attention in artificial intelligence. Researchers have argued that DP provides the appropriate basis for compiling planning results into reactive strategies for real-time control, as well as for learning such strategies when the system being controlled is incompletely known. We introduce an algorithm based on DP, which we call Real-Time DP (RTDP), by which an embedded system can improve its performance with experience. RTDP generalizes Korf's Learning-Real-Time-A* algorithm to problems involving uncertainty. We invoke results from the theory of asynchronous DP to prove that RTDP achieves optimal behavior in several different classes of problems. We also use the theory of asynchronous DP to illuminate aspects of other DP-based reinforcement learning methods such as Watkins' Q-Learning algorithm. A secondary aim of this article is to provide a bridge between AI research on real-time planning and learning and relevant concepts and algorithms from control theory.

Adaptive linear quadratic control using policy iteration

Page 1. Rowdingl ol thr Amerlean Conbol Conhnnco BaHlnon, Mqland Jun. 199) Adaptive FP12 = 4:40 l... more

Learning reactive admittance control

Proceedings 1992 IEEE International Conference on Robotics and Automation

In this paper, a peg-in-hole insertion task is used as an example to illustrate the utility of di... more In this paper, a peg-in-hole insertion task is used as an example to illustrate the utility of direct associative reinforcement learning methods for learning control under real-world conditions of uncertainty and noise. An associative reinforcement learning system has to learn appropriate actions in various situations through search guided by evaluative performance feedback. We used such a learning system, implemented as a connectionist network, to learn active compliant control for peg-in-hole insertion. Our results indicate that direct reinforcement learning can be used to learn a reactive control strategy that works well even in the presence of a high degree of noise and uncertainty.

Robust reinforcement learning in motion planning

Advances in Neural …, 1994

While exploring to nd better solutions, an agent performing on- line reinforcement learning (RL) ... more

Linear Least-Squares Algorithms for Temporal Difference Learning

Springer eBooks, Aug 21, 2007

We introduce two new temporal difference (TD) algorithms based on the theory of linear leastsquar... more We introduce two new temporal difference (TD) algorithms based on the theory of linear leastsquares function approximation. We define an algorithm we call Least-Squares TD (LS TD) for which we prove probability-one convergence when it is used with a function approximator linear in the adjustable parameters. We then define a recursive version of this algorithm, Recursive Least-Squares TD (RLS TD). Although these new TD algorithms require more computation per time-step than do Sutton's TD(A) algorithms, they are more efficient in a statistical sense because they extract more information from training experiences. We describe a simulation experiment showing the substantial improvement in learning rate achieved by RLS TD in an example Markov prediction problem. To quantify this improvement, we introduce the TD error variance of a Markov chain, arc,, and experimentally conclude that the convergence rate of a TD algorithm depends linearly on ~ro. In addition to converging more rapidly, LS TD and RLS TD do not have control parameters, such as a learning rate parameter, thus eliminating the possibility of achieving poor performance by an unlucky choice of parameters.

Local graph partitioning as a basis for generating temporally extended actions in reinforcement learning

We present a new method for automatically creating useful temporally-extended actions in reinforc... more We present a new method for automatically creating useful temporally-extended actions in reinforcement learning. Our method identifies states that lie between two densely-connected regions of the state space and generates temporally-extended actions (e.g., options) that take the agent efficiently to these states. We search for these states using graph partitioning methods on local views of the transition graph. This local perspective is a key property of our algorithms that differentiates it from most of the earlier work in this area, and one that allows it to scale to problems with large state spaces.

Policyblocks: An algorithm for creating useful macro-actions in reinforcement learning

We present PolicyBlocks, an algorithm by which a reinforcement learning agent can extract useful ... more We present PolicyBlocks, an algorithm by which a reinforcement learning agent can extract useful macro-actions from a set of related tasks. The agent creates macroactions by finding commonalities in solutions to previous tasks. Using these macro-actions, learning to do future related tasks is accelerated. This increase in performance is illustrated in a "rooms" grid-world, in which the macro-actions found by PolicyBlocks outperform even hand designed macro-actions, and in a hydroelectric reservoir control task. We provide empirical comparisons of Policy-Blocks with the Reuse options of Bernstein (1999) and the SKILLS algorithm of Thrun and Schwartz (1995), which elucidate conditions under which each algorithm performs well.

Incremental structure learning in factored mdps with continuous states and actions

… Amherst-Department of Computer Science, Tech. …, 2009

Learning factored transition models of structured environments has been shown to provide signific... more Learning factored transition models of structured environments has been shown to provide significant leverage when computing optimal policies for tasks within those environments. Previous work has focused on learning the structure of factored Markov Decision Processes (MDPs) with finite sets of states and actions. In this work we present an algorithm for online incremental learning of transition models of factored MDPs that have continuous, multi-dimensional state and action spaces. We use incremental density estimation techniques and informationtheoretic principles to learn a factored model of the transition dynamics of an FMDP online from a single, continuing trajectory of experience.

TD-DeltaPi: A Model-Free Algorithm for Efficient Exploration

Proceedings of the AAAI Conference on Artificial Intelligence

We study the problem of finding efficient exploration policies for the case in which an agent is ... more We study the problem of finding efficient exploration policies for the case in which an agent is momentarily not concerned with exploiting, and instead tries to compute a policy for later use. We first formally define the Optimal Exploration Problem as one of sequential sampling and show that its solutions correspond to paths of minimum expected length in the space of policies. We derive a model-free, local linear approximation to such solutions and use it to construct efficient exploration policies. We compare our model-free approach to other exploration techniques, including one with the best known PAC bounds, and show that ours is both based on a well-defined optimization problem and empirically efficient.