Academia.eduAcademia.edu

Policy Optimization

description21 papers
group0 followers
lightbulbAbout this topic
Policy optimization is a process in decision-making and control theory that seeks to identify the most effective strategies or policies to achieve specific objectives, often through mathematical modeling and computational techniques. It involves evaluating and refining policies based on performance metrics to enhance outcomes in various contexts, such as economics, engineering, and artificial intelligence.
lightbulbAbout this topic
Policy optimization is a process in decision-making and control theory that seeks to identify the most effective strategies or policies to achieve specific objectives, often through mathematical modeling and computational techniques. It involves evaluating and refining policies based on performance metrics to enhance outcomes in various contexts, such as economics, engineering, and artificial intelligence.

Key research themes

1. How can policy gradient and actor-critic methods improve sample efficiency and stability in reinforcement learning?

This research area focuses on developing policy optimization algorithms based on policy gradients and actor-critic architectures that achieve better sample efficiency, convergence guarantees, and stability, especially in continuous control domains. The motivation arises because traditional policy gradient methods suffer from high variance and sample inefficiency, and trust region approaches, while effective, can be computationally expensive or incompatible with certain architectures. Combining natural gradients, compatible function approximation, and novel surrogate objectives are core to advancing these methods.

Key finding: Introduced a new family of policy gradient methods called Proximal Policy Optimization (PPO) that use a clipped surrogate objective enabling multiple minibatch updates on the same data. PPO trades off performance guarantees... Read more
Key finding: Developed natural actor-critic methods that incorporate natural gradients computed via the inverse Fisher information matrix, improving invariance to parameterization and leading to faster and more stable policy updates. The... Read more
Key finding: Proposed an off-policy natural actor-critic algorithm that uses state-action distribution correction to handle off-policy data and compatible features allowing the use of nonlinear function approximators like neural networks... Read more
Key finding: Presented the first reinforcement learning actor-critic architecture in which the actor is trained via Equilibrium Propagation, a biologically plausible learning algorithm, and the critic via backpropagation. The hybrid... Read more

2. How can policy optimization leverage temporal abstraction and goal-conditioned policies to improve decision making?

This line of research investigates the augmentation of policy optimization algorithms with temporal abstraction mechanisms such as options and goal-conditioned policies. The aim is to learn policies that operate over multiple time scales or conditioned on specific goals to capture environment dynamics more effectively. This facilitates better transferability, hierarchical learning, and improved exploration by structuring the policy space at a more functional level, beyond primitive actions. Methods include deriving option-critic architectures with policy gradient theorems for options and learning actionable latent representations from goal-conditioned policies.

Key finding: Derived policy gradient theorems for options, enabling simultaneous learning of intra-option policies, termination conditions, and the policy over options without additional extrinsic rewards or subgoals. This approach scales... Read more
Key finding: Introduced actionable representations for control (ARC) derived from goal-conditioned policies where Euclidean distances in the learned representation correspond to expected differences in actions required to reach different... Read more
Key finding: Proposed leveraging stochastic abstract policies, which generalize over previously solved source tasks in relational representations, to accelerate learning in target tasks by transferring policy knowledge in a generalized... Read more

3. How can constrained policy learning from observational or batch data achieve minimax optimal regret under practical constraints?

This thematic area addresses policy optimization when learning from observational or batch data under constraints such as budget, fairness, or functional form. The core challenge is learning treatment assignment or decision policies that satisfy these constraints while optimizing expected outcomes. Researchers develop algorithms with regret guarantees that scale favorably with the complexity of the policy class and derive lower bounds for minimax regret, providing sharp theoretical characterizations and practical algorithms applicable beyond randomized trials including settings with endogenous treatments.

Key finding: Derived minimax lower bounds for regret in constrained policy learning from observational data and proposed algorithms that asymptotically attain these bounds up to constant factors. The methods handle binary and continuous... Read more
Key finding: Discussed the use of reinforcement learning for direct and model-free control from observed rewards without relying on value function approximation, emphasizing practical constraints in policy learning, and the importance of... Read more
Key finding: Proposed a policy search algorithm (PSDP) that, given a base distribution over states indicating typical visitation frequency, can efficiently find good non-stationary policies with performance guarantees after a finite... Read more

All papers in Policy Optimization

New application of our AI Abstract Engineering techniques in quantum theory of entanglement is considered. We design AI experiment with Conway’s quantum particle equipped with mathematical free will ( predicted by Conway’s Strong Free... more
In recent years, reinforcement learning (RL) has garnered increasing attention for its applications in various domains, including finance, robotics, and healthcare. One critical area in healthcare where RL has shown potential is the... more
This paper presents an enhanced Least-Squares approach for solving reinforcement learning control problems. Model-free Least-Squares policy iteration (LSPI) method bas been successfully used for this learning domain. Although LSPI is a... more
Download research papers for free!