Deep Q-Networks

description20 papers

group4 followers

lightbulbAbout this topic

Deep Q-Networks (DQN) are a class of reinforcement learning algorithms that combine Q-learning with deep neural networks to approximate the optimal action-value function. They enable agents to learn effective policies in high-dimensional state spaces by using experience replay and target networks to stabilize training.

lightbulbAbout this topic

Key research themes

1. How can Deep Q-Networks improve learning efficiency and performance robustness in autonomous navigation and path planning for mobile agents?

This theme explores the application of Deep Q-Networks (DQN) in guiding autonomous agents, such as mobile robots and vehicles, to efficiently navigate complex and dynamic environments. The research focuses on enhancing sample efficiency, overcoming high-dimensional state spaces, and ensuring generalizability and safety in navigation tasks. It studies the integration of DQN with techniques like experience replay, heuristic knowledge, and simulation environments to enable real-time decision-making in unknown or partially known spaces, addressing challenges in path planning, obstacle avoidance, and autonomous driving.

Path Planning for Intelligent Robots Based on Deep Q-learning With Experience Replay and Heuristic Knowledge

by IEEE/CAA J. Autom. Sinica

2020, IEEE/CAA Journal of Automatica Sinica

Key finding: This paper presents an approach combining DQN, experience replay, and heuristic knowledge to efficiently enable smart robot path planning and obstacle avoidance in unknown environments. The use of experience replay addresses... Read more

articleView Paper downloadDownload

DQN-Based Deep Reinforcement Learning for Autonomous Driving

by Pedro Revenga

2023

Key finding: The study applies DQN agents to the urban autonomous driving scenario simulated in CARLA, addressing policy learning for lane following and collision avoidance using sensor inputs and path planners. It highlights both... Read more

articleView Paper downloadDownload

Short-range Robotic Navigation and Exploration Tasks via Deep Q-Networks for Biomedical Applications

by Joel Disu

2020, 2020 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT)

Key finding: This work demonstrates the effective use of DQN for mobile robot navigation and exploration in a biomedical operating room environment, leveraging rewards reflecting successful task completion and collision avoidance. It... Read more

articleView Paper downloadDownload

Autonomous Navigation of Robots: Optimization with DQN

by Edisson Jordan

2023, Applied Sciences

Key finding: This research develops a DQN-based reinforcement learning control system for autonomous mobile robot navigation in dynamic environments simulated in Gazebo. The model integrates state-feedback from sensors and trains policies... Read more

articleView Paper downloadDownload

Indoor navigation for mobile robots based on deep reinforcement learning with convolutional neural network

by International Journal of Electrical and Computer Engineering (IJECE)

2025, International Journal of Electrical and Computer Engineering (IJECE)

Key finding: The paper proposes a convolutional neural network-driven DQN model for controlling a four-wheel mobile robot’s line tracking based on camera inputs in a Gazebo simulated environment. The model achieves superior tracking... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. In what ways can Deep Q-Networks be enhanced or adapted to address challenges in training stability, hyperparameter sensitivity, and time discretization robustness?

This research area investigates methodological innovations and theoretical analyses to improve DQN training efficiency, robustness to environmental and algorithmic parameters, and stability under different time discretizations. It encompasses algorithmic contributions such as dynamic reward mechanisms, capacity reduction strategies for experience replay, and theoretical formalizations about Q-function behavior in continuous or near-continuous time settings. The goal is to enhance the reliability and applicability of DQN in diverse real-world scenarios by addressing known limitations in training procedures and hyperparameter tuning.

Making Deep Q-learning methods robust to time discretization

by Léonard Blier

2022

Key finding: The authors theoretically prove the collapse of traditional Q-learning in the continuous-time limit and propose architectural and algorithmic adjustments called Deep Advantage Updating to maintain learning performance across... Read more

articleView Paper downloadDownload

Hyperparameter Optimization for Tracking with Continuous Deep Q-Learning

by Jianbing Shen

2023, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Key finding: This work presents a reinforcement learning framework utilizing continuous Deep Q-Learning techniques (notably NAF) for dynamic hyperparameter optimization in object tracking. By treating hyperparameter selection as a... Read more

articleView Paper downloadDownload

Explaining Deep Q-Learning Experience Replay with SHapley Additive exPlanations

by Luca Longo

2023, Machine Learning and Knowledge Extraction

Key finding: By experimentally reducing the capacity of Experience Replay in Deep Q-Learning across Atari games, this paper finds that moderate buffer size reduction (from 10,000 to 5,000) does not significantly impair performance,... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How can Deep Q-Learning be applied to complex decision-making problems involving high-dimensional, combinatorial, or multi-agent action spaces such as financial portfolio trading, cloud load balancing, or multi-agent target search?

This theme focuses on the extension of Deep Q-Learning methodologies to domains with sophisticated action and state representations, including multi-asset financial markets, cloud computing infrastructures, and cooperative multi-agent systems. Research contributions include devising specialized discrete or combinatorial action spaces, mapping infeasible actions to feasible alternatives, hybrid learning architectures, and the use of distributed Q-learning to optimize collective decision-making. Such work addresses scaling challenges and practical applicability considerations for DQN-based solutions beyond simple control tasks.

An intelligent financial portfolio trading strategy using deep Q-learning

by Min Kyu Sim

2025, arXiv (Cornell University)

Key finding: This study formulates portfolio trading as a Markov decision process with a discrete combinatorial action space representing buy/hold/sell decisions per asset. It introduces a novel mapping function to convert infeasible... Read more

articleView Paper downloadDownload

Simulation of the navigation of a mobile robot by the Q-Learning using artificial neuron networks

by Hatem Mezaache

2024, Citeseer

Key finding: This paper proposes Rough Q-learning, integrating rough set theory with classical Q-learning to address overestimation bias in approximated Q-values. The approach improves algorithm stability and performance by minimizing the... Read more

articleView Paper downloadDownload

Hybrid algorithm for optimized clustering and load balancing using deep Q reccurent neural networks in cloud computing

by beei iaes and

2025, Bulletin of Electrical Engineering and Informatics

Key finding: The authors propose a hybrid Deep Q Recurrent Neural Network (DQRNN) combining deep Q-networks and recurrent architectures to manage cloud load balancing, incorporating factors such as supply, demand, capacity, resource... Read more

articleView Paper downloadDownload

Detection of Hidden Moving Targets by a Group of Mobile Agents with Deep Q-Learning

by Irad Ben-Gal

2024, Robotics

Key finding: This paper develops a distributed multi-agent search strategy leveraging deep Q-learning with error-prone sensors to maximize expected information gain under statistical detection uncertainties (type I and II errors). It... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Deep Q-Networks

Indoor navigation for mobile robots based on deep reinforcement learning with convolutional neural network

by International Journal of Electrical and Computer Engineering (IJECE)

2025, International Journal of Electrical and Computer Engineering (IJECE)

The mobile robot is an intelligent device that can achieve many tasks in life. For autonomous, navigation based on the line on the ground is often used because it helps the robot to move along a predefined path, simplifies the path... more

descriptionView Paper arrow_downwardDownload

Hindsight Experience Replay with Kronecker Product Approximate Curvature

by Shalabh Bhatnagar

2025, arXiv (Cornell University)

descriptionView Paper arrow_downwardDownload

Deep Recurrent Q-Learning vs Deep Q-Learning on a simple Partially Observable Markov Decision Process with Minecraft

by Vincent Beraud

2025, ArXiv

Deep Q-Learning has been successfully applied to a wide variety of tasks in the past several years. However, the architecture of the vanilla Deep Q-Network is not suited to deal with partially observable environments such as 3D video... more

descriptionView Paper arrow_downwardDownload

Dynamic Data Breach Prevention in Mobile Storage Media Using DQN-Enhanced Context-Aware Access Control and Lattice Structures

by Vinay Kumar Kasula

2025, INTERNATIONAL JOURNAL OF RESEARCH IN ELECTRONICS AND COMPUTER ENGINEERING

This study proposes an enhanced method for preventing data breaches in mobile storage media by improving access control mechanisms through the integration of Deep Q-Network (DQN) algorithms. Building on attributebased encryption (ABE)... more

descriptionView Paper arrow_downwardDownload

OPTIMIZING AUTONOMOUS VEHICLE NAVIGATION USING DATA SCIENCE AND REINFORCEMENT LEARNING

by Swathi Suddala

2025, International Research Journal of Modernization in Engineering Technology and Science

In response to the complex demands of autonomous vehicle (AV) navigation in urban environments, this study explores a data-driven, reinforcement learning (RL)-based approach to optimize navigation for AVs, enhancing both efficiency and... more

descriptionView Paper arrow_downwardDownload

Sociological Analysis of Artificial Intelligence, Benefits, Concerns and it's Future Implications

by Mohd Sultan Rather

2025

This paper conducts a sociological analysis of artificial intelligence (AI), examining its benefits, concerns, and future implications for society. Through a multidimensional lens, it explores how AI technologies shape social structures,... more

descriptionView Paper arrow_downwardDownload

Hybrid algorithm for optimized clustering and load balancing using deep Q reccurent neural networks in cloud computing

by beei iaes and

2025, Bulletin of Electrical Engineering and Informatics

Cloud services are among the technologies that are developing the fastest. Additionally, it is acknowledged that load balancing poses a major obstacle to reaching energy efficiency. Distributing the load among several resources in order... more

descriptionView Paper arrow_downwardDownload

Predicting demand in changing environments: a review on the use of reinforcement learning in forecasting models

by beei iaes

2025, Bulletin of Electrical Engineering and Informatics

This systematic review, carried out under the PRISMA methodology, aims to identify how reinforcement learning has been used in demand forecasting, distinguishing the problems they are trying to overcome, recognizing the algorithms used,... more

descriptionView Paper arrow_downwardDownload

Vol. 7 Issue 1 ETHICAL IMPLICATIONS OF ARTIFICIAL INTELLIGENCE: A REVIEW OF EARLY RESEARCH AND PERSPECTIVES

by Sai Teja Boppiniti

2024, International Journal of Research in Engineering and Applied Sciences(IJREAS)

The rapid advancement of Artificial Intelligence (AI) technologies has significantly transformed various sectors, including healthcare, finance, and transportation. However, these developments raise critical ethical concerns that require... more

Figure 1 advent of Artificial Intelligence (AI)

remains to balance the utility of data for AI advancements against the fundamental right to privacy. policies to protect individuals’ privacy rights while enabling the beneficial use of AI in healthcare. The challenge oe at risk, the net effect of AI on employment will depend on various factors, including worker adaptability and

Figure 3 the number of articles identified for each ethical theme studies addressing key ethical themes, providing a numerical representation of research interest in each area. In this review, a quantitative analysis was conducted to highlight the prevalence of various ethical concerns This systematic review methodology aims to provide a comprehensive understanding of the ethical implications

descriptionView Paper arrow_downwardDownload

Evolution of Reinforcement Learning: From Q-Learning to Deep

by Sai Teja Boppiniti

2024, International Journal of Research in Engineering and Applied Sciences(IJREAS)

Reinforcement Learning (RL) has emerged as a pivotal area in artificial intelligence, revolutionizing the way agents learn optimal behaviors through interaction with their environment. This paper explores the evolution of RL techniques,... more

high-dimensional state spaces typical in real-world applications. The advent of Deep Q-Networks (DQN) in 2013 marked a revolutionary step forward in RL. By integrating deep learning with reinforcement learning, DQNs enabled agents to approximate Q-values using deep neural networks, thereby overcoming the limitations of traditional Q-learning. This integration has empowered RL to excel in intricate environments, such as playing video games like Atari, where agents learn to develop strategies by processing high-dimensional visual input. As RL continues to evolve, it has garnered significant attention across various domair including robotics, finance, healthcare, and autonomous systems. The advancements algorithmic strategies and architectural innovations have led to substantial improvements in tl performance and applicability of RL techniques. This paper aims to provide a comprehensi' overview of the evolution of reinforcement learning, highlighting the progression fro classical methods like Q-learning to advanced approaches such as DQNs. Additionally, we w discuss the current trends and future directions in this rapidly advancing field, emphasizing tl potential for RL to solve increasingly complex real-world problems. Literature Review

descriptionView Paper arrow_downwardDownload

Uncertainty-aware Path Planning using Reinforcement Learning and Deep Learning Methods

by Journal of Computer and Knowledge Engineering and

2024, Journal of Computer and Knowledge Engineering

This paper proposes new algorithms to improve Reinforcement Learning (RL) and Deep Q-Network (DQN) methods for path planning considering uncertainty in the perception of environment. The study aimed to formulate and solve the path... more

descriptionView Paper arrow_downwardDownload

Development of deep reinforcement learning for inverted pendulum

by Khoa Đăng

2024, International Journal of Electrical and Computer Engineering (IJECE)

This paper presents a modification of the deep Q-network (DQN) in deep reinforcement learning to control the angle of the inverted pendulum (IP). The original DQN method often uses two actions related to two force states like constant... more

Int J Elec & Comp Eng, Vol. 13, No. 4, August 2023: 3895-3902 EOE IE IES IES a J The model of an inverted pendulum is described in Figure 1. In general, the IP consists of a pendulum on the top of the cart while moving forward/backward along the rail and the pendulum can move freely around the joint between it and the cart. The dynamical equation of the IP system [28] can be presented as (1) and (2), where the parameters are defined in Table 1. To maintain the desired angle between the pendulum and Y-axis, the DRL is developed to generate F, for the cart based on deep reinforcement learning, which is presented in the next section.

— Reinforcement learning (RL) is a type of machine learning in which its concept is trial and error. Objects using RL can learn from their experience in an environment. A typical model using RL is described in Figure 2. RL contains two main components: agent and environment. Herein, the environment uses the input action a(;) from agent to generate a state s() and reward riz) which are sent back to the agent. An algorithm is developed in the agent to find the best action based on the policy 7 which goal aims to achieve maximum reward in return. RL becomes a closed loop between the environment and the agent. Each loop is called a step. Let us define an episode containing N steps. The training process will be performed within NxM loop times where N is the number step and M is the number of episodes as below code example. 2.2. Reinforcement learning

Figure 3. DQN model using a neural network Deep Q-networks (DQN) uses NN to approximate Q-value Q (s(t), @)) instead of using the Q-table in the RL method. This NN will make one model named the prediction model (PM) which contains three layers that are the input layer (the current state of the environment), the hidden layer (computations with activation function), and the output layer (predict the Q-value). The DQN model is shown in Figure 3.

Figure 6. Reward and average reward in 50 episodes gained with the training 70,000 steps (a) reward and average reward of model 1, (b) reward and average reward of model 2, and (c) reward and average reward of model 3

Figure 7. Force tracking in each model (a) force control in model 1, (b) force control in model 2, and (c) force control in model 3

Normally, RL uses the Epsilon-greedy method [29], [30] as the policy to generate the action. This policy can use a table named Q-table as the reference which shows the relationship between the state and the action. Each value in Q-table is called by Q-value Q(s;,a;) where j=(/, 2, ... m) and i=(/, 2, ... n). For example, Q-table is shown in Table 2. And the selected action can be given by the policy as in (3). Where ¢ is the probability in selection random action, with a range from 0 to 1. The main purpose of RL is to find the maximum total reward based on policy. The updated policy is very important to select the best action for the environment, which can be applied [31] as Bellman equation in (4).

Table 4. Neural network configuration in three cases Fourthly, each model will be trained in 70,000 steps. In the training step, the cart-pole system will be terminated if its angle is greater than 12 degrees or less than -12 degrees. Figure 6 shows the total reward value obtained from each episode while training three models in 70,000 steps. Herein, Figures 6(a) to 6(c) present for each case of model 1, model 2, and model 3, respectively which are described in Table 4. Fot clearly, we computed the average reward in the last 50 episodes (blue line). Because the DQN model learns based on the trial-and-error process, so the total reward could change based on ratio error and success in the trial process. All steps are performed on a computer with the specification as CPU (Intel Core i7-7500U 2.7 GHz (4CPUs)), GPU (NVIDIA GEFORCE MX150, 2 GB), RAM (12 GB), HDD (256 GB).

descriptionView Paper arrow_downwardDownload

A Nonparametric Off-Policy Policy Gradient

by Samuele Tosatto

2023, arXiv (Cornell University)

Reinforcement learning (RL) algorithms still suffer from high sample complexity despite outstanding recent successes. The need for intensive interactions with the environment is especially observed in many widely popular policy gradient... more

descriptionView Paper arrow_downwardDownload

Policy Learning and Evaluation with Randomized Quasi-Monte Carlo

by yi-fan chen

2023, arXiv (Cornell University)

Reinforcement learning constantly deals with hard integrals, for example when computing expectations in policy evaluation and policy iteration. These integrals are rarely analytically solvable and typically estimated with the Monte Carlo... more

descriptionView Paper arrow_downwardDownload

Development of deep reinforcement learning for inverted pendulum

by International Journal of Electrical and Computer Engineering (IJECE)

2023, International Journal of Electrical and Computer Engineering (IJECE)

descriptionView Paper arrow_downwardDownload

Reducing Entropy Overestimation in Soft Actor Critic Using Dual Policy Network

by Imran Usman

2023, Wireless Communications and Mobile Computing

In reinforcement learning (RL), an agent learns an environment through hit and trail. This behavior allows the agent to learn in complex and difficult environments. In RL, the agent normally learns the given environment by exploring or... more

descriptionView Paper arrow_downwardDownload

Reinforcement Learning Algorithms: An Overview and Classification

by Katarina Grolinger

2023, 2021 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE)

The desire to make applications and machines more intelligent and the aspiration to enable their operation without human interaction have been driving innovations in neural networks, deep learning, and other machine learning techniques.... more

descriptionView Paper arrow_downwardDownload

Hierarchical Reinforcement Learning for Air-to-Air Combat

by Henry Diaz

2023, 2021 International Conference on Unmanned Aircraft Systems (ICUAS)

Artificial Intelligence (AI) is becoming a critical component in the defense industry, as recently demonstrated by DARPA's AlphaDogfight Trials (ADT). ADT sought to vet the feasibility of AI algorithms capable of piloting an F-16 in... more

Fig. 1: Rendering of the simulation environment The environment provided for the dogfighting scenario was developed by the Johns Hopkins University Applied Physics Lab (JHU-APL) as an OpenAI gym environment. The physics of the F-16 aircraft are simulated with JSBSim, a high-fidelity open-source flight dynamics model [48]. A rendering of the environment is shown in Figure 1. "7 The observation space for each agent includes information about ownship aircraft (fuel load, thrust, control surface defi ection, health), aerodynamics (alpha and beta angles), position (local plane coordinates, velocity, and acceleration), and attitude (Euler angles, rates, and accelerations). The agent also gets the position (local plane coordinates and velocity), and attitude (Euler angles and rates) information of its op- ponent as well as its opponents health. All state information from the environment is provided without modeled sensor noi Sse.

Fig. 3: Example of damage calculated over the course of an episode where r; is the reward at time step t, dseif is the damage to the agent, do», is the damage to the opponent (Figure 3), and T’ = 300, the maximum duration of an engagement.

Fig. 4: High-level architecture of PHANG-MAN agent Our agent, PHANG-MAN (Policy Hierarchy for Adaptive Novel Generation of MANeuvers), is composed of a 2-layer hierarchy of policies. On the low level, there is an array of policies that have been trained to excel in a particular region of the state space. At the high level, a single policy selects which low-level policy to activate given the current context of the engagement. Our architecture is shown in Figure 4.

Fig. 3: Rrelative position» Reosures and Rogunsnap(blue) com- ponents of the CZ policy’s reward function

Fig. 7: Illustration of agent selection during a single episode. Policy selector (PS) with a team of 3 agents: CZ, CS, and AS. Opponents: CS (top) and CZ (bottom)

Fig. 8: Normalized utilization of low-level policies vs. PHANG-MAN 6(self-play) and Randy (random maneuver agent)

Fig. 10: VR enabled cockpit for human vs. AI matchups Fig. 9: Day 3 tournament results

view of pertinent information (e.g track angle, relative distance to opponent, altitude, fuel, etc). As an additional visual assist, an icon pointing out the opponent direction to the pilot, when outside his vision was provided along with a red flashing overlay of the entire pilot’s view when he received a gun snap.

TABLE I: List of common hyper-parameters used in the SAC training.

descriptionView Paper arrow_downwardDownload

Deep Reinforcement Learning for Tehran Stock Trading

by Neda Yousefi

2023, Indonesian Journal of Data and Science

One of the most interesting topics for research and also for making a profit is stock trading. Artificial intelligence has had a great impact on this path. A lot of research has been done to investigate the application of machine... more

Figure 1. Deep Deterministic Policy Gradient (DDPG) architecture

Figure 3. Advantage Actor Critic (A2C) architecture Figure 2. Deep Deterministic Policy Gradient (DDPG) Algorithm

Figure 5. Stock price history as per close price

Figure 6. DDPG loss during the learning process in various iterations

Figure 8. Convergence of the A2C trading agent in various iterations By comparing two Figure 7, Figure 8, it is obvious that, unlike the DDPG trading agent, almost all the iterations had the same behavior and all converged to a specific value, but about the A2C trading agent, different iterations lead to different results. In addition, the average results from the A2C trading agent are lower than the DDPG trading agent.

Evaluation criteria should be used to measure the efficiency of proposed deep reinforcement learning algorithms for stock trading and what the results of such methods will be in practice. The main goal of stock trading is to maximize long-term profit. Usually and the same here, criteria such as annualized return (AR) and sharp ratio (SR) are used to measure and evaluate financial strategies and stock trading performance. The annualized return is the geometric average amount of money earned by an investment each year over a given time period. [16]

Table 1. Results by applying the test dataset (AR=Annualized Return and SR= Sharpe Ratio)

descriptionView Paper arrow_downwardDownload

RLlib Flow: Distributed Reinforcement Learning is a Dataflow Problem

by Sven Mika

2023

Researchers and practitioners in the field of reinforcement learning (RL) frequently leverage parallel computation, which has led to a plethora of new algorithms and systems in the last few years. In this paper, we re-examine the... more

descriptionView Paper arrow_downwardDownload

A review of motion planning algorithms for intelligent robotics

by Jamie Pote and

2023, Journal of Intelligent Manufacturing

We investigate and analyze principles of typical motion planning algorithms. These include traditional planning algorithms, supervised learning, optimal value reinforcement learning, policy gradient reinforcement learning. Traditional... more

descriptionView Paper arrow_downwardDownload

Cooperative Multi-agent Control Using Deep Reinforcement Learning

by jayesh gupta

2023, Autonomous Agents and Multiagent Systems

This work considers the problem of learning cooperative policies in complex, partially observable domains without explicit communication. We extend three classes of single-agent deep reinforcement learning algorithms based on policy... more

descriptionView Paper arrow_downwardDownload

Deep Reinforcement Learning based Local Planner for UAV Obstacle Avoidance using Demonstration Data

by James Whidborne

2022, ArXiv

In this paper, a deep reinforcement learning (DRL) method is proposed to address the problem of UAV navigation in an unknown environment. However, DRL algorithms are limited by the data efficiency problem as they typically require a huge... more

descriptionView Paper arrow_downwardDownload

Survivable Hyper-Redundant Robotic Arm with Bayesian Policy Morphing

by apan dastider

2022, 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE)

In this paper we present a Bayesian reinforcement learning framework that allows robotic manipulators to adaptively recover from random mechanical failures autonomously, hence being survivable. To this end, we formulate the framework of... more

descriptionView Paper arrow_downwardDownload

Hierarchical Reinforcement Learning for Air-to-Air Combat

by David Rosenbluth

2022, 2021 International Conference on Unmanned Aircraft Systems (ICUAS)

descriptionView Paper arrow_downwardDownload

Reinforcement learning for resource management in multi-tenant serverless platforms

by Tamer Basar

2022, Proceedings of the 2nd European Workshop on Machine Learning and Systems

Serverless Function-as-a-Service (FaaS) is an emerging cloud computing paradigm that frees application developers from infrastructure management tasks such as resource provisioning and scaling. To reduce the tail latency of functions and... more

descriptionView Paper arrow_downwardDownload

Variational Inequalities for Heterogeneous Microstructures Based on Couple-Stress Theory

by Sourish Chakravarty

2022, International Journal for Multiscale Computational Engineering

The performance of off-policy learning, including deep Q-learning and deep deterministic policy gradient (DDPG), critically depends on the choice of the exploration policy. Existing exploration methods are mostly based on adding noise to... more

descriptionView Paper arrow_downwardDownload

Path Following Control for UAV Using Deep Reinforcement Learning Approach

by Youmin Zhang

2022, Guidance, Navigation and Control

Unmanned aerial vehicles (UAVs) have been extensively used in civil and industrial applications due to the rapid development of the guidance, navigation and control (GNC) technologies. Especially, using deep reinforcement learning methods... more

training paths. Training efficiency is an important measurement, which illustrates During the training phase, there ar Nn The first one is a circle path, whic A ability. The third one is a sine o functions, which aims at training t e three kinds of training paths, as shown in Fig. 4. h aims at training the continuous turning ability. The second one is a rectangular path, which aims at training sharp turning r cosine curve path generated by trigonometric he comprehensive turning ability. Since the shape of training paths has been determined, in order to guarantee the diversity of the training data, the size and magnitude of training path differ in each training episode. Each training episode consists 0 training paths. Training efticiency three training missions which stand for the three is an important measurement. which illustrates

Fig. 5. Successful rates of DQN, original DDPG and DERB-DDPG (with moving window of 10 episodes).

2.2. Motion control in Markov decision process model

Fig. 2. The RL motion control structure in MDP model. For solving the path following control problem in this paper, an RL algorithm is used to construct a motion controller for UAV. The UAV can learn optimal control policy from the experiences which are rewarded by the trial and error interactions with the environment. Generally, RL algorithms are based on Markov decision process (MDP) model. As shown in Fig. 2, at each time step t, the UAV observes environ- ment E; and gets state s,. After that, action a, will be taken and eventually receives a reward r;. W. hat differs from other RL algorithms is that the action a, is real valued and the action space is continuous. The inner dynamics of UAV will be represented by the transition probability model p(s;1,|s;, a,). During each time instant, the actor network 7 de termines the action a; and reward r; could be calculated by the reward function r(s,,a;). The objective of UAV is trying to maximize the accumulated discounted reward R,; = yo '-*r(s;,a;) from current time step t to a future time step T. The discount factor y ranges between 0 and 1.

Fig. 3. The framework of DDPG for path following control.

Fig. 6. Accumulated reward of DQN, original DDPG and DERB-DDPG (with moving window of 10 episodes). The average cross-track error is illustrated in Fig. 7, which is obtained by cal- culating the average value of the cross-track error at each time step. The cross-track errors of all three algorithms keep decreasing with training episode increasing. Such a phenomenon proves the effectiveness of the reward function. By using the designed reward function, all the algorithms can minimize the cross-track error. It is obvious that the DERB-DDPG has smaller cross-track error than the DQN and original DDPG at all times.

Fig. 7. Average cross-track error of DQN, original DDPG and DERB-DDPG (with moving window of 10 episodes).

Fig. 8. Path following performance comparison. the best overall performance. Although the original DDPG and DQN may perform better than DERB—DDPG in some waypoints, however, their performances are not stable, especially in a complicated case. This proves the robustness and stability of the proposed DERB—DDPG algorithm.

Fig. 9. Path following performances on another example.

descriptionView Paper arrow_downwardDownload

Harnessing Deep Reinforcement Learning to Construct Time-Dependent Optimal Fields for Quantum Control Dynamics

by Bryan M Wong

2022

We present an efficient deep reinforcement learning (DRL) approach to automatically construct time-dependent optimal control fields that enable desired transitions in reduced-dimensional chemical systems. Our DRL approach gives impressive... more

descriptionView Paper arrow_downwardDownload

An Autonomous Emotional Virtual Character: An Approach with Deep and Goal-Parameterized Reinforcement Learning

by Creto Vidal

2022, Journal on Interactive Systems

We have developed an autonomous virtual character guided by emotions. The agent is a virtual character who lives in a three-dimensional maze world. We found that emotion drivers can induce the behavior of a trained agent. Our approach is... more

descriptionView Paper arrow_downwardDownload

AWD3: Dynamic Reduction of the Estimation Bias

by Furkan Burak mutlu

2022, 2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI)

Value-based deep Reinforcement Learning (RL) algorithms suffer from the estimation bias primarily caused by function approximation and temporal difference (TD) learning. This problem induces faulty state-action value estimates and... more

descriptionView Paper arrow_downwardDownload

Discrete linear-complexity reinforcement learning in continuous action spaces for Q-learning algorithms

by Lukas Mandrake

2022

In this article, we sketch an algorithm that extends the Q-learning algorithms to the continuous action space domain. Our method is based on the discretization of the action space. Despite the commonly used discretization methods, our... more

descriptionView Paper arrow_downwardDownload

Benchmarking Deep Reinforcement Learning Algorithms for Vision-based Robotics

by Ardhendu Behera

2022

This paper presents a benchmarking study of some of the state-of-the-art reinforcement learning algorithms used for solving two simulated vision-based robotics problems. The algorithms considered in this study include soft actor-critic... more

descriptionView Paper arrow_downwardDownload

Drone Deep Reinforcement Learning: A Review

by Bilel Benjdira

2022, Electronics

Unmanned Aerial Vehicles (UAVs) are increasingly being used in many challenging and diversified applications. These applications belong to the civilian and the military fields. To name a few; infrastructure inspection, traffic patrolling,... more

descriptionView Paper arrow_downwardDownload

Transferring Domain Knowledge with an Adviser in Continuous Tasks

by Alex Xavier

2022

Recent advances in Reinforcement Learning (RL) have surpassed human-level performance in many simulated environments. However, existing reinforcement learning techniques are incapable of explicitly incorporating already known... more

descriptionView Paper arrow_downwardDownload

Hindsight Experience Replay with Kronecker Product Approximate Curvature

by Dhuruva Priyan Gowri Mariyappan

2022, ArXiv

Hindsight Experience Replay (HER) is one of the efficient algorithm to solve Reinforcement Learning tasks related to sparse rewarded environments.But due to its reduced sample efficiency and slower convergence HER fails to perform... more

descriptionView Paper arrow_downwardDownload

TEAC: Intergrating Trust Region and Max Entropy Actor Critic for Continuous Control

by hongyu zang

2022

Trust region methods and maximum entropy methods are two state-of-the-art branches used in reinforcement learning (RL) for the benefits of stability and exploration in continuous environments, respectively. This paper proposes to... more

descriptionView Paper arrow_downwardDownload

Resources Sharing in 5G Networks: Learning-Enabled Incentives and Coalitional Games

by sakti Winoto

2022, IEEE Systems Journal

Smart systems are often battery-constrained, and compete for resources from remote clouds, which results in high delay. Collaboratively sharing resource among neighbors in proximity is promising to control such delay for time-sensitive... more

descriptionView Paper arrow_downwardDownload

Reinforcement learning control of robot manipulator

by Lucas Pereira Cotrim

2022, Revista Brasileira de Computação Aplicada

Since the establishment of robotics in industrial applications, industrial robot programming involves therepetitive and time-consuming process of manually specifying a fixed trajectory, which results in machineidle time in terms of... more

descriptionView Paper arrow_downwardDownload

Robust Deep Reinforcement Learning for Quadcopter Control

by Ali Minai

2022

Deep reinforcement learning (RL) has made it possible to solve complex robotics problems using neural networks as function approximators. However, the policies trained on stationary environments suffer in terms of generalization when... more

Keywords: Reinforcement learning control; Robust adaptive control; Robotics; Flying robots

ig. 2. Neural Network Architectures: (a) The network setup used for training robust policy where pu, is the action by policy 79 and jz is the action by the adversary 7; (b) Architecture of the neural networks used for policy 79 and for adversary 7g; (c) Architecture of the neural network used as the critic (action-value function) Qg(s;, Ur).

where R € SO(3) is the rotational matrix from body to world frame of reference and is given by (3). In this section, we briefly discuss the dynamics of the quad- copter. A quadcopter of ‘X’-configuration is used in the work. Figure | shows the schematic diagram of the quadcopter along with its physics engine simulated model. The body frame origin of the drone is placed at its center of mass. Thrust motors of this system rotate only in one direction and produce a positive thrust along the Zp,~) axis. The translational motion of the UAV in the world frame and rotational motion in the body frame is represented by equations (1) and (2), respectively.

‘ig. 3. Comparison of policies trained using DDPG (left) and AR-DDPG (right); the color maps represent the average episode return over 10 episodes for corresponding tuple of (Mass Perturbation, Action Perturbation).

Table 1. Quadcopter physics model parameters

Table 2. AR-DDPG Hyperparameters we chose a=0.1. For higher values of a, the adversary gets more control and training a policy becomes difficult. This was also reported by Tessler et al. (2019). To check the robustness of the control policy trained using the proposed method of AR-DDPG for the UAV, we measured the performance of the drone for waypoint navigation using the reward function (9) to evaluate the cumulative reward per episode. While testing this policy, both the external and internal parameters to the quadcopter were changed while the trained policy weights were kept unchanged. We did not use the adversary network or the critic network in the testing phase.

Table 3. Test parameters value of the external and internal parameters, the parameters of the policies trained using AR-DDPG and DDPG were kept frozen. The drone was then initialized at a state sampled using the strategy described in Section-3.2 and allowed to perform the task of reaching the goal location. This test was performed 10 times for each pair of internal and external parameters. It can be observed from these results that the robust policy trained using AR-DDPG acquires higher rewards compared to the DDPG policy when the environment parameters are varied from the parameters seen by the policy during training. This clearly shows the robustness of the trained neural policy to unseen environmental factors. The robust policy was able to perform comparatively better than the policy trained using DDPG even when no internal or external perturbations are introduced in the environment. Another interesting point to note from these results is that the training of robust policy did not require any knowledge about the model uncertainties in set P except the scalar value a. The adversary 7, trained for taking the worst possible action, allowed the controller to be implicitly prepared for the range of uncertainties that may arise in the drone model.

descriptionView Paper arrow_downwardDownload

AWD3: Dynamic Reduction of the Estimation Bias

by Furkan Mutlu

2022, 2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI)

descriptionView Paper arrow_downwardDownload

Variational Inequalities for Heterogeneous Microstructures Based on Couple-Stress Theory

by Sonjoy Das

2022, International Journal for Multiscale Computational Engineering

descriptionView Paper arrow_downwardDownload

Jerry’s New Year

by Ricardo Bueno

2022

Reinforcement learning (RL) is attracting increasing interests in autonomous driving due to its potential to solve complex classification and control problems. However, existing RL algorithms are rarely applied to real vehicles for two... more

descriptionView Paper arrow_downwardDownload

Generating Electrical Energy from Tyne and Wear Metro

by AHMET GORGULU

2022, International Journal of Mechanical Engineering

Utilizing the collected experience tuples in the replay buffer (RB) is the primary way of exploiting the experiences in the off-policy reinforcement learning (RL) algorithms, and, therefore, the sampling scheme for the experience tuples... more

descriptionView Paper arrow_downwardDownload

RLINK: Deep reinforcement learning for user identity linkage

by Guandong Xu

2022, World Wide Web

User identity linkage is a task of recognizing the identities of the same user across different social networks (SN). Previous works tackle this problem via estimating the pairwise similarity between identities from different SN,... more

adjusted the linkage strategy after each matching step due to the high complexity. In order to model this long-term influence effectively, we novelly consider UIL as a Markov Decision Process and propose a deep reinforcement learning framework (RLink) to automatically match identities in two different social networks. Figure 2 illustrates the overall RLink process. One state consists of three components, i.e. two social net- work structures and previously matched identity pairs. According to the current state, the agent performs an action. After the action is performed, the state would be changed at the next time and a reward would be fed to the agent to adjust its policy. Because the action space is large and dynamic in the UIL process, we adapt an Actor-Critic frame- work [32], in which the actor network generates a deterministic action based on current state and the critic network evaluates the quality of this action-state pair. Concretely, for

Figure 2 The Procedure of the proposed RLink of Reinforcement Learning based User Identity Linkage. The blue link represents friend relation in social network, and orange line at state S; represents matched identity pair. At time 7, agent generates a pair of matching identities as an action according to current state. After agent performs this action, the state would be changed at time i + | and next action can be generated based on Sj+1

Figure 3 The framework of Actor-Critic network, where the actor is comprised of an Encoder-Decoder architecture and the critic is DQN. The inputs of Actor are history identity pairs and network structure, where history identity pairs were generated by our Actor-Critic network before current epoch (See (4)). Remarkably, ho is a zero vector. Then this action and current state are input onto the Critic to evaluate the quality of this action

Figure 4 Detailed Performance Comparison on Twitter-Foursquare Dataset

Figure 5 Comparison between DDPG and DQN: The x-axis shows the training episodes. The y-axis shows the total reward of each episode. Red line represents DDPG and green Line is DON

Figure6 Q-value performance with different reward on Last.fm-Myspace and Linkedin-Arminer. The x-axis shows the training episodes. The y-axis shows the normalize Q-value of each session. The value of immediate reward is equal to 1/ — 1, while long-term reward is } / el

Table 3 Performance comparison on user identity linkage

descriptionView Paper arrow_downwardDownload

3. Fairbairns Theorie und ausgewählte philosophische Freud-Interpretationen

by Graham S Clarke

2022, Theorie persönlicher Beziehungen

This paper addresses the question of how a previously available control policy πs can be used as a supervisor to more quickly and safely train a new learned control policy πL for a robot. A weighted average of the supervisor and learned... more

descriptionView Paper arrow_downwardDownload

Partial Policy-based Reinforcement Learning for Anatomical Landmark Localization in 3D Medical Images

by WALID ABDULLAH AL

2022, IEEE Transactions on Medical Imaging

Utilizing the idea of long-term cumulative return, reinforcement learning (RL) has shown remarkable performance in various fields. We propose a formulation of the landmark localization in 3D medical images as a reinforcement learning... more

descriptionView Paper arrow_downwardDownload

Count-Based Exploration in Feature Space for Reinforcement Learning

by Tom Everitt

2022, Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence

We introduce a new count-based optimistic exploration algorithm for Reinforcement Learning (RL) that is feasible in environments with high-dimensional state-action spaces. The success of RL algorithms in these domains depends crucially on... more

descriptionView Paper arrow_downwardDownload

Target Driven Visual Navigation with Hybrid Asynchronous Universal Successor Representations

by Rivindu Weerasekera

2022, ArXiv

Being able to navigate to a target with minimal supervision and prior knowledge is critical to creating human-like assistive agents. Prior work on map-based and map-less approaches have limited generalizability. In this paper, we present... more

descriptionView Paper arrow_downwardDownload

Action Branching Architectures for Deep Reinforcement Learning

by Fabio Pardo

2022

Discrete-action algorithms have been central to numerous recent successes of deep reinforcement learning. However, applying these algorithms to high-dimensional action tasks requires tackling the combinatorial increase of the number of... more

Figure 1: A conceptual illustration of the proposed action branching network architecture. The shared network module computes a latent representation of the input state that is then passed forward to the several action branches. Each action branch is responsible for controlling an individual degree of freedom and the concatenation of the selected sub-actions results in a joint-action tuple.

Figure 2: A visualization of the specific action branching networ' k implemented for the proposed BDQ agent. When a state is provided at the input, the shared decision module computes a latent representation that is then used for evaluation of the state value and the factorized (state-dependent) action advantages on the subsequent independent branches. The state value and the factorized advantages are then combined, via a special aggregation layer, to output the Q-values for each action dimension. These factorized Q-values are then queried for the generation of a joint-action tuple. The weights of the fully connected neural layers are denoted by the gray trapezoids and t! he size of each layer is indicated.

Figure 3: Illustrations of the custom physical reaching tasks. From left: Reacher3DOF, Reacher4DOF, and ReacherS5DOF domains with 3, 4, and 5 degrees of freedom, respectively.

Figure 4: Performance in sum of rewards during evaluation on the y-axis and training episodes on the x-axis. The solid lines represent smoothed (window size of 20 episodes) averages over 3 runs with random initialization seeds, while shaded areas show the standard deviations. Evaluations were conducted every 50 episodes of training for 30 episodes with a greedy policy.

Figure 5: Illustrations of the domains from the OpenAI’s MuJoCo Gym that were used in our experi- ments. From left: Reacher-v1, Hopper-v1, Walker2d-v1, and Humanoid-v1 featuring 2, 3, 6, and 17 degrees of freedom, respectively.

Figure 6: Learning curves for the OpenAI’s MuJoCo Gym manipulation and locomotion benchmark domains. The solid lines represent smoothed (window size of 20 episodes) averages over 6 runs with random initialization seeds, while shaded areas show the standard deviations. Evaluations were conducted every 50 episodes of training for 30 episodes with a greedy policy.

Table 1: Dimensionality of the OpenAI’s MuJoCo Gym benchmark domains: dim(o) denotes the observation dimensions, N is the number of action dimensions, and n% indicates the number of possible actions in the combinatorial action space, with n denoting the fixed number of discrete sub-actions per action dimension. The rightmost column indicates the total number of network outputs required for the proposed action branching architecture. The values provided are for the most fine-grained discretization case of n = 33. Table 1 states the dimensionality information of the standard benchmark domains from the OpenAI’s MuJoCo Gym collection that were used in our experiments. The values provided are calculated for the specific case of n = 33, the finest granularity that we experimented with.

descriptionView Paper arrow_downwardDownload