Academia.eduAcademia.edu

pursue the enemy to the end. The two non-reacting navigation bots were included to give the RL bots practice targets to learn combat. These bots navigated the environment at all times. Two different enemy types were also included to compel the RL bot to learn a generalized  Three different RL setups were used in the experiments: HierarchicalRL, RuleBasedRL and RL. The HierarchicalRL setup uses hierarchical RL to combine a combat and navigation controller that were trained in previous work using RL [1]. The HierarchicalRL controller learns when to use the combat or navigation controller by receiving the rewards from Table II. Collision and death penalties were not included as preliminary runs showed better performance without these penalties. The better performance was due to the conflicting nature of the problem to maximize kills and exploration, while minimizing deaths and collisions. For this problem, maximizing kills and exploration inadvertently minimized deaths and collisions, therefore the penalties were not needed. A range of parameters were experimented with, and results were varying. In general, medium discount factors and eligibility traces were the best performing.

Table 2 pursue the enemy to the end. The two non-reacting navigation bots were included to give the RL bots practice targets to learn combat. These bots navigated the environment at all times. Two different enemy types were also included to compel the RL bot to learn a generalized Three different RL setups were used in the experiments: HierarchicalRL, RuleBasedRL and RL. The HierarchicalRL setup uses hierarchical RL to combine a combat and navigation controller that were trained in previous work using RL [1]. The HierarchicalRL controller learns when to use the combat or navigation controller by receiving the rewards from Table II. Collision and death penalties were not included as preliminary runs showed better performance without these penalties. The better performance was due to the conflicting nature of the problem to maximize kills and exploration, while minimizing deaths and collisions. For this problem, maximizing kills and exploration inadvertently minimized deaths and collisions, therefore the penalties were not needed. A range of parameters were experimented with, and results were varying. In general, medium discount factors and eligibility traces were the best performing.