Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.

Log In
Sign Up

Figure 2 – uploaded by Marcus Gallagher

See full PDF downloadDownload figure

pursue the enemy to the end. The two non-reacting navigation bots were included to give the RL bots practice targets to learn combat. These bots navigated the environment at all times. Two different enemy types were also included to compel the RL bot to learn a generalized Three different RL setups were used in the experiments: HierarchicalRL, RuleBasedRL and RL. The HierarchicalRL setup uses hierarchical RL to combine a combat and navigation controller that were trained in previous work using RL [1]. The HierarchicalRL controller learns when to use the combat or navigation controller by receiving the rewards from Table II. Collision and death penalties were not included as preliminary runs showed better performance without these penalties. The better performance was due to the conflicting nature of the problem to maximize kills and exploration, while minimizing deaths and collisions. For this problem, maximizing kills and exploration inadvertently minimized deaths and collisions, therefore the penalties were not needed. A range of parameters were experimented with, and results were varying. In general, medium discount factors and eligibility traces were the best performing. — Table 2 pursue the enemy to the end. The two non-reacting navigation bots were included to give the RL bots practice targets to learn combat. These bots navigated the environment at all times. Two different enemy types were also included to compel the RL bot to learn a generalized Three different RL setups were used in the experiments: HierarchicalRL, RuleBasedRL and RL. The HierarchicalRL setup uses hierarchical RL to combine a combat and navigation controller that were trained in previous work using RL [1]. The HierarchicalRL controller learns when to use the combat or navigation controller by receiving the rewards from Table II. Collision and death penalties were not included as preliminary runs showed better performance without these penalties. The better performance was due to the conflicting nature of the problem to maximize kills and exploration, while minimizing deaths and collisions. For this problem, maximizing kills and exploration inadvertently minimized deaths and collisions, therefore the penalties were not needed. A range of parameters were experimented with, and results were varying. In general, medium discount factors and eligibility traces were the best performing.

Related Figures (11)

pairs. e(s,a) is the eligibility trace variable for a given paits. Clo, U) AS Uli Vitis tUiity UaAveY ValIiadUiIn iVUi aA oiyelt state-action pair. Line 7 calculates the temporal difference (TD) error using the current reward, decay parameter (7) and current and next state-action pair values. Line 8 sets the eligibility trace (1) to one, to flag that state-action pair as just being visited. The update function is listed in line 10. The previously calculated TD error is used with the learning rate (q@ ) and eligibility trace (1) to update each state-action pair in the policy. Line 11 reduces the eligibility trace of previously visited state-action pairs. This reduction affects state-action pairs that have just been visited by allowing them to receive more of the current reward than states further in the past. The reduction is determined by the decay (y) and trace (1) parameters. setup. The Sarsa algorithm with eligibility traces [8], termed Sarsa(/), is used for updating the policy values. Eligibility traces are a method to speed up learning by allowing pas actions to benefit from the current reward, and also allows sequences of actions to be learnt. Sequences of actions are particularly important in combat behavior where the bot A should be able to perform a combination of actions in order to defeat an enemy. Table I lists the steps of the Sarsa( A algorithm. O(s,a) is the policy consisting of all state-action

The test environment used for the experiments was a simple, purpose-built FPS game. The game test-bed has the basic features of commercial FPS games such as walls, power up items and bots. Bots are equipped with the ability o turn left, turn right, move forwards, move backwards, strafe left, strafe right, pick up items and shoot. All bots in he game have exactly the same capabilities and parameters, e.g., turn speed (0.1), speed (0.2 meters/update cycle), ammo clip (unlimited), weapon type (laser ray gun with five hit points per shot damage rating), and hit points (50). The laser guns have a cool down timer of one second, to avoid a constant stream of fire. Bots respawn ten update cycles after being killed, and they spawn in the same location each time with full health. Items respawn ten update cycles after being collected. Two different level maps were designed for the experiments to test the AI in varying environments. The first map is the arena (Fig. la), which consists of four enclosing walls and no items to collect. The arena map is 100m x 100m. The second map is a maze environment (Fig. 1b), and was used to investigate how well the bots performed in a more complex environment where combat occurs in tight spaces, and navigating the environment is more integral to the bots success. The maze map is also 100m x 100m.

The number of state-action pairs for the RL task is 137781, more than three times larger than the HierarchicalRL table. TABLE III Table III lists the rewards and corresponding values for the RL bot. A reward was given to the RL bot when it picked up an item, killed an enemy bot, and when it accurately shot an enemy. A small penalty was given when the RL bot collided with an obstacle. The death penalty and distance reward were removed to attempt to simplify the problem.

sought. The StateMachine bot had a much lower accuracy than the HierarchicalRL and RuleBasedRL bots. Although he StateMachine’s accuracy was quite low, the bot shot many times more than the higher accuracy bots, therefore having the advantage in combat kills. Due to the large differences in shooting accuracies, it is expected that the HierarchicalRL and RuleBasedRL bots would perform better than the StateMachine bot in scenarios where ammo was limited which is common in most commercial FPS games. The RL bot had a very low accuracy rating. The RL bot learnt to shoot often, sometimes even shooting when no enemy was in sight. However, the RL bot’s accuracy is still higher than the Random bot. This difference shows that the RL bot was able to learn a combat strategy which consisted of shooting as often as possible, generally when enemies were in sight, in the hope of taking down the enemy. This type of combat behavior is similar to beginner FPS players as they tend to constantly shoot haphazardly in times of danger. with a very high score for kills in the arena map. Lhe RuleBasedRL bot followed with half as many average kills, but with a high score of 12 kills in one run. The HierarchicalRL bot was close behind the RuleBasedRL bot, with an average of 5 kills and high score of 9 kills. While he StateMachine bot recorded a high number of kills, we see that the kills were mainly from the two non-reacting bots. Since the number of deaths for the StateMachine bot was similar to the RuleBasedRL bot, we know that the RuleBasedRL bot killed the StateMachine bot as many times as it was killed. Similar results were seen against the HierarchicalRL bot. The similar death count therefore implies that the RuleBasedRL and HierarchicalRL bot were formidable opponents to the StateMachine bot when head- to-head, but the underlying combat controller was less effective against the non-reactive navigation bots. The RL bot managed a maximum of 6 kills, indicating that some form of combat behavior was learnt. This result was surprising considering the high number of deaths similar to he Random bot.

Fig. 2. HierarchicalRL bot paths from replay of the arena map experiment with random seed 102. This section examines the paths travelled by the bots during the experiments. The paths are displayed from the best scoring bots from the maximum row in Table IV and V. The number of runs displayed in the path figures varies depending how many times the bot died during the 5000 replay iterations.

Fig. 3. RuleBasedRL bot paths from the replay of the arena map experiment with random seed 120. RuleBasedRL bot be an aggressive fighter due to the controlling rules, it is interesting to note that the same underlying controllers can be used to produce varying behaviors by using different high level controllers. The RuleBasedRL bot had three runs in total, the first being the

Fig. 4. RL bot paths from the replay of the arena map experiment with random seed 101.

Fig. 6. RuleBasedRL bot paths from the replay of the maze mar experiment with random seed 120.

Fig 5. HierarchicalRL bot paths from replay of the maze map experiment with random seed 112. encountering the corner where it turned and started moving forward along the eastern wall. An enemy came in sight a quarter of the way down the eastern wall, and the RL bot reacted by following and shooting. The RL bot was then killed by the enemy. In the third run the RL bot killed the StateMachine bot and a non-reacting bot, and took down a

Fig. 7. RL bot paths from the replay of the maze map experiment with random seed 101.

Related topics:

Reinforcement Learning First Person Shooter

Connect with 287M+ leading minds in your field

Discover breakthrough research and expand your academic network

Explore
Papers
Topics

Features
Mentions
Analytics
PDF Packages
Advanced Search
Search Alerts

Journals
Academia.edu Journals
My submissions
Reviewer Hub
Why publish with us
Testimonials

Company
About
Careers
Press
Help Center
Terms
Privacy
Copyright
Content Policy

580 California St., Suite 400

San Francisco, CA, 94104

© 2025 Academia. All rights reserved