# Train and evaluate AWS DeepRacer models When your AWS DeepRacer vehicle drives itself along a track, it captures environmental states with the camera mounted on the front and takes actions in response to the observations. Your AWS DeepRacer model is a function that maps the observations and actions to the expected reward. To train your model is to find or learn the function that maximize the expected reward so that the optimized model prescribes what actions (speed and steering angle pairs) your vehicle can take to move itself along the track from start to finish. In practice, the function is represented by a neural network and training the network involves finding the optimal network weights given sequences of observed environmental states and the responding vehicle's actions. The underlying criteria of optimality are described by the model's reward function that encourages the vehicle to make legal and productive moves without causing traffic accidents or infractions. A simple reward function could return a reward of 0 if the vehicle is on the track, -1 if it's off the track, and \$11 if it reaches the finish line. With this reward function, the vehicle gets penalized for going off the track and rewarded for reaching the destination. This can be a good reward function if time or speed is not an issue. Suppose that you're interested in having the vehicle drive as fast as it can without getting off a straight track. As the vehicle speeds up and down, the vehicle may steer left or right to avoid obstacles or to remain inside. Making too big a turn at a high speed could easily lead the vehicle off the track. Making too small a turn may not help avoid colliding with an obstacle or another vehicle. Generally speaking, optimal actions would be to make a bigger turn at a lower speed or to steer less along a sharper curve. To encourage this behavior, your reward function must assign a positive score to reward smaller turns at a higher speed and/or a negative score to punish bigger turns at a higher speed. Similarly, the reward function can return a positive reward for speeding up along a straighter course or speeding down when it's near an obstacle. The reward function is an important part of your AWS DeepRacer model. You must provide it when training your AWS DeepRacer model. The training involves repeated episodes along the track from start to end. In an episode the agent interacts with the track to learn the optimal course of actions by maximizing the expected cumulative reward. At the end, the training produces a reinforcement learning model. After the training, the agent executes autonomous driving by running inference on the model to take an optimal action in any given state. This can be done in either the simulated environment with a virtual agent or a real-world environment with a physical agent, such as an AWS DeepRacer scale vehicle. To train a reinforcement learning model in practice, you must choose a learning algorithm. Currently, the AWS DeepRacer console supports only the proximal policy optimization ([PPO](https://arxiv.org/pdf/1707.06347.pdf)) and soft actor critic (SAC) algorithms. You can then choose a deep-learning framework supporting the chosen algorithm, unless you want to write one from scratch. AWS DeepRacer integrates with SageMaker AI to make some popular deep-learning frameworks, such as [TensorFlow](https://www.tensorflow.org/), readily available in the AWS DeepRacer console. Using a framework simplifies configuring and executing training jobs and lets you focus on creating and enhancing reward functions specific to your problems. Training a reinforcement learning model is an iterative process. First, it's challenging to define a reward function to cover all important behaviors of an agent in an environment at once. Second, hyperparameters are often tuned to ensure satisfactory training performance. Both require experimentation. A prudent approach is to start with a simple reward function and then progressively enhance it. AWS DeepRacer facilitates this iterative process by enabling you to clone a trained model and then use it to jump-start the next round of training. At each iteration you can introduce one or a few more sophisticated treatments to the reward function to handle previously ignored variables or you can systematically adjust hyperparameters until the result converges. As with general practice in machine learning, you must evaluate a trained reinforcement learning model to ascertain its efficacy before deploying it to a physical agent for running inference in a real-world situation. For autonomous driving, the evaluation can be based on how often a vehicle stays on a given track from start to finish or how fast it can finish the course without getting off the track. The AWS DeepRacer simulation lets you run the evaluation and post the performance metrics for comparison with models trained by other AWS DeepRacer users on a [leaderboard](deepracer-racing-series.md). **Topics** + [Understanding racing types and enabling sensors supported by AWS DeepRacer](deepracer-choose-race-type.md) + [Train and evaluate AWS DeepRacer models using the AWS DeepRacer console](deepracer-console-train-evaluate-models.md) + [AWS DeepRacer reward function reference](deepracer-reward-function-reference.md) # Understanding racing types and enabling sensors supported by AWS DeepRacer In AWS DeepRacer League, you can participate in the following types of racing events: + **Time trial**: race against the clock on an unobstructed track and aim to get the fastest lap time possible. + **Object avoidance**: race against the clock on a track with stationary obstacles and aim to get the fastest lap time possible. + **Head-to-bot racing**: race against one or more other vehicles on the same track and aim to cross the finish line before other vehicles. AWS DeepRacer community races currently supports time trials only. You should experiment with different sensors on your AWS DeepRacer vehicle to provide it with sufficient capabilities to observe its surroundings for a given race type. The next section describes the [AWS DeepRacer-supported sensors](#deepracer-how-it-works-autonomous-driving-sensors) that can enable the supported types of autonomous racing events. **Topics** + [Choose sensors for AWS DeepRacer racing types](#deepracer-how-it-works-autonomous-driving-sensors) + [Configure agent for training AWS DeepRacer models](#deepracer-configure-agent) + [Tailor AWS DeepRacer training for time trials](#deepracer-get-started-training-simple-time-trial) + [Tailor AWS DeepRacer training for object avoidance races](#deepracer-get-started-training-object-avoidance) + [Tailor AWS DeepRacer training for head-to-bot races](#deepracer-get-started-training-h2h-racing) ## Choose sensors for AWS DeepRacer racing types Your AWS DeepRacer vehicle comes with a front-facing monocular camera as the default sensor. You can add another front-facing monocular camera to make front-facing stereo cameras or to supplement either the monocular camera or stereo cameras with a LiDAR unit. The following list summarizes the functional capabilities of AWS DeepRacer-supported sensors, together with brief cost-and-benefit analyses: **Front-facing camera** A single-lens front-facing camera can capture images of the environment in front of the host vehicle, including track borders and shapes. It's the least expensive sensor and is suitable to handle simpler autonomous driving tasks, such as obstacle-free time trials on well-marked tracks. With proper training, it can avoid stationary obstacles on fixed locations on the track. However, the obstacle location information is built into the trained model and, as the result, the model is likely to be overfitted and may not generalize to other obstacle placements. With stationary objects placed at random locations or other moving vehicles on the track, the model is unlikely to converge. In the real world, the AWS DeepRacer vehicle comes with a single-lens front-facing camera as the default sensor. The camera has 120-degree wide angle lens and captures RGB images that are then converted to grey-scale images of 160 x 120 pixels at 15 frames per second (fps). These sensor properties are preserved in the simulator to maximize the chance that the trained model transfers well from simulation to the real world. **Front-facing stereo camera** A stereo camera has two or more lenses that capture images with the same resolution and frequency. Images from the both lens are used to determine the depth of observed objects. The depth information from a stereo camera is valuable for the host vehicle to avoid crashing into the obstacles or other vehicles in the front, especially under more dynamic environment. However, added depth information makes trainings to converge more slowly. On the AWS DeepRacer physical vehicle, the double-lens stereo camera is constructed by adding another single-lens camera and mounting each camera on the left and right sides of the vehicle. The AWS DeepRacer software synchronizes image captures from both cameras. The captured images are converted into greyscale, stacked, and fed into the neural network for inferencing. The same mechanism is duplicated in the simulator in order to train the model to generalize well to a real-world environment. **LiDAR sensor** A LiDAR sensor uses rotating lasers to send out pulses of light outside the visible spectrum and time how long it takes each pulse to return. The direction of and distance to the objects that a specific pulse hits are recorded as a point in a large 3D map centered around the LiDAR unit. For example, LiDAR helps detect blind spots of the host vehicle to avoid collisions while the vehicle changes lanes. By combining LiDAR with mono or stereo cameras, you enable the host vehicle to capture sufficient information to take appropriate actions. However, a LiDAR sensor costs more compared to cameras. The neural network must learn how to interpret the LiDAR data. Thus, trainings will take longer to converge. On the AWS DeepRacer physical vehicle a LiDAR sensor is mounted on the rear and tilted down by 6 degrees. It rotates at the angular velocity of 10 rotations per second and has a range of 15cm to 2m. It can detect objects behind and beside the host vehicle as well as tall objects unobstructed by the vehicle parts in the front. The angle and range are chosen to make the LiDAR unit less susceptible to environmental noise. You can configure your AWS DeepRacer vehicle with the following combination of the supported sensors: + Front-facing single-lens camera only. This configuration is good for time trials, as well as obstacle avoidance with objects at fixed locations. + Front-facing stereo camera only. This configuration is good for obstacle avoidance with objects at fixed or random locations. + Front-facing single-lens camera with LiDAR. This configuration is good for obstacle avoidance or head-to-bot racing. + Front-facing stereo camera with LiDAR. This configuration is good for obstacle avoidance or head-to-bot racing, but probably not most economical for time trials. As you add more sensors to make your AWS DeepRacer vehicle to go from time trials to object avoidance to head-to-bot racing, the vehicle collects more data about the environment to feed into the underlying neural network in training. This makes training more challenging because the model is required to handle increased complexities. In the end, your tasks of learning to train models become more demanding. To learn progressively, you should start training for time trials first before moving on to object avoidance and then to head-to-bot racing. You'll find more detailed recommendations in the next section. ## Configure agent for training AWS DeepRacer models To train a reinforcement learning model for the AWS DeepRacer vehicle to race in obstacle avoidance or head-to-bot racing, you need to configure the agent with appropriate sensors. For simple time trials, you could use the default agent configured with a single-lens camera. In configuring the agent you can customize the action space and choose a neural network topology so that they work better with the selected sensors to meet the intended driving requirements. In addition, you can change the agent's appearance for visual identification during training. After you configure it, the agent configuration is recorded as part of the model's metadata for training and evaluation. For evaluation, the agent automatically retrieves the recorded configuration to use the specified sensors, action space, and neural network technology. This section walks you through the steps to configure an agent in the AWS DeepRacer console. **To configure an AWS DeepRacer agent in the AWS DeepRacer console** 1. Sign in to the [AWS DeepRacer console](https://console.aws.amazon.com/deepracer). 1. On the primary navigation pane, choose **Garage**. 1. For the first time you use **Garage**, you're presented with the **WELCOME TO THE GARAGE** dialog box. Choose **>** or **<** browse through the introduction to various sensors supported for the AWS DeepRacer vehicle or choose **X** to close the dialog box. You can find this introductory information on the help panel in **Garage**. 1. On the **Garage** page, choose **Build new vehicle**. 1. On the **Mod your own vehicle** page, under **Mod specifications**, choose one or more sensors to try and learn the best combination that can meet your intended racing types. To train for your AWS DeepRacer vehicle time trials, choose **Camera**. For obstacle avoidance or head-to-bot racing, you want to use other sensor types. To choose **Stereo camera**, make sure you have acquired an additional single-lens camera. AWS DeepRacer makes the stereo camera out two single-lens cameras. You can have either a single-lens camera or a double-lens stereo cameras on one vehicle. In either case, you can add a LiDAR sensor to the agent if you just want the trained model to be able to detect and avoid blind spots in obstacle avoidance or head-to-bot racing. 1. On the **Garage** page and under **Neural network topologies**, choose a supported network topology. In general, a deeper neural network (with more layers) is more suitable for driving on more complicated tracks with sharp curves and numerous turns, for racing to avoid stationary obstacles, or for competing against other moving vehicles. But a deeper neural network is more costly to train and the model takes longer to converge. On the other hand, a shallower network (with fewer layers) costs less and takes a shorter time to train. The trained model is capable of handling simpler track conditions or driving requirements, such as time trials on a obstacle-free track without competitors. Specifically, AWS DeepRacer supports **3-layer CNN** or **5-layer CNN**. 1. On the **Garage** page, choose **Next** to proceed to setting up the agent's action space. 1. On the **Action space** page, leave the default settings for your first training. For subsequent trainings, experiment with different settings for the steering angle, top speed, and their granularities. Then, choose **Next**. 1. On the **Color your vehicle to stand out in the crowd** page, enter a name in **Name your DeepRacer** and then choose a color for the agent from the **Vehicle color** list. Then, choose **Submit**. 1. On the **Garage** page, examine the settings of the new agent. To make further modifications, choose **Mod vehicle** and repeat the previous steps starting at **Step 4**. Now, your agent is ready for training. ## Tailor AWS DeepRacer training for time trials If this is your first time to use AWS DeepRacer, you should start with a simple time trial to become familiar with how to train AWS DeepRacer models to drive your vehicle. This way, you get a gentler introduction to basic concepts of reward function, agent, environment, etc. Your goal is to train a model to make the vehicle stay on the track and finish a lap as fast as possible. You can then deploy the trained model to your AWS DeepRacer vehicle to test driving on a physical track without any additional sensors. To train a model for this scenario, you can choose the default agent from **Garage** on the AWS DeepRacer console. The default agent has been configured with a single front-facing camera, a default action space and a default neural network topology. It is helpful to start training an AWS DeepRacer model with the default agent before moving on to more sophisticated ones. To train your model with the default agent, follow the recommendations below. 1. Start training your model with a simple track of more regular shapes and of less sharp turns. Use the default reward function. And train the model for 30 minutes. After the training job is completed, evaluate your model on the same track to watch if the agent can finish a lap. 1. Read about [the reward function parameters](deepracer-reward-function-input.md). Continue the training with different incentives to reward the agent to go faster. Lengthen the training time for the next model to 1 - 2 hours. Compare the reward graph between the first training and this second one. Keep experimenting until the reward graph stops improving. 1. Read more about [action space](deepracer-how-it-works-action-space.md). Train the model the 3rd time by increasing the top speed (for example 1 m/s). To modify the action space, you must build in **Garage** a new agent, when you get the chance to make the modification. When updating the top speed of your agent, be aware of that the higher the top speed, the faster the agent can complete the track in evaluation and the faster your AWS DeepRacer vehicle can finish a lap on a physical track. However, a higher top speed often means a longer time for the training to converge because the agent is more likely to overshoot on a curve and thus get off track. You may want to decrease granularities to give the agent more rooms to accelerate or decelerate and further tweak the reward function in other ways to help training converge faster. After the training converges, evaluate the 3rd model to see if the lap time improves. Keep exploring until there is no more improvement. 1. Choose a more complicated track and repeat **Step 1** to **Step 3**. Evaluate your model on a track that is different from the one you used to train on to see how the model can generalize to different virtual tracks [generalize to real-world environments](deepracer-how-it-works-virtual-to-physical.md). 1. (Optional) Experiment with different values of the [hyperparameters](deepracer-console-train-evaluate-models.md#deepracer-iteratively-adjust-hyperparameters) to improve the training process and repeat **Step 1** to **Step 3**. 1. (Optional) Examine and analyze the AWS DeepRacer logs. For sample code that you can use to analyze the logs, see [https://github.com/aws-samples/aws-deepracer-workshops/tree/master/log-analysis](https://github.com/aws-samples/aws-deepracer-workshops/tree/master/log-analysis). ## Tailor AWS DeepRacer training for object avoidance races After you become familiar with time trials and have trained a few converged models, move on to the next more demanding challenge—obstacle avoidance. Here, your goal is to train a model that can complete a lap as fast as possible without going off track, while avoiding crashing into the objects placed on the track. This is obviously a harder problem for the agent to learn, and training takes longer to converge. The AWS DeepRacer console supports two types of obstacle avoidance training: obstacles can be placed at fixed or random locations along the track. With fixed locations, the obstacles remain fixed to the same place throughout the training job. With random locations, the obstacles change their respective places at random from episode to episode. It is easier for trainings to converge for location-fixed obstacle avoidance because the system has less degrees of freedom. However, models can overfit when the location information is built in to the trained models. As a result, the models may be overfitted and may not generalize well. For randomly positioned obstacle avoidance, it's harder for trainings to converge because the agent must keep learning to avoid crashing into obstacles at locations it hasn't seen before. However, models trained with this option tend to generalize better and transfer well to the real-world races. To begin, have obstacles placed at fixed locations, get familiar with the behaviors, and then tackle the random locations. In the AWS DeepRacer simulator, the obstacles are cuboid boxes with the same dimensions (9.5" (L) x 15.25" (W) x 10/5" (H)) as the AWS DeepRacer vehicle's package box. This makes it simpler to transfer the trained model from the simulator to the real world if you place the packaging box as an obstacle on the physical track. To experiment with obstacle avoidance, follow the recommended practice outlined in the steps below: 1. Use the default agent or experiment with new sensors and action spaces by customizing an existing agent or building a new one. You should limit the top speed to below 0.8 m/s and the speed granularity to 1 or 2 levels. Start training a model for around 3 hours with 2 objects at fixed locations. Use the example reward function and train the model on the track that you will be racing on, or a track that closely resembles that track. The **AWS DeepRacer Smile Speedway (Intermediate)** track is a simple track, which makes it a good choice for summit race preparation. Evaluate the model on the same track with the same number of obstacles. Watch how the total expected reward converges, if at all. 1. Read about [the reward function parameters](deepracer-reward-function-input.md). Experiment with variations of your reward function. Increase the obstacle number to 4. Train the agent to see if the training converges in the same amount of training time. If it doesn't, tweak your reward function again, lower the top speed or reduce the number of obstacles, and the train the agent again. Repeat experimenting until there is no more significant improvement. 1. Now, move on to training avoiding obstacles at random locations. You'll need to configure the agent with additional sensors, which are available from **Garage** in the AWS DeepRacer console. You can use a stereo camera. Or you can combine a LiDAR unit with either a single-lens camera or a stereo camera, but should expect a longer training time. Set the action space with a relatively low top speed (for example 2 m/s) for the training to converge quicker. For the network architecture, use a shallow neural network, which has been found sufficient for obstacle avoidance. 1. Start training for 4 hours the new agent for obstacle avoidance with 4 randomly placed objects on an simple track. Then evaluate your model on the same track to see if it can finish laps with randomly positioned obstacles. If not, you may want to tweak your reward function, try different sensors and have longer training time. As another tip, you can try cloning an existing model to continue training to leverage previously learned experience. 1. (Optional) Choose a higher top speed for the action space or have more obstacles randomly placed along the track. Experiment with different combination of sensors and tweak the reward functions and hyperparameter values. Experiment with the **5-layer CNN** network topology. Then, retrain the model to determine how they affect convergence of the training. ## Tailor AWS DeepRacer training for head-to-bot races Having gone through training obstacle avoidance, you're now ready to tackle the next level of challenge: training models for head-to-bot races. Unlike the obstacle avoidance events, head-to-bot racing has a dynamic environment with moving vehicles. Your goal is to train models for your AWS DeepRacer vehicle to compete against other moving vehicles in order to reach the finish line first without going off track or crashing to any of other vehicles. In the AWS DeepRacer console you can train a head-to-bot racing model by having your agent to compete against 1-4 bot vehicles. Generally speaking, you should have more obstacles placed on a longer track. Each bot vehicle follows a predefined path at constant speed. You can enable it to change lanes or to remains on its starting lane. Similar to training for obstacle avoidance, you can have the bot vehicles evenly distributed across the track on both lanes. The console limits you to have up to 4 bot vehicles on the track. Having more competing vehicles on the track provides the learning agent with more opportunities to encounter more varied situations with the other vehicles. This way, it learns more in one training job and the agent gets trained faster. However, each training is likely to take longer to converge. To train an agent with bot vehicles, you should set the top speed of the agent's action space higher than the (constant) speed of the bot vehicles so that the agent has more passing opportunities during training. As a good starting point, you should set the agent's top speed at 0.8 m/s and the bot vehicle's moving speed at 0.4 m/s. If you enable the bots to change lanes, the training becomes more challenging because the agent must learn not only how to avoid crashing into a moving vehicle in the front on the same lane but also how to avoid crashing into another moving vehicle in the front on the other lane. You can set the bots to change lanes at random intervals. The length of an interval is randomly selected from a range of time (for example 1s to 5s) that you specify before starting the training job. This lane-changing behavior is more similar to the real-world head-to-bot racing behaviors and the trained agent should generate better. However, it takes longer to train the model to converge. Follow these suggested steps to iterate your training for head-to-bot racing: 1. In **Garage** of the AWS DeepRacer console, build a new training agent configured with both stereo cameras and a LiDAR unit. It is possible to train a relatively good model using only stereo camera against bot vehicles. LiDAR helps reduce blind spots when the agent changes lanes. Do not set the top speed too high. A good starting point is 1 m/s. 1. To train for head-to-bot racing, start with two bot vehicles. Set the bot's moving speed lower than your agent’s top speed (for example 0.5 m/s if the agent's top speed is 1 m/s). Disable the lane-changing option, and then choose the training agent you just created. Use one of the reward function examples or make minimally necessary modifications, and then train for 3 hours. Use the track that you will be racing on, or a track that closely resembles that track. The **AWS DeepRacer Smile Speedway (Intermediate)** track is a simple track, which makes it a good choice for summit race preparation. After the training is complete, evaluate the trained model on the same track. 1. For more challenging tasks, clone your trained model for a second head-to-bot racing model. Proceed to either experiment with more bot vehicles or enable lane-changing options. Start with slow lane-changing operations at random intervals longer than 2 seconds. You may also want to experiment with custom reward functions. In general, your custom reward function logic can be similar to those for obstacle avoidance, if you don't take into consideration a balance between surpassing other vehicles and staying on track. Depends on how good your previous model is, you may need to train another 3 to 6 hours. Evaluate your models and see how the model performs. # Train and evaluate AWS DeepRacer models using the AWS DeepRacer console To train a reinforcement learning model, you can use the AWS DeepRacer console. In the console, create a training job, choose a supported framework and an available algorithm, add a reward function, and configure training settings. You can also watch training proceed in a simulator. You can find the step-by-step instructions in [Train your first AWS DeepRacer model](deepracer-get-started-training-model.md). This section explains how to train and evaluate an AWS DeepRacer model. It also shows how to create and improve a reward function, how an action space affects model performance, and how hyperparameters affect training performance. You can also learn how to clone a training model to extend a training session, how to use the simulator to evaluate training performance, and how to address some of the simulation to real-world challenges. **Topics** + [Create your reward function](#deepracer-train-models-define-reward-function) + [Explore action space to train a robust model](#deepracer-define-action-space-for-training) + [Systematically tune hyperparameters](#deepracer-iteratively-adjust-hyperparameters) + [Examine AWS DeepRacer training job progress](#deepracer-examine-training-progress) + [Clone a trained model to start a new training pass](#deepracer-clone-trained-model) + [Evaluate AWS DeepRacer models in simulations](#deepracer-evaluate-models-in-simulator) + [Optimize training AWS DeepRacer models for real environments](#deepracer-evaluate-model-test-approaches) ## Create your reward function A [reward function](deepracer-reward-function-reference.md) describes immediate feedback (as a reward or penalty score) when your AWS DeepRacer vehicle moves from one position on the track to a new position. The function's purpose is to encourage the vehicle to make moves along the track to reach a destination quickly without accident or infraction. A desirable move earns a higher score for the action or its target state. An illegal or wasteful move earns a lower score. When training an AWS DeepRacer model, the reward function is the only application-specific part. In general, you design your reward function to act like an incentive plan. Different incentive strategies could result in different vehicle behaviors. To make the vehicle drive faster, the function should give rewards for the vehicle to follow the track. The function should dispense penalties when the vehicle takes too long to finish a lap or goes off the track. To avoid zig-zag driving patterns, it could reward the vehicle to steer less on straighter portions of the track. The reward function might give positive scores when the vehicle passes certain milestones, as measured by [`waypoints`](deepracer-reward-function-input.md). This could alleviate waiting or driving in the wrong direction. It is also likely that you would change the reward function to account for the track conditions. However, the more your reward function takes into account environment-specific information, the more likely your trained model is over-fitted and less general. To make your model more generally applicable, you can explore [action space](#deepracer-define-action-space-for-training). If an incentive plan is not carefully considered, it can lead to [unintended consequences of opposite effect](https://en.wikipedia.org/wiki/Cobra_effect). This is possible because the immediate feedback is a necessary but not sufficient condition for reinforcement learning. An individual immediate reward by itself also can’t determine if the move is desirable. At a given position, a move can earn a high reward. A subsequent move could go off the track and earn a low score. In such case, the vehicle should avoid the move of the high score at that position. Only when all future moves from a given position yield a high score on average should the move to the next position be deemed desirable. Future feedback is discounted at a rate that allows for only a small number of future moves or positions to be included in the average reward calculation. A good practice to create a [reward function](deepracer-reward-function-reference.md) is to start with a simple one that covers basic scenarios. You can enhance the function to handle more actions. Let's now look at some simple reward functions. **Topics** + [Simple reward function examples](#deepracer-reward-function-simple-examples) + [Enhance your reward function](#deepracer-iteratively-enhance-reward-functions) ### Simple reward function examples We can start building the reward function by first considering the most basic situation. The situation is driving on a straight track from start to finish without going off the track. In this scenario, the reward function logic depends only `on_track` and `progress`. As a trial, you could start with the following logic: ``` def reward_function(params): if not params["all_wheels_on_track"]: reward = -1 else if params["progress"] == 1 : reward = 10 return reward ``` This logic penalizes the agent when it drives itself off the track. It rewards the agent when it drives to the finishing line. It's reasonable for achieving the stated goal. However, the agent roams freely between the starting point and the finishing line, including driving backwards on the track. Not only could the training take a long time to complete, but also the trained model would lead to less efficient driving when deployed to a real-world vehicle. In practice, an agent learns more effectively if it can do so bit-by-bit throughout the course of training. This implies that a reward function should give out smaller rewards step by step along the track. For the agent to drive on the straight track, we can improve the reward function as follows: ``` def reward_function(params): if not params["all_wheels_on_track"]: reward = -1 else: reward = params["progress"] return reward ``` With this function, the agent gets more reward the closer it reaches the finishing line. This should reduce or eliminate unproductive trials of driving backwards. In general, we want the reward function to distribute the reward more evenly over the action space. Creating an effective reward function can be a challenging undertaking. You should start with a simple one and progressively enhance or improve the function. With systematic experimentation, the function can become more robust and efficient. ### Enhance your reward function After you have successfully trained your AWS DeepRacer model for the simple straight track, the AWS DeepRacer vehicle (virtual or physical) can drive itself without going off the track. If you let the vehicle run on a looped track, it won't stay on the track. The reward function has ignored the actions to make turns to follow the track. To make your vehicle handle those actions, you must enhance the reward function. The function must give a reward when the agent makes a permissible turn and produce a penalty if the agent makes an illegal turn. Then, you're ready to start another round of training. To take advantage of the prior training, you can start the new training by cloning the previously trained model, passing along the previously learned knowledge. You can follow this pattern to gradually add more features to the reward function to train your AWS DeepRacer vehicle to drive in increasingly more complex environments. For more advanced reward functions, see the following examples: + [Example 1: Follow the center line in time trials](deepracer-reward-function-examples.md#deepracer-reward-function-example-0) + [Example 2: Stay inside the two borders in time trials](deepracer-reward-function-examples.md#deepracer-reward-function-example-1) + [Example 3: Prevent zig-zag in time trials](deepracer-reward-function-examples.md#deepracer-reward-function-example-2) + [Example 4: Stay in one lane without crashing into stationary obstacles or moving vehicles](deepracer-reward-function-examples.md#deepracer-reward-function-example-3) ## Explore action space to train a robust model As a general rule, train your model to be as robust as possible so that you can apply it to as many environments as possible. A robust model is one that can be applied to a wide range of track shapes and conditions. Generally speaking, a robust model is not "smart" because its reward function does not have the ability to contain explicit environment-specific knowledge. Otherwise, your model is likely to be applicable only to an environment similar to the trained one. Explicitly incorporating environment-specific information into the reward function amounts to feature engineering. Feature engineering helps reduce training time and can be useful in solutions tailor made to a particular environment. To train a model of the general applicability though, you should refrain from attempting a lot of feature engineering. For example, when training a model on a circular track, you can't expect to obtain a trained model applicable to any non-circular track if you have such geometric properties explicitly incorporated into the reward function. How would you go about training a model as robust as possible while keeping the reward function as simple as possible? One way is to explore the action space spanning the actions your agent can take. Another is to experiment with [hyperparameters](#deepracer-iteratively-adjust-hyperparameters) of underlying training algorithm. Often times, you do both. Here, we focus on how to explore the action space to train a robust model for your AWS DeepRacer vehicle. In training an AWS DeepRacer model, an action (`a`) is a combination of speed (`t` meters per second) and steering angle (`s` in degrees). The action space of the agent defines the ranges of speed and steering angle the agent can take. For a discrete action space of `m` number of speeds, `(v1, .., vn)` and `n` number of steering angles, `(s1, .., sm)`, there are `m*n` possible actions in the action space: ``` a1: (v1, s1) ... an: (v1, sn) ... a(i-1)*n+j: (vi, sj) ... a(m-1)*n+1: (vm, s1) ... am*n: (vm, sn) ``` The actual values of `(vi, sj)` depend on the ranges of `vmax` and `|smax|` and are not uniformly distributed. Each time you begin training or iterating your AWS DeepRacer model, you must first specify the `n`, `m`, `vmax` and `|smax|` or agree to using their default values. Based on your choice, the AWS DeepRacer service generates the available actions your agent can choose in training. The generated actions are not uniformly distributed over the action space. In general, a larger number of actions and larger action ranges give your agent more room or options to react to more varied track conditions, such as a curved track with irregular turning angles or directions. The more options available to the agent, the more readily it can handle track variations. As a result, you can expect that the trained model to be more widely applicable, even when using a simple reward function . For example, your agent can learn quickly to handle straight-line track using a coarse-grained action space with small number of speeds and steering angles. On a curved track, this coarse-grained action space is likely to cause the agent to overshoot and go off the track while it turns. This is because there are not enough options at its disposal in order to adjust its speed or steering. Increase the number of speeds or the number of steering angles or both, the agent should become more capable of maneuvering the curves while keeping on the track. Similarly, if your agent moves in a zig-zag fashion, you can try to increase the number of steering ranges to reduce drastic turns at any given step. When the action space is too large, training performance may suffer, because it takes longer to explore the action space. Be sure to balance the benefits of a model's general applicability against its training performance requirements. This optimization involves systematic experimentation. ## Systematically tune hyperparameters One way to improve your model's performance is to enact a better or more effective training process. For example, to obtain a robust model, training must provide your agent more or less evenly distributed sampling over the agent's action space. This requires a sufficient mix of exploration and exploitation. Variables affecting this include the amount of training data used (`number of episodes between each training` and `batch size`), how fast the agent can learn (`learning rate`), the portion of exploration (`entropy`). To make training practical, you may want to speed the learning process. Variables affecting this include `learning rate`, `batch size`, `number of epochs` and `discount factor`. The variables affecting the training process are known as hyperparameters of the training. These algorithm attributes are not properties of the underlying model. Unfortunately, hyperparameters are empirical in nature. Their optimal values are not known for all practical purposes and require systematic experimentation to derive. Before discussing the hyperparameters that can be adjusted to tune the performance of training your AWS DeepRacer model, let's define the following terminology. Data point A data point, also known as an *experience*, it is a tuple of (*s,a,r,s’*), where *s* stands for an observation (or state) captured by the camera, *a* for an action taken by the vehicle, *r* for the expected reward incurred by the said action, and *s’* for the new observation after the action is taken. Episode An episode is a period in which the vehicle starts from a given starting point and ends up completing the track or going off the track. It embodies a sequence of experiences. Different episodes can have different lengths. Experience buffer An experience buffer consists of a number of ordered data points collected over fixed number of episodes of varying lengths during training. For AWS DeepRacer, it corresponds to images captured by the camera mounted on your AWS DeepRacer vehicle and actions taken by the vehicle and serves as the source from which input is drawn for updating the underlying (policy and value) neural networks. Batch A batch is an ordered list of experiences, representing a portion of simulation over a period of time, used to update the policy network weights. It is a subset of the experience buffer. Training data A training data is a set of batches sampled at random from an experience buffer and used for training the policy network weights. **Algorithmic hyperparameters and their effects** | Hyperparameters | Description | | --- | --- | | **Gradient descent batch size** | The number recent vehicle experiences sampled at random from an experience buffer and used for updating the underlying deep-learning neural network weights. Random sampling helps reduce correlations inherent in the input data. Use a larger batch size to promote more stable and smooth updates to the neural network weights, but be aware of the possibility that the training may be longer or slower.[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/deepracer/latest/developerguide/deepracer-console-train-evaluate-models.html) | | **Number of epochs ** | The number of passes through the training data to update the neural network weights during gradient descent. The training data corresponds to random samples from the experience buffer. Use a larger number of epochs to promote more stable updates, but expect a slower training. When the batch size is small, you can use a smaller number of epochs[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/deepracer/latest/developerguide/deepracer-console-train-evaluate-models.html) | | **Learning rate ** | During each update, a portion of the new weight can be from the gradient-descent (or ascent) contribution and the rest from the existing weight value. The learning rate controls how much a gradient-descent (or ascent) update contributes to the network weights. Use a higher learning rate to include more gradient-descent contributions for faster training, but be aware of the possibility that the expected reward may not converge if the learning rate is too large. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/deepracer/latest/developerguide/deepracer-console-train-evaluate-models.html) | | Entropy | A degree of uncertainty used to determine when to add randomness to the policy distribution. The added uncertainty helps the AWS DeepRacer vehicle explore the action space more broadly. A larger entropy value encourages the vehicle to explore the action space more thoroughly. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/deepracer/latest/developerguide/deepracer-console-train-evaluate-models.html) | | Discount factor | A factor specifies how much of the future rewards contribute to the expected reward. The larger the **Discount factor** value is, the farther out contributions the vehicle considers to make a move and the slower the training. With the discount factor of 0.9, the vehicle includes rewards from an order of 10 future steps to make a move. With the discount factor of 0.999, the vehicle considers rewards from an order of 1000 future steps to make a move. The recommended discount factor values are 0.99, 0.999 and 0.9999. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/deepracer/latest/developerguide/deepracer-console-train-evaluate-models.html) | | Loss type | Type of the objective function used to update the network weights. A good training algorithm should make incremental changes to the agent's strategy so that it gradually transitions from taking random actions to taking strategic actions to increase reward. But if it makes too big a change then the training becomes unstable and the agent ends up not learning. The [https://en.wikipedia.org/wiki/Huber_loss](https://en.wikipedia.org/wiki/Huber_loss) and [https://en.wikipedia.org/wiki/Mean_squared_error](https://en.wikipedia.org/wiki/Mean_squared_error) types behave similarly for small updates. But as the updates become larger, **Huber loss** takes smaller increments compared to **Mean squared error loss**. When you have convergence problems, use the **Huber loss** type. When convergence is good and you want to train faster, use the **Mean squared error loss** type. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/deepracer/latest/developerguide/deepracer-console-train-evaluate-models.html) | | Number of experience episodes between each policy-updating iteration | The size of the experience buffer used to draw training data from for learning policy network weights. An experience episode is a period in which the agent starts from a given starting point and ends up completing the track or going off the track. It consists of a sequence of experiences. Different episodes can have different lengths. For simple reinforcement-learning problems, a small experience buffer may be sufficient and learning is fast. For more complex problems that have more local maxima, a larger experience buffer is necessary to provide more uncorrelated data points. In this case, training is slower but more stable. The recommended values are 10, 20 and 40.[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/deepracer/latest/developerguide/deepracer-console-train-evaluate-models.html) | ## Examine AWS DeepRacer training job progress After starting your training job, you can examine the training metrics of rewards and track completion per episode to ascertain the training job's performance of your model. On the AWS DeepRacer console, the metrics are displayed in the **Reward graph**, as shown in the following illustration. ![\[\]](http://docs.aws.amazon.com/deepracer/latest/developerguide/images/best-model-bar-reward-graph2.png) You can choose to view the reward gained per episode, the averaged reward per iteration, the progress per episode, the averaged progress per iteration or any combination of them. To do so, toggle the **Reward (Episode, Average)** or **Progress (Episode, Average)** switches at the bottom of **Reward graph**. The reward and progress per episode are displayed as scattered plots in different colors. The averaged reward and track completion are displayed by line plots and start after the first iteration. The range of rewards is shown on the left side of the graph and the range of progress (0-100) is on the right side. To read the exact value of of a training metric, move the mouse near to the data point on the graph. The graphs are automatically updated every 10 seconds while training is under way. You can choose the refresh button to manually update the metric display. A training job is good if the averaged reward and track completion show trends to converge. In particular, the model has likely converged if the progress per episode continuously reach 100% and the reward levels out. If not, clone the model and retrain it. ## Clone a trained model to start a new training pass If you clone a previously trained model as the starting point of a new round of training, you could improve training efficiency. To do this, modify the hyperparameters to make use of already learned knowledge. In this section, you learn how to clone a trained model using the AWS DeepRacer console. **To iterate training the reinforcement learning model using the AWS DeepRacer console** 1. Sign in to the AWS DeepRacer console, if you're not already signed in. 1. On the **Models** page, choose a trained model and then choose **Clone** from the **Action** drop-down menu list. 1. For **Model details**, do the following: 1. Type `RL_model_1` in **Model name**, if you don't want a name to be generated for the cloned model. 1. Optionally, give a description for the to-be-cloned model in **Model description - optional**. 1. For **Environment simulation**, choose another track option. 1. For **Reward function**, choose one of the available reward function examples. Modify the reward function. For example, consider steering. 1. Expand **Algorithm settings** and try different options. For example, change the **Gradient descent batch size** value from 32 to 64 or increase the **Learning rate** to speed up the training. 1. Experiment with difference choices of the **Stop conditions**. 1. Choose **Start training** to begin new round of training. As with training a robust machine learning model in general, it is important that you conduct systematic experimentation to come up with the best solution. ## Evaluate AWS DeepRacer models in simulations To evaluate a model is to test the performance of a trained model. In AWS DeepRacer, the standard performance metric is the average time of finishing three consecutive laps. Using this metric, for any two models, one model is better than the other if it can make the agent go faster on the same track. In general, evaluating a model involves the following tasks: 1. Configure and start an evaluation job. 1. Observe the evaluation in progress while the job is running. This can be done in the AWS DeepRacer simulator. 1. Inspect the evaluation summary after the evaluation job is done. You can terminate an evaluation job in progress at any time. **Note** The evaluation time depends on the criteria you select. If your model doesn't meet the evaluation criteria, the evaluation will keep running until it reaches the 20 minute cap. 1. Optionally, submit the evaluation result to an eligible [AWS DeepRacer leaderboard](deepracer-racing-series.md). The ranking on the leaderboard lets you know how well your model performs against other participants. Test an AWS DeepRacer model with an AWS DeepRacer vehicle driving on a physical track, see [Operate your AWS DeepRacer vehicle](operate-deepracer-vehicle.md). ## Optimize training AWS DeepRacer models for real environments Many factors affect the real-world performance of a trained model, including the choice of the [action space](#deepracer-define-action-space-for-training), [reward function](#deepracer-train-models-define-reward-function), [hyperparameters](#deepracer-iteratively-adjust-hyperparameters) used in the training, and [vehicle calibration](deepracer-calibrate-vehicle.md) as well as [real-world track](deepracer-build-your-track.md) conditions. In addition, the simulation is only an (often crude) approximation of the real world. They make it a challenge to train a model in simulation, to apply it to the real world, and to achieve a satisfactory performance. Training a model to give a solid real-world performance often requires numerous iterations of exploring the [ reward function](#deepracer-train-models-define-reward-function), [action spaces](#deepracer-define-action-space-for-training), [hyperparameters](#deepracer-iteratively-adjust-hyperparameters), and [evaluation](#deepracer-evaluate-models-in-simulator) in simulation and [testing](deepracer-drive-your-vehicle.md) in a real environment. The last step involves the so-called *simulation-to-real world* (*sim2real*) transfer and can feel unwieldy. To help tackle the *sim2real* challenges, heed the following considerations: + Make sure that your vehicle is well calibrated. This is important because the simulated environment is most likely a partial representation of the real environment. Besides, the agent takes an action based on the current track condition, as captured by an image from the camera, at each step. It can't see far enough to plan its route at a fast speed. To accommodate this, the simulation imposes limits on the speed and steering. To ensure the trained model works in the real world, the vehicle must be properly calibrated to match this and other simulation settings. For more information for calibrating your vehicle, see [Calibrate your AWS DeepRacer vehicle](deepracer-calibrate-vehicle.md). + Test your vehicle with the default model first. Your AWS DeepRacer vehicle comes with a pre-trained model loaded into its inference engine. Before testing your own model in the real world, verify that the vehicle performs reasonably well with the default model. If not, check the physical track setup. Testing a model in an incorrectly built physical track is likely to lead to a poor performance. In such cases, reconfigure or repair your track before starting or resuming testing. **Note** When running your AWS DeepRacer vehicle, actions are inferred according to the trained policy network without invoking the reward function. + Make sure the model works in simulation. If your model doesn't work well in the real world, it's possible that either the model or track is defective. To sort out the root causes, you should first [evaluate the model in simulations](#deepracer-evaluate-models-in-simulator) to check if the simulated agent can finish at least one loop without getting off the track. You can do so by inspecting the convergence of the rewards while observing the agent's trajectory in the simulator. If the reward reaches the maximum when the simulated agents completes a loop without faltering, the model is likely to be a good one. + Do not over train the model. Continuing training after the model has consistently completed the track in simulation will cause overfitting in the model. An over-trained model won't perform well in the real world because it can't handle even minor variations between the simulated track and the real environment. + Use multiple models from different iterations. A typical training session produces a range of models that fall between being underfitted and being overfitted. Because there are no a priori criteria to determine a model that is just right, you should pick a few model candidates from the time when the agent completes a single loop in the simulator to the point where it performs loops consistently. + Start slow and increase the driving speed gradually in testing. When testing the model deployed to your vehicle, start with a small maximum speed value. For example, you can set the testing speed limit to be <10% of the training speed limit. Then gradually increase the testing speed limit until the vehicle starts moving. You set the testing speed limit when calibrating the vehicle using the device control console. If the vehicle goes too fast, for example if the speed exceeds those seen during training in simulator, the model is not likely to perform well on the real track. + Test a model with your vehicle in different starting positions. The model learns to take a certain path in simulation and can be sensitive to its position within the track. You should start the vehicle tests with different positions within the track boundaries (from left to center to right) to see if the model performs well from certain positions. Most models tend to make the vehicle stay close to either side of one of the white lines. To help analyze the vehicle's path, plot the vehicle's positions (x, y) step by step from the simulation to identify likely paths to be taken by your vehicle in a real environment. + Start testing with a straight track. A straight track is much easier to navigate compared to a curved track. Starting your test with a straight track is useful to weed out poor models quickly. If a vehicle cannot follow a straight track most of the time, the model will not perform well on curved tracks, either. + Watch out for the behavior where the vehicle takes only one type of actions, When your vehicle can manage to take only one type of actions, for example, to steer the vehicle to the left only, the model is likely over-fitted or under-fitted. With given model parameters, too many iterations in training could make the model over-fitted. Too few iterations could make it under-fitted. + Watch out for vehicle's ability to correct its path along a track border. A good model makes the vehicle to correct itself when nearing the track borders. Most well-trained models have this capability. If the vehicle can correct itself on both the track borders, the model is considered to be more robust and of a higher quality. + Watch out for inconsistent behaviors exhibited by the vehicle. A policy model represents a probability distribution for taking an action in a given state. With the trained model loaded to its inference engine, a vehicle will pick the most probable action, one step at time, according to the model's prescription. If the action probabilities are evenly distributed, the vehicle can take any of the actions of the equal or closely similar probabilities. This will lead to an erratic driving behavior. For example, when the vehicle follows a straight path sometimes (for example, half the time) and makes unnecessary turns at other times, the model is either under-fitted or over-fitted. + Watch out for only one type of turn (left or right) made by the vehicle. If the vehicle takes left turns very well but fails to manage steering right, or, similarly, if the vehicle takes only right turns well, but not left steering, you need to carefully calibrate or recalibrate your vehicle's steering. Alternatively, you can try to use a model that is trained with the settings close to the physical settings under testing. + Watch out for the vehicle making sudden turns and going off-track. If the vehicle follows the path correctly most of the way, but suddenly veers off the track, it is likely due to distractions in the environment. Most common distractions include unexpected or unintended light reflections. In such cases, use barriers around the track or other means to reduce glaring lights. # AWS DeepRacer reward function reference The following is the technical reference of the AWS DeepRacer reward function. **Topics** + [Input parameters of the AWS DeepRacer reward function](deepracer-reward-function-input.md) + [AWS DeepRacer reward function examples](deepracer-reward-function-examples.md) # Input parameters of the AWS DeepRacer reward function The AWS DeepRacer reward function takes a dictionary object as the input. ``` def reward_function(params) : reward = ... return float(reward) ``` The `params` dictionary object contains the following key-value pairs: ``` { "all_wheels_on_track": Boolean, # flag to indicate if the agent is on the track "x": float, # agent's x-coordinate in meters "y": float, # agent's y-coordinate in meters "closest_objects": [int, int], # zero-based indices of the two closest objects to the agent's current position of (x, y). "closest_waypoints": [int, int], # indices of the two nearest waypoints. "distance_from_center": float, # distance in meters from the track center "is_crashed": Boolean, # Boolean flag to indicate whether the agent has crashed. "is_left_of_center": Boolean, # Flag to indicate if the agent is on the left side to the track center or not. "is_offtrack": Boolean, # Boolean flag to indicate whether the agent has gone off track. "is_reversed": Boolean, # flag to indicate if the agent is driving clockwise (True) or counter clockwise (False). "heading": float, # agent's yaw in degrees "objects_distance": [float, ], # list of the objects' distances in meters between 0 and track_length in relation to the starting line. "objects_heading": [float, ], # list of the objects' headings in degrees between -180 and 180. "objects_left_of_center": [Boolean, ], # list of Boolean flags indicating whether elements' objects are left of the center (True) or not (False). "objects_location": [(float, float),], # list of object locations [(x,y), ...]. "objects_speed": [float, ], # list of the objects' speeds in meters per second. "progress": float, # percentage of track completed "speed": float, # agent's speed in meters per second (m/s) "steering_angle": float, # agent's steering angle in degrees "steps": int, # number steps completed "track_length": float, # track length in meters. "track_width": float, # width of the track "waypoints": [(float, float), ] # list of (x,y) as milestones along the track center } ``` A more detailed technical reference of the input parameters is as follows. ## all\$1wheels\$1on\$1track **Type: ** `Boolean` **Range: ** `(True:False)` A `Boolean` flag to indicate whether the agent is on-track or off-track. It's off-track (`False`) if any of its wheels are outside of the track borders. It's on-track (`True`) if all of the wheels are inside the two track borders. The following illustration shows that the agent is on-track. ![\[\]](http://docs.aws.amazon.com/deepracer/latest/developerguide/images/deepracer-reward-function-input-all_wheels_on_track-true.png) The following illustration shows that the agent is off-track. ![\[\]](http://docs.aws.amazon.com/deepracer/latest/developerguide/images/deepracer-reward-function-input-all_wheels_on_track-false.png) **Example: ** *A reward function using the `all_wheels_on_track` parameter* ``` def reward_function(params): ############################################################################# ''' Example of using all_wheels_on_track and speed ''' # Read input variables all_wheels_on_track = params['all_wheels_on_track'] speed = params['speed'] # Set the speed threshold based your action space SPEED_THRESHOLD = 1.0 if not all_wheels_on_track: # Penalize if the car goes off track reward = 1e-3 elif speed < SPEED_THRESHOLD: # Penalize if the car goes too slow reward = 0.5 else: # High reward if the car stays on track and goes fast reward = 1.0 return float(reward) ``` ## closest\$1waypoints **Type**: `[int, int]` **Range**: `[(0:Max-1),(1:Max-1)]` The zero-based indices of the two neighboring `waypoint`s closest to the agent's current position of `(x, y)`. The distance is measured by the Euclidean distance from the center of the agent. The first element refers to the closest waypoint behind the agent and the second element refers the closest waypoint in front of the agent. `Max` is the length of the waypoints list. In the illustration shown in [waypoints](#reward-function-input-waypoints), the `closest_waypoints` would be `[16, 17]`. **Example**: A reward function using the `closest_waypoints` parameter. The following example reward function demonstrates how to use `waypoints` and `closest_waypoints` as well as `heading` to calculate immediate rewards. AWS DeepRacer supports the following libraries: math, random, NumPy, SciPy, and Shapely. To use one, add an import statement, `import supported library`, above your function definition, `def function_name(parameters)`. ``` # Place import statement outside of function (supported libraries: math, random, numpy, scipy, and shapely) # Example imports of available libraries # # import math # import random # import numpy # import scipy # import shapely import math def reward_function(params): ############################################################################### ''' Example of using waypoints and heading to make the car point in the right direction ''' # Read input variables waypoints = params['waypoints'] closest_waypoints = params['closest_waypoints'] heading = params['heading'] # Initialize the reward with typical value reward = 1.0 # Calculate the direction of the center line based on the closest waypoints next_point = waypoints[closest_waypoints[1]] prev_point = waypoints[closest_waypoints[0]] # Calculate the direction in radius, arctan2(dy, dx), the result is (-pi, pi) in radians track_direction = math.atan2(next_point[1] - prev_point[1], next_point[0] - prev_point[0]) # Convert to degree track_direction = math.degrees(track_direction) # Calculate the difference between the track direction and the heading direction of the car direction_diff = abs(track_direction - heading) if direction_diff > 180: direction_diff = 360 - direction_diff # Penalize the reward if the difference is too large DIRECTION_THRESHOLD = 10.0 if direction_diff > DIRECTION_THRESHOLD: reward *= 0.5 return float(reward) ``` ## closest\$1objects **Type**: `[int, int]` **Range**: `[(0:len(objects_location)-1), (0:len(objects_location)-1)]` The zero-based indices of the two closest objects to the agent's current position of (x, y). The first index refers to the closest object behind the agent, and the second index refers to the closest object in front of the agent. If there is only one object, both indices are 0. ## distance\$1from\$1center **Type**: `float` **Range**: `0:~track_width/2` Displacement, in meters, between the agent center and the track center. The observable maximum displacement occurs when any of the agent's wheels are outside a track border and, depending on the width of the track border, can be slightly smaller or larger than half the `track_width`. ![\[\]](http://docs.aws.amazon.com/deepracer/latest/developerguide/images/deepracer-reward-function-input-distance_from_center.png) **Example:** *A reward function using the `distance_from_center` parameter* ``` def reward_function(params): ################################################################################# ''' Example of using distance from the center ''' # Read input variable track_width = params['track_width'] distance_from_center = params['distance_from_center'] # Penalize if the car is too far away from the center marker_1 = 0.1 * track_width marker_2 = 0.5 * track_width if distance_from_center <= marker_1: reward = 1.0 elif distance_from_center <= marker_2: reward = 0.5 else: reward = 1e-3 # likely crashed/ close to off track return float(reward) ``` ## heading **Type**: `float` **Range**: `-180:+180` Heading direction, in degrees, of the agent with respect to the x-axis of the coordinate system. ![\[\]](http://docs.aws.amazon.com/deepracer/latest/developerguide/images/deepracer-reward-function-input-heading.png) **Example:** *A reward function using the `heading` parameter* For more information, see [`closest_waypoints`](#reward-function-input-closest_waypoints). ## is\$1crashed **Type**: `Boolean` **Range**: `(True:False)` A Boolean flag to indicate whether the agent has crashed into another object (`True`) or not (`False`) as a termination status. ## is\$1left\$1of\$1center **Type**: `Boolean` **Range**: `[True : False]` A `Boolean` flag to indicate if the agent is on the left side to the track center (`True`) or on the right side (`False`). ## is\$1offtrack **Type**: `Boolean` **Range**: `(True:False)` A Boolean flag to indicate whether the agent has off track (True) or not (False) as a termination status. ## is\$1reversed **Type**: `Boolean` **Range**: `[True:False]` A Boolean flag to indicate if the agent is driving on clock-wise (True) or counter clock-wise (False). It's used when you enable direction change for each episode. ## objects\$1distance **Type**: `[float, … ]` **Range**: `[(0:track_length), … ]` A list of the distances between objects in the environment in relation to the starting line. The ith element measures the distance in meters between the ith object and the starting line along the track center line. **Note** abs \$1 (var1) - (var2)\$1 = how close the car is to an object, WHEN var1 = ["objects\$1distance"][index] and var2 = params["progress"]\$1params["track\$1length"] To get an index of the closest object in front of the vehicle and the closest object behind the vehicle, use the "closest\$1objects" parameter. ## objects\$1heading **Type**: `[float, … ]` **Range**: `[(-180:180), … ]` List of the headings of objects in degrees. The ith element measures the heading of the ith object. For stationary objects, their headings are 0. For a bot vehicle, the corresponding element's value is the vehicle's heading angle. ## objects\$1left\$1of\$1center **Type**: `[Boolean, … ]` **Range**: `[True|False, … ]` List of Boolean flags. The ith element value indicates whether the ith object is to the left (True) or right (False) side of the track center. ## objects\$1location **Type**: `[(x,y), … ]` **Range**: `[(0:N,0:N), … ]` List of all object locations, each location is a tuple of ([x, y](#reward-function-input-x_y)). The size of the list equals the number of objects on the track. Note the object could be the stationary obstacles, moving bot vehicles. ## objects\$1speed **Type**: `[float, … ]` **Range**: `[(0:12.0), … ]` List of speeds (meters per second) for the objects on the track. For stationary objects, their speeds are 0. For a bot vehicle, the value is the speed you set in training. ## progress **Type**: `float` **Range**: `0:100` Percentage of track completed. **Example:** *A reward function using the `progress` parameter* For more information, see [steps](#reward-function-input-steps). ## speed **Type**: `float` **Range**: `0.0:5.0` The observed speed of the agent, in meters per second (m/s). ![\[\]](http://docs.aws.amazon.com/deepracer/latest/developerguide/images/deepracer-reward-function-input-speed.png) **Example:** *A reward function using the `speed` parameter* For more information, see [all\$1wheels\$1on\$1track](#reward-function-input-all_wheels_on_track). ## steering\$1angle **Type**: `float` **Range**: `-30:30` Steering angle, in degrees, of the front wheels from the center line of the agent. The negative sign (-) means steering to the right and the positive (\$1) sign means steering to the left. The agent center line is not necessarily parallel with the track center line as is shown in the following illustration. ![\[\]](http://docs.aws.amazon.com/deepracer/latest/developerguide/images/deepracer-reward-function-steering.png) **Example:** *A reward function using the `steering_angle` parameter* ``` def reward_function(params): ''' Example of using steering angle ''' # Read input variable abs_steering = abs(params['steering_angle']) # We don't care whether it is left or right steering # Initialize the reward with typical value reward = 1.0 # Penalize if car steer too much to prevent zigzag ABS_STEERING_THRESHOLD = 20.0 if abs_steering > ABS_STEERING_THRESHOLD: reward *= 0.8 return float(reward) ``` ## steps **Type**: `int` **Range**: `0:Nstep` Number of steps completed. A step corresponds to an action taken by the agent following the current policy. **Example:** *A reward function using the `steps` parameter* ``` def reward_function(params): ############################################################################# ''' Example of using steps and progress ''' # Read input variable steps = params['steps'] progress = params['progress'] # Total num of steps we want the car to finish the lap, it will vary depends on the track length TOTAL_NUM_STEPS = 300 # Initialize the reward with typical value reward = 1.0 # Give additional reward if the car pass every 100 steps faster than expected if (steps % 100) == 0 and progress > (steps / TOTAL_NUM_STEPS) * 100 : reward += 10.0 return float(reward) ``` ## track\$1length **Type**: `float` **Range**: `[0:Lmax]` The track length in meters. `Lmax is track-dependent.` ## track\$1width **Type**: `float` **Range**: `0:Dtrack` Track width in meters. ![\[\]](http://docs.aws.amazon.com/deepracer/latest/developerguide/images/deepracer-reward-function-input-track_width.png) **Example:** *A reward function using the `track_width` parameter* ``` def reward_function(params): ############################################################################# ''' Example of using track width ''' # Read input variable track_width = params['track_width'] distance_from_center = params['distance_from_center'] # Calculate the distance from each border distance_from_border = 0.5 * track_width - distance_from_center # Reward higher if the car stays inside the track borders if distance_from_border >= 0.05: reward = 1.0 else: reward = 1e-3 # Low reward if too close to the border or goes off the track return float(reward) ``` ## x, y **Type**: `float` **Range**: `0:N` Location, in meters, of the agent center along the x and y axes, of the simulated environment containing the track. The origin is at the lower-left corner of the simulated environment. ![\[\]](http://docs.aws.amazon.com/deepracer/latest/developerguide/images/deepracer-reward-function-input-x-y.png) ## waypoints **Type**: `list` of `[float, float]` **Range**: `[[xw,0,yw,0] … [xw,Max-1, yw,Max-1]]` An ordered list of track-dependent `Max` milestones along the track center. Each milestone is described by a coordinate of (xw,i, yw,i). For a looped track, the first and last waypoints are the same. For a straight or other non-looped track, the first and last waypoints are different. ![\[\]](http://docs.aws.amazon.com/deepracer/latest/developerguide/images/deepracer-reward-function-input-waypoints.png) **Example** *A reward function using the `waypoints` parameter* For more information, see [`closest_waypoints`](#reward-function-input-closest_waypoints). # AWS DeepRacer reward function examples The following lists some examples of the AWS DeepRacer reward function. **Topics** + [Example 1: Follow the center line in time trials](#deepracer-reward-function-example-0) + [Example 2: Stay inside the two borders in time trials](#deepracer-reward-function-example-1) + [Example 3: Prevent zig-zag in time trials](#deepracer-reward-function-example-2) + [Example 4: Stay in one lane without crashing into stationary obstacles or moving vehicles](#deepracer-reward-function-example-3) ## Example 1: Follow the center line in time trials This example determines how far away the agent is from the center line, and gives higher reward if it is closer to the center of the track, encouraging the agent to closely follow the center line. ``` def reward_function(params): ''' Example of rewarding the agent to follow center line ''' # Read input parameters track_width = params['track_width'] distance_from_center = params['distance_from_center'] # Calculate 3 markers that are increasingly further away from the center line marker_1 = 0.1 * track_width marker_2 = 0.25 * track_width marker_3 = 0.5 * track_width # Give higher reward if the car is closer to center line and vice versa if distance_from_center <= marker_1: reward = 1 elif distance_from_center <= marker_2: reward = 0.5 elif distance_from_center <= marker_3: reward = 0.1 else: reward = 1e-3 # likely crashed/ close to off track return reward ``` ## Example 2: Stay inside the two borders in time trials This example simply gives high rewards if the agent stays inside the borders, and lets the agent figure out the best path to finish a lap. It's easy to program and understand, but likely takes longer to converge. ``` def reward_function(params): ''' Example of rewarding the agent to stay inside the two borders of the track ''' # Read input parameters all_wheels_on_track = params['all_wheels_on_track'] distance_from_center = params['distance_from_center'] track_width = params['track_width'] # Give a very low reward by default reward = 1e-3 # Give a high reward if no wheels go off the track and # the car is somewhere in between the track borders if all_wheels_on_track and (0.5*track_width - distance_from_center) >= 0.05: reward = 1.0 # Always return a float value return reward ``` ## Example 3: Prevent zig-zag in time trials This example incentivizes the agent to follow the center line but penalizes with lower reward if it steers too much, which helps prevent zig-zag behavior. The agent learns to drive smoothly in the simulator and likely keeps the same behavior when deployed to the physical vehicle. ``` def reward_function(params): ''' Example of penalize steering, which helps mitigate zig-zag behaviors ''' # Read input parameters distance_from_center = params['distance_from_center'] track_width = params['track_width'] abs_steering = abs(params['steering_angle']) # Only need the absolute steering angle # Calculate 3 marks that are farther and father away from the center line marker_1 = 0.1 * track_width marker_2 = 0.25 * track_width marker_3 = 0.5 * track_width # Give higher reward if the car is closer to center line and vice versa if distance_from_center <= marker_1: reward = 1.0 elif distance_from_center <= marker_2: reward = 0.5 elif distance_from_center <= marker_3: reward = 0.1 else: reward = 1e-3 # likely crashed/ close to off track # Steering penality threshold, change the number based on your action space setting ABS_STEERING_THRESHOLD = 15 # Penalize reward if the car is steering too much if abs_steering > ABS_STEERING_THRESHOLD: reward *= 0.8 return float(reward) ``` ## Example 4: Stay in one lane without crashing into stationary obstacles or moving vehicles This reward function rewards the agent for staying inside the track's borders and penalizes the agent for getting too close to objects in front of it. The agent can move from lane to lane to avoid crashes. The total reward is a weighted sum of the reward and penalty. The example gives more weight to the penalty in effort to avoid crashes. Experiment with different averaging weights to train for different behavior outcomes. ``` import math def reward_function(params): ''' Example of rewarding the agent to stay inside two borders and penalizing getting too close to the objects in front ''' all_wheels_on_track = params['all_wheels_on_track'] distance_from_center = params['distance_from_center'] track_width = params['track_width'] objects_location = params['objects_location'] agent_x = params['x'] agent_y = params['y'] _, next_object_index = params['closest_objects'] objects_left_of_center = params['objects_left_of_center'] is_left_of_center = params['is_left_of_center'] # Initialize reward with a small number but not zero # because zero means off-track or crashed reward = 1e-3 # Reward if the agent stays inside the two borders of the track if all_wheels_on_track and (0.5 * track_width - distance_from_center) >= 0.05: reward_lane = 1.0 else: reward_lane = 1e-3 # Penalize if the agent is too close to the next object reward_avoid = 1.0 # Distance to the next object next_object_loc = objects_location[next_object_index] distance_closest_object = math.sqrt((agent_x - next_object_loc[0])**2 + (agent_y - next_object_loc[1])**2) # Decide if the agent and the next object is on the same lane is_same_lane = objects_left_of_center[next_object_index] == is_left_of_center if is_same_lane: if 0.5 <= distance_closest_object < 0.8: reward_avoid *= 0.5 elif 0.3 <= distance_closest_object < 0.5: reward_avoid *= 0.2 elif distance_closest_object < 0.3: reward_avoid = 1e-3 # Likely crashed # Calculate reward by putting different weights on # the two aspects above reward += 1.0 * reward_lane + 4.0 * reward_avoid return reward ```