Defining an agent starts by creating a new class that inherits from navsim.agents.abstract_agent.AbstractAgent
.
Let’s dig deeper into this class. It has to implement the following methods:
-
__init__()
:The constructor of the agent.
-
name()
This has to return the name of the agent. The name will be used to define the filename of the evaluation csv. You can set this to an arbitrary value.
-
initialize()
This will be called before inferring the agent for the first time. If multiple workers are used, every worker will call this method for its instance of the agent. If you need to load a state dict etc., you should do it here instead of in
__init__
. -
get_sensor_config()
Has to return a
SensorConfig
(seenavsim.common.dataclasses.SensorConfig
) to define which sensor modalities should be loaded for the agent in each frame. The SensorConfig is a dataclass that stores for each sensor a List of indices of history frames for which the sensor should be loaded. Alternatively, a boolean can be used for each sensor, if all available frames should be loaded. Moreover, you can returnSensorConfig.build_all_sensors()
if you want to have access to all available sensors. Details on the available sensors can be found below.Loading the sensors has a big impact on runtime. If you don't need a sensor, consider to set it to
False
. -
compute_trajectory()
This is the main function of the agent. Given the
AgentInput
which contains the ego state as well as sensor modalities, it has to compute and return a future trajectory for the Agent. Details on the output format can be found below.The future trajectory has to be returned as an object of type
from navsim.common.dataclasses.Trajectory
. For examples, see the constant velocity agent or the human agent.
Most likely, your agent will involve learning-based components.
Navsim provides a lightweight and easy-to-use interface for training.
To use it, your agent has to implement some further functionality.
In addition to the methods mentioned above, you have to implement the methods below.
Have a look at navsim.agents.ego_status_mlp_agent.EgoStatusMLPAgent
for an example.
-
get_feature_builders()
Has to return a List of feature builders (of typenavsim.planning.training.abstract_feature_target_builder.AbstractFeatureBuilder
). FeatureBuilders take theAgentInput
object and compute the feature tensors used for agent training and inference. One feature builder can compute multiple feature tensors. They have to be returned in a dictionary, which is then provided to the model in the forward pass. Currently, we provide the following feature builders:- EgoStatusFeatureBuilder (returns a Tensor containing current velocity, acceleration and driving command)
- TransfuserFeatureBuilder (returns a dictionary containing the current front image, LiDAR BEV map, and the ego status)
-
get_target_builders()
Similar toget_feature_builders()
, returns the target builders of typenavsim.planning.training.abstract_feature_target_builder.AbstractTargetBuilder
used in training. In contrast to feature builders, they have access to the Scene object which contains ground-truth information (instead of just the AgentInput). -
forward()
The forward pass through the model. Features are provided as a dictionary which contains all the features generated by the feature builders. All tensors are already batched and on the same device as the model. The forward pass has to output a Dict of which one entry has to be "trajectory" and contain a tensor representing the future trajectory, i.e. of shape [B, T, 3], where B is the batch size, T is the number of future timesteps and 3 refers to x,y,heading. -
compute_loss()
Given the features, the targets and the model predictions, this function computes the loss used for training. The loss has to be returned as a single Tensor. -
get_optimizers()
Use this function to define the optimizers used for training. Depending on whether you want to use a learning-rate scheduler or not, this function needs to either return just an Optimizer (of typetorch.optim.Optimizer
) or a dictionary that contains the Optimizer (key: "optimizer") and the learning-rate scheduler of typetorch.optim.lr_scheduler.LRScheduler
(key: "lr_scheduler"). -
get_training_callbacks()
In this function, you can return a List ofpl.Callback
to monitor or visualize the training process of the learned model. We implemented a callback for TransFuser innavsim.agents.transfuser.transfuser_callback.TransfuserCallback
, which can serve as a starting point. -
compute_trajectory()
In contrast to the non-learning-based Agent, you don't have to implement this function. In inference, the trajectory will automatically be computed using the feature builders and the forward method.
get_sensor_config()
can be overwritten to determine which sensors are accessible to the agent.
The available sensors depend on the dataset. For OpenScene, this includes 9 sensor modalities: 8 cameras and a merged point cloud (from 5 LiDARs). Each modality is available for a duration of 2 seconds into the past, at a frequency of 2Hz (i.e., 4 frames). Only this data will be released for the test frames (no maps/tracks/occupancy etc, which you may use during training but will not have access to for leaderboard submissions).
You can configure the set of sensor modalities to use and how much history you need for each frame with the navsim.common.dataclasses.SensorConfig
dataclass.
Why LiDAR? Recent literature on open-loop planning has opted away from LiDAR in favor of using surround-view high-resolution cameras. This has significantly strained the compute requirements for training and testing SoTA planners. We hope that the availability of the LiDAR modality enables more computationally efficient submissions that use fewer (or low-resolution) camera inputs.
Ego Status. Besides the sensor data, an agent also receives the ego pose, velocity and acceleration information in local coordinates. Finally, to disambiguate driver intention, we provide a discrete driving command, indicating whether the intended route is towards the left, straight or right direction. Importantly, the driving command in NAVSIM is based solely on the desired route, and does not entangle information regarding obstacles and traffic signs (as was prevalent on prior benchmarks such as nuScenes). Note that the left and right driving commands cover turns, lane changes and sharp curves.
Given this input, you will need to override the compute_trajectory()
method and output a Trajectory
. This is an array of BEV poses (with x, y and heading in local coordinates), as well as a TrajectorySampling
config object that indicates the duration and frequency of the trajectory. The PDM score is evaluated for a horizon of 4 seconds at a frequency of 10Hz. The TrajectorySampling
config facilitates interpolation when the output frequency is different from the one used during evaluation.
Check out the baseline for implementations of agents!
NAVSIM provides several baselines, which serve as comparison or starting points for new end-to-end driving models. We provide model weights for all learned baselines on Hugging Face.
The ConstantVelocityAgent
is a naive baseline and follows the most simple driving logic. The agent maintains constant speed and a constant heading angle, resulting in a straight-line output trajectory. You can use the agent to familiarize yourself with the AbstractAgent
interface or analyze samples that have a trivial solution for achieving a high PDM score.
Link to the implementation.
The EgoStatusMLPAgent
is a blind baseline, which ignores all sensors that perceive the environment. The agent applies a Multilayer perceptron to the state of the ego vehicle (i.e., the velocity, acceleration, and driving command). Thereby, the EgoStatusMLP serves as an upper bound for performance, which can be achieved by merely extrapolating the kinematic state of the ego vehicle. The EgoStatusMLP is a lightweight learned example, showcasing the procedure of creating feature caches and training an agent in NAVSIM.
Link to the implementation.
Transfuser is an example of a sensor agent that utilizes both camera and LiDAR inputs. The backbone of Transfuser applies CNNs on a front-view camera image and a discretized LiDAR BEV grid. The features from the camera and LiDAR branches are fused over several convolution stages with Transformers to a combined feature representation. The Transfuser architecture combines several auxiliary tasks and imitation learning with strong closed-loop performance in end-to-end driving with the CARLA simulator.
In NAVSIM, we implement the Transfuser backbone from CARLA Garage and use BEV semantic segmentation and DETR-style bounding-box detection as auxiliary tasks. To facilitate the wide-angle camera view of the Transfuser, we stitch patches of the three front-facing cameras. Transfuser is a good starting point for sensor agents and provides pre-processing for image and LiDAR sensors, training visualizations with callbacks, and more advanced loss functions (i.e., Hungarian matching for detection).
Link to the implementation.