Oriented bounding boxes detection
In this project, we are working with the Waymo Open Dataset that provides hours of recording of LiDAR and camera sensors embedded on a vehicle. It comes with oriented bounding boxes that are well annotated in 3D. The goal is to detect surrounding vehicles based on the LiDAR data. Note that we don't rely on the 5 cameras for that, we only use them to reproject the 3D bounding boxes for visualization purpose.
Pedestrians, signs and cyclists could be detected in a similar way. LiDAR data enables to accurately detect the oriented bounding boxes up to 50m. The pointcloud is converted to a Bird Eye View image of 10cm pixel size and we feed the CNN with 3 channels : the maximum height, the minimum height and the number of points in these 10cm x 10cm vertical sections. The CNN is inspired from CenterNet where (x, y, z) positions + (width, height, length) dimensions and heading angle are regressed.
See the YouTube video
This network is trained on the Waymo Open Dataset. I used the latest version (March 2024) of the Perception Dataset
In the individual files, you will find .tfrecord segments splitted in training/validation/testing folders. I suggest to download them (manually or with gsutil) inside tfrecords/training
and tfrecords/validation
.
After downloading required python package (like waymo-open-dataset-tf-2-12-0
), use this command to extract PLY, images, labels, and camera intrinsic/extrinsic poses.
python3 extract_tfrecord.py tfrecords/training/individual_files_training_segment_[...]_with_camera_labels.tfrecord dataset/training/
This will put your data inside dataset/training
Note that this worked (for me) on Ubuntu 22.04, not on MacOS.
This will write to dataset/training/individual_files_training_segment_[...]_with_camera_labels
:
pointclouds
: containing LiDAR scans at each timestamp (PLY)labels
: containing objects detected as oriented bounding boxes with classescameras.json
: intrinsic/extrinsic of the 5 cameras surrounding the vehicleimages
: containing images
To enable image extraction simultaneously, you can add --front
, --front-left
, --front-right
, --side-left
or side-right
at the end of previous command.
Front-left | Front-right |
---|---|
Side-left | Side-right |
---|---|
If you want to do it for many tfrecords (which is a good idea), you can use this basic script:
sh extract_tfrecords.sh
There are usually ~200 frames per sample. It corresponds to ~30 seconds, meaning that we are running at ~6.7 fps.
The LiDAR scans are fused into one single pointcloud surrounding the vehicle. They contain >100k points and can see up to 70 meters around the car. The ground is at z = 0. The car is located at the origin (0, 0, 0). The forward direction is +X, +Y points to the left of the car, Z points upwards.
To visualize the point clouds and display the bounding boxes of the vehicles:
python3 viz.py dataset/training/individual_files_training_segment-etc
Then, you can press SPACE to pause/play the processing. In pause mode, you can go frame by frame by hitting the N(ext) key.
I intentionnally only kept the vehicle bounding boxes, but you can also display the other classes (pedestrian, signs, cyclists). There is also a filtering on the number of points, this is crucial because we can't expect our CNN to detect every objects even when a very limited amount of points hit the vehicles.
Show bounding boxes with Open3D
You can project the bounding boxes onto any camera you like.
python3 viz.py dataset/training/individual_files_training_segment-etc --front
Show bounding boxes projected on the front camera
The network is inspired from CenterNet and uses Resnet as a backbone. The pointcoud is converted to a BEV image with a "pixel" of 10cm. The 3 channels are not RGB, but the maximum height, the minimum height and the number of points contained in the 10cm x 10cm vertical square.
Then we predict the center of the bounding box, its width, its length, its height and its orientation. We also predict the z position since we can not assume that the ground is flat.
Check the training-and-inference.ipynb
notebook. After training, you will be able to extract the inference results on the validation dataset.
At the end of the notebook, you can run the network on a complete sequence. This will write to inference/
and store the results of each frame in inference_***.json
, with the exact same format than labels json.
python3 viz.py dataset/validation/individual_files_validation_segment-etc --inference
You can also visualize how the 3D bounding boxes are projected to the camera
python3 viz.py dataset/validation/individual_files_validation_segment-etc --inference --front