Let's suppose that you want to detect all types of vehicles on a highway from a camera located on a bridge. The goal of this project is to detect them while estimating their dimensions and not to detect any type of vehicles (car, truck, van etc).
The main problem to do it with a monocular camera, in comparison to RGB camera or LiDAR, is to regress 3D dimensions from 2D image. Moreover, obtaining a groundtruth of a real-life record requires huge means. This is why I decided to use the Carla simulator to build up my dataset where the bounding box dimensions are perfectly known.
Once the training is done, you can track cars individually and generate statistics on it (averaged speed, dimensions distributions...).
My contribution has 4 components:
- a camera calibration algorithm to find camera pose above the highway
- a python script to interact with Carla simulator and generate a groundtruth
- a Centernet-based network to regress positions and dimensions
- a tracking algorithm with a Kalman Filter
For detection only
python multi_object_detection.py path/to/video.mp4 centernet-3d-bbox.pth --conf 0.2
For full tracking:
python multi_object_tracking.py path/to/video.mp4 centernet-3d-bbox.pth --conf 0.2
Let's suppose that your instrinc camera matrix was already estimated with a checkerboard.
Now we want to estimate the camera pose (position + orientation) in the world referential by using the driving lanes (parallel lines with known interdistance). Regarding the orientation, we assume that we have no roll, and we only want to estimate the pitch and the yaw of the camera. We also assume that the yaw is quite small.
Camera pose estimation with lanes
The rotation matrix can be derived by multiplying the 2 Euler matrices:
Then we can form the transformation matrix (reminder:
So if you want to transform a 3D point from world coordinates that lies on the ground (
If we consider the a point very far away where lanes intersect, we have
Now when we project those points on the image, we get the pixel coordinates of the lane intersection point:
This is how we can estimate the pitch and yaw angle:
For this, we consider the points of each lane at the bottom of our image. We fix our origin by declaring that those points have
The values of
It leads:
Then
And this works for every points. This can be easily solved by using a least square regression, where you write all your equations in a single matrix equation:
Then, optimal values are:
You can check the results visually: