Our documentation can be found here
The scenes obtained from a surveillance video are usually with low resolution. Most existing digital video surveillance systems rely on human observers for detecting specific activities in a real-time video scene. However, there are limitations in the human capability to monitor simultaneous events in surveillance displays. Low-quality images require super resolution techniques to be visually perceptible and then we can use age and gender estimation techniques for a wide range of applications like abnormal event detection, person counting in a dense crowd, person identification, gender classification, for elderly people.
Build a solution to estimate the gender and age of people from a surveillance video feed (like mall, retail store, hospital etc.). Consider low resolution cameras as well as cameras put at a height for surveillance.
This repository is built in PyTorch 1.8.1 and tested on Ubuntu 20.04 environment (Python3.7, CUDA11.3.1). Follow these intructions
-
Clone the repository
git clone https://github.com/Inter-IIT-Bosch-Mid-Prep/Bosch-Age-Gender-Detection.git
-
Run the following command to create the proper conda environment with all the required dependencies
cd Bosch-Age-Gender-Detection conda env create -f env.yml
-
Activate the conda environment
conda activate bosch sudo apt-get install git-lfs git-lfs install
These two commands need to be ran to avoid a unpickling error.
-
Go to weights directory and create 2 folders inside it
cd weights mkdir age_prediction gender_prediction
-
Download the NDF weights for age prediction from here and extract it to the
${root}/weights/age_prediction/
. -
Download the VGGFace Gender weights from here and extract it to the
${root}/weights/gender_prediction/
-
If you want to use SwirIR, Download the SwinIR weights from here and extract it to the
${root}/weights/gan_weights/
-
If you want to use BSRGAN, Download the BSRGAN weights from here and extract it to the
${root}/weights/gan_weights/
For saving output images make a folder output
which is default for --output_folder
mkdir output
To run the entire pipeline on a single video you can use the below command
python detect.py --run_on_image <BOOL TO RUN ON IMAGE OR VIDEO>
--save_csv_location <PATH TO SAVE CSV OUTPUT>
--weights_yolo <PATH_TO_WEIGHTS_of_YOLO_V5>
--video_image <PATH_TO_VIDEO>
--img-size <INFERENCE_SIZE_IN_PIXELS>
--weights_gan <PATH_TO_WEIGHTS_OF_GAN>
--output_folder <PATH_TO_SAVE_OUTPUT_IMAGES>
--facelib <BOOL VALUE TO USE FACELIB OR NOT>
--sr_type <1-EDSR, 2-SwinIR, 3-BSRGAN>
--deblur_weights <PATH TO PRETRAINED WEIGHTS>
--gender_pred <IF 0 THEN USE VGG FACE>
--gender_weights <PATH OF VGG GENDER MODEL WEIGHTS>
--age_pred <IF VALUE 0 USE NDF>
--age_weights <PATH TO FINETUNED WEIGHTS>
--cuda <TRUE IF WANT TO USE CUDA>
Note that all cofigurations are optional here. To run the etire pipeline with default configuration on test.mp4, run the following command :-
python detect.py
For running detect.py on image
python detect.py --run_on_image True --video_image ssd.png
For using our UI using streamlit, run the below command and upload the photo
streamlit run app.py
We have provided support for 3 kinds of SR algorithms - EDSR, SWINIR and BSRGAN. Our default used is EDSR as other two models are very heavy. This can be used by using different values of sr_type
- (EDSR)
- (SwinIR)
- (BSRGAN)
By default the output will be make in the ${root}
directory in the name of name_csv_file.csv
which is configurable in --save_csv_location
For testing our pipeline and individual blocks we have come up with some open-source datasets as well as our manually collected datasets.
The opensource datasets can be found below and the manually collected datasets can be found here
Task | Dataset Link |
---|---|
Face Detection | WiderFace |
FDDB | |
Age and Gender Estimation | UTKFace |
CACD | |
Adience | |
IMDB-WIKI |
We are initially extracting the individual frames from the given input video and then applying state of the art denoising methods. We have provided an option to the user to apply denoising methods such as Restormer or HINet.
We analysed the problem statement from various perspectives and finally decided to go ahead with face detection.
- Firstly, surveillance videos usually record humans from a height which obfuscates information about height and pose. Applying height and pose estimation algorithms on the top of person detection algorithms would have proven computationlly expensive.
- Secondly clothing information would have introduced a stereotypical bias which would have been harmful for marginalized groups.
Thus we decided to use Face detection algorithms. For our pipeline we have integrated Yolov5-face. We had also tried out various other object detection algorithms however in our own test datasets we did not get satisfactory results in terms of Mean Accuracy Precision(mAP).
Model | mAP |
---|---|
TinaFace | 94.17 |
YoloV5 | 95.99 |
RetinaFace | 91.45 |
We have extensively tested super resolution algorithms and realised that the extracted images usually contained blurred photos which rendered the super resolution algorithms useless and gave sub-par performance on age and gender tasks. Thus we added a new preprocessing block of Deblurring
We have provided comparision for Wall Time, PSNR and SSIM accross multiple Super Resolutiom methods
MODEL | Custom Metric | Wall Time | PSNR |
---|---|---|---|
WDSR | 345.5003798 | 15.85164261 | 30.36618739 |
EDSR | 347.6678516 | 2.112349987 | 30.34467902 |
SRGAN | 354.7159776 | 9.196819544 | 29.42405326 |
FSRCNN | 430.6859193 | 0.3795514107 | 23.69700551 |
RDN | 307.7455076 | 0.3795514107 | 24.58058639 |
SRDenseNet | 408.9996247 | 17.12142944 | 24.05288471 |
ESPCN | 362.0292181 | 0.4130887985 | 25.14499845 |
FSRCNN_trained | 575.1895184 | 0.8021452427 | 21.94459659 |
The entire super resolution pipeline can be shown as
We have addressed the main pain point of the Problem statement.
- Just to give an overview, the existing super resolutions algorithms provided a high Peak Signal-to-Noise Ratio(PSNR) value but failed to preserve high frequency details of the image.
- Also existing super resolution algorithms are usually modifications of SRGANs which require expensive computation to train and have loss convergence issues.
Thus we introduced a novel technique where we introduce a new loss in addition to exisitng reconstruction loss without introducing any new network parameters. Thus we follow the same training procedure but optimize the parameters with respect to the new loss which actually helps in preserving the high frequency components.
The loss is formulated as
We have also done extensive experimentation on age and gender prediction. First we did a sanity check whether super resolution was useful for our task hence we ran benchmarks tests with and without super resolution whilst considering VGGFace as the classification model. The results are shown below
Image size | No SR | BSRGAN | EDSR | SwinIR |
---|---|---|---|---|
7x7 | 0.287 | 0.241 | 0.314 | 0.252 |
14X14 | 0.352 | 0.313 | 0.386 | 0.313 |
28x28 | 0.488 | 0.499 | 0.523 | 0.495 |
56x56 | 0.513 | 0.5342 | 0.551 | 0.533 |
This shows us a general increase in accuracy for age prediction in the case of EDSR accross all the image sizes. We have taken all possible image sizes as Yolov5-face returns faces with different dimensions from the range of 7x7 to 96x96.
- For the gender classification task since we only have 2 labels using a deeper and complex model would overfit to the data hence we train some layers of the original VGGFace model which reported a test accuracy of 94% and we are using that in our final pipeline.
MODEL | DATASET | ACCURACY | PRECISION | RECALL | F1 |
---|---|---|---|---|---|
Facelib | Adience | 0.73386 | 0.73404 | 0.735218 | 0.73357 |
UTK | 0.78948 | 0.79971 | 0.79349 | 0.78889 | |
VGG Face | Adience | 0.7492 | 0.7632 | 0.7541 | 0.74424 |
UTK | 0.9139 | 0.9245 | 0.9219 | 0.9028 | |
Resnet | Adience | 0.4821 | 0.4931 | 0.4956 | 0.4732 |
UTK | 0.9369 | 0.9378 | 0.9413 | 0.9217 | |
MLP | Adience | 0.8543 | 0.8612 | 0.8711 | 0.8422 |
UTK | 0.9257 | 0.9033 | 0.9341 | 0.9160 |
-
For the age classification task, we faced quite a number of challenges such as
- Non-uniformity in dataset labels
- Subpar Cross-dataset performance in existing state-of-the-art models
-
To address these problems, we improved upon exisitng models in these ways:
- We first performed the gender classification and used that gender embedding as a prior to the age classification which helped us improve our results.
- In order to generalize our datasets, we performed a model ensemble across multiple datasets and this made our model much more robust.
- We have use a new ensembling technique where more weight is given to age clusters grouped together by inversely weighting difference between predicted ages.
- We first performed the gender classification and used that gender embedding as a prior to the age classification which helped us improve our results.
-
For our age classification tasks, after extensive experimentation we have use the following models
- VisualizingNDF trained on CACD
- VisualizingNDF trained on CACD and finetuned on WIKI and UTKface
- VisualizingNDF trained on CACD and finetuned on WIKI
- VGGFace trained on IMDB
Model | MSE | RMSE | R-square | MAE |
---|---|---|---|---|
Deep Face (Retina Face) | 332.0787 | 18.223 | -7.78913 | 14.3249 |
Deep Face (Opencv) | 332.8555 | 18.24432 | -5.4156 | 14.3404 |
Deep Face (SSD) | 326.2908 | 18.06352 | -6.4814 | 14.17737 |
InsightFace | 424.6219 | 20.60635 | -0.46678 | 15.8082 |
FaceLib | 211.54167 | 14.5444 | 0.286898 | 10.09992 |
├── age_gender_prediction/
├── VGGFace
├── VisualizingNDF
├── Deblur/
├── MPRNet
├── Denoising/
├── HINet
├── Restormer
├── ObjDet/
├── Super_Resolution/
├── bicubic_pytorch
├── ESPCN_pytorch
├── imgs/
├── README.md
After running the command for streamlit, a new window opens up in the default browser. This code is directly using our codebase as a backend and takes an API call for running the model. Users can choose the type of GAN from the dropdown list and then upload the image from local system. Then the processing happens in the backend and the output image is rendered with the corresponding bounding box with the age and gender information.
For the help of the user, a walkthrough is shown below
The final prediction is shown below on a test image