The project consists in the development of an application for the recognition of one-dimensional signals (audio) and two-dimensional signals (images). Specifically we have developed three different task:
- Processing-1D: Recognize the identity of the group member starting from a two-second audio with ML and DL models. For solve this task we have tried different models and different configuration of features (zero crossing rate, standard deviation, mfcc, spectrogram, etc...)
- Processing-2D: Recognize the identity of the group member starting from an image with DL models. In this case we tried different pretrained architecture with weights based on general task (ImageNet) and on face recognition task (VGGFace).
- Retrieval: Find the ten most famous VIP faces for each member of the group
All the data used for this project were collected directly in the following ways:
- Processing-1D: recording 100 audios for each five-second person. These audio were subsequently cut every 2 seconds and a data augmentation was applied, modifying their pitch and speed to increase the available data.
- Processing-2D: taking 100 photos with variations of light and expression
- Retrieval: three photos taken from the previous task were used
If you have question about the data or you need them please write me!
- Processing-1D. For this task we have developed three different notebook:
- 1_AudioAcquisition: This notebook must be executed locally. It uses the default microphone for automatically registering all audios needed for the project.
- 2_AudioRecognition: This notebook contains all the ML and DL models developed for solve this task. It also contains the code used for splitting and augmenting the data starting from the original five seconds audios.
- 3_DemoLive: This notebook must be run locally, it uses the microphone and the camera to create a sort of live demo in which to demonstrate the effectiveness of the models developed for the voice recognition task.
- Processing-2D. Again we have developed three different notebook with the same purpose but ready for images processing:
- 1_ImageAcquisition: This notebook must be executed locally, as before it automatically snaps all the images needed by using the default camera.
- 2_FaceRecognition: This notebook contains all the ML and DL models developed for solve this second task. In this folder you can also find a link to dowload the weights used for the VGGFace model.
- 3_DemoLive: This notebook must be run locally, it uses the camera to create a sort of live demo in which to demonstrate the effectiveness of the models developed for the face recognition task.
- Retrieval. This folder contains only one notebook that implement all the code necessary for solve the retrieval task. The dataset used with vip's faces can be download here.
You can also find the report and presentation made for the exam. Both in italian language.
If you need the trained models that we implemented please feel free to write me because their weights exceed the GitHub maximum allowed.
Unless otherwise specified in the notebook section all codes can be runned in Google Colaboratory platform. All notebooks all already setted to import the necessary packages and also in this way you can easily use a GPU!
Unfortunately for the notebook that performs live demo and automatic acquisition you will need to use local environment because their required cams and microphone, for this notebook you need to install all the packages reported in the requirements file that you can find in each different folder.
Anyway if you have any problem just contact me for further information!
Comparative result of models based on test set created by subsampling the original dataset:
-
Processing-1D: The first three models used the mfcc features, while the last CNN model use spectrogram image
Architectures Accuracy Precision Recall F1-score SVM 0.83 0.83 0.83 0.83 RandomForest 0.81 0.86 0.81 0.82 CNN 0.97 0.97 0.97 0.97 CNN on spectrogram 0.88 0.88 0.88 0.88 -
Processing-2D:
Architectures Accuracy Precision Recall F1-score VGG16 0.94 0.94 0.95 0.94 MobileNet-V2 0.98 0.98 0.98 0.98 VGGFace 1.00 1.00 1.00 1.00
[1] S. Bianco, “Dispense e slide del corso digital signal and image management” 2021.
[2] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition", 2015.