A website that uses webcam feeds to answer open-ended questions requiring outside knowledge. For more info, check out the ZenML blog post.
The visual question-answering pipeline is inspired by the paper from Microsoft linked in the credit section. In short, we prompt GPT-3 with a generated image caption and object tag list, the question-answer pair, and context examples that demonstrate the task at hand in a few-shot learning method, achieving a BERTScore computed F1 score of around .989 on the test set.
As a direct consequence of not feeding the image data directly to GPT-3, the best queries involve asking descriptive, counting, or similar questions about one or more objects visible in the background. For example, if there are two people in the image, one wearing a hat and the other wearing glasses, questions that would work well could include the following:
- "How many people are in the room?"
- "What color is the hat in the picture?"
- "How many people are wearing glasses?"
To setup the production server for the website, we:
Create an AWS Lambda function for the backend:
. deploy/aws_login.sh python deploy/aws_lambda.py
Implement continual development by updating the AWS Lambda backend whenever a commit is pushed to the repo and the BERTScore computed F1 score of the pipeline has improved:
. deploy/cont_deploy.sh
To contribute, check out the guide.
Install conda if necessary:
# Install conda: https://conda.io/projects/conda/en/latest/user-guide/install/index.html#regular-installation # If on Windows, install chocolately: https://chocolatey.org/install. Then, run: # choco install make
Create the conda environment locally:
cd admirer make conda-update conda activate admirer make pip-tools export PYTHONPATH=. echo "export PYTHONPATH=.:$PYTHONPATH" >> ~/.bashrc
Install pre-commit:
pre-commit install
Sign up for an OpenAI account and get an API key here.
Populate a
file with your key and the backend URL in the format of.env.template
, and reactivate the environment. -
Sign up for a Weights and Biases account here and download the CLIP ONNX file locally:
wandb login python ./training/stage_model.py --fetch --from_project admirer
(Optional) Sign up for an AWS account here and set up your AWS credentials locally, referring to this as needed:
aws configure
If the instructions aren't working for you, head to this Google Colab, make a copy of it, and run the cells there to get an environment set up.
The repo is separated into main folders that each describe a part of the ML-project lifecycle, some of which contain interactive notebooks, and supporting files and folders that store configurations and workflow scripts:
├── api_serverless # the backend handler code using AWS Lambda.
├── app_gradio # the frontend code using Gradio.
├── deploy # the AWS Lambda backend setup and continuous deployment code.
├── data_manage # the data management code using AWS S3 for training data and ZenML log storage, boto3 for data exploration, and ZenML + Great Expectations for data validation.
├── load_test # the load testing code using Locust.
├── monitoring # the model monitoring code using Gradio's flagging feature.
├── question_answer # the inference code.
├── tasks # the pipeline testing code.
├── training # the model development code using PyTorch, PyTorch Lightning, and Weights and Biases.
From the main directory, there are various ways to test the pipeline:
To start a W&B hyperparameter optimization sweep for the caption model (on one GPU):
. ./training/sweep/sweep.sh CUDA_VISIBLE_DEVICES=0 wandb agent --project ${PROJECT} --entity ${ENTITY} ${SWEEP_ID}
To train the caption model (add
--strategy ddp_find_unused_parameters_false
for multi-GPU machines; takes ~7.5 hrs on an 8xA100 Lambda Labs instance):python ./training/run_experiment.py \ --data_class PICa --model_class ViT2GPT2 --gpus "-1" \ --wandb --log_every_n_steps 25 --max_epochs 300 \ --augment_data True --num_workers "$(nproc)" \ --batch_size 2 --one_cycle_max_lr 0.01 --top_k 780 --top_p 0.65 --max_label_length 50
To test the caption model (best model can be downloaded from here):
python ./training/test_model.py \ --data_class PICa --model_class ViT2GPT2 \ --num_workers "$(nproc)" --load_checkpoint training/model.pth
To start the app locally (uncomment code in PredictorBackend.init and set use_url=False to use the local model instead of the API):
python app_gradio/app.py
To test the Gradio frontend by launching and pinging the frontend locally:
python -c "from app_gradio.tests.test_app import test_local_run; test_local_run()"
To test the caption model's ability to memorize a single batch:
. ./training/tests/test_memorize_caption.sh
To run integration tests for the model pipeline:
. ./tasks/integration_test.sh
To run unit tests for the model pipeline:
. ./tasks/unit_test.sh
To test the whole model pipeline:
. ./tasks/test.sh
To lint your code:
pre-commit run --all-files
- GI4E for their database and Scale AI for their annotations.
- Facebook for their image segmentation model.
- NLP Connect for their base image caption model and Sachin Abeywardana for his fine-tuning code.
- OpenAI for their CLIP text and image encoder code and GPT-3 API.
- Microsoft for their visual question answering code.