Action Recognition using Vision Transformers

Overview

This project implements action recognition in videos using Vision Transformers (ViTs), specifically the TimeSformer model. The model is trained on the HMDB dataset and achieves high accuracy in classifying human actions from video input.

Key Features

Vision Transformer-based video action recognition
Pre-trained TimeSformer model fine-tuning
Supports multiple action categories from HMDB dataset
Achieves up to 98.40% top-1 accuracy
Includes visualization tools for model analysis

Requirements

Python 3.x
PyTorch
Transformers library
ffmpeg
TensorBoard
scikit-learn
pandas
numpy

Dataset Preparation

Preprocessing Pipeline:
- Resize frames to 224x224 pixels
- Convert images to video format using ffmpeg
- Organize videos into class-based directories
- Generate class_path.csv with file paths and labels
- Create label_mapping.json for class encoding
Dataset Split:
- Training: 80%
- Validation: 10%
- Test: 10%

Model Architecture

Base Model: TimeSformer
Pre-trained on: Kinetics-400 dataset
Features:
- Divided space-time attention
- Separate handling of spatial and temporal dimensions
- Transformer-based architecture

Training Configuration

Batch Size: 8 (optimal)
Learning Rate: 0.005
Optimizer: SGD with momentum 0.9
Weight Decay: 1e-4
Dropout Rate: 0.5
Loss Function: Cross-entropy
Evaluation Frequency: Every 5 epochs

Best Performance

Top-1 Accuracy: 97.60% (4 batch size, 20 epochs)
Top-5 Accuracy: 100.00%

Usage

Setup Environment:
```
pip install -r requirements.txt
```

Preprocess Dataset:

python preprocess_data.py --input_dir /path/to/hmdb --output_dir /path/to/output

Train Model:

python train.py --data_path /path/to/processed_data --batch_size 8 --epochs 20

Evaluate Model:

python evaluate.py --model_path /path/to/checkpoint --test_data /path/to/test_set

Visualization

TensorBoard integration for:
- Training/validation loss curves
- Accuracy metrics
- Confusion matrices
- Prediction visualizations

Results

The model shows strong performance across various action categories:

High accuracy in distinguishing similar actions
Robust performance on unseen test data
Consistent performance across different action types

Known Limitations

Training instability in loss curves
Occasional confusion between similar actions (e.g., fencing and draw sword)
Compute-intensive training process

Future Improvements

Implementation of learning rate scheduling
Enhanced regularization techniques
Weighted loss functions for class imbalance
Multi-stream approach integration
Additional data augmentation methods

Citation

@inproceedings{bertasius2021space,
  title={Is Space-Time Attention All You Need for Video Understanding?},
  author={Gedas Bertasius and Heng Wang and Lorenzo Torresani},
  booktitle={Proceedings of the International Conference on Computer Vision (ICCV)},
  year={2021}
}

License

Acknowledgments

Project TA: Anindya Mondal
Contributors:
- Mohamed Faheem Thanveer (6823682)
- Mahabaleshwar Poorvita (6801697)
- Shrinit Sanjiv Patil (6816353)
- Priyanka Kamila (6787345)

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
TimeSformer		TimeSformer
.gitignore		.gitignore
AML_Coursework.pdf		AML_Coursework.pdf
Action_recognition.ipynb		Action_recognition.ipynb
LICENSE		LICENSE
README.md		README.md
attention_rollout_visulation.ipynb		attention_rollout_visulation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Action Recognition using Vision Transformers

Overview

Key Features

Requirements

Dataset Preparation

Model Architecture

Training Configuration

Best Performance

Usage

Visualization

Results

Known Limitations

Future Improvements

Citation

License

About

Releases

Packages

Languages

License

feemthan/TimeSformer_HMDB

Folders and files

Latest commit

History

Repository files navigation

Action Recognition using Vision Transformers

Overview

Key Features

Requirements

Dataset Preparation

Model Architecture

Training Configuration

Best Performance

Usage

Visualization

Results

Known Limitations

Future Improvements

Citation

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages