English | 简体中文
PaddlePaddle training/validation code and pretrained models for Image Classification.
This implementation is part of PaddleViT project.
- Update (2021-11-05): Update ConvMLP models.
- Update (2021-11-04): Update ConvMixer models.
- Update (2021-11-03): Update ViP models.
- Update (2021-10-28): Add MobileViT model.
- Update (2021-10-28): Add FocalTransformer model.
- Update (2021-10-28): Add CycleMLP model.
- Update (2021-10-19): Add BEiT model.
- Update (2021-10-12): Update code for training from scratch in Swin Transformer.
- Update (2021-09-28): Add AMP training.
- Update (2021-09-27): Add more ported model weights.
- Update (2021-09-09): Add FF-Only, RepMLP models.
- Update (2021-08-25): Init readme uploaded.
The following links are provided for the code and detail usage of each model architecture:
- ViT
- DeiT
- Swin
- VOLO
- CSwin
- CaiT
- PVTv2
- Shuffle Transformer
- T2T-ViT
- CrossViT
- Focal Transformer
- BEiT
- MobileViT
- ViP
- MLP-Mixer
- ResMLP
- gMLP
- FF_Only
- RepMLP
- CycleMLP
- ConvMixer
- ConvMLP
This module is tested on Python3.6+, and PaddlePaddle 2.1.0+. Most dependencies are installed by PaddlePaddle installation. You only need to install the following packages:
pip install yacs pyyaml
Then download the github repo:
git clone https://github.com/BR-IDL/PaddleViT.git
cd PaddleViT/image_classification
Note: It is recommended to install the latest version of PaddlePaddle to avoid some CUDA errors for PaddleViT training. For PaddlePaddle, please refer to this link for stable version installation and this link for develop version installation.
ImageNet2012 dataset is used in the following folder structure:
│imagenet/
├──train/
│ ├── n01440764
│ │ ├── n01440764_10026.JPEG
│ │ ├── n01440764_10027.JPEG
│ │ ├── ......
│ ├── ......
├──val/
│ ├── n01440764
│ │ ├── ILSVRC2012_val_00000293.JPEG
│ │ ├── ILSVRC2012_val_00002138.JPEG
│ │ ├── ......
│ ├── ......
To use the model with pretrained weights, go to the specific subfolder, then download the .pdparam
weight file and change related file paths in the following python scripts. The model config files are located in ./configs/
.
Assume the downloaded weight file is stored in ./vit_base_patch16_224.pdparams
, to use the vit_base_patch16_224
model in python:
from config import get_config
from visual_transformer import build_vit as build_model
# config files in ./configs/
config = get_config('./configs/vit_base_patch16_224.yaml')
# build model
model = build_model(config)
# load pretrained weights, .pdparams is NOT needed
model_state_dict = paddle.load('./vit_base_patch16_224')
model.set_dict(model_state_dict)
🤖 See the README file in each model folder for detailed usages.
PaddleViT image classification module is developed in separate folders for each model with similar structure. Each implementation is around 3 type of classes and 2 types of scripts:
-
Model classes such as transformer.py, in which the core transformer model and related methods are defined.
-
Dataset classes such as dataset.py, in which the dataset, dataloader, data transforms are defined. We provided flexible implementations for you to customize the data loading scheme. Both single GPU and multi-GPU loading are supported.
-
Config classes such as config.py, in which the model and training/validation configurations are defined. Usually, you don't need to change the items in the configuration, we provide updating configs by python
arguments
or.yaml
config file. You can see here for details of our configuration design and usage. -
main scripts such as main_single_gpu.py, in which the whole training/validation procedures are defined. The major steps of training or validation are provided, such as logging, loading/saving models, finetuning, etc. Multi-GPU is also supported and implemented in separate python script
main_multi_gpu.py
. -
run scripts such as run_eval_base_224.sh, in which the shell command for running python script with specific configs and arguments are defined.
PaddleViT now provides the following transfomer based models:
- ViT (from Google), released with paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
- DeiT (from Facebook and Sorbonne), released with paper Training data-efficient image transformers & distillation through attention, by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
- Swin Transformer (from Microsoft), released with paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
- VOLO (from Sea AI Lab and NUS), released with paper VOLO: Vision Outlooker for Visual Recognition, by Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, Shuicheng Yan.
- CSwin Transformer (from USTC and Microsoft), released with paper CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows , by Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, Baining Guo.
- CaiT (from Facebook and Sorbonne), released with paper Going deeper with Image Transformers, by Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, Hervé Jégou.
- PVTv2 (from NJU/HKU/NJUST/IIAI/SenseTime), released with paper PVTv2: Improved Baselines with Pyramid Vision Transformer, by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.
- Shuffle Transformer (from Tencent), released with paper Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer, by Zilong Huang, Youcheng Ben, Guozhong Luo, Pei Cheng, Gang Yu, Bin Fu.
- T2T-ViT (from NUS and YITU), released with paper Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , by Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, Shuicheng Yan.
- CrossViT (from IBM), released with paper CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification, by Chun-Fu Chen, Quanfu Fan, Rameswar Panda.
- BEiT (from Microsoft Research), released with paper BEiT: BERT Pre-Training of Image Transformers, by Hangbo Bao, Li Dong, Furu Wei.
- Focal Transformer (from Microsoft), released with paper Focal Self-attention for Local-Global Interactions in Vision Transformers, by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.
- Mobile-ViT (from Apple), released with paper MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer, by Sachin Mehta, Mohammad Rastegari.
- ViP (from Oxford/ByteDance), released with Visual Parser: Representing Part-whole Hierarchies with Transformers, by Shuyang Sun, Xiaoyu Yue, Song Bai, Philip Torr.
PaddleViT now provides the following MLP based models:
- MLP-Mixer (from Google), released with paper MLP-Mixer: An all-MLP Architecture for Vision, by Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy
- ResMLP (from Facebook/Sorbonne/Inria/Valeo), released with paper ResMLP: Feedforward networks for image classification with data-efficient training, by Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, Hervé Jégou.
- gMLP (from Google), released with paper Pay Attention to MLPs, by Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le.
- FF Only (from Oxford), released with paper Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet, by Luke Melas-Kyriazi.
- RepMLP (from BNRist/Tsinghua/MEGVII/Aberystwyth), released with paper RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition, by Xiaohan Ding, Chunlong Xia, Xiangyu Zhang, Xiaojie Chu, Jungong Han, Guiguang Ding.
- CycleMLP (from HKU/SenseTime), released with paper CycleMLP: A MLP-like Architecture for Dense Prediction, by Shoufa Chen, Enze Xie, Chongjian Ge, Ding Liang, Ping Luo.
- ConvMixer (from Anonymous), released with Patches Are All You Need?, by Anonymous.
- ConvMLP (from UO/UIUC/PAIR), released with ConvMLP: Hierarchical Convolutional MLPs for Vision, by Jiachen Li, Ali Hassani, Steven Walton, Humphrey Shi.
- HaloNet, (from Google), released with paper Scaling Local Self-Attention for Parameter Efficient Visual Backbones, by Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas, Niki Parmar, Blake Hechtman, Jonathon Shlens.
- XCiT (from Facebook/Inria/Sorbonne), released with paper XCiT: Cross-Covariance Image Transformers, by Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, Hervé Jegou.
- CvT (from McGill/Microsoft), released with paper CvT: Introducing Convolutions to Vision Transformers, by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang
- PiT (from NAVER/Sogan University), released with paper Rethinking Spatial Dimensions of Vision Transformers, by Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, Seong Joon Oh.
- HVT (from Monash University), released with paper Scalable Vision Transformers with Hierarchical Pooling, by Zizheng Pan, Bohan Zhuang, Jing Liu, Haoyu He, Jianfei Cai.
- DynamicViT (from Tsinghua/UCLA/UW), released with paper DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification, by Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, Cho-Jui Hsieh.
If you have any questions, please create an issue on our Github.