Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

May you release the code to finetune on ADE20K? #1

Open
Wallace-222 opened this issue Oct 14, 2022 · 7 comments
Open

May you release the code to finetune on ADE20K? #1

Wallace-222 opened this issue Oct 14, 2022 · 7 comments

Comments

@Wallace-222
Copy link

Hello authors, it is really an inspiring work, and it is also very kind of you to release the code at the same time. May you please also release the code to finetune on ADE20K? Although it is stated in your paper that your experiments are simply following the MAE. However, I am unable to find it from the official code of MAE. Thanks a lot for your attention.
Best wishes.

@ronghanghu
Copy link
Contributor

Hi @Wallace-222, sorry that we don't have a code release for the ADE20K segmentation model yet. For this experiment, we follow the implementation in the MAE paper. I think one can adapt it from the Swin-Transformer repo (https://github.com/SwinTransformer/Swin-Transformer-Semantic-Segmentation) and replace the backbone with ViT for this experiment.

@youngwanLEE
Copy link

@ronghanghu Hi,
In the original MAE paper and their official code, the hyper-parameter information (e.g., learning rate, weight decay) is not informed.
It would be helpful to share the hyper-parameters in the community :).

@ronghanghu
Copy link
Contributor

Hi @youngwanLEE, we follow the same setting in BEiT, MAE, and ConvNeXt for the ADE20K experiments and sweep the hyperparameters. Our final hyperparameters are as follows:

  • For ViT-B, we train for 160K iterations on ADE20K with batch size 16, learning rate 8e-5, layer-wise lr decay 0.8, drop path 0.1, and input image size 512 x 512
  • For ViT-L, we train for 160K iterations on ADE20K with batch size 16, learning rate 8e-5, layer-wise lr decay 0.9, drop path 0.2, and input image size 512 x 512

Hope these are helpful!

@youngwanLEE
Copy link

@ronghanghu
It would be very helpful in community :)

Many thanks!

@ggjy
Copy link

ggjy commented Dec 29, 2022

@ronghanghu Hi ronghang, thanks for sharing this great work. Table. 2(c) of the main papers indicates that models with different patch size (e.g., 8, 16, 24) are finetuned on COCO/ADE20K under the same transferring input sizes, so these three models have different GPU memory usage and different FLOPs but can still have similar results, as I understand it correctly?

@ronghanghu
Copy link
Contributor

@ronghanghu Hi ronghang, thanks for sharing this great work. Table. 2(c) of the main papers indicates that models with different patch size (e.g., 8, 16, 24) are finetuned on COCO/ADE20K under the same transferring input sizes, so these three models have different GPU memory usage and different FLOPs but can still have similar results, as I understand it correctly?

Hi @ggjy, during COCO (and similarly ADE20K) fine-tuning, all the three pre-trained model in Table 2(c) are fine-tuned with the same ViT patch size of 16 and same image size of 1024, so they have the same GPU memory usage and FLOPs during fine-tuning. This is mentioned in the "Setups" in Sec. 4.1 (Page 4 right column).

Notably, the Mask R-CNN detection backbone always uses a ViT with patch size 16 and image size 1024 × 1024, and hence a fixed sequence length of L = 4096 during detection fine-tuning for all pre-trained models. When there is a mismatch in sizes, the ViT position embeddings are bicubic-interpolated to L = 4096 following the practice in [30]. The same is applied to patch embedding layers, where the weights are treated as 2D convolution filters and bicubic-interpolated when needed [29]

@ggjy
Copy link

ggjy commented Dec 29, 2022

Got it! Thanks very much for your quick reply.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants