Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results on Youtobe #3

Open
noUmbrella opened this issue Dec 23, 2019 · 16 comments
Open

Results on Youtobe #3

noUmbrella opened this issue Dec 23, 2019 · 16 comments

Comments

@noUmbrella
Copy link

Hi, I test your released code and model on Youtobe, but I can get the accuracy reported in the paper. Did you test this code on Youtobe?

@seoungwugoh
Copy link
Owner

seoungwugoh commented Dec 24, 2019

The checkpoint in this repo is different from one for Youtube-VOS evaluation.
For Youtube-VOS evaluation, we did not use DAVIS videos for training.
(This gives us a minor improvement)

However, the provided checkpoint should also give similar numbers with one reported in our paper with minor degradation (about 1-2 lower Overall score).
How was your results?

For Youtube-VOS, there are some differences compared to DAVIS:

  1. Some objects start to appear in the middle of video. In that case, we overwrite current mask with the new objects.
  2. While evaluation server takes results computed every 5 frames, we use all the frames for estimation.
    We first estimate masks for all the frames, then sample frames to submit from there.

@noUmbrella
Copy link
Author

Great!It surprised me that using DAVIS videos for training will degrade the performance on Youtube-VOS. Thank you for sharing. I will retest it with your mentioned 1 and 2. Thanks.

@sourabhswain
Copy link

@seoungwugoh Can you please tell how to test the pretrained model on YouTube VOS ? I tried to use the YoutubeVOS dataset instead of DAVIS17, however, I seem to get empty masks as output.

@seoungwugoh
Copy link
Owner

seoungwugoh commented Jan 9, 2020

Getting an empty mask seems to be due to bugs in the code.

@siyueyu
Copy link

siyueyu commented Jan 9, 2020

@seoungwugoh In the case that Some objects start to appear in the middle of video then overwriting current mask with the new objects, will the overwritten mask include the old objects?

@seoungwugoh
Copy link
Owner

@siyueyu Yes, we overwrite the pixels belongs to the new object. Other pixels remain the same.

@sourabhswain
Copy link

@seoungwugoh I could get the correct masks now as predictions, however, I keep getting out of memory error when I test it on YouTubeVOS. I am using all the validation frames instead of every 5 frames. The GPU I am using is GTX 1080. Did you recommend using any particular configurations for YouTubeVOS ? I even played with the mem_every parameter, but still getting out of memory issues.

@seoungwugoh
Copy link
Owner

seoungwugoh commented Jan 21, 2020

For YoutubeVOS, some videos are quite long (> 150 frames), it often cause OOM. GPU memories are mostly consumed by a large matrix inner-product when memory reading. We used V100 GPU which has 16GB memory and setting a larger mem_every parameter for some videos works well. To drastically reduce memory consumption. you can consider to use no intermediate memory frames (infinite mem_every). Another extreme solution will be convert that inner-product part to CPU if you afford additional computation time.

@sourabhswain
Copy link

@seoungwugoh Thanks for the suggestion. I ran it without any intermediate frame and could obtain the results. However, I see that it doesn't consider the masks of objects which start to appear after the first frame. I get no predictions for those objects. Looking at your suggestion above in this thread, you mention that "Some objects start to appear in the middle of video. In that case, we overwrite current mask with the new objects." I already modified the dataset.py. Is it already implemented in the uploaded code ? If not, can you point out where do we need to incorporate those changes ? Thank you.

@sourabhswain
Copy link

@seoungwugoh Also, to add to what I mentioned above, I get a score of 69.4 (compared to 78.4 in the paper) on YouTube validation set using the pre-trained model. Since, I used no intermediate memory frames, I guess by default it takes only the first and the previous frame.

@npmhung
Copy link

npmhung commented Jan 30, 2020

@seoungwugoh Hi, I'm trying to finetune your model.
In the paper, you claim that batchnorm is turned off for all experiments.
Just to be clear, do you turn off your batchnorm during the main training stage with video only or also during pre-training with images?

@seoungwugoh
Copy link
Owner

@sourabhswain Code in this repository does not contains functionality for evaluating Youtube-VOS. You should implement by yourself. But, It will not too difficult. To get a similar number with the paper, you should estimate masks for objects start to appear in the middle of video.

@npmhung We turned off Batchnorm for both pre-training and main-training. In other words, we use mean and var learned from ImageNet. This can be simply done by setting model.eval() during training.

@hkchengrex
Copy link

@seoungwugoh Is it possible for you to also provide the checkpoint used for Youtube-VOS evaluation (I'm ok without the code)? Thanks a lot!

@sourabhswain
Copy link

@seoungwugoh I made the changes specific to YouTube-VOS and now I can get a score 74.17. It's still a bit off from the score mentioned in the paper (78.4). Could it be just due to the different pretrained model which you uploaded here ? Or do you use different hyperparameters for YouTube-VOS ?

@seoungwugoh
Copy link
Owner

@sourabhswain It would be due to different weights. The number in paper (78.4) is measured using weights for Youtube-VOS. Unfortunately, we have no plan to upload weights for Youtube-VOS testing.

@chenz97
Copy link

chenz97 commented Apr 14, 2020

Hi @seoungwugoh , you mentioned that when objects started to appear in the middle of the video, you overwrite the current mask. So only the prev mask was impacted, and the first frame mask remains unchanged. However, for the objects that appear later, they cannot refer to the first frame for the GT mask (since the "first frame" for them is not the first frame of the video). Can this hurt the performance, or do you have any workaround for this? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants