You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for having released the code for this project.
I'm having trouble wrapping my head around the integration of the DSG into the generation process, when comparing against the paper, and I wanted to ask for some clarifications.
In step 2 it is said: "This step uses gold DSG of video for the updating of recurrent graph Transformer in 3D-UNet."
However the config file specifies that the DSG conditioning is not added, through use_temporal_transformer:False, is this still a T2V only pretraining step?
It would also be nice to have some more explanation on 'how to parse the DSG annotations in advance with the tools in dysen/DSG', since the original code is made for images, not videos. Or if possible provide pre-parsed representations for one of the video datasets used.
I might have misunderstood the process, but even in bash shellscripts/run_sample_vdm_text2video.sh I can't find the step that lies between the textual representation and the graph representation. Is that the script that is used to generate the data passed to shellscripts/run_eval_dysen_vdm.sh?
Thank you in advance!
The text was updated successfully, but these errors were encountered:
Thank you for having released the code for this project.
I'm having trouble wrapping my head around the integration of the DSG into the generation process, when comparing against the paper, and I wanted to ask for some clarifications.
In step 2 it is said: "This step uses gold DSG of video for the updating of recurrent graph Transformer in 3D-UNet."
However the config file specifies that the DSG conditioning is not added, through
use_temporal_transformer:False
, is this still a T2V only pretraining step?It would also be nice to have some more explanation on 'how to parse the DSG annotations in advance with the tools in dysen/DSG', since the original code is made for images, not videos. Or if possible provide pre-parsed representations for one of the video datasets used.
I might have misunderstood the process, but even in
bash shellscripts/run_sample_vdm_text2video.sh
I can't find the step that lies between the textual representation and the graph representation. Is that the script that is used to generate the data passed toshellscripts/run_eval_dysen_vdm.sh
?Thank you in advance!
The text was updated successfully, but these errors were encountered: