This is the source code for The MSR-Video to Text dataset with clean annotations. We found that MSR-VTT dataset contains a lot of noisy annotations. After analyzing the data carefully, we put some efforts on cleaning the annotations. We retrained some models on the cleaned dataset and found experimental results improved compared to the previous models.
- Python 3.8
- Jupyter Notebook
- Hunspell
clean_process
is the folder for cleaning MSR-VTT dataset.
- Run Jupyter Notebook in the
clean_process
. - Please replace the input for the models in Semantics-Assisted Video Captioning Model Trained with Scheduled Sampling Strategy and Delving Deeper into the Decoder for Video Captioning .
- Train the new models.
- Cleaned dataset: GoogleDrive
- The paper on Arxiv: The MSR-Video to Text dataset with clean annotations
- The published paper: The MSR-Video to Text dataset with clean annotations
- Dictionary download for Hunspell: dictionaries
Haoran Chen, Jianmin Li, Simone Frintrop, Xiaolin Hu,
The MSR-Video to Text dataset with clean annotations,
Computer Vision and Image Understanding,
Volume 225,
2022,
103581,
ISSN 1077-3142,
https://doi.org/10.1016/j.cviu.2022.103581.
(https://www.sciencedirect.com/science/article/pii/S107731422200159X)
Abstract: Video captioning automatically generates short descriptions of the video content, usually in form of a single sentence. Many methods have been proposed for solving this task. A large dataset called MSR Video to Text (MSR-VTT) is often used as the benchmark dataset for testing the performance of the methods. However, we found that the human annotations, i.e., the descriptions of video contents in the dataset are quite noisy, e.g., there are many duplicate captions and many captions contain grammatical problems. These problems may pose difficulties to video captioning models for learning underlying patterns. We cleaned the MSR-VTT annotations by removing these problems, then tested several typical video captioning models on the cleaned dataset. Experimental results showed that data cleaning boosted the performances of the models measured by popular quantitative metrics. We recruited subjects to evaluate the results of a model trained on the original and cleaned datasets. The human behavior experiment demonstrated that trained on the cleaned dataset, the model generated captions that were more coherent and more relevant to the contents of the video clips.
Keywords: MSR-VTT dataset; Data cleaning; Data analysis; Video captioning