The MSR-Video to Text dataset with clean annotations

This is the source code for The MSR-Video to Text dataset with clean annotations. We found that MSR-VTT dataset contains a lot of noisy annotations. After analyzing the data carefully, we put some efforts on cleaning the annotations. We retrained some models on the cleaned dataset and found experimental results improved compared to the previous models.

Requirements

Python 3.8
Jupyter Notebook
Hunspell

Information

clean_process is the folder for cleaning MSR-VTT dataset.

Reproduction of Results

Run Jupyter Notebook in the clean_process.
Please replace the input for the models in Semantics-Assisted Video Captioning Model Trained with Scheduled Sampling Strategy and Delving Deeper into the Decoder for Video Captioning .
Train the new models.

Links

Cleaned dataset: GoogleDrive
The paper on Arxiv: The MSR-Video to Text dataset with clean annotations
The published paper: The MSR-Video to Text dataset with clean annotations
Dictionary download for Hunspell: dictionaries

Citation

Haoran Chen, Jianmin Li, Simone Frintrop, Xiaolin Hu,
The MSR-Video to Text dataset with clean annotations,
Computer Vision and Image Understanding,
Volume 225,
2022,
103581,
ISSN 1077-3142,
https://doi.org/10.1016/j.cviu.2022.103581.
(https://www.sciencedirect.com/science/article/pii/S107731422200159X)
Abstract: Video captioning automatically generates short descriptions of the video content, usually in form of a single sentence. Many methods have been proposed for solving this task. A large dataset called MSR Video to Text (MSR-VTT) is often used as the benchmark dataset for testing the performance of the methods. However, we found that the human annotations, i.e., the descriptions of video contents in the dataset are quite noisy, e.g., there are many duplicate captions and many captions contain grammatical problems. These problems may pose difficulties to video captioning models for learning underlying patterns. We cleaned the MSR-VTT annotations by removing these problems, then tested several typical video captioning models on the cleaned dataset. Experimental results showed that data cleaning boosted the performances of the models measured by popular quantitative metrics. We recruited subjects to evaluate the results of a model trained on the original and cleaned datasets. The human behavior experiment demonstrated that trained on the cleaned dataset, the model generated captions that were more coherent and more relevant to the contents of the video clips.
Keywords: MSR-VTT dataset; Data cleaning; Data analysis; Video captioning

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
clean_process		clean_process
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The MSR-Video to Text dataset with clean annotations

Requirements

Information

Reproduction of Results

Links

Citation

About

Releases

Packages

Languages

License

WingsBrokenAngel/MSR-VTT-DataCleaning

Folders and files

Latest commit

History

Repository files navigation

The MSR-Video to Text dataset with clean annotations

Requirements

Information

Reproduction of Results

Links

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages