- Clone this repository:
git clone https://github.com/lukewys/dcase_2020_T6.git
- Install pytorch >=1.4.0
- Use pip to install dependencies:
pip install -r requirements.txt
-
Download the required dataset of DCASE2020 Automated Audio Captioning task
-
Enter the
create_dataset
directory,Put Clotho data set intodata
folder structured like this:data ├── clotho_csv_files │ ├── clotho_captions_development.csv │ ├── clotho_captions_evaluation.csv │ ├── clotho_metadata_development.csv │ └── clotho_metadata_evaluation.csv ├── clotho_audio_files │ ├── development │ └── evaluation └── clotho_test_audio │ ├── test_0001.wav │ ├── test_0002.wav │ ├── ... │ ├── test_1043.wav
-
Enter the
create_dataset
directory. -
Run
main.py
to create dataset and extract features under current directory. Result folders would bedata_splits
,data_splits_audio
andpickles
. (The progress bar is not behaving correctly, please wait until the program to finish) -
Then, run
create_test_set.py
to create test set under current directory. Result folder would betest_data
.
Please setup coco caption and download the file needed according to https://github.com/audio-captioning/caption-evaluation-tools. (~300MB of files will be downloaded)
- Enter the
audio_tag
directory. - First, run
generate_word_list.py
to generate word listword_list_pretrain_rules.p
, then rungenerate_tag.py
to generate the tag label:audioTagNum_development_fin.pickle
andaudioTagNum_evaluation_fin.pickle
. - Back to the project directory, run
Tag_train.py
to train the model. The model is saved in./models/tag_models
.
- Enter
the word_embedding_pretrain
directory, runword2vec.py
-
The file hparams.py holds the default arguments for our project. You can modify these parameters according to your own needs.
-
You can use our pretrained models in
models/best.pt
to reproduce our results.
Set mode=train
in hparams.py. Then you can run the project using the set script:
#You can modify the running script as needed
python run.py
The trained models and logs will be saved in models/{hparams.name}
.
set mode=eval
in hparams.py. Then choose the model you want to evaluate,run
python train.py
. The system will score the model based on the evaluation dataset.
set mode=test
in hparams.py. Then choose the model you want to test,run python train.py
. The system will generate the corresponding captions according to the test dataset.
If you use our code, pleas kindly cite following:
@techreport{wuyusong2020_t6,
Author = "Wu, Yusong and Chen, Kun and Wang, Ziyue and Zhang, Xuan and Nian, Fudong and Li, Shengchen and Shao, Xi",
title = "Audio Captioning Based on Transformer and Pre-Training for 2020 {DCASE} Audio Captioning Challenge",
institution = "DCASE2020 Challenge",
year = "2020",
month = "June",
abstract = "This report proposes an automated audio captioning model for the 2020 DCASE audio captioning challenge. In this challenge, a model is required to be trained from scratch to generate natural language descriptions of a given audio signal. However, as limited data available and restrictions on using pre-trained models trained by external data, training directly from scratch can result in poor performance where acoustic events and language are poorly modeled. For better acoustic event and language modeling, a sequence-to-sequence model is proposed which consists of a CNN encoder and a Transformer decoder. In the proposed model, the encoder and word embedding are firstly pre-trained. Regulations and data augmentations are applied during training, while fine-tuning is applied after training. Experiments show that the proposed model can achieve a SPIDEr score of 0.227 on audio captioning performance."
}
how to get the results and score
how to load trained weight and reproduce exact result submitted in challenge
how to get all the scores in ablation test