-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Post your evaluation score #20
Comments
Hi, @intuinno what hyperparamter did you pick. I tried to run the on the coco data, the algorithm terminate after 15 epochs and achieved the following scores, which is much lower than yours. Bleu 1: .527 |
Hi, @intuinno, @snakeztc, @yaxingwang my flickr8k scores with parameters in capgen.train() are @yaxingwang |
Hi @AAmmy , I am also get similar results like you. But BLEU-1 you get is batter then me(0.30). Do you do normalization for Dataset? After it, I get less result. The metrics.py is used to get results. |
@yaxingwang
token and train, valid, test are the same with the files in Flickr8k_text.zip |
Yes, we am same. I did it, since the result is poor. |
@yaxingwang I think intuinno's evaluate_flickr8k.py's parameters are for coco and flickr30k, the parameters for flickr8k and for flickr30k, coco are not same.(5.2 in paper) I think parameters in original capgen.py file are for flickr8k.(I used this, stopped around 70epoc) I also did flickr8k learning with the parameters for coco I also changed patience and some parameters to check over fitting, |
@AAmmy , Thank you. I try to do the both flickr30k and coco, but I guess the memory of my computer is too small to process the flickr30, so I am doing it. Do you met the question when doing the flickr30k, It notes MemoryError. When using epoch = 10, 20, the result is less good than the matter the script early stop . I think got epoch by early stop is not the best, for it has not strong relation with scores. Maybe testing different epochs is optimal. |
@yaxingwang I have the same memory problem on coco, and process for sparse to dense is too slow, so I extracted feature into one file for each image. I changed code and data format like below. caption example:
('a dog running' and 'dogs running' are captions for OOO.jpg, In flickr.py or coco.py: In prepare_data():
In load_data():
Hmm... I will try on different epochs. |
@AAmmy , Thanks. |
I also created my own scripts to prepare the data. I completely skipped the sparse matrix stuff since I think it's not needed at all. I have a single hdf5 file with CONV5_4 features from VGG19 network for Flickr30k (around 12GB). This file contains all the image features for all splits in the following order: I am pretty sure that I am not doing any mistake (but apparently i am doing since you have at least some results) but all I got is repetitive phrases of meaningless words, with BLEU of 0 and a validation loss which doesn't improve at all. I create the dictionary in a frequency ordered fashion, 0 is and 1 is UNK. I don't know where is the problem at all. |
@intuinno, your results are closest the reported coco results, which hyper-parameters have you used? @kelvinxu , @kyunghyuncho , paper does not mention hyper-parameters for different datasets. would you mind providing this information? (plus maybe even the models themselves which are not big for a dropbox/gdrive file) |
Hi everybody, I'd like to share my observations and experimentations about the code on Flickr30k dataset: Preprocessing:
Feature dimensions:
Early stopping with BLEU: This seems critical and it's mentioned in the paper as wel but unfortunately not implemented in the code. The validation loss is not correlated with BLEU or METEOR. I just save the model into a temporary file before each validation and call Validation: I normalized the validation loss w.r.t sequence lengths as well. This seems a better estimate of validation loss as the default one is sensible to the caption lengths in the validation batches. Hyperparameters: I'm still experimenting but the best working system so far had the following parameters:
Results: I trained a system yesterday with early-stopping on BLEU (but this was using the (EDIT: Fixed the results of my system which was for the validation split instead of the test split.)
Problems: The main problem are the duplicate captions in the final files:
So out of 1014 validation images, I can only generate 853/790 unique captions. This seems to be an important problem that I'm facing. The richness of the captions is also quite limited. For the sampling case, I have 497 unique words out of a vocabulary of ~10K words. For beamsearch, the number is 561. EDIT |
|
Hi, @ozancaglayan
|
Hi @intuinno |
So does anyone get better score on coco? |
I got |
@AAmmy Hi,I tested your code, and got 'Bleu_4': 0.276, 'Bleu_3': 0.367, 'Bleu_2': 0.497, 'Bleu_1': 0.668 with beam_size = 10. |
@AAmmy @xinghedyc Could you please explain about using http://cs.stanford.edu/people/karpathy/deepimagesent/ for me? |
@Lorne0 Hi, what you could download form that website is a coco dataset as COCO (750MB)http://cs.stanford.edu/people/karpathy/deepimagesent/coco.zip, |
@xinghedyc Thank you. But I still not understand. The feature is 4096 , and @AAmmy said reshaping it as 8x512, and then? Which 8 of the new features should I use? |
@Lorne0 I think 8×512 means 8 annotation vectors which the author's paper defines them as |
@xinghedyc Thank you~ |
@Lorne0 Yes,I also got this problem,so I just comment that line in the metrics.py |
@xinghedyc I think METEOR is important too. |
@xinghedyc I think I found the solution |
@Lorne0 OK, I'll check it. |
@Lorne0 @xinghedyc |
@AAmmy thanks, I got bleu-4 23.9 if I use beam size 0f 1, but got 27.6 use beam size of 10. |
@xinghedyc @AAmmy Just to confirm, the results you get for increasing the beam size are correct. At the time of publication, we were using a beamsize of 1 (mea culpa!!!) |
I got bleu1 0.685 bleu2 0.507 bleu3 0.363 bleu4 0.258 meteor 0.234 rouge_L 0.505 cider 0.836 |
In these days, I use Capgen and VQA model on Tensorflow. It's very flexible. I can share the code if needed. |
@AAmmy Could please share your code on tensorflow? |
Excuse me, how long does this training process roughly take? I have run it for about 12 hours, and it is still stuck in epoch 1. I really don't know what's wrong with it? My GPU is |
I am wondering why pycocoevalcap gives me a different BLEU score compared to multi-bleu.perl? I found 2 sentences and calculate the BLEU score manually, the result matches with multi-bleu.perl but not pycocoevalcap. What algorithm exactly does pycocoevalcap use? |
@ammmy |
@ammmy |
Hello, everyone,
I got the following score after I ran the coco.
{'CIDEr': 0.50350648251818364, 'Bleu_4': 0.20037826460154334, 'Bleu_3': 0.2920434703847389, 'Bleu_2': 0.42775646056296673, 'Bleu_1': 0.6105274018537202, 'ROUGE_L': 0.43556281782994649, 'METEOR': 0.23890246684760072}
So METEOR is almost same. However my BLEU score are 7~8% lower than paper. I wonder if this is acceptable or there is something wrong in my process.
Would you please share your results in this post?
Thanks.
The text was updated successfully, but these errors were encountered: