Post your evaluation score #20

intuinno · 2016-03-09T23:37:51Z

Hello, everyone,

I got the following score after I ran the coco.

{'CIDEr': 0.50350648251818364, 'Bleu_4': 0.20037826460154334, 'Bleu_3': 0.2920434703847389, 'Bleu_2': 0.42775646056296673, 'Bleu_1': 0.6105274018537202, 'ROUGE_L': 0.43556281782994649, 'METEOR': 0.23890246684760072}

So METEOR is almost same. However my BLEU score are 7~8% lower than paper. I wonder if this is acceptable or there is something wrong in my process.

Would you please share your results in this post?

Thanks.

snakeztc · 2016-03-21T03:31:26Z

Hi, @intuinno what hyperparamter did you pick. I tried to run the on the coco data, the algorithm terminate after 15 epochs and achieved the following scores, which is much lower than yours.

Bleu 1: .527
Bleu 2: .333
Bleu 3: .210
Bleu 4: .138
METEOR: .163
ROUGE_L: .403
CIDEr: .371

yaxingwang · 2016-03-25T16:53:00Z

Hi, @intuinno , @snakeztc I am running this codes based on the flickr8k, for I just try to test this code. I have run it and got visualization, but I don't know how to get the scores--Bleu and METEOR, could you tell which script about it? Please forgive me if bother you.

AAmmy · 2016-03-31T08:08:02Z

Hi, @intuinno, @snakeztc, @yaxingwang my flickr8k scores with parameters in capgen.train() are
BLEU = 0.504 / 0.270 / 0.145 / 0.082
Max are
BLEU = 0.550 / 0.296 / 0.164 / 0.095
with parameters in eval_coco, + optimizer = rmsprop .
My scores are lower than paper:
BLEU = 0.670 / 0.457 / 0.314 / 0.213

@yaxingwang
metrics.py or I am using a neuraltalk's script(see my repository)

yaxingwang · 2016-03-31T08:22:23Z

Hi @AAmmy , I am also get similar results like you. But BLEU-1 you get is batter then me(0.30). Do you do normalization for Dataset? After it, I get less result. The metrics.py is used to get results.

AAmmy · 2016-03-31T08:31:37Z

@yaxingwang
I did not normalize Dataset.
My preprocess:

center crop images
resize images to 224x224
extract features with VGG_ILSVRC_19_layers

token and train, valid, test are the same with the files in Flickr8k_text.zip

yaxingwang · 2016-03-31T08:56:28Z

Yes, we am same. I did it, since the result is poor.
for me :
1, if width > height:
width = (width * resize) / height # resize = 256
height = resize
else:
height = (height * resize) / width
width = resize
2, center crop images to 224x224
3, extract features by Vgg_layers5_4
I am confused whether all parameters are same for three dataset, I just run the code offered by @intuinno, and I found the parameters for three datasets are same. Besides, what the value of epoch when script stops, I got epoch = 79, I don't know whether It is over-fitting.

AAmmy · 2016-03-31T09:17:15Z

@yaxingwang I think intuinno's evaluate_flickr8k.py's parameters are for coco and flickr30k, the parameters for flickr8k and for flickr30k, coco are not same.(5.2 in paper)

I think parameters in original capgen.py file are for flickr8k.(I used this, stopped around 70epoc)

I also did flickr8k learning with the parameters for coco
(same as intuinno evaluate_flickr8k.py, so same with you?),
the sores were
BLEU = 0.493 / 0.258 / 0.130 / 0.072
eary stop on Epoc 89 (6-12 hours)

I also changed patience and some parameters to check over fitting,
after 89 epoc, samples from val seemed getting better, but BLEU score (from test) was getting worse.

yaxingwang · 2016-03-31T09:50:50Z

@AAmmy , Thank you. I try to do the both flickr30k and coco, but I guess the memory of my computer is too small to process the flickr30, so I am doing it. Do you met the question when doing the flickr30k, It notes MemoryError.

When using epoch = 10, 20, the result is less good than the matter the script early stop . I think got epoch by early stop is not the best, for it has not strong relation with scores. Maybe testing different epochs is optimal.

AAmmy · 2016-03-31T10:17:23Z

@yaxingwang I have the same memory problem on coco, and process for sparse to dense is too slow, so I extracted feature into one file for each image.

I changed code and data format like below.

caption example:

train_cap = [['a dog running', OOO.jpg], ['dogs running', OOO.jpg],
                       ..., ['a cat running', +++.jpg], ['cats running', +++.jpg]]

('a dog running' and 'dogs running' are captions for OOO.jpg,
OOO.jpg is the image file name, OOO.jpg.mat will be the feature from OOO.jpg)

In flickr.py or coco.py:

In prepare_data():

# load target feat file each time
for cc in caps:
    seqs.append([worddict[w] if worddict[w] < n_words else 1 for w in cc[0].split()])
    feat_list.append(loadmat(feat_path + str(cc[1]) + '.mat')['feats']) # my code
    # feat_list.append(features[cc[1]]) # original code

# OOO.jpg.mat is dense() matrix, so no need to todense()

# y = numpy.zeros((len(feat_list), feat_list[0].shape[1])).astype('float32') # original code
# for idx, ff in enumerate(feat_list): # original code
    # y[idx,:] = numpy.array(ff.todense()) # original code
# y = y.reshape([y.shape[0], 14*14, 512]) # original code
y = numpy.array(feat_list).reshape([len(feat_list), 14*14, 512]).astype('float32') # my code

In load_data():

# only caption files are loaded
train_cap = pkl.load(open(path+'flicker_30k_cap.train.pkl', 'rb'))
train_feat = []

Hmm... I will try on different epochs.

yaxingwang · 2016-03-31T10:24:19Z

@AAmmy , Thanks.

ozancaglayan · 2016-04-01T14:40:13Z

I also created my own scripts to prepare the data. I completely skipped the sparse matrix stuff since I think it's not needed at all. I have a single hdf5 file with CONV5_4 features from VGG19 network for Flickr30k (around 12GB). This file contains all the image features for all splits in the following order: train, valid and test. The order of the jpeg files for matching the order of the feature matrix is also available.

I am pretty sure that I am not doing any mistake (but apparently i am doing since you have at least some results) but all I got is repetitive phrases of meaningless words, with BLEU of 0 and a validation loss which doesn't improve at all.

I create the dictionary in a frequency ordered fashion, 0 is and 1 is UNK.

I don't know where is the problem at all.

blue code

volkancirik · 2016-04-11T22:33:32Z

@intuinno, your results are closest the reported coco results, which hyper-parameters have you used?

@kelvinxu , @kyunghyuncho , paper does not mention hyper-parameters for different datasets. would you mind providing this information? (plus maybe even the models themselves which are not big for a dropbox/gdrive file)

ozancaglayan · 2016-04-12T08:49:34Z

Hi everybody,

I'd like to share my observations and experimentations about the code on Flickr30k dataset:

Preprocessing:

I have a separate HDF5 file for train/dev/test splits containing the convolutional features extracted from Flickr30k dataset using VGG19 network. Since the current way of creating a PKL file with captions and sparse matrices is so inefficient (it even doesn't work with Python 2.7 because of a pickle bug with huge files) I directly load those HDF5 files and I only keep the tokenized captions and image idxs in the pkl file. I create a dictionary with words occuring >= 3 times leading to a dictionary of 9584 words.

Feature dimensions:

This is specific to how you create your feature file. What is done in the original code, i.e. y.reshape([y.shape[0], 14*14, 512]) was not correct for my feature file and I was obtaining complete nonsense during training. Ensure that the reshaping is done correctly.

Early stopping with BLEU:

This seems critical and it's mentioned in the paper as wel but unfortunately not implemented in the code. The validation loss is not correlated with BLEU or METEOR. I just save the model into a temporary file before each validation and call generate_caps.py to save the hypotheses inside a file. I then used the pycocoevalcap utilities to obtain BLEU1-BLEU4 and METEOR scores. After that you can select upon which metric you would like to early stop.

Validation:

I normalized the validation loss w.r.t sequence lengths as well. This seems a better estimate of validation loss as the default one is sensible to the caption lengths in the validation batches.

Hyperparameters:

I'm still experimenting but the best working system so far had the following parameters:

n_words: 9584
maxlen: 100
decay_c: 1e-05
alpha_c: 0 (This is 1 in the original code)
use_dropout: False (dropout is enabled by default in the original code)
patience: 10
ctx_dim: 512
dim: 1000 (This is 1800 in the original code)
dim_word: 512
batch_size: 128
optimizer: adam (rmsprop is OK too but adadelta is completely failing)
lstm_encoder: False
n_layers_init: 2
n_layers_att: 2
n_layers_lstm: 1
n_layers_out: 1
ctx2out: True
prev2out: True
selector: True,
attn_type: deterministic (didn't try the hard one)
validFreq: 500

Results:

I trained a system yesterday with early-stopping on BLEU (but this was using the multi-bleu.perl script which has different dynamics than the pycocoevalcap utilities). I generated the captions with sampling instead of beam-search during validation periods. At the end I obtained the following results with the best validation model:

(EDIT: Fixed the results of my system which was for the validation split instead of the test split.)

Description	BLEU1	BLEU2	BLEU3	BLEU4	METEOR
Beam (12)	57.9	39.3	26.9	18.5	17.58
Sampling	61.2	41.4	28.12	19.1	16.77
Paper results (soft)	66.7	43.4	28.8	19.1	18.49
Paper results (hard)	66.9	43.9	29.6	19.9	18.46

Problems:

The main problem are the duplicate captions in the final files:

$ sort -u adam-512emb-1000lstm-wdecay-att2-init2-flickr30k-en-bleu.sampling.dev.txt | wc -l
853
$ sort -u adam-512emb-1000lstm-wdecay-att2-init2-flickr30k-en-bleu.beam12.1best.dev.txt | wc -l
790

So out of 1014 validation images, I can only generate 853/790 unique captions. This seems to be an important problem that I'm facing. The richness of the captions is also quite limited. For the sampling case, I have 497 unique words out of a vocabulary of ~10K words. For beamsearch, the number is 561.

EDIT
I actually checked the captions generated and the images. Eventhough there are for example 10 instances of "a group of people are standing outside" for 10 different images, it's actually true in terms of scene description: In all of the images there are some people standing outside :) So maybe this can be related to the weak diversity of Flickr30k dataset.

AAmmy · 2016-04-15T07:09:33Z

~~The BLEU result of multi-bleu.perl and pycocoevalcap are very different. I got 65% on BLEU1 with multi-bleu.perl, but bleu.py in pycocoevalcap showed around 50% on the same samples and GTs.~~
Nothing

AAmmy · 2016-04-22T05:23:18Z

Hi, @ozancaglayan
Could you share your code handling batch normalization process, plese?

Validation:

I normalized the validation loss w.r.t sequence lengths as well. 
This seems a better estimate of validation loss as the default
one is sensible to the caption lengths in the validation batches.

frajem · 2016-04-22T10:02:32Z

Hi @intuinno
Would you share the model file trained on Coco?
Also, what are your best validation/test cost for Flickr8k and Coco?
Thanks.

Lorne0 · 2016-05-09T01:41:18Z

So does anyone get better score on coco?
I used @intuinno 's code and I got similar score with him(one the top of this issue) in the end (17epochs).
However, when I calculated the score on epoch 10, it turned out to be better than 17th epoch, which is
BLEU: 0.6398/0.4518/0.3127/0.218/ METEOR:0.2384

AAmmy · 2016-05-12T00:55:48Z

I got
BLEU: 0.6887/0.5034/0.3588/0.2547
METEOR: 0.2234
on COCO with
http://cs.stanford.edu/people/karpathy/deepimagesent/
the feature size is 4096, so I used them by reshaping 8x512.
However Flickr8k training was failed.
I didn't try on Flickr30k.

xinghedyc · 2016-05-21T16:06:07Z

@AAmmy Hi，I tested your code, and got 'Bleu_4': 0.276, 'Bleu_3': 0.367, 'Bleu_2': 0.497, 'Bleu_1': 0.668 with beam_size = 10.
was your result based on beam_size of 1?

Lorne0 · 2016-05-22T08:38:48Z

@AAmmy @xinghedyc Could you please explain about using http://cs.stanford.edu/people/karpathy/deepimagesent/ for me?
You used that for extracting features?

xinghedyc · 2016-05-22T08:47:18Z

@Lorne0 Hi, what you could download form that website is a coco dataset as COCO (750MB)http://cs.stanford.edu/people/karpathy/deepimagesent/coco.zip,
the vgg_feats.mat contains extracted features through the vgg-net as 4096-dimention for each image, and json file contains all the captions.
for more details you can read their paper.

Lorne0 · 2016-05-22T08:54:52Z

@xinghedyc Thank you. But I still not understand. The feature is 4096 , and @AAmmy said reshaping it as 8x512, and then? Which 8 of the new features should I use?

xinghedyc · 2016-05-22T12:01:05Z

@Lorne0 I think 8×512 means 8 annotation vectors which the author's paper defines them as
a = {a1,...,aL}, ai ∈ RD, you can refer section 3.1.1 in the paper.
The original code uses 196 × 512 as the annotation vectors, so @AAmmy tested 8 annotion vectors in soft attention mode by using the dataset above, it actually works.

Lorne0 · 2016-05-22T14:42:43Z

@xinghedyc Thank you~
I just ran 3 epochs but I use metrics.py I always got
IOError: [Errno 32] Broken pipe
in pycocoevalcap/meteor/meteor.py
Did you have this problem?

xinghedyc · 2016-05-22T14:49:38Z

@Lorne0 Yes,I also got this problem,so I just comment that line in the metrics.py
like this
scorers = [
(Bleu(4), ["Bleu_1", "Bleu_2", "Bleu_3", "Bleu_4"]),
#(Meteor(),"METEOR"),
#(Rouge(), "ROUGE_L"),
#(Cider(), "CIDEr")
]
This is because I care more about bleu, but you could try to fix this problem :)

Lorne0 · 2016-05-22T14:51:32Z

@xinghedyc I think METEOR is important too.
I'll try to fix it, thank you~
@AAmmy could you help us about this problem?

Lorne0 · 2016-05-22T15:37:13Z

@xinghedyc I think I found the solution
Just delete the pycocoevalcap/ and clone the newest one :)

xinghedyc · 2016-05-22T15:40:21Z

@Lorne0 OK, I'll check it.

AAmmy · 2016-05-23T01:26:28Z

@Lorne0 @xinghedyc
My result BLEU: 0.6887/0.5034/0.3588/0.2547 METEOR: 0.2234
is based on beam_size 1. I checked only on epoch 19.
May be there some other epoch results show more better score. (in epoch from 1 to 18)
References were made by code written in scripts.py.

xinghedyc · 2016-05-23T05:37:08Z

@AAmmy thanks, I got bleu-4 23.9 if I use beam size 0f 1, but got 27.6 use beam size of 10.
I only trained 11 epochs. maybe more epochs should be trained.

kelvinxu · 2016-06-04T04:04:08Z

@xinghedyc @AAmmy Just to confirm, the results you get for increasing the beam size are correct. At the time of publication, we were using a beamsize of 1 (mea culpa!!!)

DongNaeSwellfish · 2017-05-24T10:49:03Z

I got bleu1 0.685 bleu2 0.507 bleu3 0.363 bleu4 0.258 meteor 0.234 rouge_L 0.505 cider 0.836

ammmy · 2017-05-24T11:15:08Z

In these days, I use Capgen and VQA model on Tensorflow. It's very flexible. I can share the code if needed.

porcofly · 2017-07-04T09:23:32Z

@AAmmy Could please share your code on tensorflow？

ammmy · 2017-07-10T18:54:20Z

@porcofly
https://drive.google.com/open?id=0B9SwS-q4-5HxdUdHdG9yYjZQQjQ
model 1, 2 are based on
https://github.com/jazzsaxmafia/show_attend_and_tell.tensorflow
https://github.com/yunjey/show-attend-and-tell
respectively.

shaoxuan92 · 2018-01-06T04:32:06Z

Excuse me, how long does this training process roughly take? I have run it for about 12 hours, and it is still stuck in epoch 1. I really don't know what's wrong with it? My GPU isQuadro K4200. Thank you...

ChiZhangRIT · 2018-09-28T00:50:08Z

The BLEU result of multi-bleu.perl and pycocoevalcap are very different. I got 65% on BLEU1 with multi-bleu.perl, but bleu.py in pycocoevalcap showed around 50% on the same samples and GTs.
Nothing

I am wondering why pycocoevalcap gives me a different BLEU score compared to multi-bleu.perl? I found 2 sentences and calculate the BLEU score manually, the result matches with multi-bleu.perl but not pycocoevalcap. What algorithm exactly does pycocoevalcap use?

kavithasampath · 2018-10-29T06:34:25Z

@ammmy
Thanks for sharing your tensor flow based repo. If you could share the script to generate the following files (that you have used in your implementation) it will be very useful
"tokens.npy",
"tokens_flat.npy",
"filename.npy",
"filepath.npy",
"vgg_feats.npy",
"tokens_flat_to_image_lookup.npy"

xxxyyyzzzz · 2018-10-29T08:45:33Z

@ammmy
Can you post accuracy numbers for below tensorflow based implementations?
https://github.com/jazzsaxmafia/show_attend_and_tell.tensorflow
https://github.com/yunjey/show-attend-and-tell

AAmmy referenced this issue in AAmmy/show-attend-and-tell Apr 11, 2016

Create blue.py

fc1e364

blue code

Post your evaluation score #20

Post your evaluation score #20

Comments

intuinno commented Mar 9, 2016

snakeztc commented Mar 21, 2016

yaxingwang commented Mar 25, 2016

AAmmy commented Mar 31, 2016

yaxingwang commented Mar 31, 2016

AAmmy commented Mar 31, 2016

yaxingwang commented Mar 31, 2016

AAmmy commented Mar 31, 2016

yaxingwang commented Mar 31, 2016

AAmmy commented Mar 31, 2016

yaxingwang commented Mar 31, 2016

ozancaglayan commented Apr 1, 2016

volkancirik commented Apr 11, 2016

ozancaglayan commented Apr 12, 2016

AAmmy commented Apr 15, 2016 • edited Loading

AAmmy commented Apr 22, 2016 • edited Loading

frajem commented Apr 22, 2016

Lorne0 commented May 9, 2016

AAmmy commented May 12, 2016 • edited Loading

xinghedyc commented May 21, 2016

Lorne0 commented May 22, 2016

xinghedyc commented May 22, 2016

Lorne0 commented May 22, 2016

xinghedyc commented May 22, 2016

Lorne0 commented May 22, 2016

xinghedyc commented May 22, 2016

Lorne0 commented May 22, 2016

Lorne0 commented May 22, 2016

xinghedyc commented May 22, 2016

AAmmy commented May 23, 2016 • edited Loading

xinghedyc commented May 23, 2016

kelvinxu commented Jun 4, 2016

DongNaeSwellfish commented May 24, 2017

ammmy commented May 24, 2017

porcofly commented Jul 4, 2017

ammmy commented Jul 10, 2017

shaoxuan92 commented Jan 6, 2018 • edited Loading

ChiZhangRIT commented Sep 28, 2018

kavithasampath commented Oct 29, 2018

xxxyyyzzzz commented Oct 29, 2018

AAmmy commented Apr 15, 2016 •

edited

Loading

AAmmy commented Apr 22, 2016 •

edited

Loading

AAmmy commented May 12, 2016 •

edited

Loading

AAmmy commented May 23, 2016 •

edited

Loading

shaoxuan92 commented Jan 6, 2018 •

edited

Loading