Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post your evaluation score #20

Open
intuinno opened this issue Mar 9, 2016 · 39 comments
Open

Post your evaluation score #20

intuinno opened this issue Mar 9, 2016 · 39 comments

Comments

@intuinno
Copy link

intuinno commented Mar 9, 2016

Hello, everyone,

I got the following score after I ran the coco.

{'CIDEr': 0.50350648251818364, 'Bleu_4': 0.20037826460154334, 'Bleu_3': 0.2920434703847389, 'Bleu_2': 0.42775646056296673, 'Bleu_1': 0.6105274018537202, 'ROUGE_L': 0.43556281782994649, 'METEOR': 0.23890246684760072}

So METEOR is almost same. However my BLEU score are 7~8% lower than paper. I wonder if this is acceptable or there is something wrong in my process.

Would you please share your results in this post?

Thanks.

@snakeztc
Copy link

Hi, @intuinno what hyperparamter did you pick. I tried to run the on the coco data, the algorithm terminate after 15 epochs and achieved the following scores, which is much lower than yours.

Bleu 1: .527
Bleu 2: .333
Bleu 3: .210
Bleu 4: .138
METEOR: .163
ROUGE_L: .403
CIDEr: .371

@yaxingwang
Copy link

Hi, @intuinno , @snakeztc I am running this codes based on the flickr8k, for I just try to test this code. I have run it and got visualization, but I don't know how to get the scores--Bleu and METEOR, could you tell which script about it? Please forgive me if bother you.

@AAmmy
Copy link

AAmmy commented Mar 31, 2016

Hi, @intuinno, @snakeztc, @yaxingwang my flickr8k scores with parameters in capgen.train() are
BLEU = 0.504 / 0.270 / 0.145 / 0.082
Max are
BLEU = 0.550 / 0.296 / 0.164 / 0.095
with parameters in eval_coco, + optimizer = rmsprop .
My scores are lower than paper:
BLEU = 0.670 / 0.457 / 0.314 / 0.213

@yaxingwang
metrics.py or I am using a neuraltalk's script(see my repository)

@yaxingwang
Copy link

Hi @AAmmy , I am also get similar results like you. But BLEU-1 you get is batter then me(0.30). Do you do normalization for Dataset? After it, I get less result. The metrics.py is used to get results.

@AAmmy
Copy link

AAmmy commented Mar 31, 2016

@yaxingwang
I did not normalize Dataset.
My preprocess:

  1. center crop images
  2. resize images to 224x224
  3. extract features with VGG_ILSVRC_19_layers

token and train, valid, test are the same with the files in Flickr8k_text.zip

@yaxingwang
Copy link

Yes, we am same. I did it, since the result is poor.
for me :
1, if width > height:
width = (width * resize) / height # resize = 256
height = resize
else:
height = (height * resize) / width
width = resize
2, center crop images to 224x224
3, extract features by Vgg_layers5_4
I am confused whether all parameters are same for three dataset, I just run the code offered by @intuinno, and I found the parameters for three datasets are same. Besides, what the value of epoch when script stops, I got epoch = 79, I don't know whether It is over-fitting.

@AAmmy
Copy link

AAmmy commented Mar 31, 2016

@yaxingwang I think intuinno's evaluate_flickr8k.py's parameters are for coco and flickr30k, the parameters for flickr8k and for flickr30k, coco are not same.(5.2 in paper)

I think parameters in original capgen.py file are for flickr8k.(I used this, stopped around 70epoc)

I also did flickr8k learning with the parameters for coco
(same as intuinno evaluate_flickr8k.py, so same with you?),
the sores were
BLEU = 0.493 / 0.258 / 0.130 / 0.072
eary stop on Epoc 89 (6-12 hours)

I also changed patience and some parameters to check over fitting,
after 89 epoc, samples from val seemed getting better, but BLEU score (from test) was getting worse.

@yaxingwang
Copy link

@AAmmy , Thank you. I try to do the both flickr30k and coco, but I guess the memory of my computer is too small to process the flickr30, so I am doing it. Do you met the question when doing the flickr30k, It notes MemoryError.

When using epoch = 10, 20, the result is less good than the matter the script early stop . I think got epoch by early stop is not the best, for it has not strong relation with scores. Maybe testing different epochs is optimal.

@AAmmy
Copy link

AAmmy commented Mar 31, 2016

@yaxingwang I have the same memory problem on coco, and process for sparse to dense is too slow, so I extracted feature into one file for each image.

I changed code and data format like below.

caption example:

train_cap = [['a dog running', OOO.jpg], ['dogs running', OOO.jpg],
                       ..., ['a cat running', +++.jpg], ['cats running', +++.jpg]]

('a dog running' and 'dogs running' are captions for OOO.jpg,
OOO.jpg is the image file name, OOO.jpg.mat will be the feature from OOO.jpg)

In flickr.py or coco.py:

In prepare_data():

# load target feat file each time
for cc in caps:
    seqs.append([worddict[w] if worddict[w] < n_words else 1 for w in cc[0].split()])
    feat_list.append(loadmat(feat_path + str(cc[1]) + '.mat')['feats']) # my code
    # feat_list.append(features[cc[1]]) # original code
# OOO.jpg.mat is dense() matrix, so no need to todense()

# y = numpy.zeros((len(feat_list), feat_list[0].shape[1])).astype('float32') # original code
# for idx, ff in enumerate(feat_list): # original code
    # y[idx,:] = numpy.array(ff.todense()) # original code
# y = y.reshape([y.shape[0], 14*14, 512]) # original code
y = numpy.array(feat_list).reshape([len(feat_list), 14*14, 512]).astype('float32') # my code

In load_data():

# only caption files are loaded
train_cap = pkl.load(open(path+'flicker_30k_cap.train.pkl', 'rb'))
train_feat = []

Hmm... I will try on different epochs.

@yaxingwang
Copy link

@AAmmy , Thanks.

@ozancaglayan
Copy link

I also created my own scripts to prepare the data. I completely skipped the sparse matrix stuff since I think it's not needed at all. I have a single hdf5 file with CONV5_4 features from VGG19 network for Flickr30k (around 12GB). This file contains all the image features for all splits in the following order: train, valid and test. The order of the jpeg files for matching the order of the feature matrix is also available.

I am pretty sure that I am not doing any mistake (but apparently i am doing since you have at least some results) but all I got is repetitive phrases of meaningless words, with BLEU of 0 and a validation loss which doesn't improve at all.

I create the dictionary in a frequency ordered fashion, 0 is and 1 is UNK.

I don't know where is the problem at all.

AAmmy referenced this issue in AAmmy/show-attend-and-tell Apr 11, 2016
@volkancirik
Copy link

@intuinno, your results are closest the reported coco results, which hyper-parameters have you used?

@kelvinxu , @kyunghyuncho , paper does not mention hyper-parameters for different datasets. would you mind providing this information? (plus maybe even the models themselves which are not big for a dropbox/gdrive file)

@ozancaglayan
Copy link

Hi everybody,

I'd like to share my observations and experimentations about the code on Flickr30k dataset:

Preprocessing:

  • I have a separate HDF5 file for train/dev/test splits containing the convolutional features extracted from Flickr30k dataset using VGG19 network. Since the current way of creating a PKL file with captions and sparse matrices is so inefficient (it even doesn't work with Python 2.7 because of a pickle bug with huge files) I directly load those HDF5 files and I only keep the tokenized captions and image idxs in the pkl file. I create a dictionary with words occuring >= 3 times leading to a dictionary of 9584 words.

Feature dimensions:

  • This is specific to how you create your feature file. What is done in the original code, i.e. y.reshape([y.shape[0], 14*14, 512]) was not correct for my feature file and I was obtaining complete nonsense during training. Ensure that the reshaping is done correctly.

Early stopping with BLEU:

This seems critical and it's mentioned in the paper as wel but unfortunately not implemented in the code. The validation loss is not correlated with BLEU or METEOR. I just save the model into a temporary file before each validation and call generate_caps.py to save the hypotheses inside a file. I then used the pycocoevalcap utilities to obtain BLEU1-BLEU4 and METEOR scores. After that you can select upon which metric you would like to early stop.

Validation:

I normalized the validation loss w.r.t sequence lengths as well. This seems a better estimate of validation loss as the default one is sensible to the caption lengths in the validation batches.

Hyperparameters:

I'm still experimenting but the best working system so far had the following parameters:

n_words: 9584
maxlen: 100
decay_c: 1e-05
alpha_c: 0 (This is 1 in the original code)
use_dropout: False (dropout is enabled by default in the original code)
patience: 10
ctx_dim: 512
dim: 1000 (This is 1800 in the original code)
dim_word: 512
batch_size: 128
optimizer: adam (rmsprop is OK too but adadelta is completely failing)
lstm_encoder: False
n_layers_init: 2
n_layers_att: 2
n_layers_lstm: 1
n_layers_out: 1
ctx2out: True
prev2out: True
selector: True,
attn_type: deterministic (didn't try the hard one)
validFreq: 500

Results:

I trained a system yesterday with early-stopping on BLEU (but this was using the multi-bleu.perl script which has different dynamics than the pycocoevalcap utilities). I generated the captions with sampling instead of beam-search during validation periods. At the end I obtained the following results with the best validation model:

(EDIT: Fixed the results of my system which was for the validation split instead of the test split.)

Description BLEU1 BLEU2 BLEU3 BLEU4 METEOR
Beam (12) 57.9 39.3 26.9 18.5 17.58
Sampling 61.2 41.4 28.12 19.1 16.77
Paper results (soft) 66.7 43.4 28.8 19.1 18.49
Paper results (hard) 66.9 43.9 29.6 19.9 18.46

Problems:

The main problem are the duplicate captions in the final files:

$ sort -u adam-512emb-1000lstm-wdecay-att2-init2-flickr30k-en-bleu.sampling.dev.txt | wc -l
853
$ sort -u adam-512emb-1000lstm-wdecay-att2-init2-flickr30k-en-bleu.beam12.1best.dev.txt | wc -l
790

So out of 1014 validation images, I can only generate 853/790 unique captions. This seems to be an important problem that I'm facing. The richness of the captions is also quite limited. For the sampling case, I have 497 unique words out of a vocabulary of ~10K words. For beamsearch, the number is 561.

EDIT
I actually checked the captions generated and the images. Eventhough there are for example 10 instances of "a group of people are standing outside" for 10 different images, it's actually true in terms of scene description: In all of the images there are some people standing outside :) So maybe this can be related to the weak diversity of Flickr30k dataset.

@AAmmy
Copy link

AAmmy commented Apr 15, 2016

The BLEU result of multi-bleu.perl and pycocoevalcap are very different. I got 65% on BLEU1 with multi-bleu.perl, but bleu.py in pycocoevalcap showed around 50% on the same samples and GTs.
Nothing

@AAmmy
Copy link

AAmmy commented Apr 22, 2016

Hi, @ozancaglayan
Could you share your code handling batch normalization process, plese?

Validation:

I normalized the validation loss w.r.t sequence lengths as well. 
This seems a better estimate of validation loss as the default
one is sensible to the caption lengths in the validation batches.

@frajem
Copy link

frajem commented Apr 22, 2016

Hi @intuinno
Would you share the model file trained on Coco?
Also, what are your best validation/test cost for Flickr8k and Coco?
Thanks.

@Lorne0
Copy link

Lorne0 commented May 9, 2016

So does anyone get better score on coco?
I used @intuinno 's code and I got similar score with him(one the top of this issue) in the end (17epochs).
However, when I calculated the score on epoch 10, it turned out to be better than 17th epoch, which is
BLEU: 0.6398/0.4518/0.3127/0.218/ METEOR:0.2384

@AAmmy
Copy link

AAmmy commented May 12, 2016

I got
BLEU: 0.6887/0.5034/0.3588/0.2547
METEOR: 0.2234
on COCO with
http://cs.stanford.edu/people/karpathy/deepimagesent/
the feature size is 4096, so I used them by reshaping 8x512.
However Flickr8k training was failed.
I didn't try on Flickr30k.

@xinghedyc
Copy link

@AAmmy Hi,I tested your code, and got 'Bleu_4': 0.276, 'Bleu_3': 0.367, 'Bleu_2': 0.497, 'Bleu_1': 0.668 with beam_size = 10.
was your result based on beam_size of 1?

@Lorne0
Copy link

Lorne0 commented May 22, 2016

@AAmmy @xinghedyc Could you please explain about using http://cs.stanford.edu/people/karpathy/deepimagesent/ for me?
You used that for extracting features?

@xinghedyc
Copy link

@Lorne0 Hi, what you could download form that website is a coco dataset as COCO (750MB)http://cs.stanford.edu/people/karpathy/deepimagesent/coco.zip,
the vgg_feats.mat contains extracted features through the vgg-net as 4096-dimention for each image, and json file contains all the captions.
for more details you can read their paper.

@Lorne0
Copy link

Lorne0 commented May 22, 2016

@xinghedyc Thank you. But I still not understand. The feature is 4096 , and @AAmmy said reshaping it as 8x512, and then? Which 8 of the new features should I use?

@xinghedyc
Copy link

@Lorne0 I think 8×512 means 8 annotation vectors which the author's paper defines them as
a = {a1,...,aL}, ai ∈ RD, you can refer section 3.1.1 in the paper.
The original code uses 196 × 512 as the annotation vectors, so @AAmmy tested 8 annotion vectors in soft attention mode by using the dataset above, it actually works.

@Lorne0
Copy link

Lorne0 commented May 22, 2016

@xinghedyc Thank you~
I just ran 3 epochs but I use metrics.py I always got
IOError: [Errno 32] Broken pipe
in pycocoevalcap/meteor/meteor.py
Did you have this problem?

@xinghedyc
Copy link

@Lorne0 Yes,I also got this problem,so I just comment that line in the metrics.py
like this
scorers = [
(Bleu(4), ["Bleu_1", "Bleu_2", "Bleu_3", "Bleu_4"]),
#(Meteor(),"METEOR"),
#(Rouge(), "ROUGE_L"),
#(Cider(), "CIDEr")
]
This is because I care more about bleu, but you could try to fix this problem :)

@Lorne0
Copy link

Lorne0 commented May 22, 2016

@xinghedyc I think METEOR is important too.
I'll try to fix it, thank you~
@AAmmy could you help us about this problem?

@Lorne0
Copy link

Lorne0 commented May 22, 2016

@xinghedyc I think I found the solution
Just delete the pycocoevalcap/ and clone the newest one :)

@xinghedyc
Copy link

@Lorne0 OK, I'll check it.

@AAmmy
Copy link

AAmmy commented May 23, 2016

@Lorne0 @xinghedyc
My result BLEU: 0.6887/0.5034/0.3588/0.2547 METEOR: 0.2234
is based on beam_size 1. I checked only on epoch 19.
May be there some other epoch results show more better score. (in epoch from 1 to 18)
References were made by code written in scripts.py.

@xinghedyc
Copy link

@AAmmy thanks, I got bleu-4 23.9 if I use beam size 0f 1, but got 27.6 use beam size of 10.
I only trained 11 epochs. maybe more epochs should be trained.

@kelvinxu
Copy link
Owner

kelvinxu commented Jun 4, 2016

@xinghedyc @AAmmy Just to confirm, the results you get for increasing the beam size are correct. At the time of publication, we were using a beamsize of 1 (mea culpa!!!)

@DongNaeSwellfish
Copy link

I got bleu1 0.685 bleu2 0.507 bleu3 0.363 bleu4 0.258 meteor 0.234 rouge_L 0.505 cider 0.836

@ammmy
Copy link

ammmy commented May 24, 2017

In these days, I use Capgen and VQA model on Tensorflow. It's very flexible. I can share the code if needed.

@porcofly
Copy link

porcofly commented Jul 4, 2017

@AAmmy Could please share your code on tensorflow?

@shaoxuan92
Copy link

shaoxuan92 commented Jan 6, 2018

Excuse me, how long does this training process roughly take? I have run it for about 12 hours, and it is still stuck in epoch 1. I really don't know what's wrong with it? My GPU isQuadro K4200. Thank you...

@ChiZhangRIT
Copy link

The BLEU result of multi-bleu.perl and pycocoevalcap are very different. I got 65% on BLEU1 with multi-bleu.perl, but bleu.py in pycocoevalcap showed around 50% on the same samples and GTs.
Nothing

I am wondering why pycocoevalcap gives me a different BLEU score compared to multi-bleu.perl? I found 2 sentences and calculate the BLEU score manually, the result matches with multi-bleu.perl but not pycocoevalcap. What algorithm exactly does pycocoevalcap use?

@kavithasampath
Copy link

@ammmy
Thanks for sharing your tensor flow based repo. If you could share the script to generate the following files (that you have used in your implementation) it will be very useful
"tokens.npy",
"tokens_flat.npy",
"filename.npy",
"filepath.npy",
"vgg_feats.npy",
"tokens_flat_to_image_lookup.npy"

@xxxyyyzzzz
Copy link

@ammmy
Can you post accuracy numbers for below tensorflow based implementations?
https://github.com/jazzsaxmafia/show_attend_and_tell.tensorflow
https://github.com/yunjey/show-attend-and-tell

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests