Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Positive Bag Negative Bag #19

Open
Monikshah opened this issue Sep 9, 2021 · 24 comments
Open

Positive Bag Negative Bag #19

Monikshah opened this issue Sep 9, 2021 · 24 comments

Comments

@Monikshah
Copy link

I am trying to reproduce your model for weak supervised multi instance learning, I am a bit confused about the formation of positive and negative bag. It says in the paper, for a predicate r associated with object region pair, the region pair will be labeled as positive bag if the predicate r is in the caption S. My question is the predicates are extracted from the triplets and the triplets are extracted from the caption so the predicate with always be present in the caption.

How to label the positive and negative bag. Can you please help me understand this?

Thank you very much.

@Gitsamshi
Copy link
Owner

Thanks for asking. Given a image with n regions, there are n*(n-1) region pairs. There maybe k pairs related to predicate r, which is
extracted from caption. The k makes up a pos bag, and the rest n*(n-1)-k makes up a neg bag.

What we do care in training are pos bags as pairs in neg bags are all labelled as negative (same with binary classification).

You may view the MIL learning as a better method compared to simply assigning every pair in k pairs a postive label for binarily classifying predicate r.

@Monikshah
Copy link
Author

Thank you very much for the response. I am very new to this field of research and and I am struggling from around a month to implement this model.

So what I understand is there can be k pairs related to predicate which is extracted from the caption. For the caption "a women in hat feeds the giraffee from her hand", the triplets extracted are 'women in hat' and 'women feeds giraffe'. The predicates are 'in' and 'feed'. So the k pairs are all the pairs for 'women' and 'hat' for predicate 'in' and all the pairs of 'women' and giraffe for predicate 'feed'. Only these k pairs will make up as a positive bag and the rest pairs such as women in giraffee, women feed hat etc. will make up as negative bag right? Also do we need to assign all the rest pair or objects as negative bag?

Also for the formation of bags, we have object features (att_feat) from the images and labels of these objects (coco_img_sg) and the triplets from the sentences (coco_spice_sg2). Now for the object pairs (att_feat concatenated) we check if their respective labels (from coco_img_sg) for a predicate is present in the triples extracted, if true we assign it as positive bag else -ve bag. The object labels detected from image should match the objects in the triplets?

Am I making my understanding very complicated? Please let me know if my understanding is right. Please correct me if I am wrong. Thank you.

@Gitsamshi
Copy link
Owner

Hi there,
Pls refer to scripts/prepro_predicates.py for getting positive bags. It should have answered your second question.
For the first question, let's first make it clear about "pair" and "bag".
(1) A pair is defined across any two regions in the image, and a bag is a combination of pairs.
(2) The reason why we use bags is that we are not sure about the label of some pair. If we are certain about the labels between one pair, just use regular classification, e.g., pairs in the negative bags.
(3) In our case, you have to train multiple binary predicate classifiers, and pos bags are different regarding predicate. Assuming there are k1 for 'in' and k2 for 'feed' (k1+k2=k), so two pos bags and the rest pairs are labeled as negative for any predicate.
(4) For the negative pairs, just use regular cross-entropy loss; And for pos bags, just compute the bag possibility based on pair possibility and then use cross-entropy loss on the bag.

Hope the above helps, pls feel free to leave further comments

@Monikshah
Copy link
Author

Thank you very much. This gives me the answers of so many of my questions.

I checked the scripts/prepro_predicates.py. It gives the predicates and the triples from images and sentences. From these triples generated can we directly form the positive bag or there is a process such as matching the object categories of the images with the objects from the triples and form the positive bag? I hope this makes sense.

Thank you.

@Gitsamshi
Copy link
Owner

Yeah, there is a matching process in the code.

@Monikshah
Copy link
Author

There are two files used in the scripts/prepro_predicates.py which I could not find to download:

  1. '--pred_category', default='data/all_predicates_final.json', help='get all predicates'
  2. '--aligned_triplets', default='data/aligned_triplets_final.json', help='get aligned weak supervision'

Are these the predicates and the triples extracted from sentences respectively. Do I save all the predicates as "all_predicates_final.json" and all the triples as 'aligned_triplets_final.json' and put them in data folder to run the code?

@Monikshah
Copy link
Author

I got answer to these.

@Monikshah
Copy link
Author

Sorry to bother you again and thank you for patiently responding :)

I am not understanding which part in scripts/prepro_predicates.py does the matching and what variable represents the object pairs to put in the positive bag. So I will write code on my own for this part. I will write the steps to form the bags. Please let me know if the process is right

  1. I have the predicates and triples from the sentences, also the objects detected in the images and their labels
  2. Form the object pairs (OP) for the objects detected from the images.
  3. For all the predicates in the triples check if the objects pairs are in the triples. If true put these pairs of objects in the positive bag
    lets consider the triples provided in the paper,
    triples: women in hat, women feeds giraffe
    predicates: in, feed
    object labels: women, hat, giraffe
    object pairs: women-hat, women-giraffe, hat-giraffe
    for predicate 'in' check if these object pairs matches the subject and object in the triples
    if true put the features of the object pairs in the positive bag ( all the pairs of women and hat)

We do the same for all the predicates i.e. 200*2 bags for 200 predicates.

Thank you :)

@Gitsamshi
Copy link
Owner

Exactly, remember to use "data/coco_class_names.txt" to map object labels and caption words.

@Monikshah
Copy link
Author

I have been going through scripts/prepro_predicates.py. I came to understand that all_predicates_final.json contains all the predicates and aligned_triplets_final.json contains the pairs of objects for each predicates. So I understand now that these object pairs can directly be used to form the positive bags. But I am still wondering how to form negative bag. Can I use the same pairs for the respective predicate to form negative bag?

These questions might look dum but I am really layman and struggling to put all the pieces together.

Thank you

@Gitsamshi
Copy link
Owner

Given a predicate, the neg bag is the complement to the positive bag in terms of all pairs.

@Monikshah
Copy link
Author

Great! Thank you very much for all the answers. Hopefully I should be able to implement now.

@Monikshah
Copy link
Author

Hello again,

I have been able to implement the bag model. I trained the model. Now while testing we should be able to predict the predicates in the images by using just the objects pairs in the images right ?

Also we will need the ground truth predicates in the images to evaluate the model. I don't see any file which contains the ground truth labels of predicates. Can you please provide me some information about how/where can I get the test data?

Thank you!

@Gitsamshi
Copy link
Owner

Exactly. You can use the same split as Karparthy split for train/dev/test. As we didn't have predicate annotation between each pair, I just used predicate recall over the whole image (#(predicted predicates ∩ gt predicates) / #(gt predicates)) as a metric to roughly evaluate the model.

@Monikshah
Copy link
Author

Thank you very much for responding.

I think Karparthy's split for train/dev/test does not have predicate annotation over the whole image. How did you get the ground truth predicates?

@Monikshah
Copy link
Author

Do you consider the predicates in the captions of each test image as the ground truth predicates?

@Gitsamshi
Copy link
Owner

Yes, I mean cut the karparthy train split into train/dev/test, and use caption predicates as reference.

@Monikshah
Copy link
Author

Awesome! Thank you very much. :)

@Monikshah
Copy link
Author

I have one more question. We develop positive and negative bag for each predicate and train them separately. If we train them separately we will have separate models. For example, 10 models for 10 predicates. Do we need to combine these 10 models into one or do we get just one model while training? (Because eventually we need just one model to predict any predicates right?)

@Gitsamshi
Copy link
Owner

I would suggest adding 10 different top layer classifiers for each predicate and sharing other params between all predicates. That makes one model.

@Monikshah
Copy link
Author

Sure. Thank you very much!

@ababababababababababab
Copy link

Can I ask you some details about the training? For different predicates, the division of positive bags and negative bags is different. How do you realize batch training? In addition, what is the probability threshold after training the classifier?

@Monikshah
Copy link
Author

Can I ask you some details about the training? For different predicates, the division of positive bags and negative bags is different. How do you realize batch training? In addition, what is the probability threshold after training the classifier?

I have not been able to train the model yet. I am still working on the MIL part.

What library are you using for multi-instance learning?

@Monikshah
Copy link
Author

Hello author,

Thank you so much for being very helpful. I have some more questions, It would very kind of you if you can share how you resolved these;

  1. How did you combine the visual features and the boundary box coordinates of the objects?
  2. After getting the many binary predicate models (classifiers), how do you combine these into one?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants