Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] DependencyCRF.log_prob returns a positive value when the input arc score is large. #61

Open
wangxinyu0922 opened this issue Mar 22, 2020 · 21 comments

Comments

@wangxinyu0922
Copy link

wangxinyu0922 commented Mar 22, 2020

Hi, I found that if I input log_potential into the DependencyCRF with some large values, the marginal distribution will be slightly over 1, and this also results in the log_prob returns a very large positive value in training.
My code is somthing like this:

dist = DependencyCRF(arc_scores, lengths=mask.sum(-1))
labels = dist.struct.to_parts(arcs[:,1:], lengths=mask.sum(-1)).type_as(arc_scores)
log_prob = dist.log_prob(labels)

I found my arc_scores contains a large value like 807687.0625
图片
This will result in the marginal contains values like 1.002
图片
The final log_prob will return a very large positive value of 750633.4375.
图片

I think this problem is something like floating-point precision and can you suggest something to me?

@srush
Copy link
Collaborator

srush commented Mar 22, 2020

Hmm, this is worrisome, we are pretty careful about precision but it is possible we are losing some bits somewhere. I will try to replicate. This is on CUDA right? just for sanity would you mind printing dist.partition.

@wangxinyu0922
Copy link
Author

wangxinyu0922 commented Mar 23, 2020

Hmm, this is worrisome, we are pretty careful about precision but it is possible we are losing some bits somewhere. I will try to replicate. This is on CUDA right? just for sanity would you mind printing dist.partition.

dist.partition and the gold tree score:
图片

@srush
Copy link
Collaborator

srush commented Mar 23, 2020

That is strange. If possible, do you think you could send me a minimal example, perhaps as a colab? Is it plausible that your gold tree is invalid somehow (either non-projective, or with multiple heads?). I am not sure why it would be so different than the partition.

@wangxinyu0922
Copy link
Author

wangxinyu0922 commented Mar 23, 2020

My dataset contains a tiny proportion of non-projective trees. If this issue results in the wrong results, I think I should try to convert these trees into projective ones.

@srush
Copy link
Collaborator

srush commented Mar 23, 2020

I see, so I think there are two things happening here :

  1. minor: the marginal is having a slight underflow,
  2. major: the score is very off because of non-projectivity.

I would recommend just dropping the non-projective trees, they are not correctly scored by the model and I don't check. If you want to add a check, then I would happy for a PR (it's a non-trivial check to write fast, you need check all arc-pairs).

@wangxinyu0922
Copy link
Author

I see, so I think there are two things happening here :

1. minor: the marginal is having a slight underflow,

2. major: the score is very off because of non-projectivity.

I would recommend just dropping the non-projective trees, they are not correctly scored by the model and I don't check. If you want to add a check, then I would happy for a PR (it's a non-trivial check to write fast, you need check all arc-pairs).

Thanks for your suggestions. I may try some non-projective trees in the dataset for the model. By the way, can I use NonProjectiveDependencyCRF instead of DependencyCRF? The document said that argmax has not been implemented. Is there any simple way to get the predictions?

@srush
Copy link
Collaborator

srush commented Mar 23, 2020

I wouldn't recommend switching between the two. NonProj will be faster and more general, but less accurate in cases where the languages 99% projective. Unfortunately there is currently no easy way to get an argmax tree! You can try taking the max for each arc, but that will not guarantee it is a tree.

If you are interested in implementing the argmax, the algorithm is the Chiu-Liu-Edmonds algorithm. It shouldn't be too hard to implement, it's just a bit different so I didn't include it.

@wangxinyu0922
Copy link
Author

wangxinyu0922 commented Mar 23, 2020

Now I'm trying to reproduce the problem with the following code:

arc_scores[:,:,:]=500000
arc_scores[:,:,:25]=-500000
dist = DependencyCRF(arc_scores, lengths=mask.sum(-1))
labels = dist.struct.to_parts(arcs[:,1:], lengths=mask.sum(-1)).type_as(arc_scores)
log_prob = dist.log_prob(labels)

Such way will make the dist.marginals becomes larger than 1, regardless of whether it is a projective tree.
图片
But the log_prob is a negative value. I'm still trying to make the log_prob become a positive one.

@wangxinyu0922
Copy link
Author

Well, now I'm trying cases for NonProjectiveDependencyCRF . The code is like this:

arc_scores[:,:,:]=0
dist = NonProjectiveDependencyCRF(arc_scores, lengths=mask.sum(-1))
labels = dist.struct.to_parts(arcs[:,1:], lengths=mask.sum(-1)).type_as(arc_scores)
log_prob = dist.log_prob(labels)

When the sentence length is long (I tried 50 here), the output log_prob is -inf and the dist.partition is inf. When sentence length is shorter (I tried one with 9 words), the model returns a correct value. Do you have any ideas?

@srush
Copy link
Collaborator

srush commented Mar 23, 2020

No ideas yet, but I will look into both issues.

@kmkurn
Copy link
Contributor

kmkurn commented Mar 24, 2020

Hi, sorry for jumping in, but the issue with NonProjectiveDependencyCRF is I believe the line below:

return lap.det().log()

It should've used lap.logdet() to avoid numerical error for large matrices.

Slightly off topic, but NonProjectiveDependencyCRF doesn't seem to handle variable-length sequences. I can't see where the length argument is used, or am I missing something?

@wangxinyu0922
Copy link
Author

After setting a breakpoint to find out when the problem occurs in my training code, I find that probably there is something wrong with the length argument in DependencyCRF including the problem of the underflow. The code is:

temp_arcs=torch.load('temp_arcs.pt')
temp_mask=torch.load('temp_mask.pt')
# (batch, sent_len+1, sent_len+1), shape: (dependency, head)
# -1e-9 is a masked value in my code
temp_scores=torch.load('temp_scores.pt')
# pdb.set_trace()
# basic
dist=generate_tree(temp_scores,temp_mask,is_mst=False)
labels = dist.struct.to_parts(temp_arcs, lengths=temp_mask.sum(-1)).type_as(temp_scores)
log_prob = dist.log_prob(labels)
# tensor([3.0518e-05], device='cuda:0', grad_fn=<SubBackward0>)
print(log_prob)

# masking the dependency part
temp_scores2=temp_scores*temp_mask.unsqueeze(-1)
dist=generate_tree(temp_scores2,temp_mask,is_mst=False)
labels = dist.struct.to_parts(temp_arcs, lengths=temp_mask.sum(-1)).type_as(temp_scores2)
log_prob = dist.log_prob(labels)
# tensor([3.0518e-05], device='cuda:0', grad_fn=<SubBackward0>)
print(log_prob)

# masking the head part
temp_scores3=temp_scores*temp_mask.unsqueeze(-2)
dist=generate_tree(temp_scores3,temp_mask,is_mst=False)
labels = dist.struct.to_parts(temp_arcs, lengths=temp_mask.sum(-1)).type_as(temp_scores3)
log_prob = dist.log_prob(labels)
# tensor([-0.0044], device='cuda:0', grad_fn=<SubBackward0>)
print(log_prob)

I have uploaded these files on the OneDrive, I hope this will help us to solve the problem quickly.

@srush
Copy link
Collaborator

srush commented Mar 24, 2020

@kmkurn ah fantastic, I changed that line and made another issue about variable length.

@srush
Copy link
Collaborator

srush commented Mar 24, 2020

@wangxinyu0922 thanks for the repro, I will see what I can do to fix.

@srush
Copy link
Collaborator

srush commented Mar 24, 2020

@wangxinyu0922 I think your code had a bug in it.

image

Your mask is zeroing out the potentials for the 0th head word.

@wangxinyu0922
Copy link
Author

wangxinyu0922 commented Mar 25, 2020

@wangxinyu0922 I think your code had a bug in it.

image

Your mask is zeroing out the potentials for the 0th head word.

In my code, there is an redundant '' word at position 0, and the temp_score is a shape of (dependency, head). Therefore masking 0th word is correct. To fit the requirements of DependencyCRF, the score of 0th head word should be on the diagonal of the log_potential matrix without the '' and the matrix should be the shape of (head, dependency), which is processed in the process_potential function. So this should not be the problem.

Why the third example (which masks the score of '' to 0 for all words/setting the diagonal of log_potential to be 0) gets a correct negative log_prob while others do not? Do I missed something else?

@srush
Copy link
Collaborator

srush commented Mar 25, 2020

Sorry I'm still having trouble understanding. I'm really trying to figure it out, but I need a more clear example.

  1. The 0 word is a real word in the model, if you change it's scores it will impact the model partition. If you want to change the root you need to change the diagonal.
  2. Making a log potential 0 does not remove it from the model. If you want to mask out a decision completely you need to make it -inf.

@wangxinyu0922
Copy link
Author

wangxinyu0922 commented Mar 25, 2020

@srush

When taking your suggestion that mask out the log potential with -inf, I find that the log_prob becomes nan and v in the dist.log_prob function in distribution.py makes the calculation of log_prob to be a wrong score.

Then I set the masked value to -1e9 again. I check the function score in helper.py and find that there is some presicion problem of the function torch.mul:

    def score(self, potentials, parts, batch_dims=[0]):
        score = torch.mul(potentials, parts)
        batch = tuple((score.shape[b] for b in batch_dims))
        return self.semiring.prod(score.view(batch + (-1,)))

Here is an example, my potentials is:
图片

The target label is:
图片

The corresponding partition is:
图片

I use score1 = torch.mul(dist1.log_potentials, target_labels) and get:
图片

Things becomes strange when I try score1.sum(), which is 0.001 larger than the partition.
图片

I try to sum over some parts of the matrix and find something strange. But I cannot explain why this happened. Do you have any suggestion?
图片

@srush
Copy link
Collaborator

srush commented Mar 25, 2020

I still do not understand exactly where you think the problem is. Do you think the problem is with the score function? We could try use torch.where instead of torch.mul and see if the value is different.

If you think the partition calculation is the issue, can you please send me a working minimal example where the partition is less than the value of a single tree?

@wangxinyu0922
Copy link
Author

debug.zip
I tried to use torch.where instead of torch.mul and the values are the same. Here is an example that the partition is less than the value of a single tree

@srush
Copy link
Collaborator

srush commented Mar 30, 2020

Thanks this is a good repro case. I see the issue, but I do not yet have a good fix. Can't tell if it is an inherent issue in the algorithm, or a bug somewhere.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants