You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi there! I was recently working on implementing a custom loss function (binary focal loss) and found some of the documentation to be a bit confusing. The documentation of Loss.get_grad states that it should:
Calculate the gradient of the loss with respect with the model outputs.
However, looking at the implementation of some of Thinc's built-in loss functions, Loss.get_grad actually calculates the gradient of the loss with respect to the logits used as input to the preceding softmax/sigmoid layer.
This whole setup works because the softmax part of the softmax layer uses the identity function as its backwards pass. This ends up making the forwards and backwards passes of the softmax layer inconsistent, but in theory everything balances out.
I assume that this setup was selected to help improve numerical stability. The focal loss paper actually mentions this explicitly:
we note that the implementation of the loss layer combines the sigmoid operation for computing p with the loss computation, resulting in greater numerical stability.
Anyways, the point of this issue is that the current documentation of Loss.get_grad is confusing, since the gradient is actually computed with respect to the logits, and not the model outputs, even though the model outputs are what is provided to the method. It would be great to have this clarified in the documentation 🙂
Thanks for maintaining Thinc! 😄
Your Environment
Operating System: macOS 13.5
Python Version Used: 3.9.16
Thinc Version Used: 8.1.10
Environment Information: Poetry virtual environment, M1 mac
The text was updated successfully, but these errors were encountered:
How to reproduce the behaviour
Hi there! I was recently working on implementing a custom loss function (binary focal loss) and found some of the documentation to be a bit confusing. The documentation of
Loss.get_grad
states that it should:However, looking at the implementation of some of Thinc's built-in loss functions,
Loss.get_grad
actually calculates the gradient of the loss with respect to the logits used as input to the preceding softmax/sigmoid layer.For example, the
CategoricalCrossentropy
loss class computes the gradient asguesses - target
. This is off from the derivative of the loss wrt the model outputs (probabilities) by a factor of1 / (p * (1 - p))
. This factor is cancelled out by the derivative of the logistic function in the derivative of the loss wrt to the logits, giving the derivative wrt to the logits asguesses - targets
.This whole setup works because the softmax part of the softmax layer uses the identity function as its backwards pass. This ends up making the forwards and backwards passes of the softmax layer inconsistent, but in theory everything balances out.
I assume that this setup was selected to help improve numerical stability. The focal loss paper actually mentions this explicitly:
Anyways, the point of this issue is that the current documentation of
Loss.get_grad
is confusing, since the gradient is actually computed with respect to the logits, and not the model outputs, even though the model outputs are what is provided to the method. It would be great to have this clarified in the documentation 🙂Thanks for maintaining Thinc! 😄
Your Environment
The text was updated successfully, but these errors were encountered: