-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convergence criteria: improve docs, add per-component gradient #1107
base: master
Are you sure you want to change the base?
Conversation
This allows `options.g_abstol` to be either a scalar (the traditional choice) or a vector. The motivation for this change comes from the fact that the gradient is dependent on how each variable is scaled: f₁(x, y) = x^2 + y^2 + 1 and f₂(x, y) = x^2 + 10^8 * y^2 + 1 Both have the same minimizer and minimum value, but at any point other than the x-axis, derivatives with respect to `y` of `f₂` are much larger than those of `f₁`. When roundoff error becomes a factor, the convergence requirement should be adjusted accordingly.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1107 +/- ##
==========================================
+ Coverage 85.26% 85.34% +0.08%
==========================================
Files 45 45
Lines 3502 3528 +26
==========================================
+ Hits 2986 3011 +25
- Misses 516 517 +1 ☔ View full report in Codecov by Sentry. |
I suppose this partially addresses @stevengj 's concern in-so-far it is now a bit more clearly described. I suppose the point about the relative tolerance is saved for later. |
I didn't do the relative tolerance because that would be a breaking change. But we could do that if you're OK with making a breaking change. I agree that relative is better, but only if it's applied to each component individually (it's not scale-invariant otherwise). But that does have one weakness: what if some component of the gradient starts at 0? One could of course keep a running tally of the largest absolute value one has ever encountered. |
There are algorithms that adaptively learn a per-component scaling (e.g. CCSA does this) and hence are more flexible about dimensionful variables. But I agree that many optimization algorithms seem to implicitly assume that However, I doubt that asking the user to provide even more information here (a per-component gradient tolerance) is so helpful (though it doesn't hurt to have the option) … they barely know how to set tolerances as it is. Instead, I would recommend (1) rescaling the components of the unknowns so that they have comparable magnitudes, ideally of order unity and (2) then setting an appropriate gradient tolerance. (Far too many algorithms have "magic numbers" inside the algorithm, typically tolerances for corner cases, that are implicitly dimensionful.) |
In some sense this was motivated by hoping to rationalize the test suite (#1104) as much as anything else. If you know where the minimum is, you can evaluate the Hessian at the minimum, and thus estimate how much roundoff error should contribute to a nonzero gradient. But there's a simpler way to achieve something similar: put a threshold on the actual value of |
do you want to add the threshold in another PR? |
This was inspired by #1102, and expands the discussion of convergence criteria in the documentation.
However, it also goes beyond that by optionally allowing the user to specify that the
g_abstol
criterion should change fromto
That is, it avoids taking the norm. As pointed out in #1102 with the scaling of the objective coefficient, it's also the case that scaling the coordinates (e.g., using meters vs kilometers) changes the scaling of the gradient components. Thus, there is an argument to be made that we need a theory of optimization over non-metric vector spaces. I'm not aware of such a theory existing, and basically all the algorithms in this package assume a metric at some place or another. But at least this allows useful convergence criteria to be specified in a scale-invariant manner.
This is fully-opt in: by default
g_abstol = 1e-8
, and that uses the previous criterion. But if you specify, e.g.,g_abstol = [1e-8, 1e-6]
(for an optimization problem with two variables), then you use the new one.CC @stevengj