Convergence criteria: improve docs, add per-component gradient #1107

timholy · 2024-10-27T18:51:30Z

This was inspired by #1102, and expands the discussion of convergence criteria in the documentation.

However, it also goes beyond that by optionally allowing the user to specify that the g_abstol criterion should change from

norm(g) <= g_abstol

to

all(abs(g[i]) <= g_abstol[i] for i = 1:n)

That is, it avoids taking the norm. As pointed out in #1102 with the scaling of the objective coefficient, it's also the case that scaling the coordinates (e.g., using meters vs kilometers) changes the scaling of the gradient components. Thus, there is an argument to be made that we need a theory of optimization over non-metric vector spaces. I'm not aware of such a theory existing, and basically all the algorithms in this package assume a metric at some place or another. But at least this allows useful convergence criteria to be specified in a scale-invariant manner.

This is fully-opt in: by default g_abstol = 1e-8, and that uses the previous criterion. But if you specify, e.g., g_abstol = [1e-8, 1e-6] (for an optimization problem with two variables), then you use the new one.

CC @stevengj

This allows `options.g_abstol` to be either a scalar (the traditional choice) or a vector. The motivation for this change comes from the fact that the gradient is dependent on how each variable is scaled: f₁(x, y) = x^2 + y^2 + 1 and f₂(x, y) = x^2 + 10^8 * y^2 + 1 Both have the same minimizer and minimum value, but at any point other than the x-axis, derivatives with respect to `y` of `f₂` are much larger than those of `f₁`. When roundoff error becomes a factor, the convergence requirement should be adjusted accordingly.

Fixes #1102

codecov · 2024-10-27T18:52:39Z

Codecov Report

Attention: Patch coverage is 96.96970% with 1 line in your changes missing coverage. Please review.

Project coverage is 85.34%. Comparing base (e890943) to head (47f2277).

Files with missing lines	Patch %	Lines
src/utilities/assess_convergence.jl	95.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1107      +/-   ##
==========================================
+ Coverage   85.26%   85.34%   +0.08%     
==========================================
  Files          45       45              
  Lines        3502     3528      +26     
==========================================
+ Hits         2986     3011      +25     
- Misses        516      517       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pkofod · 2024-10-29T08:12:25Z

I suppose this partially addresses @stevengj 's concern in-so-far it is now a bit more clearly described. I suppose the point about the relative tolerance is saved for later.

timholy · 2024-10-29T09:27:38Z

I didn't do the relative tolerance because that would be a breaking change. But we could do that if you're OK with making a breaking change.

I agree that relative is better, but only if it's applied to each component individually (it's not scale-invariant otherwise). But that does have one weakness: what if some component of the gradient starts at 0? One could of course keep a running tally of the largest absolute value one has ever encountered.

stevengj · 2024-10-30T21:37:26Z

There are algorithms that adaptively learn a per-component scaling (e.g. CCSA does this) and hence are more flexible about dimensionful variables. But I agree that many optimization algorithms seem to implicitly assume that $x^T x$ is a reasonable norm, corresponding in dimensionful cases to an implicit weighted norm $x^T W x$ where the weight $W$ is a diagonal matrix of the inverses of the units of each component.

However, I doubt that asking the user to provide even more information here (a per-component gradient tolerance) is so helpful (though it doesn't hurt to have the option) … they barely know how to set tolerances as it is. Instead, I would recommend (1) rescaling the components of the unknowns so that they have comparable magnitudes, ideally of order unity and (2) then setting an appropriate gradient tolerance.

(Far too many algorithms have "magic numbers" inside the algorithm, typically tolerances for corner cases, that are implicitly dimensionful.)

timholy · 2024-11-06T13:21:59Z

However, I doubt that asking the user to provide even more information here (a per-component gradient tolerance) is so helpful (though it doesn't hurt to have the option) … they barely know how to set tolerances as it is.

In some sense this was motivated by hoping to rationalize the test suite (#1104) as much as anything else. If you know where the minimum is, you can evaluate the Hessian at the minimum, and thus estimate how much roundoff error should contribute to a nonzero gradient. But there's a simpler way to achieve something similar: put a threshold on the actual value of f, i.e., if f(x) <= thresh then you've converged. That would be easy to set, might work robustly for the test suite, and may be more useful in practice than this PR. For example, if you're doing least-squares, you know that any point that makes f(x) "tiny" (by some user-specified definition of tiny) must be very close to a minimum.

pkofod · 2024-11-08T06:29:23Z

However, I doubt that asking the user to provide even more information here (a per-component gradient tolerance) is so helpful (though it doesn't hurt to have the option) … they barely know how to set tolerances as it is.

In some sense this was motivated by hoping to rationalize the test suite (#1104) as much as anything else. If you know where the minimum is, you can evaluate the Hessian at the minimum, and thus estimate how much roundoff error should contribute to a nonzero gradient. But there's a simpler way to achieve something similar: put a threshold on the actual value of f, i.e., if f(x) <= thresh then you've converged. That would be easy to set, might work robustly for the test suite, and may be more useful in practice than this PR. For example, if you're doing least-squares, you know that any point that makes f(x) "tiny" (by some user-specified definition of tiny) must be very close to a minimum.

do you want to add the threshold in another PR?

timholy added 2 commits October 26, 2024 14:48

Expand documentation on convergence options

47f2277

Fixes #1102

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convergence criteria: improve docs, add per-component gradient #1107

Convergence criteria: improve docs, add per-component gradient #1107

timholy commented Oct 27, 2024 •

edited

Loading

codecov bot commented Oct 27, 2024 •

edited

Loading

pkofod commented Oct 29, 2024

timholy commented Oct 29, 2024

stevengj commented Oct 30, 2024 •

edited

Loading

timholy commented Nov 6, 2024

pkofod commented Nov 8, 2024

Convergence criteria: improve docs, add per-component gradient #1107

Are you sure you want to change the base?

Convergence criteria: improve docs, add per-component gradient #1107

Conversation

timholy commented Oct 27, 2024 • edited Loading

codecov bot commented Oct 27, 2024 • edited Loading

Codecov Report

pkofod commented Oct 29, 2024

timholy commented Oct 29, 2024

stevengj commented Oct 30, 2024 • edited Loading

timholy commented Nov 6, 2024

pkofod commented Nov 8, 2024

timholy commented Oct 27, 2024 •

edited

Loading

codecov bot commented Oct 27, 2024 •

edited

Loading

stevengj commented Oct 30, 2024 •

edited

Loading