Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convergence criteria: improve docs, add per-component gradient #1107

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

timholy
Copy link
Contributor

@timholy timholy commented Oct 27, 2024

This was inspired by #1102, and expands the discussion of convergence criteria in the documentation.

However, it also goes beyond that by optionally allowing the user to specify that the g_abstol criterion should change from

norm(g) <= g_abstol

to

all(abs(g[i]) <= g_abstol[i] for i = 1:n)

That is, it avoids taking the norm. As pointed out in #1102 with the scaling of the objective coefficient, it's also the case that scaling the coordinates (e.g., using meters vs kilometers) changes the scaling of the gradient components. Thus, there is an argument to be made that we need a theory of optimization over non-metric vector spaces. I'm not aware of such a theory existing, and basically all the algorithms in this package assume a metric at some place or another. But at least this allows useful convergence criteria to be specified in a scale-invariant manner.

This is fully-opt in: by default g_abstol = 1e-8, and that uses the previous criterion. But if you specify, e.g., g_abstol = [1e-8, 1e-6] (for an optimization problem with two variables), then you use the new one.

CC @stevengj

This allows `options.g_abstol` to be either a scalar (the traditional
choice) or a vector. The motivation for this change comes from the fact
that the gradient is dependent on how each variable is scaled:

     f₁(x, y) = x^2 + y^2 + 1

and

     f₂(x, y) = x^2 + 10^8 * y^2 + 1

Both have the same minimizer and minimum value, but at any point other
than the x-axis, derivatives with respect to `y` of `f₂` are much larger
than those of `f₁`. When roundoff error becomes a factor, the
convergence requirement should be adjusted accordingly.
Copy link

codecov bot commented Oct 27, 2024

Codecov Report

Attention: Patch coverage is 96.96970% with 1 line in your changes missing coverage. Please review.

Project coverage is 85.34%. Comparing base (e890943) to head (47f2277).

Files with missing lines Patch % Lines
src/utilities/assess_convergence.jl 95.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1107      +/-   ##
==========================================
+ Coverage   85.26%   85.34%   +0.08%     
==========================================
  Files          45       45              
  Lines        3502     3528      +26     
==========================================
+ Hits         2986     3011      +25     
- Misses        516      517       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pkofod
Copy link
Member

pkofod commented Oct 29, 2024

I suppose this partially addresses @stevengj 's concern in-so-far it is now a bit more clearly described. I suppose the point about the relative tolerance is saved for later.

@timholy
Copy link
Contributor Author

timholy commented Oct 29, 2024

I didn't do the relative tolerance because that would be a breaking change. But we could do that if you're OK with making a breaking change.

I agree that relative is better, but only if it's applied to each component individually (it's not scale-invariant otherwise). But that does have one weakness: what if some component of the gradient starts at 0? One could of course keep a running tally of the largest absolute value one has ever encountered.

@stevengj
Copy link

stevengj commented Oct 30, 2024

There are algorithms that adaptively learn a per-component scaling (e.g. CCSA does this) and hence are more flexible about dimensionful variables. But I agree that many optimization algorithms seem to implicitly assume that $x^T x$ is a reasonable norm, corresponding in dimensionful cases to an implicit weighted norm $x^T W x$ where the weight $W$ is a diagonal matrix of the inverses of the units of each component.

However, I doubt that asking the user to provide even more information here (a per-component gradient tolerance) is so helpful (though it doesn't hurt to have the option) … they barely know how to set tolerances as it is. Instead, I would recommend (1) rescaling the components of the unknowns so that they have comparable magnitudes, ideally of order unity and (2) then setting an appropriate gradient tolerance.

(Far too many algorithms have "magic numbers" inside the algorithm, typically tolerances for corner cases, that are implicitly dimensionful.)

@timholy
Copy link
Contributor Author

timholy commented Nov 6, 2024

However, I doubt that asking the user to provide even more information here (a per-component gradient tolerance) is so helpful (though it doesn't hurt to have the option) … they barely know how to set tolerances as it is.

In some sense this was motivated by hoping to rationalize the test suite (#1104) as much as anything else. If you know where the minimum is, you can evaluate the Hessian at the minimum, and thus estimate how much roundoff error should contribute to a nonzero gradient. But there's a simpler way to achieve something similar: put a threshold on the actual value of f, i.e., if f(x) <= thresh then you've converged. That would be easy to set, might work robustly for the test suite, and may be more useful in practice than this PR. For example, if you're doing least-squares, you know that any point that makes f(x) "tiny" (by some user-specified definition of tiny) must be very close to a minimum.

@pkofod
Copy link
Member

pkofod commented Nov 8, 2024

However, I doubt that asking the user to provide even more information here (a per-component gradient tolerance) is so helpful (though it doesn't hurt to have the option) … they barely know how to set tolerances as it is.

In some sense this was motivated by hoping to rationalize the test suite (#1104) as much as anything else. If you know where the minimum is, you can evaluate the Hessian at the minimum, and thus estimate how much roundoff error should contribute to a nonzero gradient. But there's a simpler way to achieve something similar: put a threshold on the actual value of f, i.e., if f(x) <= thresh then you've converged. That would be easy to set, might work robustly for the test suite, and may be more useful in practice than this PR. For example, if you're doing least-squares, you know that any point that makes f(x) "tiny" (by some user-specified definition of tiny) must be very close to a minimum.

do you want to add the threshold in another PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants