Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

👌 PwBaseWorkChain: Improve handling of SCF convergence issues #987

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions src/aiida_quantumespresso/workflows/protocols/pw/base.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,9 @@ default_inputs:
nosym: False
occupations: smearing
smearing: cold
degauss: 0.01
degauss: 0.02
ELECTRONS:
electron_maxstep: 80
electron_maxstep: 50
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we set it a bit larger, e.g. 60, as to include 99% of cases instead? Or you say that eventually one would just restart the calculation if the slope is good?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my thinking. And we would be a bit more pre-emptive in fixing ones that do not, so potentially save time that would otherwise be wasted. In the end I suppose the question is what we prefer: avoiding restarts or avoiding wasted resources? With a slope threshold of -0.1, it's very likely that we'll catch all the calculations who just need more steps, and those rare edge cases might still benefit from some reduced mixing and hence not be lost.

That said, 60 is probably also fine (it includes 98.8% of cases, so 1.3% more ^^).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then again... I was redoing the analysis on the slopes, but then instead of looking at the slopes over all steps, look only at the 50 we actually calculated in case we'd set electron_maxstep to 50:

Screenshot 2023-11-27 at 22 41 38

Now we see quite a few with slopes > -0.1 that do converge. Looking at the evolution of some of their scf accuracies:

image

We can see that indeed, only after around step 50 the SCF "catches its groove", and the convergence begins accelerating. Maybe this would also happen if we restart from the charge density and reduce the mixing, but that isn't certain and it'd most likely be slower.

You can play round with the "max nstep" to consider. Of course as you increase it, you have fewer cases where the slope before "max nstep" is > -0.1 and the calculation converges, but you have fewer cases of successful convergence between "max step" and 80 steps as well.

So we could ask two questions: if I would have set electron_maxstep to X, and used -0.1 as the slope threshold

  • what's the chance that we did an "incorrect stop", i.e. the slope is > -0.1, but the calculation converges in 80 steps?
  • what's the chance that we did an "incorrect restart", i.e. the slope is < -0.1, but the calculation fails to converge in 80 steps?
max nstep # incorrect stop % incorrect stop # incorrect restart % incorrect restart
70 13 6.3 1153 46.2
60 27 5.4 1292 51.7
50 54 5.3 1425 57.0
40 61 2.9 1584 63.4
30 72 1.6 1700 68.1
20 80 0.7 1919 76.8

But frankly, these % data are also skewed by the fact that we only have data up to step 80, i.e. if we had data up to 200 steps the % incorrect stop would go down for all "max nstep", but most likely more for 70 than for 40.

So to do this analysis "right", we'd need to gather statistics for a larger electron_maxstep. Moreover, the question is what happens if we "incorrectly stop" a calculation and decide to e.g. reduce the mixing. It's all quite a lot of analysis without too much benefit in the end, I think. 60 seems like a reasonable choice to me, so I'll pick that unless there are objections.

One note though: it's clear from the "scf accuracy" curves above that extrapolation may not always give a reasonable value. I'd keep it simple: if the slope < -0.1, just double the electron_maxstep to 120 and restart from the last charge density.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mbercx for the analysis. I must say I couldn't completely follow the arguments. But the figure you showed is also interesting. Maybe we should compute the slope only on e.g. 20 of the last steps, and look at the slope, and decide whether to restart or to change the mixing mode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it was a little late, so maybe my explanation wasn't the best. ^^ Let me summarize:

In my previous analysis I used the slope over all 80 steps to see if we can decide after 50 steps to continue with more steps or switch to another strategy. That's of course not correct, since at step 50 we'd only have information about the first 50. If we look at the slope distributions for the first 50 steps (first image in my post above), we can see that there are more cases where selecting -0.1 as the threshold would lead to our approach switching to another strategy (e.g. adapting mixing_beta), or doing an "incorrect stop".

I was curious if you could "optimize" the choice for electron_maxstep, i.e. minimize the % of incorrect stops. But to judge this properly I think we need to have data for larger electron_maxstep runs, since we don't really know which of the failed runs would have converged if we had run with more than 80 steps. Also: we don't know if switching to e.g. a lower mixing_beta would improve or worsen the convergence after step 50. So perhaps we shouldn't overdo this analysis for now.

Maybe we should compute the slope only on e.g. 20 of the last steps

I was also thinking of this approach. Here is the analysis:

$ ./gather_stats.py slopes --last 20 --max-nstep 50 workchain/PBEsol/scf
Report: Number of completed pw.x with > 50 SCF steps: 3524
Report: Calculating slopes from step 30 to 50.
Report: Incorrect stops: 130 | 12.7% of 1026 successful runs with > 50 steps
Report: Incorrect restarts: 834 | 33.4% of 2498 failed runs

Screenshot 2023-11-28 at 17 26 04

So based on this setup (information about 80 steps, trial "max nstep" 50, slope threshold -0.1), only checking the last 20 steps increases the # of incorrect stops, but reduces the # of incorrect restarts. Below are some examples of runs where we would incorrectly stop:

image

So not sure if this really improves things. I suppose it depends what we prioritise: convergence or saving resources.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mbercx ! Ok, I see now. Well, the the point is also whether a change of strategy (e.g. mixing beta) will ever produce significant change in the first 50 (or 60, or whatever) iterations. There is a high chance to keep trying changing strategy, while having just some iterations more would solve the convergence.
So I don't know how much we are going to save resources if we are not really sure to fix with a certain probability the problem. So I would vote to even increase such maximum number of iterations (which btw on QE I think the default is to 100).

mixing_beta: 0.4
default_protocol: moderate
protocols:
Expand All @@ -43,6 +43,8 @@ protocols:
parameters:
CONTROL:
forc_conv_thr: 0.5e-4
SYSTEM:
degauss: 0.0125
fast:
description: 'Protocol to perform the computation at low precision at minimal computational cost for testing purposes.'
kpoints_distance: 0.50
Expand All @@ -53,3 +55,5 @@ protocols:
parameters:
CONTROL:
forc_conv_thr: 1.e-3
SYSTEM:
degauss: 0.0275
92 changes: 84 additions & 8 deletions src/aiida_quantumespresso/workflows/pw/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,14 @@ class PwBaseWorkChain(ProtocolMixin, BaseRestartWorkChain):
'qe': qe_defaults,
'delta_threshold_degauss': 30,
'delta_factor_degauss': 0.1,
'delta_factor_mixing_beta': 0.8,
'delta_factor_mixing_beta': 0.5,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we start from 0.4 then the next will be: 0.2, 0.1, 0.05, ... Nevertheless, 0.05 is rather uncommon. Shall we set something to reproduce e.g. 0.3, 0.2, 0.1, 0.05? Or probably just set a list of them? Ok, here I don't consider starting from a different mixing beta. What do you thik?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually coming back here - probably it is just best to leave 0.5 as a factor, son one explores quickly the possible betas. No need to stay too close I guess.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly, I'm not sure either. My thinking is that in case reducing the mixing improves the convergence, better to reduce a bit more and then check if the slope is < -0.1 after 50 steps and then give it more max steps.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah makes sense, something like in the ph.x code, where you can chnage mixing beta at a certain point

'delta_addition_mixing_ndim': 4,
'delta_factor_max_seconds': 0.95,
'delta_factor_nbnd': 0.05,
'delta_minimum_nbnd': 4,
'conv_slope_threshold': -0.1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_electronic_steps too? E.g. ~200-250?

'conv_slope_range': 30,
'low_symmetry_threshold': 6
})

@classmethod
Expand Down Expand Up @@ -568,15 +572,87 @@ def handle_electronic_convergence_not_reached(self, calculation):

Decrease the mixing beta and fully restart from the previous calculation.
"""
factor = self.defaults.delta_factor_mixing_beta
mixing_beta = self.ctx.inputs.parameters.get('ELECTRONS', {}).get('mixing_beta', self.defaults.qe.mixing_beta)
mixing_beta_new = mixing_beta * factor
import numpy

self.ctx.inputs.parameters['ELECTRONS']['mixing_beta'] = mixing_beta_new
action = f'reduced beta mixing from {mixing_beta} to {mixing_beta_new} and restarting from the last calculation'
scf_accuracy = calculation.tools.get_scf_accuracy(-1)[-self.defaults.conv_slope_range:]
scf_accuracy_slope = numpy.polyfit(numpy.arange(0, len(scf_accuracy)), numpy.log(scf_accuracy), 1)[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also take the intercept to use for extrapolating the amount of extra steps to perform. Or otherwise, we can simply increase it to a fixed maximum? E.g. electrons_maxstep = self.defaults.max_electronic_steps?


mixing_beta = self.ctx.inputs.parameters['ELECTRONS'].get('mixing_beta', self.defaults.qe.mixing_beta)
mixing_ndim = self.ctx.inputs.parameters['ELECTRONS'].get('mixing_ndim', self.defaults.qe.mixing_ndim)
mixing_mode = self.ctx.inputs.parameters['ELECTRONS'].get('mixing_mode', self.defaults.qe.mixing_mode)
low_symmetry_structure = (
len(calculation.outputs.output_parameters.get_dict()['symmetries']) < self.defaults.low_symmetry_threshold
)
low_dim_structure = calculation.inputs.structure.pbc != (True, True, True)
nbnd_cur = calculation.outputs.output_parameters.get_dict()['number_of_bands']
diagonalization = self.ctx.inputs.parameters['ELECTRONS'].get('diagonalization', 'david')

self.report(f'number of bands: {nbnd_cur}')
self.report(f'scf accuracy slope: {scf_accuracy_slope:.2f}')
self.report(f'mixing beta: {mixing_beta:.2f}')
self.report(f'mixing ndim: {mixing_ndim}')
self.report(f'mixing mode: {mixing_mode}')
self.report(f"structure symmetries: {len(calculation.outputs.output_parameters.get_dict()['symmetries'])}")
self.report(f'low symmetry structure: {low_symmetry_structure}')
self.report(f'low dimension structure: {low_dim_structure}')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update ELECTRONS.electron_maxstep. Discussion needed

if 'scf_failed_once' not in self.ctx:
nbnd_new = nbnd_cur + max(int(nbnd_cur * self.defaults.delta_factor_nbnd), self.defaults.delta_minimum_nbnd)
self.ctx.inputs.parameters['SYSTEM']['nbnd'] = nbnd_new
self.report(
f'First SCF failure encountered: increasing number of bands to {nbnd_new}'
)
self.ctx.scf_failed_once = True

if scf_accuracy_slope < self.defaults.conv_slope_threshold:
action = (
f'electronic convergence not reached but the scf accuracy slope ({scf_accuracy_slope:.2f}) is smaller '
f'than the threshold ({self.defaults.conv_slope_threshold:.2f}): restart from the last calculation.'
)
self.set_restart_type(RestartType.FROM_CHARGE_DENSITY, calculation.outputs.remote_folder)
self.report_error_handled(calculation, action)
return ProcessHandlerReport(True)

if mixing_mode == 'plain' and low_dim_structure:

self.ctx.inputs.parameters['ELECTRONS'].setdefault('mixing_mode', 'local-TF')
action = (
'electronic convergence not reached and structure is low dimensional: switch to local-TF mixing and '
'restart from the last calculation.'
)
self.set_restart_type(RestartType.FROM_CHARGE_DENSITY, calculation.outputs.remote_folder)
self.report_error_handled(calculation, action)
return ProcessHandlerReport(True)

if 'diagonalizations' not in self.ctx:
# Initialize a list to track diagonalisations that haven't been tried in reverse order or preference
self.ctx.diagonalizations = [value for value in ['cg', 'paro', 'ppcg', 'rmm-paro', 'david'] if value != diagonalization.lower()]

try:
new = self.ctx.diagonalizations.pop()
self.ctx.inputs.parameters['ELECTRONS']['diagonalization'] = new
action = f'electronic convergence not reached: switching to `{new}` diagonalization.'
self.set_restart_type(RestartType.FROM_CHARGE_DENSITY, calculation.outputs.remote_folder)
self.report_error_handled(calculation, action)
return ProcessHandlerReport(True)
except IndexError:
pass

if mixing_beta > 0.1:

mixing_beta_new = mixing_beta * self.defaults.delta_factor_mixing_beta
mixing_ndim_new = mixing_ndim + self.defaults.delta_addition_mixing_ndim

self.ctx.inputs.parameters['ELECTRONS']['mixing_beta'] = mixing_beta_new
self.ctx.inputs.parameters['ELECTRONS']['mixing_ndim'] = mixing_ndim_new
action = (
f'reduced beta mixing from {mixing_beta} to {mixing_beta_new}, increased `mixing_ndim` from '
f'{mixing_ndim} to {mixing_ndim_new} and restarting from the last calculation.'
)
self.set_restart_type(RestartType.FROM_CHARGE_DENSITY, calculation.outputs.remote_folder)
self.report_error_handled(calculation, action)
return ProcessHandlerReport(True)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mixing_mode = self.ctx.inputs.parameters['ELECTRONS'].get_default('mixing_beta', 'plain')
if mixing_mode == 'plain':
  self.ctx.inputs.parameters['ELECTRONS']['mixing_mode'] = 'local-TF'
  [...]
mixing_ndim = self.ctx.inputs.parameters['ELECTRONS'].get_default('mixing_ndim', 8)
if mixing_ndim <= 8:
  self.ctx.inputs.parameters['ELECTRONS']['mixing_ndim'] = 20 
  [...]

I would try also these solutions. Maybe at the same time?

self.set_restart_type(RestartType.FULL, calculation.outputs.remote_folder)
self.report_error_handled(calculation, action)
return ProcessHandlerReport(True)

@process_handler(priority=420, exit_codes=[
Expand Down
3 changes: 2 additions & 1 deletion tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -246,6 +246,7 @@ def _generate_calc_job_node(
filename = os.path.join(entry_point_name[len('quantumespresso.'):], test_name)
filepath_folder = os.path.join(basepath, 'parsers', 'fixtures', filename)
filepath_input = os.path.join(filepath_folder, 'aiida.in')
print(basepath)

entry_point = format_entry_point_string('aiida.calculations', entry_point_name)

Expand Down Expand Up @@ -640,7 +641,7 @@ def _generate_inputs_pw():
structure = generate_structure()
inputs = {
'code': fixture_code('quantumespresso.pw'),
'structure': generate_structure(),
'structure': structure,
'kpoints': generate_kpoints_mesh(2),
'parameters': parameters,
'pseudos': {kind: generate_upf_data(kind) for kind in structure.get_kind_names()},
Expand Down
12 changes: 6 additions & 6 deletions tests/workflows/protocols/pw/test_bands/test_default.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,11 @@ bands:
conv_thr: 4.0e-10
diago_full_acc: true
diagonalization: paro
electron_maxstep: 80
electron_maxstep: 50
mixing_beta: 0.4
startingpot: file
SYSTEM:
degauss: 0.01
degauss: 0.02
ecutrho: 240.0
ecutwfc: 30.0
nosym: false
Expand Down Expand Up @@ -62,10 +62,10 @@ relax:
tstress: true
ELECTRONS:
conv_thr: 4.0e-10
electron_maxstep: 80
electron_maxstep: 50
mixing_beta: 0.4
SYSTEM:
degauss: 0.01
degauss: 0.02
ecutrho: 240.0
ecutwfc: 30.0
nosym: false
Expand Down Expand Up @@ -98,10 +98,10 @@ scf:
tstress: true
ELECTRONS:
conv_thr: 4.0e-10
electron_maxstep: 80
electron_maxstep: 50
mixing_beta: 0.4
SYSTEM:
degauss: 0.01
degauss: 0.02
ecutrho: 240.0
ecutwfc: 30.0
nosym: false
Expand Down
4 changes: 2 additions & 2 deletions tests/workflows/protocols/pw/test_base/test_default.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,10 @@ pw:
tstress: true
ELECTRONS:
conv_thr: 4.0e-10
electron_maxstep: 80
electron_maxstep: 50
mixing_beta: 0.4
SYSTEM:
degauss: 0.01
degauss: 0.02
ecutrho: 240.0
ecutwfc: 30.0
nosym: false
Expand Down
8 changes: 4 additions & 4 deletions tests/workflows/protocols/pw/test_relax/test_default.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,10 @@ base:
tstress: true
ELECTRONS:
conv_thr: 4.0e-10
electron_maxstep: 80
electron_maxstep: 50
mixing_beta: 0.4
SYSTEM:
degauss: 0.01
degauss: 0.02
ecutrho: 240.0
ecutwfc: 30.0
nosym: false
Expand Down Expand Up @@ -56,10 +56,10 @@ base_final_scf:
tstress: true
ELECTRONS:
conv_thr: 4.0e-10
electron_maxstep: 80
electron_maxstep: 50
mixing_beta: 0.4
SYSTEM:
degauss: 0.01
degauss: 0.02
ecutrho: 240.0
ecutwfc: 30.0
nosym: false
Expand Down
6 changes: 3 additions & 3 deletions tests/workflows/protocols/test_pdos/test_default.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ nscf:
tstress: true
ELECTRONS:
conv_thr: 4.0e-10
electron_maxstep: 80
electron_maxstep: 50
mixing_beta: 0.4
SYSTEM:
ecutrho: 240.0
Expand Down Expand Up @@ -76,10 +76,10 @@ scf:
tstress: true
ELECTRONS:
conv_thr: 4.0e-10
electron_maxstep: 80
electron_maxstep: 50
mixing_beta: 0.4
SYSTEM:
degauss: 0.01
degauss: 0.02
ecutrho: 240.0
ecutwfc: 30.0
nosym: false
Expand Down
4 changes: 2 additions & 2 deletions tests/workflows/protocols/xspectra/test_core/test_default.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,10 @@ scf:
tstress: true
ELECTRONS:
conv_thr: 4.0e-10
electron_maxstep: 80
electron_maxstep: 50
mixing_beta: 0.4
SYSTEM:
degauss: 0.01
degauss: 0.02
ecutrho: 240.0
ecutwfc: 30.0
nosym: false
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,10 @@ core:
tstress: true
ELECTRONS:
conv_thr: 4.0e-10
electron_maxstep: 80
electron_maxstep: 50
mixing_beta: 0.4
SYSTEM:
degauss: 0.01
degauss: 0.02
ecutrho: 240.0
ecutwfc: 30.0
nosym: false
Expand Down Expand Up @@ -102,10 +102,10 @@ relax:
tstress: true
ELECTRONS:
conv_thr: 4.0e-10
electron_maxstep: 80
electron_maxstep: 50
mixing_beta: 0.4
SYSTEM:
degauss: 0.01
degauss: 0.02
ecutrho: 240.0
ecutwfc: 30.0
nosym: false
Expand Down
14 changes: 12 additions & 2 deletions tests/workflows/pw/test_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,19 +67,29 @@ def test_handle_out_of_walltime_structure_changed(generate_workchain_pw, generat
assert result.status == 0


def test_handle_electronic_convergence_not_reached(generate_workchain_pw, fixture_localhost, generate_remote_data):
def test_handle_electronic_convergence_not_reached(
generate_workchain_pw, fixture_localhost, generate_remote_data, generate_structure,
generate_calc_job_node
):
"""Test `PwBaseWorkChain.handle_electronic_convergence_not_reached`."""
remote_data = generate_remote_data(computer=fixture_localhost, remote_path='/path/to/remote')

process = generate_workchain_pw(
exit_code=PwCalculation.exit_codes.ERROR_ELECTRONIC_CONVERGENCE_NOT_REACHED,
pw_outputs={'remote_folder': remote_data}
pw_outputs={'remote_folder': remote_data},
)
process.setup()

cj_node = generate_calc_job_node('quantumespresso.pw', fixture_localhost, 'test')

print(cj_node.outputs.output_parameters.get_dict())

process.ctx.structure = generate_structure('2D-xy-arsenic')

process.ctx.inputs.parameters['ELECTRONS']['mixing_beta'] = 0.5

result = process.handle_electronic_convergence_not_reached(process.ctx.children[-1])

assert isinstance(result, ProcessHandlerReport)
assert process.ctx.inputs.parameters['ELECTRONS']['mixing_beta'] == \
process.defaults.delta_factor_mixing_beta * 0.5
Expand Down
Loading