Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RelaxWorkChain should use restart_mode restart when max_wall #968

Open
AndresOrtegaGuerrero opened this issue Sep 3, 2023 · 15 comments · May be fixed by #1012
Open

RelaxWorkChain should use restart_mode restart when max_wall #968

AndresOrtegaGuerrero opened this issue Sep 3, 2023 · 15 comments · May be fixed by #1012

Comments

@AndresOrtegaGuerrero
Copy link
Collaborator

The RelaxWorkChain should do a restar_mode = 'restart' when the max wall is reach, instead is starting a simulations from scratch

@mbercx
Copy link
Member

mbercx commented Sep 3, 2023

Thanks @AndresOrtegaGuerrero! Could you explain why you think the restart_mode should be set to 'restart'? I remember I proposed switching to the charge density in a comment on #722, but we weren't sure restarting from the charge density/wave functions was a good idea (which soft of depends on where QE does the soft kill). So we left the current approach:

def handle_out_of_walltime(self, calculation):
"""Handle `ERROR_OUT_OF_WALLTIME` exit code.
In this case the calculation shut down neatly and we can simply restart. We consider two cases:
1. If the structure is unchanged, we do a full restart.
2. If the structure has changed during the calculation, we restart from scratch.
"""
try:
self.ctx.inputs.structure = calculation.outputs.output_structure
except exceptions.NotExistent:
self.set_restart_type(RestartType.FULL, calculation.outputs.remote_folder)
self.report_error_handled(calculation, 'simply restart from the last calculation')
else:
self.set_restart_type(RestartType.FROM_SCRATCH)
self.report_error_handled(calculation, 'out of walltime: structure changed so restarting from scratch')
return ProcessHandlerReport(True)

Which does update the input structure for the next calculation, but restarts "from scratch", i.e. sets the following inputs:

if restart_type == RestartType.FROM_SCRATCH:
self.ctx.inputs.parameters['CONTROL']['restart_mode'] = 'from_scratch'
self.ctx.inputs.parameters['ELECTRONS'].pop('startingpot', None)
self.ctx.inputs.parameters['ELECTRONS'].pop('startingwfc', None)
self.ctx.inputs.pop('parent_folder', None)

I suppose it might be more efficient to restart from the charge density/wave functions, but at the time I wasn't sure why the restart was set up this way, and hence felt it was better to sacrifice some efficiency instead of potentially breaking things or (even worse) producing incorrect results.

@AndresOrtegaGuerrero
Copy link
Collaborator Author

AndresOrtegaGuerrero commented Sep 3, 2023

@mbercx So in case you are doing a long geometry optimization , and if QE did a max_walltime save, then QE can do a proper restart and continue the geometry optimization or even the SCF. However this only works if QE did the save correctly (So it is important to check in the output if indeed this was done). I have been doing optimizations with DFT+U (and using the workchain) and i noticed that without the proper restart the simulations can increase the consumption of time and resources as well. On the other hand it can also make a simulation that was relatively converging to a whole new one where it will take time to reach convergence. If the system has more than 100 atoms, then is quite important a proper restart , especially if the simulations was converging

@mbercx
Copy link
Member

mbercx commented Sep 4, 2023

Yeah, I can see how for larger structures it can make a significant difference.

However this only works if QE did the save correctly (So it is important to check in the output if indeed this was done).

The error handler above only handles cases where Maximum CPU time exceeded was in the output:

'Maximum CPU time exceeded': 'ERROR_OUT_OF_WALLTIME',

So in principle this should indicate a clean shutdown. I'd be inclined to accept switching to a full restart here with restart_mode == 'restart'. Do you see any issues with this @pietrodelugas?

@AndresOrtegaGuerrero
Copy link
Collaborator Author

@mbercx , it is important to guarantee that indeed qe did the clean shutdown, sometimes this shutdown doesnt take place (maybe because the time is not enough, or is an expensive calculation like a hybrid ).

@unkcpz
Copy link
Member

unkcpz commented Jan 26, 2024

@AndresOrtegaGuerrero, do you think more discussion is required? I didn't see big problem with set restart_mode=='restart'.
The only concern I have is if the calculation is just hard to converge, hit the wall time will always make it restart then never will the workchain will finished. Can we set a max_restart to avoid the case?

@AndresOrtegaGuerrero
Copy link
Collaborator Author

I will try to use this restart only for the Relax workchain , and only when there is a MAX_WALLTIME

@mbercx
Copy link
Member

mbercx commented Jan 29, 2024

If QE did a clean shutdown, I'm also fine with doing a full restart. @AndresOrtegaGuerrero I can quickly open a PR for this if you're not already doing so?

Can we set a max_restart to avoid the case?

The BaseRestartWorkChain already has a limit on how many times the work chain will restart (max_iterations, defaults to 5):

https://github.com/aiidateam/aiida-core/blob/06ea130df8854f621e25853af6ac723c37397ed0/src/aiida/engine/processes/workchains/restart.py#L143-L148

@sphuber
Copy link
Contributor

sphuber commented Jan 29, 2024

@mbercx , it is important to guarantee that indeed qe did the clean shutdown, sometimes this shutdown doesnt take place (maybe because the time is not enough, or is an expensive calculation like a hybrid ).

I want to highlight this. It is possible, and I have seen it often in practice, where QE is initializing a soft-shutdown because the walltime was exceeded (and so it prints Maximum CPU time exceeded) but then it doesn't manage to write everything to disk in time and the scheduler still kills the job, leaving the restart files corrupt. So we cannot just check the message written by QE but have to verify that the scheduler also didn't kill. By default I think we take 95% of the allocated walltime of the scheduler job and set that as the soft walltime for QE. If the scenario above happens a lot, we should reduce this percentage to give QE more time to complete the shutdown.

@mbercx
Copy link
Member

mbercx commented Jan 31, 2024

Thanks @sphuber, I see your point. But we already distinguish between ERROR_OUT_OF_WALLTIME (clean shutdown) and ERROR_OUT_OF_WALLTIME_INTERRUPTED, no?

def validate_premature_exit(self, logs):
"""Analyze problems that will cause a pre-mature termination of the calculation, controlled or not."""
if 'ERROR_OUT_OF_WALLTIME' in logs['error'] and 'ERROR_OUTPUT_STDOUT_INCOMPLETE' in logs['error']:
return self.exit_codes.ERROR_OUT_OF_WALLTIME_INTERRUPTED

I suppose it's safe to assume that in case Maximum CPU time exceeded is in the stdout and JOB DONE is printed that the shutdown was a clean one?

@sphuber
Copy link
Contributor

sphuber commented Jan 31, 2024

Thanks @sphuber, I see your point. But we already distinguish between ERROR_OUT_OF_WALLTIME (clean shutdown) and ERROR_OUT_OF_WALLTIME_INTERRUPTED, no?

def validate_premature_exit(self, logs):
"""Analyze problems that will cause a pre-mature termination of the calculation, controlled or not."""
if 'ERROR_OUT_OF_WALLTIME' in logs['error'] and 'ERROR_OUTPUT_STDOUT_INCOMPLETE' in logs['error']:
return self.exit_codes.ERROR_OUT_OF_WALLTIME_INTERRUPTED

I suppose it's safe to assume that in case Maximum CPU time exceeded is in the stdout and JOB DONE is printed that the shutdown was a clean one?

That depends, if JOB DONE is only printed after QE has finished writing all state to disk, then yes. But I don't remember for sure if that is the case.

@AndresOrtegaGuerrero
Copy link
Collaborator Author

It would be nice to include a check if "JOB DONE" is printed, in case you have big system and a hybrid functional , then you can guarantee you have the wfn to do a restart , (even in a scf calculation)

@mbercx
Copy link
Member

mbercx commented Jan 31, 2024

It would be nice to include a check if "JOB DONE" is printed

This is already done, and in case it is missing ERROR_OUTPUT_STDOUT_INCOMPLETE is added to the error logs.

# First check whether the `JOB DONE` message was written, otherwise the job was interrupted
for line in data_lines:
if 'JOB DONE' in line:
break
else:
if crash_file is None:
logs.error.append('ERROR_OUTPUT_STDOUT_INCOMPLETE')

I now notice that this is not done in case there is a CRASH file. I suppose the idea is that the failure can then be retrieved from this file and a more specific exit code can be returned? @bastonero do you remember why this was added in 7f53c96?

But I don't remember for sure if that is the case.

I'm not 100% sure either, but it seems pretty reasonable, since the following line is printed before this:

Writing all to output data dir ./out/aiida.save/

So if the calculation would be interrupted during writing, I think the final line in the stdout should be this one and hence JOB DONE would not be printed.

@bastonero
Copy link
Collaborator

I now notice that this is not done in case there is a CRASH file. I suppose the idea is that the failure can then be retrieved from this file and a more specific exit code can be returned? @bastonero do you remember why this was added in 7f53c96?

Indeed, otherwise say there is a recoverable error (e.g. Cholesky), the stdout incomplete error would be exposed, hence making it unrecoverable. Nevertheless, if the run is interrupted while writing, there will be not CRASH file, as it is written only by QE when encounters an error.

I'm not 100% sure either, but it seems pretty reasonable, since the following line is printed before this:

Writing all to output data dir ./out/aiida.save/

So if the calculation would be interrupted during writing, I think the final line in the stdout should be this one and hence JOB DONE would not be printed.

Indeed, so it would be unrecoverable. Although, one might try to restart e.g. using the charge density only as a compromise, as it will probably the first and fastest thing that QE would dump. But nevertheless, wouldn't be worth it to try restarting either way? Or bad things would happen?

@mbercx
Copy link
Member

mbercx commented Jan 31, 2024

Indeed, otherwise say there is a recoverable error (e.g. Cholesky), the stdout incomplete error would be exposed, hence making it unrecoverable.

This of course depends on how the PwParser deals with the errors in the logs and which one it decides to return. This logic is now a bit convoluted and sometimes hard to follow, but I'd say we can perfectly add ERROR_OUTPUT_STDOUT_INCOMPLETE to the logs in all cases and simply choose to return a more specific exit code instead. But that's a matter to deal with when refactoring the parser.

But nevertheless, wouldn't be worth it to try restarting either way? Or bad things would happen?

I was also wondering about this. If a file is corrupted, will QE crash when trying to read it, or simply ignore the file?

@bastonero
Copy link
Collaborator

That's true. I think in order to avoid issues we decided to opt for this solution, so that we were certain that it wouldn't return the ERROR_OUTPUT_STDOUT_INCOMPLETE. But that should be testable anyways

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants