Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Walker restart #328

Open
jjgoings opened this issue Dec 13, 2024 · 2 comments
Open

[Feature request] Walker restart #328

jjgoings opened this issue Dec 13, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@jjgoings
Copy link

Duplicate/related to #214. I think this feature exists somewhere, but does not appear to be implemented yet.

Is your feature request related to a problem? Please describe.
I want the ability to restart AFQMC simulations from a previously saved walker state. Right now, if a run is interrupted or needs to be extended, there’s no way to resume from where the walkers left off. This makes things inefficient because I either lose the work already done or have to restart the whole simulation from scratch.

Describe the solution you’d like
A way to save the walker state (positions, weights, overlaps, etc.) at a given point in the simulation and then load that state to resume later. Ideally, this should work seamlessly without resetting things like population control or other runtime parameters.

Describe alternatives you’ve considered
I tried manually saving walker data and restoring it using dicts, but it is feels clunky/inefficient and it's really more of a hack. Also, it’s not clear if population control and other internal states are being properly re-initialized when restarting this way. This is what I’ve been doing:

def restore_walkers(walkers, data):
    walkers.phia = data["phia"].copy()
    walkers.phib = data["phib"].copy()
    walkers.Ga = data["Ga"].copy()
    walkers.Gb = data["Gb"].copy()
    walkers.Ghalfa = data["Ghalfa"].copy()
    walkers.Ghalfb = data["Ghalfb"].copy()
    walkers.weight = data["weight"].copy()
    walkers.unscaled_weight = data["unscaled_weight"].copy()
    walkers.phase = data["phase"].copy()
    walkers.ovlp = data["ovlp"].copy()
    walkers.sgn_ovlp = data["sgn_ovlp"].copy()
    walkers.log_ovlp = data["log_ovlp"].copy()
    walkers.eloc = data["eloc"].copy()
    walkers.hybrid_energy = data["hybrid_energy"].copy()
    walkers.detR = data["detR"].tolist()
    walkers.detR_shift = data["detR_shift"].copy()
    walkers.log_detR = data["log_detR"].tolist()
    walkers.log_shift = data["log_shift"].copy()
    walkers.log_detR_shift = data["log_detR_shift"].tolist()

def pack_walkers_data(walkers):
    return {
        "phia": walkers.phia,
        "phib": walkers.phib,
        "Ga": walkers.Ga,
        "Gb": walkers.Gb,
        "Ghalfa": walkers.Ghalfa,
        "Ghalfb": walkers.Ghalfb,
        "weight": walkers.weight,
        "unscaled_weight": walkers.unscaled_weight,
        "phase": walkers.phase,
        "ovlp": walkers.ovlp,
        "sgn_ovlp": walkers.sgn_ovlp,
        "log_ovlp": walkers.log_ovlp,
        "eloc": walkers.eloc,
        "hybrid_energy": walkers.hybrid_energy,
        "detR": np.array(walkers.detR, dtype=float),
        "detR_shift": walkers.detR_shift,
        "log_detR": np.array(walkers.log_detR, dtype=float),
        "log_shift": walkers.log_shift,
        "log_detR_shift": np.array(walkers.log_detR_shift, dtype=float),
    }

This works to some extent, as the energy is mostly correct but it’s not efficient and doesn’t handle everything cleanly. Like the walker weights are not correct on restart, cause i'm not totally sure what attributes to pass or what to let the system repopulate. Letting run handle population control after restoring walkers also feels like it’s reinitializing some things unnecessarily.

Additional context
This feature would make simulations much easier to manage, especially when running on systems with time limits or when simulations are interrupted. Being able to checkpoint and restart would save a lot of effort. Something like saving walkers to a binary file and having a dedicated method to reload and resume would be ideal.

@jjgoings jjgoings added the enhancement New feature or request label Dec 13, 2024
@fdmalone fdmalone removed their assignment Dec 13, 2024
@fdmalone
Copy link
Collaborator

I do not actively work on this anymore :)

@jiangtong1000
Copy link
Collaborator

hi Joshua @jjgoings

  1. firstly, if you saved weights and walkers's phi. the data should be able to connect with the 1st round job. but maybe recomputing green's function and overlap before the executing the new afqmc run step is crutial.
  2. FYI, here is the previous implementation for restarting, apparently the interfaces need to be updated to make it work again.

def get_write_buffer(self, i):
w = self.walkers[i]
buff = numpy.concatenate([[w.weight], [w.phase], [w.ot], w.phi.ravel()])
return buff
def set_walker_from_buffer(self, i, buff):
w = self.walkers[i]
w.weight = buff[0]
w.phase = buff[1]
w.ot = buff[2]
w.phi = buff[3:].reshape(self.walkers[i].phi.shape)
def write_walkers(self, comm):

Long time ago, I tried h5 with mpi driver for writing and reading walkers, but I didn't make it work, I think it also required a properly installed hdf5 with mpi support.
so I finally ended up with creating a separate h5 file for one rank.

currently I don't have the bandwidth but I am happy to be involved to make this feature work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants