[Feature request] Walker restart #328

jjgoings · 2024-12-13T17:01:55Z

Duplicate/related to #214. I think this feature exists somewhere, but does not appear to be implemented yet.

Is your feature request related to a problem? Please describe.
I want the ability to restart AFQMC simulations from a previously saved walker state. Right now, if a run is interrupted or needs to be extended, there’s no way to resume from where the walkers left off. This makes things inefficient because I either lose the work already done or have to restart the whole simulation from scratch.

Describe the solution you’d like
A way to save the walker state (positions, weights, overlaps, etc.) at a given point in the simulation and then load that state to resume later. Ideally, this should work seamlessly without resetting things like population control or other runtime parameters.

Describe alternatives you’ve considered
I tried manually saving walker data and restoring it using dicts, but it is feels clunky/inefficient and it's really more of a hack. Also, it’s not clear if population control and other internal states are being properly re-initialized when restarting this way. This is what I’ve been doing:

def restore_walkers(walkers, data):
    walkers.phia = data["phia"].copy()
    walkers.phib = data["phib"].copy()
    walkers.Ga = data["Ga"].copy()
    walkers.Gb = data["Gb"].copy()
    walkers.Ghalfa = data["Ghalfa"].copy()
    walkers.Ghalfb = data["Ghalfb"].copy()
    walkers.weight = data["weight"].copy()
    walkers.unscaled_weight = data["unscaled_weight"].copy()
    walkers.phase = data["phase"].copy()
    walkers.ovlp = data["ovlp"].copy()
    walkers.sgn_ovlp = data["sgn_ovlp"].copy()
    walkers.log_ovlp = data["log_ovlp"].copy()
    walkers.eloc = data["eloc"].copy()
    walkers.hybrid_energy = data["hybrid_energy"].copy()
    walkers.detR = data["detR"].tolist()
    walkers.detR_shift = data["detR_shift"].copy()
    walkers.log_detR = data["log_detR"].tolist()
    walkers.log_shift = data["log_shift"].copy()
    walkers.log_detR_shift = data["log_detR_shift"].tolist()

def pack_walkers_data(walkers):
    return {
        "phia": walkers.phia,
        "phib": walkers.phib,
        "Ga": walkers.Ga,
        "Gb": walkers.Gb,
        "Ghalfa": walkers.Ghalfa,
        "Ghalfb": walkers.Ghalfb,
        "weight": walkers.weight,
        "unscaled_weight": walkers.unscaled_weight,
        "phase": walkers.phase,
        "ovlp": walkers.ovlp,
        "sgn_ovlp": walkers.sgn_ovlp,
        "log_ovlp": walkers.log_ovlp,
        "eloc": walkers.eloc,
        "hybrid_energy": walkers.hybrid_energy,
        "detR": np.array(walkers.detR, dtype=float),
        "detR_shift": walkers.detR_shift,
        "log_detR": np.array(walkers.log_detR, dtype=float),
        "log_shift": walkers.log_shift,
        "log_detR_shift": np.array(walkers.log_detR_shift, dtype=float),
    }

This works to some extent, as the energy is mostly correct but it’s not efficient and doesn’t handle everything cleanly. Like the walker weights are not correct on restart, cause i'm not totally sure what attributes to pass or what to let the system repopulate. Letting run handle population control after restoring walkers also feels like it’s reinitializing some things unnecessarily.

Additional context
This feature would make simulations much easier to manage, especially when running on systems with time limits or when simulations are interrupted. Being able to checkpoint and restart would save a lot of effort. Something like saving walkers to a binary file and having a dedicated method to reload and resume would be ideal.

The text was updated successfully, but these errors were encountered:

fdmalone · 2024-12-13T17:05:03Z

I do not actively work on this anymore :)

jiangtong1000 · 2024-12-13T18:32:41Z

hi Joshua @jjgoings

firstly, if you saved weights and walkers's phi. the data should be able to connect with the 1st round job. but maybe recomputing green's function and overlap before the executing the new afqmc run step is crutial.
FYI, here is the previous implementation for restarting, apparently the interfaces need to be updated to make it work again.

ipie/ipie/legacy/walkers/handler.py

Lines 474 to 486 in 2d76d46

    
           def get_write_buffer(self, i): 
        
               w = self.walkers[i] 
        
               buff = numpy.concatenate([[w.weight], [w.phase], [w.ot], w.phi.ravel()]) 
        
               return buff 
        
           def set_walker_from_buffer(self, i, buff): 
        
               w = self.walkers[i] 
        
               w.weight = buff[0] 
        
               w.phase = buff[1] 
        
               w.ot = buff[2] 
        
               w.phi = buff[3:].reshape(self.walkers[i].phi.shape) 
        
           def write_walkers(self, comm):

Long time ago, I tried h5 with mpi driver for writing and reading walkers, but I didn't make it work, I think it also required a properly installed hdf5 with mpi support.
so I finally ended up with creating a separate h5 file for one rank.

currently I don't have the bandwidth but I am happy to be involved to make this feature work.

jjgoings added the enhancement New feature or request label Dec 13, 2024

jjgoings assigned fdmalone and linusjoonho Dec 13, 2024

fdmalone removed their assignment Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Walker restart #328

[Feature request] Walker restart #328

jjgoings commented Dec 13, 2024

fdmalone commented Dec 13, 2024

jiangtong1000 commented Dec 13, 2024

[Feature request] Walker restart #328

[Feature request] Walker restart #328

Comments

jjgoings commented Dec 13, 2024

fdmalone commented Dec 13, 2024

jiangtong1000 commented Dec 13, 2024