Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle asynchronous computation #245

Open
ye-luo opened this issue Jun 12, 2019 · 4 comments
Open

How to handle asynchronous computation #245

ye-luo opened this issue Jun 12, 2019 · 4 comments
Labels

Comments

@ye-luo
Copy link
Collaborator

ye-luo commented Jun 12, 2019

My rough ideas for incorporating asynchronous computation but hide the detail at the lowest possible level. Since we have limited confidence in applying a tasking programming model to the whole code, the follow code may achieve hopefully sufficient asynchronous behaviour and performance.

When we compute the trial wavefunction, the call sequence is
TrialWF->ratioGrad()
{
TrialWF->WFC[0]->ratioGrad(iel) //determinant
{
SPO->evaluate(iel);
getInvRow(psi_inv);
dot(spo_v, psi_inv);
}
TrialWF->WFC[1]->ratioGrad(iel); //Jastrow
}

Instead, we separate ratioGrad into two parts. The async launching part and the wait
TrialWF->ratioGrad()
{
TrialWF->WFC[0]->ratioGradLaunchAsync(iel) //determinant
{
SPO->evaluateLaunchAsync(iel);
getInvRowLaunchAsync(psi_inv);
}
TrialWF->WFC[1]->ratioGradLaunchAsync(iel); //Jastrow
/// finish launching async calls of all the WFCs
TrialWF->WFC[0]->ratioGrad(iel) //determinant
{
SPO->evaluate(iel); // wait completion inside
getInvRow(psi_inv); // wait completion inside
dot(spo_v, psiM[iel]);
}
TrialWF->WFC[1]->ratioGrad(iel); //Jastrow
}

This is similar to what we have in the CUDA code but I'm expanding it to allows working through levels if necessary. CUDA or OpenMP offload can be hidden beneath. In the case of CUDA, delayed update engine and SPO can use different streams to maximize asynchronous concurrent execution. The QMCPACK CUDA code relies on a single stream to enforce synchronization. SPO can also be OpenMP offload and the asynchronous control is self contained. If necessary, the TrialWF->ratioGrad can also split into ratioGradLaunchAsync and ratioGrad which can be called by the driver.

Any piece not needing async remains unchanged.

Pros: we explicitly control dependency.
Cons: we explicitly control wait instead of the runtime.

@PDoakORNL
Copy link
Collaborator

PDoakORNL commented Jun 13, 2019

The QMCPACK CUDA code is dead. My CUDA code does not depend on a single stream for synchronization. It does depend on being able to construct transfers (most important) and evaluations (not very important) from more than one trialWF worth of data.

Synchronization pattern is going to depend on architecture and whether you have bothered to have a device specialization for a particular operation and which other operations have been specialized. SPOs, Determinants, and Jastrows should not have to know the implementations of each other nor should the logic for synchronizing between them be spread through the object hierarchy.

Looking to the future I prefer this sort of thing. Reading the input(s) should produce a parameters object for each QMC run.
This should include the wfc to spo mapping and the wfc's and spo's to computation device mapping.
The requirements for the wfc determinant and input buffer memory as well as the SPO storage would also need to be in this structure.

//in the driver anything with the semantics of std::async could be used.
driver<Async>::do_ratioGrad
{
//crowd is in scope

det = crowd_wf[0].wfc[i];
spo = crowd_wfc_spo_map[det.id];
//crowd_calc_location(spo) returns a device tag for that spo
std::future<int> fut_spo_eval = likeSTDAsync(crowd_calc_location(spo), launch_type, multi_func<SPOType,DEVICE>.evaluate(spo, positions, iels, crowd_els, ions, crowd_v, crowd_g, crowd_h);
fut_spo_eval.get();
std::future<std::vector<ValueType>> fut_ratio_grad = likeSTDAsync(crowd_calc_location(dets), launch_type, multi_func<DetType,DEVICE>.ratioGrad(dets, crowd_v, crowd_g, iat);
}

driver<Sync>::do_ratioGrad
{
//crowd is in scope
multi_func<SPOType, DEVICE>.evaluate(spo, positions, iels, crowd_els, ions, crowd_v, crowd_g, crowd_h);
multi_func<DetType, DEVICE>.ratioGrad(crowd_v, crowd_g, iat);
}

template<class SPOTType, Device DT = CPU>
class multiFunc {
//default implementation
evaluate(SPOType spo,auto positions, auto iels, auto crowd_els, auto ions, auto& crowd_v, auto& crowd_g, auto& crowd_h) {
//or parallel block construct of your choice
for(i = 0; i < crowd_v.size(); ++i)
  spo.evaluate(positions[i],crowd_v[i],crowd_g[i], crowd_h[i];
}
};

@PDoakORNL
Copy link
Collaborator

you actually don't need the driver specialization if you make likeSTDAsync default to a blocking synchronous evaluation.

@lshulen
Copy link

lshulen commented Jun 17, 2019

If we pursue either of these, we need to be very careful of the trade-offs associated with moving to this model of programming. New programmers coming to the code are likely to have to learn to reason about these constructs for the first time. Also, we will have to be incredibly sure that our unit tests / integration tests are robust to catching the sort of race conditions that may not occur for every order of the evaluations.

Are we totally convinced that the speed-up gained from this programming model is worth the other costs?

@prckent
Copy link
Contributor

prckent commented Jun 17, 2019

@lshulen Good point. Totally agree. MiniQMC may be a good place to play, but for QMCPACK step 1 should involve the absolute minimum complexity and therefore minimum asynchronicity, possibly none. Only when that is working and we see a clear and significant benefit to a more complex and capable implementation should we move forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants