Optimizing for GPU parallelized environments that return batched torch cuda tensors #13

StoneT2000 · 2024-10-31T00:47:58Z

Currently trying to modify my current PPO code for ManiSkill GPU sim (mostly based on clean rl as well) that is modified to do everything on the GPU. I am trying to squeeze as much performance out as possible and am reviewing the torch compile ppo continuous code right now.

A few questions

I can probably remove all the .to(device) calls of tensordicts right? e.g.

LeanRL/leanrl/ppo_continuous_action_torchcompile.py

Lines 206 to 209 in a416e61

    
               obs = next_obs = next_obs.to(device, non_blocking=True) 
        
               done = next_done.to(device, non_blocking=True) 
        
           container = torch.stack(ts, 0).to(device)

And is the original non_blocking meant to to ensure we don't eagerly move it until we need to (e.g. next inference step in the rollout)?

How bad are .eval, .train calls? and how come they should be avoided? I thought they were like simple switches
Are there potentially any environment side optimizations to make RL faster? Im aware of some things that can be made to do be done non-blocking, I wonder if the same can be done for environment observations and rewards? Are there other tricks?

Thanks! Looking forward to trying to set some training speed records with these improvements!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing for GPU parallelized environments that return batched torch cuda tensors #13

Optimizing for GPU parallelized environments that return batched torch cuda tensors #13

StoneT2000 commented Oct 31, 2024 •

edited

Loading

Optimizing for GPU parallelized environments that return batched torch cuda tensors #13

Optimizing for GPU parallelized environments that return batched torch cuda tensors #13

Comments

StoneT2000 commented Oct 31, 2024 • edited Loading

StoneT2000 commented Oct 31, 2024 •

edited

Loading