Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if the simulator can run headless on a large GPU with many entities #43

Open
clement-moulin-frier opened this issue Mar 13, 2024 · 1 comment
Assignees

Comments

@clement-moulin-frier
Copy link
Collaborator

clement-moulin-frier commented Mar 13, 2024

Description

(Better to do this after #42 and #39 are merged, which are higher priority and a bit linked.)

A while ago I tried to launch the simulator at large scale on Google Cloud (GCP). I had to fix a few things in Simulator.run to make it work, in particular in the use of jax.lax.fori_loop and the neighbor rebuilding, then it was working well (I managed to run a simulation with 30K entities, which was cool).
Before we lost access to GCP, I saved the code I had there locally on my laptop, with the aim to commit it here. I have just looked at it, but I actually I can't see any relevant change in the diff (no major change with what is currently in the repo). So it is well possible that this is actually currently working.

@corentinlger Can you please try it on JZ? (see steps to reproduce below).

Steps to Reproduce

On JZ, make a script to launch a simulation with:

  • A large number of agents and objects (I succeeded with 30K entities on GCP, but you can try with less, e.g. 10K if there is not enough memory on the GPU you will get)
  • Use these parameters: box_size=1000., neighbor_radius=10., use_fori_loop=True, num_steps_lax=1000, freq=-1

Execute Simulator.run (no server) in non-threaded mode with a single timestep (so that it only runs within the fori_loop call, the number of timesteps being set in num_steps_lax). Use a large GPU (e.g. a V100 or A100) and make sure the simulation run in it (e.g. checking with jax.devices)

Please report here if there are issues. If it works, information about how many entities you manage to simulate, for how many timesteps with which GPU memory will be interesting to know.

Thanks!

@corentinlger
Copy link
Collaborator

Indeed good idea ! I'll try that when the PRs you mentionned are merged !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants