Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use with SLURM #41

Open
dillonmsandhu opened this issue Dec 22, 2022 · 0 comments
Open

How to use with SLURM #41

dillonmsandhu opened this issue Dec 22, 2022 · 0 comments

Comments

@dillonmsandhu
Copy link

dillonmsandhu commented Dec 22, 2022

Any guidance for using with SLURM? Certain actors are failing

When I run

srun -p compsci-gpu --gres=gpu:4 --cpus-per-gpu=5 --mem=24G --pty bash

Followed by:

python main.py --env BreakoutNoFrameskip-v4 --case atari --opr train --amp_type torch_amp --num_gpus 1 --num_cpus 10 --cpu_actor 1 --gpu_actor 1 --force

I get the following warning:

WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 135095644160 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.

Followed by the task failing:

2022-12-22 10:38:02,577 WARNING worker.py:1072 -- The node with node id 67f743d808b7bd16d45063d18dadf1b5cbb39e7d has been marked dead because the detector has missed too many heartbeats from it.

E1222 10:38:02.612172 8087 8433 task_manager.cc:323] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=core.reanalyze_worker, class_name=BatchWorker_CPU, function_name=run, function_hash=}, task_id=d251967856448ceb88866c7d01000000, task_name=BatchWorker_CPU.run(), job_id=01000000, num_args=0, num_returns=2, actor_task_spec={actor_id=88866c7d01000000, actor_caller_id=ffffffffffffffffffffffff01000000, actor_counter=0}

I am not sure how to parse the error, any advice? What #SBATCH headings do you recommend using in the providedtrain.sh? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant