Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when i run `python run_sim.py', the worker died or was killed by an unexpected system error #2

Open
robint-XNF opened this issue Nov 22, 2021 · 6 comments

Comments

@robint-XNF
Copy link

when i run python run_sim.py --eval --tasks flingbot-normal-rect-eval.hdf5 --load flingbot.pth --num_processes 1 --gui
the error shows:ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-11-22 15:10:23,194 WARNING worker.py:1228 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff341cd030556402df7c59625701000000 Worker ID: 4f72e151e496fac468e1c730556e291e00ec1cfb29882f51097186fd Node ID: d4a9eb590967aeb63fe838e2eca52cf666565bf009207c0ec4a730e6 Worker IP address: 192.168.1.106 Worker port: 41747 Worker PID: 18687
i don't know why occur this issue, could you please help me?

@robint-XNF
Copy link
Author

also ,there is no 'replay_buffer.hdf5' in the 'fingbot_eval_X'

@Jeffery-Zhou
Copy link

when i run python run_sim.py --eval --tasks flingbot-normal-rect-eval.hdf5 --load flingbot.pth --num_processes 1 --gui the error shows:ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-11-22 15:10:23,194 WARNING worker.py:1228 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff341cd030556402df7c59625701000000 Worker ID: 4f72e151e496fac468e1c730556e291e00ec1cfb29882f51097186fd Node ID: d4a9eb590967aeb63fe838e2eca52cf666565bf009207c0ec4a730e6 Worker IP address: 192.168.1.106 Worker port: 41747 Worker PID: 18687 i don't know why occur this issue, could you please help me?

I met the same issue, I thought it's the issue of ray version, but it turned out to other issues after testing. Have you solved that right now?

@gtegner
Copy link

gtegner commented Dec 13, 2021

Hey,
I got the same error and looking through the ray logs, it's because it can't find the GPU.
To fix this, theres a line in utils.setup_envs:

    envs = [ray.remote(SimEnv).options(
        num_gpus=torch.cuda.device_count()/num_processes,
        num_cpus=0.1).remote(
        replay_buffer_path=dataset,
        get_task_fn=lambda: ray.get(task_loader.get_next_task.remote()),
        **kwargs)
        for _ in range(num_processes)]

The problem is that torch is installed on cpu, which gives torch.cuda.device_count() == 0 and consequently num_gpus=0. Hardcoding this to be equal to 1 (or whatever number of GPUs you're using) fixes the problem!

@Barbany
Copy link

Barbany commented May 12, 2022

Instead of hardcoding the number of GPUs:

I found out that the PyTorch installation and the cuda drivers installed by the flingbot.yml file are not properly set up. Notice that the boolean torch.cuda.is_available() is False. You can solve this by re-installing PyTorch using pip, which already packs compatible cuda drivers. Now verify that torch.cuda.is_available() is True and the device count is correct.

@scarlett-sun
Copy link

My problem is when running the evaluation command, it seems like when the animation finishes, the terminal stops at "Evaluating flingbot.pth: saving to flingbot_eval_X/replay_buffer.hdf5", and no changes happen, the replay_buffer.hdf5 is not seen in the directory.

@zcswdt
Copy link

zcswdt commented Aug 22, 2023

Instead of hardcoding the number of GPUs:

I found out that the PyTorch installation and the cuda drivers installed by the flingbot.yml file are not properly set up. Notice that the boolean torch.cuda.is_available() is False. You can solve this by re-installing PyTorch using pip, which already packs compatible cuda drivers. Now verify that torch.cuda.is_available() is True and the device count is correct.

Have you successfully run the code for this warehouse?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants