Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors with multiple async send and receive #308

Open
Tiiiger opened this issue May 17, 2021 · 0 comments
Open

Errors with multiple async send and receive #308

Tiiiger opened this issue May 17, 2021 · 0 comments

Comments

@Tiiiger
Copy link

Tiiiger commented May 17, 2021

Hi Friends,

I am experimenting with the GLOO async isend and irecv in my work on pipeline parallelism. With torch==1.8.1 on macOS, I will get an error libc++abi.dylib: terminating with uncaught exception of type gloo::EnforceNotMet: [enforce fail at ../third_party/gloo/gloo/transport/uv/pair.cc:333] buf. Cannot lock pointer to unbound buffer when I have multiple send requests and multiple receive requests. With torch==1.7.1 and on Linux, this example will hang forever.

Here is an minimal example:

import torch
import torch.distributed as dist

def recv_prev(rank, tag):
    input_tensor = torch.empty(1)
    recv_handle = dist.irecv(tensor=input_tensor, src=rank-1, tag=tag)
    return input_tensor, recv_handle

def send_next(rank, output_tensor, tag):
    send_handle = dist.isend(tensor=output_tensor, dst=rank+1, tag=tag)
    return send_handle

def run(rank, size, hostname):
    """
    Simulation of simple async communication
    :param rank:
    :param size:
    :param hostname:
    :return:
    """
    num_ops = 3

    for i in range(num_ops):
        if rank == 0:
            tensor = torch.ones(1) * i
            send_handle = send_next(rank, tensor, tag=i)
            print(f"RANK {rank} send {i}")

    for i in range(num_ops):
        if rank == 1:
            recv, recv_handle = recv_prev(rank, tag=i)
            recv_handle.wait()
            print(f"RANK {rank} receive {recv}")

    dist.barrier()


print("Start init...")
dist.init_process_group('gloo')
print("Init done!")
hostname = socket.gethostname()
run(dist.get_rank(), dist.get_world_size(), hostname)

I run this example with this command:
python -m torch.distributed.launch --nproc_per_node=2 minimal_example.py.

With this minimal example, the errors I get are:

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Start init...
Start init...
Init done!
Init done!
RANK 0 send 0
RANK 0 send 1
RANK 0 send 2
RANK 1 receive tensor([0.])
libc++abi.dylib: terminating with uncaught exception of type gloo::EnforceNotMet: [enforce fail at ../third_party/gloo/gloo/transport/uv/pair.cc:333] buf. Cannot lock pointer to unbound buffer
Killing subprocess 71105
Killing subprocess 71106
Traceback (most recent call last):
  File "/Users/tianyizhang/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/tianyizhang/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/tianyizhang/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/Users/tianyizhang/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/Users/tianyizhang/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/Users/tianyizhang/anaconda3/bin/python', '-u', 'simple_async_exp.py', '--local_rank=1']' died with <Signals.SIGABRT: 6>.

Can you help me understand the error message and please let me know if I am using this wrong?

Thank you in advance!

@Tiiiger Tiiiger changed the title Understanding errors with async send and receive Errors with multiple async send and receive May 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant