You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am experimenting with the GLOO async isend and irecv in my work on pipeline parallelism. With torch==1.8.1 on macOS, I will get an error libc++abi.dylib: terminating with uncaught exception of type gloo::EnforceNotMet: [enforce fail at ../third_party/gloo/gloo/transport/uv/pair.cc:333] buf. Cannot lock pointer to unbound buffer when I have multiple send requests and multiple receive requests. With torch==1.7.1 and on Linux, this example will hang forever.
Here is an minimal example:
import torch
import torch.distributed as dist
def recv_prev(rank, tag):
input_tensor = torch.empty(1)
recv_handle = dist.irecv(tensor=input_tensor, src=rank-1, tag=tag)
return input_tensor, recv_handle
def send_next(rank, output_tensor, tag):
send_handle = dist.isend(tensor=output_tensor, dst=rank+1, tag=tag)
return send_handle
def run(rank, size, hostname):
"""
Simulation of simple async communication
:param rank:
:param size:
:param hostname:
:return:
"""
num_ops = 3
for i in range(num_ops):
if rank == 0:
tensor = torch.ones(1) * i
send_handle = send_next(rank, tensor, tag=i)
print(f"RANK {rank} send {i}")
for i in range(num_ops):
if rank == 1:
recv, recv_handle = recv_prev(rank, tag=i)
recv_handle.wait()
print(f"RANK {rank} receive {recv}")
dist.barrier()
print("Start init...")
dist.init_process_group('gloo')
print("Init done!")
hostname = socket.gethostname()
run(dist.get_rank(), dist.get_world_size(), hostname)
I run this example with this command: python -m torch.distributed.launch --nproc_per_node=2 minimal_example.py.
With this minimal example, the errors I get are:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Start init...
Start init...
Init done!
Init done!
RANK 0 send 0
RANK 0 send 1
RANK 0 send 2
RANK 1 receive tensor([0.])
libc++abi.dylib: terminating with uncaught exception of type gloo::EnforceNotMet: [enforce fail at ../third_party/gloo/gloo/transport/uv/pair.cc:333] buf. Cannot lock pointer to unbound buffer
Killing subprocess 71105
Killing subprocess 71106
Traceback (most recent call last):
File "/Users/tianyizhang/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Users/tianyizhang/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/tianyizhang/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
main()
File "/Users/tianyizhang/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/Users/tianyizhang/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/Users/tianyizhang/anaconda3/bin/python', '-u', 'simple_async_exp.py', '--local_rank=1']' died with <Signals.SIGABRT: 6>.
Can you help me understand the error message and please let me know if I am using this wrong?
Thank you in advance!
The text was updated successfully, but these errors were encountered:
Tiiiger
changed the title
Understanding errors with async send and receive
Errors with multiple async send and receive
May 17, 2021
Hi Friends,
I am experimenting with the GLOO async
isend
andirecv
in my work on pipeline parallelism. Withtorch==1.8.1
on macOS, I will get an errorlibc++abi.dylib: terminating with uncaught exception of type gloo::EnforceNotMet: [enforce fail at ../third_party/gloo/gloo/transport/uv/pair.cc:333] buf. Cannot lock pointer to unbound buffer
when I have multiple send requests and multiple receive requests. Withtorch==1.7.1
and on Linux, this example will hang forever.Here is an minimal example:
I run this example with this command:
python -m torch.distributed.launch --nproc_per_node=2 minimal_example.py
.With this minimal example, the errors I get are:
Can you help me understand the error message and please let me know if I am using this wrong?
Thank you in advance!
The text was updated successfully, but these errors were encountered: