Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"ChildProcessError: [Errno 10] No child processes" when doing optimize_parallel_gpu #59

Open
kyoungrok0517 opened this issue Oct 22, 2019 · 2 comments

Comments

@kyoungrok0517
Copy link

I'm using pytorch-lightning and test_tube at the same time. I try to perform hyperparameter search using optimize_parallel_gpu, but I see the strange error in the title: ChildProcessError: [Errno 10] No child processes

Code

def main_local(hparams, gpu_ids=None):
    # init module
    # model = SparseNet(hparams)
    model = SparseNet(hparams)

    # most basic trainer, uses good defaults
    trainer = Trainer(
        max_nb_epochs=hparams.max_nb_epochs,
        gpus=gpu_ids,
        distributed_backend=hparams.distributed_backend,
        nb_gpu_nodes=hparams.nodes,
        # optional
        fast_dev_run=hparams.fast_dev_run,
        use_amp=hparams.use_amp,
        amp_level=("O1" if hparams.use_amp else "O0"),
    )
    trainer.fit(model)

...
if __name__ == "__main__":
    ...
    parser = SparseNet.add_model_specific_args(parser)

    # HyperParameter search
    parser.opt_list(
        "--n", default=2000, type=int, tunable=True, options=[2000, 3000, 4000]
    )
    parser.opt_list(
        "--k", default=50, type=int, tunable=True, options=[100, 200, 300, 400]
    )
    parser.opt_list(
        "--batch_size",
        default=32,
        type=int,
        tunable=True,
        options=[32, 64, 128, 256, 512],
    )

    # parse params
    hparams = parser.parse_args()

    # LR for different batch_size
    if hparams.batch_size <= 128:
        hparams.learning_rate = 0.001
    else:
        hparams.learning_rate = 0.002

    # run trials of random search over the hyperparams
    if torch.cuda.is_available():
        hparams.optimize_parallel_gpu(
            main_local, max_nb_trials=20, gpu_ids=["0, 1"]
        )
    else:
        hparams.gpus = None
        hparams.distributed_backend = None
        hparams.optimize_parallel_cpu(main_local, nb_trials=20)

    # main_local(hparams) # this works

console log

gpu available: True, used: True
VISIBLE GPUS: 0,1
Caught exception in worker thread [Errno 10] No child processes
Traceback (most recent call last):
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 37, in optimize_parallel_gpu_private
    results = train_function(trial_params, gpu_id_set)
  File "sparse_trainer.py", line 29, in main_local
    trainer.fit(model)
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 746, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model, ))
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 156, in spawn
    error_queue = mp.SimpleQueue()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/context.py", line 112, in SimpleQueue
    return SimpleQueue(ctx=self.get_context())
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/queues.py", line 332, in __init__
    self._rlock = ctx.Lock()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/context.py", line 67, in Lock
    return Lock(ctx=self.get_context())
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 162, in __init__
    SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 80, in __init__
    register(self._semlock.name)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 83, in register
    self._send('REGISTER', name)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 90, in _send
    self.ensure_running()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 46, in ensure_running
    pid, status = os.waitpid(self._pid, os.WNOHANG)
ChildProcessError: [Errno 10] No child processes
gpu available: True, used: True
VISIBLE GPUS: 0,1
Caught exception in worker thread [Errno 10] No child processes
Traceback (most recent call last):
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 37, in optimize_parallel_gpu_private
    results = train_function(trial_params, gpu_id_set)
  File "sparse_trainer.py", line 29, in main_local
    trainer.fit(model)
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 746, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model, ))
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 156, in spawn
    error_queue = mp.SimpleQueue()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/context.py", line 112, in SimpleQueue
    return SimpleQueue(ctx=self.get_context())
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/queues.py", line 332, in __init__
    self._rlock = ctx.Lock()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/context.py", line 67, in Lock
    return Lock(ctx=self.get_context())
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 162, in __init__
    SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 80, in __init__
    register(self._semlock.name)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 83, in register
    self._send('REGISTER', name)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 90, in _send
    self.ensure_running()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 46, in ensure_running
    pid, status = os.waitpid(self._pid, os.WNOHANG)
ChildProcessError: [Errno 10] No child processes
^CTraceback (most recent call last):
  File "sparse_trainer.py", line 73, in <module>
Process ForkPoolWorker-2:
Process ForkPoolWorker-1:
Process ForkPoolWorker-4:
    main_local, nb_trials=20, trials=hparams.trials(20), gpu_ids=["0, 1"]
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 361, in optimize_trials_parallel_gpu
Traceback (most recent call last):
    results = self.pool.map(optimize_parallel_gpu_private, self.trials)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
Traceback (most recent call last):
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
    gpu_id_set = g_gpu_id_q.get(block=True)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
KeyboardInterrupt
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
    gpu_id_set = g_gpu_id_q.get(block=True)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 651, in get
Traceback (most recent call last):
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
    gpu_id_set = g_gpu_id_q.get(block=True)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/queues.py", line 94, in get
    res = self._recv_bytes()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt
    self.wait(timeout)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 648, in wait
    self._event.wait(timeout)
  File "/home/kyoungrok/anaconda3/lib/python3.7/threading.py", line 552, in wait
    signaled = self._cond.wait(timeout)
  File "/home/kyoungrok/anaconda3/lib/python3.7/threading.py", line 296, in wait
    waiter.acquire()
KeyboardInterrupt
@s-rog
Copy link

s-rog commented Nov 27, 2019

I think we're having the same underlying issue

Caught exception in worker thread daemonic processes are not allowed to have children
Traceback (most recent call last):
  File "/home/roger/libs/torch/test-tube/test_tube/argparse_hopt.py", line 37, in optimize_parallel_gpu_private
    results = train_function(trial_params, gpu_id_set)
  File "./training.py", line 40, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 343, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn
    process.start()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 103, in start
    'daemonic processes are not allowed to have children'
AssertionError: daemonic processes are not allowed to have children

@s-rog
Copy link

s-rog commented Nov 27, 2019

@williamFalcon I got a fix in the works that makes pool spawn non-daemonic processes, yay or nay?

Edit:
welp, after fixing this, I still run into the same issue as Lightning-AI/pytorch-lightning#485

Edit:
got around it with setting slurm id in os, the debugging continues...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants