[BUG] AttributeError deepspeed.comm has no attribute Processgroup #5421

ajindal1 · 2024-04-16T17:06:13Z

Describe the bug
Installing the latest Deepspeed is throwing error, we previously had 0.13.1 and it was working fine but installing the latest deepspeed (any version from 0.13.5) is giving the same error.

To Reproduce
Steps to reproduce the behavior:

Use this docker image: mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu121-py310-torch222:latest
Start the container: docker run -it --gpus all --ipc host mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu121-py310-torch222:latest bash
Uninstall existing deepspeed: pip uninstall deepspeed -y
Install latest deepspeed: pip install deepspeed

Expected behavior
Successful installation of deepspeed

ds_report output
Based on 0.13.4 version:

DeepSpeed general environment info:
torch install path ............... ['/opt/conda/envs/ptca/lib/python3.10/site-packages/torch']
torch version .................... 2.2.2
deepspeed install path ........... ['/opt/conda/envs/ptca/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.13.4, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.2, cuda 12.1
shared memory (/dev/shm) size .... 885.87 GB

Screenshots
Error details:

Collecting deepspeed==0.13.5
  Downloading deepspeed-0.13.5.tar.gz (1.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 14.3 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [95 lines of output]
      [2024-04-16 16:57:19,574] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/setup.py", line 37, in <module>
          from op_builder import get_default_compute_capabilities, OpBuilder
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/op_builder/__init__.py", line 18, in <module>
          import deepspeed.ops.op_builder  # noqa: F401 # type: ignore
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/__init__.py", line 25, in <module>
          from . import ops
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/ops/__init__.py", line 6, in <module>
          from . import adam
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/ops/adam/__init__.py", line 6, in <module>
          from .cpu_adam import DeepSpeedCPUAdam
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/ops/adam/cpu_adam.py", line 8, in <module>
          from deepspeed.utils import logger
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/utils/__init__.py", line 10, in <module>
          from .groups import *
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/utils/groups.py", line 28, in <module>
          from deepspeed import comm as dist
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/comm/__init__.py", line 7, in <module>
          from .comm import *
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/comm/comm.py", line 31, in <module>
          from deepspeed.comm.ccl import CCLBackend
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/comm/ccl.py", line 12, in <module>
          from .torch import TorchBackend
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/comm/torch.py", line 100, in <module>
          class TorchBackend(Backend):
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/comm/torch.py", line 125, in TorchBackend
          def get_all_gather_function(self):
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/runtime/compiler.py", line 21, in disable
          return torch.compiler.disable(func)
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/compiler/__init__.py", line 93, in disable
          import torch._dynamo
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/_dynamo/__init__.py", line 2, in <module>
          from . import allowed_functions, convert_frame, eval_frame, resume_execution
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 45, in <module>
          from .eval_frame import always_optimize_code_objects, skip_code, TorchPatcher
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 69, in <module>
          from . import config, convert_frame, external_utils, skipfiles, utils
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/_dynamo/skipfiles.py", line 39, in <module>
          from .variables.functions import (
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/_dynamo/variables/__init__.py", line 26, in <module>
          from .higher_order_ops import TorchHigherOrderOperatorVariable
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/_dynamo/variables/higher_order_ops.py", line 11, in <module>
          import torch.onnx.operators
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/onnx/__init__.py", line 59, in <module>
          from ._internal.onnxruntime import (
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/onnx/_internal/onnxruntime.py", line 35, in <module>
          import onnxruntime  # type: ignore[import]
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/__init__.py", line 54, in <module>
          from onnxruntime.capi import onnxruntime_validation
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py", line 145, in <module>
          has_ortmodule, package_name, version, cuda_version = validate_build_package_info()
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py", line 140, in validate_build_package_info
          raise import_ortmodule_exception
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py", line 70, in validate_build_package_info
          from onnxruntime.training.ortmodule import ORTModule  # noqa: F401
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/__init__.py", line 26, in <module>
          from .ortmodule import ORTModule  # noqa: F401
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/ortmodule/__init__.py", line 132, in <module>
          from .ortmodule import ORTModule  # noqa: E402, F401
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/ortmodule/ortmodule.py", line 8, in <module>
          from ._torch_module_factory import TorchModuleFactory
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/ortmodule/_torch_module_factory.py", line 8, in <module>
          from ._torch_module_ort import TorchModuleORT
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/ortmodule/_torch_module_ort.py", line 13, in <module>
          from ._graph_execution_manager_factory import GraphExecutionManagerFactory
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager_factory.py", line 10, in <module>
          from ._inference_manager import InferenceManager
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/ortmodule/_inference_manager.py", line 17, in <module>
          from ._graph_execution_manager import GraphExecutionManager, _RunStateInfo
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 23, in <module>
          from onnxruntime.training.utils.hooks import configure_ort_compatible_zero_stage3
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/utils/hooks/__init__.py", line 19, in <module>
          from ._zero_offload_subscriber import ZeROOffloadSubscriber, configure_ort_compatible_zero_stage3
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/utils/hooks/_zero_offload_subscriber.py", line 141, in <module>
          from deepspeed.runtime.zero.parameter_offload import *  # noqa: F403
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/runtime/zero/__init__.py", line 6, in <module>
          from .partition_parameters import ZeroParamType
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/runtime/zero/partition_parameters.py", line 22, in <module>
          from .linear import zero3_linear_wrap
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/runtime/zero/linear.py", line 25, in <module>
          from deepspeed.runtime.utils import noop_decorator
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/runtime/utils.py", line 12, in <module>
          from deepspeed.moe.utils import is_moe_param
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/moe/utils.py", line 12, in <module>
          from .layer import MoE
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/moe/layer.py", line 14, in <module>
          from .sharded_moe import MOELayer, TopKGate
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/moe/sharded_moe.py", line 95, in <module>
          class _AllToAll(torch.autograd.Function):
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/moe/sharded_moe.py", line 98, in _AllToAll
          def forward(ctx: Any, group: dist.ProcessGroup, input: Tensor) -> Tensor:  # type: ignore
      AttributeError: partially initialized module 'deepspeed.comm' has no attribute 'ProcessGroup' (most likely due to a circular import)
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

System info (please complete the following information):

OS: Ubuntu 20.04
GPU count and types: 8xA100
Interconnects (if applicable): N/A
Python version: 3.10 (reproducible with 3.8)
Any other relevant info about your setup

Docker context
Yes: mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu121-py310-torch222:latest

The text was updated successfully, but these errors were encountered:

ajindal1 · 2024-04-16T17:26:03Z

It seems like a conflict with onnxruntime_training, when I uninstalled onnxruntime_training and installed deepspeed, it worked fine.

ajindal1 · 2024-04-19T05:37:11Z

This is fixed in ORT-Training in this PR, so closing this issue now.

loadams · 2024-04-19T15:01:37Z

Thanks @ajindal1!

ajindal1 added bug Something isn't working training labels Apr 16, 2024

loadams self-assigned this Apr 17, 2024

ajindal1 closed this as completed Apr 19, 2024

atilla00 mentioned this issue May 29, 2024

01 Building Environment/create_docker_environment failed Azure/aml-deep-learning-examples#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] AttributeError deepspeed.comm has no attribute Processgroup #5421

[BUG] AttributeError deepspeed.comm has no attribute Processgroup #5421

ajindal1 commented Apr 16, 2024

ajindal1 commented Apr 16, 2024

ajindal1 commented Apr 19, 2024

loadams commented Apr 19, 2024

[BUG] AttributeError deepspeed.comm has no attribute Processgroup #5421

[BUG] AttributeError deepspeed.comm has no attribute Processgroup #5421

Comments

ajindal1 commented Apr 16, 2024

ajindal1 commented Apr 16, 2024

ajindal1 commented Apr 19, 2024

loadams commented Apr 19, 2024