Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] AttributeError deepspeed.comm has no attribute Processgroup #5421

Closed
ajindal1 opened this issue Apr 16, 2024 · 3 comments
Closed

[BUG] AttributeError deepspeed.comm has no attribute Processgroup #5421

ajindal1 opened this issue Apr 16, 2024 · 3 comments
Assignees
Labels
bug Something isn't working training

Comments

@ajindal1
Copy link
Contributor

Describe the bug
Installing the latest Deepspeed is throwing error, we previously had 0.13.1 and it was working fine but installing the latest deepspeed (any version from 0.13.5) is giving the same error.

To Reproduce
Steps to reproduce the behavior:

  1. Use this docker image: mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu121-py310-torch222:latest
  2. Start the container: docker run -it --gpus all --ipc host mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu121-py310-torch222:latest bash
  3. Uninstall existing deepspeed: pip uninstall deepspeed -y
  4. Install latest deepspeed: pip install deepspeed

Expected behavior
Successful installation of deepspeed

ds_report output
Based on 0.13.4 version:

DeepSpeed general environment info:
torch install path ............... ['/opt/conda/envs/ptca/lib/python3.10/site-packages/torch']
torch version .................... 2.2.2
deepspeed install path ........... ['/opt/conda/envs/ptca/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.13.4, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.2, cuda 12.1
shared memory (/dev/shm) size .... 885.87 GB

Screenshots
Error details:

Collecting deepspeed==0.13.5
  Downloading deepspeed-0.13.5.tar.gz (1.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 14.3 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [95 lines of output]
      [2024-04-16 16:57:19,574] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/setup.py", line 37, in <module>
          from op_builder import get_default_compute_capabilities, OpBuilder
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/op_builder/__init__.py", line 18, in <module>
          import deepspeed.ops.op_builder  # noqa: F401 # type: ignore
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/__init__.py", line 25, in <module>
          from . import ops
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/ops/__init__.py", line 6, in <module>
          from . import adam
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/ops/adam/__init__.py", line 6, in <module>
          from .cpu_adam import DeepSpeedCPUAdam
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/ops/adam/cpu_adam.py", line 8, in <module>
          from deepspeed.utils import logger
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/utils/__init__.py", line 10, in <module>
          from .groups import *
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/utils/groups.py", line 28, in <module>
          from deepspeed import comm as dist
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/comm/__init__.py", line 7, in <module>
          from .comm import *
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/comm/comm.py", line 31, in <module>
          from deepspeed.comm.ccl import CCLBackend
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/comm/ccl.py", line 12, in <module>
          from .torch import TorchBackend
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/comm/torch.py", line 100, in <module>
          class TorchBackend(Backend):
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/comm/torch.py", line 125, in TorchBackend
          def get_all_gather_function(self):
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/runtime/compiler.py", line 21, in disable
          return torch.compiler.disable(func)
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/compiler/__init__.py", line 93, in disable
          import torch._dynamo
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/_dynamo/__init__.py", line 2, in <module>
          from . import allowed_functions, convert_frame, eval_frame, resume_execution
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 45, in <module>
          from .eval_frame import always_optimize_code_objects, skip_code, TorchPatcher
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 69, in <module>
          from . import config, convert_frame, external_utils, skipfiles, utils
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/_dynamo/skipfiles.py", line 39, in <module>
          from .variables.functions import (
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/_dynamo/variables/__init__.py", line 26, in <module>
          from .higher_order_ops import TorchHigherOrderOperatorVariable
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/_dynamo/variables/higher_order_ops.py", line 11, in <module>
          import torch.onnx.operators
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/onnx/__init__.py", line 59, in <module>
          from ._internal.onnxruntime import (
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/onnx/_internal/onnxruntime.py", line 35, in <module>
          import onnxruntime  # type: ignore[import]
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/__init__.py", line 54, in <module>
          from onnxruntime.capi import onnxruntime_validation
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py", line 145, in <module>
          has_ortmodule, package_name, version, cuda_version = validate_build_package_info()
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py", line 140, in validate_build_package_info
          raise import_ortmodule_exception
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py", line 70, in validate_build_package_info
          from onnxruntime.training.ortmodule import ORTModule  # noqa: F401
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/__init__.py", line 26, in <module>
          from .ortmodule import ORTModule  # noqa: F401
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/ortmodule/__init__.py", line 132, in <module>
          from .ortmodule import ORTModule  # noqa: E402, F401
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/ortmodule/ortmodule.py", line 8, in <module>
          from ._torch_module_factory import TorchModuleFactory
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/ortmodule/_torch_module_factory.py", line 8, in <module>
          from ._torch_module_ort import TorchModuleORT
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/ortmodule/_torch_module_ort.py", line 13, in <module>
          from ._graph_execution_manager_factory import GraphExecutionManagerFactory
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager_factory.py", line 10, in <module>
          from ._inference_manager import InferenceManager
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/ortmodule/_inference_manager.py", line 17, in <module>
          from ._graph_execution_manager import GraphExecutionManager, _RunStateInfo
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 23, in <module>
          from onnxruntime.training.utils.hooks import configure_ort_compatible_zero_stage3
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/utils/hooks/__init__.py", line 19, in <module>
          from ._zero_offload_subscriber import ZeROOffloadSubscriber, configure_ort_compatible_zero_stage3
        File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/utils/hooks/_zero_offload_subscriber.py", line 141, in <module>
          from deepspeed.runtime.zero.parameter_offload import *  # noqa: F403
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/runtime/zero/__init__.py", line 6, in <module>
          from .partition_parameters import ZeroParamType
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/runtime/zero/partition_parameters.py", line 22, in <module>
          from .linear import zero3_linear_wrap
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/runtime/zero/linear.py", line 25, in <module>
          from deepspeed.runtime.utils import noop_decorator
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/runtime/utils.py", line 12, in <module>
          from deepspeed.moe.utils import is_moe_param
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/moe/utils.py", line 12, in <module>
          from .layer import MoE
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/moe/layer.py", line 14, in <module>
          from .sharded_moe import MOELayer, TopKGate
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/moe/sharded_moe.py", line 95, in <module>
          class _AllToAll(torch.autograd.Function):
        File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/moe/sharded_moe.py", line 98, in _AllToAll
          def forward(ctx: Any, group: dist.ProcessGroup, input: Tensor) -> Tensor:  # type: ignore
      AttributeError: partially initialized module 'deepspeed.comm' has no attribute 'ProcessGroup' (most likely due to a circular import)
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

System info (please complete the following information):

  • OS: Ubuntu 20.04
  • GPU count and types: 8xA100
  • Interconnects (if applicable): N/A
  • Python version: 3.10 (reproducible with 3.8)
  • Any other relevant info about your setup

Docker context
Yes: mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu121-py310-torch222:latest

@ajindal1 ajindal1 added bug Something isn't working training labels Apr 16, 2024
@ajindal1
Copy link
Contributor Author

It seems like a conflict with onnxruntime_training, when I uninstalled onnxruntime_training and installed deepspeed, it worked fine.

@loadams loadams self-assigned this Apr 17, 2024
@ajindal1
Copy link
Contributor Author

This is fixed in ORT-Training in this PR, so closing this issue now.

@loadams
Copy link
Contributor

loadams commented Apr 19, 2024

Thanks @ajindal1!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

2 participants