You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Installing the latest Deepspeed is throwing error, we previously had 0.13.1 and it was working fine but installing the latest deepspeed (any version from 0.13.5) is giving the same error.
To Reproduce
Steps to reproduce the behavior:
Use this docker image: mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu121-py310-torch222:latest
Start the container: docker run -it --gpus all --ipc host mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu121-py310-torch222:latest bash
Expected behavior
Successful installation of deepspeed
ds_report output
Based on 0.13.4 version:
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/envs/ptca/lib/python3.10/site-packages/torch']
torch version .................... 2.2.2
deepspeed install path ........... ['/opt/conda/envs/ptca/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.13.4, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.2, cuda 12.1
shared memory (/dev/shm) size .... 885.87 GB
Screenshots
Error details:
Collecting deepspeed==0.13.5
Downloading deepspeed-0.13.5.tar.gz (1.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 14.3 MB/s eta 0:00:00
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [95 lines of output]
[2024-04-16 16:57:19,574] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/setup.py", line 37, in <module>
from op_builder import get_default_compute_capabilities, OpBuilder
File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/op_builder/__init__.py", line 18, in <module>
import deepspeed.ops.op_builder # noqa: F401 # type: ignore
File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/__init__.py", line 25, in <module>
from . import ops
File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/ops/__init__.py", line 6, in <module>
from . import adam
File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/ops/adam/__init__.py", line 6, in <module>
from .cpu_adam import DeepSpeedCPUAdam
File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/ops/adam/cpu_adam.py", line 8, in <module>
from deepspeed.utils import logger
File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/utils/__init__.py", line 10, in <module>
from .groups import *
File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/utils/groups.py", line 28, in <module>
from deepspeed import comm as dist
File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/comm/__init__.py", line 7, in <module>
from .comm import *
File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/comm/comm.py", line 31, in <module>
from deepspeed.comm.ccl import CCLBackend
File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/comm/ccl.py", line 12, in <module>
from .torch import TorchBackend
File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/comm/torch.py", line 100, in <module>
class TorchBackend(Backend):
File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/comm/torch.py", line 125, in TorchBackend
def get_all_gather_function(self):
File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/runtime/compiler.py", line 21, in disable
return torch.compiler.disable(func)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/compiler/__init__.py", line 93, in disable
import torch._dynamo
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/_dynamo/__init__.py", line 2, in <module>
from . import allowed_functions, convert_frame, eval_frame, resume_execution
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 45, in <module>
from .eval_frame import always_optimize_code_objects, skip_code, TorchPatcher
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 69, in <module>
from . import config, convert_frame, external_utils, skipfiles, utils
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/_dynamo/skipfiles.py", line 39, in <module>
from .variables.functions import (
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/_dynamo/variables/__init__.py", line 26, in <module>
from .higher_order_ops import TorchHigherOrderOperatorVariable
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/_dynamo/variables/higher_order_ops.py", line 11, in <module>
import torch.onnx.operators
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/onnx/__init__.py", line 59, in <module>
from ._internal.onnxruntime import (
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/onnx/_internal/onnxruntime.py", line 35, in <module>
import onnxruntime # type: ignore[import]
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/__init__.py", line 54, in <module>
from onnxruntime.capi import onnxruntime_validation
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py", line 145, in <module>
has_ortmodule, package_name, version, cuda_version = validate_build_package_info()
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py", line 140, in validate_build_package_info
raise import_ortmodule_exception
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py", line 70, in validate_build_package_info
from onnxruntime.training.ortmodule import ORTModule # noqa: F401
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/__init__.py", line 26, in <module>
from .ortmodule import ORTModule # noqa: F401
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/ortmodule/__init__.py", line 132, in <module>
from .ortmodule import ORTModule # noqa: E402, F401
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/ortmodule/ortmodule.py", line 8, in <module>
from ._torch_module_factory import TorchModuleFactory
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/ortmodule/_torch_module_factory.py", line 8, in <module>
from ._torch_module_ort import TorchModuleORT
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/ortmodule/_torch_module_ort.py", line 13, in <module>
from ._graph_execution_manager_factory import GraphExecutionManagerFactory
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager_factory.py", line 10, in <module>
from ._inference_manager import InferenceManager
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/ortmodule/_inference_manager.py", line 17, in <module>
from ._graph_execution_manager import GraphExecutionManager, _RunStateInfo
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 23, in <module>
from onnxruntime.training.utils.hooks import configure_ort_compatible_zero_stage3
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/utils/hooks/__init__.py", line 19, in <module>
from ._zero_offload_subscriber import ZeROOffloadSubscriber, configure_ort_compatible_zero_stage3
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/training/utils/hooks/_zero_offload_subscriber.py", line 141, in <module>
from deepspeed.runtime.zero.parameter_offload import * # noqa: F403
File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/runtime/zero/__init__.py", line 6, in <module>
from .partition_parameters import ZeroParamType
File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/runtime/zero/partition_parameters.py", line 22, in <module>
from .linear import zero3_linear_wrap
File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/runtime/zero/linear.py", line 25, in <module>
from deepspeed.runtime.utils import noop_decorator
File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/runtime/utils.py", line 12, in <module>
from deepspeed.moe.utils import is_moe_param
File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/moe/utils.py", line 12, in <module>
from .layer import MoE
File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/moe/layer.py", line 14, in <module>
from .sharded_moe import MOELayer, TopKGate
File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/moe/sharded_moe.py", line 95, in <module>
class _AllToAll(torch.autograd.Function):
File "/tmp/pip-install-4_g7vose/deepspeed_c90334a39c1c4f0294e136d1e5fb80fb/deepspeed/moe/sharded_moe.py", line 98, in _AllToAll
def forward(ctx: Any, group: dist.ProcessGroup, input: Tensor) -> Tensor: # type: ignore
AttributeError: partially initialized module 'deepspeed.comm' has no attribute 'ProcessGroup' (most likely due to a circular import)
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
System info (please complete the following information):
Describe the bug
Installing the latest Deepspeed is throwing error, we previously had 0.13.1 and it was working fine but installing the latest deepspeed (any version from 0.13.5) is giving the same error.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Successful installation of deepspeed
ds_report output
Based on 0.13.4 version:
Screenshots
Error details:
System info (please complete the following information):
Docker context
Yes: mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu121-py310-torch222:latest
The text was updated successfully, but these errors were encountered: