Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: ModuleNotFoundError: No module named 'colossalai.context.parallel_mode' #4980

Closed
vetmax7 opened this issue Oct 26, 2023 · 1 comment
Closed
Labels
bug Something isn't working

Comments

@vetmax7
Copy link

vetmax7 commented Oct 26, 2023

🐛 Describe the bug

Hello !

I tried to run train.sh for FastFold https://github.com/hpcaitech/FastFold, but I got such errors:

Could you help me pls?
colossalai 0.3.3 pypi_0 pypi

/opt/conda/envs/pytorch/lib/python3.8/site-packages/colossalai/kernel/cuda_native/mha/flash_attn_2.py:21: UserWarning: FlashAttention only supports Ampere GPUs or newer.                                     warnings.warn("FlashAttention only supports Ampere GPUs or newer.")
/opt/conda/envs/pytorch/lib/python3.8/site-packages/colossalai/kernel/cuda_native/mha/flash_attn_2.py:28: UserWarning: please install flash_attn from https://github.com/HazyResearch/flash-attention
  warnings.warn("please install flash_attn from https://github.com/HazyResearch/flash-attention")
Traceback (most recent call last):
  File "train.py", line 11, in <module>
    from fastfold.utils.inject_fastnn import inject_fastnn
  File "/FastFold/fastfold/utils/inject_fastnn.py", line 17, in <module>
    from fastfold.model.fastnn import EvoformerStack, ExtraMSAStack
  File "/FastFold/fastfold/model/fastnn/__init__.py", line 1, in <module>
    from .msa import MSACore, ExtraMSACore, ExtraMSABlock, ExtraMSAStack
  File "/FastFold/fastfold/model/fastnn/msa.py", line 21, in <module>
    from colossalai.context.parallel_mode import ParallelMode
ModuleNotFoundError: No module named 'colossalai.context.parallel_mode'


ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 144182) of binary: /opt/conda/envs/pytorch/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/pytorch/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.12.0', 'console_scripts', 'torchrun')())
  File "/opt/conda/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/opt/conda/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/opt/conda/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

============================================================
train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-10-26_16:44:54
host : volta01.hpc.local
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 144182)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Environment

No response

@vetmax7 vetmax7 added the bug Something isn't working label Oct 26, 2023
@Fridge003
Copy link
Contributor

Fridge003 commented Oct 27, 2023

Hi, colossalai.context.parallel_model has been deprecated and moved to legacy in the newest version of ColossalAI (it can be imported through colossalai.legacy.context.parallel_model). Downgrading ColossalAI to older version (below 0.3.0) should solve this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants