Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG: validate_data.py ModuleNotFoundError (finetune & tensorflow) #98

Open
CorentinWicht opened this issue Sep 6, 2024 · 12 comments
Open
Labels
bug Something isn't working

Comments

@CorentinWicht
Copy link

CorentinWicht commented Sep 6, 2024

Python Version

Python 3.10.12

Pip Freeze

absl-py==2.1.0
annotated-types==0.7.0
astunparse==1.6.3
attrs==24.2.0
beautifulsoup4==4.12.3
blis==0.7.11
bs4==0.0.2
catalogue==2.0.10
certifi==2024.8.30
charset-normalizer==3.3.2
click==8.1.7
cloudpathlib==0.19.0
confection==0.1.5
cramjam==2.8.3
cymem==2.0.8
docstring_parser==0.16
fastparquet==2024.5.0
filelock==3.15.4
finetune==0.10.0
fire==0.6.0
flatbuffers==24.3.25
fsspec==2024.9.0
ftfy==6.2.3
gast==0.6.0
google-pasta==0.2.0
grpcio==1.66.1
h5py==3.11.0
huggingface-hub==0.24.6
idna==3.8
Jinja2==3.1.4
joblib==1.4.2
jsonschema==4.23.0
jsonschema-specifications==2023.12.1
keras==3.5.0
langcodes==3.4.0
language_data==1.2.0
libclang==18.1.1
lxml==5.3.0
marisa-trie==1.2.0
Markdown==3.7
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
mistral_common==1.3.4
ml-dtypes==0.4.0
mpmath==1.3.0
murmurhash==1.0.10
namex==0.0.8
networkx==3.3
nltk==3.9.1
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.19.3
nvidia-nvjitlink-cu12==12.6.68
nvidia-nvtx-cu12==12.1.105
opt-einsum==3.3.0
optree==0.12.1
packaging==24.1
pandas==2.2.2
preshed==3.0.9
protobuf==4.25.4
psutil==5.7.0
pyarrow==17.0.0
pydantic==2.9.0
pydantic_core==2.23.2
Pygments==2.18.0
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.2
referencing==0.35.1
regex==2024.7.24
requests==2.32.3
rich==13.8.0
rpds-py==0.20.0
safetensors==0.4.5
scikit-learn==1.5.1
scipy==1.14.1
sentencepiece==0.2.0
shellingham==1.5.4
simple-parsing==0.1.6
six==1.16.0
smart-open==7.0.4
soupsieve==2.6
spacy==3.7.6
spacy-legacy==3.0.12
spacy-loggers==1.0.5
srsly==2.4.8
sympy==1.13.2
tabulate==0.8.10
tensorboard==2.17.1
tensorboard-data-server==0.7.2
tensorflow==2.17.0
tensorflow-addons==0.16.1
tensorflow-estimator==2.11.0
tensorflow-io-gcs-filesystem==0.37.1
termcolor==2.4.0
thinc==8.2.5
threadpoolctl==3.5.0
tiktoken==0.7.0
tokenizers==0.13.3
torch==2.2.0
tqdl==0.0.4
tqdm==4.66.5
transformers==4.25.1
triton==2.2.0
typeguard==4.3.0
typer==0.12.5
typing_extensions==4.12.2
tzdata==2024.1
urllib3==2.2.2
wasabi==1.1.3
wcwidth==0.2.13
weasel==0.4.1
Werkzeug==3.0.4
wrapt==1.16.0
xformers==0.0.24

Reproduction Steps

  1. Follow instructions from the README
  2. Run the validation script in a python virtual environment as python ./mistral-finetune/utils/validate_data.py --train_yaml ./mistral-finetune/example/7B.yaml

Expected Behavior

According to the README, it should return a "a summary of the data input and training parameters" such as:

Train States
 --------------------
{
   "expected": {
       "eta": "00:52:44",
       "data_tokens": 25169147,
       "train_tokens": 131072000,
       "epochs": "5.21",
       "max_steps": 500,
       "data_tokens_per_dataset": {
           "/Users/johndoe/data/ultrachat_chunk_train.jsonl": "25169147.0"
       },
       "train_tokens_per_dataset": {
           "/Users/johndoe/data/ultrachat_chunk_train.jsonl": "131072000.0"
       },
       "epochs_per_dataset": {
           "/Users/johndoe/data/ultrachat_chunk_train.jsonl": "5.2"
       }
   },
}

Additional Context

The script returns the following error:

Traceback (most recent call last):
  File "/cluster/flash/wichtco/ai-fine-tuning/./mistral-finetune/utils/validate_data.py", line 16, in <module>
    from finetune.args import TrainArgs
ModuleNotFoundError: No module named 'finetune'

When installing the latest 'finetune-0.10.0' release, it returns a second error also related to a missing package:

Traceback (most recent call last):
  File "/cluster/flash/wichtco/ai-fine-tuning/./mistral-finetune/utils/validate_data.py", line 16, in <module>
    from finetune.args import TrainArgs
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/finetune/__init__.py", line 12, in <module>
    import tensorflow as tf
ModuleNotFoundError: No module named 'tensorflow'

Suggested Solutions

When installing the second missing package 'tensorflow-2.17.0' the problem should be fixed though it returns a pip's depencendy conflict:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
finetune 0.10.0 requires numpy<1.24.0,>=1.18.4, but you have numpy 1.26.4 which is incompatible.

Since finetune 0.10.0 requires numpy <1.24.0 while tensorflow-2.17.0 requires version numpy 1.26.4, I really don't see how I could make your script work.

Any idea?

Best,

C.

Follow up:

Command torchrun --nproc-per-node 8 --master_port $RANDOM -m train example/7B.yamltorchrun --nproc-per-node 8 --master_port $RANDOM -m train example/7B.yaml seems to fail as well due to a missing package:

[2024-09-06 16:02:17,003] torch.distributed.run: [WARNING]
[2024-09-06 16:02:17,003] torch.distributed.run: [WARNING] *****************************************
[2024-09-06 16:02:17,003] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-09-06 16:02:17,003] torch.distributed.run: [WARNING] *****************************************
/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train
/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train
/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train

/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train
/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train
/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train
/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train
[2024-09-06 16:02:22,013] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 422322) of binary: /cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python
Traceback (most recent call last):
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train FAILED
------------------------------------------------------------

And when trying to install 'train-0.0.5', I got another pip's dependency conflict with the same packages as above:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
thinc 8.2.5 requires numpy<2.0.0,>=1.19.0; python_version >= "3.9", but you have numpy 2.1.1 which is incompatible.
tensorflow 2.17.0 requires numpy<2.0.0,>=1.23.5; python_version <= "3.11", but you have numpy 2.1.1 which is incompatible.
finetune 0.10.0 requires numpy<1.24.0,>=1.18.4, but you have numpy 2.1.1 which is incompatible.
@CorentinWicht CorentinWicht added the bug Something isn't working label Sep 6, 2024
@CorentinWicht
Copy link
Author

CorentinWicht commented Sep 11, 2024

I fixed part of my issue by running the command directly from within the mistral-finetune folder:

cd mistral-finetune
python -m utils.validate_data --train_yaml example/7B.yaml

Still, I am now getting another error:

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/utils/validate_data.py", line 17, in <module>
    from finetune.data.dataset import parse_data_sources
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/data/dataset.py", line 10, in <module>
    import torch.distributed as dist
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/torch/__init__.py", line 1471, in <module>
    from .functional import *  # noqa: F403
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/torch/functional.py", line 9, in <module>
    import torch.nn.functional as F
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/torch/nn/__init__.py", line 1, in <module>
    from .modules import *  # noqa: F403
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/torch/nn/modules/__init__.py", line 35, in <module>
    from .transformer import TransformerEncoder, TransformerDecoder, \
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/torch/nn/modules/transformer.py", line 20, in <module>
    device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
  device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/utils/validate_data.py", line 372, in <module>
    main(args)
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/utils/validate_data.py", line 179, in main
    datasets, weights = parse_data_sources(pretrain_file, instruct_file)
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/data/dataset.py", line 159, in parse_data_sources
    assert min(n_weights) > 0
ValueError: min() arg is an empty sequence

Downgrading to numpy-1.26.4 running pip install "numpy<2.0" fixed only some of the issues:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/utils/validate_data.py", line 372, in <module>
    main(args)
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/utils/validate_data.py", line 179, in main
    datasets, weights = parse_data_sources(pretrain_file, instruct_file)
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/data/dataset.py", line 159, in parse_data_sources
    assert min(n_weights) > 0
ValueError: min() arg is an empty sequence

Any idea?

Best,

C.

@CorentinWicht
Copy link
Author

Any help?

I have tried also with Python 13.11 and it fails similarly...

@NazimHAli
Copy link

NazimHAli commented Sep 23, 2024

I tried it in a new environment for python 3.10 and it worked. You have to run it as a module (python -m) like the example in the README instead of as a script:

cd $HOME/mistral-finetune
python -m utils.reformat_data $HOME/data/ultrachat_chunk_train.jsonl
python -m utils.reformat_data $HOME/data/ultrachat_chunk_eval.jsonl

FYI that package finetune you installed from https://pypi.org/project/finetune/ has nothing to do with this project. You should uninstall it (might be easier to create a new environment just for this project). When you run the commands as a module, python will execute mistral-finetune/finetune correctly.

@CorentinWicht
Copy link
Author

CorentinWicht commented Sep 24, 2024

I tried it in a new environment for python 3.10 and it worked. You have to run it as a module (python -m) like the example in the README instead of as a script:

cd $HOME/mistral-finetune
python -m utils.reformat_data $HOME/data/ultrachat_chunk_train.jsonl
python -m utils.reformat_data $HOME/data/ultrachat_chunk_eval.jsonl

FYI that package finetune you installed from https://pypi.org/project/finetune/ has nothing to do with this project. You should uninstall it (might be easier to create a new environment just for this project). When you run the commands as a module, python will execute mistral-finetune/finetune correctly.

Dear @NazimHAli, many thanks for your support.

As written above, I have fixed some of the issues by running it as a module instead of as a script (e.g., the missing finetune package).

I went through the whole process once more and realized that I forgot to modify the example\/7B.yaml file to contain the absolute paths to both ultrachat_chunk_eval.jsonl and ultrachat_chunk_train.jsonl files:
image

I could thus successfully complete the dataset verification section:
python -m utils.validate_data --train_yaml example/7B.yaml

Nevertheless, it fails at training when running:

torchrun --nproc-per-node 8 --master_port $RANDOM -m train example/7B.yaml

I get:

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/torchrun", line 5, in <module>
    from torch.distributed.run import main
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/__init__.py", line 1471, in <module>
    from .functional import *  # noqa: F403
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/functional.py", line 9, in <module>
    import torch.nn.functional as F
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/__init__.py", line 1, in <module>
    from .modules import *  # noqa: F403
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/__init__.py", line 35, in <module>
    from .transformer import TransformerEncoder, TransformerDecoder, \
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/transformer.py", line 20, in <module>
    device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
  device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
[2024-09-24 15:10:19,432] torch.distributed.run: [WARNING]
[2024-09-24 15:10:19,432] torch.distributed.run: [WARNING] *****************************************
[2024-09-24 15:10:19,432] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-09-24 15:10:19,432] torch.distributed.run: [WARNING] *****************************************

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 10, in <module>
    import torch.cuda
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/__init__.py", line 1471, in <module>
    from .functional import *  # noqa: F403
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/functional.py", line 9, in <module>
    import torch.nn.functional as F
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/__init__.py", line 1, in <module>
    from .modules import *  # noqa: F403
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/__init__.py", line 35, in <module>
    from .transformer import TransformerEncoder, TransformerDecoder, \
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/transformer.py", line 20, in <module>
    device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 10, in <module>
    import torch.cuda
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/__init__.py", line 1471, in <module>
    from .functional import *  # noqa: F403
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/functional.py", line 9, in <module>
    import torch.nn.functional as F
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/__init__.py", line 1, in <module>
    from .modules import *  # noqa: F403
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/__init__.py", line 35, in <module>
    from .transformer import TransformerEncoder, TransformerDecoder, \
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/transformer.py", line 20, in <module>
    device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
  device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
  device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 10, in <module>
    import torch.cuda
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/__init__.py", line 1471, in <module>
    from .functional import *  # noqa: F403
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/functional.py", line 9, in <module>
    import torch.nn.functional as F
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/__init__.py", line 1, in <module>
    from .modules import *  # noqa: F403
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/__init__.py", line 35, in <module>
    from .transformer import TransformerEncoder, TransformerDecoder, \
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/transformer.py", line 20, in <module>
    device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
  device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 10, in <module>
    import torch.cuda
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/__init__.py", line 1471, in <module>
    from .functional import *  # noqa: F403
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/functional.py", line 9, in <module>
    import torch.nn.functional as F
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/__init__.py", line 1, in <module>
    from .modules import *  # noqa: F403
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/__init__.py", line 35, in <module>
    from .transformer import TransformerEncoder, TransformerDecoder, \
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/transformer.py", line 20, in <module>
    device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
  device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 10, in <module>
    import torch.cuda
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/__init__.py", line 1471, in <module>
    from .functional import *  # noqa: F403
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/functional.py", line 9, in <module>
    import torch.nn.functional as F
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/__init__.py", line 1, in <module>
    from .modules import *  # noqa: F403
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/__init__.py", line 35, in <module>
    from .transformer import TransformerEncoder, TransformerDecoder, \
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/transformer.py", line 20, in <module>
    device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
  device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 10, in <module>
    import torch.cuda
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/__init__.py", line 1471, in <module>
    from .functional import *  # noqa: F403
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/functional.py", line 9, in <module>
    import torch.nn.functional as F
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/__init__.py", line 1, in <module>
    from .modules import *  # noqa: F403
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/__init__.py", line 35, in <module>
    from .transformer import TransformerEncoder, TransformerDecoder, \
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/transformer.py", line 20, in <module>
    device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
  device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 10, in <module>
    import torch.cuda
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/__init__.py", line 1471, in <module>
    from .functional import *  # noqa: F403
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/functional.py", line 9, in <module>
    import torch.nn.functional as F
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/__init__.py", line 1, in <module>
    from .modules import *  # noqa: F403
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/__init__.py", line 35, in <module>
    from .transformer import TransformerEncoder, TransformerDecoder, \
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/transformer.py", line 20, in <module>
    device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
  device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 10, in <module>
    import torch.cuda
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/__init__.py", line 1471, in <module>
    from .functional import *  # noqa: F403
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/functional.py", line 9, in <module>
    import torch.nn.functional as F
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/__init__.py", line 1, in <module>
    from .modules import *  # noqa: F403
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/__init__.py", line 35, in <module>
    from .transformer import TransformerEncoder, TransformerDecoder, \
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/transformer.py", line 20, in <module>
    device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
  device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_train.jsonl', eval_instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_eval.jsonl', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='/cluster/flash/wichtco/ai-fine-tuning/mistral_models', run_dir='/cluster/flash/wichtco/ai-fine-tuning/ultra_chat_test', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=False, checkpoint=True, world_size=8, wandb=WandbArgs(project='Mistral-finetune', offline=False, key='81eab917d15e15c70c653f96b000838fcbb6bad5', run_name=''), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))
2024-09-24 15:10:21 (CET) - 0:00:01 - distributed - INFO - torch.cuda.device_count: 0
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 328, in <module>
    fire.Fire(train)
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 65, in train
    _train(args, exit_stack)
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 78, in _train
    set_device()
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/distributed.py", line 30, in set_device
    logger.info(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}")
                                         ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen os>", line 678, in __getitem__
KeyError: 'CUDA_VISIBLE_DEVICES'
args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_train.jsonl', eval_instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_eval.jsonl', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='/cluster/flash/wichtco/ai-fine-tuning/mistral_models', run_dir='/cluster/flash/wichtco/ai-fine-tuning/ultra_chat_test', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=False, checkpoint=True, world_size=8, wandb=WandbArgs(project='Mistral-finetune', offline=False, key='81eab917d15e15c70c653f96b000838fcbb6bad5', run_name=''), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))
2024-09-24 15:10:21 (CET) - 0:00:01 - distributed - INFO - torch.cuda.device_count: 0
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 328, in <module>
    fire.Fire(train)
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 65, in train
    _train(args, exit_stack)
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 78, in _train
    set_device()
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/distributed.py", line 30, in set_device
    logger.info(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}")
                                         ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen os>", line 678, in __getitem__
KeyError: 'CUDA_VISIBLE_DEVICES'
args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_train.jsonl', eval_instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_eval.jsonl', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='/cluster/flash/wichtco/ai-fine-tuning/mistral_models', run_dir='/cluster/flash/wichtco/ai-fine-tuning/ultra_chat_test', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=False, checkpoint=True, world_size=8, wandb=WandbArgs(project='Mistral-finetune', offline=False, key='81eab917d15e15c70c653f96b000838fcbb6bad5', run_name=''), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))
2024-09-24 15:10:21 (CET) - 0:00:01 - distributed - INFO - torch.cuda.device_count: 0
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 328, in <module>
    fire.Fire(train)
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 65, in train
    _train(args, exit_stack)
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 78, in _train
    set_device()
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/distributed.py", line 30, in set_device
    logger.info(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}")
                                         ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen os>", line 678, in __getitem__
KeyError: 'CUDA_VISIBLE_DEVICES'
args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_train.jsonl', eval_instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_eval.jsonl', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='/cluster/flash/wichtco/ai-fine-tuning/mistral_models', run_dir='/cluster/flash/wichtco/ai-fine-tuning/ultra_chat_test', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=False, checkpoint=True, world_size=8, wandb=WandbArgs(project='Mistral-finetune', offline=False, key='81eab917d15e15c70c653f96b000838fcbb6bad5', run_name=''), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))
2024-09-24 15:10:21 (CET) - 0:00:01 - distributed - INFO - torch.cuda.device_count: 0
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 328, in <module>
    fire.Fire(train)
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 65, in train
    _train(args, exit_stack)
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 78, in _train
    set_device()
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/distributed.py", line 30, in set_device
    logger.info(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}")
                                         ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen os>", line 678, in __getitem__
KeyError: 'CUDA_VISIBLE_DEVICES'
args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_train.jsonl', eval_instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_eval.jsonl', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='/cluster/flash/wichtco/ai-fine-tuning/mistral_models', run_dir='/cluster/flash/wichtco/ai-fine-tuning/ultra_chat_test', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=False, checkpoint=True, world_size=8, wandb=WandbArgs(project='Mistral-finetune', offline=False, key='81eab917d15e15c70c653f96b000838fcbb6bad5', run_name=''), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))
2024-09-24 15:10:21 (CET) - 0:00:01 - distributed - INFO - torch.cuda.device_count: 0
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 328, in <module>
    fire.Fire(train)
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 65, in train
    _train(args, exit_stack)
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 78, in _train
    set_device()
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/distributed.py", line 30, in set_device
    logger.info(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}")
                                         ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen os>", line 678, in __getitem__
KeyError: 'CUDA_VISIBLE_DEVICES'
args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_train.jsonl', eval_instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_eval.jsonl', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='/cluster/flash/wichtco/ai-fine-tuning/mistral_models', run_dir='/cluster/flash/wichtco/ai-fine-tuning/ultra_chat_test', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=False, checkpoint=True, world_size=8, wandb=WandbArgs(project='Mistral-finetune', offline=False, key='81eab917d15e15c70c653f96b000838fcbb6bad5', run_name=''), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))
2024-09-24 15:10:21 (CET) - 0:00:01 - distributed - INFO - torch.cuda.device_count: 0
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 328, in <module>
    fire.Fire(train)
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 65, in train
    _train(args, exit_stack)
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 78, in _train
    set_device()
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/distributed.py", line 30, in set_device
    logger.info(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}")
                                         ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen os>", line 678, in __getitem__
KeyError: 'CUDA_VISIBLE_DEVICES'
args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_train.jsonl', eval_instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_eval.jsonl', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='/cluster/flash/wichtco/ai-fine-tuning/mistral_models', run_dir='/cluster/flash/wichtco/ai-fine-tuning/ultra_chat_test', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=False, checkpoint=True, world_size=8, wandb=WandbArgs(project='Mistral-finetune', offline=False, key='81eab917d15e15c70c653f96b000838fcbb6bad5', run_name=''), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))
2024-09-24 15:10:21 (CET) - 0:00:01 - distributed - INFO - torch.cuda.device_count: 0
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 328, in <module>
    fire.Fire(train)
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 65, in train
    _train(args, exit_stack)
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 78, in _train
    set_device()
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/distributed.py", line 30, in set_device
    logger.info(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}")
                                         ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen os>", line 678, in __getitem__
KeyError: 'CUDA_VISIBLE_DEVICES'
args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_train.jsonl', eval_instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_eval.jsonl', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='/cluster/flash/wichtco/ai-fine-tuning/mistral_models', run_dir='/cluster/flash/wichtco/ai-fine-tuning/ultra_chat_test', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=False, checkpoint=True, world_size=8, wandb=WandbArgs(project='Mistral-finetune', offline=False, key='81eab917d15e15c70c653f96b000838fcbb6bad5', run_name=''), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))
2024-09-24 15:10:21 (CET) - 0:00:01 - distributed - INFO - torch.cuda.device_count: 0
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 328, in <module>
    fire.Fire(train)
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 65, in train
    _train(args, exit_stack)
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 78, in _train
    set_device()
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/distributed.py", line 30, in set_device
    logger.info(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}")
                                         ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen os>", line 678, in __getitem__
KeyError: 'CUDA_VISIBLE_DEVICES'
[2024-09-24 15:10:24,437] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3654500) of binary: /cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python
Traceback (most recent call last):
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-09-24_15:10:24
  host      : master.cluster
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3654501)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-09-24_15:10:24
  host      : master.cluster
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 3654502)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-09-24_15:10:24
  host      : master.cluster
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 3654503)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-09-24_15:10:24
  host      : master.cluster
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 3654504)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-09-24_15:10:24
  host      : master.cluster
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 3654505)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2024-09-24_15:10:24
  host      : master.cluster
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 3654506)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2024-09-24_15:10:24
  host      : master.cluster
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 3654507)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-24_15:10:24
  host      : master.cluster
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3654500)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Any idea?

Best,

C.

@NazimHAli
Copy link

Try uninstalling numpy and reinstall it with version 1. An easy way would be to install the requirements like this pip install -r requirements.txt "numpy<2".

It's possible a combination of the packages + local environment is causing it to install version 2, but not have the dependencies correctly defined.

@CorentinWicht
Copy link
Author

Try uninstalling numpy and reinstall it with version 1. An easy way would be to install the requirements like this pip install -r requirements.txt "numpy<2".

It's possible a combination of the packages + local environment is causing it to install version 2, but not have the dependencies correctly defined.

Many thanks for your support, that indeed fixed the "A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.1 as it may crash
" message.

Unfortunately, when running torchrun --nproc-per-node 8 --master_port $RANDOM -m train example/7B.yaml, I still get the following error:

[2024-09-25 06:53:30,824] torch.distributed.run: [WARNING]
[2024-09-25 06:53:30,824] torch.distributed.run: [WARNING] *****************************************
[2024-09-25 06:53:30,824] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-09-25 06:53:30,824] torch.distributed.run: [WARNING] *****************************************
args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_train.jsonl', eval_instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_eval.jsonl', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='/cluster/flash/wichtco/ai-fine-tuning/mistral_models', run_dir='/cluster/flash/wichtco/ai-fine-tuning/ultra_chat_test', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=False, checkpoint=True, world_size=8, wandb=WandbArgs(project='Mistral-finetune', offline=False, key='81eab917d15e15c70c653f96b000838fcbb6bad5', run_name=''), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))
args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_train.jsonl', eval_instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_eval.jsonl', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='/cluster/flash/wichtco/ai-fine-tuning/mistral_models', run_dir='/cluster/flash/wichtco/ai-fine-tuning/ultra_chat_test', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=False, checkpoint=True, world_size=8, wandb=WandbArgs(project='Mistral-finetune', offline=False, key='81eab917d15e15c70c653f96b000838fcbb6bad5', run_name=''), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))
args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_train.jsonl', eval_instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_eval.jsonl', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='/cluster/flash/wichtco/ai-fine-tuning/mistral_models', run_dir='/cluster/flash/wichtco/ai-fine-tuning/ultra_chat_test', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=False, checkpoint=True, world_size=8, wandb=WandbArgs(project='Mistral-finetune', offline=False, key='81eab917d15e15c70c653f96b000838fcbb6bad5', run_name=''), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))
args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_train.jsonl', eval_instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_eval.jsonl', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='/cluster/flash/wichtco/ai-fine-tuning/mistral_models', run_dir='/cluster/flash/wichtco/ai-fine-tuning/ultra_chat_test', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=False, checkpoint=True, world_size=8, wandb=WandbArgs(project='Mistral-finetune', offline=False, key='81eab917d15e15c70c653f96b000838fcbb6bad5', run_name=''), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))
args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_train.jsonl', eval_instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_eval.jsonl', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='/cluster/flash/wichtco/ai-fine-tuning/mistral_models', run_dir='/cluster/flash/wichtco/ai-fine-tuning/ultra_chat_test', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=False, checkpoint=True, world_size=8, wandb=WandbArgs(project='Mistral-finetune', offline=False, key='81eab917d15e15c70c653f96b000838fcbb6bad5', run_name=''), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_train.jsonl', eval_instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_eval.jsonl', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='/cluster/flash/wichtco/ai-fine-tuning/mistral_models', run_dir='/cluster/flash/wichtco/ai-fine-tuning/ultra_chat_test', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=False, checkpoint=True, world_size=8, wandb=WandbArgs(project='Mistral-finetune', offline=False, key='81eab917d15e15c70c653f96b000838fcbb6bad5', run_name=''), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))
args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_train.jsonl', eval_instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_eval.jsonl', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='/cluster/flash/wichtco/ai-fine-tuning/mistral_models', run_dir='/cluster/flash/wichtco/ai-fine-tuning/ultra_chat_test', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=False, checkpoint=True, world_size=8, wandb=WandbArgs(project='Mistral-finetune', offline=False, key='81eab917d15e15c70c653f96b000838fcbb6bad5', run_name=''), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_train.jsonl', eval_instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_eval.jsonl', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='/cluster/flash/wichtco/ai-fine-tuning/mistral_models', run_dir='/cluster/flash/wichtco/ai-fine-tuning/ultra_chat_test', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=False, checkpoint=True, world_size=8, wandb=WandbArgs(project='Mistral-finetune', offline=False, key='81eab917d15e15c70c653f96b000838fcbb6bad5', run_name=''), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))


2024-09-25 06:53:33 (CET) - 0:00:02 - distributed - INFO - torch.cuda.device_count: 8
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
2024-09-25 06:53:33 (CET) - 0:00:02 - distributed - INFO - torch.cuda.device_count: 8
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 328, in <module>
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 328, in <module>
2024-09-25 06:53:33 (CET) - 0:00:02 - distributed - INFO - torch.cuda.device_count: 8
2024-09-25 06:53:33 (CET) - 0:00:02 - distributed - INFO - torch.cuda.device_count: 8
Traceback (most recent call last):
      File "<frozen runpy>", line 198, in _run_module_as_main
fire.Fire(train)
  File "<frozen runpy>", line 88, in _run_code
Traceback (most recent call last):
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 328, in <module>
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
  File "<frozen runpy>", line 88, in _run_code
2024-09-25 06:53:33 (CET) - 0:00:02 - distributed - INFO - torch.cuda.device_count: 8
      File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 328, in <module>
fire.Fire(train)
2024-09-25 06:53:33 (CET) - 0:00:02 - distributed - INFO - torch.cuda.device_count: 8
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
2024-09-25 06:53:33 (CET) - 0:00:02 - distributed - INFO - torch.cuda.device_count: 8
2024-09-25 06:53:33 (CET) - 0:00:02 - distributed - INFO - torch.cuda.device_count: 8
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 328, in <module>
    component_trace = _Fire(component, args, parsed_flag_args, context, name)Traceback (most recent call last):

Traceback (most recent call last):
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
    File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 328, in <module>
  File "<frozen runpy>", line 88, in _run_code
      File "<frozen runpy>", line 88, in _run_code
 fire.Fire(train)   File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 328, in <module>

      File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 328, in <module>
 component_trace = _Fire(component, args, parsed_flag_args, context, name)
  fire.Fire(train)   File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 143, in Fire

        fire.Fire(train)
    File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
           File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
           fire.Fire(train)
      ^fire.Fire(train)^
^ component_trace = _Fire(component, args, parsed_flag_args, context, name)
^   File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
fire.Fire(train)^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    ^ component_trace = _Fire(component, args, parsed_flag_args, context, name)^
^ component_trace = _Fire(component, args, parsed_flag_args, context, name)   File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
^
 ^       ^  component_trace = _Fire(component, args, parsed_flag_args, context, name)    ^^
component_trace = _Fire(component, args, parsed_flag_args, context, name)^
 ^  ^ ^^    ^^        ^^   component_trace = _Fire(component, args, parsed_flag_args, context, name)  ^^
  ^^^     ^^      ^^      ^^ ^      ^ ^     ^ ^     ^ ^      ^^      ^^      ^ ^ ^    ^  ^ ^   ^   ^    ^  ^^    ^^ ^^ ^  ^^ ^ ^ ^ ^^ ^  ^ ^^^ ^ ^^^^^^ ^ ^^^^^ ^ ^^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire

^^^^^^^^  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    ^^^^^^component, remaining_args = _CallAndUpdateTrace(^^^^^^^^
^^^^^^^component, remaining_args = _CallAndUpdateTrace(^^^^^^
^ ^^^^^^ ^^^^^^  ^^^^^^  ^^^^^^  ^^^^^^  ^^^^^^  ^^^^^^^   ^^^^^^  ^^^^^^  ^^^^^^^  ^^^^^^  ^ ^^^^
 ^  ^ ^^
^^ ^  ^^  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
^^  ^   File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
^^^
^ ^^ ^ ^^ ^  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
^ ^
^ ^ ^ ^^   File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire

^  ^        File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
^   component, remaining_args = _CallAndUpdateTrace(^
^      ^ component, remaining_args = _CallAndUpdateTrace(

        File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
  component, remaining_args = _CallAndUpdateTrace(
^ component, remaining_args = _CallAndUpdateTrace(  ^^
  ^ ^component, remaining_args = _CallAndUpdateTrace(  ^ ^
    ^ ^    ^ ^^        ^ ^component, remaining_args = _CallAndUpdateTrace(    ^ ^^
     ^^      ^^      ^^      ^^      ^^      ^^ ^     ^ ^      ^^      ^ ^     ^ ^     ^
^
             File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
   File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
                                               ^     ^              ^component = fn(*varargs, **kwargs)   component = fn(*varargs, **kwargs)^ ^

^ ^^^    ^^^     ^ ^^ ^^  ^ ^ ^^ ^^^ ^ ^ ^^^ ^ ^ ^^^^ ^  ^^^^ ^ ^^^^^ ^  ^^^^^^^  ^^^^^^  ^^^^^^ ^ ^^^^^^^ ^ ^^^^^ ^^ ^^^^^ ^^ ^^^
^^^ ^^^^^^^^^ ^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
^^^^ ^^  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
^^
^^^^^^^^^^  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
^^^

^^^^^^  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
^^^^^^^^^    ^component = fn(*varargs, **kwargs)^
^
^^^^      File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
^^component = fn(*varargs, **kwargs)^^
 ^    ^ ^component = fn(*varargs, **kwargs)^^ ^
 ^          ^ ^component = fn(*varargs, **kwargs) component = fn(*varargs, **kwargs)

 ^
   ^    File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 65, in train
 ^
     component = fn(*varargs, **kwargs)
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 65, in train
                 _train(args, exit_stack)
                 _train(args, exit_stack)^ ^  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 78, in _train

  ^ ^       File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 78, in _train
^         ^ ^   ^set_device() ^  ^   ^
 ^    ^  ^ ^set_device()  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/distributed.py", line 30, in set_device
^ ^^ ^
^^^^ ^^^^  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/distributed.py", line 30, in set_device
^ ^^^^^ ^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 65, in train
^^^^^^^^^^^^^^^^^^^^^^^^^^^    ^^^
^^_train(args, exit_stack)^^^

^^  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 65, in train
    ^^^logger.info(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}")^
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 65, in train

^  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 78, in _train
^
^  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 65, in train
 ^      File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 65, in train
     ^_train(args, exit_stack) logger.info(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}")^


       File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 78, in _train
     _train(args, exit_stack)
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 65, in train
set_device() _train(args, exit_stack)

  _train(args, exit_stack)  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 78, in _train

          File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/distributed.py", line 30, in set_device
    File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 78, in _train
set_device()_train(args, exit_stack)

  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 78, in _train
                  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 78, in _train
set_device()  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/distributed.py", line 30, in set_device
 logger.info(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}") set_device()

     set_device()
    File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/distributed.py", line 30, in set_device

          File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/distributed.py", line 30, in set_device
      set_device()   File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/distributed.py", line 30, in set_device
 logger.info(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}")

         File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/distributed.py", line 30, in set_device
 logger.info(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}")
   logger.info(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}")
  logger.info(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}")
            logger.info(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}")
                                                                                         ~          ~        ~        ~ ~        ~       ~       ~       ~ ~       ^       ^  ~    ^  ~ ~        ^ ^ ~    ~ ~^ ~     ~^ ~     ~^ ~    ~^~ ~     ^~ ~    ^~ ~ ^   ^~  ^   ^~  ^    ^^ ^    ^^ ^     ^^ ^    ^^ ^^^ ~ ~ ^ ^^~~~ ^ ^^~~~ ^ ^^~~~ ^ ^^~~~ ^~^^~~~~^~^^~^~^~~^~^~~^~~
~^~~^~~~^~~^^^~  File "<frozen os>", line 678, in __getitem__
~^~~^^~^~~KeyError~^: ^^^^^~~'CUDA_VISIBLE_DEVICES'^^^^^~^~
^^^^^~^^^^^^^
^^^^^
^^^^  File "<frozen os>", line 678, in __getitem__
^^^^^^  File "<frozen os>", line 678, in __getitem__
^^^^^KeyError^^^^^KeyError: ^^^^^: ^'CUDA_VISIBLE_DEVICES'^
^^^'CUDA_VISIBLE_DEVICES'^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^
  File "<frozen os>", line 678, in __getitem__
^^
^^^^  File "<frozen os>", line 678, in __getitem__
KeyError
^  File "<frozen os>", line 678, in __getitem__
: ^KeyError'CUDA_VISIBLE_DEVICES'  File "<frozen os>", line 678, in __getitem__

: KeyError
'CUDA_VISIBLE_DEVICES':
'CUDA_VISIBLE_DEVICES'KeyError  File "<frozen os>", line 678, in __getitem__

: 'CUDA_VISIBLE_DEVICES'KeyError
: 'CUDA_VISIBLE_DEVICES'
[2024-09-25 06:53:35,830] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1755254) of binary: /cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python
Traceback (most recent call last):
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-09-25_06:53:35
  host      : node03.cluster
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1755255)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-09-25_06:53:35
  host      : node03.cluster
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 1755256)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-09-25_06:53:35
  host      : node03.cluster
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 1755257)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-09-25_06:53:35
  host      : node03.cluster
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 1755258)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-09-25_06:53:35
  host      : node03.cluster
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 1755259)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2024-09-25_06:53:35
  host      : node03.cluster
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 1755260)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2024-09-25_06:53:35
  host      : node03.cluster
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 1755261)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-25_06:53:35
  host      : node03.cluster
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1755254)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

FYI I am trying to run the scripts on our University GPU Cluster.

@NazimHAli
Copy link

I don't have experience with this, so not sure how to debug because it could be specific to your cluster - you can try first getting it to run with a single GPU and go from there. This might be a better question in the torch repo as they would have more experience.

@CorentinWicht
Copy link
Author

I don't have experience with this, so not sure how to debug because it could be specific to your cluster - you can try first getting it to run with a single GPU and go from there. This might be a better question in the torch repo as they would have more experience.

Dear @NazimHAli,

Thanks for the suggestion, unfortunately it fails similarly:

args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_train.jsonl', eval_instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_eval.jsonl', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='/cluster/flash/wichtco/ai-fine-tuning/mistral_models', run_dir='/cluster/flash/wichtco/ai-fine-tuning/ultra_chat_test', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=False, checkpoint=True, world_size=1, wandb=WandbArgs(project='Mistral-finetune', offline=False, key='81eab917d15e15c70c653f96b000838fcbb6bad5', run_name=''), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))
2024-10-01 13:33:54 (CET) - 0:00:02 - distributed - INFO - torch.cuda.device_count: 8
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 328, in <module>
    fire.Fire(train)
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 65, in train
    _train(args, exit_stack)
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 78, in _train
    set_device()
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/distributed.py", line 30, in set_device
    logger.info(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}")
                                         ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen os>", line 678, in __getitem__
KeyError: 'CUDA_VISIBLE_DEVICES'
[2024-10-01 13:33:57,332] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1779407) of binary: /cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python
Traceback (most recent call last):
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-01_13:33:57
  host      : node03.cluster
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1779407)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

I will thus open a thread on the torch repo and reference it here if anyone might encounter the same issue.

Best,

C.

@CorentinWicht
Copy link
Author

CorentinWicht commented Oct 2, 2024

@NazimHAli according to PyTorch developpers the issue is coming from your code and not from their package: pytorch/pytorch#137082.

So I could go a step further by setting up CUDA_VISIBLE_DEVICES manually (see this thread) :
CUDA_VISIBLE_DEVICES=1 torchrun --nproc-per-node 1 --master_port $RANDOM -m train example/7B.yaml

Though it still fails later on...

args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_train.jsonl', eval_instruct_data='/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_eval.jsonl', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='/cluster/flash/wichtco/ai-fine-tuning/mistral_models', run_dir='/cluster/flash/wichtco/ai-fine-tuning/ultra_chat_test', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=False, checkpoint=True, world_size=1, wandb=WandbArgs(project='Mistral-finetune', offline=True, key='81eab917d15e15c70c653f96b000838fcbb6bad5', run_name=''), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))
2024-10-02 08:24:00 (CET) - 0:00:02 - distributed - INFO - torch.cuda.device_count: 1
2024-10-02 08:24:00 (CET) - 0:00:02 - distributed - INFO - CUDA_VISIBLE_DEVICES: 1
2024-10-02 08:24:00 (CET) - 0:00:02 - distributed - INFO - local rank: 0
2024-10-02 08:24:00 (CET) - 0:00:02 - train - INFO - Going to init comms...
2024-10-02 08:24:00 (CET) - 0:00:02 - train - INFO - Run dir: /cluster/flash/wichtco/ai-fine-tuning/ultra_chat_test
2024-10-02 08:24:01 (CET) - 0:00:02 - train - INFO - TrainArgs: {'batch_size': 1,
 'checkpoint': True,
 'ckpt_freq': 100,
 'data': {'data': '',
          'eval_instruct_data': '/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_eval.jsonl',
          'instruct': {'dynamic_chunk_fn_call': True, 'shuffle': True},
          'instruct_data': '/cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_train.jsonl',
          'shuffle': False},
 'eval_freq': 100,
 'log_freq': 1,
 'lora': {'dropout': 0.0, 'enable': True, 'rank': 64, 'scaling': 2.0},
 'max_norm': 1.0,
 'max_steps': 300,
 'mlflow': {'experiment_name': None, 'tracking_uri': None},
 'model_id_or_path': '/cluster/flash/wichtco/ai-fine-tuning/mistral_models',
 'no_ckpt': False,
 'no_eval': False,
 'num_ckpt_keep': 3,
 'num_microbatches': 1,
 'optim': {'lr': 6e-05, 'pct_start': 0.05, 'weight_decay': 0.1},
 'run_dir': '/cluster/flash/wichtco/ai-fine-tuning/ultra_chat_test',
 'save_adapters': True,
 'seed': 0,
 'seq_len': 32768,
 'wandb': {'key': '81eab917d15e15c70c653f96b000838fcbb6bad5',
           'offline': True,
           'project': 'Mistral-finetune',
           'run_name': ''},
 'world_size': 1}
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
wandb: Appending key for api.wandb.ai to your netrc file: /cluster/raid/home/wichtco/.netrc
2024-10-02 08:24:11 (CET) - 0:00:12 - metrics_logger - INFO - initializing wandb
wandb: WARNING Changes to your `wandb` environment variables will be ignored because your `wandb` session has already started. For more information on how to modify your settings with `wandb.init()` arguments, please refer to https://wandb.me/wandb-init.
wandb: Tracking run with wandb version 0.18.1
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
wandb: WARNING Calling wandb.login() after wandb.init() has no effect.
2024-10-02 08:24:11 (CET) - 0:00:12 - utils - INFO - Closing: eval_logger
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /cluster/flash/wichtco/ai-fine-tuning/ultra_chat_test/wandb/offline-run-20241002_082411-8idg077y
wandb: Find logs at: /cluster/flash/wichtco/ai-fine-tuning/ultra_chat_test/wandb/offline-run-20241002_082411-8idg077y/logs
2024-10-02 08:24:14 (CET) - 0:00:15 - utils - INFO - Closed: eval_logger
2024-10-02 08:24:14 (CET) - 0:00:15 - utils - INFO - Closing: metrics_logger
2024-10-02 08:24:14 (CET) - 0:00:15 - utils - INFO - Closed: metrics_logger
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 328, in <module>
    fire.Fire(train)
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 65, in train
    _train(args, exit_stack)
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/train.py", line 171, in _train
    eval_batches = list(eval_data_loader)
                   ^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/data/data_loader.py", line 122, in build_data_loader
    batch: Batch = batch_list.create_batch()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/data/data_loader.py", line 77, in create_batch
    x_np: np.ndarray = self.flatten_to_numpy(self.x, dtype=np.int64)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/data/data_loader.py", line 74, in flatten_to_numpy
    return np.array([el for sublist in list_of_lists for el in sublist], dtype=dtype)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (32768,) + inhomogeneous part.
[2024-10-02 08:24:18,791] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1785821) of binary: /cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python
Traceback (most recent call last):
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-02_08:24:18
  host      : node03.cluster
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1785821)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Any idea what went wrong this time?

@NazimHAli
Copy link

Hey,

Sorry for the late reply, lost track of things. From this error, it's complaining about your dataset:

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (32768,) + inhomogeneous part.

Can you create a public repo with reproducible code and sample data?

@CorentinWicht
Copy link
Author

CorentinWicht commented Oct 16, 2024

Hey,

Sorry for the late reply, lost track of things. From this error, it's complaining about your dataset:

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (32768,) + inhomogeneous part.

Can you create a public repo with reproducible code and sample data?

@NazimHAli, no worries and thanks for the reply.

In fact, I have strictly followed your README and downloaded the Ultrachat_200k dataset from HuggingFace using Python:

# Packages
import pandas as pd

# Load the data into a Pandas Dataframe
df = pd.read_parquet('https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k/resolve/main/data/test_gen-00000-of-00001-3d4cd8309148a71f.parquet')

# Split into train and eval
df_train=df.sample(frac=0.95,random_state=200)
df_eval=df.drop(df_train.index)

# Save data to jsonl
df_train.to_json("./data/ultrachat_chunk_train.jsonl", orient="records", lines=True)
df_eval.to_json("./data/ultrachat_chunk_eval.jsonl", orient="records", lines=True)

Then, again as suggested in your README, I made use of the ./utils/reformat_data.py to correct the data:

python -m utils.reformat_data /cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_train.jsonl
python -m utils.reformat_data /cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_eval.jsonl

Maybe this last step corrupted the dataset ?

EDIT:

There seems to be something wrong in the Ultrachat_200k dataset from HuggingFace dataset because when I verify the training yaml to make sure the data is correctly formatted running python -m utils.validate_data --train_yaml example/7B.yaml I get the following error:

0it [00:00, ?it/s]Validating /cluster/flash/wichtco/ai-fine-tuning/data/ultrachat_chunk_train.jsonl ...
  0%|                                                                                         | 0/26889 [00:00<?, ?it/s]
0it [00:00, ?it/s]                                                                            | 0/26889 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/utils/validate_data.py", line 372, in <module>
    main(args)
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/utils/validate_data.py", line 214, in main
    sample = build_instruct_sample(data)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/flash/wichtco/ai-fine-tuning/mistral-finetune/finetune/data/tokenize.py", line 180, in build_instruct_sample
    validator.validate_messages(messages)
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/mistral_common/protocol/instruct/validator.py", line 50, in validate_messages
    self._validate_message_list_structure(messages)
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/mistral_common/protocol/instruct/validator.py", line 259, in _validate_message_list_structure
    self._validate_last_message(messages[-1])
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/mistral_common/protocol/instruct/validator.py", line 323, in _validate_last_message
    super()._validate_last_message(message)
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.11/site-packages/mistral_common/protocol/instruct/validator.py", line 231, in _validate_last_message
    f"Expected last role Assistant for finetuning but got {last_message_role.value}"
                                                           ^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'str' object has no attribute 'value'

Which is not what's to be expected as described in your README, namely:

The data in line 1412 of dataset /Users/johndoe/data/ultrachat_chunk_eval.jsonl is incorrectly formatted. Expected last role to be one of: [assistant] but got user
The data in line 1413 of dataset /Users/johndoe/data/ultrachat_chunk_eval.jsonl is incorrectly formatted. Expected last role to be one of: [assistant] but got user
The data in line 1414 of dataset /Users/johndoe/data/ultrachat_chunk_eval.jsonl is incorrectly formatted. Expected last role to be one of: [assistant] but got user
The data in line 1415 of dataset /Users/johndoe/data/ultrachat_chunk_eval.jsonl is incorrectly formatted. Expected last role to be one of: [assistant] but got user

@CorentinWicht
Copy link
Author

@NazimHAli any idea ? I have actually strictly followed your README as written above and cannot replicate your results, can you?

Best,

C.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants