You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running nanotron with batch_size = 0 causes nanotron to crash during batch detection.
(lighteval-main) hynek_kydlicek@ip-26-0-162-233:/fsx/hynek_kydlicek/projects/lighteval-main-branch$ torchrun --standalone --nnodes=1 --nproc-per-node=1 src/lighteval/__main__.py nanotron --checkpoint_config_path ./nanotron/checkpoints/0/config.yaml --lighteval_config_path examples/nanotron/lighteval_config_override_template.yaml
WARNING:lighteval.logging.hierarchical_logger:main: (0, './nanotron/checkpoints/0/config.yaml'), (1, 'examples/nanotron/lighteval_config_override_template.yaml'), (2, '/fsx/hynek_kydlicek/.cache/huggingface'), {
WARNING:lighteval.logging.hierarchical_logger: Load nanotron config {
skip_unused_config_keys set
Skip_null_keys set
WARNING:lighteval.logging.hierarchical_logger: } [0:00:00.013603]
WARNING:lighteval.logging.hierarchical_logger: WARNING: --max_samples WAS SET. THESE NUMBERS ARE ONLY PARTIAL AND SHOULD NOT BE USED FOR COMPARISON UNLESS YOU KNOW WHAT YOU ARE DOING.
WARNING:lighteval.logging.hierarchical_logger: Test all gather {
WARNING:lighteval.logging.hierarchical_logger: Test gather tensor
WARNING:lighteval.logging.hierarchical_logger:[TEST] Running NCCL sync for ranks [0]
WARNING:lighteval.logging.hierarchical_logger:[TEST] NCCL sync for ranks [0]
WARNING:lighteval.logging.hierarchical_logger: } [0:00:00.661526]
WARNING:lighteval.logging.hierarchical_logger: Model loading {
/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
WARNING:lighteval.models.nanotron_model:Building model
WARNING:lighteval.models.nanotron_model:Sanity checks on model
WARNING:lighteval.models.nanotron_model:Loading checkpoint from ./nanotron/checkpoints/0:
Loading weights: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 1288.92it/s]
WARNING:lighteval.logging.hierarchical_logger: } [0:00:00.361026]
WARNING:lighteval.logging.hierarchical_logger: Tasks loading {
WARNING:lighteval.logging.hierarchical_logger: If you want to use extended_tasks, make sure you installed their dependencies using `pip install -e .[extended_tasks]`.
WARNING:lighteval.logging.hierarchical_logger: gsm8k main
WARNING:lighteval.logging.hierarchical_logger: Loading documents, and requests
Token indices sequence length is longer than the specified maximum sequence length for this model (985 > 256). Running this sequence through the model will result in indexing errors
WARNING:lighteval.logging.hierarchical_logger: } [0:00:01.286350]
WARNING:lighteval.logging.hierarchical_logger: Setting seeds and waiting for all processes {
WARNING:lighteval.logging.hierarchical_logger: setting seed to 1234 for random and numpy
WARNING:lighteval.logging.hierarchical_logger: } [0:00:00.000133]
WARNING:lighteval.logging.hierarchical_logger: Evaluation {
WARNING:lighteval.logging.hierarchical_logger: Evaluate on 1 tasks.
WARNING:lighteval.logging.hierarchical_logger: Running RequestType.GREEDY_UNTIL requests
WARNING:lighteval.logging.hierarchical_logger: You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring.
greedy -- Node 0: 0%| | 0/1 [00:00<?, ?it/s]WARNING:lighteval.models.nanotron_model:Detecting largest batch size
WARNING:lighteval.models.nanotron_model:Testing batch size 512
greedy -- Node 0: 0%| | 0/1 [00:00<?, ?it/s]
WARNING:lighteval.logging.hierarchical_logger: } [0:00:00.164193]
WARNING:lighteval.logging.hierarchical_logger:} [0:00:02.496358]
[rank0]: Traceback (most recent call last):
[rank0]: File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/__main__.py", line 93, in <module>
[rank0]: cli_evaluate()
[rank0]: File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/__main__.py", line 63, in cli_evaluate
[rank0]: main_nanotron(args.checkpoint_config_path, args.lighteval_config_path, args.cache_dir)
[rank0]: File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/logging/hierarchical_logger.py", line 175, in wrapper
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/main_nanotron.py", line 97, in main
[rank0]: pipeline.evaluate()
[rank0]: File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/pipeline.py", line 235, in evaluate
[rank0]: sample_id_to_responses = self._run_model()
[rank0]: File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/pipeline.py", line 264, in _run_model
[rank0]: responses = run_model(requests, override_bs=self.pipeline_parameters.override_batch_size)
[rank0]: File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/models/nanotron_model.py", line 1149, in greedy_until
[rank0]: batch_size = self._get_batch_size(
[rank0]: File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/models/nanotron_model.py", line 320, in _get_batch_size
[rank0]: batch_size = forward_batch()
[rank0]: File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/utils/parallelism.py", line 104, in decorator
[rank0]: return function(batch_size, *args, **kwargs)
[rank0]: File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/models/nanotron_model.py", line 317, in forward_batch
[rank0]: F.log_softmax(self._model_call(test_batch).float(), dim=-1).cpu()
[rank0]: File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/models/nanotron_model.py", line 342, in _model_call
[rank0]: return self.model(inputs)
[rank0]: File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: TypeError: LlamaModel.forward() missing 1 required positional argument: 'input_mask'
E0903 13:22:41.743000 140200056006464 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 1010958) of binary: /fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/bin/python
Traceback (most recent call last):
File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
src/lighteval/__main__.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-09-03_13:22:41
host : ip-26-0-162-233.ec2.internal
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1010958)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
To Reproduce
torchrun --standalone --nnodes=1 --nproc-per-node=1 src/lighteval/__main__.py nanotron --checkpoint_config_path ./nanotron/checkpoints/0/config.yaml --lighteval_config_path examples/nanotron/lighteval_config_override_template.yaml
Where you substitute 0 for batch size to config.
Describe the bug
Running nanotron with batch_size = 0 causes nanotron to crash during batch detection.
To Reproduce
torchrun --standalone --nnodes=1 --nproc-per-node=1 src/lighteval/__main__.py nanotron --checkpoint_config_path ./nanotron/checkpoints/0/config.yaml --lighteval_config_path examples/nanotron/lighteval_config_override_template.yaml
Where you substitute 0 for batch size to config.
Expected behavior
Batch size is correctly detected and run finishes
Version info
git+ssh://[email protected]/huggingface/lighteval.git@80b460f496e729077850f379d40da88298489a8f#egg=lighteval
The text was updated successfully, but these errors were encountered: