fixed: OverflowError: out of range integral type conversion attempted #2206

himanshushukla12 · 2024-10-09T12:41:45Z

And this is fixed using accelerator library

What does this PR do?

This PR fixed two issues by implementing the accelerate library, and those issue mainly comes when we have 2 or more GPUs in our system, and it was not handled properly, So I used Accelerate from hugging-face here to make it perfect.
during executing the commands given below

python examples/scripts/chat.py --model_name_or_path /home/trl/models/minimal/ppo/checkpoint-157

and

trl chat  --model_name_or_path /home/trl/models/minimal/ppo/checkpoint-157

Fixes issue #2205
There are two issues which it fixes:

OverflowError: out of range integral type conversion attempted

and

RuntimeError: probability tensor contains either inf, nan or element < 0

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Yes
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…, And this is fixed using accelerator library

…ed using accelerate library

himanshushukla12 · 2024-10-10T16:46:36Z

@qgallouedec Please review my PR, I'm too excited...

qgallouedec · 2024-10-10T17:08:09Z

Hey, not having any issue with trl chat and 2 GPUs. Can you double check?

himanshushukla12 · 2024-10-10T17:10:40Z

Hey, not having any issue with trl chat and 2 GPUs. Can you double check?

I tried with all latest dependency but faced this issue.

qgallouedec · 2024-10-10T17:11:41Z

Can you share your system info? (trl env)

himanshushukla12 · 2024-10-10T17:13:42Z

Can you share your system info? (trl env)

Platform: Linux-6.8.0-41-generic-x86_64-with-glibc2.35
Python version: 3.10.11
PyTorch version: 2.4.1
CUDA device: NVIDIA RTX 6000 Ada Generation
Transformers version: 4.46.0.dev0
Accelerate version: 1.0.0
Accelerate config: not found
Datasets version: 3.0.1
HF Hub version: 0.25.2
TRL version: 0.12.0.dev0
bitsandbytes version: not installed
DeepSpeed version: not installed
Diffusers version: not installed
Liger-Kernel version: not installed
LLM-Blender version: not installed
OpenAI version: not installed
PEFT version: 0.13.1

himanshushukla12 · 2024-10-10T17:21:53Z

This is the error i got

Can you share your system info? (trl env)

Platform: Linux-6.8.0-41-generic-x86_64-with-glibc2.35

Python version: 3.10.11

PyTorch version: 2.4.1

CUDA device: NVIDIA RTX 6000 Ada Generation

Transformers version: 4.46.0.dev0

Accelerate version: 1.0.0

Accelerate config: not found

Datasets version: 3.0.1

HF Hub version: 0.25.2

TRL version: 0.12.0.dev0

bitsandbytes version: not installed

DeepSpeed version: not installed

Diffusers version: not installed

Liger-Kernel version: not installed

LLM-Blender version: not installed

OpenAI version: not installed

PEFT version: 0.13.1

the error I got

trl chat  --model_name_or_path /home/trl/models/minimal/ppo/checkpoint-157
Traceback (most recent call last):
  File "/home/trl/venvTRL/lib/python3.10/site-packages/transformers/utils/hub.py", line 403, in cached_file
    resolved_file = hf_hub_download(
  File "/home/trl/venvTRL/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
    return f(*args, **kwargs)
  File "/home/trl/venvTRL/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
    validate_repo_id(arg_value)
  File "/home/trl/venvTRL/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id
    raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/trl/models/minimal/ppo/checkpoint-157'. Use `repo_type` argument if needed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/trl/venvTRL/lib/python3.10/site-packages/trl/commands/scripts/chat.py", line 368, in <module>
    chat_cli()
  File "/home/trl/venvTRL/lib/python3.10/site-packages/trl/commands/scripts/chat.py", line 275, in chat_cli
    model, tokenizer = load_model_and_tokenizer(args)
  File "/home/trl/venvTRL/lib/python3.10/site-packages/trl/commands/scripts/chat.py", line 213, in load_model_and_tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
  File "/home/trl/venvTRL/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 854, in from_pretrained
    tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
  File "/home/trl/venvTRL/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 686, in get_tokenizer_config
    resolved_config_file = cached_file(
  File "/home/trl/venvTRL/lib/python3.10/site-packages/transformers/utils/hub.py", line 469, in cached_file
    raise EnvironmentError(
OSError: Incorrect path_or_model_id: '/home/trl/models/minimal/ppo/checkpoint-157'. Please provide either the path to a local folder or the repo_id of a model on the Hub.
[17:14:41] TRL - CHAT failed! See the logs above for further details.                                                                                                  cli.py:127
Traceback (most recent call last):
  File "/home/trl/venvTRL/lib/python3.10/site-packages/trl/commands/cli.py", line 118, in chat
    subprocess.run(
  File "/home/z004x2xz/local/python3.10/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['python', '/home/trl/venvTRL/lib/python3.10/site-packages/trl/commands/scripts/chat.py', '--model_name_or_path', '/home/trl/models/minimal/ppo/checkpoint-157']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

<z004x2xz>:
Hello

</home/trl/models/minimal/ppo/checkpoint-157>:
Exception in thread Thread-1 (generate):
Traceback (most recent call last):
  File "/home/z004x2xz/local/python3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/z004x2xz/local/python3.10/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/trl/venvTRL/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/trl/venvTRL/lib/python3.10/site-packages/transformers/generation/utils.py", line 2173, in generate
    result = self._sample(
  File "/home/trl/venvTRL/lib/python3.10/site-packages/transformers/generation/utils.py", line 3169, in _sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Knowided typ fileouth carryingscope bounds Small Soviet //�Dourunionauses                                                                     
Knowided typ fileouth carryingscope bounds Small Soviet //�Dourunionauses                                                                     
Traceback (most recent call last):
  File "/home/trl/venvTRL/bin/trl", line 8, in <module>
    sys.exit(main())
  File "/home/trl/venvTRL/lib/python3.10/site-packages/trl/commands/cli.py", line 137, in main
    chat()
  File "/home/trl/venvTRL/lib/python3.10/site-packages/trl/commands/cli.py", line 118, in chat
    subprocess.run(
  File "/home/z004x2xz/local/python3.10/lib/python3.10/subprocess.py", line 505, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "/home/z004x2xz/local/python3.10/lib/python3.10/subprocess.py", line 1146, in communicate
    self.wait()
<z004x2xz>:
describe something about new technologies...?

</home/trl/models/minimal/ppo/checkpoint-157>:
dealualy----------------�                                                                                                                     
Exception in thread Thread-1 (generate):
Traceback (most recent call last):
  File "/home/z004x2xz/local/python3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/z004x2xz/local/python3.10/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/trl/venvTRL/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/trl/venvTRL/lib/python3.10/site-packages/transformers/generation/utils.py", line 2173, in generate
    result = self._sample(
  File "/home/trl/venvTRL/lib/python3.10/site-packages/transformers/generation/utils.py", line 3169, in _sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
dealualy----------------� Committeeoll animalsironflags                                                                                       
southwest southwest384

qgallouedec · 2024-10-10T18:05:36Z

I can't reaaly reproduce it since you're using a local model. Do you get the same error with a remote model?

himanshushukla12 · 2024-10-10T18:07:40Z

I can't really reproduce it since you're using a local model. Do you get the same error with a remote model?

I tried with local models only, not with cloud models.
Have you tested my code parallelly?

qgallouedec · 2024-10-10T18:20:00Z

I did, and everything works as expected

himanshushukla12 · 2024-10-10T18:22:45Z

I did, and everything works as expected

Please share your trl env
This might help

qgallouedec · 2024-10-10T20:17:13Z

Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
Python version: 3.11.9
PyTorch version: 2.4.1
CUDA device(s): NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3
Transformers version: 4.46.0.dev0
Accelerate version: 0.34.2
Accelerate config: not found
Datasets version: 3.0.0
HF Hub version: 0.24.7
TRL version: 0.12.0.dev0+b3f93f0
bitsandbytes version: 0.41.1
DeepSpeed version: 0.15.1
Diffusers version: 0.30.3
Liger-Kernel version: 0.3.0
LLM-Blender version: 0.0.2
OpenAI version: 1.46.0
PEFT version: 0.13.0

qgallouedec · 2024-10-10T20:19:39Z

$ trl chat --model_name_or_path meta-llama/Llama-3.2-1B-Instruct
<quentin_gallouedec>:
Hello, what's the closest planet?

<meta-llama/Llama-3.2-1B-Instruct>:
The closest planet to Earth is Venus. On average, Venus is about 25 million miles (40 million kilometers) away from our planet. Due to a massive tilt in Venus's axis, it permanently rotates in the      
opposite direction of its orbit around the Sun, resulting in very high levels of solar radiation and extreme greenhouse gases in its atmosphere.                                                          

<quentin_gallouedec>:

himanshushukla12 · 2024-10-11T04:54:43Z

I tried like:

$ trl chat --model_name_or_path meta-llama/Llama-3.2-1B-Instruct
This is what i got

  File "/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/bin/trl", line 8, in <module>
    sys.exit(main())
  File "/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/trl/commands/cli.py", line 137, in main
    chat()
  File "/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/trl/commands/cli.py", line 118, in chat
    subprocess.run(
  File "/home/z004x2xz/local/python3.10/lib/python3.10/subprocess.py", line 505, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
<z004x2xz>:
Hello, what's the closest planet?

.<meta-llama/Llama-3.2-1B-Instruct>:
scar=scCUSRound himself,…ирpackageerceerseREET Soldiersendersittiittoatto_signatureLaugh//                   
/;                                                                                                           
!;                                                                                                           
Exception in thread Thread-1 (generate):
Traceback (most recent call last):
  File "/home/z004x2xz/local/python3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/z004x2xz/local/python3.10/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File 
"/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/torch/utils/_contextlib.py", line
116, in decorate_context
    return func(*args, **kwargs)
  File 
"/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/transformers/generation/utils.py"
, line 2173, in generate
    result = self._sample(
  File 
"/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/transformers/generation/utils.py"
, line 3133, in _sample
    outputs = self(**model_inputs, return_dict=True)
  File 
"/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/torch/nn/modules/module.py", line
1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File 
"/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/torch/nn/modules/module.py", line
1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/accelerate/hooks.py", line
170, in new_forward
    output = module._old_forward(*args, **kwargs)
  File 
"/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/transformers/models/llama/modelin
g_llama.py", line 1187, in forward
    outputs = self.model(
  File 
"/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/torch/nn/modules/module.py", line
1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File 
"/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/torch/nn/modules/module.py", line
1562, in _call_impl
    return forward_call(*args, **kwargs)
  File 
"/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/transformers/models/llama/modelin
g_llama.py", line 914, in forward
    causal_mask = self._update_causal_mask(
  File 
"/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/transformers/models/llama/modelin
g_llama.py", line 1003, in _update_causal_mask
    if AttentionMaskConverter._ignore_causal_mask_sdpa(
  File 
"/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/transformers/modeling_attn_mask_u
tils.py", line 284, in _ignore_causal_mask_sdpa
    elif not is_tracing and torch.all(attention_mask == 1):
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be 
incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

scar=scCUSRound himself,…ирpackageerceerseREET Soldiersendersittiittoatto_signatureLaugh//                   
/;                                                                                                           
!;                                                                                                           
getResponse/response dad momasLEYesor health outоP acidity as capital Bent Ent ch Cancer immaturelublue      
yielding of

By running like this:

CUDA_VISIBLE_DEVICES=0 trl chat --model_name_or_path meta-llama/Llama-3.2-1B-Instruct

Hello, what's the closest planet?

<meta-llama/Llama-3.2-1B-Instruct>:
The closest planet to the Sun is Mercury. It's a small, rocky planet with a highly elliptical orbit that     
takes about 88 Earth days to complete.                                                                       

However, if you're asking about other planets, it would be Venus or Mars. Venus is the second planet from the
Sun, and Mars is the third.                                                                                  

If you're looking for a specific planet, I can try and help you with that. Can you please provide more       
context or clarify what you're asking about?

and by specifying the device the inferencing was too fast

himanshushukla12 and others added 5 commits October 9, 2024 12:33

fixed: OverflowError: out of range integral type conversion attempted…

a3936a1

…, And this is fixed using accelerator library

Fixed indefinately waiting time when running on multi-GPU, and it fix…

62c35cb

…ed using accelerate library

Merge branch 'main' into main

04e3d4f

Merge branch 'main' into main

ba513e5

Merge branch 'huggingface:main' into main

0762a01

Merge branch 'main' into main

e21cb1f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixed: OverflowError: out of range integral type conversion attempted #2206

fixed: OverflowError: out of range integral type conversion attempted #2206

himanshushukla12 commented Oct 9, 2024 •

edited

Loading

himanshushukla12 commented Oct 10, 2024

qgallouedec commented Oct 10, 2024

himanshushukla12 commented Oct 10, 2024

qgallouedec commented Oct 10, 2024

himanshushukla12 commented Oct 10, 2024

himanshushukla12 commented Oct 10, 2024

qgallouedec commented Oct 10, 2024

himanshushukla12 commented Oct 10, 2024 •

edited

Loading

qgallouedec commented Oct 10, 2024

himanshushukla12 commented Oct 10, 2024

qgallouedec commented Oct 10, 2024

qgallouedec commented Oct 10, 2024

himanshushukla12 commented Oct 11, 2024

fixed: OverflowError: out of range integral type conversion attempted #2206

Are you sure you want to change the base?

fixed: OverflowError: out of range integral type conversion attempted #2206

Conversation

himanshushukla12 commented Oct 9, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

himanshushukla12 commented Oct 10, 2024

qgallouedec commented Oct 10, 2024

himanshushukla12 commented Oct 10, 2024

qgallouedec commented Oct 10, 2024

himanshushukla12 commented Oct 10, 2024

himanshushukla12 commented Oct 10, 2024

qgallouedec commented Oct 10, 2024

himanshushukla12 commented Oct 10, 2024 • edited Loading

qgallouedec commented Oct 10, 2024

himanshushukla12 commented Oct 10, 2024

qgallouedec commented Oct 10, 2024

qgallouedec commented Oct 10, 2024

himanshushukla12 commented Oct 11, 2024

himanshushukla12 commented Oct 9, 2024 •

edited

Loading

himanshushukla12 commented Oct 10, 2024 •

edited

Loading