Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed: OverflowError: out of range integral type conversion attempted #2206

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

himanshushukla12
Copy link

@himanshushukla12 himanshushukla12 commented Oct 9, 2024

And this is fixed using accelerator library

What does this PR do?

This PR fixed two issues by implementing the accelerate library, and those issue mainly comes when we have 2 or more GPUs in our system, and it was not handled properly, So I used Accelerate from hugging-face here to make it perfect.
during executing the commands given below

python examples/scripts/chat.py --model_name_or_path /home/trl/models/minimal/ppo/checkpoint-157

and

trl chat  --model_name_or_path /home/trl/models/minimal/ppo/checkpoint-157

Fixes issue #2205
There are two issues which it fixes:

OverflowError: out of range integral type conversion attempted

and

RuntimeError: probability tensor contains either inf, nan or element < 0

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Yes
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@himanshushukla12
Copy link
Author

@qgallouedec Please review my PR, I'm too excited...

@qgallouedec
Copy link
Member

Hey, not having any issue with trl chat and 2 GPUs. Can you double check?

@himanshushukla12
Copy link
Author

Hey, not having any issue with trl chat and 2 GPUs. Can you double check?

I tried with all latest dependency but faced this issue.

@qgallouedec
Copy link
Member

Can you share your system info? (trl env)

@himanshushukla12
Copy link
Author

Can you share your system info? (trl env)

  • Platform: Linux-6.8.0-41-generic-x86_64-with-glibc2.35
  • Python version: 3.10.11
  • PyTorch version: 2.4.1
  • CUDA device: NVIDIA RTX 6000 Ada Generation
  • Transformers version: 4.46.0.dev0
  • Accelerate version: 1.0.0
  • Accelerate config: not found
  • Datasets version: 3.0.1
  • HF Hub version: 0.25.2
  • TRL version: 0.12.0.dev0
  • bitsandbytes version: not installed
  • DeepSpeed version: not installed
  • Diffusers version: not installed
  • Liger-Kernel version: not installed
  • LLM-Blender version: not installed
  • OpenAI version: not installed
  • PEFT version: 0.13.1

@himanshushukla12
Copy link
Author

This is the error i got

Can you share your system info? (trl env)

  • Platform: Linux-6.8.0-41-generic-x86_64-with-glibc2.35
  • Python version: 3.10.11
  • PyTorch version: 2.4.1
  • CUDA device: NVIDIA RTX 6000 Ada Generation
  • Transformers version: 4.46.0.dev0
  • Accelerate version: 1.0.0
  • Accelerate config: not found
  • Datasets version: 3.0.1
  • HF Hub version: 0.25.2
  • TRL version: 0.12.0.dev0
  • bitsandbytes version: not installed
  • DeepSpeed version: not installed
  • Diffusers version: not installed
  • Liger-Kernel version: not installed
  • LLM-Blender version: not installed
  • OpenAI version: not installed
  • PEFT version: 0.13.1

the error I got

trl chat  --model_name_or_path /home/trl/models/minimal/ppo/checkpoint-157
Traceback (most recent call last):
  File "/home/trl/venvTRL/lib/python3.10/site-packages/transformers/utils/hub.py", line 403, in cached_file
    resolved_file = hf_hub_download(
  File "/home/trl/venvTRL/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
    return f(*args, **kwargs)
  File "/home/trl/venvTRL/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
    validate_repo_id(arg_value)
  File "/home/trl/venvTRL/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id
    raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/trl/models/minimal/ppo/checkpoint-157'. Use `repo_type` argument if needed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/trl/venvTRL/lib/python3.10/site-packages/trl/commands/scripts/chat.py", line 368, in <module>
    chat_cli()
  File "/home/trl/venvTRL/lib/python3.10/site-packages/trl/commands/scripts/chat.py", line 275, in chat_cli
    model, tokenizer = load_model_and_tokenizer(args)
  File "/home/trl/venvTRL/lib/python3.10/site-packages/trl/commands/scripts/chat.py", line 213, in load_model_and_tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
  File "/home/trl/venvTRL/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 854, in from_pretrained
    tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
  File "/home/trl/venvTRL/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 686, in get_tokenizer_config
    resolved_config_file = cached_file(
  File "/home/trl/venvTRL/lib/python3.10/site-packages/transformers/utils/hub.py", line 469, in cached_file
    raise EnvironmentError(
OSError: Incorrect path_or_model_id: '/home/trl/models/minimal/ppo/checkpoint-157'. Please provide either the path to a local folder or the repo_id of a model on the Hub.
[17:14:41] TRL - CHAT failed! See the logs above for further details.                                                                                                  cli.py:127
Traceback (most recent call last):
  File "/home/trl/venvTRL/lib/python3.10/site-packages/trl/commands/cli.py", line 118, in chat
    subprocess.run(
  File "/home/z004x2xz/local/python3.10/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['python', '/home/trl/venvTRL/lib/python3.10/site-packages/trl/commands/scripts/chat.py', '--model_name_or_path', '/home/trl/models/minimal/ppo/checkpoint-157']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

<z004x2xz>:
Hello

</home/trl/models/minimal/ppo/checkpoint-157>:
Exception in thread Thread-1 (generate):
Traceback (most recent call last):
  File "/home/z004x2xz/local/python3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/z004x2xz/local/python3.10/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/trl/venvTRL/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/trl/venvTRL/lib/python3.10/site-packages/transformers/generation/utils.py", line 2173, in generate
    result = self._sample(
  File "/home/trl/venvTRL/lib/python3.10/site-packages/transformers/generation/utils.py", line 3169, in _sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Knowided typ fileouth carryingscope bounds Small Soviet //�Dourunionauses                                                                     
Knowided typ fileouth carryingscope bounds Small Soviet //�Dourunionauses                                                                     
Traceback (most recent call last):
  File "/home/trl/venvTRL/bin/trl", line 8, in <module>
    sys.exit(main())
  File "/home/trl/venvTRL/lib/python3.10/site-packages/trl/commands/cli.py", line 137, in main
    chat()
  File "/home/trl/venvTRL/lib/python3.10/site-packages/trl/commands/cli.py", line 118, in chat
    subprocess.run(
  File "/home/z004x2xz/local/python3.10/lib/python3.10/subprocess.py", line 505, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "/home/z004x2xz/local/python3.10/lib/python3.10/subprocess.py", line 1146, in communicate
    self.wait()
<z004x2xz>:
describe something about new technologies...?

</home/trl/models/minimal/ppo/checkpoint-157>:
dealualy----------------�                                                                                                                     
Exception in thread Thread-1 (generate):
Traceback (most recent call last):
  File "/home/z004x2xz/local/python3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/z004x2xz/local/python3.10/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/trl/venvTRL/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/trl/venvTRL/lib/python3.10/site-packages/transformers/generation/utils.py", line 2173, in generate
    result = self._sample(
  File "/home/trl/venvTRL/lib/python3.10/site-packages/transformers/generation/utils.py", line 3169, in _sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
dealualy----------------� Committeeoll animalsironflags                                                                                       
southwest southwest384

@qgallouedec
Copy link
Member

I can't reaaly reproduce it since you're using a local model. Do you get the same error with a remote model?

@himanshushukla12
Copy link
Author

himanshushukla12 commented Oct 10, 2024

I can't really reproduce it since you're using a local model. Do you get the same error with a remote model?

I tried with local models only, not with cloud models.
Have you tested my code parallelly?

@qgallouedec
Copy link
Member

I did, and everything works as expected

@himanshushukla12
Copy link
Author

I did, and everything works as expected

Please share your trl env
This might help

@qgallouedec
Copy link
Member

  • Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
  • Python version: 3.11.9
  • PyTorch version: 2.4.1
  • CUDA device(s): NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3
  • Transformers version: 4.46.0.dev0
  • Accelerate version: 0.34.2
  • Accelerate config: not found
  • Datasets version: 3.0.0
  • HF Hub version: 0.24.7
  • TRL version: 0.12.0.dev0+b3f93f0
  • bitsandbytes version: 0.41.1
  • DeepSpeed version: 0.15.1
  • Diffusers version: 0.30.3
  • Liger-Kernel version: 0.3.0
  • LLM-Blender version: 0.0.2
  • OpenAI version: 1.46.0
  • PEFT version: 0.13.0

@qgallouedec
Copy link
Member

$ trl chat --model_name_or_path meta-llama/Llama-3.2-1B-Instruct
<quentin_gallouedec>:
Hello, what's the closest planet?

<meta-llama/Llama-3.2-1B-Instruct>:
The closest planet to Earth is Venus. On average, Venus is about 25 million miles (40 million kilometers) away from our planet. Due to a massive tilt in Venus's axis, it permanently rotates in the      
opposite direction of its orbit around the Sun, resulting in very high levels of solar radiation and extreme greenhouse gases in its atmosphere.                                                          

<quentin_gallouedec>:

@himanshushukla12
Copy link
Author

I tried like:

$ trl chat --model_name_or_path meta-llama/Llama-3.2-1B-Instruct
This is what i got

  File "/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/bin/trl", line 8, in <module>
    sys.exit(main())
  File "/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/trl/commands/cli.py", line 137, in main
    chat()
  File "/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/trl/commands/cli.py", line 118, in chat
    subprocess.run(
  File "/home/z004x2xz/local/python3.10/lib/python3.10/subprocess.py", line 505, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
<z004x2xz>:
Hello, what's the closest planet?

.<meta-llama/Llama-3.2-1B-Instruct>:
scar=scCUSRound himself,…ирpackageerceerseREET Soldiersendersittiittoatto_signatureLaugh//                   
/;                                                                                                           
!;                                                                                                           
Exception in thread Thread-1 (generate):
Traceback (most recent call last):
  File "/home/z004x2xz/local/python3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/z004x2xz/local/python3.10/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File 
"/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/torch/utils/_contextlib.py", line
116, in decorate_context
    return func(*args, **kwargs)
  File 
"/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/transformers/generation/utils.py"
, line 2173, in generate
    result = self._sample(
  File 
"/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/transformers/generation/utils.py"
, line 3133, in _sample
    outputs = self(**model_inputs, return_dict=True)
  File 
"/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/torch/nn/modules/module.py", line
1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File 
"/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/torch/nn/modules/module.py", line
1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/accelerate/hooks.py", line
170, in new_forward
    output = module._old_forward(*args, **kwargs)
  File 
"/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/transformers/models/llama/modelin
g_llama.py", line 1187, in forward
    outputs = self.model(
  File 
"/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/torch/nn/modules/module.py", line
1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File 
"/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/torch/nn/modules/module.py", line
1562, in _call_impl
    return forward_call(*args, **kwargs)
  File 
"/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/transformers/models/llama/modelin
g_llama.py", line 914, in forward
    causal_mask = self._update_causal_mask(
  File 
"/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/transformers/models/llama/modelin
g_llama.py", line 1003, in _update_causal_mask
    if AttentionMaskConverter._ignore_causal_mask_sdpa(
  File 
"/home/z004x2xz/WorkAssignedByMatt/trl/venvTRL/lib/python3.10/site-packages/transformers/modeling_attn_mask_u
tils.py", line 284, in _ignore_causal_mask_sdpa
    elif not is_tracing and torch.all(attention_mask == 1):
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be 
incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

scar=scCUSRound himself,…ирpackageerceerseREET Soldiersendersittiittoatto_signatureLaugh//                   
/;                                                                                                           
!;                                                                                                           
getResponse/response dad momasLEYesor health outоP acidity as capital Bent Ent ch Cancer immaturelublue      
yielding of 

By running like this:

CUDA_VISIBLE_DEVICES=0 trl chat --model_name_or_path meta-llama/Llama-3.2-1B-Instruct

Hello, what's the closest planet?

<meta-llama/Llama-3.2-1B-Instruct>:
The closest planet to the Sun is Mercury. It's a small, rocky planet with a highly elliptical orbit that     
takes about 88 Earth days to complete.                                                                       

However, if you're asking about other planets, it would be Venus or Mars. Venus is the second planet from the
Sun, and Mars is the third.                                                                                  

If you're looking for a specific planet, I can try and help you with that. Can you please provide more       
context or clarify what you're asking about? 

and by specifying the device the inferencing was too fast

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RuntimeError: probability tensor contains either inf, nan or element < 0
2 participants