ValueError: Input image size (1121036) doesn't match model ([112, 1036][112, 1036]). #67

JeffRody · 2024-10-28T03:37:08Z

使用最新的transformers 4.47.0.dev0
删除 improt _expand_mask 改为自定义
def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
"""
Expands attention_mask from [bsz, seq_len] to [bsz, 1, tgt_seq_len, src_seq_len].
"""
bsz, src_len = mask.size()
tgt_len = tgt_len if tgt_len is not None else src_len
expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
inverted_mask = 1.0 - expanded_mask
return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min)

其它模态可以运行，但是modeling_audio出现以下报错，请教一下应该如何修改：
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
File "/home/wanglch/projects/LanguageBind/inference.py", line 43, in
embeddings = model(inputs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/wanglch/projects/LanguageBind/languagebind/init.py", line 78, in forward
value = self.modality_encoderkey[1]
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/wanglch/projects/LanguageBind/languagebind/audio/modeling_audio.py", line 656, in forward
hidden_states = self.embeddings(pixel_values)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 244, in forward
raise ValueError(
ValueError: Input image size (1121036) doesn't match model ([112, 1036][112, 1036]).

The text was updated successfully, but these errors were encountered:

SoyeonHH · 2024-11-14T09:06:27Z

I ran into the same issue, but it was probably due to a version difference.

I made the following changes to the modeling_audio file and it fixed the problem.

from transformers.modeling_attn_mask_utils import _prepare_4d_attention_mask

attention_mask = _prepare_4d_attention_mask(attention_mask, hidden_states.dtype)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: Input image size (1121036) doesn't match model ([112, 1036][112, 1036]). #67

ValueError: Input image size (1121036) doesn't match model ([112, 1036][112, 1036]). #67

JeffRody commented Oct 28, 2024

SoyeonHH commented Nov 14, 2024

ValueError: Input image size (112*1036) doesn't match model ([112, 1036]*[112, 1036]). #67

ValueError: Input image size (112*1036) doesn't match model ([112, 1036]*[112, 1036]). #67

Comments

JeffRody commented Oct 28, 2024

SoyeonHH commented Nov 14, 2024

ValueError: Input image size (1121036) doesn't match model ([112, 1036][112, 1036]). #67

ValueError: Input image size (1121036) doesn't match model ([112, 1036][112, 1036]). #67