Update dependency transformers to v4.47.1 #647
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
4.46.3
->4.47.1
Release Notes
huggingface/transformers (transformers)
v4.47.1
Compare Source
Patch release v4.47.1
We waited a little bit to make sure it was stable, thanks @winglian for double checking and everyone for the fixes!
Fix GA loss bugs and add unit test (#35121)
Contributed by @techkang and @ArthurZucker.
Fix num_items_in_batch not being an integer (#35115)
Contributed by @xspirus.
Fix FSDP no longer working (#35212)
Contributed by @muellerzr.
Don't use no_sync when DeepSpeed doesn't support it for certain ZeRO configurations (#35212)
Contributed by @winglian.
Only import torch.distributed if it is available (#35133)
Contributed by @GaetanLepage.
[Whisper] Patch float type on MPS (#35295)
Contributed by @eustlb. 🔜 we should probably have MPS CIs to avoid repeating this!
v4.47.0
: v4.47.0: PaliGemma-2, I-JEPA, OLMo-2, LayerSkip, Tensor ParallelCompare Source
New models
PaliGemma-2
PaliGemma 2 and PaliGemma are lightweight open vision-language models (VLM) inspired by PaLI-3, and based on open components like the SigLIP vision model and the Gemma language model. PaliGemma takes both images and text as inputs and can answer questions about images with detail and context, meaning that PaliGemma can perform deeper analysis of images and provide useful insights, such as captioning for images and short videos, object detection, and reading text embedded within images.
PaliGemma 2 is available in 3B, 10B, and 28B parameter sizes, which are based on Gemma 2 2B, 9B, and 27B models, respectively. The original PaliGemma models are available in the 3B size. For more information on Gemma model variants, see the Gemma models list. PaliGemma model variants support different pixel resolutions for image inputs, including 224 x 224, 448 x 448, and 896 x 896 pixels.
I-JEPA
The I-JEPA model was proposed in Image-based Joint-Embedding Predictive Architecture by Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas. I-JEPA is a self-supervised learning method that predicts the representations of one part of an image based on other parts of the same image. This approach focuses on learning semantic features without relying on pre-defined invariances from hand-crafted data transformations, which can bias specific tasks, or on filling in pixel-level details, which often leads to less meaningful representations.
OLMo 2
The OLMo2 model is the successor of the OLMo model, which was proposed in OLMo: Accelerating the Science of Language Models.
The architectural changes from the original OLMo model to this model are:
Commits:
Layer-Skip Llama
We add support for Meta's Layer-Skip Llama 3.2 1B model.
The Llama3.2 1B model was continually pretrained with LayerSkip recipe, early exit loss and layer dropout, as presented in Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding and is capable of performing self-speculative decoding: decode with earlier layers and verify with remaining layers.
Tensor Parallel implementation
This PR uses the
torch.distributed.tensor.parallel
subpackage to implement Tensor Parallel for Llama (as an example).The motivation is multi-fold:
to make modeling code simple as single-worker case:
all manual TP implementations under
if self.config.pretraining_tp > 1
can be removed.to make tensor parallelism easily accessible by users:
added a
model.tensor_parallel(device_mesh)
method that allows users to turn a single-proc model into a parallel model. !- Please guide me to a right place to put this function/method ifPreTrainedModel
is not a preferred place. -!This is the first PR of many to simplify and enable Tensor Parallel across models.
Farewell, Python 3.8
Python 3.8 reaches end of life, and, as such, we drop it from our CI.
GGUF improvements
Several improvements have been done to the GGUF support in transformers; notably by adding new architectures to the list of supported architectures.
use_parallel_residual
andqkv_bias
for StableLM GGUF config extraction by @Isotr0py in #34450Fast processors
We continue the work to improve the speed of fast processors as detailed in this roadmap.
We contribute a fast processor to RT-DETR.
New pipelines
A new pipeline has been added to transformers: image-text-to-text!
the pipeline support the following inputs:
Notable refactors
Separate chat templates into a single file
We have had several issues with chat templates because they're stored as single lines in the JSON config files:
processor
templates inchat_template.json
andtokenizer
templates intokenizer_config.json
causing confusionThe solution:
chat_template.jinja
file in the repoProcessor
classes, so processors should always be able to save their template as a raw Jinja file. In general, we'll be gently deprecating multiple templates in future.chat_template.jinja
file is present, it overrides the JSON files. If a tokenizer is loaded with both Jinja and JSON chat templates and resaved, it should save only the Jinja file, and not have anychat_template
entry intokenizer_config.json
.For now, we continue saving in the old format by default. I'll probably keep it this way for several versions before making the new format the default, to ensure that most users are able to load the new format before it becomes common. Until then, the new format should mostly be used for testing, to make sure it's ready for deployment when we do the switch.
Large modular logic refactor
This PR largely rework the logic we use in the modular converter. It is (hopefully) clearer and maintainable. Instead of going in all directions, adding stuff, then deleting it if not needed, we now do the following:
Community bugfixes and improvements
convert_tokens_to_ids
by @winstxnhdw in #34030torch.fx
issue related to the newloss_kwargs
keyword argument by @michaelbenayoun in #34380test_eager_matches_sdpa_generate
by @gante in #34386tensorflow_probability<0.22
in docker files by @ydshieh in #34381"best"
forargs.save_strategy
. by @seanswyi in #31817model_doc/barthez.md
to Korean by @Jwaminju in #33980docs/source/ar/fast_tokenizers.md
into Arabic by @AhmedAlmaghz in #33034post_process_depth_estimation
for GLPN by @alex-bene in #34413head_dim
formixtral
model by @wavy-jung in #34281optimizer_cls_and_kwargs
toTrainer.__init__
by @apoorvkh in #34358generate
tests to the right mixin and delete redundant tests by @gante in #34464gc.collect
andcuda.empty_cache
by @ydshieh in #34514input_ids
-inputs_embeds
equivalence check by @gante in #34535test_eager_matches_sdpa_inference
less flaky by @ydshieh in #34512docs/source/ar/multilingual.md
into Arabic by @AhmedAlmaghz in #33048query_pre_attn_scalar
different ofnum_heads
in default gemma2 config by @molbap in #34540isin_mps_friendly
can support 0D tensors by @gante in #34538@slow
fortest_eager_matches_sdpa_inference
by @ydshieh in #34558convbert.md
to Korean by @ahnjj in #34599timesformer.md
to Korean by @mreraser in #33972docs/source/ar/trainer.md
into Arabic by @AhmedAlmaghz in #33080Tool.from_space()
by @aymeric-roucher in #34561docs/source/ar/torchscript.md
into Arabic by @AhmedAlmaghz in #33079continue_final_message=True
by @lewtun in #34253patch_size
->num_image_tokens
in processing by @zucchini-nlp in #33424empty_cache
device-agnostic by @faaany in #34774test_medium_seamless_m4t_pt
insubprocess
to avoid many failures by @ydshieh in #34812check_training_gradient_checkpointing
by @ydshieh in #34806torch.export
by @philkuz in #34103max_steps
overridingnum_train_epochs
by @qgallouedec in #34810use_cache
by @zucchini-nlp in #34274Deberta/Deberta-v2
] Refactor code base to support compile, export, and fix LLM by @ArthurZucker in #22105peft
] Given thatself.active_adapter
is deprecated, avoid using it by @tomaarsen in #34804test_auto_backbone_timm_model_from_pretrained
by @ydshieh in #34877docs/source/ar/benchmarks.md
into Arabic by @AhmedAlmaghz in #33023FlexAttention
] Update gemma2 by @ArthurZucker in #34942get_max_length
by @ydshieh in #34971Thread
by @ydshieh in #34966release_memory()
by @faaany in #34911VisitWebpageTool
by @sergiopaniego in #34978save_pretrained
for partially offloaded models by @kylesayrs in #34890Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.