Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi!
In this branch I reintroduce & update to the current main branch the Llama model & conversion scripts to support Llama3.1 and Llama3.2 1B&3B models.
The main changes are the following:
transformers
LlamaRotaryEmbedding layer. Now this will be the only class in llama.py. I think it shouldn't break generations for the inference case WITHOUTLlamaConfig.rope_interleaved = True
inCausalSelfAttention.forward
, are there any tests?config.optimizer.finetuning
flag in order to (True) just load the weights or (False) Load weights, optimizer & LR Scheduler instead ofconfig.checkpoints.load_optimizer
&config.checkpoints.load_lr_scheduler
flash_attn_varlen_func
toflash_attn_func
as the later achieves greater performance. Keep in mind that we aren't using any feature of the varlen funct so it's recommended to stick withflash_attn_func
LlamaConfig.rope_interleaved
? It was useful for training when using FlashAttention RoPEs and now seems to be used also in the inference code. IMO we should unify all 3 cases (Training, inference with rope_interleaved & inference without rope interleaved) within a single RoPEResults
You can run the conversions & generations tests using the scripts in
tools/converters
. As I already mentioned in the previous PR (#174), despite we need at least 1 GPU (To init theParallelContext
) we are running the conversion with the CPU.As can be seen from the following table, we observe slightly differences between the 2 backends. Those differences are produced by the QKV projections in the CausalSelfAttention layer (Nanotron computes them in a single GEMM vs 3 different GEMMs in HF) and the LayerNorm layer is different (Nanotron is using a optimized one from FlashAttention vs Basic PyTorch LayerNorm in HF). Also note that the differences increase if we use TP which is totally expected as the sizes of the GEMMs are different, triggering different GEMM algorithms.
To run the Nanotron generations with different TP sizes:
TODO (Preferably in other PRs):
nanotron/tools/converters/delete/generate_hf_predictions.py
&nanotron/tools/converters/delete/generate_nanotron_predictions.py
scriptsapply_rotary_pos_emb
CausalSelfAttention.forward