Llama3 conversion scripts 🦙 #174

TJ-Solergibert · 2024-05-20T00:57:48Z

Hello,

In this PR, I include the scripts to convert the checkpoints of Llama3 8B & 70B to Nanotron. Although there are still some details to be polished, the current status is as follows:

Conversion from HF to Nanotron of the 8B model
Conversion from Nanotron to HF of the 8B model
Conversion from HF to Nanotron of the 70B model
Conversion from Nanotron to HF of the 70B model
Conversion on CPU
TP Topology agnostic

All conversions are carried out in BFLOAT16 and on the CPU, but we will need at least one GPU because the ParallelContext requires it. The 8B model fits on a GPU with 80GB, but the 70B model does not. Even so, in ALL conversions, we will set DP=PP=TP=1. I have confirmed that Nanotron supports changing the TP topology, although while waiting for GPUs in my cluster, I developed a fancy script with broadcast, scatter, and gathers to perform the conversion with TP>1. I have also tried a dummy Finetune with TP=2 from the TP=1 8B converted checkpoint to store it back with TP=2, checked the results in Nanotron (correct, results below), and then converted it back to HF with the result still being correct. I have attempted to experiment with all possible cases I think.

Included

convert_hf_to_nanotron.py to convert the weights from the HF checkpoint to Nanotron
convert_nanotron_to_hf.py to convert the weights from the Nanotron checkpoint to HF
generate_hf_predictions.py to test the logits of the HF model with a prompt
generate_nanotron_predictions.py to test the logits of the Nanotron model with a prompt
- The generation scripts are just for debugging reasons, we should delete them before merging

Results & Precision

It is impossible for the two models (HF & Nanotron) to produce exactly the same logits with a level of precision capable of passing the assert_close test. This is true both at the model level and at the layer level because, despite having the same parameters, the two models perform different operations. Different in the sense of...

Shapes: For TP in the Attention Layer, the QKV matrices are fused in qkv_proj and the projections are computed with a single GEMM, whereas in the HF implementation, it is done in three (Even in the Meta model, it is done the same way, although they also have TensorParallelLayers). By changing the shape of the matrices, the result is different because, in the GEMM, the order of operations is non-deterministic, and in reduced 16-bit types, the difference becomes more noticeable when accumulating the result. The same happens in the MLP layer with gate_up_proj.
Operators: RoPE, LayerNorm. Nanotron uses different implementations for both RoPE and LayerNorm (TritonRMSNorm), which produce results that are not exactly the same as those of the HF implementation.

I have a (somewhat catastrophic) notebook where the differences at each operation level are evident. But what is really important is not so much the logits but the predictions and their order. To verify this, I developed the generate_XXXX.py scripts that, from the same prompt for the desired tokens, print the 10 most probable predictions and print an accuracy value of all the sequence. I chose a fixed prompt to 1. Ensure manually that the predictions makes sense 2. Compare through the different converted models. The following table shows the accuracy results for different configurations.

Experiment	Backend	Size	TP	Accuracy
OG HF	HF	8	1	0.83
OG HF --> Nanotron	Nanotron	8	1	0.83
OG HF --> Nanotron --> HF	HF	8	1	0.83
OG HF	HF	70	2	0.89
OG HF --> Nanotron	Nanotron	70	2	0.83
OG HF --> Nanotron --> HF	HF	70	2	0.89
HF -> Nanotron -> Dummy Finetune to change TP=2 -> HF	HF	8	1 --> 2	0.83

It is worth noting that:

For the 70B model, when using the HF backend with AutoModelForCausalLM.from_pretrained() there is NO tensor parallelism, while in Nanotron there is.
The accuracy values are from the prediction of 512 tokens.

Details

This PR is build with #168 FA2 kernel, which is the same as in the HF implementation.

After extensive reverse engineering, I found a critical point that was significantly different from the HuggingFace implementation: RoPE. After numerous tests, even transferring the RoPE from the HF implementation, it turns out that there are 2 fundamental parameters of the FlashRotaryEmbedding layer:

interleaved: The default value in Nanotron is True, but it must be False.
rope_theta: The default value is 10000.0, but for Llama3, it is 500000.0.

I have included both values in LlamaConfig, with the OLD values as defaults, although I propose at least changing the interleaved default to False.

In essence, to perform the conversions, we initialize the two implementations (HuggingFace & Nanotron) and copy the parameters layer by layer. After trying several methods to copy the weights, I opted for the copy_ method, because this way we preserve the ShardedInfo & TiedInfo of all the NanotronParameters.

The conversion from HF to Nanotron is fast, taking 2 and 16 minutes for the 8B and 70B models respectively. However, the conversion from Nanotron to HF extends to 5 and 51 minutes respectively. This is due to the initialization of the HF model (AutoModelForCausalLM.from_config()).

When converting to Nanotron, we also store the tokenizer (as in the HF models) and generate a config.yaml with the basic configurations and parameters to start training from the checkpoint. Additionally, the conversions include assertions on all parameters to ensure that we are copying the parameters correctly and making the process as explicit as possible for the conversion of future models.

TODO

Check torch.no_grad() in conversions
Improve logging, log_rank of Nanotron was not working correctly
Add README
~~Add push_to_hub flag in the Nanotron to HF conversion script~~

Instructions

In the header of all the files there are instructions, I recommend the following commands to launch the evaluations and conversions.

torchrun --nproc-per-node 1 tools/llama3/generate_hf_predictions.py --pretrained-model-name-or-path  meta-llama/Meta-Llama-3-8B-Instruct
torchrun --nproc-per-node 1 tools/llama3/convert_hf_to_nanotron.py --nanotron-checkpoint-path nanotron_checkpoints/NanotronLlama3-8B --pretrained-model-name-or-path meta-llama/Meta-Llama-3-8B-Instruct
torchrun --nproc-per-node 2 tools/llama3/generate_nanotron_predictions.py --tp 2 --nanotron-checkpoint-path nanotron_checkpoints/NanotronLlama3-8B
torchrun --nproc-per-node 1 tools/llama3/convert_nanotron_to_hf.py --nanotron-checkpoint-path nanotron_checkpoints/NanotronLlama3-8B --hugging-face-checkpoint-path hf_checkpoints/ConvertedNanotronLlama3-8B
torchrun --nproc-per-node 1 tools/llama3/generate_hf_predictions.py --pretrained-model-name-or-path hf_checkpoints/ConvertedNanotronLlama3-8B

xrsrke · 2024-05-20T12:24:38Z

@TJ-Solergibert. Thanks for the PR, have you tried continually pretraining or finetuning a llama3-converted checkpoint to Nanotron? I encountered some exploding gradient issues in my experience (not in your PR)

TJ-Solergibert · 2024-05-20T18:36:25Z

Hi @xrsrke ,

After your comments about exploding gradient issues I've run the following:

Preprocessed the DKYoon/SlimPajama-6B dataset to use Nanoset
Changed the TXT prompt of the generate_XXX.py scripts to a prompt generated by meta-llama/Meta-Llama-3-8B. I do this in order to get a good accuracy in the tests in order to detect flaws more easily (If we perform bad and after we perform worse it's difficult from where does this decrease comes from).
Run generate_hf_predictions.py for the base Llama-3-8B model and we get 0.888671875 of accuracy:
torchrun --nproc-per-node 1 tools/llama3/generate_hf_predictions.py --pretrained-model-name-or-path models/Meta-Llama-3-8B
Convert the checkpoint to Nanotron, 2 minutes:
torchrun --nproc-per-node 1 tools/llama3/convert_hf_to_nanotron.py --nanotron-checkpoint-path nanotron_checkpoints/NanotronLlama-3-8B --pretrained-model-name-or-path models/Meta-Llama-3-8B
Generate Nanotron predictions with generate_nanotron_predictions.py with TP = 1 & TP = 2:
torchrun --nproc-per-node 1 tools/llama3/generate_nanotron_predictions.py --tp 1 --nanotron-checkpoint-path nanotron_checkpoints/NanotronLlama-3-8B
We get 0.888671875 & 0.869140625 with TP = 1 & TP = 2 respectively. This difference is due to TP and what I explained about shapes and GEMMs.
Run a fine-tune for 500 steps with TP = 2 and 256000 tokens. The logs of the run are here. I don't see any issues.
Then I run generate_nanotron_predictions.py with the new checkpoint with TP = 2. The accuracy is very very low. Something is happening.
- First I rerun the experiment for just 5 steps. Accuracy is still very very low.
- I try with PP = 2 & TP = 1 to try if it's a problem with TP. This doesn't makes much sense because as I've said, we can run the Nanotron generations with different TP sizes and also the 70B model is converted to a TP = DP = PP = 1 checkpoint and it works converting the model in both directions + the generations. The accuracy still sucks.
- Finally, I reduce the learning rate. This was the actual problem, as I was using the default one. I set a very low value and train for 100 iterations. The logs are also in W&B.
Run predictions with the fine-tuned model. We get 0.876953125 & 0.86328125 with TP = 1 and TP = 2 respectively.
torchrun --nproc-per-node 2 tools/llama3/generate_nanotron_predictions.py --tp 2 --nanotron-checkpoint-path nanotron_checkpoints/NanotronLlama-3-8B-finetuned/100
Convert back to HuggingFace:
torchrun --nproc-per-node 1 tools/llama3/convert_nanotron_to_hf.py --nanotron-checkpoint-path nanotron_checkpoints/NanotronLlama-3-8B-finetuned/100 --hugging-face-checkpoint-path models/Meta-Llama-3-8B-finetuned
Run HuggingFace generations and got 0.880859375 accuracy:
torchrun --nproc-per-node 1 tools/llama3/generate_hf_predictions.py --pretrained-model-name-or-path models/Meta-Llama-3-8B-finetuned

So I haven't experienced any problem, let me know if I should look into anything more!

Toni

PD: We could upload Nanotron Llama3 checkpoints to the Hub, right?
PPD: In W&B I've included the results of a dummy run with 5000 steps.

3outeille · 2024-06-17T19:16:40Z

Nice PR, when loading llama3 from HF to nanotron, I had to change the rotary embedding (31c12e8) otherwise the generation was not good

TJ-Solergibert · 2024-06-17T19:58:51Z

Hi,

I just took care of the "training case". As you can see, there are 2 RotaryEmbedding layers: self.rotary_embedding & self.flash_rotary_embedding. The first one is just used in the "inference case", while the last is just for the training case. The interleaved thing is just for the RotaryEmbedding of flash-attention, I don't see the point of this and this for example, as they belong to the inference case.

For training, the RotaryEmbedding of flash-attention needs interleaved=True to match the HuggingFace implementation. Now, for the inference case, you can include other RotaryEmbedding layers for sure, but I don't understand why would we have 3 different RotaryEmbedding layers and take care of the interleaved thing for the inference case. We should have at maximum 2, right?

xrsrke · 2024-07-10T11:58:51Z

src/nanotron/config/config.py

        run: Name of the run
        step: Global step (updated when we save the checkpoint)
        consumed_train_samples: Number of samples consumed during training (should be actually just step*batch_size)
        ignore_sanity_checks: Whether to ignore sanity checks
    """

    project: str
+    entity: Optional[str] = None


Suggested change

entity: Optional[str] = None

wandb_entity: Optional[str] = None

xrsrke · 2024-07-10T12:00:45Z

run_train.py

@@ -143,17 +143,17 @@ def get_dataloader_from_data_stage(
    elif isinstance(data.dataset, NanosetDatasetsArgs):
        # Get tokenizer cardinality
        tokenizer = AutoTokenizer.from_pretrained(trainer.config.tokenizer.tokenizer_name_or_path)
-        token_dtype = np.int32 if len(tokenizer) > np.iinfo(np.uint16).max + 1 else np.uint16
+        token_size = 4 if len(tokenizer) > np.iinfo(np.uint16).max + 1 else 2


Could we add a config option to specify the byte size of a token instead of this?

3outeille · 2024-07-10T12:52:48Z

src/nanotron/data/collator.py

+
+
+@dataclasses.dataclass
+class NanosetDataCollatorForCLM:


Why not reuse DataCollatorForCLM ?

3outeille · 2024-07-10T12:56:49Z

src/nanotron/trainer.py

@@ -276,7 +276,8 @@ def pre_training(self, *args, **kwargs):
        if dist.get_rank(self.parallel_context.world_pg) == self.logger_ranks[0] and wandb is not None:
            wandb.init(
                project=self.config.general.project,
-                name=f"{current_time}_{self.config.general.run}",
+                name=f"{current_time}_{self.config.general.project}_{self.config.general.run}",
+                entity=self.config.general.entity,


if we change to wandb_entity to forget to change here as well

TJ-Solergibert · 2024-07-10T16:14:22Z

Sorry, there were 68 commits that IDK how they ended up here 😅. All your comments are respect those commits.
Now the branch only contains the files relative to the Llama3 converter scripts. Before merging I should delete the generate_XX scripts that are just for debug purposes.

The conflicts are related to the interleaved parameter in the RoPE embeddings. As we discussed, it's necessary to set this parameter to False to load llama3 pretrained checkpoints but the default value of nanotron has been always True, so I set the default value to True.

Let me know if there is still any issue!

zzhhjjj · 2024-07-24T12:45:03Z

add some instructions for downloading the weights?

hjc3613 · 2024-10-29T11:32:25Z

how to convert 70B llama hf into nanotron with tp > 1 or pp > 1， when I set tp=8 or pp=8, it gives me errors in the line of:

which show that some field in the chain of nanotron_model.model.token_position_embeddings.pp_block.token_embedding.weight.shape is none

TJ-Solergibert · 2024-10-29T11:50:01Z

Hi @hjc3613,

from my first comment:

All conversions are carried out in BFLOAT16 and on the CPU, but we will need at least one GPU because the ParallelContext requires it. The 8B model fits on a GPU with 80GB, but the 70B model does not. Even so, in ALL conversions, we will set DP=PP=TP=1. I have confirmed that Nanotron supports changing the TP topology

Toni

hjc3613 · 2024-10-29T12:49:49Z

Hi @hjc3613,

from my first comment:

All conversions are carried out in BFLOAT16 and on the CPU, but we will need at least one GPU because the ParallelContext requires it. The 8B model fits on a GPU with 80GB, but the 70B model does not. Even so, in ALL conversions, we will set DP=PP=TP=1. I have confirmed that Nanotron supports changing the TP topology

Toni

thanks for your explanation. actually, I want to train llama 70B into 1.58bit model using this method: https://github.com/huggingface/nanotron/pull/180/files
so I need to convert llama 70B into nanotron format first, after this, I need to set dp, pp, tp in the pr:180 as follows:

when set tp=8, I get the error:

when set pp=8, I got another error which also indicate that the mismatch between the two pr180 and this pr174
when set pp=1 and tp=1 and dp=1, it run OOM
I do not know how to finetune the 70B model into 1.58bit, this is my problem.

TJ-Solergibert added 8 commits May 19, 2024 09:16

Add HF Generation script

9578b69

Add Nanotron Generation Script

a7f918c

Add HF to Nanotron conversion script

7728482

Add Nanorton to HF conversion script

4107ed4

Moved scripts to tools llama3 folder

2411d43

Pushed FA2 mod and rope configs fix

372fa02

Added logging

b348460

Cleaned scripts

de81b53

TJ-Solergibert added 2 commits May 22, 2024 13:01

Added Nanotron logging

a28c532

Added README

3e169c5

TJ-Solergibert mentioned this pull request Jul 2, 2024

Llama3 conversion scripts 🦙 swiss-ai/nanotron#7

Merged

xrsrke reviewed Jul 10, 2024

View reviewed changes

3outeille reviewed Jul 10, 2024

View reviewed changes

src/nanotron/data/collator.py Outdated

@dataclasses.dataclass

class NanosetDataCollatorForCLM:

Copy link

Member

3outeille Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not reuse DataCollatorForCLM ?

3outeille reviewed Jul 10, 2024

View reviewed changes

TJ-Solergibert force-pushed the llama3_converter branch from eb68e41 to 3e169c5 Compare July 10, 2024 16:04

Deleted generation tests & added donwload info to README

0afd7b7

TJ-Solergibert force-pushed the llama3_converter branch from da6281d to 0afd7b7 Compare July 25, 2024 11:12

Merge branch 'main' into llama3_converter

43fe9b9

zzhhjjj mentioned this pull request Sep 2, 2024

Match Transformers RoPE implementation #214

Merged

hjc3613 mentioned this pull request Oct 29, 2024

FEAT: Adding 1.58bit LLMs training architecture in nanotron #180

Draft

This was referenced Nov 23, 2024

convert_nt_to_hf is broken for non-interleaved RoPE #241

Open

[NEW] Llama3.2 weight converters 🦙 #255

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama3 conversion scripts 🦙 #174

Llama3 conversion scripts 🦙 #174

TJ-Solergibert commented May 20, 2024 •

edited

Loading

xrsrke commented May 20, 2024

TJ-Solergibert commented May 20, 2024 •

edited

Loading

3outeille commented Jun 17, 2024 •

edited

Loading

TJ-Solergibert commented Jun 17, 2024

xrsrke Jul 10, 2024 •

edited

Loading

xrsrke Jul 10, 2024

3outeille Jul 10, 2024

3outeille Jul 10, 2024

TJ-Solergibert commented Jul 10, 2024

zzhhjjj commented Jul 24, 2024

hjc3613 commented Oct 29, 2024

TJ-Solergibert commented Oct 29, 2024

hjc3613 commented Oct 29, 2024

	entity: Optional[str] = None
	wandb_entity: Optional[str] = None

Llama3 conversion scripts 🦙 #174

Are you sure you want to change the base?

Llama3 conversion scripts 🦙 #174

Conversation

TJ-Solergibert commented May 20, 2024 • edited Loading

Included

Results & Precision

Details

TODO

Instructions

xrsrke commented May 20, 2024

TJ-Solergibert commented May 20, 2024 • edited Loading

3outeille commented Jun 17, 2024 • edited Loading

TJ-Solergibert commented Jun 17, 2024

xrsrke Jul 10, 2024 • edited Loading

Choose a reason for hiding this comment

xrsrke Jul 10, 2024

Choose a reason for hiding this comment

3outeille Jul 10, 2024

Choose a reason for hiding this comment

3outeille Jul 10, 2024

Choose a reason for hiding this comment

TJ-Solergibert commented Jul 10, 2024

zzhhjjj commented Jul 24, 2024

hjc3613 commented Oct 29, 2024

TJ-Solergibert commented Oct 29, 2024

hjc3613 commented Oct 29, 2024

TJ-Solergibert commented May 20, 2024 •

edited

Loading

TJ-Solergibert commented May 20, 2024 •

edited

Loading

3outeille commented Jun 17, 2024 •

edited

Loading

xrsrke Jul 10, 2024 •

edited

Loading