Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[P1] Eval time model is not loaded: Unable to replicate results from paper for RoBERTa Base for Glue tasks like CoLa #114

Open
m-dev12 opened this issue Jun 24, 2024 · 50 comments
Assignees
Labels
bug Something isn't working question Further information is requested

Comments

@m-dev12
Copy link

m-dev12 commented Jun 24, 2024

Using below configuration but unable to replicate paper results. Is there anything different that the authors have done in the paper? Got {'validation_matthews_correlation': 0.40104291315665774} finally instead of ~61%. Should SEED, or any other configuration be updated? Would be great if authors could share wandb logs for this as well. Thanks!

python train.py
-task glue
-data_dir ./data
-train_dataset cola
-eval_dataset cola
-model FacebookAI/roberta-base
-seed 42
-l all
-r 1
-p f3
-e 60
-lr 4e-4
-type LoreftIntervention
-batch_size 32
-output_dir ./output
-schedule linear
-wu 5e-3
-logging_steps 20
-allow_cls_grad
-metric_for_best_model matthews_correlation
-dropout 0.2
-test_split validation
@frankaging

@frankaging frankaging changed the title Unable to replicate results from paper for RoBERTa Base for Glue tasks like CoLa [P1] Unable to replicate results from paper for RoBERTa Base for Glue tasks like CoLa Jun 24, 2024
@frankaging frankaging self-assigned this Jun 24, 2024
@frankaging frankaging added the question Further information is requested label Jun 24, 2024
@frankaging
Copy link
Collaborator

frankaging commented Jun 24, 2024

@m-dev12 Hey! Thanks for the inputs! Here are some pointers about reproducing our results.

  • Firstly, could you run the following command and see if you can get up to ~64%? (the raw log for this run is attached in output.log)
python train.py \
-task glue -train_dataset cola -model FacebookAI/roberta-base \
-seed 45 \
-l all -r 1 -p f3 -e 60 -lr 4e-4 \
-type LoreftIntervention \
-gradient_accumulation_steps 1 \
-batch_size 32 -eval_batch_size 32 \
-test_split validation -max_length 256 \
--metric_for_best_model matthews_correlation \
--dropout 0.2 --weight_decay 0.00000 \
--warmup_ratio 0.005 --logging_steps 20 \
--allow_cls_grad

Note that here, we (1) use seed 45 which is one of the seeds we used; (2) -max_length 256. We will clarify the max length setting in our next revision in the paper. We follow this paper for the max length setting to ensure a fair comparison.

  • As noted in the paper (attached below, on pg. 23 of our paper), we use {42,43,44,45,46} for all GLUE datasets, but we replace a couple of seeds for RTE and CoLA due to instability. Due to this, you could report mean or median (we report this in Table 16 on pg. 31) to compare to our method or others.
Screenshot 2024-06-23 at 11 15 00 PM
  • Sorry, we have not organized our GLUE wandb logs. We might consider to rerun those and release a set of wandb logs later if we have time.

Let me know if these help! and let me know if you have more questions! Thanks!

@m-dev12
Copy link
Author

m-dev12 commented Jun 24, 2024

Thank you for the detailed response @frankaging.
I did try out your command, and it gives me ~61.6% (attached logs), but it is still different than your logs at ~64%.
Command used:
train.py -task glue -train_dataset cola -model FacebookAI/roberta-base -seed 45 -l all -r 1 -p f3 -e 60 -lr 4e-4 -type LoreftIntervention -gradient_accumulation_steps 1 -batch_size 32 -eval_batch_size 32 -test_split validation -max_length 256 --metric_for_best_model matthews_correlation --dropout 0.2 --weight_decay 0.00000 --warmup_ratio 0.005 --logging_steps 20 --allow_cls_grad

As for my environment, I am using pyvene==0.1.1 and pyreft==0.0.6. Note that pyvene==0.1.2 did not work with Roberta(giving out an error for additional arg use_cache passed.) I believe it is related to this: stanfordnlp/pyvene#152

Is there anything you recommend me checking, as I believe ideally I should be replicate your logs.

output (2).log

@frankaging
Copy link
Collaborator

@m-dev12 Thanks for the follow up!

I think it might boil down to different random state of the machine, which is hard to control, especially given how unstable it could be for datasets such as CoLA and RTE (e.g., even with the same random seed, there could be discrepancies across machines).

You can try to follow this ticket to create an exact same env as we have locally, and test out if it helps: #102. Minor: note that ~61.6% is close to our number we reported in the paper (60.4) given the instability of the setup. I would also recommend to try different seeds e.g. 46/43/44, etc..

@m-dev12
Copy link
Author

m-dev12 commented Jun 24, 2024

Thanks again @frankaging!!. Yes, I understand and yes the results are close to the reported results in the paper.
As for the environment, as I mentioned pyvene == 0.1.2 gives out an error with RoBERTa, "forward() got an unexpected keyword argument 'use_cache'".

This is related to this fix = stanfordnlp/pyvene#152
If I comment out these changes then RoBERTa worked with 0.1.2. So, for now I just reverted to 0.1.1 pyvene version to run these experiments. I should probably create an issue in the pyvene github repo for this..

@frankaging
Copy link
Collaborator

@m-dev12 Thanks! Opening an issue would help a lot, as I am slowly ramping up my workload on these two repos again for the summer!

@m-dev12
Copy link
Author

m-dev12 commented Jun 25, 2024

Thanks @frankaging! I have opened an issue on Pyreft.

On a side note, could you please share commands with hyperparam config for other Glue tasks like MNLI, QNLI for RoBERTa, just in case there are any nuances not mentioned in the paper and since CoLA is a little unstable.
For instance I am using this below config:
python train.py \ -task glue -train_dataset mnli -model FacebookAI/roberta-base \ -seed 45 \ -l all -r 1 -p f1 -e 40 -lr 6e-4 \ -type LoreftIntervention \ -gradient_accumulation_steps 1 \ -batch_size 32 -eval_batch_size 32 \ -test_split validation_matched -max_length 256 \ --warmup_ratio 0.06 --logging_steps 20 \ --dropout 0.05 --weight_decay 0.00000 \ --allow_cls_grad \ --metric_for_best_model accuracy

Thanks again!.

@frankaging
Copy link
Collaborator

frankaging commented Jun 25, 2024

@m-dev12 Hey, thanks. The whole hyperparameter search space is outlined in Table 8 on pg. 25; Individual task hyperparameter configuration is outlined on the page that follows from Table 9 to Table 12. And we use 256 as our maximum sequence length for all tasks.

I double checked, and I think most of the nuances are mentioned already, but maximum sequence length is indeed missing (and probably the only one? i think). We wrote this on pg. 23 last paragraph, which causes confusion:

We follow Wu et al. [2024a]’s setting for evaluation.

Thus, we will add another sentence after this sentence to further clarify that we also use the same maximum sequence length which is 256. Our loreft folder also gives one example command that has the maximum sequence length set to 256.

(If you want to do hyperparameter search as we did) You can also follow our hyperparameter search procedure:

  • Use one single seed to do the tuning (e.g., 42 is what we used).
  • I don't think we exhaustively search all the combinations outlined in Table 8. We mainly conduct hyperparameter search by changing one parameter at a time. But, exhausting all the combinations might give extra performance boost.
  • If just for replication, you can skip Table 8, and use per task hyperparameters.

@m-dev12
Copy link
Author

m-dev12 commented Jun 25, 2024

@frankaging Sure, yes I have taken everything from the appendix of the paper, will double check any additional details from Wu et al (2024). Thanks a lot!

@m-dev12 m-dev12 closed this as completed Jun 25, 2024
@m-dev12
Copy link
Author

m-dev12 commented Jul 17, 2024

Hi @frankaging ,

I was trying to replicate the results for Roberta Base with Loreft for the MNLI dataset with the configuration given in Table 9, but I got very different accuracy. For instance, for the command below, I get a validation matched accuracy of ~56% (although accuracy during training was ~82%).

train.py -task glue -train_dataset mnli -model FacebookAI/roberta-base -seed 45 -e 40 -lr 6e-4 -l all -r 1 -p f1 -type LoreftIntervention -gradient_accumulation_steps 1 -batch_size 32 -eval_batch_size 32 -test_split validation_matched -max_length 256 --warmup_ratio 0.06 --logging_steps 20 --dropout 0.05 --weight_decay 0.00000 --allow_cls_grad --metric_for_best_model accuracy

Can you please share anything missing / different in your runs?

@m-dev12 m-dev12 reopened this Jul 17, 2024
@frankaging
Copy link
Collaborator

frankaging commented Jul 17, 2024

@m-dev12 Thanks for raising the issue.

This is our previous run with seed 42 (I think 45 should be quite similar given MNLI is stable, unlike CoLA/RTE):

llms-switch/task_steer.py -task glue -train_dataset mnli -model FacebookAI/roberta-base -seed 42 -l all -r 1 -p f1 -e 40 -lr 6e-4 -type ConditionedSourceLowRankRotatedSpaceIntervention -gradient_accumulation_steps 1 -batch_size 32 -eval_batch_size 32 -test_split validation_matched -max_length 256 --metric_for_best_model accuracy --dropout 0.05 --weight_decay 0.0000 --warmup_ratio 0.00 --logging_steps 20 --allow_cls_grad

Note that we rename our script, and intervention -type, but it should be the same as the current train.py and LoreftIntervention respectively.

This is what i got from our old log

24666 {'loss': 0.4528, 'grad_norm': 4.364076137542725, 'learning_rate': 0.0, 'epoch': 40.0}
24667 100%|██████████| 490880/490880 [5:57:46<00:00, 22.87it/s]
24668 {'eval_accuracy': 0.824, 'epoch': 40.0}
24669 Directory './official_results/roberta-base.glue.mnli.validation_matched.20240318073542833101/checkpoint-490880/intervenable_model' created successfully.
24670 {'train_runtime': 21466.4494, 'train_samples_per_second': 731.75, 'train_steps_per_second': 22.867, 'train_loss': 0.5202320372421085, 'epoch': 40.0}
24671 {'n_params': 18444}
24672 100%|██████████| 276/276 [00:05<00:00, 48.91it/s]
24673 {'validation_matched_accuracy': 0.8343732274532047}

See my screenshot for the runname roberta-base.glue.mnli.validation_matched.20240318073542833101

Screenshot 2024-07-17 at 1 41 47 PM

Based on this, it could a regression due to our recent code changes. Could you share your git top commit hash # of your branch? for pyvene and pyreft. I will check.

@m-dev12
Copy link
Author

m-dev12 commented Jul 17, 2024

I agree, results should ideally not vary so much between seeds. Also, the gap between train and validation loss seems to indicate some kind of overfitting?

Here's the git top commit hash:

ReFT]$ git log -1
commit 97768c053c3bbb05194ffef00f523ca9952c9a2c (HEAD -> main, origin/main, origin/HEAD)
Merge: 64278ff 5c3cfd8
Author: Zen <[email protected]>
Date:   Wed Jun 12 13:34:58 2024 -0700

    Merge pull request #108 from savadikarc/alpaca-data-loreft-fix
    
    Fix: datasets.exceptions.DatasetNotFoundError when training with alpaca_data_cleaned
image

Attached is the environment file as well, i am using pyreft 0.0.5 and pyvene 0.1.1
conda-environment (1).txt

Another Question: Is there any specific reason you train it for as long as 40 epochs? It takes quite long for MNLI and the loss curve does not seem to reduce much after a while.

image

@frankaging
Copy link
Collaborator

@m-dev12 just did a quick check - could not find recent changes that touch the saving and loading code path.

But my guess is that classifier head (since for GLUE, we freeze everything else and train the head with interventions) is not properly loaded during evaluation time.

Could you downgrade your transformers to transformers==4.39.3 and try again? Thanks!

@frankaging
Copy link
Collaborator

I agree, results should ideally not vary so much between seeds. Also, the gap between train and validation loss seems to indicate some kind of overfitting?

...

Thanks! We were following previous works on the epoch selection. I agree, it is on the high side. Given the large gap, we can also start debugging with much lower epoch i think. Maybe epoch=5 or 10? Could you upgrade your pyvene to 0.1.2? Thanks!

@m-dev12
Copy link
Author

m-dev12 commented Jul 17, 2024

I agree, will try with lower epochs. I had switched to pyvene==0.1.1 from 0.1.2 initially, because it was giving out an error with Roberta models. stanfordnlp/pyvene#163

@frankaging
Copy link
Collaborator

@m-dev12 another thing to check, we wrote a callback function in pyreft to load the best model at the end as:

    def _load_best_model(self):
        logger.warning(f"Loading best model from {self.state.best_model_checkpoint} (score: {self.state.best_metric}).")
        self.model.load_intervention(
            f"{self.state.best_model_checkpoint}/intervenable_model", 
            include_model=True
        )

It should log something like "Loading best model from ...". Could you find this line in your log? Thanks! And do these file exist in your directory as well? as in f"{self.state.best_model_checkpoint}/intervenable_model". You can also try to log out self.state.best_model_checkpoint to see if it takes a proper value.

@frankaging
Copy link
Collaborator

@PinetreePantry Adding Peter here who will provide more detail since he owns the MNLI results, and he is reproducing the error on our cluster.

Two minor things:

  • the run i shared above is with a slightly different weight decay (=0, sorry i shared a wrong one), but I also checked the one with wd=0.06 (or even other settings), eval is all above 80%.
  • it seems like you are running on top of transformers=4.39.3, so no need to worry that one.

@frankaging
Copy link
Collaborator

@m-dev12 I just checked in the fix for use_cache; could you install pyvene from top of the tree? e.g., pip install git+https://github.com/stanfordnlp/pyvene.git

@m-dev12
Copy link
Author

m-dev12 commented Jul 18, 2024

_load_best_model

Yup I can see this in the logs:
image

@PinetreePantry Adding Peter here who will provide more detail since he owns the MNLI results, and he is reproducing the error on our cluster.

Two minor things:

  • the run i shared above is with a slightly different weight decay (=0, sorry i shared a wrong one), but I also checked the one with wd=0.06 (or even other settings), eval is all above 80%.
  • it seems like you are running on top of transformers=4.39.3, so no need to worry that one.

You mean warmup ratio right?

Sure, let me update pyvene and will run fresh experiments.

@frankaging
Copy link
Collaborator

frankaging commented Jul 18, 2024

@m-dev12 Thanks. This might require some changes from your end. Would it be possible for you to modify https://github.com/stanfordnlp/pyreft/blob/main/examples/loreft/train.py?
(1) skip the training loop (e..g, comment out this line https://github.com/stanfordnlp/pyreft/blob/main/examples/loreft/train.py#L411)
(2) add a call to load the saved model after the commented out line from your directory ./official_results/.... so we can just do the evaluation from the weights saved in your directory above?

To load back the classifier heads and interventions you just need to call this

trainer.model.load_intervention(
            f"<your_best_checkpoint_dir>", 
            include_model=True
        )

Ideally if you run the same command, it should directly do the evaluation and print out the number. And could you also list out the files under the best checkpoint? checkpoint-245440. Thanks!

@m-dev12
Copy link
Author

m-dev12 commented Jul 18, 2024

Thanks @frankaging will try this.

Meanwhile, I updated pyvene using: pip install git+https://github.com/stanfordnlp/pyvene.git

But I am still getting this error:

Traceback (most recent call last):
  File "/scratch/dm5927/ReFT/loreft_dm/train_1.py", line 611, in <module>
    main()
  File "/scratch/dm5927/ReFT/loreft_dm/train_1.py", line 607, in main
    finetune(**vars(args), args=args)
  File "/scratch/dm5927/ReFT/loreft_dm/train_1.py", line 441, in finetune
    trainer.train()
  File "/ext3/miniconda3/envs/awesome-reft/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
    return inner_training_loop(
  File "/ext3/miniconda3/envs/awesome-reft/lib/python3.10/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/ext3/miniconda3/envs/awesome-reft/lib/python3.10/site-packages/transformers/trainer.py", line 3138, in training_step
    loss = self.compute_loss(model, inputs)
  File "/ext3/miniconda3/envs/awesome-reft/lib/python3.10/site-packages/pyreft/reft_trainer.py", line 82, in compute_loss
    _, cf_outputs = intervenable(
  File "/ext3/miniconda3/envs/awesome-reft/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/ext3/miniconda3/envs/awesome-reft/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/ext3/miniconda3/envs/awesome-reft/lib/python3.10/site-packages/pyvene/models/intervenable_base.py", line 1926, in forward
    raise e
  File "/ext3/miniconda3/envs/awesome-reft/lib/python3.10/site-packages/pyvene/models/intervenable_base.py", line 1910, in forward
    counterfactual_outputs = self.model(**base, **model_kwargs)
  File "/ext3/miniconda3/envs/awesome-reft/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/ext3/miniconda3/envs/awesome-reft/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: RobertaForSequenceClassification.forward() got an unexpected keyword argument 'use_cache'


@frankaging
Copy link
Collaborator

Thanks @frankaging will try this.

Meanwhile, I updated pyvene using: pip install git+https://github.com/stanfordnlp/pyvene.git

But I am still getting this error:

Traceback (most recent call last):
  File "/scratch/dm5927/ReFT/loreft_dm/train_1.py", line 611, in <module>
    main()
  File "/scratch/dm5927/ReFT/loreft_dm/train_1.py", line 607, in main
    finetune(**vars(args), args=args)
  File "/scratch/dm5927/ReFT/loreft_dm/train_1.py", line 441, in finetune
    trainer.train()
  File "/ext3/miniconda3/envs/awesome-reft/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
    return inner_training_loop(
  File "/ext3/miniconda3/envs/awesome-reft/lib/python3.10/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/ext3/miniconda3/envs/awesome-reft/lib/python3.10/site-packages/transformers/trainer.py", line 3138, in training_step
    loss = self.compute_loss(model, inputs)
  File "/ext3/miniconda3/envs/awesome-reft/lib/python3.10/site-packages/pyreft/reft_trainer.py", line 82, in compute_loss
    _, cf_outputs = intervenable(
  File "/ext3/miniconda3/envs/awesome-reft/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/ext3/miniconda3/envs/awesome-reft/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/ext3/miniconda3/envs/awesome-reft/lib/python3.10/site-packages/pyvene/models/intervenable_base.py", line 1926, in forward
    raise e
  File "/ext3/miniconda3/envs/awesome-reft/lib/python3.10/site-packages/pyvene/models/intervenable_base.py", line 1910, in forward
    counterfactual_outputs = self.model(**base, **model_kwargs)
  File "/ext3/miniconda3/envs/awesome-reft/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/ext3/miniconda3/envs/awesome-reft/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: RobertaForSequenceClassification.forward() got an unexpected keyword argument 'use_cache'

Hey, could you try again by reinstalling?

@m-dev12
Copy link
Author

m-dev12 commented Jul 18, 2024

checkpoint-245440

(1) skip the training loop (e..g, comment out this line https://github.com/stanfordnlp/pyreft/blob/main/examples/loreft/train.py#L411) (2) add a call to load the saved model after the commented out line from your directory ./official_results/.... so we can just do the evaluation from the weights saved in your directory above?

To load back the classifier heads and interventions you just need to call this

trainer.model.load_intervention(
            f"<your_best_checkpoint_dir>", 
            include_model=True
        )

Ideally if you run the same command, it should directly do the evaluation and print out the number. And could you also list out the files under the best checkpoint? checkpoint-245440. Thanks!

Hey @frankaging I tried this, and it gives out {'validation_matched_accuracy': 0.8193987521270562}. Do you happen to have an idea why did the model not load correctly. I have a few runs which are great in training but have similar poor eval accuracy.

Here are the files listed in this checkpoint:
image

@frankaging frankaging added the bug Something isn't working label Jul 19, 2024
@frankaging frankaging changed the title [P1] Unable to replicate results from paper for RoBERTa Base for Glue tasks like CoLa [P1] Eval time model is not loaded: Unable to replicate results from paper for RoBERTa Base for Glue tasks like CoLa Jul 19, 2024
@frankaging
Copy link
Collaborator

To fix this, i think it might be best just to ensure a manual loading after .train(). E.g.,

if args.task == "glue":
   # do the manual loading from the best checkpoint directory
   trainer.model.load_intervention(
            f"<your_best_checkpoint_dir>", 
            include_model=True
        )

This could introduce a small change in our train.py without huge effort.

@m-dev12
Copy link
Author

m-dev12 commented Jul 19, 2024

Thanks @frankaging. I also believe there might be some issue with the evaluation loop. I printed out the dataset object. I believe the complete validation dataset 9.8k is going in for test set eval. But this should be 8.8k for test and 1k for validation.

image

@frankaging
Copy link
Collaborator

frankaging commented Jul 19, 2024

@m-dev12 Could you double check if you are printing the right object?

We also print this in the log to make sure we do the split correctly in https://github.com/stanfordnlp/pyreft/blob/main/examples/loreft/train.py#L244

        print("GLUE validation split (in training): ", len(in_train_eval_datasets))
        print("GLUE validation split (testing): ", len(eval_datasets[train_dataset_str][test_split][0]))

Could you search for the log above?

Based on your eval batch size which is 32, and the eval step number 276 printed from your image, 32*276 = 8832. So, the actual test time eval example should be strictly less than 8832.

@PinetreePantry
Copy link
Collaborator

This is pretty weird. I was trying commenting out trainer.train() and directly loading the eval model from the checkpoint, the MNLI validation_matched accuracy was 0.82-ish. Uncommenting the trainer.train(), and load the eval model from the trainer's best_model_checkpoint at the same position in the code, then the MNLI validation_matched accuracy was 0.50.

@m-dev12
Copy link
Author

m-dev12 commented Jul 19, 2024

@m-dev12 Could you double check if you are printing the right object?

We also print this in the log to make sure we do the split correctly in https://github.com/stanfordnlp/pyreft/blob/main/examples/loreft/train.py#L244

        print("GLUE validation split (in training): ", len(in_train_eval_datasets))
        print("GLUE validation split (testing): ", len(eval_datasets[train_dataset_str][test_split][0]))

Could you search for the log above?

Based on your eval batch size which is 32, and the eval step number 276 printed from your image, 32*276 = 8832. So, the actual test time eval example should be strictly less than 8832.

Got it,
Yes, L244 did indicate 8.8k examples.
image

But, I was confused because when I print this:
image
I got the result printed in my previous comment, and seeing the num_rows= 9815 I was concerned.

@m-dev12
Copy link
Author

m-dev12 commented Jul 19, 2024

This is pretty weird. I was trying commenting out trainer.train() and directly loading the eval model from the checkpoint, the MNLI validation_matched accuracy was 0.82-ish. Uncommenting the trainer.train(), and load the eval model from the trainer's best_model_checkpoint at the same position in the code, then the MNLI validation_matched accuracy was 0.50.

Yup, I observed the same and tried with a couple different previous runs and found different results as well :/

@frankaging
Copy link
Collaborator

frankaging commented Jul 19, 2024

@m-dev12 Could you double check if you are printing the right object?
We also print this in the log to make sure we do the split correctly in https://github.com/stanfordnlp/pyreft/blob/main/examples/loreft/train.py#L244

        print("GLUE validation split (in training): ", len(in_train_eval_datasets))
        print("GLUE validation split (testing): ", len(eval_datasets[train_dataset_str][test_split][0]))

Could you search for the log above?
Based on your eval batch size which is 32, and the eval step number 276 printed from your image, 32*276 = 8832. So, the actual test time eval example should be strictly less than 8832.

Got it, Yes, L244 did indicate 8.8k examples. image

But, I was confused because when I print this: image I got the result printed in my previous comment, and seeing the num_rows= 9815 I was concerned.

Hey, thanks for the info. I think you are printing eval_dataset, and data_items. data_items will contain all examples with num_rows= 9815. But the evaluation is on the eval_dataset. If you actually iterate over eval_dataset, or do a length check, I think it should be less than 32*276 = 8832. data_items is for us to trace back to the original example for fields such as input strings, etc..

On the eval loading: I think the easiest way is simply to manually loading after .train(), this will let us decouple from HF trainer behavior. I also suspect trainer.best_model_checkpoint is not pointing to the right directory maybe?

@m-dev12
Copy link
Author

m-dev12 commented Jul 21, 2024

Hi @frankaging,

So, I added this line below in my train scripts so that every time we manually load the best model. But strangely I notice the same phenomena again.

image image

But then when I rerun the script, comment out the trainer.train() line and add the same path but just manually this time. Then I get an accuracy of 82.4%.

image image

@m-dev12
Copy link
Author

m-dev12 commented Jul 22, 2024

Hi @frankaging, Also, pyvene= 0.1.2 still gives the use_cache() error with Roberta.

@frankaging
Copy link
Collaborator

Hi @frankaging, Also, pyvene= 0.1.2 still gives the use_cache() error with Roberta.

@m-dev12 Thanks for the update. Did you install from the source? pip install git+https://github.com/stanfordnlp/pyvene.git? And what is the error message? Thanks.

@m-dev12
Copy link
Author

m-dev12 commented Jul 22, 2024

@frankaging Yes, I installed from the source,
the error was: TypeError: RobertaForSequenceClassification.forward() got an unexpected keyword argument 'use_cache'

@frankaging
Copy link
Collaborator

@m-dev12 Thanks! Please try again. And let me know if you are still hitting the issue.

On the model loading issue, could you adding this after .train():

reft_model.load_intervention(...)

instead of trainer.model.load_intervention(...)

The trainer might create another reference model for the best model, which is not used.

@m-dev12
Copy link
Author

m-dev12 commented Jul 22, 2024

On the model loading issue, could you adding this after .train():

reft_model.load_intervention(...)

instead of trainer.model.load_intervention(...)

The trainer might create another reference model for the best model, which is not used.

Hi @frankaging I tried this but still get different accuracies with and without trainer.train().

@frankaging
Copy link
Collaborator

@m-dev12 to manually load, might need to turn off this as well: https://github.com/stanfordnlp/pyreft/blob/main/examples/loreft/train.py#L383C12-L383C66

@m-dev12
Copy link
Author

m-dev12 commented Jul 22, 2024

Hi @frankaging, I tried this but it did not change the results. And ideally I dont believe this should make a difference since the manual loading is after trainer.train(). I also added print statements inside the load_intervention function in pyvene to check if the path in the callback when best model path is loaded and when I manually load it is any different, but they were exactly the same.

@frankaging
Copy link
Collaborator

@m-dev12 Thanks! Could you add print statements in
https://github.com/stanfordnlp/pyvene/blob/main/pyvene/models/intervenable_base.py#L1331C9-L1331C26

To simply print out the saved state dict, checking the parameter names, etc..? And also maybe eyeballing the weight tensor itself? This issue is probably shared for all GLUE tasks, it might be quicker to debug with a smaller dataset.

@m-dev12
Copy link
Author

m-dev12 commented Jul 23, 2024

Yes, I have been trying to debug with the CoLA dataset and with just 5 epochs, I did print the parameter names and eyeballed one of the layer intervention tensors and classifier weight tensor after loading the model in the train script itself using:

reft_model.model.state_dict()["classifier.dense.weight"]
reft_model.interventions['comp.roberta.encoder.layer[11].output.unit.pos.nunit.1#0'][0].learned_source.weight

They seemed fine.

For the path, I did add the print statements in the line you mentioned above:

    def load_intervention(self, load_directory, include_model=True):
        """
        Instead of creating an new object, this function loads existing weights onto
        the current object. This is not a static method, and returns nothing.
        """
        # load binary files
        for i, (k, v) in enumerate(self.interventions.items()):
            intervention = v[0]
            binary_filename = f"intkey_{k}.bin"
            if isinstance(intervention, TrainableIntervention):
                print(f"### Loading trainable intervention from {binary_filename}. PATH = {os.path.join(load_directory, binary_filename)}")
                saved_state_dict = torch.load(os.path.join(load_directory, binary_filename))
                intervention.load_state_dict(saved_state_dict)

        # load model's trainable parameters as well
        if include_model:
            model_binary_filename = "pytorch_model.bin"
            print(f"### Loading model from {model_binary_filename}. PATH = {os.path.join(load_directory, model_binary_filename)}")
            saved_model_state_dict = torch.load(os.path.join(load_directory, model_binary_filename))
            self.model.load_state_dict(saved_model_state_dict, strict=False)

@m-dev12
Copy link
Author

m-dev12 commented Jul 23, 2024

Hi @frankaging pyvene=0.1.2 now works with Roberta, Thanks!

On the model loading issue, in a bid to debug this, I tried the following:

  1. Created a new reft model object after training and loaded the best model interventions/ classifier head to it and then evaluated it. But still get different results when I run eval without trainer.train().
    reft_model_new = get_reft_model(model, reft_config, set_device=not isinstance(dtype, str))

    print(f"Best Model Path: {best_model_path}")
    reft_model_new.load_intervention(
                f"{best_model_path}/intervenable_model", 
                include_model=True
                )
    # ensure everything is in eval mode
    reft_model_new.model.eval()
    for k,v in reft_model_new.interventions.items():
        _ = v[0].eval()

    print({"n_params": n_params})
    # do eval
    eval_results = {}
    
    print(f"\nEval Datasets: {eval_datasets}")
    for dataset_name in eval_datasets:
        # split evalset into chunks
        print(f"Dataset Name: {dataset_name}")
        for split, (eval_dataset, data_items) in eval_datasets[dataset_name].items():
            print(f"Split: {split}")
            print(f"Eval Dataset: {eval_dataset}")
            print(f"Data Items: {data_items}")
            generations, stats = compute_metrics(
                task, dataset_name, reft_model_new, tokenizer, eval_dataset, data_items,
                trigger_tokens, run_name, eval_batch_size, 
                data_collator if task in classification_tasks else None,
                split, greedy_decoding, temperature, top_p, top_k
            )
            
            # log
            eval_results.update(stats)
            if is_wandb:
                wandb.log(stats)
            generations = stats if generations is None else generations
            result_json_file_name = f"{output_dir}/{run_name}/{dataset_name}_{split}_outputs.json"
            with open(result_json_file_name, 'w') as json_file:
                json.dump(generations, json_file, indent=4)

    # log final eval stats
    result_json_file_name = f"{output_dir}/{run_name}/eval_results.json"
    eval_results["n_params"] = n_params
  1. I also added the save_model= True and checked the performance when loading a model from that path vs best checkpoint (both with trainer.train() commented out) and indeed this is different. Here, when loading the model from the saved path corresponding to the below path I get the same accuracy as i got when running the entire training script with trainer.train()
if save_model:
        reft_model.save(f"{output_dir}/{run_name}", include_model=True

While loading from the best checkpoint path gives me a different (mostly much better) accuracy

Any ideas how we can remedy this?

@m-dev12
Copy link
Author

m-dev12 commented Jul 24, 2024

Hi @frankaging,
I figured that the rotate_layer is different in the two cases (in a script with trainer.train() v/s trainer.train() commented out with loading from best checkpoint)

Normal train.py with trainer.train() (note here I have tried different methods shared before to load the best checkpoint but it did not work out)
image

train.py with trainer.train() commented and best model loaded from checkpoint.
image

Attached is a debug notebook:
cola_training_playnotebook.ipynb.zip

frankaging added a commit that referenced this issue Jul 25, 2024
[P0] Fixing LoReFT rotation layer hot loading problem (#114)
@frankaging
Copy link
Collaborator

@m-dev12 Thanks for your inputs! I looked into this issue a bit more, and summarized my findings with the changes to fix this issue. See details listed here: #123.

In short, it seems like when loading back the low-rank weight matrix of rotate_layer, it is not correctly overwriting the in-memory weight matrix just for the rotate_layer. As a result, other finetuned weights (e.g., the classifier head as well the learned source) are not compatible which leads to low eval performance.

I am not sure why it is not doing the overwrite; but to fix this, we essentially want to reinit a new instance of rotate layer, and inject learned weights (i.e., selected column vectors) into it.

@m-dev12
Copy link
Author

m-dev12 commented Jul 25, 2024

Hi @frankaging, Thank you so much for the quick fix! I installed and tried out the new version (from your branch). The rotate layer weights are now matching, and the metrics now come out much higher! But, I now observe a mismatch in learned_source weight and bias between the two versions of scripts.

From train script: Accuracy is higher now ~52%
Screenshot 2024-07-25 at 12 40 15 AM

From eval script: Accuracy is lower now ~50%
image

Attaching debug notebooks for your reference.
notebooks.zip

@frankaging
Copy link
Collaborator

@m-dev12 hey, quick question: it seems like these two screenshots are logging for different directories? .*30579 the first one, and .*44666 - is this expected? Thanks.

@frankaging frankaging changed the title [P0] Eval time model is not loaded: Unable to replicate results from paper for RoBERTa Base for Glue tasks like CoLa [P1] Eval time model is not loaded: Unable to replicate results from paper for RoBERTa Base for Glue tasks like CoLa Jul 25, 2024
@m-dev12
Copy link
Author

m-dev12 commented Jul 25, 2024

Hi @frankaging Yes, that is expected because second ss is from eval notebook with trainer.train() commented so just creates a new log.
But please ignore the previous message, it was from your branch but from your second last commit (before - remove logging and cleanup one ). I tried from the latest commit from main and it works perfectly.. Sorry for the bother!

@m-dev12
Copy link
Author

m-dev12 commented Aug 8, 2024

@m-dev12 Hey! Thanks for the inputs! Here are some pointers about reproducing our results.

  • Firstly, could you run the following command and see if you can get up to ~64%? (the raw log for this run is attached in output.log)
python train.py \
-task glue -train_dataset cola -model FacebookAI/roberta-base \
-seed 45 \
-l all -r 1 -p f3 -e 60 -lr 4e-4 \
-type LoreftIntervention \
-gradient_accumulation_steps 1 \
-batch_size 32 -eval_batch_size 32 \
-test_split validation -max_length 256 \
--metric_for_best_model matthews_correlation \
--dropout 0.2 --weight_decay 0.00000 \
--warmup_ratio 0.005 --logging_steps 20 \
--allow_cls_grad

Note that here, we (1) use seed 45 which is one of the seeds we used; (2) -max_length 256. We will clarify the max length setting in our next revision in the paper. We follow this paper for the max length setting to ensure a fair comparison.

  • As noted in the paper (attached below, on pg. 23 of our paper), we use {42,43,44,45,46} for all GLUE datasets, but we replace a couple of seeds for RTE and CoLA due to instability. Due to this, you could report mean or median (we report this in Table 16 on pg. 31) to compare to our method or others.
Screenshot 2024-06-23 at 11 15 00 PM * Sorry, we have not organized our GLUE wandb logs. We might consider to rerun those and release a set of wandb logs later if we have time.

Let me know if these help! and let me know if you have more questions! Thanks!

Hey @frankaging Do you still get this same result for this COLA run post all the changes? I get around 56.9% accuracy for this same command now.

@frankaging
Copy link
Collaborator

frankaging commented Aug 8, 2024

@m-dev12 Hey, with the current ToT, this is what i got with the same command (the postfix 20240807235850407960 shows i just ran this)

Directory './official_results/roberta-base.glue.cola.validation.20240807235850407960/checkpoint-16080/intervenable_model' already exists.
Loading best model from ./official_results/roberta-base.glue.cola.validation.20240807235850407960/checkpoint-15812 (score: 0.587609805120735).
{'train_runtime': 814.5028, 'train_samples_per_second': 629.906, 'train_steps_per_second': 19.742, 'train_loss': 0.34663497393404075, 'epoch': 60.0}                         
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16080/16080 [13:34<00:00, 19.74it/s]
{'n_params': 18444}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 38.03it/s]
{'validation_matthews_correlation': 0.600239855782026}

0.600239855782026 is (1) lower than our original run as shown in the log above which could be due to instability given the dataset size; and (2) within the std of what we reported. Did you try other seeds and see?

@m-dev12
Copy link
Author

m-dev12 commented Aug 8, 2024

Thanks @frankaging, will try from current ToT. I haven't tried other seeds will try and check as well.

@frankaging
Copy link
Collaborator

@m-dev12 Thanks. And I just open sourced our old source code: https://github.com/stanfordnlp/pyreft/tree/main/examples/loreft/original_code

We built pyreft based on our old source code; and all of our paper results are obtained using our old source code (as you can tell from our shared wandb logged commands -- they all start with task_steer.py).

You can also optionally use the code in that folder to replicate the results, hopefully easier. For instance, it does not have that strange model loading issue..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants