Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems related to model training #102

Open
zdy1013 opened this issue Oct 28, 2024 · 1 comment
Open

Problems related to model training #102

zdy1013 opened this issue Oct 28, 2024 · 1 comment

Comments

@zdy1013
Copy link

zdy1013 commented Oct 28, 2024

Hello, I am a master with strong interest in autonomous driving. I want to reproduce your model, no, because I want to run through your model code first, and try to use part of the LMDrive data. I plan to use Town01, Town02 for training and Town03 for verification. After I set the number of Gpus in train.sh and the path of data set, the execution of train.sh did not have normal training, and the results are as follows:


(interfuser) zdy@zhaojh:~/InterFuser-main/interfuser/scripts$ bash train.sh 
/home/zdy/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/distributed/launch.py:188: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total 2.
Added key: store_based_barrier_key:1 to store for rank: 1
Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Training in distributed mode with multiple processes, 1 GPU per process. Process 1, total 2.
Loading pretrained weights from url (https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/resnet50d_ra2-464e36ba.pth)
Loading pretrained weights from url (https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/resnet50d_ra2-464e36ba.pth)
Model interfuser_baseline created, param count:52935567
Data processing configuration for current model + dataset:
        input_size: (3, 224, 224)
        interpolation: bicubic
        mean: (0.485, 0.456, 0.406)
        std: (0.229, 0.224, 0.225)
        crop_pct: 0.875
CNN backbone and transformer blocks using different learning rates!
165 weights in the cnn backbone, 274 weights in other modules
AMP not enabled. Training in float32.
Using native Torch DistributedDataParallel.
Sub route dir nums: 0
Scheduled epochs: 35
Sub route dir nums: 0
Sub route dir nums: 0
Sub route dir nums: 0
Current checkpoints:
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-0.pth.tar', 0)

Current checkpoints:
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-0.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-1.pth.tar', 0)

Current checkpoints:
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-0.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-1.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-2.pth.tar', 0)

Current checkpoints:
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-0.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-1.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-2.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-3.pth.tar', 0)

Current checkpoints:
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-0.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-1.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-2.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-3.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-4.pth.tar', 0)

*** Best metric: 0 (epoch 0)

How can I solve this problem?Hope to get your recovery

@zdy1013
Copy link
Author

zdy1013 commented Oct 28, 2024

Could it have something to do with the structure and naming of the data set? Here is my data set and its naming
dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant