Problems related to model training #102

zdy1013 · 2024-10-28T05:22:01Z

Hello, I am a master with strong interest in autonomous driving. I want to reproduce your model, no, because I want to run through your model code first, and try to use part of the LMDrive data. I plan to use Town01, Town02 for training and Town03 for verification. After I set the number of Gpus in train.sh and the path of data set, the execution of train.sh did not have normal training, and the results are as follows:


(interfuser) zdy@zhaojh:~/InterFuser-main/interfuser/scripts$ bash train.sh 
/home/zdy/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/distributed/launch.py:188: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total 2.
Added key: store_based_barrier_key:1 to store for rank: 1
Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Training in distributed mode with multiple processes, 1 GPU per process. Process 1, total 2.
Loading pretrained weights from url (https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/resnet50d_ra2-464e36ba.pth)
Loading pretrained weights from url (https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/resnet50d_ra2-464e36ba.pth)
Model interfuser_baseline created, param count:52935567
Data processing configuration for current model + dataset:
        input_size: (3, 224, 224)
        interpolation: bicubic
        mean: (0.485, 0.456, 0.406)
        std: (0.229, 0.224, 0.225)
        crop_pct: 0.875
CNN backbone and transformer blocks using different learning rates!
165 weights in the cnn backbone, 274 weights in other modules
AMP not enabled. Training in float32.
Using native Torch DistributedDataParallel.
Sub route dir nums: 0
Scheduled epochs: 35
Sub route dir nums: 0
Sub route dir nums: 0
Sub route dir nums: 0
Current checkpoints:
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-0.pth.tar', 0)

Current checkpoints:
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-0.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-1.pth.tar', 0)

Current checkpoints:
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-0.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-1.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-2.pth.tar', 0)

Current checkpoints:
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-0.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-1.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-2.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-3.pth.tar', 0)

Current checkpoints:
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-0.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-1.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-2.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-3.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-4.pth.tar', 0)

*** Best metric: 0 (epoch 0)

How can I solve this problem?Hope to get your recovery

The text was updated successfully, but these errors were encountered:

zdy1013 · 2024-10-28T06:41:27Z

Could it have something to do with the structure and naming of the data set? Here is my data set and its naming

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems related to model training #102

Problems related to model training #102

zdy1013 commented Oct 28, 2024

zdy1013 commented Oct 28, 2024

Problems related to model training #102

Problems related to model training #102

Comments

zdy1013 commented Oct 28, 2024

zdy1013 commented Oct 28, 2024