Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In the training process, it will stop after completing one epoch. #39

Open
zhibeiyou135 opened this issue Jan 28, 2024 · 3 comments
Open

Comments

@zhibeiyou135
Copy link

Below are the execution commands and console output results for my entire training process.

(rvt) pe@505-1-ubuntu20-04-5-lts:/projects/yxl/RVT/RVT-master$ DATA_DIR=/home/pe/projects/yxl/gen1
(rvt) pe@505-1-ubuntu20-04-5-lts:
/projects/yxl/RVT/RVT-master$ MDL_CFG=tiny
(rvt) pe@505-1-ubuntu20-04-5-lts:/projects/yxl/RVT/RVT-master$ GPU_IDS=0
(rvt) pe@505-1-ubuntu20-04-5-lts:
/projects/yxl/RVT/RVT-master$ BATCH_SIZE_PER_GPU=8
(rvt) pe@505-1-ubuntu20-04-5-lts:/projects/yxl/RVT/RVT-master$ TRAIN_WORKERS_PER_GPU=6
(rvt) pe@505-1-ubuntu20-04-5-lts:
/projects/yxl/RVT/RVT-master$ EVAL_WORKERS_PER_GPU=2
(rvt) pe@505-1-ubuntu20-04-5-lts:~/projects/yxl/RVT/RVT-master$ python train.py model=rnndet dataset=gen1 dataset.path=${DATA_DIR} wandb.project_name=RVT \

wandb.group_name=gen1 +experiment/gen1="${MDL_CFG}.yaml" hardware.gpus=${GPU_IDS}
batch_size.train=${BATCH_SIZE_PER_GPU} batch_size.eval=${BATCH_SIZE_PER_GPU}
hardware.num_workers.train=${TRAIN_WORKERS_PER_GPU} hardware.num_workers.eval=${EVAL_WORKERS_PER_GPU}
Using python-based detection evaluation
Set MaxViTRNN backbone (height, width) to (256, 320)
Set partition sizes: (8, 10)
Set num_classes=2 for detection head
------ Configuration ------
reproduce:
seed_everything: null
deterministic_flag: false
benchmark: false
training:
precision: 16
max_epochs: 10000
max_steps: 400000
learning_rate: 0.0002
weight_decay: 0
gradient_clip_val: 1.0
limit_train_batches: 1.0
lr_scheduler:
use: true
total_steps: ${..max_steps}
pct_start: 0.005
div_factor: 20
final_div_factor: 10000
validation:
limit_val_batches: 1.0
val_check_interval: null
check_val_every_n_epoch: 1
batch_size:
train: 8
eval: 8
hardware:
num_workers:
train: 6
eval: 2
gpus: 0
dist_backend: nccl
logging:
ckpt_every_n_epochs: 1
train:
metrics:
compute: false
detection_metrics_every_n_steps: null
log_model_every_n_steps: 5000
log_every_n_steps: 500
high_dim:
enable: true
every_n_steps: 5000
n_samples: 4
validation:
high_dim:
enable: true
every_n_epochs: 1
n_samples: 8
wandb:
wandb_runpath: null
artifact_name: null
artifact_local_file: null
resume_only_weights: false
group_name: gen1
project_name: RVT
dataset:
name: gen1
path: /home/pe/projects/yxl/gen1
train:
sampling: mixed
random:
weighted_sampling: false
mixed:
w_stream: 1
w_random: 1
eval:
sampling: stream
data_augmentation:
random:
prob_hflip: 0.5
rotate:
prob: 0
min_angle_deg: 2
max_angle_deg: 6
zoom:
prob: 0.8
zoom_in:
weight: 8
factor:
min: 1
max: 1.5
zoom_out:
weight: 2
factor:
min: 1
max: 1.2
stream:
prob_hflip: 0.5
rotate:
prob: 0
min_angle_deg: 2
max_angle_deg: 6
zoom:
prob: 0.5
zoom_out:
factor:
min: 1
max: 1.2
ev_repr_name: stacked_histogram_dt=50_nbins=10
sequence_length: 21
resolution_hw:

  • 240
  • 304
    downsample_by_factor_2: false
    only_load_end_labels: false
    model:
    name: rnndet
    backbone:
    name: MaxViTRNN
    compile:
    enable: false
    args:
    mode: reduce-overhead
    input_channels: 20
    enable_masking: false
    partition_split_32: 1
    embed_dim: 32
    dim_multiplier:
    • 1
    • 2
    • 4
    • 8
      num_blocks:
    • 1
    • 1
    • 1
    • 1
      T_max_chrono_init:
    • 4
    • 8
    • 16
    • 32
      stem:
      patch_size: 4
      stage:
      downsample:
      type: patch
      overlap: true
      norm_affine: true
      attention:
      use_torch_mha: false
      partition_size:
      • 8
      • 10
        dim_head: 32
        attention_bias: true
        mlp_activation: gelu
        mlp_gated: false
        mlp_bias: true
        mlp_ratio: 4
        drop_mlp: 0
        drop_path: 0
        ls_init_value: 1.0e-05
        lstm:
        dws_conv: false
        dws_conv_only_hidden: true
        dws_conv_kernel_size: 3
        drop_cell_update: 0
        in_res_hw:
    • 256
    • 320
      fpn:
      name: PAFPN
      compile:
      enable: false
      args:
      mode: reduce-overhead
      depth: 0.33
      in_stages:
    • 2
    • 3
    • 4
      depthwise: false
      act: silu
      head:
      name: YoloX
      compile:
      enable: false
      args:
      mode: reduce-overhead
      depthwise: false
      act: silu
      num_classes: 2
      postprocess:
      confidence_threshold: 0.1
      nms_threshold: 0.45

Disabling PL seed everything because of unresolved issues with shuffling during training on streaming datasets
new run: generating id zee59lta
wandb: WARNING resume will be ignored since W&B syncing is set to offline. Starting a new run with run id zee59lta.
wandb: Tracking run with wandb version 0.14.0
wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing.
wandb: logging graph, to disable use wandb.watch(log_graph=False)
Using 16bit native Automatic Mixed Precision (AMP)
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a default ModelSummary callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Trainer(limit_train_batches=1.0) was configured so 100% of the batches per epoch will be used..
Trainer(limit_val_batches=1.0) was configured so 100% of the batches will be used..
[Train] Local batch size for:
stream sampling: 4
random sampling: 4
[Train] Local num workers for:
stream sampling: 3
random sampling: 3
creating rnd access train datasets: 1458it [00:01, 1117.94it/s]
creating streaming train datasets: 1458it [00:03, 410.78it/s]
num_full_sequences=317
num_splits=1141
num_split_sequences=5492
creating streaming val datasets: 429it [00:00, 1079.97it/s]
num_full_sequences=429
num_splits=0
num_split_sequences=0
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

| Name | Type | Params

0 | mdl | YoloXDetector | 4.4 M
1 | mdl.backbone | RNNDetector | 3.2 M
2 | mdl.fpn | YOLOPAFPN | 710 K
3 | mdl.yolox_head | YOLOXHead | 474 K

4.4 M Trainable params
0 Non-trainable params
4.4 M Total params
8.810 Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 32 which is the number of cpus on this machine) in theDataLoader` init to improve performance.
rank_zero_warn(
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/TensorShape.cpp:3483.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Epoch 0: : 0it [00:00, ?it/s]/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:139: UserWarning:

Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate

Epoch 0: : 248it [01:01, 4.02it/s, loss=16.3, v_num=9lta][2024-01-28 00:22:02,035][urllib3.connectionpool][WARNING] - Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)'))': /api/5288891/envelope/
Epoch 0: : 297it [01:12, 4.11it/s, loss=14.1, v_num=9lta][2024-01-28 00:22:12,484][urllib3.connectionpool][WARNING] - Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)'))': /api/5288891/envelope/
Epoch 0: : 344it [01:22, 4.18it/s, loss=14, v_num=9lta][2024-01-28 00:22:22,654][urllib3.connectionpool][WARNING] - Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)'))': /api/5288891/envelope/
Epoch 0: : 144434it [8:35:11, 4.67it/s, loss=2.73, v_num=9lta]creating index...
index created!aLoader 0: : 2342it [03:33, 10.97it/s]
Loading and preparing results...
DONE (t=0.16s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type bbox
DONE (t=3.08s).
Accumulating evaluation results...
DONE (t=1.00s).
Epoch 0: : 144434it [8:35:17, 4.67it/s, loss=2.73, v_num=9lta]Epoch 0, global step 142092: 'val/AP' reached 0.42821 (best 0.42821), saving model to '/home/pe/projects/yxl/RVT/RVT-master/RVT/zee59lta/checkpoints/epoch000step142092val_AP0.43.ckpt' as top 1
Error executing job with overrides: ['model=rnndet', 'dataset=gen1', 'dataset.path=/home/pe/projects/yxl/gen1', 'wandb.project_name=RVT', 'wandb.group_name=gen1', '+experiment/gen1=tiny.yaml', 'hardware.gpus=0', 'batch_size.train=8', 'batch_size.eval=8', 'hardware.num_workers.train=6', 'hardware.num_workers.eval=2']
Traceback (most recent call last):
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
results = self._run_stage()
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
self._run_train()
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
self.fit_loop.run()
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.on_advance_end()
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 295, in on_advance_end
self.trainer._call_callback_hooks("on_train_epoch_end")
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1380, in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 312, in on_train_epoch_end
self._save_topk_checkpoint(trainer, monitor_candidates)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 369, in _save_topk_checkpoint
self._save_monitor_checkpoint(trainer, monitor_candidates)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 650, in _save_monitor_checkpoint
self._update_best_and_save(current, trainer, monitor_candidates)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 701, in _update_best_and_save
self._save_checkpoint(trainer, filepath)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 381, in _save_checkpoint
logger.after_save_checkpoint(proxy(self))
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/lightning_utilities/core/rank_zero.py", line 43, in wrapped_fn
return fn(*args, **kwargs)
File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 218, in after_save_checkpoint
self._scan_and_log_checkpoints(checkpoint_callback, self._save_last and not self._save_last_only_final)
File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 268, in _scan_and_log_checkpoints
num_ckpt_logged_before = self._num_logged_artifact()
File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 235, in _num_logged_artifact
public_run = self._get_public_run()
File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 229, in _get_public_run
runpath = experiment._entity + '/' + experiment._project + '/' + experiment._run_id
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/pe/projects/yxl/RVT/RVT-master/train.py", line 144, in main
trainer.fit(model=module, ckpt_path=ckpt_path, datamodule=data_module)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
call._call_and_handle_interrupt(
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 62, in _call_and_handle_interrupt
logger.finalize("failed")
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/lightning_utilities/core/rank_zero.py", line 43, in wrapped_fn
return fn(*args, **kwargs)
File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 224, in finalize
self._scan_and_log_checkpoints(self._checkpoint_callback, self._save_last)
File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 268, in _scan_and_log_checkpoints
num_ckpt_logged_before = self._num_logged_artifact()
File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 235, in _num_logged_artifact
public_run = self._get_public_run()
File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 229, in _get_public_run
runpath = experiment._entity + '/' + experiment._project + '/' + experiment._run_id
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
wandb: Waiting for W&B process to finish... (failed 1).
wandb:
wandb: Run history:
wandb: epoch ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: learning_rate ████▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▂▁▁▁
wandb: train/cls_loss_step █▅▄▅▃▃▃▄▃▃▁▂▃▂▂▁▂▂▃▁▃▁▁▂▂▂▂▂▂▁▂▂▁▁▁▂▁▁▂▁
wandb: train/conf_loss_step █▄▃▄▂▂▂▄▂▂▂▂▂▂▁▁▂▂▂▁▂▂▁▂▂▂▂▂▁▁▂▂▁▁▁▂▁▁▁▂
wandb: train/iou_loss_step █▆▄▅▄▃▃▅▃▃▁▃▃▂▂▁▂▂▃▁▃▁▁▂▂▂▂▂▂▁▂▂▁▁▁▂▁▂▂▁
wandb: train/l1_loss_step ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: train/loss_step █▅▃▄▃▃▂▄▃▃▁▂▂▂▂▁▂▂▃▁▃▁▁▂▂▂▂▂▁▁▂▂▁▁▁▂▁▁▁▂
wandb: train/num_fg_step ▁▄▅▄▅▆▅▄▆▅█▅▆▇▇█▇▆▅▇▅▇█▆▇▇▇▇██▇▆▇██▇█▇▇▇
wandb: trainer/global_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb: val/AP ▁
wandb: val/AP_50 ▁
wandb: val/AP_75 ▁
wandb: val/AP_L ▁
wandb: val/AP_M ▁
wandb: val/AP_S ▁
wandb:
wandb: Run summary:
wandb: epoch 0
wandb: learning_rate 0.00013
wandb: train/cls_loss_step 0.41309
wandb: train/conf_loss_step 0.94029
wandb: train/iou_loss_step 1.3903
wandb: train/l1_loss_step 0.0
wandb: train/loss_step 2.74368
wandb: train/num_fg_step 7.48387
wandb: trainer/global_step 142094
wandb: val/AP 0.42821
wandb: val/AP_50 0.68434
wandb: val/AP_75 0.44886
wandb: val/AP_L 0.44424
wandb: val/AP_M 0.49079
wandb: val/AP_S 0.35797
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/pe/projects/yxl/RVT/RVT-master/wandb/offline-run-20240128_002051-zee59lta
wandb: Find logs at: ./wandb/offline-run-20240128_002051-zee59lta/logs
== Timing statistics ==

@magehrig
Copy link
Contributor

You can see what the issue is from the error:

File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 229, in _get_public_run
runpath = experiment._entity + '/' + experiment._project + '/' + experiment._run_id
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

I.e. in one of the three attributes listed in this line are None for some reason

  1. Which wandb version are you using (the one in the installation is 0.14.0)?
  2. Change the line to runpath = experiment.entity + '/' + experiment.project + '/' + experiment.id. Do you still encounter the same issue after this change?
  3. Can you figure out which of the three attributes is None?

@leafyseay
Copy link

leafyseay commented Mar 29, 2024

om,I encounter the same problem,how do you solve it? It's wandb version error or other?

@HongxiL
Copy link

HongxiL commented Nov 24, 2024

I also encountered the same problem, and it is not a problem with the wandb version. I don't know how to solve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants