Error wandb #209

P-UnKnow08 · 2024-08-23T03:25:11Z

I'm trying to run Neuralangelo with the test set "lego," but I haven't been able to get past the point where I invoke the command:

torchrun --nproc_per_node=${GPUS} train.py
--logdir=logs/${GROUP}/${NAME}
--config=${CONFIG}
--show_pbar

This command throws an error for which I haven't been able to find a solution. I've tried changing many of the parameters in the project's files, but I still can't find a fix. Below is the error I'm encountering, in case anyone has a solution.

Thank you.

Error:
torchrun --nproc_per_node=${GPUS} train.py --logdir=logs/${GROUP}/${NAME} --config=${CONFIG} --show_pbar
(Setting affinity with NVML failed, skipping...)
[W Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
Training with 1 GPUs.
Using random seed 0
Make folder logs/example_group/example_name

checkpoint:
- save_epoch: 9999999999
- save_iter: 5000
- save_latest_iter: 9999999999
- save_period: 9999999999
- strict_resume: True
cudnn:
- benchmark: True
- deterministic: False
data:
- name: dummy
- num_images: None
- num_workers: 4
- preload: True
- readjust:
  - center: [0.0, 0.0, 0.0]
  - scale: 1.0
- root: datasets/lego_ds2
- train:
  - batch_size: 2
  - image_size: [801, 801]
  - subset: None
- type: projects.neuralangelo.data
- use_multi_epoch_loader: True
- val:
  - batch_size: 2
  - image_size: [300, 300]
  - max_viz_samples: 16
  - subset: 4
image_save_iter: 9999999999
inference_args:
local_rank: 0
logdir: logs/example_group/example_name
logging_iter: 9999999999999
max_epoch: 9999999999
max_iter: 500000
metrics_epoch: None
metrics_iter: None
model:
- appear_embed:
  - dim: 4
  - enabled: False
- background:
  - enabled: True
  - encoding:
    - levels: 10
    - type: fourier
  - encoding_view:
    - levels: 3
    - type: spherical
  - mlp:
    - activ: relu
    - activ_density: softplus
    - activ_density_params:
    - activ_params:
    - hidden_dim: 256
    - hidden_dim_rgb: 128
    - num_layers: 8
    - num_layers_rgb: 2
    - skip: [4]
    - skip_rgb: []
  - view_dep: True
  - white: False
- object:
  - rgb:
    - encoding_view:
      - levels: 3
      - type: spherical
    - mlp:
      - activ: relu_
      - activ_params:
      - hidden_dim: 256
      - num_layers: 4
      - skip: []
      - weight_norm: True
    - mode: idr
  - s_var:
    - anneal_end: 0.1
    - init_val: 3.0
  - sdf:
    - encoding:
      - coarse2fine:
        
        enabled: True
        
        init_active_level: 4
        
        step: 5000
      - hashgrid:
        
        dict_size: 21
        
        dim: 4
        
        max_logres: 11
        
        min_logres: 5
        
        range: [-2, 2]
      - levels: 16
      - type: hashgrid
    - gradient:
      - mode: numerical
      - taps: 4
    - mlp:
      - activ: softplus
      - activ_params:
        
        beta: 100
      - geometric_init: True
      - hidden_dim: 256
      - inside_out: False
      - num_layers: 1
      - out_bias: 0.5
      - skip: []
      - weight_norm: True
- render:
  - num_sample_hierarchy: 4
  - num_samples:
    - background: 32
    - coarse: 64
    - fine: 16
  - rand_rays: 512
  - stratified: True
- type: projects.neuralangelo.model
nvtx_profile: False
optim:
- fused_opt: False
- params:
  - lr: 0.001
  - weight_decay: 0.01
- sched:
  - gamma: 10.0
  - iteration_mode: True
  - step_size: 9999999999
  - two_steps: [300000, 400000]
  - type: two_steps_with_warmup
  - warm_up_end: 5000
- type: AdamW
pretrained_weight: None
source_filename: projects/neuralangelo/configs/custom/lego.yaml
speed_benchmark: False
test_data:
- name: dummy
- num_workers: 0
- test:
  - batch_size: 1
  - is_lmdb: False
  - roots: None
- type: imaginaire.datasets.images
timeout_period: 9999999
trainer:
- amp_config:
  - backoff_factor: 0.5
  - enabled: False
  - growth_factor: 2.0
  - growth_interval: 2000
  - init_scale: 65536.0
- ddp_config:
  - find_unused_parameters: False
  - static_graph: True
- depth_vis_scale: 0.5
- ema_config:
  - beta: 0.9999
  - enabled: False
  - load_ema_checkpoint: False
  - start_iteration: 0
- grad_accum_iter: 1
- image_to_tensorboard: False
- init:
  - gain: None
  - type: none
- loss_weight:
  - curvature: 0.0005
  - eikonal: 0.1
  - render: 1.0
- type: projects.neuralangelo.trainer
validation_iter: 5000
wandb_image_iter: 10000
wandb_scalar_iter: 100
cudnn benchmark: True
cudnn deterministic: False
Setup trainer.
Using random seed 0
/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
model parameter count: 99,705,900
Initialize model weights using type: none, gain: None
Using random seed 0
[rank0]:[W Utils.hpp:108] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarString)
Allow TensorFloat32 operations on supported devices
Train dataset length: 100
Val dataset length: 4
Training from scratch.
Initialize wandb
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/d/Documents/neuralangelo/train.py", line 104, in
[rank0]: main()
[rank0]: File "/mnt/d/Documents/neuralangelo/train.py", line 85, in main
[rank0]: trainer.init_wandb(cfg,
[rank0]: File "/mnt/d/Documents/neuralangelo/imaginaire/trainers/base.py", line 269, in init_wandb
[rank0]: wandb.watch(self.model_module)
[rank0]: File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_watch.py", line 49, in watch
[rank0]: tel.feature.watch = True
[rank0]: File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/lib/telemetry.py", line 42, in exit
[rank0]: self._run._telemetry_callback(self._obj)
[rank0]: File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 799, in _telemetry_callback
[rank0]: self._telemetry_obj.MergeFrom(telem_obj)
[rank0]: AttributeError: 'Run' object has no attribute '_telemetry_obj'
E0822 21:49:57.518840 139941045491520 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 25214) of binary: /home/miguel12/miniconda3/envs/neuralangelo/bin/python
Traceback (most recent call last):
File "/home/miguel12/miniconda3/envs/neuralangelo/bin/torchrun", line 10, in
sys.exit(main())
File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-08-22_21:49:57
host : DESKTOP-Q0DS9I2.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 25214)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Longhao-Chen · 2024-08-26T03:19:44Z

pip3 uninstall wandb
pip3 install wandb==0.17.5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error wandb #209

Error wandb #209

P-UnKnow08 commented Aug 23, 2024

Longhao-Chen commented Aug 26, 2024

Error wandb #209

Error wandb #209

Comments

P-UnKnow08 commented Aug 23, 2024

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-08-22_21:49:57 host : DESKTOP-Q0DS9I2. rank : 0 (local_rank: 0) exitcode : 1 (pid: 25214) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Longhao-Chen commented Aug 26, 2024

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-08-22_21:49:57
host : DESKTOP-Q0DS9I2.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 25214)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html