Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error wandb #209

Open
P-UnKnow08 opened this issue Aug 23, 2024 · 1 comment
Open

Error wandb #209

P-UnKnow08 opened this issue Aug 23, 2024 · 1 comment

Comments

@P-UnKnow08
Copy link

I'm trying to run Neuralangelo with the test set "lego," but I haven't been able to get past the point where I invoke the command:

torchrun --nproc_per_node=${GPUS} train.py
--logdir=logs/${GROUP}/${NAME}
--config=${CONFIG}
--show_pbar

This command throws an error for which I haven't been able to find a solution. I've tried changing many of the parameters in the project's files, but I still can't find a fix. Below is the error I'm encountering, in case anyone has a solution.

Thank you.

Error:
torchrun --nproc_per_node=${GPUS} train.py --logdir=logs/${GROUP}/${NAME} --config=${CONFIG} --show_pbar
(Setting affinity with NVML failed, skipping...)
[W Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
Training with 1 GPUs.
Using random seed 0
Make folder logs/example_group/example_name

  • checkpoint:
    • save_epoch: 9999999999
    • save_iter: 5000
    • save_latest_iter: 9999999999
    • save_period: 9999999999
    • strict_resume: True
  • cudnn:
    • benchmark: True
    • deterministic: False
  • data:
    • name: dummy
    • num_images: None
    • num_workers: 4
    • preload: True
    • readjust:
      • center: [0.0, 0.0, 0.0]
      • scale: 1.0
    • root: datasets/lego_ds2
    • train:
      • batch_size: 2
      • image_size: [801, 801]
      • subset: None
    • type: projects.neuralangelo.data
    • use_multi_epoch_loader: True
    • val:
      • batch_size: 2
      • image_size: [300, 300]
      • max_viz_samples: 16
      • subset: 4
  • image_save_iter: 9999999999
  • inference_args:
  • local_rank: 0
  • logdir: logs/example_group/example_name
  • logging_iter: 9999999999999
  • max_epoch: 9999999999
  • max_iter: 500000
  • metrics_epoch: None
  • metrics_iter: None
  • model:
    • appear_embed:
      • dim: 4
      • enabled: False
    • background:
      • enabled: True
      • encoding:
        • levels: 10
        • type: fourier
      • encoding_view:
        • levels: 3
        • type: spherical
      • mlp:
        • activ: relu
        • activ_density: softplus
        • activ_density_params:
        • activ_params:
        • hidden_dim: 256
        • hidden_dim_rgb: 128
        • num_layers: 8
        • num_layers_rgb: 2
        • skip: [4]
        • skip_rgb: []
      • view_dep: True
      • white: False
    • object:
      • rgb:
        • encoding_view:
          • levels: 3
          • type: spherical
        • mlp:
          • activ: relu_
          • activ_params:
          • hidden_dim: 256
          • num_layers: 4
          • skip: []
          • weight_norm: True
        • mode: idr
      • s_var:
        • anneal_end: 0.1
        • init_val: 3.0
      • sdf:
        • encoding:
          • coarse2fine:
            • enabled: True
            • init_active_level: 4
            • step: 5000
          • hashgrid:
            • dict_size: 21
            • dim: 4
            • max_logres: 11
            • min_logres: 5
            • range: [-2, 2]
          • levels: 16
          • type: hashgrid
        • gradient:
          • mode: numerical
          • taps: 4
        • mlp:
          • activ: softplus
          • activ_params:
            • beta: 100
          • geometric_init: True
          • hidden_dim: 256
          • inside_out: False
          • num_layers: 1
          • out_bias: 0.5
          • skip: []
          • weight_norm: True
    • render:
      • num_sample_hierarchy: 4
      • num_samples:
        • background: 32
        • coarse: 64
        • fine: 16
      • rand_rays: 512
      • stratified: True
    • type: projects.neuralangelo.model
  • nvtx_profile: False
  • optim:
    • fused_opt: False
    • params:
      • lr: 0.001
      • weight_decay: 0.01
    • sched:
      • gamma: 10.0
      • iteration_mode: True
      • step_size: 9999999999
      • two_steps: [300000, 400000]
      • type: two_steps_with_warmup
      • warm_up_end: 5000
    • type: AdamW
  • pretrained_weight: None
  • source_filename: projects/neuralangelo/configs/custom/lego.yaml
  • speed_benchmark: False
  • test_data:
    • name: dummy
    • num_workers: 0
    • test:
      • batch_size: 1
      • is_lmdb: False
      • roots: None
    • type: imaginaire.datasets.images
  • timeout_period: 9999999
  • trainer:
    • amp_config:
      • backoff_factor: 0.5
      • enabled: False
      • growth_factor: 2.0
      • growth_interval: 2000
      • init_scale: 65536.0
    • ddp_config:
      • find_unused_parameters: False
      • static_graph: True
    • depth_vis_scale: 0.5
    • ema_config:
      • beta: 0.9999
      • enabled: False
      • load_ema_checkpoint: False
      • start_iteration: 0
    • grad_accum_iter: 1
    • image_to_tensorboard: False
    • init:
      • gain: None
      • type: none
    • loss_weight:
      • curvature: 0.0005
      • eikonal: 0.1
      • render: 1.0
    • type: projects.neuralangelo.trainer
  • validation_iter: 5000
  • wandb_image_iter: 10000
  • wandb_scalar_iter: 100
    cudnn benchmark: True
    cudnn deterministic: False
    Setup trainer.
    Using random seed 0
    /home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
    warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
    model parameter count: 99,705,900
    Initialize model weights using type: none, gain: None
    Using random seed 0
    [rank0]:[W Utils.hpp:108] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarString)
    Allow TensorFloat32 operations on supported devices
    Train dataset length: 100
    Val dataset length: 4
    Training from scratch.
    Initialize wandb
    [rank0]: Traceback (most recent call last):
    [rank0]: File "/mnt/d/Documents/neuralangelo/train.py", line 104, in
    [rank0]: main()
    [rank0]: File "/mnt/d/Documents/neuralangelo/train.py", line 85, in main
    [rank0]: trainer.init_wandb(cfg,
    [rank0]: File "/mnt/d/Documents/neuralangelo/imaginaire/trainers/base.py", line 269, in init_wandb
    [rank0]: wandb.watch(self.model_module)
    [rank0]: File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_watch.py", line 49, in watch
    [rank0]: tel.feature.watch = True
    [rank0]: File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/lib/telemetry.py", line 42, in exit
    [rank0]: self._run._telemetry_callback(self._obj)
    [rank0]: File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 799, in _telemetry_callback
    [rank0]: self._telemetry_obj.MergeFrom(telem_obj)
    [rank0]: AttributeError: 'Run' object has no attribute '_telemetry_obj'
    E0822 21:49:57.518840 139941045491520 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 25214) of binary: /home/miguel12/miniconda3/envs/neuralangelo/bin/python
    Traceback (most recent call last):
    File "/home/miguel12/miniconda3/envs/neuralangelo/bin/torchrun", line 10, in
    sys.exit(main())
    File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
    return f(*args, **kwargs)
    File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
    File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
    File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call
    return launch_agent(self._config, self._entrypoint, list(args))
    File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
    ============================================================
    train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-08-22_21:49:57
host : DESKTOP-Q0DS9I2.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 25214)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@Longhao-Chen
Copy link

pip3 uninstall wandb
pip3 install wandb==0.17.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants