Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Upstream changes #7

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

Erotemic
Copy link

@Erotemic Erotemic commented Nov 4, 2023

I'm looking into integrating ScaleMAE into geowatch.
I've made this branch to track modifications to make it work. Currently this involves:

  • Setting up proper package namespaces: Everything should be referenced under the "scalemae" namespace to allow for integrations with other libraries. Having a module named "lib" is a common anti-pattern in repos, as it leads to conflicts, and simply putting everything into a top-level namespace fixes this issue. It also means all imports are now referenced explicitly in the code itself.

  • Finding minimum versions of required and optional dependencies. Still working on this, but there doesn't seem to be a comprehensive list of requirements to make the repo work. I'm working on gathering those while also deconflicting with requirements of geowatch.

  • Linting to remove unused code.

This should not be merged yet. I'm just ensuring the work is pushed as it is developed for comments and visibility.

@Erotemic
Copy link
Author

Erotemic commented Nov 5, 2023

@cjrd @RitwikGupta I'm trying to get a MWE of this running. With the latest changes you can do something like this:

# Create demo train / vali data
DATA_PATH=$(python -m scalemae.demo)

echo "
data:
  type: ImageList
  length: 10
  img_dir: '$DATA_PATH'
  mean: [0.46921533, 0.46026663, 0.41329921]
  std: [0.1927, 0.1373, 0.1203]
  vis_factor: 1.0
" > $DATA_PATH/demo.yaml

cat  $DATA_PATH/demo.yaml


DEFAULT_ROOT_DIR=$HOME/exps/scalemae_demo

echo "
DEFAULT_ROOT_DIR      = $DEFAULT_ROOT_DIR
DATA_PATH             = $DATA_PATH
"


mkdir -p $DEFAULT_ROOT_DIR
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=1 --master_port=11085 -m scalemae.main_pretrain \
    --output_dir $DEFAULT_ROOT_DIR \
    --log_dir  $DEFAULT_ROOT_DIR \
    --config $DATA_PATH/demo.yaml \
    --eval_path "$DATA_PATH" \
    --batch_size 4 \
    --model mae_vit_base_patch16  \
    --mask_ratio 0.75 \
    --num_workers 0 \
    --epochs 300 \
    --target_size 224\
    --input_size 224\
    --self_attention\
    --scale_min 0.2 \
    --scale_max 1.0 \
    --warmup_epochs 40 \
    --blr 1.5e-4 --weight_decay 0.05 \
    --decoder_aux_loss_layers 1\
    --target_size_scheduler constant\
    --decoder_depth 8 \
    --no_autoresume \
    --use_mask_token \
    --skip_knn_eval \
    --fixed_output_size_min 224\
    --fixed_output_size_max 336\
    --absolute_scale 

This generates a small dataset with kwcoco, so it can grow larger if needed. I'm able to write an ImageFolder that should corresond to one of the dataloaders. I thought the above would run, but I got:

RuntimeError: Unexpected error from cudaGetDeviceCount(). 
data:
  type: ImageList
  length: 10
  img_dir: '/home/joncrall/.cache/scalemae/tests/demo/imagefolder'
  mean: [0.46921533, 0.46026663, 0.41329921]
  std: [0.1927, 0.1373, 0.1203]
  vis_factor: 1.0


DEFAULT_ROOT_DIR      = /home/joncrall/exps/scalemae_demo
DATA_PATH             = /home/joncrall/.cache/scalemae/tests/demo/imagefolder

/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
2023-11-04 21:27:40.159162: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-04 21:27:40.201501: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-04 21:27:41.136947: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Starting pretrain
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/joncrall/code/watch/geowatch_tpl/submodules/scale-mae/scalemae/main_pretrain.py", line 771, in <module>
    main(args)
  File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/joncrall/code/watch/geowatch_tpl/submodules/scale-mae/scalemae/main_pretrain.py", line 409, in main
    misc.init_distributed_mode(args)
  File "/home/joncrall/code/watch/geowatch_tpl/submodules/scale-mae/scalemae/util/misc.py", line 264, in init_distributed_mode
    torch.cuda.set_device(args.gpu)
  File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/cuda/__init__.py", line 350, in set_device
    torch._C._cuda_setDevice(device)
  File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 518179) of binary: /home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/bin/python
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
scalemae.main_pretrain FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-04_21:27:41
  host      : toothbrush
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 518179)
  error_file: /tmp/torchelastic_ktnaatzu/none_6fhs8u23/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
      return f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
    File "/home/joncrall/code/watch/geowatch_tpl/submodules/scale-mae/scalemae/main_pretrain.py", line 409, in main
      misc.init_distributed_mode(args)
    File "/home/joncrall/code/watch/geowatch_tpl/submodules/scale-mae/scalemae/util/misc.py", line 264, in init_distributed_mode
      torch.cuda.set_device(args.gpu)
    File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/cuda/__init__.py", line 350, in set_device
      torch._C._cuda_setDevice(device)
    File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
      torch._C._cuda_init()
  RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
  
============================================================

This could just be a hardware problem (can this not run on 2x 3090's?). Is there anything obviously wrong about my config?

Are there recommended settings for attempting to reproduce the pipeline on a small dataset (for testing).

@RitwikGupta
Copy link
Member

Jon, your config looks ok, but the issue seems to be with your environment. It seems that PyTorch is unable to see your GPUs. Can you verify everything is set up correctly?

@Erotemic
Copy link
Author

Erotemic commented Nov 5, 2023

Yes, I'm currently training a geowatch network with 2 GPUs using LightningCLI.

An extended version of python -m torch.utils.collect_env with more relevant package output is:

PyTorch version: 2.0.0+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.26.1
Libc version: glibc-2.35

Python version: 3.11.2 (main, Apr  1 2023, 18:27:37) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-6.2.0-36-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.0.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 525.147.05
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      39 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             16
On-line CPU(s) list:                0-15
Vendor ID:                          GenuineIntel
Model name:                         11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz
CPU family:                         6
Model:                              167
Thread(s) per core:                 2
Core(s) per socket:                 8
Socket(s):                          1
Stepping:                           1
CPU max MHz:                        5300.0000
CPU min MHz:                        800.0000
BogoMIPS:                           7008.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap avx512ifma clflushopt intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear flush_l1d arch_capabilities
L1d cache:                          384 KiB (8 instances)
L1i cache:                          256 KiB (8 instances)
L2 cache:                           4 MiB (8 instances)
L3 cache:                           16 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-15
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed:             Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] classy-vision==0.7.0
[pip3] cmd-queue==0.1.19
[pip3] dvc==3.22.0
[pip3] dvc-azure==2.21.2
[pip3] dvc-data==2.16.1.dev0+g5326364.d20230907
[pip3] dvc-gdrive==2.19.2
[pip3] dvc-gs==2.22.1
[pip3] dvc-hdfs==2.19.0
[pip3] dvc-http==2.30.2
[pip3] dvc-objects==1.0.1
[pip3] dvc-oss==2.19.0
[pip3] dvc-render==0.5.3
[pip3] dvc-s3==2.23.0
[pip3] dvc-ssh==2.22.3.dev6+g773f905
[pip3] dvc-studio-client==0.10.0
[pip3] dvc-task==0.3.0
[pip3] dvc-webdav==2.19.1
[pip3] dvc-webhdfs==2.19.0
[pip3] efficientnet-pytorch==0.7.1
[pip3] einops==0.6.0
[pip3] GDAL==3.8.0
[pip3] geopandas==0.12.2
[pip3] geowatch==0.11.0
[pip3] kwarray==0.6.14
[pip3] kwcoco==0.7.2
[pip3] kwimage==0.9.21
[pip3] kwimage-ext==0.2.1
[pip3] kwutil==0.2.4
[pip3] lightning==2.1.0
[pip3] lightning-utilities==0.8.0
[pip3] matplotlib==3.7.1
[pip3] matplotlib-inline==0.1.6
[pip3] mmcv==2.0.0
[pip3] mypy==1.6.1
[pip3] mypy-boto3-s3==1.26.153
[pip3] mypy-extensions==1.0.0
[pip3] ndsampler==0.7.5
[pip3] numpy==1.26.1
[pip3] nvidia-cublas-cu11==11.10.3.66
[pip3] nvidia-cuda-cupti-cu11==11.7.101
[pip3] nvidia-cuda-nvrtc-cu11==11.7.99
[pip3] nvidia-cuda-runtime-cu11==11.7.99
[pip3] nvidia-cudnn-cu11==8.5.0.96
[pip3] nvidia-cufft-cu11==10.9.0.58
[pip3] nvidia-curand-cu11==10.2.10.91
[pip3] nvidia-cusolver-cu11==11.4.0.1
[pip3] nvidia-cusparse-cu11==11.7.4.91
[pip3] nvidia-nccl-cu11==2.14.3
[pip3] nvidia-nvtx-cu11==11.7.91
[pip3] opencv-python-headless==4.8.1.78
[pip3] pandas==1.5.3
[pip3] perceiver-pytorch==0.8.7
[pip3] performer-pytorch==1.1.4
[pip3] pytorch-lightning==2.0.8
[pip3] pytorch-msssim==0.1.5
[pip3] pytorch-ranger==0.1.1
[pip3] rasterio==1.3.5
[pip3] reformer-pytorch==1.4.4
[pip3] scikit-learn==1.2.2
[pip3] scipy==1.11.1
[pip3] scriptconfig==0.7.11
[pip3] seaborn==0.12.2
[pip3] segmentation-models-pytorch==0.3.3
[pip3] shapely==2.0.1
[pip3] simple-dvc==0.2.0
[pip3] tensorboard==2.14.0
[pip3] tensorboard-data-server==0.7.0
[pip3] tensorboard-plugin-wit==1.8.1
[pip3] tensorboardX==2.6
[pip3] tensorflow==2.12.0
[pip3] tensorflow-estimator==2.12.0
[pip3] tensorflow-io-gcs-filesystem==0.32.0
[pip3] timm==0.9.2
[pip3] torch==2.0.0+cu117
[pip3] torch-liberator==0.2.2
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==2.0.1+cu117
[pip3] torchgeo==0.5.0
[pip3] torchmetrics==0.11.4
[pip3] torchvision==0.15.1+cu117
[pip3] ubelt==1.3.4
[pip3] vit-pytorch==1.2.0
[conda] Could not collect

@RitwikGupta
Copy link
Member

@Erotemic I was able to take a look at this again. The environment was set up for me properly. Can you install packages in your environment step-by-step and see where your env breaks?

@Erotemic
Copy link
Author

@RitwikGupta I've made a MWE in a docker image, and I was able to get farther. It's likely something on my host system is weird.

To that end, I've added a dockerfile and instructions that walk through my MWE. It still is giving me an error, but it has to do with not having a CRS for the dataset. This makes sense because kwcoco demo data doesn't contain geo-metadata. However, geowatch demodata does have CRS information, so I'll see if I can get farther by using that.

@Erotemic
Copy link
Author

Erotemic commented Nov 28, 2023

Hmm, it looks like I still get an error:

  File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torchgeo/datasets/geo.py", line 83, in GeoDataset
    _crs = CRS.from_epsg(4326)
  File "rasterio/crs.pyx", line 590, in rasterio.crs.CRS.from_epsg

rasterio.errors.CRSError: The EPSG code is unknown. 
PROJ: internal_proj_create_from_database: 
/opt/conda/envs/scalemae/share/proj/proj.db lacks 
DATABASE.LAYOUT.VERSION.MAJOR / DATABASE.LAYOUT.VERSION.MINOR metadata. 
It comes from another PROJ installation.
/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
ERROR 1: PROJ: internal_proj_create_from_database: /opt/conda/envs/scalemae/share/proj/proj.db lacks DATABASE.LAYOUT.VERSION.MAJOR / DATABASE.LAYOUT.VERSION.MINOR metadata. It comes from another PROJ installation.
Traceback (most recent call last):
  File "rasterio/crs.pyx", line 586, in rasterio.crs.CRS.from_epsg
  File "rasterio/_err.pyx", line 195, in rasterio._err.exc_wrap_int
rasterio._err.CPLE_AppDefinedError: PROJ: internal_proj_create_from_database: /opt/conda/envs/scalemae/share/proj/proj.db lacks DATABASE.LAYOUT.VERSION.MAJOR / DATABASE.LAYOUT.VERSION.MINOR metadata. It comes from another PROJ installation.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/scalemae/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/scalemae/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/code/scalemae/scalemae/main_pretrain.py", line 48, in <module>
    from scalemae.dataloaders.utils import get_dataset_and_sampler, get_eval_dataset_and_transform
  File "/root/code/scalemae/scalemae/dataloaders/utils.py", line 15, in <module>
    from scalemae.dataloaders.naip import build_naip_sampler
  File "/root/code/scalemae/scalemae/dataloaders/naip.py", line 4, in <module>
    from torchgeo.datasets import stack_samples
  File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torchgeo/datasets/__init__.py", line 6, in <module>
    from .advance import ADVANCE
  File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torchgeo/datasets/advance.py", line 17, in <module>
    from .geo import NonGeoDataset
  File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torchgeo/datasets/geo.py", line 42, in <module>
    class GeoDataset(Dataset[dict[str, Any]], abc.ABC):
  File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torchgeo/datasets/geo.py", line 83, in GeoDataset
    _crs = CRS.from_epsg(4326)
  File "rasterio/crs.pyx", line 590, in rasterio.crs.CRS.from_epsg
rasterio.errors.CRSError: The EPSG code is unknown. PROJ: internal_proj_create_from_database: /opt/conda/envs/scalemae/share/proj/proj.db lacks DATABASE.LAYOUT.VERSION.MAJOR / DATABASE.LAYOUT.VERSION.MINOR metadata. It comes from another PROJ installation.
[2023-11-28 01:40:29,341] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 893) of binary: /opt/conda/envs/scalemae/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/scalemae/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/scalemae/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
scalemae.main_pretrain FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-28_01:40:29
  host      : 168b53aa1722
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 893)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

This docker env is:

``` (scalemae) root@168b53aa1722:~/code/scalemae# python -m torch.utils.collect_env

Collecting environment information...
PyTorch version: 2.1.1
Is debug build: False
CUDA used to build PyTorch: Could not collect
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.9.18 (main, Sep 11 2023, 13:41:44) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.2.0-36-generic-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: 11.6.124
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 525.147.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.8.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: GenuineIntel
Model name: 11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz
CPU family: 6
Model: 167
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
Stepping: 1
CPU max MHz: 5300.0000
CPU min MHz: 800.0000
BogoMIPS: 7008.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap avx512ifma clflushopt intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear flush_l1d arch_capabilities
L1d cache: 384 KiB (8 instances)
L1i cache: 256 KiB (8 instances)
L2 cache: 4 MiB (8 instances)
L3 cache: 16 MiB (1 instance)
NUMA node(s): 1
NUMA node0 CPU(s): 0-15
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed: Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] efficientnet-pytorch==0.7.1
[pip3] numpy==1.26.0
[pip3] pytorch-lightning==2.1.2
[pip3] pytorch-msssim==0.1.5
[pip3] pytorch-ranger==0.1.1
[pip3] segmentation-models-pytorch==0.3.2
[pip3] torch==2.1.1
[pip3] torch-liberator==0.2.1
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==2.1.1
[pip3] torchgeo==0.5.1
[pip3] torchmetrics==1.2.0
[pip3] torchvision==0.16.1
[conda] blas 1.0 mkl
[conda] efficientnet-pytorch 0.7.1 pypi_0 pypi
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] libjpeg-turbo 2.0.0 h9bf148f_0 pytorch
[conda] mkl 2023.1.0 h213fc3f_46344
[conda] mkl-service 2.4.0 py39h5eee18b_1
[conda] mkl_fft 1.3.8 py39h5eee18b_0
[conda] mkl_random 1.2.4 py39hdb19cb5_0
[conda] numpy 1.26.0 py39h5f9d8c6_0
[conda] numpy-base 1.26.0 py39hb5e798b_0
[conda] pytorch 2.1.1 py3.9_cpu_0 pytorch
[conda] pytorch-cuda 11.6 h867d48c_1 pytorch
[conda] pytorch-lightning 2.1.2 pypi_0 pypi
[conda] pytorch-msssim 0.1.5 pypi_0 pypi
[conda] pytorch-mutex 1.0 cpu pytorch
[conda] pytorch-ranger 0.1.1 pypi_0 pypi
[conda] segmentation-models-pytorch 0.3.2 pypi_0 pypi
[conda] torch-liberator 0.2.1 pypi_0 pypi
[conda] torch-optimizer 0.3.0 pypi_0 pypi
[conda] torchaudio 2.1.1 py39_cpu pytorch
[conda] torchgeo 0.5.1 pypi_0 pypi
[conda] torchmetrics 1.2.0 pypi_0 pypi
[conda] torchvision 0.16.1 py39_cpu pytorch

@RitwikGupta
Copy link
Member

RitwikGupta commented Nov 28, 2023

This is a common env issue with rasterio. You should conda install rasterio instead of pip installing it

@Erotemic
Copy link
Author

Erotemic commented Nov 28, 2023

The conda variant of rasterio works (I do hope to get this working where conda is no longer necessary, but that's for after I get the basic case working).

Unfortuantely, I'm still getting errors:

Root Cause (first observed failure):
[0]:
  time      : 2023-11-28_14:45:34
  host      : 168b53aa1722
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 998)
  error_file: /tmp/torchelastic_m66l_uvu/none_i0_vmh1r/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
      return f(*args, **kwargs)
    File "/root/code/scalemae/scalemae/main_pretrain.py", line 409, in main
      misc.init_distributed_mode(args)
    File "/root/code/scalemae/scalemae/util/misc.py", line 264, in init_distributed_mode
      torch.cuda.set_device(args.gpu)
    File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/cuda/__init__.py", line 404, in set_device
      torch._C._cuda_setDevice(device)
  AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'

Do you have the details for the environment where you've gotten it to work? Torch versions / etc...?

EDIT: I'm getting farther (I've got versions sorted out - although still would be nice to know exactly which version you had in your env to make it work). Currently running into an issue that I think is due to the hard-coded datasets:

  traceback : Traceback (most recent call last):
    File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
      return f(*args, **kwargs)
    File "/root/code/scale-mae/scalemae/main_pretrain.py", line 717, in main
      train_stats = train_one_epoch(
    File "/root/code/scale-mae/scalemae/engine_pretrain.py", line 57, in train_one_epoch
      for data_iter_step, ((samples, res, targets, target_res), metadata) in enumerate(
    File "/root/code/scale-mae/scalemae/util/misc.py", line 144, in log_every
      for obj in iterable:
    File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
      data = self._next_data()
    File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
      data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
      return self.collate_fn(data)
    File "/root/code/scale-mae/scalemae/dataloaders/utils.py", line 148, in __call__
      imgs = torch.stack(list(zip(*samples))[0])
  TypeError: expected Tensor as element 0 in argument 0, but got Image

I may be able to work through this one. But if you'll allow me to rant for a moment: this is the reason why I've built kwcoco and the dataloader in geowatch. The fact that you can't just swap datasets in / out as modules in research repos makes them far harder to use / reproduce / extend than they should be. Torchgeo doesn't solve this problem: it makes it worse by having a specific dataset class for specific datasets. There should be a generic dataset that points to a metadata manifest file. The process of dataloading should be entirely abstracted away from the ML research. The current practice of hard coding everything leads to too many frustrations. There needs to be a standardized vision dataset interchange that's expressive enough to capture the nuances of different vision problems. I'm attemption to make kwcoco that format, but really I'd be happy if anything standard and easy-to-use existed. In any case, if I do get this working you should expect that the updated code will be able to point to a kwcoco dataset and just run on it </rant>

@RitwikGupta
Copy link
Member

PyTorch 1.13.1 should work, try that out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants