-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Upstream changes #7
base: main
Are you sure you want to change the base?
Conversation
@cjrd @RitwikGupta I'm trying to get a MWE of this running. With the latest changes you can do something like this: # Create demo train / vali data
DATA_PATH=$(python -m scalemae.demo)
echo "
data:
type: ImageList
length: 10
img_dir: '$DATA_PATH'
mean: [0.46921533, 0.46026663, 0.41329921]
std: [0.1927, 0.1373, 0.1203]
vis_factor: 1.0
" > $DATA_PATH/demo.yaml
cat $DATA_PATH/demo.yaml
DEFAULT_ROOT_DIR=$HOME/exps/scalemae_demo
echo "
DEFAULT_ROOT_DIR = $DEFAULT_ROOT_DIR
DATA_PATH = $DATA_PATH
"
mkdir -p $DEFAULT_ROOT_DIR
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=1 --master_port=11085 -m scalemae.main_pretrain \
--output_dir $DEFAULT_ROOT_DIR \
--log_dir $DEFAULT_ROOT_DIR \
--config $DATA_PATH/demo.yaml \
--eval_path "$DATA_PATH" \
--batch_size 4 \
--model mae_vit_base_patch16 \
--mask_ratio 0.75 \
--num_workers 0 \
--epochs 300 \
--target_size 224\
--input_size 224\
--self_attention\
--scale_min 0.2 \
--scale_max 1.0 \
--warmup_epochs 40 \
--blr 1.5e-4 --weight_decay 0.05 \
--decoder_aux_loss_layers 1\
--target_size_scheduler constant\
--decoder_depth 8 \
--no_autoresume \
--use_mask_token \
--skip_knn_eval \
--fixed_output_size_min 224\
--fixed_output_size_max 336\
--absolute_scale This generates a small dataset with kwcoco, so it can grow larger if needed. I'm able to write an ImageFolder that should corresond to one of the dataloaders. I thought the above would run, but I got:
This could just be a hardware problem (can this not run on 2x 3090's?). Is there anything obviously wrong about my config? Are there recommended settings for attempting to reproduce the pipeline on a small dataset (for testing). |
Jon, your config looks ok, but the issue seems to be with your environment. It seems that PyTorch is unable to see your GPUs. Can you verify everything is set up correctly? |
Yes, I'm currently training a geowatch network with 2 GPUs using LightningCLI. An extended version of
|
@Erotemic I was able to take a look at this again. The environment was set up for me properly. Can you install packages in your environment step-by-step and see where your env breaks? |
@RitwikGupta I've made a MWE in a docker image, and I was able to get farther. It's likely something on my host system is weird. To that end, I've added a dockerfile and instructions that walk through my MWE. It still is giving me an error, but it has to do with not having a CRS for the dataset. This makes sense because kwcoco demo data doesn't contain geo-metadata. However, geowatch demodata does have CRS information, so I'll see if I can get farther by using that. |
Hmm, it looks like I still get an error:
This docker env is:
```
(scalemae) root@168b53aa1722:~/code/scalemae# python -m torch.utils.collect_env
Collecting environment information... OS: Ubuntu 22.04.3 LTS (x86_64) Python version: 3.9.18 (main, Sep 11 2023, 13:41:44) [GCC 11.2.0] (64-bit runtime) Nvidia driver version: 525.147.05 CPU: Versions of relevant libraries:
|
This is a common env issue with rasterio. You should |
The conda variant of rasterio works (I do hope to get this working where conda is no longer necessary, but that's for after I get the basic case working). Unfortuantely, I'm still getting errors:
Do you have the details for the environment where you've gotten it to work? Torch versions / etc...? EDIT: I'm getting farther (I've got versions sorted out - although still would be nice to know exactly which version you had in your env to make it work). Currently running into an issue that I think is due to the hard-coded datasets:
I may be able to work through this one. But if you'll allow me to rant for a moment: this is the reason why I've built kwcoco and the dataloader in geowatch. The fact that you can't just swap datasets in / out as modules in research repos makes them far harder to use / reproduce / extend than they should be. Torchgeo doesn't solve this problem: it makes it worse by having a specific dataset class for specific datasets. There should be a generic dataset that points to a metadata manifest file. The process of dataloading should be entirely abstracted away from the ML research. The current practice of hard coding everything leads to too many frustrations. There needs to be a standardized vision dataset interchange that's expressive enough to capture the nuances of different vision problems. I'm attemption to make kwcoco that format, but really I'd be happy if anything standard and easy-to-use existed. In any case, if I do get this working you should expect that the updated code will be able to point to a kwcoco dataset and just run on it </rant> |
PyTorch 1.13.1 should work, try that out. |
I'm looking into integrating ScaleMAE into geowatch.
I've made this branch to track modifications to make it work. Currently this involves:
Setting up proper package namespaces: Everything should be referenced under the "scalemae" namespace to allow for integrations with other libraries. Having a module named "lib" is a common anti-pattern in repos, as it leads to conflicts, and simply putting everything into a top-level namespace fixes this issue. It also means all imports are now referenced explicitly in the code itself.
Finding minimum versions of required and optional dependencies. Still working on this, but there doesn't seem to be a comprehensive list of requirements to make the repo work. I'm working on gathering those while also deconflicting with requirements of geowatch.
Linting to remove unused code.
This should not be merged yet. I'm just ensuring the work is pushed as it is developed for comments and visibility.