Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tips for running in google colab & resuming from checkpoint #8

Open
Luke2642 opened this issue Jul 14, 2021 · 7 comments
Open

Tips for running in google colab & resuming from checkpoint #8

Luke2642 opened this issue Jul 14, 2021 · 7 comments

Comments

@Luke2642
Copy link

Luke2642 commented Jul 14, 2021

Thanks for this repo, it's great!

To get it working in colab, I copied the bare minimum out from the docker file:

!pip install jsonnet
!apt install -y -q ninja-build
!pip install tensorfn rich

!pip install setuptools
!pip install numpy scipy nltk lmdb cython pydantic pyhocon

!apt install libsm6 libxext6 libxrender1
!pip install opencv-python-headless

It then works despite throwing two compatibility errors:

ERROR: requests 2.23.0 has requirement urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1, but you'll have urllib3 1.26.6 which is incompatible.
ERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.

I then made some manual edits to config/config-t.jsonnet so it runs on colab:

Under training:{} set image size to 128
Under training:{} batch size to 12 (650mb each so <8gb I guess)

In prepare_data.py I commented out line 14 for no resizing, just cropping. Could be useful config for some datasets.

In train.py main function line 322 and comment out 5 "logger" lines. the logger info didn't work, it just hangs then falls over without error out of the box in colab but I didn't investigate further.

I also couldn't get --ckpt=checkpoint/010000.pt to resume properly. I tried editing start iteration in the config too but no luck, it just seemed to start from zero again.

Also, it may be worth editing train.py with autocast() for half precision float16 instead of float32 to improve speed and memory limitations? Or even porting to TPU? https://github.com/pytorch/xla

So then run

!git clone https://github.com/rosinality/alias-free-gan-pytorch.git

After making these edits

#upload your zip file or use google drive import
!unzip /content/dataraw.zip -d /content/dataraw

%cd /content/alias-free-gan-pytorch
!python prepare_data.py --out /content/dataset --n_worker 8 --size=128 /content/dataraw

%cd /content/alias-free-gan-pytorch
!python train.py --n_gpu 1 --conf config/config-t.jsonnet path=/content/dataset/

Thanks again!

@pabloppp
Copy link
Contributor

pabloppp commented Jul 14, 2021

You can avoid a lot of those extra steps, also you can run it at 256 by just reducing the batch size as long as Colab gives you a GPU with 16Gb. You can make this work with just a few cells:

# Install dependencies and clone repo
!pip install click requests tqdm pyspng ninja imageio-ffmpeg==0.4.3 tensorfn jsonnet
!git clone https://github.com/rosinality/alias-free-gan-pytorch.git
# Move into the cloned repo
%cd alias-free-gan-pytorch
# Here download your dataset, or copy/unzip it from drive, or whatever
# prepare the dataset and crop the images to 256x256
%run prepare_data.py --out my_dataset --n_worker 4 --size 256 "your/dataset/folder"
# now train
%run train.py --n_gpu 1 --conf config/config-t.jsonnet training.batch=4 path=my_dataset

I believe a batch size of 8 should also work without problems, but I set 4 just in case

@Luke2642
Copy link
Author

Thanks, that's fantastic help, and much easier!

Are you able to offer any advice on resuming from checkpoints, or did I (probably) make a mistake?

@pabloppp
Copy link
Contributor

I hadn't tried resuming until now, and you're correct, it seems to be broken 🤔 no matter what you pass as an argument, it will be displayed as None when the config is printed.
I managed to get it working by adding an extra ckpt parameter inside training but it's just a bad hack to make it work 🤔

@MHRosenberg
Copy link

I hadn't tried resuming until now, and you're correct, it seems to be broken 🤔 no matter what you pass as an argument, it will be displayed as None when the config is printed.
I managed to get it working by adding an extra ckpt parameter inside training but it's just a bad hack to make it work 🤔

Would it be possible for you to post your hack until the problem is addressed at a deeper level?

@pabloppp
Copy link
Contributor

Sure, just edit config.py and and under class Training(Config): add ckpt: str = None like this:

class Training(Config):
    size: StrictInt
    iter: StrictInt = 800000
    batch: StrictInt = 16
    n_sample: StrictInt = 32
    r1: float = 10
    d_reg_every: StrictInt = 16
    lr_g: float = 2e-3
    lr_d: float = 2e-3
    augment: StrictBool = False
    augment_p: float = 0
    ada_target: float = 0.6
    ada_length: StrictInt = 500 * 1000
    ada_every: StrictInt = 256
    start_iter: StrictInt = 0
    ckpt: str = None

Then in train.py you need to change the 4 occurrences of conf.ckpt with conf.training.ckpt

if conf.training.ckpt is not None:
        logger.info(f"load model: {conf.training.ckpt}")

        ckpt = torch.load(conf.training.ckpt, map_location=lambda storage, loc: storage)

        try:
            ckpt_name = os.path.basename(conf.training.ckpt)
            conf.training.start_iter = int(os.path.splitext(ckpt_name)[0])

        except ValueError:
            pass

And that's it, then when you run the training script, you can pass the argument training.ckpt="checkpoint/060000.pt" and it should load it and resume training instead of restarting from scratch.

@ucalyptus2
Copy link

ucalyptus2 commented Jul 22, 2021

duskvirkus made a notebook in Pytorch Lightning -> https://colab.research.google.com/github/duskvirkus/alias-free-gan-pytorch/blob/main/notebooks/AliasFreeGAN_lightning_basic_training.ipynb

@Luke2642 Luke2642 changed the title Tips for running in google colab Tips for running in google colab & resuming from checkpoint Jul 30, 2021
@MHRosenberg
Copy link

MHRosenberg commented Aug 19, 2021

Has anyone succeed at training with 512 x 512 or 1024 x 1024? I succeed with 256 x 256 but have been struggling with higher resolutions. I'm using the same input dataset in both cases but I hit PIL errors which I hacked around via sidphbot's approach but I appear to still have some issue with loading real images and get: "AttributeError: 'bytes' object has no attribute 'seek'". Any ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants