Tips for running in google colab & resuming from checkpoint #8

Luke2642 · 2021-07-14T12:21:07Z

Thanks for this repo, it's great!

To get it working in colab, I copied the bare minimum out from the docker file:

!pip install jsonnet
!apt install -y -q ninja-build
!pip install tensorfn rich

!pip install setuptools
!pip install numpy scipy nltk lmdb cython pydantic pyhocon

!apt install libsm6 libxext6 libxrender1
!pip install opencv-python-headless

It then works despite throwing two compatibility errors:

ERROR: requests 2.23.0 has requirement urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1, but you'll have urllib3 1.26.6 which is incompatible.
ERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.

I then made some manual edits to config/config-t.jsonnet so it runs on colab:

Under training:{} set image size to 128
Under training:{} batch size to 12 (650mb each so <8gb I guess)

In prepare_data.py I commented out line 14 for no resizing, just cropping. Could be useful config for some datasets.

In train.py main function line 322 and comment out 5 "logger" lines. the logger info didn't work, it just hangs then falls over without error out of the box in colab but I didn't investigate further.

I also couldn't get --ckpt=checkpoint/010000.pt to resume properly. I tried editing start iteration in the config too but no luck, it just seemed to start from zero again.

Also, it may be worth editing train.py with autocast() for half precision float16 instead of float32 to improve speed and memory limitations? Or even porting to TPU? https://github.com/pytorch/xla

So then run

!git clone https://github.com/rosinality/alias-free-gan-pytorch.git

After making these edits

#upload your zip file or use google drive import
!unzip /content/dataraw.zip -d /content/dataraw

%cd /content/alias-free-gan-pytorch
!python prepare_data.py --out /content/dataset --n_worker 8 --size=128 /content/dataraw

%cd /content/alias-free-gan-pytorch
!python train.py --n_gpu 1 --conf config/config-t.jsonnet path=/content/dataset/

Thanks again!

pabloppp · 2021-07-14T13:55:03Z

You can avoid a lot of those extra steps, also you can run it at 256 by just reducing the batch size as long as Colab gives you a GPU with 16Gb. You can make this work with just a few cells:

# Install dependencies and clone repo
!pip install click requests tqdm pyspng ninja imageio-ffmpeg==0.4.3 tensorfn jsonnet
!git clone https://github.com/rosinality/alias-free-gan-pytorch.git

# Move into the cloned repo
%cd alias-free-gan-pytorch

# Here download your dataset, or copy/unzip it from drive, or whatever

# prepare the dataset and crop the images to 256x256
%run prepare_data.py --out my_dataset --n_worker 4 --size 256 "your/dataset/folder"

# now train
%run train.py --n_gpu 1 --conf config/config-t.jsonnet training.batch=4 path=my_dataset

I believe a batch size of 8 should also work without problems, but I set 4 just in case

Luke2642 · 2021-07-14T14:17:30Z

Thanks, that's fantastic help, and much easier!

Are you able to offer any advice on resuming from checkpoints, or did I (probably) make a mistake?

pabloppp · 2021-07-14T18:41:32Z

I hadn't tried resuming until now, and you're correct, it seems to be broken 🤔 no matter what you pass as an argument, it will be displayed as None when the config is printed.
I managed to get it working by adding an extra ckpt parameter inside training but it's just a bad hack to make it work 🤔

MHRosenberg · 2021-07-16T05:19:09Z

I hadn't tried resuming until now, and you're correct, it seems to be broken 🤔 no matter what you pass as an argument, it will be displayed as None when the config is printed.
I managed to get it working by adding an extra ckpt parameter inside training but it's just a bad hack to make it work 🤔

Would it be possible for you to post your hack until the problem is addressed at a deeper level?

pabloppp · 2021-07-16T07:04:40Z

Sure, just edit config.py and and under class Training(Config): add ckpt: str = None like this:

class Training(Config):
    size: StrictInt
    iter: StrictInt = 800000
    batch: StrictInt = 16
    n_sample: StrictInt = 32
    r1: float = 10
    d_reg_every: StrictInt = 16
    lr_g: float = 2e-3
    lr_d: float = 2e-3
    augment: StrictBool = False
    augment_p: float = 0
    ada_target: float = 0.6
    ada_length: StrictInt = 500 * 1000
    ada_every: StrictInt = 256
    start_iter: StrictInt = 0
    ckpt: str = None

Then in train.py you need to change the 4 occurrences of conf.ckpt with conf.training.ckpt

if conf.training.ckpt is not None:
        logger.info(f"load model: {conf.training.ckpt}")

        ckpt = torch.load(conf.training.ckpt, map_location=lambda storage, loc: storage)

        try:
            ckpt_name = os.path.basename(conf.training.ckpt)
            conf.training.start_iter = int(os.path.splitext(ckpt_name)[0])

        except ValueError:
            pass

And that's it, then when you run the training script, you can pass the argument training.ckpt="checkpoint/060000.pt" and it should load it and resume training instead of restarting from scratch.

ucalyptus2 · 2021-07-22T19:50:12Z

duskvirkus made a notebook in Pytorch Lightning -> https://colab.research.google.com/github/duskvirkus/alias-free-gan-pytorch/blob/main/notebooks/AliasFreeGAN_lightning_basic_training.ipynb

MHRosenberg · 2021-08-19T10:39:26Z

Has anyone succeed at training with 512 x 512 or 1024 x 1024? I succeed with 256 x 256 but have been struggling with higher resolutions. I'm using the same input dataset in both cases but I hit PIL errors which I hacked around via sidphbot's approach but I appear to still have some issue with loading real images and get: "AttributeError: 'bytes' object has no attribute 'seek'". Any ideas?

Luke2642 changed the title ~~Tips for running in google colab~~ Tips for running in google colab & resuming from checkpoint Jul 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tips for running in google colab & resuming from checkpoint #8

Tips for running in google colab & resuming from checkpoint #8

Luke2642 commented Jul 14, 2021 •

edited

Loading

pabloppp commented Jul 14, 2021 •

edited

Loading

Luke2642 commented Jul 14, 2021

pabloppp commented Jul 14, 2021

MHRosenberg commented Jul 16, 2021

pabloppp commented Jul 16, 2021

ucalyptus2 commented Jul 22, 2021 •

edited

Loading

MHRosenberg commented Aug 19, 2021 •

edited

Loading

Tips for running in google colab & resuming from checkpoint #8

Tips for running in google colab & resuming from checkpoint #8

Comments

Luke2642 commented Jul 14, 2021 • edited Loading

pabloppp commented Jul 14, 2021 • edited Loading

Luke2642 commented Jul 14, 2021

pabloppp commented Jul 14, 2021

MHRosenberg commented Jul 16, 2021

pabloppp commented Jul 16, 2021

ucalyptus2 commented Jul 22, 2021 • edited Loading

MHRosenberg commented Aug 19, 2021 • edited Loading

Luke2642 commented Jul 14, 2021 •

edited

Loading

pabloppp commented Jul 14, 2021 •

edited

Loading

ucalyptus2 commented Jul 22, 2021 •

edited

Loading

MHRosenberg commented Aug 19, 2021 •

edited

Loading