-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tips for running in google colab & resuming from checkpoint #8
Comments
You can avoid a lot of those extra steps, also you can run it at 256 by just reducing the batch size as long as Colab gives you a GPU with 16Gb. You can make this work with just a few cells: # Install dependencies and clone repo
!pip install click requests tqdm pyspng ninja imageio-ffmpeg==0.4.3 tensorfn jsonnet
!git clone https://github.com/rosinality/alias-free-gan-pytorch.git # Move into the cloned repo
%cd alias-free-gan-pytorch # Here download your dataset, or copy/unzip it from drive, or whatever # prepare the dataset and crop the images to 256x256
%run prepare_data.py --out my_dataset --n_worker 4 --size 256 "your/dataset/folder" # now train
%run train.py --n_gpu 1 --conf config/config-t.jsonnet training.batch=4 path=my_dataset I believe a batch size of 8 should also work without problems, but I set 4 just in case |
Thanks, that's fantastic help, and much easier! Are you able to offer any advice on resuming from checkpoints, or did I (probably) make a mistake? |
I hadn't tried resuming until now, and you're correct, it seems to be broken 🤔 no matter what you pass as an argument, it will be displayed as |
Would it be possible for you to post your hack until the problem is addressed at a deeper level? |
Sure, just edit class Training(Config):
size: StrictInt
iter: StrictInt = 800000
batch: StrictInt = 16
n_sample: StrictInt = 32
r1: float = 10
d_reg_every: StrictInt = 16
lr_g: float = 2e-3
lr_d: float = 2e-3
augment: StrictBool = False
augment_p: float = 0
ada_target: float = 0.6
ada_length: StrictInt = 500 * 1000
ada_every: StrictInt = 256
start_iter: StrictInt = 0
ckpt: str = None Then in if conf.training.ckpt is not None:
logger.info(f"load model: {conf.training.ckpt}")
ckpt = torch.load(conf.training.ckpt, map_location=lambda storage, loc: storage)
try:
ckpt_name = os.path.basename(conf.training.ckpt)
conf.training.start_iter = int(os.path.splitext(ckpt_name)[0])
except ValueError:
pass And that's it, then when you run the training script, you can pass the argument |
duskvirkus made a notebook in Pytorch Lightning -> https://colab.research.google.com/github/duskvirkus/alias-free-gan-pytorch/blob/main/notebooks/AliasFreeGAN_lightning_basic_training.ipynb |
Has anyone succeed at training with 512 x 512 or 1024 x 1024? I succeed with 256 x 256 but have been struggling with higher resolutions. I'm using the same input dataset in both cases but I hit PIL errors which I hacked around via sidphbot's approach but I appear to still have some issue with loading real images and get: "AttributeError: 'bytes' object has no attribute 'seek'". Any ideas? |
Thanks for this repo, it's great!
To get it working in colab, I copied the bare minimum out from the docker file:
!pip install jsonnet
!apt install -y -q ninja-build
!pip install tensorfn rich
!pip install setuptools
!pip install numpy scipy nltk lmdb cython pydantic pyhocon
!apt install libsm6 libxext6 libxrender1
!pip install opencv-python-headless
It then works despite throwing two compatibility errors:
ERROR: requests 2.23.0 has requirement urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1, but you'll have urllib3 1.26.6 which is incompatible.
ERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.
I then made some manual edits to config/config-t.jsonnet so it runs on colab:
Under training:{} set image size to 128
Under training:{} batch size to 12 (650mb each so <8gb I guess)
In prepare_data.py I commented out line 14 for no resizing, just cropping. Could be useful config for some datasets.
In train.py main function line 322 and comment out 5 "logger" lines. the logger info didn't work, it just hangs then falls over without error out of the box in colab but I didn't investigate further.
I also couldn't get --ckpt=checkpoint/010000.pt to resume properly. I tried editing start iteration in the config too but no luck, it just seemed to start from zero again.
Also, it may be worth editing train.py with autocast() for half precision float16 instead of float32 to improve speed and memory limitations? Or even porting to TPU? https://github.com/pytorch/xla
So then run
!git clone https://github.com/rosinality/alias-free-gan-pytorch.git
After making these edits
#upload your zip file or use google drive import
!unzip /content/dataraw.zip -d /content/dataraw
%cd /content/alias-free-gan-pytorch
!python prepare_data.py --out /content/dataset --n_worker 8 --size=128 /content/dataraw
%cd /content/alias-free-gan-pytorch
!python train.py --n_gpu 1 --conf config/config-t.jsonnet path=/content/dataset/
Thanks again!
The text was updated successfully, but these errors were encountered: