new 32 layer model #396

Jack000 · 2021-12-01T17:43:57Z

Jack000
Dec 1, 2021

hey guys

following up on my previous post, I trained a 32 layer dalle-pytorch model. Here's the link https://github.com/Jack000/DALLE-pytorch/
I also put up the deepspeed checkpoint here: https://dall-3.com/models/dalle/

model settings:

32 layers, 12 heads
text sequence length 64, image sequence length 256
attn_types full
shift_tokens and rotary_emb to True
16384 dict size for text tokens, 8192 dict size for image tokens
gumbel f8 VQGAN (it has better embeddings with no "dead" tokens like the f16 versions)

I chose a size that would converge in a reasonable amount of time on 4x3090 (ie. months, not years) and this seemed to be about right.

For the learning rate I started with what I thought was a large value (1e-3) and halved it manually when the loss appeared to plateau, ending with 2e-5 after about a month. At this point I can't see any change in loss even with 0.999 ema on the loss curve, so I'm calling it done, but I suspect it's still under-trained.

For the dataset, I used a filtered version of Laion-400m and added some images I scraped myself, for a total of 60 million images. I filtered the data to include mostly photographic images (clip filtering is pretty noisy) and also to remove images with watermarks.

There are sample images in the github repo but here are some failure cases I noticed.

a green pentagonal clock. a green clock shaped like a pentagon

seems to have trouble with shapes, but to be fair even OpenAI's DALL-E has trouble with this

barak obama appears at a press conference

I think a larger transformer would help it memorize specific people/places

the birth of sentient ai

If the prompt can be interpreted as an article title or news release it has a tendency to generate image-text. The data might need more aggressive filtering.

Overall I think you really need a larger transformer to get the generalization capabilities of OpenAI's DALL-E, but this model works pretty well for common object classes.

some notes on training:

currently, changing learning rate on resume doesn't work with deepspeed (it'll take the previous learning rate from the checkpoint) I had to loop through the params manually
not sure if it's just my system, but I ran into this issue: DataLoader num_workers > 0 causes CPU memory from parent process to be replicated in all worker processes pytorch/pytorch#13246 Swapping out the python native lists with pyarrow fixed it
with deepspeed O1+amp I got NaN gradients and the scaling didn't seem to work (after the first NaN it never recovered), so I just used fp32-only training
I tried attention_types full,sparse but it only increased throughput by 5% or so. Just went with full attention

anyways, I'll probably keep training this whenever my training rig is idle

rom1504 · 2021-12-01T17:57:11Z

rom1504
Dec 1, 2021

very cool stuff!!
we should put that model in a demo using https://github.com/rom1504/dalle-service or https://huggingface.co/spaces/rom1504/laion_car_subset/blob/main/app.py so people can try it out!

I'm curious to see more examples, for example the ones from https://openai.com/blog/dall-e/

0 replies

Jack000 · 2021-12-01T22:59:41Z

Jack000
Dec 1, 2021
Author

does huggingface spaces offer free gpu? That would be awesome since it has an actual gui, compared to colab

here are some more generations. Temp=0.9 top_p=0.85 - top 36 out of 256 images clip re-ranked. I didn't do the diffusion pass since it would take several days.

a cube made of porcupine. a cube with the texture of porcupine

a collection of glasses is sitting on a table

seems like it's firmly decided on drinking glasses instead of eye glasses

a small red block sitting on a large green block

a stack of 3 cubes. a red cube is on the top, sitting on a green cube. the green cube is in the middle, sitting on a blue cube. the blue cube is on the bottom

not quite, but it mostly gets the 3 colors

an extreme close-up view of a capybara sitting in a field

a capybara made of voxels sitting in a field

some of these do look more video-gamey

a cross sectional view of a walnut

not many cross sections. I wonder if it's a wording issue (maybe something like "inside of a walnut" or a more common phrasing would work better)

a macro photograph of brain coral

this looks great, mostly because I have no idea what a brain coral is supposed to look like

a painting of a capybara sitting in a field at sunrise

I did filter out non-photographic images so it's not too surprising that they don't look very paintingish

a stained glass window with an image of a blue strawberry

a storefront that has the word "openai" written on it. a store front with the word openai. openai storefront

it's trying, but there's not enough resolution to make legible text

a snail made of harp. a snail with the texture of harp

not great, maybe the diffusion pass would help with some of these

a photo of the food of china

a more generic prompt definitely works better. Some of these could be photo-realistic after the diffusion pass

a photo of alamo square, san francisco, from a street at night

a photo of san francisco's golden gate bridge

a photo of a phone from the 1900s

hm.. maybe there are not enough historical photos in the dataset

a photo of a phone from the 2000s

this seems closer to the mark

a photo of a phone from the distant future

a professional high quality emoji of a lovestruck cup of boba

not sure if it knows what a boba is, but it's definitely going for an illustrative style

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new 32 layer model #396

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

new 32 layer model #396

Jack000 Dec 1, 2021

Replies: 2 comments

rom1504 Dec 1, 2021

Jack000 Dec 1, 2021 Author

Jack000
Dec 1, 2021

rom1504
Dec 1, 2021

Jack000
Dec 1, 2021
Author