Training on Google's Image Captioning dataset (~3.3 million images) #125

TheoCoombes · 2021-03-27T18:52:07Z

TheoCoombes
Mar 27, 2021

Hi guys,

I've just finished playing around with train_dalle.py and have finally integrated Google's Conceptual Captioning dataset into DALLE-pytorch. It uses the requests module to stream the data from the URLs instead of having to download all 3 million images (I don't have the disk space for that lmao). Whilst the dataset was not explicitly designed for text to image generation, (instead for automatic image captioning), I believe it still has a major use case here.

I will be sharing the results as well as my code once it has finished its initial training session.

(p.s. I am very new to ML so if my code is a mess then that's probably why haha)

abenc · 2021-05-18T16:40:59Z

abenc
May 18, 2021

Hello TheCoombes! I was about to do exactly the same thing. If you could share your parameters and how you managed to integrate this I would be very glad

0 replies

sschwarz25 · 2021-05-31T23:01:13Z

sschwarz25
May 31, 2021

Any update on this? Excited to see how things went.

5 replies

TheoCoombes Jun 1, 2021
Author

Hi - I think I underestimated my hardware haha, there was nothing useful that was trained (just blobs of colour) and the code I wrote was definitely not as good as others on the discord server lmfao

afiaka87 Jun 1, 2021

Happens to all of us lol

sschwarz25 Jun 1, 2021

No shame in the learning game. Did you figure out anything worth sharing?

TheoCoombes Jun 7, 2021
Author

I'd say definitely avoid streaming content, and definitely download when available. Not only is it good to have that layer of security incase a website is down / your internet cuts out, but it definitely increases training speeds substantially.

afiaka87 Jun 9, 2021

@TheoCoombes Hopefully the WebDataset implementation that @robvanvolt is working on should alleviate the issues with streaming your dataset on the fly! Looks very promising. I agree though - the pre-caching required to get it to work is very tricky and it can still all fail if your internet goes out for whatever reason (on localhost). Better to just download the dataset in advance if you can.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on Google's Image Captioning dataset (~3.3 million images) #125

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Training on Google's Image Captioning dataset (~3.3 million images) #125

TheoCoombes Mar 27, 2021

Replies: 2 comments · 5 replies

abenc May 18, 2021

sschwarz25 May 31, 2021

TheoCoombes Jun 1, 2021 Author

afiaka87 Jun 1, 2021

sschwarz25 Jun 1, 2021

TheoCoombes Jun 7, 2021 Author

afiaka87 Jun 9, 2021

TheoCoombes
Mar 27, 2021

Replies: 2 comments 5 replies

abenc
May 18, 2021

sschwarz25
May 31, 2021

TheoCoombes Jun 1, 2021
Author

TheoCoombes Jun 7, 2021
Author