-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low success rate on donwloading laion400m #400
Comments
Did you set up knot resolver?
Please share a wandb link so we can see the error cause
…On Sat, Feb 3, 2024, 12:04 PM thomas chaton ***@***.***> wrote:
Hey there,
I have been trying to download laion400m using the scripts from an EC2
instance m5n.8xlarge and the success rate is quite poor.
I am getting a success rate of 10 images for 10k requests with the default
command in the README.
Any idea why I am doing wrong ?
Best,
T.C
—
Reply to this email directly, view it on GitHub
<#400>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437V625Q3GBYBWJJOTKDYRYKTHAVCNFSM6AAAAABCX6KGQWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEYTMNBSHA4DCMI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Oh interesting. I haven't. Let me try again. What's knot resolver ? |
I am getting errors when trying to install knot resolver too. ⚡ ~ wget https://secure.nic.cz/files/knot-resolver/knot-resolver-release.deb
--2024-02-03 11:16:00-- https://secure.nic.cz/files/knot-resolver/knot-resolver-release.deb
Resolving secure.nic.cz (secure.nic.cz)... failed: Temporary failure in name resolution.
wget: unable to resolve host address ‘secure.nic.cz’
⚡ ~ sudo dpkg -i knot-resolver-release.deb
sudo: unable to resolve host ip-10-192-12-27: Temporary failure in name resolution
dpkg: error: cannot access archive 'knot-resolver-release.deb': No such file or directory |
Here are the normal logs. Looks like wandb had a Network error (TransientError), entering retry loop ⚡ ~ img2dataset --url_list the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/ --input_format "parquet"\
> --url_col "URL" --caption_col "TEXT" --output_format webdataset\
> --output_folder laion400m-data --processes_count 32 --thread_count 128 --image_size 256\
> --save_additional_columns '["NSFW","similarity","LICENSE"]' --enable_wandb True
Starting the downloading of this file
Sharding file number 1 of 32 called /teamspace/studios/this_studio/the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet
0it [00:00, ?it/s]File sharded in 1294 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
wandb: Currently logged in as: thomas-chaton. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.16.2
wandb: Run data is saved locally in /teamspace/studios/this_studio/wandb/run-20240203_111216-t4t3ohoz
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run woven-microwave-1
wandb: ⭐️ View project at https://wandb.ai/thomas-chaton/img2dataset
wandb: 🚀 View run at https://wandb.ai/thomas-chaton/img2dataset/runs/t4t3ohoz
wandb: Network error (TransientError), entering retry loop.
1it [04:07, 247.25s/it]worker - success: 0.002 - failed to download: 0.998 - failed to resize: 0.000 - images per sec: 42 - count: 10000
total - success: 0.002 - failed to download: 0.998 - failed to resize: 0.000 - images per sec: 42 - count: 10000
17it [04:11, 1.61s/it]wandb: Network error (TransientError), entering retry loop.
22it [04:14, 1.16it/s]wandb: Network error (TransientError), entering retry loop.
24it [04:15, 1.61it/s]worker - success: 0.008 - failed to download: 0.992 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 83 - count: 20000
worker - success: 0.004 - failed to download: 0.996 - failed to resize: 0.000 - images per sec: 42 - count: 10000
total - success: 0.004 - failed to download: 0.996 - failed to resize: 0.000 - images per sec: 124 - count: 30000
worker - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 42 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 165 - count: 40000
worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 207 - count: 50000
worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 245 - count: 60000
worker - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 285 - count: 70000
worker - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 326 - count: 80000
worker - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 366 - count: 90000
worker - success: 0.004 - failed to download: 0.996 - failed to resize: 0.000 - images per sec: 42 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 406 - count: 100000
worker - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total - success: 0.005 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 447 - count: 110000
worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 487 - count: 120000
worker - success: 0.002 - failed to download: 0.998 - failed to resize: 0.000 - images per sec: 42 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 528 - count: 130000
worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 42 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 569 - count: 140000
worker - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 609 - count: 150000
worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 650 - count: 160000
worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 690 - count: 170000
worker - success: 0.004 - failed to download: 0.996 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 731 - count: 180000
worker - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 772 - count: 190000
worker - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 812 - count: 200000
worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 42 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 853 - count: 210000
worker - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 42 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 894 - count: 220000
worker - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 934 - count: 230000
worker - success: 0.002 - failed to download: 0.999 - failed to resize: 0.000 - images per sec: 42 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 975 - count: 240000
28it [04:21, 1.13s/it]worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 40 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1004 - count: 250000
worker - success: 0.004 - failed to download: 0.996 - failed to resize: 0.000 - images per sec: 40 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1038 - count: 260000
worker - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 40 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1071 - count: 270000
worker - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 40 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1111 - count: 280000
31it [04:26, 1.32s/it]worker - success: 0.008 - failed to download: 0.992 - failed to resize: 0.000 - images per sec: 39 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1134 - count: 290000
worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 39 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1173 - count: 300000
worker - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 39 - count: 10000
total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1207 - count: 310000
32it [04:51, 8.13s/it]worker - success: 0.037 - failed to download: 0.963 - failed to resize: 0.000 - images per sec: 35 - count: 10000
total - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 1133 - count: 320000 |
I see you're using 32 processes and 128 threads per process.
That might be too much for the machine you're using, try to decrease
As for knot resolver, please follow their doc for instructions on how to
install it for your distribution
…On Sat, Feb 3, 2024, 12:18 PM thomas chaton ***@***.***> wrote:
Here are the normal logs. Looks like wandb had a Network error
(TransientError), entering retry loop
⚡ ~ img2dataset --url_list the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/ --input_format "parquet"\> --url_col "URL" --caption_col "TEXT" --output_format webdataset\> --output_folder laion400m-data --processes_count 32 --thread_count 128 --image_size 256\> --save_additional_columns '["NSFW","similarity","LICENSE"]' --enable_wandb TrueStarting the downloading of this fileSharding file number 1 of 32 called /teamspace/studios/this_studio/the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet0it [00:00, ?it/s]File sharded in 1294 shardsDownloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!wandb: Currently logged in as: thomas-chaton. Use `wandb login --relogin` to force reloginwandb: Tracking run with wandb version 0.16.2wandb: Run data is saved locally in /teamspace/studios/this_studio/wandb/run-20240203_111216-t4t3ohozwandb: Run `wandb offline` to turn off syncing.wandb: Syncing run woven-microwave-1wandb: ⭐️ View project at https://wandb.ai/thomas-chaton/img2datasetwandb: 🚀 View run at https://wandb.ai/thomas-chaton/img2dataset/runs/t4t3ohozwandb: Network error (TransientError), entering retry loop.1it [04:07, 247.25s/it]worker - success: 0.002 - failed to download: 0.998 - failed to resize: 0.000 - images per sec: 42 - count: 10000total - success: 0.002 - failed to download: 0.998 - failed to resize: 0.000 - images per sec: 42 - count: 1000017it [04:11, 1.61s/it]wandb: Network error (TransientError), entering retry loop.22it [04:14, 1.16it/s]wandb: Network error (TransientError), entering retry loop.24it [04:15, 1.61it/s]worker - success: 0.008 - failed to download: 0.992 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 83 - count: 20000worker - success: 0.004 - failed to download: 0.996 - failed to resize: 0.000 - images per sec: 42 - count: 10000total - success: 0.004 - failed to download: 0.996 - failed to resize: 0.000 - images per sec: 124 - count: 30000worker - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 42 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 165 - count: 40000worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 207 - count: 50000worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 245 - count: 60000worker - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 285 - count: 70000worker - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 326 - count: 80000worker - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 366 - count: 90000worker - success: 0.004 - failed to download: 0.996 - failed to resize: 0.000 - images per sec: 42 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 406 - count: 100000worker - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 447 - count: 110000worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 487 - count: 120000worker - success: 0.002 - failed to download: 0.998 - failed to resize: 0.000 - images per sec: 42 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 528 - count: 130000worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 42 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 569 - count: 140000worker - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 609 - count: 150000worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 650 - count: 160000worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 690 - count: 170000worker - success: 0.004 - failed to download: 0.996 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 731 - count: 180000worker - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 772 - count: 190000worker - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 812 - count: 200000worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 42 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 853 - count: 210000worker - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 42 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 894 - count: 220000worker - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 934 - count: 230000worker - success: 0.002 - failed to download: 0.999 - failed to resize: 0.000 - images per sec: 42 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 975 - count: 24000028it [04:21, 1.13s/it]worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 40 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1004 - count: 250000worker - success: 0.004 - failed to download: 0.996 - failed to resize: 0.000 - images per sec: 40 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1038 - count: 260000worker - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 40 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1071 - count: 270000worker - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 40 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1111 - count: 28000031it [04:26, 1.32s/it]worker - success: 0.008 - failed to download: 0.992 - failed to resize: 0.000 - images per sec: 39 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1134 - count: 290000worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 39 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1173 - count: 300000worker - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 39 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1207 - count: 31000032it [04:51, 8.13s/it]worker - success: 0.037 - failed to download: 0.963 - failed to resize: 0.000 - images per sec: 35 - count: 10000total - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 1133 - count: 320000
—
Reply to this email directly, view it on GitHub
<#400 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437UXT2BFEMJHMEKRWVDYRYMJDAVCNFSM6AAAAABCX6KGQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRVGI4DINJUGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks @rom1504 The machine has 32 CPUs, so I thought it should be fine. I am running inside a docker container, so having some issues to install knot resolver. I will keep you updated. |
I think you probably have more network issues than only the DNS if you have
99% failure
Maybe some misconfiguration of docker or the cloud provider?
…On Sat, Feb 3, 2024, 1:10 PM thomas chaton ***@***.***> wrote:
Thanks @rom1504 <https://github.com/rom1504> The machine has 32 CPUs, so
I thought it should be fine. I am running inside a docker container, so
having some issues to install knot resolver.
I will keep you updated.
—
Reply to this email directly, view it on GitHub
<#400 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437QSZI5RUQQM3UKCJ4LYRYSKPAVCNFSM6AAAAABCX6KGQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRVGMYDINJWGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hey @rom1504 Any idea what I should be looking for on the docker or cloud provider side as possible source of issues? Also, should I use knot or bind9? |
I advise you use knot, it's better for this use case.
For network issues, could be a lot of things but maybe a limit on the
number of open handles/files ?
…On Sat, Feb 3, 2024, 1:42 PM thomas chaton ***@***.***> wrote:
Hey @rom1504 <https://github.com/rom1504> Any idea what I should be
looking for on the docker or cloud provider side?
Also, should I use knot or bind9?
—
Reply to this email directly, view it on GitHub
<#400 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437RWLMAX4FBNH277RDDYRYWDDAVCNFSM6AAAAABCX6KGQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRVGMYTCMJSGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks, @rom1504 I will check this out. I managed to install knot on the host but it isn't visible inside the container and networking seems broken. Have you ever tried? |
I am also curious what kind of numbers do you get without using knot resolver ? |
I didn't try to use docker for img2dataset no. Maybe just use the host?
Usually I get 40-50 image/s/core. With 32 cores that would be 1300-1600
image/s
Knot resolver does not change the speed, it increases the success rate
Should be about 80% for laion400m.
…On Sat, Feb 3, 2024, 2:40 PM thomas chaton ***@***.***> wrote:
I am also curious what kind of numbers you get without using knot resolver
?
—
Reply to this email directly, view it on GitHub
<#400 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437RCIZOZDJYI7PHWZDTYRY46DAVCNFSM6AAAAABCX6KGQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRVGMZDINZXGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hey @rom1504 I am trying to get it working on https://lightning.ai/, so it runs in docker. Yes, my success rate is far from this. So something is wrong. |
@rom1504 Here is the PR I am working on: Lightning-AI/pytorch-lightning#19400 and the API: I am trying to make data processing efficient while easy to hack around. Here is the example to download laion400m. Still need some extra optimizations. import os
from multiprocessing.pool import ThreadPool
from lightning.data import optimize
from lightning.data.processing.readers import ParquetReader
from lightning.data.processing.image import download_image
from PIL import Image
from time import sleep
input_dir = "the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta"
parquet_files = [os.path.join(input_dir, f) for f in os.listdir(input_dir) if f.endswith(".parquet")]
def process(row):
image_id, url, text, height, width, image_license, nsfw, similarity = row
img, err = download_image(url, 1, timeout=5)
if err:
return None, err
try:
return [image_id, Image.open(row[1]).resize((224, 224)), text, image_license, nsfw, similarity], None
except Exception:
return None, err
class Fetcher:
def __init__(self, max_threads=32):
self.max_threads = max_threads
def __call__(self, df):
rows = [list(row) for row in df.iter_rows() if row[0] is not None]
with ThreadPool(self.max_threads) as thread_pool:
for row, err in thread_pool.imap_unordered(process, rows):
if err is not None:
continue
yield row
optimize(
fn=Fetcher(max_threads=16),
inputs=parquet_files,
output_dir="/teamspace/datasets/laion400m",
num_workers=os.cpu_count(),
reader=ParquetReader(num_rows=2048, to_pandas=False),
chunk_bytes="64MB",
) And the associated Streaming library I have been working on: https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries If this is ok, I will make a PR to add a Lightning Data writer to img2dataset. |
Ok. Curious if you reach similar speed as img2dataset and how you'd like to
handle distribution
…On Sat, Feb 3, 2024, 7:28 PM thomas chaton ***@***.***> wrote:
@rom1504 <https://github.com/rom1504> Here is the PR I am working on:
Lightning-AI/pytorch-lightning#19400
<Lightning-AI/pytorch-lightning#19400> and the
API:
I am trying to make data processing efficient while easy to hack around.
Here is the example to download laion400m. Still need some extra
optimizations.
import osfrom multiprocessing.pool import ThreadPoolfrom lightning.data import optimizefrom lightning.data.processing.readers import ParquetReaderfrom lightning.data.processing.image import download_image_with_retryfrom lightning.data.processing.utilities import SuppressStdoutStderrfrom PIL import Imagefrom time import sleep
input_dir = "the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta"parquet_files = [os.path.join(input_dir, f) for f in os.listdir(input_dir) if f.endswith(".parquet")]
def process(row):
image_id, url, text, height, width, image_license, nsfw, similarity = row
img, err = download_image_with_retry(0, url, timeout=5)
if err:
return None, err
try:
return [image_id, img, text, image_license, nsfw, similarity], None
except Exception:
return None, err
class Fetcher:
def __init__(self, max_threads=32):
self.max_threads = max_threads
self.stored = 0
self.skipped = 0
def __call__(self, df):
print(self.skipped, self.stored)
rows = [list(row) for row in df.iter_rows() if row[0] is not None]
with ThreadPool(self.max_threads) as thread_pool:
for row, err in thread_pool.imap_unordered(process, rows):
if err is not None:
self.skipped += 1
continue
if row[1] is not None:
try:
row[1] = Image.open(row[1]).resize((224, 224))
except:
self.skipped += 1
continue
yield row
self.stored += 1
optimize(
fn=Fetcher(max_threads=16),
inputs=parquet_files,
output_dir="/teamspace/datasets/laion400m",
num_workers=os.cpu_count(),
reader=ParquetReader(num_rows=2048, to_pandas=False),
chunk_bytes="64MB",
)
—
Reply to this email directly, view it on GitHub
<#400 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437ST3YPBBFTDOR4FJKTYRZ6TXAVCNFSM6AAAAABCX6KGQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRVGQZDENRTGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
It seemed Image downloading speeds were quite similar between optimize and img2dataset. But I need to be more principled and collect the same metrics to build a more educated comparison. But first, I need to resolve the low downloading speed and low success rate behind so low. But the StreamingDataset is faster than Webdataset though. Actually, you can try it yourself by duplicating my Studio: lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries. It contains everything, python deps, code, data, etc... I am happy to get on call to chat more about design and optimizations if you are interested. |
The distribution is already fully handled by the Example to tokenize SlimPajama. import json
from pathlib import Path
import zstandard as zstd
from lightning.data import optimize
from tokenizer import Tokenizer
from functools import partial
from lightning_sdk import Machine
# 1. Function to tokenize the text contained within the Slimpajama files
def tokenize_fn(filepath, tokenizer=None):
with zstd.open(open(filepath, "rb"), "rt", encoding="utf-8") as f:
for row in f:
text = json.loads(row)["text"]
if json.loads(row)["meta"]["redpajama_set_name"] == "RedPajamaGithub":
continue # exclude the GitHub data since it overlaps with starcoder
text_ids = tokenizer.encode(text, bos=False, eos=True)
yield text_ids
# 2. Generate the inputs (we are going to optimize all the compressed json files from SlimPajama dataset)
input_dir = "/teamspace/studios/SlimPajama_Dataset/data/train"
inputs = [str(file) for file in Path(input_dir).rglob("*.jsonl.zst")]
# 3. Store the optimized data wherever you want under "/teamspace/datasets" or "/teamspace/s3_connections"
outputs = optimize(
fn=partial(tokenize_fn, tokenizer=Tokenizer("./checkpoints/Llama-2-7b-hf")), # Note: You can use HF tokenizer or any others
inputs=inputs,
output_dir="/teamspace/datasets/slimpajama/train/",
chunk_size=(2049 * 8012),
num_nodes=16,
machine=Machine.DATA_PREP, # use 32 CPU machine
) This remotely process the full dataset over 16 nodes and make it processable by the StreamingDataset. Or this one to embed Wikipedia in 15 min: https://lightning.ai/lightning-ai/studios/embed-english-wikipedia-under-5-dollars |
Hey @rom1504 I am able to get 1.1k images/sec. I think I have a version of knot resolver that works. I am also using http2 from httpx client and I sorted to parquet files by URL to hopefully help slightly the DNS resolving. But the ratio of success is around 60%, so quite far from yours though. I will try again img2dataset. There is possibly something with docker not well configured. Best, |
Be careful with sorting the urls as you risk to dos the hosts. I had randomly shuffled them in laion datasets to mitigate this. Some people recently have had some success by calling knot with all unique domains to get its cache ready. Usually I didn't hit issues with DNS when using knot though. Issues only happens in some environments with restricted DNS setup
You can log the errors to try and understand what the cause is. In img2dataset there is a wandb table for it.
Nice! How many cores are you using? |
Hey @rom1504,
Interesting. Yes, I didn't think of that. Good call !
This is a good idea. I will see if there is a simple way for to add support for this.
I am capturing the errors and printing them. I will share what I am getting in couple of hours.
I am using a 32 CPU machine, so slightly lower than what you told me to expect. I will try img2dataset again to get numbers. |
# main ones
- [Errno 101] Network is unreachable,
- [Errno 99] Cannot assign requested address
- [Errno -2] Name or service not known
# the rest
- [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'site.aimbulance.com'. (_ssl.c:997) |
Interestingly, the ratio with img2dataset is quite lower: worker - success: 0.105 - failed to download: 0.894 - failed to resize: 0.001 - images per sec: 308 - count: 10000
total - success: 0.082 - failed to download: 0.918 - failed to resize: 0.000 - images per sec: 2979 - count: 96608
worker - success: 0.146 - failed to download: 0.854 - failed to resize: 0.000 - images per sec: 313 - count: 10000
total - success: 0.088 - failed to download: 0.912 - failed to resize: 0.000 - images per sec: 3288 - count: 106608
worker - success: 0.128 - failed to download: 0.872 - failed to resize: 0.000 - images per sec: 300 - count: 10000
total - success: 0.091 - failed to download: 0.909 - failed to resize: 0.000 - images per sec: 3497 - count: 116608
worker - success: 0.124 - failed to download: 0.876 - failed to resize: 0.000 - images per sec: 311 - count: 10000
total - success: 0.094 - failed to download: 0.906 - failed to resize: 0.000 - images per sec: 3797 - count: 126608
worker - success: 0.174 - failed to download: 0.825 - failed to resize: 0.001 - images per sec: 343 - count: 10000
total - success: 0.100 - failed to download: 0.900 - failed to resize: 0.000 - images per sec: 4097 - count: 136608
worker - success: 0.159 - failed to download: 0.840 - failed to resize: 0.001 - images per sec: 178 - count: 5536
total - success: 0.102 - failed to download: 0.898 - failed to resize: 0.000 - images per sec: 4263 - count: 142144
worker - success: 0.090 - failed to download: 0.909 - failed to resize: 0.001 - images per sec: 317 - count: 10000
total - success: 0.101 - failed to download: 0.898 - failed to resize: 0.000 - images per sec: 4563 - count: 152144
worker - success: 0.149 - failed to download: 0.851 - failed to resize: 0.000 - images per sec: 313 - count: 10000
total - success: 0.104 - failed to download: 0.896 - failed to resize: 0.000 - images per sec: 4863 - count: 162144
worker - success: 0.082 - failed to download: 0.918 - failed to resize: 0.000 - images per sec: 305 - count: 10000
total - success: 0.103 - failed to download: 0.897 - failed to resize: 0.000 - images per sec: 5163 - count: 172144
worker - success: 0.120 - failed to download: 0.880 - failed to resize: 0.000 - images per sec: 304 - count: 10000
total - success: 0.104 - failed to download: 0.896 - failed to resize: 0.000 - images per sec: 5463 - count: 182144
worker - success: 0.102 - failed to download: 0.897 - failed to resize: 0.001 - images per sec: 316 - count: 10000
total - success: 0.104 - failed to download: 0.896 - failed to resize: 0.000 - images per sec: 5763 - count: 192144
worker - success: 0.099 - failed to download: 0.901 - failed to resize: 0.000 - images per sec: 305 - count: 10000
total - success: 0.104 - failed to download: 0.896 - failed to resize: 0.000 - images per sec: 6063 - count: 202144
worker - success: 0.194 - failed to download: 0.806 - failed to resize: 0.000 - images per sec: 318 - count: 10000
total - success: 0.108 - failed to download: 0.892 - failed to resize: 0.000 - images per sec: 6363 - count: 212144
worker - success: 0.152 - failed to download: 0.848 - failed to resize: 0.000 - images per sec: 308 - count: 10000
total - success: 0.110 - failed to download: 0.890 - failed to resize: 0.000 - images per sec: 6644 - count: 222144 {
"count": 10000,
"successes": 900,
"failed_to_download": 9093,
"failed_to_resize": 7,
"duration": 31.51988196372986,
"start_time": 1707166867.7824914,
"end_time": 1707166899.3023734,
"status_dict": {
"<urlopen error [Errno -2] Name or service not known>": 996,
"<urlopen error [Errno -3] Temporary failure in name resolution>": 31,
"success": 900,
"Image decoding error": 7,
"HTTP Error 404: Not Found": 105,
"timed out": 1,
"HTTP Error 403: Forbidden": 23,
"<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for '0.realtorpage.io'. (_ssl.c:997)>": 1,
"<urlopen error [Errno 99] Cannot assign requested address>": 7936
}
}``` |
Hey @rom1504 I found this interesting issue: pola-rs/polars#14358. I need to add profiling. But it seems you got around this by creating shards from the parquet files to optimize the distribution: https://github.com/rom1504/img2dataset/blob/main/img2dataset/reader.py#L189. This is a great idea. I am going to try this out. |
Hey @rom1504 I started a distributed Job on 32 nodes to download the dataset. This is my first test run. I will keep you updated. |
Sorry to bother you. Could you tell me how to download laion400M dataset? I use this code try to download:img2dataset --url_list laion400m-meta --input_format "parquet" --url_col "URL" --caption_col "TEXT" --output_format webdataset --output_folder laion400m-data --processes_count 16 --thread_count 128 --image_size 256 --save_additional_columns '["NSFW","similarity","LICENSE"]' --enable_wandb True, but sth wrong happened. |
Hey @SomnusQue Here is the full blogpost explaining how to download the dataset: lightning.ai/lightning-ai/studios/download-stream-400m-images-text~01hg0zg8fyybp7p1sma6g9dkzm. @rom1504 I would appreciate if you could have a read and give me your thoughts. |
Hey there, @rom1504,
I have been trying to download laion400m using the scripts from an EC2 instance m5n.8xlarge and the success rate is quite poor.
I am getting a success rate of 10 images for 10k requests with the default command in the README.
Any idea why I am doing wrong ?
Best,
T.C
The text was updated successfully, but these errors were encountered: