Q: docs on deduplication? #438

ppbrown · 2024-10-31T01:58:14Z

Hello,
I didnt see anything in the docs on deduplication, so thought I'd ask.

I do see that it has a (resume) type feature, so that is in one sense deduplication.
But what if I start running it on a large input file....
kill it half way through the run...
edit the file to weed out a bunch of URLs, then restart it?

Will it get confused, or just start from the top, and still know which ones it has already done?

Related to that, if I download one large internet data set...
then start downloading another one, from a different input file, but specify the same destination directory?
Will it dedup based on url?

rom1504 · 2024-10-31T03:49:30Z

Img2dataset doesn't perform any dedup and assumes you give it exactly the same dataset on resume. If you want something smarter then that I advise using the generated parquet files in the output folder to perform joins with your new data (I'd advise using pyspark) and then give the output to img2dataset

…

On Thu, Oct 31, 2024, 02:58 Philip Brown ***@***.***> wrote: Hello, I didnt see anything in the docs on deduplication, so thought I'd ask. I do see that it has a (resume) type feature, so that is in one sense deduplication. But what if I start running it on a large input file.... kill it half way through the run... edit the file to weed out a bunch of URLs, then restart it? Will it get confused, or just start from the top, and still know which ones it has already done? Related to that, if I download one large internet data set... then start downloading another one, from a different input file, but specify the same destination directory? Will it dedup based on url? — Reply to this email directly, view it on GitHub <#438>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437RUPCVBRTJOOXBY74DZ6GFEXAVCNFSM6AAAAABQ5LI662VHI2DSMVQWIX3LMV43ASLTON2WKOZSGYZDKNZXHA4TKOA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q: docs on deduplication? #438

Q: docs on deduplication? #438

ppbrown commented Oct 31, 2024

rom1504 commented Oct 31, 2024 via email

Q: docs on deduplication? #438

Q: docs on deduplication? #438

Comments

ppbrown commented Oct 31, 2024

rom1504 commented Oct 31, 2024 via email