You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
I didnt see anything in the docs on deduplication, so thought I'd ask.
I do see that it has a (resume) type feature, so that is in one sense deduplication.
But what if I start running it on a large input file....
kill it half way through the run...
edit the file to weed out a bunch of URLs, then restart it?
Will it get confused, or just start from the top, and still know which ones it has already done?
Related to that, if I download one large internet data set...
then start downloading another one, from a different input file, but specify the same destination directory?
Will it dedup based on url?
The text was updated successfully, but these errors were encountered:
Img2dataset doesn't perform any dedup and assumes you give it exactly the
same dataset on resume.
If you want something smarter then that I advise using the generated
parquet files in the output folder to perform joins with your new data (I'd
advise using pyspark) and then give the output to img2dataset
On Thu, Oct 31, 2024, 02:58 Philip Brown ***@***.***> wrote:
Hello,
I didnt see anything in the docs on deduplication, so thought I'd ask.
I do see that it has a (resume) type feature, so that is in one sense
deduplication.
But what if I start running it on a large input file....
kill it half way through the run...
edit the file to weed out a bunch of URLs, then restart it?
Will it get confused, or just start from the top, and still know which
ones it has already done?
Related to that, if I download one large internet data set...
then start downloading another one, from a different input file, but
specify the same destination directory?
Will it dedup based on url?
—
Reply to this email directly, view it on GitHub
<#438>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437RUPCVBRTJOOXBY74DZ6GFEXAVCNFSM6AAAAABQ5LI662VHI2DSMVQWIX3LMV43ASLTON2WKOZSGYZDKNZXHA4TKOA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
Hello,
I didnt see anything in the docs on deduplication, so thought I'd ask.
I do see that it has a (resume) type feature, so that is in one sense deduplication.
But what if I start running it on a large input file....
kill it half way through the run...
edit the file to weed out a bunch of URLs, then restart it?
Will it get confused, or just start from the top, and still know which ones it has already done?
Related to that, if I download one large internet data set...
then start downloading another one, from a different input file, but specify the same destination directory?
Will it dedup based on url?
The text was updated successfully, but these errors were encountered: