allow for complete wat instead of shards #10

rvencu · 2021-07-12T07:49:19Z

Speaking with @rom1504 I learned that dataset is ideally sized for model training if we can provide files containing an entire batch of 5000-10000 samples.

At this deduplication level this is achieved with 16 shards grouped together, as I already started to use. At the same time the shard download takes 90-130 seconds and became a significant part of the entire job duration at scraper. Getting a single download and performing 2 shards in one go will cut this time in half.

I would like the server to send, whenever possible, 2 shards at once from the same wat like this:

client                                                                 server
client.newJob()
                                      get shard 0 if shard 1 is unavailable
                                      or
                                      get shard 1 if shard 0 is unavailable
                                      or
                                      get both shards if both are available
client.jobComplete(linktocomplete)
                                      server register one or 2 jobs completed

that kind of information should also be available for the GPU node, so it knows if downloaded data correspond with single or double shard

The text was updated successfully, but these errors were encountered:

TheoCoombes · 2021-07-12T10:45:59Z

Actually, I did something similar for @DefinatelyNotSam's workers on Discord. His worker, nicknamed "Cruncher" parses entire WARC files, and has a crawling@home plugin to help out our project (yet to go live). He reformats the WARC files to match the WAT file URLs and sends them via custom endpoints I developed.

In summary, this is what the endpoints I made do:

The URL to the WAT he's passing is sent to the server via /custom/lookup-wat
The endpoint returns the shard numbers, as well as the shard data for both the shards in that WAT.
Cruncher compiles the images and creates the same files other workers generate
Cruncher then marks the job as done via /custom/markasdone, which marks both the shards as done

I could definitely do something similar if this is what you'd like to achieve.

rvencu · 2021-07-12T15:18:56Z

Yes, if I can get both shards into the worker and upload either combined or separate results I do not mind reutilizing the same endpoint as cruncher

though... the GPU should know which is which in order to mark as done properly...

TheoCoombes · 2021-07-12T22:43:43Z

Ah it won't be using the same endpoints, however, I'll design custom ones that work directly with client instances.

rvencu mentioned this issue Jul 12, 2021

GPU node should request multiple jobs at once #11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow for complete wat instead of shards #10

allow for complete wat instead of shards #10

rvencu commented Jul 12, 2021 •

edited

Loading

TheoCoombes commented Jul 12, 2021

rvencu commented Jul 12, 2021 •

edited

Loading

TheoCoombes commented Jul 12, 2021

allow for complete wat instead of shards #10

allow for complete wat instead of shards #10

Comments

rvencu commented Jul 12, 2021 • edited Loading

TheoCoombes commented Jul 12, 2021

rvencu commented Jul 12, 2021 • edited Loading

TheoCoombes commented Jul 12, 2021

rvencu commented Jul 12, 2021 •

edited

Loading

rvencu commented Jul 12, 2021 •

edited

Loading