Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow for complete wat instead of shards #10

Open
rvencu opened this issue Jul 12, 2021 · 3 comments
Open

allow for complete wat instead of shards #10

rvencu opened this issue Jul 12, 2021 · 3 comments

Comments

@rvencu
Copy link
Contributor

rvencu commented Jul 12, 2021

Speaking with @rom1504 I learned that dataset is ideally sized for model training if we can provide files containing an entire batch of 5000-10000 samples.

At this deduplication level this is achieved with 16 shards grouped together, as I already started to use. At the same time the shard download takes 90-130 seconds and became a significant part of the entire job duration at scraper. Getting a single download and performing 2 shards in one go will cut this time in half.

I would like the server to send, whenever possible, 2 shards at once from the same wat like this:

client                                                                 server
client.newJob()
                                      get shard 0 if shard 1 is unavailable
                                      or
                                      get shard 1 if shard 0 is unavailable
                                      or
                                      get both shards if both are available
client.jobComplete(linktocomplete)
                                      server register one or 2 jobs completed

that kind of information should also be available for the GPU node, so it knows if downloaded data correspond with single or double shard

@TheoCoombes
Copy link
Owner

Actually, I did something similar for @DefinatelyNotSam's workers on Discord. His worker, nicknamed "Cruncher" parses entire WARC files, and has a crawling@home plugin to help out our project (yet to go live). He reformats the WARC files to match the WAT file URLs and sends them via custom endpoints I developed.

In summary, this is what the endpoints I made do:

  • The URL to the WAT he's passing is sent to the server via /custom/lookup-wat
  • The endpoint returns the shard numbers, as well as the shard data for both the shards in that WAT.
  • Cruncher compiles the images and creates the same files other workers generate
  • Cruncher then marks the job as done via /custom/markasdone, which marks both the shards as done

I could definitely do something similar if this is what you'd like to achieve.

@rvencu
Copy link
Contributor Author

rvencu commented Jul 12, 2021

Yes, if I can get both shards into the worker and upload either combined or separate results I do not mind reutilizing the same endpoint as cruncher

though... the GPU should know which is which in order to mark as done properly...

@TheoCoombes
Copy link
Owner

Ah it won't be using the same endpoints, however, I'll design custom ones that work directly with client instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants