-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow for complete wat instead of shards #10
Comments
Actually, I did something similar for @DefinatelyNotSam's workers on Discord. His worker, nicknamed "Cruncher" parses entire WARC files, and has a crawling@home plugin to help out our project (yet to go live). He reformats the WARC files to match the WAT file URLs and sends them via custom endpoints I developed. In summary, this is what the endpoints I made do:
I could definitely do something similar if this is what you'd like to achieve. |
Yes, if I can get both shards into the worker and upload either combined or separate results I do not mind reutilizing the same endpoint as cruncher though... the GPU should know which is which in order to mark as done properly... |
Ah it won't be using the same endpoints, however, I'll design custom ones that work directly with client instances. |
Speaking with @rom1504 I learned that dataset is ideally sized for model training if we can provide files containing an entire batch of 5000-10000 samples.
At this deduplication level this is achieved with 16 shards grouped together, as I already started to use. At the same time the shard download takes 90-130 seconds and became a significant part of the entire job duration at scraper. Getting a single download and performing 2 shards in one go will cut this time in half.
I would like the server to send, whenever possible, 2 shards at once from the same wat like this:
that kind of information should also be available for the GPU node, so it knows if downloaded data correspond with single or double shard
The text was updated successfully, but these errors were encountered: