Scale scraper to multiple threads #133

johndpjr · 2023-04-24T18:17:41Z

Context

Right now, our scraper's speed is heavily limited since we are only using one! Allowing the scraper to scale to N threads will dramatically increase performance.

TODO

Add a CLI arg -n that is an integer representing the number of threads that the scraper scales to
Since a scraping job has a number of available companies to scrape, divide the companies amoungst all threads (e.g. a scraper needing to scrape c companies will spawn n threads and assign them c / n companies to divide the work evenly).

Notes

Be careful of race conditions and whatnot here! Multi-threading adds a lot of additional complexity and bugs that one might initially overlook. Some more ideas for improving the speed would have a scraper scrape other sites while it waits for the crawl delay to expire on a different company site (i.e. asynchronous requests).

The text was updated successfully, but these errors were encountered:

JeremyEastham · 2023-04-24T19:00:42Z

Another note: I think the current logging system still isn't thread-safe? I believe when I tried to run the GUI and the scraper at the same time that either the logs from one process were lost or they were combined with the logs from the other process into a garbled mess. Child processes should probably communicate with the parent process with an inter-thread/process logging queue. Logs should include what process sent them (scraper, web server, db, etc). The database should work well with multithreading/multiprocessing by default. If any files are modified, this could be an issue.

Also, the Python APIs are identical for multiprocessing and multithreading, but multiprocessing is more performant. However, multiprocessing uses more memory because each process has a separate Python instance. Each thread/process will also need a separate Chrome, a notorious memory hog. There may be a way for each process to have a different Chrome tab or window in the same instance, but coordinating these may still use a similar amount of memory while requiring much more process coordination. Our current server will probably need more RAM to facilitate multiple scrapers. Again, processes can be swapped for threads easily later if we decide to.

johndpjr added enhancement Adds value to a previous feature scraping Involves web scraping backend Deals with the FastAPI and web-scraping backend labels Apr 24, 2023

johndpjr added this to AgTern Apr 24, 2023

github-project-automation bot moved this to Backlog in AgTern Apr 24, 2023

johndpjr added this to the 2024 Spring Semester milestone Mar 13, 2024

johndpjr modified the milestones: 2024 Spring Semester, 2024 Fall Semester Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale scraper to multiple threads #133

Scale scraper to multiple threads #133

johndpjr commented Apr 24, 2023

JeremyEastham commented Apr 24, 2023

Scale scraper to multiple threads #133

Scale scraper to multiple threads #133

Comments

johndpjr commented Apr 24, 2023

Context

TODO

Notes

JeremyEastham commented Apr 24, 2023