Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale scraper to multiple threads #133

Open
2 tasks
johndpjr opened this issue Apr 24, 2023 · 1 comment
Open
2 tasks

Scale scraper to multiple threads #133

johndpjr opened this issue Apr 24, 2023 · 1 comment
Labels
backend Deals with the FastAPI and web-scraping backend enhancement Adds value to a previous feature scraping Involves web scraping

Comments

@johndpjr
Copy link
Owner

Context

Right now, our scraper's speed is heavily limited since we are only using one! Allowing the scraper to scale to N threads will dramatically increase performance.

TODO

  • Add a CLI arg -n that is an integer representing the number of threads that the scraper scales to
  • Since a scraping job has a number of available companies to scrape, divide the companies amoungst all threads (e.g. a scraper needing to scrape c companies will spawn n threads and assign them c / n companies to divide the work evenly).

Notes

Be careful of race conditions and whatnot here! Multi-threading adds a lot of additional complexity and bugs that one might initially overlook. Some more ideas for improving the speed would have a scraper scrape other sites while it waits for the crawl delay to expire on a different company site (i.e. asynchronous requests).

@johndpjr johndpjr added enhancement Adds value to a previous feature scraping Involves web scraping backend Deals with the FastAPI and web-scraping backend labels Apr 24, 2023
@johndpjr johndpjr added this to AgTern Apr 24, 2023
@github-project-automation github-project-automation bot moved this to Backlog in AgTern Apr 24, 2023
@JeremyEastham
Copy link
Collaborator

Another note: I think the current logging system still isn't thread-safe? I believe when I tried to run the GUI and the scraper at the same time that either the logs from one process were lost or they were combined with the logs from the other process into a garbled mess. Child processes should probably communicate with the parent process with an inter-thread/process logging queue. Logs should include what process sent them (scraper, web server, db, etc). The database should work well with multithreading/multiprocessing by default. If any files are modified, this could be an issue.

Also, the Python APIs are identical for multiprocessing and multithreading, but multiprocessing is more performant. However, multiprocessing uses more memory because each process has a separate Python instance. Each thread/process will also need a separate Chrome, a notorious memory hog. There may be a way for each process to have a different Chrome tab or window in the same instance, but coordinating these may still use a similar amount of memory while requiring much more process coordination. Our current server will probably need more RAM to facilitate multiple scrapers. Again, processes can be swapped for threads easily later if we decide to.

@johndpjr johndpjr added this to the 2024 Spring Semester milestone Mar 13, 2024
@johndpjr johndpjr modified the milestones: 2024 Spring Semester, 2024 Fall Semester Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Deals with the FastAPI and web-scraping backend enhancement Adds value to a previous feature scraping Involves web scraping
Projects
Status: In Progress
Development

No branches or pull requests

2 participants