Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Last few examples are taking long time to scrape #3

Open
mipo57 opened this issue Nov 24, 2021 · 0 comments
Open

Last few examples are taking long time to scrape #3

mipo57 opened this issue Nov 24, 2021 · 0 comments

Comments

@mipo57
Copy link
Owner

mipo57 commented Nov 24, 2021

Currently, every thread has its own tor instance. Some instances are not working well, so they are changed until one is found. Usually, if request pool is not large (<10k urls), few instances do not find good tor route in time but will take tasks from the queue.

Currently, on the end of processing, we are killing processes with good tor routes (because there is nothing to do for them) and leave bad instances (because they are struggling to find route while they have taken the task from queue). We should probably redesign the code, to separate tor pools from task queue, so that we always pritoritize processing with good instances over processing with bad tor instances

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant