-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawler getting stuck (lots of "Still waiting to process downloaded pages..." msgs) #199
Comments
This is very similar to the problem reported in issue #186. Is 4g the maximum you can use or the same problem also happens when using more memory? |
The same problem happens when using 32g memory. I guess this is not about memory issue. |
I checked issue #186 but find no solution since I am not using ElasticSearch. I doubt if this is out of the limit on thread. But I find no config item to control the thread # that could be used for page dispatch. |
The number of threads is currently hard-coded to be the same number of CPU cores here:
|
I tried to change the # of cores to a 4, but it seems no matter how I set this value the number of dispatcher remains to be 12. |
Also, I run the original app in local environment, the crawler still getting stuck with a pile of "Still waiting to process downloaded pages...". and there's no way to sent signal with Ctrl+c to interrupt the process. So I guess there some other issue that is not related to the constraint on # of thread. |
I wouldn't expect this to happen. Maybe something happened that caused you to use the old binaries before changing and recompiling the code? How are you finding the number of threads? I typically use VisualVM for debugging memory/threading issues like this.
I haven't seen any OOM errors happening in while, but I think this is possible. How long does it take until this problem happens? Is this a large crawl? Can you create a JVM thread dump and a heap dump when this problem happens? Seeing what is using most memory might help to figure out the problem. |
After a while (maybe half hour) Ache stops crawling and gives lots of "Still waiting to process downloaded pages..." messages, I have checked the load of all CPUs with htop and just found there's no busy worker.
I'm experimenting with Ache. I've written config ache like what has been mentioned in the guide and use the config file in ./config/config__website_crawl/ache.yml. The parts I've changed contain only two properties:
target_storage.visited_page_limit: 50
crawler_manager.downloader.download_thread_pool_size: 4
I've played around with -XX:+UseG1GC and -Xmx4g to get enough capability for my project. Also, the running environment is an unix server with a constraint on the maximum number of process for each user at 20.
My jdk version:
Java(TM) SE Runtime Environment (build 1.8.0_172-b11)
JVM:
Java HotSpot(TM) 64-Bit Server VM (build 25.172-b11, mixed mode)
So after about half an hour (I think - whenever I screen back into the running window) I see lots of msgs (pasted below) and it seems it is trapped in an infinite loop.
[2021-04-11 04:03:02,565] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:07,613] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:12,661] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:17,708] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:22,757] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:27,805] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:32,853] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:37,901] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:42,948] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:47,997] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:53,045] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:58,092] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:03,139] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:08,187] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:13,236] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:18,284] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:23,332] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:28,379] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:33,426] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:38,474] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:43,522] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:48,568] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:53,616] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:58,664] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:03,711] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:08,759] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:13,806] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:18,854] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:23,902] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:28,949] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:33,997] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:39,043] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:44,091] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:49,139] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:54,186] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
I tried to use ctrl+c to sent SIGINT but got OOM error:
^C^CJava HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
^CJava HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
[2021-04-11 04:06:04,281] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
^CJava HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
^CJava HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
[2021-04-11 04:06:09,329] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:06:14,376] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
Has anyone seen this before ?
Thanks,
Stan
The text was updated successfully, but these errors were encountered: