Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler getting stuck (lots of "Still waiting to process downloaded pages..." msgs) #199

Open
Stanxy opened this issue Apr 11, 2021 · 7 comments
Labels

Comments

@Stanxy
Copy link

Stanxy commented Apr 11, 2021

After a while (maybe half hour) Ache stops crawling and gives lots of "Still waiting to process downloaded pages..." messages, I have checked the load of all CPUs with htop and just found there's no busy worker.

I'm experimenting with Ache. I've written config ache like what has been mentioned in the guide and use the config file in ./config/config__website_crawl/ache.yml. The parts I've changed contain only two properties:

target_storage.visited_page_limit: 50

crawler_manager.downloader.download_thread_pool_size: 4

I've played around with -XX:+UseG1GC and -Xmx4g to get enough capability for my project. Also, the running environment is an unix server with a constraint on the maximum number of process for each user at 20.
My jdk version:
Java(TM) SE Runtime Environment (build 1.8.0_172-b11)
JVM:
Java HotSpot(TM) 64-Bit Server VM (build 25.172-b11, mixed mode)

So after about half an hour (I think - whenever I screen back into the running window) I see lots of msgs (pasted below) and it seems it is trapped in an infinite loop.

[2021-04-11 04:03:02,565] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:07,613] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:12,661] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:17,708] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:22,757] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:27,805] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:32,853] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:37,901] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:42,948] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:47,997] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:53,045] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:58,092] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:03,139] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:08,187] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:13,236] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:18,284] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:23,332] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:28,379] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:33,426] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:38,474] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:43,522] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:48,568] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:53,616] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:58,664] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:03,711] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:08,759] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:13,806] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:18,854] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:23,902] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:28,949] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:33,997] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:39,043] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:44,091] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:49,139] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:54,186] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...

I tried to use ctrl+c to sent SIGINT but got OOM error:

^C^CJava HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
^CJava HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
[2021-04-11 04:06:04,281] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
^CJava HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
^CJava HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
[2021-04-11 04:06:09,329] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:06:14,376] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...

Has anyone seen this before ?

Thanks,

Stan

@aecio
Copy link
Member

aecio commented Apr 12, 2021

This is very similar to the problem reported in issue #186. Is 4g the maximum you can use or the same problem also happens when using more memory?

@Stanxy
Copy link
Author

Stanxy commented Apr 14, 2021

This is very similar to the problem reported in issue #186. Is 4g the maximum you can use or the same problem also happens when using more memory?

The same problem happens when using 32g memory. I guess this is not about memory issue.

@Stanxy
Copy link
Author

Stanxy commented Apr 14, 2021

I checked issue #186 but find no solution since I am not using ElasticSearch. I doubt if this is out of the limit on thread. But I find no config item to control the thread # that could be used for page dispatch.

@aecio
Copy link
Member

aecio commented Apr 14, 2021

The number of threads is currently hard-coded to be the same number of CPU cores here:

this.distpatchThreadPool = new ThreadPoolExecutor(CPU_CORES, CPU_CORES, 0L,

@Stanxy
Copy link
Author

Stanxy commented Apr 16, 2021

I tried to change the # of cores to a 4, but it seems no matter how I set this value the number of dispatcher remains to be 12.

@Stanxy
Copy link
Author

Stanxy commented Apr 16, 2021

Also, I run the original app in local environment, the crawler still getting stuck with a pile of "Still waiting to process downloaded pages...". and there's no way to sent signal with Ctrl+c to interrupt the process. So I guess there some other issue that is not related to the constraint on # of thread.

@Stanxy Stanxy changed the title Crawler getting stuck (lots of "Waiting for links from pages being downloaded" msgs) Crawler getting stuck (lots of "Still waiting to process downloaded pages..." msgs) Apr 16, 2021
@aecio
Copy link
Member

aecio commented Apr 18, 2021

I tried to change the # of cores to a 4, but it seems no matter how I set this value the number of dispatcher remains to be 12.

I wouldn't expect this to happen. Maybe something happened that caused you to use the old binaries before changing and recompiling the code? How are you finding the number of threads? I typically use VisualVM for debugging memory/threading issues like this.

Also, I run the original app in local environment, the crawler still getting stuck with a pile of "Still waiting to process downloaded pages...". and there's no way to sent signal with Ctrl+c to interrupt the process. So I guess there some other issue that is not related to the constraint on # of thread.

I haven't seen any OOM errors happening in while, but I think this is possible. How long does it take until this problem happens? Is this a large crawl? Can you create a JVM thread dump and a heap dump when this problem happens? Seeing what is using most memory might help to figure out the problem.

@aecio aecio added the bug label May 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants