Crawler getting stuck (lots of "Waiting for links from pages being downloaded" msgs) #186

dconnx · 2019-08-06T17:13:59Z

After a while (maybe 5 hours) Ache stops crawling and gives lots of "Waiting for links from pages being downloaded..." messages before eventually dying with a stack trace error (which I didn't wait for a few minutes ago - aargh - next time....).

I'm experimenting with Ache. I've written my own Geo link classifier (implementing LinkClassifier) to try to keep crawling to within a single country and I'm writing records to ElasticSearch.

I've played around with Xms and Xmx (both 3g now) and XX:MaxDirectMemorySize (0.5g now) to get enough memory to be stable.

So after about 5 hours (I think - whenever I screen back into the running window) I see lots of msgs (pasted below) and then eventually it will die with a stack trace error. Up until that time it will be crawling away, getting Links and determining the link relevance for the next searches.

Has anyone seen this before ?

I'll wait until next time and check out the stack trace.

thanks,

Derek

After a while it stops dead with crawling and then lots (100's) of messages like this: -

[2019-08-06 16:51:01,716] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2019-08-06 16:51:02,716] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2019-08-06 16:51:03,716] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2019-08-06 16:51:04,717] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2019-08-06 16:51:05,717] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2019-08-06 16:51:06,717] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2019-08-06 16:51:07,718] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2019-08-06 16:51:08,718] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2019-08-06 16:51:09,719] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2019-08-06 16:51:10,719] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...

dconnx · 2019-08-07T08:31:41Z

Ok - waited for it to hang again.

After a lot of the "Waiting for links from pages being downloaded..." messages (I'm not sure how many exactly because my screen history fills up) I then get multiple messages saying "ERROR [dispatcher-0] (FetchedResultHandler.java:60) - Problem while processing data." and a stack trace (I'll paste below). Finally everything stops with "java.lang.OutOfMemoryError" exceptions on multiple threads (I'll paste below too).

Lots and lots of the "INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded..."
Eventually lots of these errors: -

[2019-08-07 01:35:53,357]ERROR [dispatcher-0] (FetchedResultHandler.java:60) - Problem while processing data.
java.lang.IllegalStateException: Request cannot be executed; I/O reactor status: STOPPED
at org.apache.http.util.Asserts.check(Asserts.java:46)
at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase.ensureRunning(CloseableHttpAsyncClientBase.java:90)
at org.apache.http.impl.nio.client.InternalHttpAsyncClient.execute(InternalHttpAsyncClient.java:123)
at org.elasticsearch.client.RestClient.performRequestAsync(RestClient.java:343)
at org.elasticsearch.client.RestClient.performRequestAsync(RestClient.java:325)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:218)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:191)
at focusedCrawler.target.repository.ElasticSearchRestTargetRepository.insert(ElasticSearchRestTargetRepository.java:164)
at focusedCrawler.target.TargetStorage.insert(TargetStorage.java:86)
at focusedCrawler.crawler.async.FetchedResultHandler.processData(FetchedResultHandler.java:57)
at focusedCrawler.crawler.async.FetchedResultHandler.completed(FetchedResultHandler.java:31)
at focusedCrawler.crawler.async.HttpDownloader$FetchFinishedHandler.doHandle(HttpDownloader.java:389)
at focusedCrawler.crawler.async.HttpDownloader$FetchFinishedHandler.run(HttpDownloader.java:373)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)

Finally these errors:

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "AsyncCrawler"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Thread-11"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "qtp1862115510-23"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "FrontierLinkLoader"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "org.eclipse.jetty.server.session.HashSessionManager@31b5ccceTimer"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "qtp1862115510-24"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "qtp1862115510-21"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "qtp1862115510-20"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "qtp1862115510-22"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "qtp1862115510-19"

No more output after this just a hung terminal (I have to kill -9 the Ache java PID to kill it as CTRL-C doesn't do it)

aecio · 2019-08-07T16:55:57Z

Thanks for your detailed log messages. I'll try to explain a bit how ACHE works, which hopefully will help you to debug the problem.

ACHE runs a few different threads (or thread pools) to execute different types of operations. There is one main crawler thread (AsyncCrawler) which continuously polls the frontier (database of links) for links that need to be crawled, and feeds them to a thread pool of downloaders (downloader-%d) which is responsible to download the links concurrently. After being downloaded, the downloaded pages are passed along to another thread pool (dispatcher-%d) for data parsing, link extraction, and data storing/indexing.

The message "Waiting for links from pages being downloaded..." happens when the main crawler thread requests links to download, but there are no more uncrawled links in frontier AND there are still links currently being downloaded (in the downloader threads) or processed (in the dispatcher threads). If there are links being downloaded/processed, the main crawler needs to wait because new links can be found in these pages. The main thread prints this message right before sleeping for a 1 second, after which it will continue to request links to crawl again. When there are no links being download/processed and there are no more links in the frontier, the crawler can safely shut down because there are no more links to crawl.

Coming back to your original problem, the first exception seems to be occurring while indexing pages in Elasticsearch, more specifically, within Elaticsearch's client library in the method ElasticSearchRestTargetRepository.insert(). Here is my guess of what may be happening: the crawler might be downloading pages faster than what Elasticsearch can index, which causes Elaticsearch client library to throw this exception. (Are you observing a high load in the Elasticsearch cluster? ) The current implementation of ElasticSearchRestTargetRepository does not throttle indexing requests, so links will continue to be sent to Elaticsearch even when it is failing to index pages. Because the indexing is slow, it may be causing downloaded links to be accumulated in an in-memory queue. While there is a max size for this queue, you also seem to be giving a small amount of memory (based on your JVM parameters). Something might be causing thrashing in the JVM, which eventually throws the out-of-memory errors.

I would try the following things:

If you are observing high load in Elaticsearch, try to increase the capacity of the Elasticsearch by adding more nodes.
If possible, increase the max memory available to the crawler.
There are some configurations to control parallelism in ACHE. You could, for example, try to reduce the number of downloader threads, which might slow down the crawler and reduce the load in Elaticsearch. Try setting the configuration key crawler_manager.downloader.download_thread_pool_size in the ache.yml file to a smaller value (it defaults to 100).

aecio mentioned this issue Apr 12, 2021

Crawler getting stuck (lots of "Still waiting to process downloaded pages..." msgs) #199

Open

aecio added the bug label Jan 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler getting stuck (lots of "Waiting for links from pages being downloaded" msgs) #186

Crawler getting stuck (lots of "Waiting for links from pages being downloaded" msgs) #186

dconnx commented Aug 6, 2019

dconnx commented Aug 7, 2019

aecio commented Aug 7, 2019 •

edited

Loading

Crawler getting stuck (lots of "Waiting for links from pages being downloaded" msgs) #186

Crawler getting stuck (lots of "Waiting for links from pages being downloaded" msgs) #186

Comments

dconnx commented Aug 6, 2019

dconnx commented Aug 7, 2019

aecio commented Aug 7, 2019 • edited Loading

aecio commented Aug 7, 2019 •

edited

Loading