-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawler getting stuck (lots of "Waiting for links from pages being downloaded" msgs) #186
Comments
Ok - waited for it to hang again. After a lot of the "Waiting for links from pages being downloaded..." messages (I'm not sure how many exactly because my screen history fills up) I then get multiple messages saying "ERROR [dispatcher-0] (FetchedResultHandler.java:60) - Problem while processing data." and a stack trace (I'll paste below). Finally everything stops with "java.lang.OutOfMemoryError" exceptions on multiple threads (I'll paste below too).
[2019-08-07 01:35:53,357]ERROR [dispatcher-0] (FetchedResultHandler.java:60) - Problem while processing data.
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "AsyncCrawler" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Thread-11" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "qtp1862115510-23" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "FrontierLinkLoader" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "org.eclipse.jetty.server.session.HashSessionManager@31b5ccceTimer" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "qtp1862115510-24" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "qtp1862115510-21" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "qtp1862115510-20" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "qtp1862115510-22" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "qtp1862115510-19"
|
Thanks for your detailed log messages. I'll try to explain a bit how ACHE works, which hopefully will help you to debug the problem. ACHE runs a few different threads (or thread pools) to execute different types of operations. There is one main crawler thread ( The message "Waiting for links from pages being downloaded..." happens when the main crawler thread requests links to download, but there are no more uncrawled links in frontier AND there are still links currently being downloaded (in the Coming back to your original problem, the first exception seems to be occurring while indexing pages in Elasticsearch, more specifically, within Elaticsearch's client library in the method I would try the following things:
|
After a while (maybe 5 hours) Ache stops crawling and gives lots of "Waiting for links from pages being downloaded..." messages before eventually dying with a stack trace error (which I didn't wait for a few minutes ago - aargh - next time....).
I'm experimenting with Ache. I've written my own Geo link classifier (implementing LinkClassifier) to try to keep crawling to within a single country and I'm writing records to ElasticSearch.
I've played around with Xms and Xmx (both 3g now) and XX:MaxDirectMemorySize (0.5g now) to get enough memory to be stable.
So after about 5 hours (I think - whenever I screen back into the running window) I see lots of msgs (pasted below) and then eventually it will die with a stack trace error. Up until that time it will be crawling away, getting Links and determining the link relevance for the next searches.
Has anyone seen this before ?
I'll wait until next time and check out the stack trace.
thanks,
Derek
After a while it stops dead with crawling and then lots (100's) of messages like this: -
[2019-08-06 16:51:01,716] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2019-08-06 16:51:02,716] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2019-08-06 16:51:03,716] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2019-08-06 16:51:04,717] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2019-08-06 16:51:05,717] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2019-08-06 16:51:06,717] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2019-08-06 16:51:07,718] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2019-08-06 16:51:08,718] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2019-08-06 16:51:09,719] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2019-08-06 16:51:10,719] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
The text was updated successfully, but these errors were encountered: