Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get source url #95

Open
HostingBE opened this issue Apr 15, 2023 · 1 comment
Open

get source url #95

HostingBE opened this issue Apr 15, 2023 · 1 comment

Comments

@HostingBE
Copy link

First of all, thanks for creating the php-spider script (almost) everything I need for my project is in it.

Is it possible to get the source of the spider where the relevant URLs were found?

For example, if I now index 500 URLs and some of these URLs give an incorrect return code (404 ,403 or 500) then it is currently difficult to find out where this incorrect URL was noticed.

Thank you
Constan

@mvdbos
Copy link
Owner

mvdbos commented Aug 14, 2023

See example_complex.php. It adds the statshandler to the downloader and at the end interates over all failures. This should show you all failed URLs with their reason.

First register the StatsHandler to collect download stats. Note that you could easily create a similar handler, that listens to these events and acts on them however you want. You could store errors in a db for instance.

$spider->getDownloader()->getDispatcher()->addSubscriber($statsHandler);

Then at then end of the crawl, show the failures that the StatsHandler collected:

echo "\nFAILED RESOURCES: ";
foreach ($statsHandler->getFailed() as $uri => $message) {
    echo "\n - " . $uri . " failed because: " . $message;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants