-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to skip creation of the results files and just report if the links are valid? #103
Comments
I created a <?php
use VDB\Spider\PersistenceHandler\FilePersistenceHandler;
use VDB\Spider\PersistenceHandler\PersistenceHandlerInterface;
use VDB\Spider\Resource;
class JsonPersistenceHandler extends FilePersistenceHandler implements PersistenceHandlerInterface
{
protected string $defaultFilename = 'data.json';
#[\Override]
public function persist(Resource $resource)
{
$file = $this->getResultPath() . 'data.json';
// Check if file exists.
if (!file_exists($file)) {
// Create file if it doesn't exist.
$fileHandler = fopen($file, 'w');
$results = [];
} else {
// Open file if it exists.
$fileHandler = fopen($file, 'c+');
// Check if file is not empty before reading.
if (filesize($file) > 0) {
// Read file and decode the json.
$results = json_decode(fread($fileHandler, filesize($file)), true);
} else {
$results = [];
}
}
$url = $resource->getUri()->toString();
$statusCode = $resource->getResponse()->getStatusCode();
$results[$url] = $statusCode;
// Move the pointer to the beginning of the file before writing.
rewind($fileHandler);
// Write to file.
fwrite($fileHandler, json_encode($results));
// Close the file handler.
fclose($fileHandler);
}
#[\Override]
public function current(): Resource
{
return unserialize($this->getIterator()->current()->getContents());
}
} And this works kinda. The only thing is that I don't get all the links from the webpage, only 125. Can the crawler get the sitemap.xml and try to parse that to get all the links? |
@dingo-d Currently the spider does not support parsing the sitemap.xml. Your approach with a custom persistence handler icm with the link checker seems correct. Are you sure there are more than 125 links on the page/website? If so:
Interested to hear what you find. |
The site is a WordPress site, so all links should be present. But it's good to know about JS added ones 👍🏼
Yup, added
These were the filters I've added: $spider->getDiscovererSet()->addFilter(new AllowedSchemeFilter(array('https')));
$spider->getDiscovererSet()->addFilter(new AllowedHostsFilter(array($seed), $allowSubDomains));
$spider->getDiscovererSet()->addFilter(new UriWithHashFragmentFilter());
$spider->getDiscovererSet()->addFilter(new UriWithQueryStringFilter());
I had All in all I did get the JSON file with some 503 statuses. The idea was to use it as a site-health checker. |
Hi.
I'm wondering if it's possible to use the link checker example to just check for valid links, and maybe store them in a JSON, or CSV file instead of creating binary files and index.html files inside the
results
folder?Should I try to create my own persistence handler for this?
Basically, I'd just like to crawl my web to check if there are any 404 pages in my web, I'm not necessarily interested if any of the links on the page is returning 404, just need to check if all my pages are healthy.
The text was updated successfully, but these errors were encountered: