Consider using DOMDocument recovery mode #73

Isinlor · 2018-05-07T21:43:21Z

See stack overflow for details: https://stackoverflow.com/a/9281963/893222

The idea is to handle malformed XML thanks to recovery option in libxml that is implemented in userland:

$dom = new DOMDocument();
$dom->recover = TRUE;

froschdesign · 2018-10-05T06:26:45Z

The problem is that false results could lead to subsequent errors in parsing and handling of the entire feed.

Maybe it's an option to inject your own DOMDocument or a decorator clean up / recover the feeds before they are parsed by the reader.

froschdesign · 2019-03-04T21:18:44Z

You can use the recovery mode yourself:

// Import by URI
$httpClient = Zend\Feed\Reader\Reader::getHttpClient();
$response   = $httpClient->get(
    'https://github.com/zendframework/zend-feed/releases.atom'
);
$xmlString  = $response->getBody();

// Create DOMDocument
$dom          = new DOMDocument;
$dom->recover = true;
$dom->loadXML(trim($xmlString));

// Detect type
$type = Zend\Feed\Reader\Reader::detectType($dom);

// Create reader
if (0 === strpos($type, 'rss')) {
    $reader = new Zend\Feed\Reader\Feed\Rss($dom, $type);
}
if (0 === strpos($type, 'atom')) {
    $reader = new Zend\Feed\Reader\Feed\Atom($dom, $type);
}

var_dump($reader->getTitle()); // "Release notes from zend-feed"

Isinlor · 2019-03-04T23:18:50Z

Thanks for help! This is indeed what I ended up doing:
https://gitlab.com/DeepRSS/Reader/blob/3667b1b10c11b9c067de1e3242f15eaf2a1de261/src/Core/Service/ZendReader/FeedParser.php#L35

froschdesign · 2019-03-05T06:43:26Z

@Isinlor
Thanks for the fast response! 👍

Can you provide a link to a feed which is malformed and needs the recovery mode?

Isinlor · 2019-03-05T11:49:25Z

Here is one example: http://itbrokeand.ifixit.com/atom.xml

Code I used for testing:

<?php

$libxmlErrflag = libxml_use_internal_errors(true);
$oldValue = libxml_disable_entity_loader(true);

$dom = new \DOMDocument;
//$dom->recover = true; // Allows to parse slightly malformed feeds

$status = $dom->loadXML(file_get_contents("http://itbrokeand.ifixit.com/atom.xml"));

if (!$status) {

    // Build error message
    $error = libxml_get_last_error();
    if ($error instanceof \LibXMLError && $error->message != '') {
        $error->message = trim($error->message);
        $errormsg = "DOMDocument cannot parse XML: {$error->message}";
    } else {
        $errormsg = "DOMDocument cannot parse XML: Please check the XML document's validity";
    }

    throw new Exception($errormsg);
}

froschdesign · 2019-03-05T12:19:16Z

@Isinlor
Perfect, this helps a lot. I collect various problems to create some test scenarios.

Isinlor · 2019-03-05T12:34:55Z

I think your initial reaction was correct.

The problem is that false results could lead to subsequent errors in parsing and handling of the entire feed.

I missed it when I was working on it myself. But indeed, even tough $dom->recover = true; seems to work, Zend Feed is not able to handle it correctly.

I'm really curious how Firefox handle it, because I have no issues if I open:

froschdesign · 2019-03-05T13:00:17Z

@Isinlor
I will check all links this evening and will give a feedback.

froschdesign · 2019-03-12T22:37:55Z

@Isinlor

https://blog.noredink.com/rss

There were some problems, but now I have not found anything.

http://itbrokeand.ifixit.com/atom.xml

Problem is <title>Web Operations D&D</title> and therefore not well-formed. Should be reported at ifixit.com. Everything else means ugly replacements.

(Also fails in a browser.)

http://aasnova.org/feed/

Two problems: 403 and wrong header.

(Also fails in a browser. [Download])

https://blog.floydhub.com/rss/

Many feeds contain characters out of the legal range.

Try the following preg_replace:

preg_replace(
    '/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+/u',
    ' ',
    $string
)

This should eliminate problems like "CData section not finished".

(Also fails in a browser.)

Thanks for the examples. At the moment I do not know if we should do something in zend-feed, because it opens the door to many pitfalls or ugly workarounds. I see the benefit for the user but also the problem of maintain.

I remain open to suggestions and improvements.

weierophinney · 2019-12-31T21:27:01Z

This repository has been closed and moved to laminas/laminas-feed; a new issue has been opened at laminas/laminas-feed#8.

froschdesign added the feature request label Oct 5, 2018

froschdesign changed the title ~~Consider using DOMDocument recorvery option~~ Consider using DOMDocument recovery mode Mar 4, 2019

weierophinney mentioned this issue Dec 31, 2019

Consider using DOMDocument recovery mode laminas/laminas-feed#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using DOMDocument recovery mode #73

Consider using DOMDocument recovery mode #73

Isinlor commented May 7, 2018

froschdesign commented Oct 5, 2018

froschdesign commented Mar 4, 2019

Isinlor commented Mar 4, 2019 •

edited

Loading

froschdesign commented Mar 5, 2019

Isinlor commented Mar 5, 2019 •

edited

Loading

froschdesign commented Mar 5, 2019

Isinlor commented Mar 5, 2019

froschdesign commented Mar 5, 2019

froschdesign commented Mar 12, 2019

weierophinney commented Dec 31, 2019

Consider using DOMDocument recovery mode #73

Consider using DOMDocument recovery mode #73

Comments

Isinlor commented May 7, 2018

froschdesign commented Oct 5, 2018

froschdesign commented Mar 4, 2019

Isinlor commented Mar 4, 2019 • edited Loading

froschdesign commented Mar 5, 2019

Isinlor commented Mar 5, 2019 • edited Loading

froschdesign commented Mar 5, 2019

Isinlor commented Mar 5, 2019

froschdesign commented Mar 5, 2019

froschdesign commented Mar 12, 2019

https://blog.noredink.com/rss

http://itbrokeand.ifixit.com/atom.xml

http://aasnova.org/feed/

https://blog.floydhub.com/rss/

weierophinney commented Dec 31, 2019

Isinlor commented Mar 4, 2019 •

edited

Loading

Isinlor commented Mar 5, 2019 •

edited

Loading