Skip to content
This repository has been archived by the owner on Aug 14, 2021. It is now read-only.

Commit

Permalink
Merge branch 'development'
Browse files Browse the repository at this point in the history
  • Loading branch information
andreskrey committed Sep 14, 2017
2 parents d5d0b2e + 816a45f commit c7f4193
Show file tree
Hide file tree
Showing 8 changed files with 2,480 additions and 1 deletion.
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
test/* linguist-language=PHP
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,12 @@ All notable changes to this project will be documented in this file.

## Unreleased


## [v0.2.2](https://github.com/andreskrey/readability.php/releases/tag/v0.2.2)

- Added a safecheck for really nasty HTML
- Added summonCthulhu option, to remove all script tags via regex

## [v0.2.1](https://github.com/andreskrey/readability.php/releases/tag/v0.2.1)

- Added `normalizeEntities` flag to convert UTF-8 characters to its HTML Entity equivalent. Fixes bugs on htmls with mixed encoding.
Expand Down
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ If the parsing process was unsuccessful the HTMLParser will return `false`
- **substituteEntities**: default value `false`, disables the `substituteEntities` flag of libxml. Will avoid substituting HTML entities. Like `á` to á.
- **normalizeEntities**: default value `false`, converts UTF-8 characters to its HTML Entity equivalent. Useful to parse HTML with mixed encoding.
- **originalURL**: default value `http://fakehost`, original URL from the article used to fix relative URLs.
- **summonCthulhu**: default value `false`, remove all `<script>` nodes via regex. This is not ideal as it might break things, but might be the only solution to [libxml problems with unescaped javascript](https://github.com/andreskrey/readability.php#known-issues).

## Limitations

Expand Down Expand Up @@ -82,6 +83,8 @@ If you would like to remove the scripts of the HTML (like readability does), you

This is a libxml issue and not a Readability.php bug.

There's a workaround for this: using the summonCthulhu option. This will remove all script tags via regex, which is not ideal because you may end up summoning [the lord of darkness](https://stackoverflow.com/a/1732454).

## Dependencies

Readability uses the Element interface and class from *The PHP League's* **[html-to-markdown](https://github.com/thephpleague/html-to-markdown/)**. The Readability object is an extension of the Element class. It overrides some methods but relies on it for basic DOMElement parsing.
Expand Down
7 changes: 6 additions & 1 deletion src/HTMLParser.php
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ public function __construct(array $options = [])
'fixRelativeURLs' => false,
'substituteEntities' => true,
'normalizeEntities' => false,
'summonCthulhu' => false,
'originalURL' => 'http://fakehost',
];

Expand Down Expand Up @@ -125,7 +126,7 @@ public function parse($html)
$this->metadata['title'] = $this->getTitle();

// Checking for minimum HTML to work with.
if (!($root = $this->dom->getElementsByTagName('body')->item(0))) {
if (!($root = $this->dom->getElementsByTagName('body')->item(0)) || !$root->firstChild) {
return false;
}

Expand Down Expand Up @@ -212,6 +213,10 @@ private function loadHTML($html)
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
}

if($this->getConfig()->getOption('summonCthulhu')){
$html = preg_replace('/<script\b[^>]*>([\s\S]*?)<\/script>/', '', $html);
}

// Prepend the XML tag to avoid having issues with special characters. Should be harmless.
$dom->loadHTML('<?xml encoding="UTF-8">' . $html);
$dom->encoding = 'UTF-8';
Expand Down
3 changes: 3 additions & 0 deletions test/test-pages/webmd-1/config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"summonCthulhu": true
}
Empty file.
50 changes: 50 additions & 0 deletions test/test-pages/webmd-1/expected.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
<div class="copyNormal" id="textArea">





<p></p>
<h3></h3>


<p>Feb. 23, 2015 -- Life-threatening peanut allergies have mysteriously been
on the rise in the past decade, with little hope for a cure.</p>
<p xmlns:xalan="http://xml.apache.org/xalan">But a groundbreaking new study may offer a way to stem that rise, while
another may offer some hope for those who are already allergic.</p>
<p>Parents have been told for years to avoid giving foods containing peanuts
to babies for fear of triggering an allergy. Now research shows the opposite
is true: Feeding babies snacks made with peanuts before their first birthday
appears to prevent that from happening.</p>
<p>The study is published in the <i>New England Journal of Medicine,</i> and
it was presented at the annual meeting of the American Academy of Allergy,
Asthma and Immunology in Houston. It found that among children at high
risk for getting peanut allergies, eating peanut snacks by 11 months of
age and continuing to eat them at least three times a week until age 5
cut their chances of becoming allergic by more than 80% compared to kids
who avoided peanuts. Those at high risk were already allergic to egg, they
had the skin condition <a class="Article" href="http://www.webmd.com/skin-problems-and-treatments/eczema/default.htm" onclick="return sl(this,'','embd-lnk');">eczema</a>, or
both.</p>
<p>Overall, about 3% of kids who ate peanut butter or peanut snacks before
their first birthday got an allergy, compared to about 17% of kids who
didn’t eat them.</p>
<p>“I think this study is an astounding and groundbreaking study, really,”
says Katie Allen, MD, PhD. She's the director of the Center for Food and
Allergy Research at the Murdoch Children’s Research Institute in Melbourne,
Australia. Allen was not involved in the research.</p>
<p>Experts say the research should shift thinking about how kids develop
<a class="Article" href="http://www.webmd.com/allergies/guide/food-allergy-intolerances" onclick="return sl(this,'','embd-lnk');">food allergies</a>, and it should change the guidance doctors give to
parents.</p>
<p>Meanwhile, for children and adults who are already <a class="Article" href="http://www.webmd.com/allergies/guide/nut-allergy" onclick="return sl(this,'','embd-lnk');">allergic to peanuts</a>,
another study presented at the same meeting held out hope of a treatment.</p>
<p>A new skin patch called Viaskin allowed people with peanut allergies to
eat tiny amounts of peanuts after they wore it for a year.</p>
<a name="1"> </a>

<h3>A Change in Guidelines?</h3>

<p>Allergies to peanuts and other foods are on the rise. In the U.S., more
than 2% of people react to peanuts, a 400% increase since 1997. And reactions
to peanuts and other tree nuts can be especially severe. Nuts are the main
reason people get a life-threatening problem called <a class="Article" href="http://www.webmd.com/allergies/guide/anaphylaxis" onclick="return sl(this,'','embd-lnk');">anaphylaxis</a>.</p>
</div>
Loading

0 comments on commit c7f4193

Please sign in to comment.