Releases · andreskrey/readability.php

22 Jul 21:55

v2.1.0

7617a91

v2.1.0 > The one where I realized that libxml didn't die on version v2.9.4 Latest

Latest

Thanks to issue #86 I realized that there are modern versions of libxml2. I always wondered why the bundled version of libxml was so old (2.9.4 was released in 2016). Turns out I was checking the wrong website. What seems to be the official website has a really old version as the latest version, meanwhile in gitlab the last version was released months ago!

So I realized there are newer versions and from 2.9.5 the normal behavior changed, breaking up all our tests. Luckily the change is "cosmetic" (whitespace differences with 2.9.4) so the tests are still "valid" but PHPUnit will complain anyway. If you know a way to compare HTMLs ignoring whitespace, let me know.

Anyway, the following changes are included in this release:

Avoid overwriting extracted metadata with similarly named keys (like og:image and og:image:width)
Imported new getSiteName() feature from JS version as of 21 Dec 2018
Added getFirstElementChild function to NodeTrait + test case (Issue #83)
Reworked the test suit to use TestPage objects and give more hints about what failed
Removed getWordThreshold and setWordThreshold configuration functions
Added NodeUtility::filterTextNodes and deprecated NodeTrait getChildren()
Added new DOMNodeList fake class that mimics the original DOMNodeList class but allows to add new nodes to the list
Added new Dockerfiles that pulls different versions of PHP and libxml. Now we are supporting 4 versions of PHP and 6 versions of libxml!

I reworked the 4 Dockerfiles we had before and created a dedicated repo for PHP with custom libxml versions. Here it is: https://github.com/andreskrey/php-libxml-docker-images

Each PR will be tested against 4 versions of PHP and 6 versions of libxml, which means that Travis will run 24 virtual machines every time there are changes in the repo. Let's see for how long we can abuse their free resources.

And that's it. Let me know if something is broken for you. Tell your mom you love her. Don't forget to call your father.

Assets 2

27 Nov 19:41

andreskrey

v2.0.1

23f2175

v2.0.1 > Oopsie

Oopsie. Noticed that the main image was always missing from the results? That's because I screwed it up. But fear not, it's fixed.

I also updated the tests to be a little more strict so this, IN THEORY, should not happen again.

Assets 2

25 Nov 13:02

andreskrey

v2.0.0

e1b31f9

v2.0.0 > Up to date with Readability.js again + docker containers

Hello everyone,

Guess you weren't expecting a new release of your favorite dependency written in the wrong language, huh!?

We are up-to-date with the JS version as of 19 Nov 2018 which includes the following changes:

Move phrasing contents into paragraphs
Improved the title detection
Remove single cell tables
Improved the detection of video related elements
New test cases
Various minor fixes

The following changes were also added:

Clean tags during prepArticle().
Merged PR #58: Fix notice non-object on $parentOfTopCandidate for tumblr.com
Fixed issue #63: Division by zero
Housekeeping:
- Removed $parseSuccessful flag that wasn't needed anymore
- Rename wordThreshold to charThreshold and throw deprecation notices. WordThreshold will be removed in version 3.0.
- Added "-ad-" as unlikely candidate

And finally a docker container was added so you can easily test in all the supported PHP versions by simply typing make test-all in your console. You'll need docker and docker-compose if you want to see some really flashy stuff in your console and not just a silly error message.

If for some reason you're still reading this, you might be wondering why this version is 2.0 and not 1.3.something. I know you do. I know you're making a confused face right now.

The reason is that PHP 5.6 support is GONE.

YES

IT'S GONE

So make sure you run this code in a somewhat modern version of PHP. A version that starts with 7.

That's it. Take care. Call your mother.

Assets 2

19 Mar 13:39

andreskrey

v1.2.0

992a112

v1.2.0 > Up to date with Readability.js

Hi all,

Version 1.2.0 is here. We are up to date with our JS big brother. Here's the full changelog:

Merged PR#49 (Missing object when calling ->getContent())
Imported all changes from Readability.js as of 2 March 2018 (8525c6a):
- Check for <base> elements before converting URLs to absolute.
- Clean <link> tags on prepArticle()
- Attempt to return at least some text if all the algorithm runs fail (Check PR #423 on JS version)
- Add new test cases for the previous changes
- And all other changes reflected in this diff

Assets 2

12 Mar 23:20

andreskrey

v1.1.1

f0f6906

v1.1.1 > The one with small changes

Hello pretty people of the PHP world.

It's monday night in this side of the world, I've just had a lovely dinner with the girlfriend and everything in my world is at peace. This is great opportunity to release a new version of Readability and maybe get ~~some of those github stars~~ feedback from the users.

This version includes the following changes:

Switched from assertEquals to assertSame on unit testing to avoid weak comparisons.
Added a safe check to avoid sending the DOMDocument as a node when scanning for node ancestors.
Fix issue #45: Small mistake in documentation
Fix issue #46: Added data-src as a image source path
Fixed bug when extracting all the image of the article (Was extracting images from the original DOM instead of the parsed one)
Added the ->getDOMDocument() getter to retrieve the fully parsed DOMDocument
Merged PR #48 that allows passing an array as configuration (@topotru)

Don't forget to update your dependencies, wash your hands after going to the bathroom, and do I have to tell you again to call your mom? That woman loves you. Call her right now. That composer update andreskrey/readability.php can wait a couple of minutes.

Assets 2

11 Jan 20:21

andreskrey

v1.1.0

2b136b3

v1.1.0 > Say hello to optional logging

Hi all!

Happy 2018! Hope you had an excellent 2017!

With this new release you'll be able to log everything that happens inside readability. Of course this is optional and nothing else is required from you if you don't care about logs.

Here's the full changelog:

Added 'data-orig' as an URL source for images
Removed 'modal' as a negative property from classes
Added option to inject a logger
Removed all references to the data-readability tags that don't apply anymore to the new structure
Merged PR #38 (Missing DOMEntityReference)

'til next time!

Assets 2

03 Dec 12:35

andreskrey

v1.0.0

0d02e29

v1 🎉🎉🎉

Hi all!

Finally v1 is here. 🎉🎉🎉 The project changed drastically from v0, mainly because the HTMLParser is gone and the Readability class replaces it. I know, confusing, but this change aligns us with Readability.js and makes everything easier to port.

Also another huge change that I wanted to do since version 0.0.1 was getting rid of the node encapsulation. v0 used league\html-to-markdown NodeElement class to encapsulate the nodes and act as a middle man between your code and the DOMDocument. This caused lots of trouble because when you encapsulate nodes, you are actually severing the relation between the original DOM and the encapsulated node, forcing you to keep track of the changes between them instead of letting the system do it. This version instead of encapsulating nodes, extends the original class, solving all these issues.

Check the readme file to understand how to port your v0 code to v1 and the changelog to read about all the other changes.

Enjoy!

Assets 2

01 Dec 23:17

andreskrey

v0.3.1

bbf6068

v0.3.1

Hi all, happy friday.

I'm releasing this version just to clean the Unreleased section of the changelog and prepare everything for the v1 version. Changes for this release are:

Trim titles when detecting hierarchical separators to avoid false negatives on strings with spaces.
Fix issue when converting divs to p nodes and never rating them (issue #29)
Fix "Unsupported operand types" (PR #31)
Fix division by zero when no title was found (issue #32)
New function to retrieve all images at once (PR #30)
Get the title from the <title> tag before searching on the <meta> tags

Next release will be v1. For real this time.

Assets 2

12 Nov 20:57

andreskrey

v0.3.0

caefabc

v0.3.0

Happy November everyone. Took me more than I expected buy finally we are up to date with Readability.js, at least at the moment of writing these lines.

Here are the changelist for this release.

Merged PR #24. Fixes notice when trying to extract og:image
Up to date to commit eb221c5 (2017-10-16), which includes the following changes:
- New tags added to the unlikelyCandidates regex
- Detection and removal of hierarchical separators in titles
- Added more tags to clean after parsing the article (button, textarea, select, etc.)
- New way to detect empty nodes (including a edge case where a node with a   was detected as a node with content)
- Better approach to find a top candidate (specially when a top candidate is the only child of a parent node, which allows a more accurate joining of sibling elements)
- Detect text direction (ltr or rtl)
- Detect and mark data tables to avoid removing them during final clean up
- Major fixes when scanning and deleting nodes (no need to traverse backwards anymore)
- Node cleaning via regex matches
- Clean table attributes during final clean up.
Added license

Hopefully you'll find this release useful. Next release will be 1.0.

Don't forget to like, comment, subscribe, follow my Patreon, hit the gym, and call your mom. Make sure you tell your significant other you love him/her/it and if you are alone right now, install Tinder, because that's something I would really like to do but I was already on a relationship when that app was released.

Enjoy!

Assets 2

14 Sep 19:35

andreskrey

v0.2.2

c7f4193

v0.2.2

Happy September everyone, here's another release of readability.php:

Added a safecheck for really nasty HTML
Added summonCthulhu option, to remove all script tags via regex

Enjoy!

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YES

IT'S GONE

Releases: andreskrey/readability.php

v2.1.0 > The one where I realized that libxml didn't die on version v2.9.4

v2.0.1 > Oopsie

v2.0.0 > Up to date with Readability.js again + docker containers

YES

IT'S GONE

v1.2.0 > Up to date with Readability.js

v1.1.1 > The one with small changes

v1.1.0 > Say hello to optional logging

v1 🎉🎉🎉

v0.3.1

v0.3.0

v0.2.2