Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strange diffs getting tweeted #7

Open
edsu opened this issue Jan 12, 2017 · 8 comments
Open

strange diffs getting tweeted #7

edsu opened this issue Jan 12, 2017 · 8 comments

Comments

@edsu
Copy link
Member

edsu commented Jan 12, 2017

@ruebot noticed a series of odd updates like this which led to the discovery that readability returns very little content sometimes. For example:

import requests
import readability

html = requests.get("https://www.thestar.com/news/world/2017/01/11/uk-teen-charged-with-murder-of-7-year-old-girl.html").content
doc = readability.Document(html)

print(doc.summary())

returns (at the moment):

<html><body><div><div class="article__subheadline" data-reactid="93"><p data-reactid="94">The 15-year-old was remanded into secure accommodation on Wednesday and was also charged with possession of an offensive weapon. </p></div></div></body></html>

Perhaps there should be a configurable threshold below which the content will be ignored or at least not tweeted? Could readability be tuned in this case to return content that is more appropriate like the text of the AP press release?

@edsu
Copy link
Member Author

edsu commented Jan 12, 2017

Should be interesting to compare Python readability with the JavaScript version that it is based on.

@ruebot
Copy link
Member

ruebot commented Jan 12, 2017

Weird. I just ran the above, and I get:

<html><body><div><div class="nav-news-content" data-reactid="411"><h3 class="nav-news-headline" data-reactid="412"><span class="nav-news-category" data-reactid="413"/>The Good Wife spinoff The Good Fight takes on Donald Trump</h3><p class="nav-news-abstract show-for-medium-up" data-reactid="415">The Good Fight will feature themes critical about the new presidency as well as satirize the Liberal reaction. Trump seems to take opposition by entertainers more seriously than traditional press coverage, creator Robert King says</p></div></div></body></html

If I view the source of https://www.thestar.com/news/world/2017/01/11/uk-teen-charged-with-murder-of-7-year-old-girl.html I see that text in a large chunk of JavaScript here ...or just ctrl+F "Good Wife"

@ruebot
Copy link
Member

ruebot commented Jan 12, 2017

Run the same thing a few minutes later and I get:

<html><body><div><div class="nav-news-content" data-reactid="411"><h3 class="nav-news-headline" data-reactid="412"><span class="nav-news-category" data-reactid="413"/>Late-night hosts mock Trump for calling Meryl Streep ‘overrated’ on Twitter </h3><p class="nav-news-abstract show-for-medium-up" data-reactid="415">On Jimmy Kimmel Live, there was an exchange between Ben Affleck and Kimmel about Trump's tweet with Affleck saying 'if you look up in the encyclopedia ‘great actress,’ it’s a picture of Meryl Streep.’ </p></div></div></body></html>

@edsu
Copy link
Member Author

edsu commented Jan 12, 2017

Wild. I guess now we know why we're getting the diffs. What the heck is going on? Could they be serving advertisements randomly to people?

@ruebot
Copy link
Member

ruebot commented Jan 12, 2017

Looks like it isn't actually grabbing the body consistently. I'm not seeing a way to really tweak Readability either.

.content() is going to return too much, and lead to more false positives, right?

@ruebot
Copy link
Member

ruebot commented Jan 14, 2017

I've put the torstar account on pause until we can figure this one out since it's putting out so many false positives.

@edsu
Copy link
Member Author

edsu commented Jan 14, 2017

That's a wise move. Definitely leaving it open because I bet we run up against this type of issue with other sites.

@ruebot
Copy link
Member

ruebot commented Mar 15, 2017

@edsu I think things have resolved themselves for the most part with the recent commits. What we were seeing before is now coming through like this tweet: https://twitter.com/torstar_diff/status/842119916958453762 -- So, maybe resolving #28 might fully resolve this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@edsu @ruebot and others