Tweet URL Extraction: All Twitter Shortlinks #216

ianmilligan1 · 2016-04-06T16:24:41Z

Right now, our script for URL extraction is as follows:

import org.warcbase.spark.matchbox._
import org.warcbase.spark.matchbox.TweetUtils._
import org.warcbase.spark.rdd.RecordRDD._

val tweets =
RecordLoader.loadTweets("/mnt/vol1/data_sets/elxn42/ruest-white/elxn42-tweets-combined-deduplicated-unshortened-fixed.json",
sc)
val r = tweets.flatMap(tweet => {"""http://[^ ]+""".r.findAllIn(tweet.text).toList})
.countItems()
.saveAsTextFile("/home/i2millig/tweet-test/tweet-urls-test.txt")

By grabbing tweets from the text field we just get results like:

(http://…,49033)
(http://t…,48066)
(http://t.…,45610)
(http://t.c…,42470)
(http://t.co…,38145)
(http://t.co/…,32723)
(http://t.co/pbFMYFZpQC,2902)
(http://t.co/lTTkYPlGX0,2823)
(http://t.co/mn2pyBGZmj,1964)
(http://t.co/rriRvt6DyI,1964)
(ad nauseum)

This is not very useful – so what's the best path? In the past, @ruebot and I have used unshorten.py in twarc.

The text was updated successfully, but these errors were encountered:

ruebot · 2016-04-06T16:26:54Z

...which in-turn uses https://github.com/edsu/unshrtn

We could incorporate that in. Or, create a method in warcbase that does the same thing, or maybe there is already a Java library that does unshortening that we could just pull in.

lintool · 2016-04-07T20:08:52Z

Do we have a file which has the mapping from short urls to the full URLs? If so, I can show you how to join in the data...

ruebot · 2016-04-07T20:11:43Z

@lintool can you clarify what you mean by "a file that has the mapping from short urls to the full URLs"?

ruebot · 2016-04-07T20:12:28Z

...or, is this what you're looking for? https://github.com/edsu/unshrtn/blob/master/unshrtn.coffee

lintool · 2016-04-07T20:14:29Z

File that has:

http://t.co/pbFMYFZpQC http://foo.bar.com/
http://t.co/pg3SFzLc http://foo.bar.com/
...

ruebot · 2016-04-07T20:20:03Z

Oh, https://github.com/edsu/twarc/blob/master/utils/unshorten.py#L37-L53 puts it back in the dataset with a new entry.

lintool · 2016-04-07T23:00:39Z

If I understand correctly what it's doing, that's absolutely terrible. That's the digital equivalent of going through a paper archive with a black magic marker, crossing out historical place names and replacing them with their modern names. Would you do that to a paper archive? No! So don't do it to a digital archive.

The correct way to do this is to have a separate file that has the mapping (per above), and join in the unshortened form during processing.

EDIT: okay, it adds in a new field in the JSON, which isn't as bad as I thought. Analogy would be to go through a paper archive and put a post-it note next to every instance of a historical place name and on the post-it note write it's modern name.

ruebot · 2016-04-08T14:22:04Z

You don't do it on the preservation/master version of the dataset, you always cat it out to a new file. By default it is stdout. It only reads the preservation/master version of the dataset.

lintool · 2016-04-09T00:51:39Z

If that's the case, it's a waste of space. You still just want

short long
short long
...

ruebot · 2016-04-09T01:00:51Z

Would the output be:

short, count, long, count
http://t.co/pbFMYFZpQC, 12, http://foo.bar.com/, 123

lintool · 2016-04-09T01:29:35Z

You wouldn't even need the count. If you just had short, long, you can process the original archival JSON and just join in the long form as needed.

ianmilligan1 · 2016-04-28T23:27:32Z

Just re-opening this. Did we reach any agreement here?

lintool · 2016-04-29T19:15:21Z

Do we have a way to generate a file that has the following?

short-url full-url
short-url full-url
...

ianmilligan1 added the feature label Apr 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweet URL Extraction: All Twitter Shortlinks #216

Tweet URL Extraction: All Twitter Shortlinks #216

ianmilligan1 commented Apr 6, 2016

ruebot commented Apr 6, 2016

lintool commented Apr 7, 2016

ruebot commented Apr 7, 2016

ruebot commented Apr 7, 2016

lintool commented Apr 7, 2016

ruebot commented Apr 7, 2016

lintool commented Apr 7, 2016

ruebot commented Apr 8, 2016

lintool commented Apr 9, 2016

ruebot commented Apr 9, 2016

lintool commented Apr 9, 2016

ianmilligan1 commented Apr 28, 2016

lintool commented Apr 29, 2016 •

edited

Loading

Tweet URL Extraction: All Twitter Shortlinks #216

Tweet URL Extraction: All Twitter Shortlinks #216

Comments

ianmilligan1 commented Apr 6, 2016

ruebot commented Apr 6, 2016

lintool commented Apr 7, 2016

ruebot commented Apr 7, 2016

ruebot commented Apr 7, 2016

lintool commented Apr 7, 2016

ruebot commented Apr 7, 2016

lintool commented Apr 7, 2016

ruebot commented Apr 8, 2016

lintool commented Apr 9, 2016

ruebot commented Apr 9, 2016

lintool commented Apr 9, 2016

ianmilligan1 commented Apr 28, 2016

lintool commented Apr 29, 2016 • edited Loading

lintool commented Apr 29, 2016 •

edited

Loading