-
Notifications
You must be signed in to change notification settings - Fork 47
Tweet URL Extraction: All Twitter Shortlinks #216
Comments
...which in-turn uses https://github.com/edsu/unshrtn We could incorporate that in. Or, create a method in warcbase that does the same thing, or maybe there is already a Java library that does unshortening that we could just pull in. |
Do we have a file which has the mapping from short urls to the full URLs? If so, I can show you how to join in the data... |
@lintool can you clarify what you mean by "a file that has the mapping from short urls to the full URLs"? |
...or, is this what you're looking for? https://github.com/edsu/unshrtn/blob/master/unshrtn.coffee |
File that has:
|
Oh, https://github.com/edsu/twarc/blob/master/utils/unshorten.py#L37-L53 puts it back in the dataset with a new entry. |
If I understand correctly what it's doing, that's absolutely terrible. That's the digital equivalent of going through a paper archive with a black magic marker, crossing out historical place names and replacing them with their modern names. Would you do that to a paper archive? No! So don't do it to a digital archive. The correct way to do this is to have a separate file that has the mapping (per above), and join in the unshortened form during processing. EDIT: okay, it adds in a new field in the JSON, which isn't as bad as I thought. Analogy would be to go through a paper archive and put a post-it note next to every instance of a historical place name and on the post-it note write it's modern name. |
You don't do it on the preservation/master version of the dataset, you always |
If that's the case, it's a waste of space. You still just want
|
Would the output be:
|
You wouldn't even need the count. If you just had |
Just re-opening this. Did we reach any agreement here? |
Do we have a way to generate a file that has the following?
|
Right now, our script for URL extraction is as follows:
By grabbing tweets from the
text
field we just get results like:This is not very useful – so what's the best path? In the past, @ruebot and I have used
unshorten.py
in twarc.The text was updated successfully, but these errors were encountered: