-
Notifications
You must be signed in to change notification settings - Fork 712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrape images, video, and post forwarding information for Telegram #413
base: master
Are you sure you want to change the base?
Scrape images, video, and post forwarding information for Telegram #413
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thank you for this – and happy to see snscrape used by Bellingcat!
I don't have time at the moment to review and test it in detail, but a few general thoughts:
- For forwarded posts, I'd like to see a URL to the original post as well as a reference to the channel behind it. I guess there's zero info in the HTML besides the username, so that might require some changes on the
Channel
class (makingtitle
,verified
, andphoto
optional). - Posts can have more than one video. I believe the current code only catches the last video.
- For videos and audio, I'd like to extract everything Telegram provides. Definitely the duration and thumbnail, perhaps even the audio amplitude bars although that's probably overkill and of little value. This would require separate dataclasses to carry this extra data, similar to how the Twitter module handles media.
Makes sense to me. I don't have a timeline for when we'd be able to make those changes -- there's a few high priority things happening right now -- but we've been using our fork for a while and I wanted to open a PR to remember to merge it upstream at some point. |
…e string in membersDiv has the word 'subscribers' rather than 'members'.
…tracting a post's view count
…ribute type Channel.
… attribute; fixed video edge cases.
…s didn't have a next page link (added reasonable default)
…se they weren't in a post containing a 'tgme_widget_message_text' div
…edundant outlinks
…t wasn't correctly getting the forwarding information in forwarded posts that contained attachments but no text
…TTERN as variable
Implemented JustAnotherArchivist's requested changes to Telegram scraper from PR
I implemented the requested changes:
Additional changes:
|
Hm, should this be rebased? 25 commits is a lot, but I'm not sure on @JustAnotherArchivist's policy on that. |
Pasting something from the PR to the fork that I think is relevant:
|
The changes sound good so far, though I haven't reviewed the code thoroughly yet. Some quick comments on things I noticed at a glance:
|
This is an example of a channel page with no |
Incorporated your changes, let me know if there are other issues you'd like me to address |
@JustAnotherArchivist Any additional changes you want us to make? We've been using this quite a bit and would love to see it get merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, and thanks for the fixes! I'll have some style nits, but let's get the functionality sorted out first.
# Generic filter of links to the post itself, catches videos, photos, and the date link | ||
if style != '': | ||
imageUrls = _STYLE_MEDIA_URL_PATTERN.findall(style) | ||
if len(imageUrls) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any examples with more than one match (here or a few lines below)?
|
||
media.append(VoiceMessage(url = audioUrl, duration = duration, bars = barHeights)) | ||
|
||
for videoPlayer in post.find_all('a', {'class': 'tgme_widget_message_video_player'}): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the extraction of images and videos is done separately, the order is not preserved. For example, https://t.me/s/nexta_live/43102 has video 1 (without URL), image, video 2 (with URL), but the image gets listed first. I think that can be fixed by simply merging this loop (and also the one for the voice player extraction) into the general link loop above, since they're all a
tags in the post div
.
} | ||
timeTag = videoPlayer.find('time') | ||
if timeTag is None: | ||
cls = Gif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have some examples? I don't remember seeing fake-GIFs on Telegram before. (Also for the future test suite.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Telegram doesn't have a policy on whether or not they're allowed, right? I don't think a real-GIF would ever inaccurately go down this path, so isn't it just making the logic more robust against change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer erroring out on things the code doesn't actually understand and implement. It might be 'more robust' in some sense, but it can easily result in misparsing the data as well.
But if 'videos' without a time tag already exist similar to how it is on Twitter, this is totally fine. Hence why I'm asking for examples. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not an active Telegram user, so I don't think I'll be able to quickly come up with an example myself. @loganwilliams , do you remember running into a problem which required adding this line back when you implemented this?
On the other hand, what data misparsing are you imagining from this, @JustAnotherArchivist ? Especially if Twitter already has examples which require this behavior, what's the error mode that we'd want to call out by throwing here?
I'm hoping that merging this will get everyone off the fork, but am concerned that if we introduce new exceptions, it'll require more significant updates to existing workflows.
Edit: As a compromise, I'm adding a warning log to this in my PR. It won't stop execution, but will let the user know in case there's something actually wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @john-osullivan. Thanks for your work pushing this forward. You can see an example of a GIF here: https://t.me/thisisatestchannel19451923/3
It sits in the same .tgme_widget_message_video_player
element and lacks a duration.
if kwargs['href'] in outlinks: | ||
outlinks.remove(kwargs['href']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer leaving the link preview href in outlinks
as well, similar to how the Twitter scraper will have outlinks from link cards in outlinks
.
try: | ||
if soup.find('a', attrs = {'class': 'tgme_widget_message_date'}, href = True)['href'].split('/')[-1] == '1': | ||
# if message 1 is the first message in the page, terminate scraping | ||
break | ||
except: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bare except
is awful and hides various exceptions that shouldn't be caught, such as ^C interrupts. This test should really be done without a try-except
.
# some pages are missing a "tme_messages_more" tag, causing early termination | ||
if '=' not in nextPageUrl: | ||
nextPageUrl = soup.find('link', attrs = {'rel': 'canonical'}, href = True)['href'] | ||
nextPostIndex = int(nextPageUrl.split('=')[-1]) - 20 | ||
if nextPostIndex > 20: | ||
pageLink = {'href': nextPageUrl.split('=')[0] + f'={nextPostIndex}'} | ||
else: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this approach lead to duplicates in some cases? When a post includes multiple media, those get their own ID, so there'd be a gap in post IDs.
I feel like it'd be better to get the ID of the first post on the page and then use that as the before
parameter.
Also, while testing this and looking around for odd cases, I discovered that Telegram supports 'round videos'. Example: https://t.me/s/memes/9641 |
Hello. I'd love to get this PR merged. Is there anything I can do to help? |
Please finish and merge this; it's quite a useful feature. If there's anything I can do to help, I'd be more than happy to. |
Just so you know: I've created a Python library called tchan that scrapes Telegram public channels and does not have the current problems snscrape has regarding this PR (still missing some features like scrape polls). |
What I need (and what I'm pretty sure this PR provides) is an easy way to check if a post contains one or more videos. |
Hey, I recently did a Bellingcat workshop which used this fork -- I'd love to close the gap and get it merged. I'll try to cut something soon, @JustAnotherArchivist, and will let you know if I have any questions on how you'd like it! |
I wrapped up my changes to all your comments, @JustAnotherArchivist ! Asked @.loganwilliams for a review over there, but if you'd like to preemptively call out any issues with my changes, I'd love to get em fixed ahead of time and only do one merge process 👍 |
A small enhancement that adds some additional information from Telegram channel posts.