-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented JustAnotherArchivist's requested changes to Telegram scraper from PR #2
Conversation
…tracting a post's view count
…ribute type Channel.
… attribute; fixed video edge cases.
…s didn't have a next page link (added reasonable default)
…se they weren't in a post containing a 'tgme_widget_message_text' div
I got frustrated with the slowness of the scraping so I changed the forwarding Channel method by modifying the Channel definition so that it only requires the username, rather than retrieving the full forwarded channel information for every forwarded message. Additional changes:
|
…edundant outlinks
…t wasn't correctly getting the forwarding information in forwarded posts that contained attachments but no text
One thing we need to decide is if we want to include pinned messages, e.g. https://t.me/s/SouthwestOhioPB/17, where the content is just "[CHANNEL NAME] pinned a [ATTACHMENT TYPE] ". Unfortunately, unlike the desktop app, the browser interface doesn't include the link to the message that was pinned, so there's very little information in the scraped post. |
snscrape/modules/telegram.py
Outdated
if link['href'] == rawUrl or link['href'] == url: | ||
style = link.attrs.get('style', '') | ||
# Generic filter of links to the post itself, catches videos, photos, and the date link | ||
if style != '': | ||
imageUrls = re.findall('url\(\'(.*?)\'\)', style) | ||
if len(imageUrls) == 1: | ||
media.append(Photo(url = imageUrls[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is partially duplicated below (152-155) maybe it could be isolated to a method, or at least the REGEX into a variable so it stays consistent.
snscrape/modules/telegram.py
Outdated
forwarded = forward_tag['href'].split('t.me/')[1].split('/')[0] | ||
for voice_player in post.find_all('a', {'class': 'tgme_widget_message_voice_player'}): | ||
audioUrl = voice_player.find('audio')['src'] | ||
durationStr = voice_player.find('time').text.split(':') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
durationStr
comes from split
so it will be a list rather than string. Both calls pass lists so maybe renaming the variables + durationStrToSeconds
method to reflect that.
snscrape/modules/telegram.py
Outdated
videoThumbnailUrl = None | ||
else: | ||
style = iTag['style'] | ||
videoThumbnailUrl = re.findall('url\(\'(.*?)\'\)', style)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regex can be extracted to variable since it's also used above
snscrape/modules/telegram.py
Outdated
if videoTag is None: | ||
videoUrl = None | ||
else: | ||
videoUrl = videoTag['src'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if videoTag is None: | |
videoUrl = None | |
else: | |
videoUrl = videoTag['src'] | |
videoUrl = None if videoTag is None else videoTag['src'] |
else: | ||
cls = Video | ||
durationStr = video_player.find('time').text.split(':') | ||
mKwargs['duration'] = durationStrToSeconds(durationStr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment on list vs str as above for durationStrToSeconds
snscrape/modules/telegram.py
Outdated
if viewsSpan is None: | ||
views = None | ||
else: | ||
views = parse_num(viewsSpan.text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if viewsSpan is None: | |
views = None | |
else: | |
views = parse_num(viewsSpan.text) | |
views = None if viewsSpan is None else parse_num(viewsSpan.text) |
snscrape/modules/telegram.py
Outdated
s = s.replace(' ', '') | ||
if s.endswith('M'): | ||
return int(float(s[:-1]) * 1e6), 10 ** (6 if '.' not in s else 6 - len(s[:-1].split('.')[1])) | ||
elif s.endswith('K'): | ||
return int(float(s[:-1]) * 1000), 10 ** (3 if '.' not in s else 3 - len(s[:-1].split('.')[1])) | ||
else: | ||
return int(s), 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not check this logic, maybe adding some docstr with example expected input and expected output
snscrape/modules/telegram.py
Outdated
if r.status_code == 200: | ||
return (True, None) | ||
elif r.status_code // 100 == 5: | ||
return (False, f'status code: {r.status_code}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return (False, f'status code: {r.status_code}') | |
return (False, f'{r.status_code=}') |
discovered this recently for python 3.8+, see here, just a suggestion
snscrape/modules/telegram.py
Outdated
else: | ||
return (False, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else: | |
return (False, None) | |
return (False, None) |
no need for else and having a base-level return with the default values is also a good pattern
…TTERN as variable
Implemented requested changes from JustAnotherArchivist#413
Channel
Additional steps that should be done:
Document
dataclass for arbitrary attached documents