Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change counting hashtags in retweets #2

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

igorbrigadir
Copy link

@igorbrigadir igorbrigadir commented Feb 5, 2022

fix #1 but this will also alter hashtag counts, sometimes significantly, so existing results may not reproduce.

I'm wondering if the retweet merging code,

                # Process Retweets:
                if "referenced_tweets" in tweet:
                    rts = [t for t in tweet["referenced_tweets"] if t["type"] == "retweeted"]
                    retweeted_tweet = rts[-1] if rts else None
                    # If it's a native retweet, replace the "RT @user Text" with the original text, metrics, and entities, but keep the Author.
                    if retweeted_tweet:
                        # A retweet inherits everything from retweeted tweet.
                        tweet["text"] = retweeted_tweet.pop("text", tweet.pop("text", None))
                        tweet["entities"] = retweeted_tweet.pop("entities", tweet.pop("entities", None))
                        tweet["attachments"] = retweeted_tweet.pop("attachments", tweet.pop("attachments", None))
                        tweet["context_annotations"] = retweeted_tweet.pop(
                            "context_annotations", tweet.pop("context_annotations", None)
                        )
                        tweet["public_metrics"] = retweeted_tweet.pop("public_metrics", tweet.pop("public_metrics", None))

The way it works with pop is,

tweet["entities"] = retweeted_tweet.pop("entities", tweet.pop("entities", None))

if entites exists in retweeted_tweet, tweet["entities"] is replaced. If it doesn't exist in retweeted_tweet, tweet.pop("entities", None) will set tweet["entities"] to whatever tweet["entities"] was before, or if it didn't exist, set it to None. I think this way should cover any situation - if the elements are there or not.

should be in expansions.py in twarc proper? since this will most likely be shared in other things too. We can refactor this later though.

Copy link
Member

@edsu edsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tracking this down! I didn't realize hashtag entities for an underlying tweet weren't available on the retweet.

I wonder if we should count hashtags in any kind of referenced tweet--so quotes and replies as well?

tweet["text"] = retweeted_tweet.pop(
"text", tweet.pop("text", None)
)
tweet["entities"] = retweeted_tweet.pop(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are only interested in hashtags isn't entities all that is needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but i'm inclined to leave them in because it's more explicit about what's happening - and in case there's any future additions / modifications that won't throw up any surprises. I don't think it hurts to have the extra bits there, but i'm also ok with just commenting them out.

@@ -62,20 +59,43 @@ def load(infile, outfile, db):

data = json.loads(line)
for tweet in ensure_flattened(data):
# Process Retweets:
if "referenced_tweets" in tweet:
rts = [
Copy link
Member

@edsu edsu Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should consider hashtags in any referenced tweet?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that but i was unsure how to handle it. Maybe it should have 1 switch for all? --include-referenced-tweets or ``--count-referenced-tweetsor add ones for each type:--count-replies` and `--count-quotes` etc?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Miss-matching counts
2 participants