Question 2). Let’s say you’re a Product Data Scientist at Instagram. How would you measure the success of the Instagram TV product?
In my role as a Product Data Scientist at Instagram, I'd have a bunch of ways to gauge how well IGTV is doing. Think of it as checking the pulse of this feature. We're talking about metrics, which are basically numbers that tell us what's going on. Here's what I'd look at:
(1)Views: This one's simple. It's about how many folks watched at least a bit of an IGTV video. We'd want to know if more people are tuning in over time.
(2)Average Watch Time: This is the average amount of time people spend watching IGTV videos. We'd want to see if folks are sticking around longer.
(3)Audience Retention: This tells us how many people watched an entire IGTV video from start to finish. Ideally, we'd like to see this number go up.
(4)Engagement Rate: We're talking about likes, comments, and shares here. It's important to know how often people are interacting with IGTV videos.
(5)Follower Growth: This is about how many new followers IGTV creators are getting. It's a good sign if this number keeps going up.
(Revenue: Money talks, right? We'd keep tabs on how much cash Instagram is making from IGTV ads.
Now, beyond these core metrics, there are some other things I'd keep an eye on:
(1)Top IGTV Creators: Who's getting the most views, followers, and engagement? That can tell us what kind of content is really hitting the mark.
(2)Popular IGTV Categories: What types of IGTV videos are people loving the most? We'd want to know what categories are trending.
(3)IGTV Discovery Features: How effective are the features that help folks find IGTV content? We'd study stuff like the Explore feed and recommendations.
(4)IGTV User Satisfaction: Are people happy with IGTV? This matters a lot, and we'd want to know if folks are enjoying the product.
So, what's the point of all this data tracking? Well, it helps us figure out what's working and what's not with IGTV. For instance:
(1)If we see views going up, we know people are tuning in more. (2)If the average watch time is increasing, it means people are finding content they like. (3)If engagement rates are high, it means users are getting involved.
Jupyter Notebook user-behavior-on-instagram
Code import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os for dirname, _, filenames in os.walk('/kaggle/input'): for filename in filenames: print(os.path.join(dirname, filename))
You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
ModuleNotFoundError Traceback (most recent call last) Cell In[1], line 5 1 # This Python 3 environment comes with many helpful analytics libraries installed 2 # It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python 3 # For example, here's several helpful packages to load ----> 5 import numpy as np # linear algebra 6 import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) 8 # Input data files are available in the read-only "../input/" directory 9 # For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
ModuleNotFoundError: No module named 'numpy'
num_comments = df_comments.shape[0]
comments_per_user = df_comments['User id'].value_counts() comments_per_photo = df_comments['Photo id'].value_counts()
hashtags_dist = df_comments['Hashtags used count'].value_counts()
emoji_usage = df_comments['emoji used'].value_counts()
df_comments['created Timestamp'] = pd.to_datetime(df_comments['created Timestamp']) date_range = df_comments['created Timestamp'].min(), df_comments['created Timestamp'].max() num_comments, comments_per_user.describe(), comments_per_photo.describe(), hashtags_dist, emoji_usage, date_range import pandas as pd
file_path = os.path.join('/kaggle/input/user-behavior-on-instagram/comments_cleaned.csv')
df_comments = pd.read_csv(file_path)
ModuleNotFoundError Traceback (most recent call last) Cell In[2], line 1 ----> 1 import pandas as pd 3 # Define file path 4 file_path = os.path.join('/kaggle/input/user-behavior-on-instagram/comments_cleaned.csv')
ModuleNotFoundError: No module named 'pandas'
from wordcloud import WordCloud import matplotlib.pyplot as plt from PIL import Image import numpy as np
all_comments = " ".join(comment for comment in df_comments.comment)
wordcloud = WordCloud(background_color="white").generate(all_comments)
plt.figure(figsize=(8,6)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()
ModuleNotFoundError Traceback (most recent call last) Cell In[3], line 1 ----> 1 from wordcloud import WordCloud 2 import matplotlib.pyplot as plt 3 from PIL import Image
ModuleNotFoundError: No module named 'wordcloud'
from textblob import TextBlob
df_comments['sentiment_polarity'] = df_comments['comment'].apply(lambda text: TextBlob(text).sentiment.polarity)
df_comments['sentiment'] = df_comments['sentiment_polarity'].apply(lambda p: 'positive' if p > 0 else ('negative' if p < 0 else 'neutral'))
df_comments.head() from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
df_comments['sentiment_scores'] = df_comments['comment'].apply(sia.polarity_scores)
df_comments['compound_score'] = df_comments['sentiment_scores'].apply(lambda score_dict: score_dict['compound'])
df_comments['sentiment'] = df_comments['compound_score'].apply(lambda c: 'positive' if c > 0.05 else ('negative' if c < -0.05 else 'neutral'))
df_comments.head()
sentiment_distribution = df_comments['sentiment'].value_counts() sentiment_distribution
negative_comments = df_comments[df_comments['sentiment'] == 'negative']
negative_comments_sample = negative_comments.sample(10) negative_comments_sample
all_negative_comments = " ".join(comment for comment in negative_comments.comment)
wordcloud_negative = WordCloud(background_color="white").generate(all_negative_comments)
plt.figure(figsize=(8,6)) plt.imshow(wordcloud_negative, interpolation='bilinear') plt.axis("off") plt.show()
emoji_sentiment_proportions = df_comments.groupby('sentiment')['emoji used'].value_counts(normalize=True).unstack()
emoji_sentiment_proportions = emoji_sentiment_proportions * 100 emoji_sentiment_proportions
emoji_sentiment_proportions.plot(kind='bar', stacked=True, figsize=(10, 6)) plt.title('Proportion of Comments With and Without Emojis by Sentiment') plt.ylabel('Percentage') plt.xlabel('Sentiment') plt.xticks(rotation=0) plt.show()
user_comment_counts = df_comments['User id'].value_counts()
top_users = user_comment_counts.nlargest(10)
top_user_sentiments = df_comments[df_comments['User id'].isin(top_users.index)].groupby('User id')['sentiment'].value_counts(normalize=True).unstack().fillna(0)
top_user_sentiments = top_user_sentiments * 100 top_users, top_user_sentiments
photo_comment_counts = df_comments['Photo id'].value_counts()
top_photos = photo_comment_counts.nlargest(10)
top_photo_sentiments = df_comments[df_comments['Photo id'].isin(top_photos.index)].groupby('Photo id')['sentiment'].value_counts(normalize=True).unstack().fillna(0)
top_photo_sentiments = top_photo_sentiments * 100 top_photos, top_photo_sentiments
hashtag_sentiment = df_comments.groupby('Hashtags used count')['sentiment_polarity'].mean()
hashtag_sentiment.plot(kind='bar', figsize=(10, 6)) plt.title('Average Sentiment Polarity by Number of Hashtags Used') plt.ylabel('Average Sentiment Polarity') plt.xlabel('Number of Hashtags Used') plt.xticks(rotation=0) plt.show()
df_comments['word_count'] = df_comments['comment'].apply(lambda text: len(text.split())) df_comments['char_count'] = df_comments['comment'].apply(len)
average_lengths = df_comments.groupby('sentiment')[['word_count', 'char_count']].mean() average_lengths
average_lengths['word_count'].plot(kind='bar', figsize=(8, 6)) plt.title('Average Number of Words in Comments by Sentiment') plt.ylabel('Average Number of Words') plt.xlabel('Sentiment') plt.xticks(rotation=0) plt.show()
average_lengths['char_count'].plot(kind='bar', figsize=(8, 6)) plt.title('Average Number of Characters in Comments by Sentiment') plt.ylabel('Average Number of Characters') plt.xlabel('Sentiment') plt.xticks(rotation=0) plt.show()
average_lengths_emoji = df_comments.groupby('emoji used')[['word_count', 'char_count']].mean() average_lengths_emoji
average_lengths_emoji['word_count'].plot(kind='bar', figsize=(8, 6)) plt.title('Average Number of Words in Comments by Emoji Use') plt.ylabel('Average Number of Words') plt.xlabel('Emoji Used') plt.xticks(rotation=0) plt.show()
average_lengths_emoji['char_count'].plot(kind='bar', figsize=(8, 6)) plt.title('Average Number of Characters in Comments by Emoji Use') plt.ylabel('Average Number of Characters') plt.xlabel('Emoji Used') plt.xticks(rotation=0) plt.show()
hashtag_usage_counts = df_comments['Hashtags used count'].value_counts()
hashtag_usage_counts = hashtag_usage_counts.sort_index() hashtag_usage_counts
hashtag_usage_counts.plot(kind='bar', figsize=(8, 6)) plt.title('Distribution of the Number of Hashtags Used in Comments') plt.ylabel('Number of Comments') plt.xlabel('Number of Hashtags Used') plt.xticks(rotation=0) plt.show()
user_sentiment_distribution = df_comments.groupby('User id')['sentiment'].value_counts(normalize=True).unstack().fillna(0)
user_sentiment_distribution = user_sentiment_distribution * 100
user_with_most_negative_comments = user_sentiment_distribution['negative'].idxmax() user_with_most_neutral_comments = user_sentiment_distribution['neutral'].idxmax() user_with_most_negative_comments, user_with_most_neutral_comments, user_sentiment_distribution.loc[[user_with_most_negative_comments, user_with_most_neutral_comments]]
user_sentiment_distribution = df_comments.groupby('User id')['sentiment'].value_counts(normalize=True).unstack().fillna(0)
user_sentiment_distribution = user_sentiment_distribution * 100
user_with_most_negative_comments = user_sentiment_distribution['negative'].idxmax() user_with_most_neutral_comments = user_sentiment_distribution['neutral'].idxmax() user_with_most_negative_comments, user_with_most_neutral_comments, user_sentiment_distribution.loc[[user_with_most_negative_comments, user_with_most_neutral_comments]]
sentiment_counts = df_comments['sentiment'].value_counts() emoji_sentiment_proportions = df_comments.groupby('sentiment')['emoji used'].value_counts(normalize=True).unstack().fillna(0) * 100
fig, axs = plt.subplots(2, 2, figsize=(20, 15))
sentiment_counts.plot(kind='bar', ax=axs[0, 0], color='skyblue') axs[0, 0].set_title('Distribution of Comments by Sentiment') axs[0, 0].set_xlabel('Sentiment') axs[0, 0].set_ylabel('Number of Comments') axs[0, 0].set_xticklabels(axs[0, 0].get_xticklabels(), rotation=0)
emoji_sentiment_proportions.plot(kind='bar', stacked=True, ax=axs[0, 1]) axs[0, 1].set_title('Proportion of Comments With Emojis by Sentiment') axs[0, 1].set_xlabel('Sentiment') axs[0, 1].set_ylabel('Percentage') axs[0, 1].set_xticklabels(axs[0, 1].get_xticklabels(), rotation=0)
hashtag_usage_counts.plot(kind='bar', ax=axs[1, 0], color='skyblue') axs[1, 0].set_title('Distribution of Comments by Number of Hashtags Used') axs[1, 0].set_xlabel('Number of Hashtags Used') axs[1, 0].set_ylabel('Number of Comments') axs[1, 0].set_xticklabels(axs[1, 0].get_xticklabels(), rotation=0)
average_lengths['char_count'].plot(kind='bar', ax=axs[1, 1], color='skyblue') axs[1, 1].set_title('Average Number of Characters in Comments by Sentiment') axs[1, 1].set_xlabel('Sentiment') axs[1, 1].set_ylabel('Average Number of Characters') axs[1, 1].set_xticklabels(axs[1, 1].get_xticklabels(), rotation=0) plt.tight_layout() plt.show()
For each user, calculate the total number of comments and the distribution of their comments by sentiment
user_comments_sentiment = df_comments.groupby('User id')['sentiment'].value_counts().unstack().fillna(0)
user_comments_sentiment['total_comments'] = user_comments_sentiment.sum(axis=1)
user_comments_sentiment = user_comments_sentiment.sort_values('total_comments', ascending=False)
top_commenting_users = user_comments_sentiment.head(10)
least_commenting_users = user_comments_sentiment.tail(10) top_commenting_users, least_commenting_users
Create pie charts showing the sentiment distribution for the most frequent and least frequent commenters
fig, axs = plt.subplots(1, 2, figsize=(15, 7))
top_commenting_users[['negative', 'neutral']].mean().plot(kind='pie', ax=axs[0], autopct='%1.1f%%') axs[0].set_ylabel('') axs[0].set_title('Sentiment Distribution for the Most Frequent Commenters')
least_commenting_users[['negative', 'neutral']].mean().plot(kind='pie', ax=axs[1], autopct='%1.1f%%') axs[1].set_ylabel('') axs[1].set_title('Sentiment Distribution for the Least Frequent Commenters') plt.show()
average_hashtags_by_user = df_comments.groupby('User id')['Hashtags used count'].mean()
user_with_most_hashtags = average_hashtags_by_user.idxmax() user_with_least_hashtags = average_hashtags_by_user.idxmin() user_with_most_hashtags, user_with_least_hashtags, average_hashtags_by_user.loc[[user_with_most_hashtags, user_with_least_hashtags]]
For each user, for each number of hashtags used, calculate the distribution of their comments by sentiment
user_hashtag_sentiment_distribution = df_comments.groupby(['User id', 'Hashtags used count'])['sentiment'].value_counts(normalize=True).unstack().fillna(0)
user_hashtag_sentiment_distribution = user_hashtag_sentiment_distribution * 100
Identify the users and the number of hashtags used that have the highest proportions of negative and neutral comments
user_hashtag_with_most_negative_comments = user_hashtag_sentiment_distribution['negative'].idxmax() user_hashtag_with_most_neutral_comments = user_hashtag_sentiment_distribution['neutral'].idxmax() user_hashtag_with_most_negative_comments, user_hashtag_with_most_neutral_comments, user_hashtag_sentiment_distribution.loc[[user_hashtag_with_most_negative_comments, user_hashtag_with_most_neutral_comments]]
Visualize the sentiment distribution for the user and number of hashtags with the highest proportion of negative comments
user_hashtag_sentiment_distribution.loc[[user_hashtag_with_most_negative_comments]].plot(kind='bar', stacked=True, figsize=(10, 6)) plt.title(f'Sentiment Distribution for User {user_hashtag_with_most_negative_comments[0]} Using {user_hashtag_with_most_negative_comments[1]} Hashtags') plt.ylabel('Percentage') plt.xlabel('Sentiment') plt.xticks(rotation=0) plt.show()
Visualize the sentiment distribution for the user and number of hashtags with the highest proportion of neutral comments
user_hashtag_sentiment_distribution.loc[[user_hashtag_with_most_neutral_comments]].plot(kind='bar', stacked=True, figsize=(10, 6)) plt.title(f'Sentiment Distribution for User {user_hashtag_with_most_neutral_comments[0]} Using {user_hashtag_with_most_neutral_comments[1]} Hashtags') plt.ylabel('Percentage') plt.xlabel('Sentiment') plt.xticks(rotation=0) plt.show()
user_emoji_hashtag_usage = df_comments.groupby(['User id', 'emoji used'])['Hashtags used count'].mean()
Identify the users and the emoji usage (yes or no) that have the highest and lowest average number of hashtags used
user_emoji_with_most_hashtags = user_emoji_hashtag_usage.idxmax() user_emoji_with_least_hashtags = user_emoji_hashtag_usage.idxmin() user_emoji_with_most_hashtags, user_emoji_with_least_hashtags, user_emoji_hashtag_usage.loc[[user_emoji_with_most_hashtags, user_emoji_with_least_hashtags]]
average_length_by_hashtag_count = df_comments.groupby('Hashtags used count')['char_count'].mean()
average_length_by_hashtag_count.plot(kind='bar', figsize=(8, 6), color='skyblue') plt.title('Average Number of Characters in Comments for Each Number of Hashtags Used') plt.ylabel('Average Number of Characters') plt.xlabel('Number of Hashtags Used') plt.xticks(rotation=0) plt.show() fig, axs = plt.subplots(2, 2, figsize=(20, 15))
average_length_by_hashtag_count.plot(kind='bar', ax=axs[0, 0], color='skyblue') axs[0, 0].set_title('Average Number of Characters in Comments for Each Number of Hashtags Used') axs[0, 0].set_xlabel('Number of Hashtags Used') axs[0, 0].set_ylabel('Average Number of Characters') axs[0, 0].set_xticklabels(axs[0, 0].get_xticklabels(), rotation=0)
Sentiment Distribution for the User and Number of Hashtags with the Highest Proportion of Negative Comments
user_hashtag_sentiment_distribution.loc[[user_hashtag_with_most_negative_comments]].plot(kind='bar', stacked=True, ax=axs[0, 1]) axs[0, 1].set_title(f'Sentiment Distribution for User {user_hashtag_with_most_negative_comments[0]} Using {user_hashtag_with_most_negative_comments[1]} Hashtags') axs[0, 1].set_xlabel('Sentiment') axs[0, 1].set_ylabel('Percentage') axs[0, 1].set_xticklabels(axs[0, 1].get_xticklabels(), rotation=0)
Sentiment Distribution for the User and Number of Hashtags with the Highest Proportion of Neutral Comments
user_hashtag_sentiment_distribution.loc[[user_hashtag_with_most_neutral_comments]].plot(kind='bar', stacked=True, ax=axs[1, 0]) axs[1, 0].set_title(f'Sentiment Distribution for User {user_hashtag_with_most_neutral_comments[0]} Using {user_hashtag_with_most_neutral_comments[1]} Hashtags') axs[1, 0].set_xlabel('Sentiment') axs[1, 0].set_ylabel('Percentage') axs[1, 0].set_xticklabels(axs[1, 0].get_xticklabels(), rotation=0)
top_commenting_users[['negative', 'neutral']].mean().plot(kind='pie', autopct='%1.1f%%', ax=axs[1, 1]) axs[1, 1].set_title('Sentiment Distribution for the Most Frequent Commenters') axs[1, 1].set_ylabel('') plt.tight_layout() plt.show()