data_cleaning.ipynb.orig

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Parsing and Cleaning PHEME RNR Dataset Events\n",
    "\n",
    "This notebook cleans tweet level data generated from `lib/pheme_parsing.py` and aggregates this tabular, individual data to tabular thread-level data. It also provides a useful sanity check after making modifications to `lib/pheme_parsing.py`. \n",
    "\n",
    "## Instructions\n",
    "1. Update the variable `event` in the cell below with one of the following events:\n",
    "    1. germanwings-crash\n",
    "    1. ferguson\n",
    "    1. ottawashooting\n",
    "    1. sydneysiege\n",
    "    1. charliehebdo\n",
    "1. Run all the cells in this notebook to generate thread-level CSV files in the `data/threads` directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load dependencies for this Jupyter Notebook\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import networkx as nx\n",
    "from functools import reduce\n",
    "from lib.util import fetch_tweets\n",
    "\n",
    "event = \"charliehebdo\"  # Change this value to clear different PHEME datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parsing and Cleaning Data\n",
    "This step takes the raw PHEME rumor dataset and saves it tabular format as CSV file. The original PHEME dataset consists of JSON files organized into directories by event and category (rumor or non-rumor). These three functions below parse the data, save it as a CSV file (if necessary), and load it into this notebook as a Pandas DataFrame from the \"cached\" CSV file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:red\">**<<<<<<< local**</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "dataset_name=\"sydneysiege\"\n",
    "gw = fetch_tweets(dataset_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:red\">**=======**</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [],
   "source": [
    "gw = fetch_tweets(event)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:red\">**>>>>>>> remote**</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Tweet Level Features\n",
    "\n",
    "| Name/Column       | Description                   | Type   | Notes  |\n",
    "|-------------------|-------------------------------|--------| ------ |\n",
    "| is_rumor          | Was this classified as rumor  | \"bool\" (`int`) | *Classification done by journalists* |\n",
    "| thread            | Source tweet id               | `str`  |                                                   |\n",
    "| in_reply_tweet    | Tweet ID in reply to          | `str`  |                                                   |\n",
    "| event             | Name of the PHEME event       | `str`  | Corresponds to event in the PHEME dataset         |\n",
    "| tweet_id          | Unique ID for tweet           | `str`  | This field is the ID referenced in `in_reply_tweet`     |\n",
    "| is_source_tweet   | Was this classified as rumor  | \"bool\" (`int`) |                                                   |\n",
    "| in_reply_user     | User ID in reply to           | `str`  |                                                   |\n",
    "| user_id           | Twitter User's ID             | `str`  | This field is the ID referenced in `in_reply_user` |\n",
    "| tweet_length      | Number of characters in tweet | `int`  |                                                   |\n",
    "| urls_count        | Number of URLS in tweet       | `int`  |                                                   |\n",
    "| hashtags_count    | Number of hashtags in tweet   | `int`  |                                                   |\n",
    "| retweet_count     | Times the tweet was retweeted | `int`  |                                                   |\n",
    "| favorite_count    | Number of times favorited     | `int`  |                                                   |\n",
    "| mentions_count    | Number of users mentioned     | `int`  |                                                   |\n",
    "| is_truncated      | Is this tweet truncated       | \"bool\" (`int`) | Did User type > 140 characters. [See Tweet updates](https://developer.twitter.com/en/docs/tweets/tweet-updates) |\n",
    "| created              | Datetime Tweet was created    | `datetime` | |\n",
    "| has_smile_emoji      | Does Tweet contain \"\"?        | \"bool\" (`int`) | 😊 is the smile emoji |\n",
    "| user.tweets_count    | User's tweet total, currently | `int`  | |\n",
    "| user.verified        | Is Twitter user verified?     | \"bool\" (`int`) |                                                   |\n",
    "| user.followers_count | Total number of followers  | `int` | |\n",
    "| user.listed_count    | ?? | `int` | | \n",
    "| user.friends_count   | ?? | `int` | |\n",
    "| user.time_zone       | Timezone of the user's Twitter account | `str` | |\n",
    "| user.desc_length     | Length of the user's biographic description | `int` |\n",
    "| user.has_bg_img      | Does user have a profile background image?  | \"bool\" (`int`) |\n",
    "| user.default_pric    | Does the user have the default profile picture | \"bool\" (`int`) |\n",
    "| user.created_at      | Date and time Twitter account was activated | `datetime` | |\n",
    "| user.utc_dist        | TK | `int` | See [this blog post time and the Twitter API](https://zacharyst.com/2017/04/05/assigning-the-correct-time-to-a-twee) |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Germanwings Crash"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "<<<<<<< local\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 23996 entries, 0 to 23995\n",
      "Data columns (total 59 columns):\n",
      "is_rumor                23996 non-null int64\n",
      "thread                  23996 non-null object\n",
      "in_reply_tweet          23996 non-null object\n",
      "event                   23996 non-null object\n",
      "tweet_id                23996 non-null object\n",
      "is_source_tweet         23996 non-null int64\n",
      "in_reply_user           23996 non-null object\n",
      "user_id                 23996 non-null object\n",
      "tweet_length            23996 non-null int64\n",
      "symbol_count            23996 non-null int64\n",
      "user_mentions           23996 non-null int64\n",
      "urls_count              23996 non-null int64\n",
      "media_count             23996 non-null int64\n",
      "hashtags_count          23996 non-null int64\n",
      "retweet_count           23996 non-null int64\n",
      "favorite_count          23996 non-null int64\n",
      "mentions_count          23996 non-null int64\n",
      "is_truncated            23996 non-null int64\n",
      "created                 23996 non-null float64\n",
      "has_smile_emoji         23996 non-null int64\n",
      "sensitive               23996 non-null int64\n",
      "has_place               23996 non-null int64\n",
      "has_coords              23996 non-null int64\n",
      "has_quest               23996 non-null int64\n",
      "has_exclaim             23996 non-null int64\n",
      "has_quest_or_exclaim    23996 non-null int64\n",
      "user.tweets_count       23996 non-null int64\n",
      "user.verified           23996 non-null int64\n",
      "user.followers_count    23996 non-null int64\n",
      "user.listed_count       23996 non-null int64\n",
      "user.desc_length        23996 non-null int64\n",
      "user.handle_length      23996 non-null int64\n",
      "user.name_length        23996 non-null int64\n",
      "user.notifications      23996 non-null int64\n",
      "user.friends_count      23996 non-null int64\n",
      "user.time_zone          16388 non-null object\n",
      "user.has_bg_img         23996 non-null int64\n",
      "user.default_pic        23996 non-null int64\n",
      "user.created_at         23996 non-null float64\n",
      "user.location           23996 non-null int64\n",
      "user.profile_sbcolor    23996 non-null int64\n",
      "user.profile_bgcolor    23996 non-null int64\n",
      "user.utc_dist           14394 non-null float64\n",
      "hasperiod               23996 non-null int64\n",
      "number_punct            23996 non-null int64\n",
      "negativewordcount       23996 non-null int64\n",
      "positivewordcount       23996 non-null int64\n",
      "capitalratio            23996 non-null float64\n",
      "contentlength           23996 non-null int64\n",
      "sentimentscore          23996 non-null float64\n",
      "Noun                    23996 non-null int64\n",
      "Verb                    23996 non-null int64\n",
      "Adjective               23996 non-null int64\n",
      "Pronoun                 23996 non-null int64\n",
      "FirstPersonPronoun      23996 non-null int64\n",
      "SecondPersonPronoun     23996 non-null int64\n",
      "ThirdPersonPronoun      23996 non-null int64\n",
      "Adverb                  23996 non-null int64\n",
      "has_url_in_text         23996 non-null int64\n",
      "dtypes: float64(5), int64(47), object(7)\n",
      "memory usage: 10.8+ MB\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "=======\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 38268 entries, 0 to 38267\n",
      "Data columns (total 59 columns):\n",
      "is_rumor                38268 non-null int64\n",
      "thread                  38268 non-null object\n",
      "in_reply_tweet          38268 non-null object\n",
      "event                   38268 non-null object\n",
      "tweet_id                38268 non-null object\n",
      "is_source_tweet         38268 non-null int64\n",
      "in_reply_user           38268 non-null object\n",
      "user_id                 38268 non-null object\n",
      "tweet_length            38268 non-null int64\n",
      "symbol_count            38268 non-null int64\n",
      "user_mentions           38268 non-null int64\n",
      "urls_count              38268 non-null int64\n",
      "media_count             38268 non-null int64\n",
      "hashtags_count          38268 non-null int64\n",
      "retweet_count           38268 non-null int64\n",
      "favorite_count          38268 non-null int64\n",
      "mentions_count          38268 non-null int64\n",
      "is_truncated            38268 non-null int64\n",
      "created                 38268 non-null float64\n",
      "has_smile_emoji         38268 non-null int64\n",
      "sensitive               38268 non-null int64\n",
      "has_place               38268 non-null int64\n",
      "has_coords              38268 non-null int64\n",
      "has_quest               38268 non-null int64\n",
      "has_exclaim             38268 non-null int64\n",
      "has_quest_or_exclaim    38268 non-null int64\n",
      "user.tweets_count       38268 non-null int64\n",
      "user.verified           38268 non-null int64\n",
      "user.followers_count    38268 non-null int64\n",
      "user.listed_count       38268 non-null int64\n",
      "user.desc_length        38268 non-null int64\n",
      "user.handle_length      38268 non-null int64\n",
      "user.name_length        38268 non-null int64\n",
      "user.notifications      38268 non-null int64\n",
      "user.friends_count      38268 non-null int64\n",
      "user.time_zone          24681 non-null object\n",
      "user.has_bg_img         38268 non-null int64\n",
      "user.default_pic        38268 non-null int64\n",
      "user.created_at         38268 non-null float64\n",
      "user.location           38268 non-null int64\n",
      "user.profile_sbcolor    38268 non-null int64\n",
      "user.profile_bgcolor    38268 non-null int64\n",
      "user.utc_dist           18557 non-null float64\n",
      "hasperiod               38268 non-null int64\n",
      "number_punct            38268 non-null int64\n",
      "negativewordcount       38268 non-null int64\n",
      "positivewordcount       38268 non-null int64\n",
      "capitalratio            38268 non-null float64\n",
      "contentlength           38268 non-null int64\n",
      "sentimentscore          38268 non-null float64\n",
      "Noun                    38268 non-null int64\n",
      "Verb                    38268 non-null int64\n",
      "Adjective               38268 non-null int64\n",
      "Pronoun                 38268 non-null int64\n",
      "FirstPersonPronoun      38268 non-null int64\n",
      "SecondPersonPronoun     38268 non-null int64\n",
      "ThirdPersonPronoun      38268 non-null int64\n",
      "Adverb                  38268 non-null int64\n",
      "has_url_in_text         38268 non-null int64\n",
      "dtypes: float64(5), int64(47), object(7)\n",
      "memory usage: 17.2+ MB\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      ">>>>>>> remote\n"
     ]
    }
   ],
   "source": [
    "gw.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `.head` method prints out the 5 first rows in the dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "<<<<<<< local\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>is_rumor</th>\n",
       "      <th>thread</th>\n",
       "      <th>in_reply_tweet</th>\n",
       "      <th>event</th>\n",
       "      <th>tweet_id</th>\n",
       "      <th>is_source_tweet</th>\n",
       "      <th>in_reply_user</th>\n",
       "      <th>user_id</th>\n",
       "      <th>tweet_length</th>\n",
       "      <th>symbol_count</th>\n",
       "      <th>...</th>\n",
       "      <th>sentimentscore</th>\n",
       "      <th>Noun</th>\n",
       "      <th>Verb</th>\n",
       "      <th>Adjective</th>\n",
       "      <th>Pronoun</th>\n",
       "      <th>FirstPersonPronoun</th>\n",
       "      <th>SecondPersonPronoun</th>\n",
       "      <th>ThirdPersonPronoun</th>\n",
       "      <th>Adverb</th>\n",
       "      <th>has_url_in_text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>544427042616254465</td>\n",
       "      <td>nan</td>\n",
       "      <td>sydneysiege</td>\n",
       "      <td>544427042616254465</td>\n",
       "      <td>1</td>\n",
       "      <td>nan</td>\n",
       "      <td>61436584</td>\n",
       "      <td>139</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.166667</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>544427042616254465</td>\n",
       "      <td>5.4442704261625446e+17</td>\n",
       "      <td>sydneysiege</td>\n",
       "      <td>545322709756805120</td>\n",
       "      <td>0</td>\n",
       "      <td>61436584.0</td>\n",
       "      <td>887450286</td>\n",
       "      <td>133</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>544427042616254465</td>\n",
       "      <td>5.445083586532475e+17</td>\n",
       "      <td>sydneysiege</td>\n",
       "      <td>544508565705072640</td>\n",
       "      <td>0</td>\n",
       "      <td>19317766.0</td>\n",
       "      <td>17872080</td>\n",
       "      <td>48</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.600000</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>544427042616254465</td>\n",
       "      <td>5.4442704261625446e+17</td>\n",
       "      <td>sydneysiege</td>\n",
       "      <td>544505594561167361</td>\n",
       "      <td>0</td>\n",
       "      <td>61436584.0</td>\n",
       "      <td>30726225</td>\n",
       "      <td>48</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>544427042616254465</td>\n",
       "      <td>5.4442704261625446e+17</td>\n",
       "      <td>sydneysiege</td>\n",
       "      <td>544506491684659200</td>\n",
       "      <td>0</td>\n",
       "      <td>61436584.0</td>\n",
       "      <td>19317766</td>\n",
       "      <td>38</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 59 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   is_rumor              thread          in_reply_tweet        event  \\\n",
       "0         0  544427042616254465                     nan  sydneysiege   \n",
       "1         0  544427042616254465  5.4442704261625446e+17  sydneysiege   \n",
       "2         0  544427042616254465   5.445083586532475e+17  sydneysiege   \n",
       "3         0  544427042616254465  5.4442704261625446e+17  sydneysiege   \n",
       "4         0  544427042616254465  5.4442704261625446e+17  sydneysiege   \n",
       "\n",
       "             tweet_id  is_source_tweet in_reply_user    user_id  tweet_length  \\\n",
       "0  544427042616254465                1           nan   61436584           139   \n",
       "1  545322709756805120                0    61436584.0  887450286           133   \n",
       "2  544508565705072640                0    19317766.0   17872080            48   \n",
       "3  544505594561167361                0    61436584.0   30726225            48   \n",
       "4  544506491684659200                0    61436584.0   19317766            38   \n",
       "\n",
       "   symbol_count       ...         sentimentscore  Noun  Verb  Adjective  \\\n",
       "0             0       ...               0.166667     5     6          0   \n",
       "1             0       ...               0.000000     5     5          0   \n",
       "2             0       ...               0.600000     3     2          0   \n",
       "3             0       ...               0.500000     3     0          1   \n",
       "4             0       ...               0.000000     3     1          0   \n",
       "\n",
       "   Pronoun  FirstPersonPronoun  SecondPersonPronoun  ThirdPersonPronoun  \\\n",
       "0        4                   2                    1                   0   \n",
       "1        3                   2                    0                   1   \n",
       "2        0                   1                    0                   0   \n",
       "3        0                   0                    0                   0   \n",
       "4        0                   0                    0                   0   \n",
       "\n",
       "   Adverb  has_url_in_text  \n",
       "0       0                0  \n",
       "1       0                0  \n",
       "2       0                0  \n",
       "3       0                0  \n",
       "4       0                0  \n",
       "\n",
       "[5 rows x 59 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "=======\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>is_rumor</th>\n",
       "      <th>thread</th>\n",
       "      <th>in_reply_tweet</th>\n",
       "      <th>event</th>\n",
       "      <th>tweet_id</th>\n",
       "      <th>is_source_tweet</th>\n",
       "      <th>in_reply_user</th>\n",
       "      <th>user_id</th>\n",
       "      <th>tweet_length</th>\n",
       "      <th>symbol_count</th>\n",
       "      <th>...</th>\n",
       "      <th>sentimentscore</th>\n",
       "      <th>Noun</th>\n",
       "      <th>Verb</th>\n",
       "      <th>Adjective</th>\n",
       "      <th>Pronoun</th>\n",
       "      <th>FirstPersonPronoun</th>\n",
       "      <th>SecondPersonPronoun</th>\n",
       "      <th>ThirdPersonPronoun</th>\n",
       "      <th>Adverb</th>\n",
       "      <th>has_url_in_text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>553535765997969408</td>\n",
       "      <td>nan</td>\n",
       "      <td>charliehebdo</td>\n",
       "      <td>553535765997969408</td>\n",
       "      <td>1</td>\n",
       "      <td>nan</td>\n",
       "      <td>1379288282</td>\n",
       "      <td>144</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000</td>\n",
       "      <td>8</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>553535765997969408</td>\n",
       "      <td>5.535357659979694e+17</td>\n",
       "      <td>charliehebdo</td>\n",
       "      <td>553536824673861633</td>\n",
       "      <td>0</td>\n",
       "      <td>1379288282.0</td>\n",
       "      <td>628636580</td>\n",
       "      <td>135</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.050</td>\n",
       "      <td>8</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>553535765997969408</td>\n",
       "      <td>5.535357659979694e+17</td>\n",
       "      <td>charliehebdo</td>\n",
       "      <td>553545896739086336</td>\n",
       "      <td>0</td>\n",
       "      <td>1379288282.0</td>\n",
       "      <td>514523937</td>\n",
       "      <td>128</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000</td>\n",
       "      <td>10</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>553535765997969408</td>\n",
       "      <td>5.535357659979694e+17</td>\n",
       "      <td>charliehebdo</td>\n",
       "      <td>553536468782571520</td>\n",
       "      <td>0</td>\n",
       "      <td>1379288282.0</td>\n",
       "      <td>1623557642</td>\n",
       "      <td>122</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.125</td>\n",
       "      <td>6</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>553535765997969408</td>\n",
       "      <td>5.535357659979694e+17</td>\n",
       "      <td>charliehebdo</td>\n",
       "      <td>553540960718962690</td>\n",
       "      <td>0</td>\n",
       "      <td>1379288282.0</td>\n",
       "      <td>61921063</td>\n",
       "      <td>105</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.500</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 59 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   is_rumor              thread         in_reply_tweet         event  \\\n",
       "0         0  553535765997969408                    nan  charliehebdo   \n",
       "1         0  553535765997969408  5.535357659979694e+17  charliehebdo   \n",
       "2         0  553535765997969408  5.535357659979694e+17  charliehebdo   \n",
       "3         0  553535765997969408  5.535357659979694e+17  charliehebdo   \n",
       "4         0  553535765997969408  5.535357659979694e+17  charliehebdo   \n",
       "\n",
       "             tweet_id  is_source_tweet in_reply_user     user_id  \\\n",
       "0  553535765997969408                1           nan  1379288282   \n",
       "1  553536824673861633                0  1379288282.0   628636580   \n",
       "2  553545896739086336                0  1379288282.0   514523937   \n",
       "3  553536468782571520                0  1379288282.0  1623557642   \n",
       "4  553540960718962690                0  1379288282.0    61921063   \n",
       "\n",
       "   tweet_length  symbol_count       ...         sentimentscore  Noun  Verb  \\\n",
       "0           144             0       ...                  0.000     8     2   \n",
       "1           135             0       ...                 -0.050     8     4   \n",
       "2           128             0       ...                  0.000    10     2   \n",
       "3           122             0       ...                 -0.125     6     4   \n",
       "4           105             0       ...                  0.500     4     2   \n",
       "\n",
       "   Adjective  Pronoun  FirstPersonPronoun  SecondPersonPronoun  \\\n",
       "0          1        1                   0                    0   \n",
       "1          1        1                   0                    0   \n",
       "2          0        1                   0                    0   \n",
       "3          1        3                   1                    0   \n",
       "4          1        2                   1                    1   \n",
       "\n",
       "   ThirdPersonPronoun  Adverb  has_url_in_text  \n",
       "0                   2       0                1  \n",
       "1                   3       0                0  \n",
       "2                   3       0                0  \n",
       "3                   2       0                0  \n",
       "4                   2       0                1  \n",
       "\n",
       "[5 rows x 59 columns]"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      ">>>>>>> remote\n"
     ]
    }
   ],
   "source": [
    "gw.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Boolean Columns\n",
    "\n",
    "The `describe` method will give summary information about each column in the dataframe. Each of these columns, except `is_truncated` should have two unique values.\n",
    "\n",
    "Just for a sanity check. The cell below converts these boolean columns into value of type `bool` and describes them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "<<<<<<< local <modified: text/html, text/plain>\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>is_rumor</th>\n",
       "      <th>is_source_tweet</th>\n",
       "      <th>is_truncated</th>\n",
       "      <th>has_smile_emoji</th>\n",
       "      <th>user.verified</th>\n",
       "      <th>user.has_bg_img</th>\n",
       "      <th>user.default_pic</th>\n",
       "      <th>sensitive</th>\n",
       "      <th>has_place</th>\n",
       "      <th>has_coords</th>\n",
       "      <th>user.notifications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>23996</td>\n",
       "      <td>23996</td>\n",
       "      <td>23996</td>\n",
       "      <td>23996</td>\n",
       "      <td>23996</td>\n",
       "      <td>23996</td>\n",
       "      <td>23996</td>\n",
       "      <td>23996</td>\n",
       "      <td>23996</td>\n",
       "      <td>23996</td>\n",
       "      <td>23996</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unique</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>freq</th>\n",
       "      <td>15320</td>\n",
       "      <td>22775</td>\n",
       "      <td>23996</td>\n",
       "      <td>23973</td>\n",
       "      <td>22943</td>\n",
       "      <td>21776</td>\n",
       "      <td>14425</td>\n",
       "      <td>23933</td>\n",
       "      <td>23039</td>\n",
       "      <td>23481</td>\n",
       "      <td>23996</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       is_rumor is_source_tweet is_truncated has_smile_emoji user.verified  \\\n",
       "count     23996           23996        23996           23996         23996   \n",
       "unique        2               2            1               2             2   \n",
       "top       False           False        False           False         False   \n",
       "freq      15320           22775        23996           23973         22943   \n",
       "\n",
       "       user.has_bg_img user.default_pic sensitive has_place has_coords  \\\n",
       "count            23996            23996     23996     23996      23996   \n",
       "unique               2                2         2         2          2   \n",
       "top               True            False     False     False      False   \n",
       "freq             21776            14425     23933     23039      23481   \n",
       "\n",
       "       user.notifications  \n",
       "count               23996  \n",
       "unique                  1  \n",
       "top                 False  \n",
       "freq                23996  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "=======\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>is_rumor</th>\n",
       "      <th>is_source_tweet</th>\n",
       "      <th>is_truncated</th>\n",
       "      <th>has_smile_emoji</th>\n",
       "      <th>user.verified</th>\n",
       "      <th>user.has_bg_img</th>\n",
       "      <th>user.default_pic</th>\n",
       "      <th>sensitive</th>\n",
       "      <th>has_place</th>\n",
       "      <th>has_coords</th>\n",
       "      <th>user.notifications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>38268</td>\n",
       "      <td>38268</td>\n",
       "      <td>38268</td>\n",
       "      <td>38268</td>\n",
       "      <td>38268</td>\n",
       "      <td>38268</td>\n",
       "      <td>38268</td>\n",
       "      <td>38268</td>\n",
       "      <td>38268</td>\n",
       "      <td>38268</td>\n",
       "      <td>38268</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unique</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>freq</th>\n",
       "      <td>30923</td>\n",
       "      <td>36189</td>\n",
       "      <td>38268</td>\n",
       "      <td>38248</td>\n",
       "      <td>36659</td>\n",
       "      <td>34638</td>\n",
       "      <td>22117</td>\n",
       "      <td>38111</td>\n",
       "      <td>36415</td>\n",
       "      <td>37353</td>\n",
       "      <td>38268</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       is_rumor is_source_tweet is_truncated has_smile_emoji user.verified  \\\n",
       "count     38268           38268        38268           38268         38268   \n",
       "unique        2               2            1               2             2   \n",
       "top       False           False        False           False         False   \n",
       "freq      30923           36189        38268           38248         36659   \n",
       "\n",
       "       user.has_bg_img user.default_pic sensitive has_place has_coords  \\\n",
       "count            38268            38268     38268     38268      38268   \n",
       "unique               2                2         2         2          2   \n",
       "top               True            False     False     False      False   \n",
       "freq             34638            22117     38111     36415      37353   \n",
       "\n",
       "       user.notifications  \n",
       "count               38268  \n",
       "unique                  1  \n",
       "top                 False  \n",
       "freq                38268  "
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      ">>>>>>> remote <modified: text/html, text/plain>\n"
     ]
    }
   ],
   "source": [
    "bool_columns = [\"is_rumor\", \"is_source_tweet\", \"is_truncated\", \n",
    "                \"has_smile_emoji\", \"user.verified\", \"user.has_bg_img\", \n",
    "                \"user.default_pic\", \"sensitive\", \"has_place\", \"has_coords\", \"user.notifications\"]\n",
    "\n",
    "gw[bool_columns].astype(bool).describe(include=\"bool\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some columns in some PHEME events have only one unique value for all tweets. Instead of dropping them, we'll just be aware of them because they may vary across PHEME datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "<<<<<<< local\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "=======\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Warning, column `event` only has one unique value \"charliehebdo\"\n",
      "Warning, column `is_truncated` only has one unique value \"0\"\n",
      "Warning, column `user.notifications` only has one unique value \"0\"\n",
      "Warning, column `Adverb` only has one unique value \"0\"\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      ">>>>>>> remote\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "<<<<<<< local <modified: >\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Warning, column `event` only has one unique value \"sydneysiege\"\n",
      "Warning, column `symbol_count` only has one unique value \"0\"\n",
      "Warning, column `is_truncated` only has one unique value \"0\"\n",
      "Warning, column `user.notifications` only has one unique value \"0\"\n",
      "Warning, column `Adverb` only has one unique value \"0\"\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "=======\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      ">>>>>>> remote <removed>\n"
     ]
    }
   ],
   "source": [
    "for col in gw.columns:\n",
    "    if len(gw[col].unique()) == 1:\n",
    "        print(\"Warning, column `%s` only has one unique value \\\"%s\\\"\" % (col, gw[col][0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The tweet-level data, we'll use in our data analysis will be in this form."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "<<<<<<< local\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>is_rumor</th>\n",
       "      <th>is_source_tweet</th>\n",
       "      <th>tweet_length</th>\n",
       "      <th>symbol_count</th>\n",
       "      <th>user_mentions</th>\n",
       "      <th>urls_count</th>\n",
       "      <th>media_count</th>\n",
       "      <th>hashtags_count</th>\n",
       "      <th>retweet_count</th>\n",
       "      <th>favorite_count</th>\n",
       "      <th>...</th>\n",
       "      <th>sentimentscore</th>\n",
       "      <th>Noun</th>\n",
       "      <th>Verb</th>\n",
       "      <th>Adjective</th>\n",
       "      <th>Pronoun</th>\n",
       "      <th>FirstPersonPronoun</th>\n",
       "      <th>SecondPersonPronoun</th>\n",
       "      <th>ThirdPersonPronoun</th>\n",
       "      <th>Adverb</th>\n",
       "      <th>has_url_in_text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>23996.000000</td>\n",
       "      <td>23996.000000</td>\n",
       "      <td>23996.000000</td>\n",
       "      <td>23996.0</td>\n",
       "      <td>23996.000000</td>\n",
       "      <td>23996.000000</td>\n",
       "      <td>23996.000000</td>\n",
       "      <td>23996.000000</td>\n",
       "      <td>23996.000000</td>\n",
       "      <td>23996.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>23996.000000</td>\n",
       "      <td>23996.000000</td>\n",
       "      <td>23996.000000</td>\n",
       "      <td>23996.000000</td>\n",
       "      <td>23996.000000</td>\n",
       "      <td>23996.000000</td>\n",
       "      <td>23996.000000</td>\n",
       "      <td>23996.000000</td>\n",
       "      <td>23996.0</td>\n",
       "      <td>23996.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>0.361560</td>\n",
       "      <td>0.050883</td>\n",
       "      <td>92.492207</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.662319</td>\n",
       "      <td>0.100558</td>\n",
       "      <td>0.073429</td>\n",
       "      <td>0.235873</td>\n",
       "      <td>22.363602</td>\n",
       "      <td>20.978038</td>\n",
       "      <td>...</td>\n",
       "      <td>0.027869</td>\n",
       "      <td>5.512002</td>\n",
       "      <td>2.343266</td>\n",
       "      <td>0.902942</td>\n",
       "      <td>0.645441</td>\n",
       "      <td>0.150567</td>\n",
       "      <td>0.210660</td>\n",
       "      <td>0.388648</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.136398</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.480462</td>\n",
       "      <td>0.219764</td>\n",
       "      <td>38.931510</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.970727</td>\n",
       "      <td>0.312172</td>\n",
       "      <td>0.260845</td>\n",
       "      <td>0.622993</td>\n",
       "      <td>697.656358</td>\n",
       "      <td>1080.436569</td>\n",
       "      <td>...</td>\n",
       "      <td>0.311927</td>\n",
       "      <td>2.882990</td>\n",
       "      <td>1.782173</td>\n",
       "      <td>1.004158</td>\n",
       "      <td>0.932062</td>\n",
       "      <td>0.427781</td>\n",
       "      <td>0.526763</td>\n",
       "      <td>0.703738</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.343218</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>59.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>98.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>130.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.136364</td>\n",
       "      <td>7.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>152.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>12.000000</td>\n",
       "      <td>99524.000000</td>\n",
       "      <td>149783.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>25.000000</td>\n",
       "      <td>11.000000</td>\n",
       "      <td>7.000000</td>\n",
       "      <td>9.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>7.000000</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>8 rows × 52 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           is_rumor  is_source_tweet  tweet_length  symbol_count  \\\n",
       "count  23996.000000     23996.000000  23996.000000       23996.0   \n",
       "mean       0.361560         0.050883     92.492207           0.0   \n",
       "std        0.480462         0.219764     38.931510           0.0   \n",
       "min        0.000000         0.000000      3.000000           0.0   \n",
       "25%        0.000000         0.000000     59.000000           0.0   \n",
       "50%        0.000000         0.000000     98.000000           0.0   \n",
       "75%        1.000000         0.000000    130.000000           0.0   \n",
       "max        1.000000         1.000000    152.000000           0.0   \n",
       "\n",
       "       user_mentions    urls_count   media_count  hashtags_count  \\\n",
       "count   23996.000000  23996.000000  23996.000000    23996.000000   \n",
       "mean        1.662319      0.100558      0.073429        0.235873   \n",
       "std         0.970727      0.312172      0.260845        0.622993   \n",
       "min         0.000000      0.000000      0.000000        0.000000   \n",
       "25%         1.000000      0.000000      0.000000        0.000000   \n",
       "50%         1.000000      0.000000      0.000000        0.000000   \n",
       "75%         2.000000      0.000000      0.000000        0.000000   \n",
       "max        11.000000      3.000000      1.000000       12.000000   \n",
       "\n",
       "       retweet_count  favorite_count       ...         sentimentscore  \\\n",
       "count   23996.000000    23996.000000       ...           23996.000000   \n",
       "mean       22.363602       20.978038       ...               0.027869   \n",
       "std       697.656358     1080.436569       ...               0.311927   \n",
       "min         0.000000        0.000000       ...              -1.000000   \n",
       "25%         0.000000        0.000000       ...               0.000000   \n",
       "50%         0.000000        0.000000       ...               0.000000   \n",
       "75%         0.000000        1.000000       ...               0.136364   \n",
       "max     99524.000000   149783.000000       ...               1.000000   \n",
       "\n",
       "               Noun          Verb     Adjective       Pronoun  \\\n",
       "count  23996.000000  23996.000000  23996.000000  23996.000000   \n",
       "mean       5.512002      2.343266      0.902942      0.645441   \n",
       "std        2.882990      1.782173      1.004158      0.932062   \n",
       "min        0.000000      0.000000      0.000000      0.000000   \n",
       "25%        3.000000      1.000000      0.000000      0.000000   \n",
       "50%        5.000000      2.000000      1.000000      0.000000   \n",
       "75%        7.000000      4.000000      1.000000      1.000000   \n",
       "max       25.000000     11.000000      7.000000      9.000000   \n",
       "\n",
       "       FirstPersonPronoun  SecondPersonPronoun  ThirdPersonPronoun   Adverb  \\\n",
       "count        23996.000000         23996.000000        23996.000000  23996.0   \n",
       "mean             0.150567             0.210660            0.388648      0.0   \n",
       "std              0.427781             0.526763            0.703738      0.0   \n",
       "min              0.000000             0.000000            0.000000      0.0   \n",
       "25%              0.000000             0.000000            0.000000      0.0   \n",
       "50%              0.000000             0.000000            0.000000      0.0   \n",
       "75%              0.000000             0.000000            1.000000      0.0   \n",
       "max              5.000000             7.000000            6.000000      0.0   \n",
       "\n",
       "       has_url_in_text  \n",
       "count     23996.000000  \n",
       "mean          0.136398  \n",
       "std           0.343218  \n",
       "min           0.000000  \n",
       "25%           0.000000  \n",
       "50%           0.000000  \n",
       "75%           0.000000  \n",
       "max           1.000000  \n",
       "\n",
       "[8 rows x 52 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "=======\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>is_rumor</th>\n",
       "      <th>is_source_tweet</th>\n",
       "      <th>tweet_length</th>\n",
       "      <th>symbol_count</th>\n",
       "      <th>user_mentions</th>\n",
       "      <th>urls_count</th>\n",
       "      <th>media_count</th>\n",
       "      <th>hashtags_count</th>\n",
       "      <th>retweet_count</th>\n",
       "      <th>favorite_count</th>\n",
       "      <th>...</th>\n",
       "      <th>sentimentscore</th>\n",
       "      <th>Noun</th>\n",
       "      <th>Verb</th>\n",
       "      <th>Adjective</th>\n",
       "      <th>Pronoun</th>\n",
       "      <th>FirstPersonPronoun</th>\n",
       "      <th>SecondPersonPronoun</th>\n",
       "      <th>ThirdPersonPronoun</th>\n",
       "      <th>Adverb</th>\n",
       "      <th>has_url_in_text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>38268.000000</td>\n",
       "      <td>38268.000000</td>\n",
       "      <td>38268.000000</td>\n",
       "      <td>38268.000000</td>\n",
       "      <td>38268.000000</td>\n",
       "      <td>38268.000000</td>\n",
       "      <td>38268.000000</td>\n",
       "      <td>38268.000000</td>\n",
       "      <td>38268.000000</td>\n",
       "      <td>38268.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>38268.000000</td>\n",
       "      <td>38268.000000</td>\n",
       "      <td>38268.000000</td>\n",
       "      <td>38268.000000</td>\n",
       "      <td>38268.000000</td>\n",
       "      <td>38268.000000</td>\n",
       "      <td>38268.000000</td>\n",
       "      <td>38268.000000</td>\n",
       "      <td>38268.0</td>\n",
       "      <td>38268.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>0.191936</td>\n",
       "      <td>0.054327</td>\n",
       "      <td>97.651249</td>\n",
       "      <td>0.000026</td>\n",
       "      <td>1.767090</td>\n",
       "      <td>0.115710</td>\n",
       "      <td>0.100711</td>\n",
       "      <td>0.275792</td>\n",
       "      <td>24.313578</td>\n",
       "      <td>11.876450</td>\n",
       "      <td>...</td>\n",
       "      <td>0.025037</td>\n",
       "      <td>5.837044</td>\n",
       "      <td>2.326278</td>\n",
       "      <td>0.927851</td>\n",
       "      <td>0.672468</td>\n",
       "      <td>0.148662</td>\n",
       "      <td>0.223738</td>\n",
       "      <td>0.388027</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.178713</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.393828</td>\n",
       "      <td>0.226666</td>\n",
       "      <td>37.748508</td>\n",
       "      <td>0.005112</td>\n",
       "      <td>1.044231</td>\n",
       "      <td>0.337685</td>\n",
       "      <td>0.301210</td>\n",
       "      <td>0.700195</td>\n",
       "      <td>478.360416</td>\n",
       "      <td>260.287571</td>\n",
       "      <td>...</td>\n",
       "      <td>0.297044</td>\n",
       "      <td>2.877994</td>\n",
       "      <td>1.787683</td>\n",
       "      <td>1.039827</td>\n",
       "      <td>0.967401</td>\n",
       "      <td>0.430173</td>\n",
       "      <td>0.557142</td>\n",
       "      <td>0.716427</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.383117</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>67.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>107.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>133.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.100000</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>152.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>9.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>74130.000000</td>\n",
       "      <td>37983.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>26.000000</td>\n",
       "      <td>11.000000</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>11.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>11.000000</td>\n",
       "      <td>7.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>8 rows × 52 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           is_rumor  is_source_tweet  tweet_length  symbol_count  \\\n",
       "count  38268.000000     38268.000000  38268.000000  38268.000000   \n",
       "mean       0.191936         0.054327     97.651249      0.000026   \n",
       "std        0.393828         0.226666     37.748508      0.005112   \n",
       "min        0.000000         0.000000      3.000000      0.000000   \n",
       "25%        0.000000         0.000000     67.000000      0.000000   \n",
       "50%        0.000000         0.000000    107.000000      0.000000   \n",
       "75%        0.000000         0.000000    133.000000      0.000000   \n",
       "max        1.000000         1.000000    152.000000      1.000000   \n",
       "\n",
       "       user_mentions    urls_count   media_count  hashtags_count  \\\n",
       "count   38268.000000  38268.000000  38268.000000    38268.000000   \n",
       "mean        1.767090      0.115710      0.100711        0.275792   \n",
       "std         1.044231      0.337685      0.301210        0.700195   \n",
       "min         0.000000      0.000000      0.000000        0.000000   \n",
       "25%         1.000000      0.000000      0.000000        0.000000   \n",
       "50%         2.000000      0.000000      0.000000        0.000000   \n",
       "75%         2.000000      0.000000      0.000000        0.000000   \n",
       "max         9.000000      5.000000      2.000000       10.000000   \n",
       "\n",
       "       retweet_count  favorite_count       ...         sentimentscore  \\\n",
       "count   38268.000000    38268.000000       ...           38268.000000   \n",
       "mean       24.313578       11.876450       ...               0.025037   \n",
       "std       478.360416      260.287571       ...               0.297044   \n",
       "min         0.000000        0.000000       ...              -1.000000   \n",
       "25%         0.000000        0.000000       ...               0.000000   \n",
       "50%         0.000000        0.000000       ...               0.000000   \n",
       "75%         0.000000        1.000000       ...               0.100000   \n",
       "max     74130.000000    37983.000000       ...               1.000000   \n",
       "\n",
       "               Noun          Verb     Adjective       Pronoun  \\\n",
       "count  38268.000000  38268.000000  38268.000000  38268.000000   \n",
       "mean       5.837044      2.326278      0.927851      0.672468   \n",
       "std        2.877994      1.787683      1.039827      0.967401   \n",
       "min        0.000000      0.000000      0.000000      0.000000   \n",
       "25%        4.000000      1.000000      0.000000      0.000000   \n",
       "50%        6.000000      2.000000      1.000000      0.000000   \n",
       "75%        8.000000      3.000000      1.000000      1.000000   \n",
       "max       26.000000     11.000000      8.000000     11.000000   \n",
       "\n",
       "       FirstPersonPronoun  SecondPersonPronoun  ThirdPersonPronoun   Adverb  \\\n",
       "count        38268.000000         38268.000000        38268.000000  38268.0   \n",
       "mean             0.148662             0.223738            0.388027      0.0   \n",
       "std              0.430173             0.557142            0.716427      0.0   \n",
       "min              0.000000             0.000000            0.000000      0.0   \n",
       "25%              0.000000             0.000000            0.000000      0.0   \n",
       "50%              0.000000             0.000000            0.000000      0.0   \n",
       "75%              0.000000             0.000000            1.000000      0.0   \n",
       "max              5.000000            11.000000            7.000000      0.0   \n",
       "\n",
       "       has_url_in_text  \n",
       "count     38268.000000  \n",
       "mean          0.178713  \n",
       "std           0.383117  \n",
       "min           0.000000  \n",
       "25%           0.000000  \n",
       "50%           0.000000  \n",
       "75%           0.000000  \n",
       "max           1.000000  \n",
       "\n",
       "[8 rows x 52 columns]"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      ">>>>>>> remote\n"
     ]
    }
   ],
   "source": [
    "gw.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Thread Level Features\n",
    "\n",
    "* **Bold features** represent high performing features identified in C. Buntain and J. Golbeck, [\"Automatically Identifying Fake News in Popular Twitter Threads\"](http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8118443&isnumber=8118402)\n",
    "* Features that are normalized are normalized by thread length\n",
    "\n",
    "\n",
    "| Name                | Description                               | Type    | Notes |\n",
    "| ---                 | ---                                       | ---     | ----- |\n",
    "| thread              | Tweet ID of the source tweet              | `str`   | |\n",
    "| favorite_count      | Normalized favorite total                 | `float` | |\n",
    "| retweet_count       | Normlaized retweet total                  | `float` | |\n",
    "| **hashtags_count**  | Normlaized hashtag total                  | `float` | |\n",
    "| urls_count          | URL total normalized by thread length     | `float`  | |\n",
    "| user.tweets_count   | Total tweets by thread users              | `float` | |\n",
    "| event               | Name of PHEME event                       | `str`  | |\n",
    "| is_rumor            | Either rumor or nonrumor                  | `bool` | |\n",
    "| thread_length       | Number of tweets in the thread            | `int`  | |\n",
    "| user.has_bg_img     | Ratio of users who have bg image          | `float`| |\n",
    "| user.default_pic    | Ratio of users with default profile pic   | `float`| |\n",
    "| **has_smile_emoji** | Number of smile emojis in the thread      | `int`  | 😊 is the smile emoji |\n",
    "| user.verified       | Count of verified users in the thread normalized by thread length     | `float`  | |\n",
    "| **src.followers_count** | The number of followers of the original poster of the thread. | `int` | |\n",
    "| src.listed_count    | TODO | `int` | |\n",
    "| src.user_verified   | TODO | `int` | |\n",
    "| src.tweets_total    | TODO | `int` | |\n",
    "| reply_var           | The variance in the timestamps of responses to the source tweet | `float` |\n",
    "| src_age             | Difference in src user's creation and tweet creation            | `int`   | Measured in seconds |\n",
    "| time_to_first_resp  | The difference between tweet creation datetime and 1st reply    | `int`   | Measured in seconds |\n",
    "| time_to_last_resp   | The difference between tweet creation datetime and last reply   | `int`   | Measured in seconds |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [],
   "source": [
    "def agg_tweets_by_thread(df):\n",
    "    \n",
    "    shared = lambda x: 1 - len(set(x)) / len(x)\n",
    "    shared.__name__ = \"shared\"\n",
    "\n",
    "    funcs = [np.mean, sum, np.var]\n",
    "    agg_props = {\n",
    "        \"favorite_count\": funcs,\n",
    "        \"user_mentions\": funcs,\n",
    "        \"media_count\": funcs,\n",
    "        \"sensitive\": funcs,\n",
    "        \"has_place\": funcs,\n",
    "        \"has_coords\": funcs,\n",
    "        \"retweet_count\": funcs,\n",
    "        \"hashtags_count\": funcs + [shared],\n",
    "        \"urls_count\": funcs,\n",
    "        \"user.tweets_count\": funcs,\n",
    "        \"is_rumor\": max,\n",
    "        \"tweet_id\": len,\n",
    "        \"user.has_bg_img\": funcs,\n",
    "        \"has_quest\": funcs,\n",
    "        \"has_exclaim\": funcs,\n",
    "        \"has_quest_or_exclaim\": funcs,\n",
    "        \"user.default_pic\": funcs,\n",
    "        \"has_smile_emoji\": funcs,\n",
    "        \"user.verified\": funcs,\n",
    "        \"user.name_length\": funcs,\n",
    "        \"user.handle_length\": funcs,\n",
    "        \"user.profile_sbcolor\": funcs,\n",
    "        \"user.profile_bgcolor\": funcs,\n",
    "        \n",
    "        \"hasperiod\": funcs,\n",
    "        \"number_punct\": funcs,\n",
    "        \"negativewordcount\" : funcs,\n",
    "        \"positivewordcount\" : funcs,\n",
    "        \"capitalratio\" : funcs,\n",
    "        \"contentlength\" : funcs,\n",
    "        \"sentimentscore\" : funcs,\n",
    "        \"Noun\" : funcs,\n",
    "        \"Verb\" : funcs,\n",
    "        \"Adjective\" : funcs,\n",
    "        \"Pronoun\" : funcs,\n",
    "        \"Adverb\": funcs,\n",
    "    }\n",
    "    rename = {\n",
    "        \"tweet_id\": \"thread_length\"\n",
    "    }\n",
    "\n",
    "    def g(x):\n",
    "        # Add size of largest user-to-user conversation component in each thread        \n",
    "        d = []\n",
    "        thread_tweets = list(x[\"tweet_id\"])\n",
    "        G = nx.from_pandas_edgelist(df[df.tweet_id.isin(thread_tweets)], \"user_id\", \"in_reply_user\")\n",
    "        Gc = max(nx.connected_component_subgraphs(G), key=len)\n",
    "        d.append(nx.number_connected_components(G))\n",
    "        d.append(nx.diameter(Gc))\n",
    "        return pd.Series(d, index=[\"component_count\", \"largest_cc_diameter\"])\n",
    "    \n",
    "    # Step 0: Build graph-based features\n",
    "    graph = df.groupby(\"thread\").apply(g)\n",
    "    \n",
    "    # Step 1: Build simple aggregate features\n",
    "    agg = df.groupby(\"thread\")\\\n",
    "        .agg(agg_props)\\\n",
    "        .rename(columns=rename)\n",
    "    \n",
    "    agg.columns = [ \"_\".join(x) for x in agg.columns.ravel() ]\n",
    "    agg = agg.rename(columns={\"is_rumor_max\": \"is_rumor\", \"thread_length_len\": \"thread_length\"})\n",
    "    \n",
    "    # Step 2: Builds some features off the source tweet, which has tweet_id == thread            \n",
    "    src = df[df[\"is_source_tweet\"] == 1][[\"thread\",\n",
    "                                          \"user.followers_count\", \n",
    "                                          \"user.listed_count\",\n",
    "                                          \"user.verified\",\n",
    "                                          \"created\",\n",
    "                                          \"user.created_at\",\n",
    "                                          \"user.tweets_count\"]] \\\n",
    "                         .rename(columns={\"user.followers_count\": \"src.followers_count\",\n",
    "                                          \"user.listed_count\": \"src.listed_count\",\n",
    "                                          \"user.verified\": \"src.user_verified\",\n",
    "                                          \"user.created_at\": \"src.created_at\",\n",
    "                                          \"user.tweets_count\": \"src.tweets_total\"})\n",
    "    \n",
    "    # Step 3: Build features off of the reply tweets\n",
    "    def f(x):\n",
    "        d = []\n",
    "        \n",
    "        # Get various features from the distribution of times of reply tweet\n",
    "        d.append(min(x[\"created\"]))\n",
    "        d.append(max(x[\"created\"]))\n",
    "        d.append(np.var(x[\"created\"]))\n",
    "                \n",
    "        return pd.Series(d, index=[\"first_resp\", \"last_resp\",\"resp_var\"])\n",
    "        \n",
    "    replies = df[df[\"is_source_tweet\"] == False] \\\n",
    "        .groupby(\"thread\") \\\n",
    "        .apply(f)\n",
    "\n",
    "    graph_features = df.groupby(\"thread\").apply(g)\n",
    "    \n",
    "    dfs = [agg, src, replies, graph]\n",
    "    thrd_data = reduce(lambda left, right: pd.merge(left,right, on=\"thread\"), dfs)\n",
    "    \n",
    "    # Step 3: Add miscelaneous features\n",
    "    # Remember timestamps increase as time progresses\n",
    "    # src.created_at < created < first_resp < last_resp\n",
    "    thrd_data[\"time_to_first_resp\"] = thrd_data[\"first_resp\"] - thrd_data[\"created\"]\n",
    "    thrd_data[\"time_to_last_resp\"] = thrd_data[\"last_resp\"] - thrd_data[\"created\"]\n",
    "    \n",
    "    return thrd_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['thread', 'user.verified_mean', 'user.verified_sum',\n",
       "       'user.verified_var', 'number_punct_mean', 'number_punct_sum',\n",
       "       'number_punct_var', 'thread_length', 'user.has_bg_img_mean',\n",
       "       'user.has_bg_img_sum', 'user.has_bg_img_var', 'urls_count_mean',\n",
       "       'urls_count_sum', 'urls_count_var', 'user_mentions_mean',\n",
       "       'user_mentions_sum', 'user_mentions_var', 'positivewordcount_mean',\n",
       "       'positivewordcount_sum', 'positivewordcount_var',\n",
       "       'user.profile_bgcolor_mean', 'user.profile_bgcolor_sum',\n",
       "       'user.profile_bgcolor_var', 'has_exclaim_mean', 'has_exclaim_sum',\n",
       "       'has_exclaim_var', 'Noun_mean', 'Noun_sum', 'Noun_var',\n",
       "       'user.name_length_mean', 'user.name_length_sum',\n",
       "       'user.name_length_var', 'media_count_mean', 'media_count_sum',\n",
       "       'media_count_var', 'user.profile_sbcolor_mean',\n",
       "       'user.profile_sbcolor_sum', 'user.profile_sbcolor_var',\n",
       "       'hashtags_count_mean', 'hashtags_count_sum', 'hashtags_count_var',\n",
       "       'hashtags_count_shared', 'has_smile_emoji_mean',\n",
       "       'has_smile_emoji_sum', 'has_smile_emoji_var',\n",
       "       'negativewordcount_mean', 'negativewordcount_sum',\n",
       "       'negativewordcount_var', 'Adverb_mean', 'Adverb_sum', 'Adverb_var',\n",
       "       'user.handle_length_mean', 'user.handle_length_sum',\n",
       "       'user.handle_length_var', 'is_rumor', 'user.default_pic_mean',\n",
       "       'user.default_pic_sum', 'user.default_pic_var',\n",
       "       'capitalratio_mean', 'capitalratio_sum', 'capitalratio_var',\n",
       "       'has_quest_mean', 'has_quest_sum', 'has_quest_var',\n",
       "       'favorite_count_mean', 'favorite_count_sum', 'favorite_count_var',\n",
       "       'has_coords_mean', 'has_coords_sum', 'has_coords_var', 'Verb_mean',\n",
       "       'Verb_sum', 'Verb_var', 'Pronoun_mean', 'Pronoun_sum',\n",
       "       'Pronoun_var', 'Adjective_mean', 'Adjective_sum', 'Adjective_var',\n",
       "       'user.tweets_count_mean', 'user.tweets_count_sum',\n",
       "       'user.tweets_count_var', 'has_quest_or_exclaim_mean',\n",
       "       'has_quest_or_exclaim_sum', 'has_quest_or_exclaim_var',\n",
       "       'retweet_count_mean', 'retweet_count_sum', 'retweet_count_var',\n",
       "       'sentimentscore_mean', 'sentimentscore_sum', 'sentimentscore_var',\n",
       "       'sensitive_mean', 'sensitive_sum', 'sensitive_var',\n",
       "       'hasperiod_mean', 'hasperiod_sum', 'hasperiod_var',\n",
       "       'has_place_mean', 'has_place_sum', 'has_place_var',\n",
       "       'contentlength_mean', 'contentlength_sum', 'contentlength_var',\n",
       "       'src.followers_count', 'src.listed_count', 'src.user_verified',\n",
       "       'created', 'src.created_at', 'src.tweets_total', 'first_resp',\n",
       "       'last_resp', 'resp_var', 'component_count', 'largest_cc_diameter',\n",
       "       'time_to_first_resp', 'time_to_last_resp'], dtype=object)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gw_thrds = agg_tweets_by_thread(gw)\n",
    "gw_thrds.columns.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user.verified_mean</th>\n",
       "      <th>user.verified_sum</th>\n",
       "      <th>user.verified_var</th>\n",
       "      <th>number_punct_mean</th>\n",
       "      <th>number_punct_sum</th>\n",
       "      <th>number_punct_var</th>\n",
       "      <th>thread_length</th>\n",
       "      <th>user.has_bg_img_mean</th>\n",
       "      <th>user.has_bg_img_sum</th>\n",
       "      <th>user.has_bg_img_var</th>\n",
       "      <th>...</th>\n",
       "      <th>created</th>\n",
       "      <th>src.created_at</th>\n",
       "      <th>src.tweets_total</th>\n",
       "      <th>first_resp</th>\n",
       "      <th>last_resp</th>\n",
       "      <th>resp_var</th>\n",
       "      <th>component_count</th>\n",
       "      <th>largest_cc_diameter</th>\n",
       "      <th>time_to_first_resp</th>\n",
       "      <th>time_to_last_resp</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>1173.000000</td>\n",
       "      <td>1173.000000</td>\n",
       "      <td>1173.000000</td>\n",
       "      <td>1173.000000</td>\n",
       "      <td>1173.000000</td>\n",
       "      <td>1173.000000</td>\n",
       "      <td>1173.000000</td>\n",
       "      <td>1173.000000</td>\n",
       "      <td>1173.000000</td>\n",
       "      <td>1173.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>1.173000e+03</td>\n",
       "      <td>1.173000e+03</td>\n",
       "      <td>1173.000000</td>\n",
       "      <td>1.173000e+03</td>\n",
       "      <td>1.173000e+03</td>\n",
       "      <td>1.173000e+03</td>\n",
       "      <td>1173.000000</td>\n",
       "      <td>1173.000000</td>\n",
       "      <td>1.173000e+03</td>\n",
       "      <td>1.173000e+03</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>0.069777</td>\n",
       "      <td>0.885763</td>\n",
       "      <td>0.067245</td>\n",
       "      <td>5.142717</td>\n",
       "      <td>99.625746</td>\n",
       "      <td>19.255169</td>\n",
       "      <td>20.416027</td>\n",
       "      <td>0.904372</td>\n",
       "      <td>18.538789</td>\n",
       "      <td>0.082491</td>\n",
       "      <td>...</td>\n",
       "      <td>1.418628e+12</td>\n",
       "      <td>1.258588e+12</td>\n",
       "      <td>54261.992327</td>\n",
       "      <td>1.418628e+12</td>\n",
       "      <td>1.418656e+12</td>\n",
       "      <td>2.571665e+14</td>\n",
       "      <td>3.086957</td>\n",
       "      <td>3.253197</td>\n",
       "      <td>5.920571e+05</td>\n",
       "      <td>2.815097e+07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.087313</td>\n",
       "      <td>1.178118</td>\n",
       "      <td>0.081256</td>\n",
       "      <td>2.100673</td>\n",
       "      <td>102.446226</td>\n",
       "      <td>19.313091</td>\n",
       "      <td>20.183395</td>\n",
       "      <td>0.111175</td>\n",
       "      <td>18.866639</td>\n",
       "      <td>0.087406</td>\n",
       "      <td>...</td>\n",
       "      <td>1.947740e+07</td>\n",
       "      <td>6.175650e+10</td>\n",
       "      <td>53508.094305</td>\n",
       "      <td>2.008682e+07</td>\n",
       "      <td>4.820095e+07</td>\n",
       "      <td>9.975376e+14</td>\n",
       "      <td>1.338945</td>\n",
       "      <td>1.731509</td>\n",
       "      <td>5.514800e+06</td>\n",
       "      <td>4.450653e+07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.666667</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>0.333333</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>1.418598e+12</td>\n",
       "      <td>1.167702e+12</td>\n",
       "      <td>32.000000</td>\n",
       "      <td>1.418599e+12</td>\n",
       "      <td>1.418601e+12</td>\n",
       "      <td>0.000000e+00</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000e+00</td>\n",
       "      <td>4.800000e+04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>3.769231</td>\n",
       "      <td>44.000000</td>\n",
       "      <td>7.383399</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>0.854167</td>\n",
       "      <td>9.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>1.418609e+12</td>\n",
       "      <td>1.223102e+12</td>\n",
       "      <td>11383.000000</td>\n",
       "      <td>1.418609e+12</td>\n",
       "      <td>1.418625e+12</td>\n",
       "      <td>8.527076e+11</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>4.800000e+04</td>\n",
       "      <td>3.426000e+06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>0.050000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.050000</td>\n",
       "      <td>4.756757</td>\n",
       "      <td>77.000000</td>\n",
       "      <td>13.392077</td>\n",
       "      <td>18.000000</td>\n",
       "      <td>0.939394</td>\n",
       "      <td>16.000000</td>\n",
       "      <td>0.058824</td>\n",
       "      <td>...</td>\n",
       "      <td>1.418626e+12</td>\n",
       "      <td>1.244628e+12</td>\n",
       "      <td>34716.000000</td>\n",
       "      <td>1.418627e+12</td>\n",
       "      <td>1.418647e+12</td>\n",
       "      <td>7.666538e+12</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>9.600000e+04</td>\n",
       "      <td>9.717000e+06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>0.100000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.099206</td>\n",
       "      <td>6.107143</td>\n",
       "      <td>120.000000</td>\n",
       "      <td>23.194444</td>\n",
       "      <td>23.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>21.000000</td>\n",
       "      <td>0.135338</td>\n",
       "      <td>...</td>\n",
       "      <td>1.418645e+12</td>\n",
       "      <td>1.300516e+12</td>\n",
       "      <td>96016.000000</td>\n",
       "      <td>1.418646e+12</td>\n",
       "      <td>1.418666e+12</td>\n",
       "      <td>8.105548e+13</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>1.980000e+05</td>\n",
       "      <td>3.398500e+07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>0.750000</td>\n",
       "      <td>27.000000</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>15.900000</td>\n",
       "      <td>1646.000000</td>\n",
       "      <td>246.763636</td>\n",
       "      <td>342.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>325.000000</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>...</td>\n",
       "      <td>1.418659e+12</td>\n",
       "      <td>1.417251e+12</td>\n",
       "      <td>512276.000000</td>\n",
       "      <td>1.418744e+12</td>\n",
       "      <td>1.418940e+12</td>\n",
       "      <td>1.259862e+16</td>\n",
       "      <td>13.000000</td>\n",
       "      <td>12.000000</td>\n",
       "      <td>1.313100e+08</td>\n",
       "      <td>3.178670e+08</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>8 rows × 115 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       user.verified_mean  user.verified_sum  user.verified_var  \\\n",
       "count         1173.000000        1173.000000        1173.000000   \n",
       "mean             0.069777           0.885763           0.067245   \n",
       "std              0.087313           1.178118           0.081256   \n",
       "min              0.000000           0.000000           0.000000   \n",
       "25%              0.000000           0.000000           0.000000   \n",
       "50%              0.050000           1.000000           0.050000   \n",
       "75%              0.100000           1.000000           0.099206   \n",
       "max              0.750000          27.000000           0.500000   \n",
       "\n",
       "       number_punct_mean  number_punct_sum  number_punct_var  thread_length  \\\n",
       "count        1173.000000       1173.000000       1173.000000    1173.000000   \n",
       "mean            5.142717         99.625746         19.255169      20.416027   \n",
       "std             2.100673        102.446226         19.313091      20.183395   \n",
       "min             0.666667          2.000000          0.000000       2.000000   \n",
       "25%             3.769231         44.000000          7.383399      10.000000   \n",
       "50%             4.756757         77.000000         13.392077      18.000000   \n",
       "75%             6.107143        120.000000         23.194444      23.000000   \n",
       "max            15.900000       1646.000000        246.763636     342.000000   \n",
       "\n",
       "       user.has_bg_img_mean  user.has_bg_img_sum  user.has_bg_img_var  \\\n",
       "count           1173.000000          1173.000000          1173.000000   \n",
       "mean               0.904372            18.538789             0.082491   \n",
       "std                0.111175            18.866639             0.087406   \n",
       "min                0.333333             1.000000             0.000000   \n",
       "25%                0.854167             9.000000             0.000000   \n",
       "50%                0.939394            16.000000             0.058824   \n",
       "75%                1.000000            21.000000             0.135338   \n",
       "max                1.000000           325.000000             0.500000   \n",
       "\n",
       "             ...               created  src.created_at  src.tweets_total  \\\n",
       "count        ...          1.173000e+03    1.173000e+03       1173.000000   \n",
       "mean         ...          1.418628e+12    1.258588e+12      54261.992327   \n",
       "std          ...          1.947740e+07    6.175650e+10      53508.094305   \n",
       "min          ...          1.418598e+12    1.167702e+12         32.000000   \n",
       "25%          ...          1.418609e+12    1.223102e+12      11383.000000   \n",
       "50%          ...          1.418626e+12    1.244628e+12      34716.000000   \n",
       "75%          ...          1.418645e+12    1.300516e+12      96016.000000   \n",
       "max          ...          1.418659e+12    1.417251e+12     512276.000000   \n",
       "\n",
       "         first_resp     last_resp      resp_var  component_count  \\\n",
       "count  1.173000e+03  1.173000e+03  1.173000e+03      1173.000000   \n",
       "mean   1.418628e+12  1.418656e+12  2.571665e+14         3.086957   \n",
       "std    2.008682e+07  4.820095e+07  9.975376e+14         1.338945   \n",
       "min    1.418599e+12  1.418601e+12  0.000000e+00         1.000000   \n",
       "25%    1.418609e+12  1.418625e+12  8.527076e+11         2.000000   \n",
       "50%    1.418627e+12  1.418647e+12  7.666538e+12         3.000000   \n",
       "75%    1.418646e+12  1.418666e+12  8.105548e+13         4.000000   \n",
       "max    1.418744e+12  1.418940e+12  1.259862e+16        13.000000   \n",
       "\n",
       "       largest_cc_diameter  time_to_first_resp  time_to_last_resp  \n",
       "count          1173.000000        1.173000e+03       1.173000e+03  \n",
       "mean              3.253197        5.920571e+05       2.815097e+07  \n",
       "std               1.731509        5.514800e+06       4.450653e+07  \n",
       "min               1.000000        0.000000e+00       4.800000e+04  \n",
       "25%               2.000000        4.800000e+04       3.426000e+06  \n",
       "50%               3.000000        9.600000e+04       9.717000e+06  \n",
       "75%               4.000000        1.980000e+05       3.398500e+07  \n",
       "max              12.000000        1.313100e+08       3.178670e+08  \n",
       "\n",
       "[8 rows x 115 columns]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gw_thrds.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gw_thrds.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "<<<<<<< local <modified: text/plain>\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'Wrote data to data/threads/sydneysiege.csv'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "=======\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      ">>>>>>> remote <removed>\n"
     ]
    }
   ],
   "source": [
    "<<<<<<< local\n",
    "fn = \"data/threads/%s.csv\" % dataset_name\n",
    "=======\n",
    "fn = \"data/threads/%s.csv\" % event\n",
    ">>>>>>> remote\n",
    "gw_thrds.to_csv(fn, index=False)\n",
    "\"Wrote data to %s\" % fn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}