exampleSearchReport.Rnw

\documentclass[a4paper]{article}
\SweaveOpts{echo=FALSE}
\usepackage{a4wide}
\usepackage{color}
\usepackage{lscape}


<< echo = F >>=
source("TAGS-Based-Scripts/core-TAGSExplorer-API.R")
require(twitteR)
#The original example used the twitteR library to pull in a user stream
#rdmTweets <- userTimeline("psychemedia", n=100)
#Instead, I'm going to pull in a search around a hashtag.
searchTerm='#lak12'
searchTerm="#YesToBrumMayor OR #NoToBrumMayor OR #BrumMayor OR ((#Birmingham OR #Brum) AND (#Mayor OR #ElectedMayor))"
fstub='opened12'
searchTerm=paste('#',fstub,sep='')

rdmTweets <- searchTwitter(searchTerm, n=1500)
tw.df=twListToDF(rdmTweets)

#library(iconv)
tw.df$text=iconv(tw.df$text)

tw.df$from_user=tw.df$screenName
# Note that the Twitter search API only goes back 1500 tweets (I think?)
df.data=twParse(tw.df)

df.counts=twCounts(df.data)

require(ggplot2)

@

\title{Example Document Summarising Contents of a Twitter Search}
\author{
Tony Hirst\thanks{@psychemedia}\\License: CC-BY
}

\date{\today}


\begin{document}
\SweaveOpts{concordance=TRUE}

\maketitle

\renewcommand{\topfraction}{0.85}
\renewcommand{\textfraction}{0.1}
\renewcommand{\floatpagefraction}{0.75}

\newpage

\section{Twitter Search Summary}
This report has been automatically generated from Twitter data grabbed by a recent results search via the Twitter API.

(I've got myself in a muddle around new style and RTs... I'm not sure what RT data is being parsed out correctly atm... Maybe it all is...! )

\subsection{Tweet and RT Stats}
A series of summary reports around retweet behaviour observed within the search results.

\begin{figure}[htbp]
\begin{center}
<<exampleRTbarchart, fig = T, echo = F>>=
#plot a bar chart of RT of counts
p=ggplot() + geom_bar(aes(x=na.omit(df.data$rtof))) + opts(axis.text.x=theme_text(angle=-90,size=6)) + xlab(NULL)
print(p)
@
\caption{RT bar chart}
\end{center}
\end{figure}

The RT bar chart shows a count of who has been retweeted. This view is ordered case sensitive alphabetically.


%comment
\newpage

This take on the RT bar chart  orders the tweeps according to how often they were RTd.

\begin{figure}[htbp]
\begin{center}
<<exampleSortedRTodbarchart, fig = T, echo = F>>=
#sorted plot based on computed counts - "RT of"
df.data$hrt=barsorter(df.data$rtof)
p=ggplot() + geom_bar(aes(x=na.omit(df.data$hrt))) + opts(axis.text.x=theme_text(angle=-90,size=6)) + xlab(NULL)
print(p)
@
\caption{RT of bar chart}
\end{center}
\end{figure}

\newpage

Sometimes, it makes more sense to present a tabular view of the data itself. However, when rows are ordered according to column, should we also order the presentation of the columns so that the ordering column is the first numerical column. I don't do that for these tables, but maybe I should?

<<label=table1,echo=FALSE,results=tex>>=
require(xtable)
require(plyr)
print(xtable(head(arrange(df.counts,desc(rtofCount)),10), caption = "Top ten users by 'RT of'' count",caption.placement = "top"))
##But how do we also order the columns, so eg the sortedBy column is first?
@

<<label=table2,echo=FALSE,results=tex>>=
print(xtable(head(arrange(df.counts,desc(toCount)),10), caption = "Top ten users by 'to'' count",caption.placement = "top"))
##But how do we also order the columns, so eg the sortedBy column is first?
@

<<label=table3,echo=FALSE,results=tex>>=
print(xtable(head(arrange(df.counts,desc(rtbyCount)),10), caption = "Top ten users by 'RT by'' count",caption.placement = "top"))
##But how do we also order the columns, so eg the sortedBy column is first?
@

<<label=table4,echo=FALSE,results=tex>>=
print(xtable(head(arrange(df.counts,desc(fromCount)),10), caption = "Top ten users by 'from'' count",caption.placement = "top"))
##But how do we also order the columns, so eg the sortedBy column is first?
@

\newpage
\SweaveOpts{width=12,height=7}
\setkeys{Gin}{width=1.5\textwidth}

This chart captures who (green) RTd whom (red) over time.

\begin{landscape}

\begin{figure}[htbp]
\begin{center}
<<exampleRTconversationChart, fig = T, echo = F>>=
p=ggplot(subset(df.data,subset=(!is.na(rtof))))+geom_linerange(aes(x=created,ymin=screenName,ymax=rtof),colour='lightgrey')+geom_point(aes(x=created,y=screenName),colour='green')+geom_point(aes(x=created,y=rtof),colour='red')
p=p+opts(axis.text.y=theme_text(size=5))

#ggplot(subset(df.data,subset=(!is.na(rtof) & (screenName=='mdpistilli'  |screenName=='kimberlyarnold' ))))+geom_linerange(aes(x=created,ymin=screenName,ymax=rtof),colour='lightgrey')+geom_point(aes(x=created,y=screenName),colour='grey')+geom_point(aes(x=created,y=rtof,colour=screenName))


print(p)
@
\caption{RT graph over time}
\end{center}
\end{figure}

\end{landscape}
\SweaveOpts{width=6,height=6}
\setkeys{Gin}{width=0.8\textwidth}

\newpage

A possibly pointless comparison between the number of tweets from a user and the number of times they are retweeted (how many of the tweets are retweeted, for example? Lots of them a few times, or one of them lots of times?) This really needs to also convey the number of unique tweets that are retweeted, perhaps using label size?

\begin{figure}[htbp]
\begin{center}
<<exampleRTofbyScatter, fig = T, echo = F>>=
p=ggplot(df.counts)+geom_text(aes(x=fromCount,y=rtofCount,label=Name),size=2,angle=45)
print(p)

@
\caption{RT of vs. tweet from counts}
\end{center}
\end{figure}

\newpage

This is maybe a little more meaningful - what proportion of a user's tweets are retweets? I guess I should put an x=y line in here to help with the comparison?

\begin{figure}[htbp]
\begin{center}
<<exampleRTbyScatter, fig = T, echo = F,width=14,height=14>>=
p=ggplot(df.counts)+geom_text(aes(x=fromCount,y=rtbyCount,label=Name),size=5,angle=45)
print(p)

@
\caption{RT by vs. tweet from counts}
\end{center}
\end{figure}

\newpage

Are things any more informative if we also size labels according to the number of times each person has been retweeted? (This is a mixed signal though - RTs as counted may come from different people relating to one tweet, or from the same person over several tweets. )

\begin{figure}[htbp]
\begin{center}
<<exampleRTbyScatterSized, fig = T, echo = F,width=14,height=14>>=
p=ggplot(df.counts)+geom_text(aes(x=fromCount,y=rtbyCount,label=Name,size=rtofCount),angle=45)
print(p)

@
\caption{RT by vs. tweet from counts sized by RT of count}
\end{center}
\end{figure}

\newpage

If we want to get a feeling for what proportion of a person's tweets are RT's, this may help? The size of the point is proportional to the number of tweets the person has sent.

\begin{figure}[htbp]
\begin{center}
<<exampleRTQuotientScatter, fig = T, echo = F,width=14,height=14>>=
p=ggplot(subset(df.counts,rtbyCount>0))+geom_point(aes(x=Name,y=rtbyCount/fromCount,size=rtbyCount))
p=p+ opts(axis.text.x=theme_text(angle=-90,size=5)) + xlab(NULL)
print(p)

@
\caption{RT by/tweet from ratio}
\end{center}
\end{figure}

\newpage

This chart shows folk who tweeted twice or more in the sample, ordered case sensitive alphabetically.

Are tweet volumes by user useful? Is the ordering of names on the axis useful or confusing? Would ordering according to tweet count be more useful? If nothing else, it would help us see the distribution of tweet volumes by user?

\begin{figure}[htbp]
\begin{center}
<<exampleSorted2LimitTweetbarchart, fig = T, echo = F>>=
# Limit the data set to show only folk who tweeted twice or more in the sample
counts=table(df.data$screenName)
cc=subset(counts,counts>1)
barplot(cc,las=2,cex.names =0.5)

@
\caption{Folk who tweeted twice or more in the sample}
\end{center}
\end{figure}

\newpage

This chart, inspired by an idea from @mediaczar, shows the accession of Twitters users into usage of the hashtag. Time goes along the x-axis, and the order in which folk first use the tag within the current sample is displayed incrementally along the y-axis. What do you think the long vertical runs, showing the accession of several people into hashtag usage over a short period of time, might represent?

\begin{figure}[htbp]
\begin{center}
<<exampleAccession, fig = T, echo = F,width=12,height=16>>=
tw.dfx=ddply(df.data, .var = "screenName", .fun = function(x) {return(subset(x, created %in% min(created),select=c(screenName,created)))})
## 2) arrange the users in accession order
tw.dfxa=arrange(tw.dfx,-desc(created))
## 3) Use the username accession order to order the screenName factors in the searchlist
df.data$screenName=factor(df.data$screenName, levels = tw.dfxa$screenName)
#ggplot seems to be able to cope with time typed values...
p=ggplot(df.data)+geom_point(aes(x=created,y=screenName))
p=p+opts(axis.text.y=theme_text(size=5))
print(p)
@
\caption{Accession order of tweeps}
\end{center}
\end{figure}

\newpage
A simple hypothesis regarding the rapid accession of large numbers of people into usage of a hashtag might be that they are watching a particular event live and responding to a notable incident occurring within it. Or maybe they all suddenly retweet one or more tweets that contain the hashtag? Colouring accession chart samples according to whether or not a tweet was an RT or not may help us get a feel for whether the latter is the case... 

\begin{figure}[htbp]
\begin{center}
<<exampleAccessionRT, fig = T, echo = F,width=12,height=16>>=

df.data$rtt=sapply(df.data$rtof,function(rt) if (is.na(rt)) 'T' else 'RT')
p=ggplot(df.data)+geom_point(aes(x=created,y=screenName,col=rtt))
p=p+opts(axis.text.y=theme_text(size=6))
print(p)
@
\caption{Accession of tweeps (highlighting RTs)}
\end{center}
\end{figure}

\newpage
If we're interested in folk creating new tweets, we can easily exclude the RT'd tweets from the chart.
\begin{figure}[htbp]
\begin{center}
<<exampleAccessionNoRT, fig = T, echo = F,width=12,height=16>>=

p=ggplot(subset(df.data,rtt=='T'))+geom_point(aes(x=created,y=screenName,col=rtt),colour='aquamarine3')
p=p+opts(axis.text.y=theme_text(size=6))
print(p)
@
\caption{Accession of tweeps (ignoring RTs)}
\end{center}
\end{figure}

\newpage
The complementary approach, only showing RTs, is a little less useful though, I think? Note that I really should force the colour in this chart to maintain consistency of colour across all the charts in the doc...
\begin{figure}[htbp]
\begin{center}
<<exampleAccessionOnlyRT, fig = T, echo = F,width=12,height=16>>=

p=ggplot(subset(df.data,rtt=='RT'))+geom_point(aes(x=created,y=screenName),colour='red')
p=p+opts(axis.text.y=theme_text(size=6))
print(p)
@
\caption{Accession of tweeps (only showing RTs)}
\end{center}
\end{figure}

\newpage
We do this because we can, right?;-) A wordcloud (with usernames removed) showing popular terms mentioned in tag bearing tweets.

\begin{figure}[htbp]
\begin{center}
<<wordcloud, fig = T, echo = F>>=
RemoveAtPeople <- function(tweet) {
  gsub("@\\w+", "", tweet)
}

tweets <- as.vector(sapply(df.data$text, RemoveAtPeople))

require(tm)
generateCorpus= function(df,my.stopwords=c()){
  #Install the textmining library
  tw.corpus= Corpus(VectorSource(df))
  # remove punctuation
  ## I wonder if it would make sense to remove @d names first?
  tw.corpus = tm_map(tw.corpus, removePunctuation)
  #normalise case
  tw.corpus = tm_map(tw.corpus, tolower)
  # remove stopwords
  tw.corpus = tm_map(tw.corpus, removeWords, stopwords('english'))
  tw.corpus = tm_map(tw.corpus, removeWords, my.stopwords)

  tw.corpus
}

wordcloud.generate=function(corpus,min.freq=3){
  require(wordcloud)
  doc.m = TermDocumentMatrix(corpus, control = list(minWordLength = 1))
  dm = as.matrix(doc.m)
  # calculate the frequency of words
  v = sort(rowSums(dm), decreasing=TRUE)
  d = data.frame(word=names(v), freq=v)
  wc=wordcloud(d$word, d$freq, min.freq=min.freq)
  wc
}

try(print(wordcloud.generate(generateCorpus(tweets),7)))
@
\caption{Wordcloud of tweets with @usernames excluded)}
\end{center}
\end{figure}

\newpage
The wordcloud with the search tag excluded.
\begin{figure}[htbp]
\begin{center}
<<wordcloudNoRootTag, fig = T, echo = F>>=
try(print(wordcloud.generate(generateCorpus(tweets,fstub),7)))
@
\caption{Wordcloud of tweets with @usernames excluded)}
\end{center}
\end{figure}

\newpage
A chart displaying the usage of hashtags across tweets in the sample, ordered alphabetically. Would an ordering according to the number of times each particular tag was used make more sense? Sould case be ignored? If case isn't ignored, should we maybe use a stacked barchart to count each case-normalised tweet, whilst also denoting the range if use of different capitalisations through stack segments? (I'm not sure I know how to do that in this case, but hey, that's what this is about, right?!;-)

\begin{figure}[htbp]
\begin{center}
<<exampleHashtagCount, fig = T, echo = F>>=
#hashtag processing via http://stackoverflow.com/a/9360445/454773
hashtagAugment=function(tmp){
  #I think we need to defend against cases with zero tagged or untagged tweets?
  tags <- str_extract_all(tmp$text, '#[a-zA-Z0-9]+')
  index <- rep.int(seq_len(nrow(tmp)), sapply(tags, length))
  if (length(index)!=0 || index ){
    tagged <- tmp[index, ]
    tagged$tag <- unlist(tags)
  } else {tagged=data.frame()}
  has_no_tag <- sapply(tags, function(x) length(x) == 0L)
  not_tagged <- tmp[has_no_tag, ]
  rbind(tagged, not_tagged)
}
df.data.t=hashtagAugment(df.data)

p=ggplot(df.data.t,aes(x=na.omit(tag)))+geom_bar(aes(y=(..count..))) + xlab(NULL) + opts(axis.text.x=theme_text(angle=-90,size=6))

print(p)
@
\caption{Example hashtag count}
\end{center}
\end{figure}

\newpage
Another accession ordered usage chart, this time for hashtags appearing within the sample. I guess I really should do a matrix style chart showing what tags were retwteeted with each other?

\begin{figure}[htbp]
\begin{center}
<<exampleHashtagOverTime, fig = T, echo = F,width=12,height=16>>=
#Order the tag factor levels in the order in which they were first created
tw.dfx=ddply(df.data.t, .var = "tag", .fun = function(x) {return(subset(x, created %in% min(created),select=c(tag,created)))})
tw.dfxa=arrange(tw.dfx,-desc(created))
df.data.t$tag=factor(df.data.t$tag, levels = tw.dfxa$tag)

#and plot the result
p=ggplot(df.data.t)+geom_point(aes(x=created,y=tag))
print(p)
@
\caption{Hashtag Usage Over Time}
\end{center}
\end{figure}

\newpage
Are things any more informative if we distinguish betwen tweets that were RTs or not?

\begin{figure}[htbp]
\begin{center}
<<exampleHashtagOverTimeRTcol, fig = T, echo = F,width=12,height=16>>=
# plot the result
p=ggplot(df.data.t)+geom_point(aes(x=created,y=tag,colour=rtt))
print(p)
@
\caption{Hashtag Usage Over Time}
\end{center}
\end{figure}

\newpage

One of the line charts everyone expects to see - a daily count of tweets over time. This is coarse grained, and uses an R time series analysis function (xts/apply.daily) to generate a plot that counts tweets within day bins. (I don't think it goes down to the level of hour?) I gues I should also do a cumulative volume chart? And if we were looking at an archive over a long period of time, maybe even a rolling average? Or would that be too lossy of signal across the time dimension.

\begin{figure}[htbp]
\begin{center}
<<exampleTweetsTimeseries, fig = T, echo = F>>=
require(xts)
#The xts function creates a timeline from a vector of values and a vector of timestamps.
#If we know how many tweets we have, we can just create a simple list or vector containing that number of 1s
ts=xts(rep(1,times=nrow(df.data)),df.data$created)

#We can now do some handy number crunching on the timeseries, such as applying a formula to values contained with day, week, month, quarter or year time bins.
#So for example, if we sum the unit values in daily bin, we can get a count of the number of tweets per day
ts.sum=apply.daily(ts,sum) 
#also apply. weekly, monthly, quarterly, yearly

#If for any reason we need to turn the timeseries into a dataframe, we can:
#http://stackoverflow.com/a/3387259/454773
ts.sum.df=data.frame(date=index(ts.sum), coredata(ts.sum))

colnames(ts.sum.df)=c('date','sum')

#We can then use ggplot to plot the timeseries...
p=ggplot(ts.sum.df)+geom_line(aes(x=date,y=sum))

print(p)
@
\caption{Tweeting activity}
\end{center}
\end{figure}


\newpage

This chart shows how folk are RTd over time, though not by whom... The ordering of tweeps is alphabetical; would it make more sense to order by number of times folk are RTd, or according to the date at which they were first RTd in the current sample?
\begin{figure}[htbp]
\begin{center}
<<exampleRTofUserActivity, fig = T, echo = F,width=12,height=16>>=
#We can start to get a feel for who RTs whom...
#Generate a plot showing how a person is RTd
tw.df$rtof=sapply(tw.df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
#Note that this doesn't show how many RTs each person got in a given time period if they got more than one...
p=ggplot(subset(tw.df,subset=(!is.na(rtof))))+geom_point(aes(x=created,y=rtof))
print(p)
@
\caption{RT activity around a person (how they were RTd)}
\end{center}
\end{figure}


\newpage
It is easy enough to chart who was retweeted by whom, though. (I guess what this chart needs are points sized according to the number of RTs of one person by another?) A large number of points going across a row show that person has been retweeted by a lot of separate individuals. I'm not sure about the ordering of tweeps in this chart though?

\begin{figure}[htbp]
\begin{center}
<<exampleWHoRTWhom, fig = T, echo = F,width=12,height=16>>=
#We can start to get a feel for who RTs whom...
require(gdata)
#We don't want to display screenNames of folk who tweeted but didn't RT
tw.df.rt=drop.levels(subset(tw.df,subset=(!is.na(rtof))))
#Order the screennames of folk who did RT by accession order (ie order in which they RTd)
tw.df.rta=arrange(ddply(tw.df.rt, .var = "screenName", .fun = function(x) {return(subset(x, created %in% min(created),select=c(screenName,created)))}),-desc(created))
tw.df.rt$screenName=factor(tw.df.rt$screenName, levels = tw.df.rta$screenName)
# Plot who RTd whom
p=ggplot(subset(tw.df.rt,subset=(!is.na(rtof))))+geom_point(aes(x=screenName,y=rtof))+opts(axis.text.x=theme_text(angle=-90,size=6))
p=p+ylab('Person retweeted')+xlab('Person Retweeting')
print(p)
@
\caption{Who Retweeted whom?}
\end{center}
\end{figure}


\newpage
It is arguably easier to look across a row to spot who's been retweeting a lot of different people. As with the previous chart, the ordering of tweeps on x and y axes doesn't really work for me in this case either. But what would be better?

\begin{figure}[htbp]
\begin{center}
<<exampleWHoRTWhom2, fig = T, echo = F,width=12,height=16>>=
# Plot who RTd whom
p=ggplot(subset(tw.df.rt,subset=(!is.na(rtof))))+geom_point(aes(y=screenName,x=rtof))+opts(axis.text.x=theme_text(angle=-90,size=6))
p=p+xlab('Person retweeted')+ylab('Person Retweeting')
print(p)
@
\caption{Who Retweeted whom?}
\end{center}
\end{figure}


%comment


\SweaveInput{doodles/exampleSearchReachReport.Rnw}

%comment

\end{document}