-
Notifications
You must be signed in to change notification settings - Fork 3
/
exampleSearchReport.Rnw
507 lines (397 loc) · 18.5 KB
/
exampleSearchReport.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
\documentclass[a4paper]{article}
\SweaveOpts{echo=FALSE}
\usepackage{a4wide}
\usepackage{color}
\usepackage{lscape}
<< echo = F >>=
source("TAGS-Based-Scripts/core-TAGSExplorer-API.R")
require(twitteR)
#The original example used the twitteR library to pull in a user stream
#rdmTweets <- userTimeline("psychemedia", n=100)
#Instead, I'm going to pull in a search around a hashtag.
searchTerm='#lak12'
searchTerm="#YesToBrumMayor OR #NoToBrumMayor OR #BrumMayor OR ((#Birmingham OR #Brum) AND (#Mayor OR #ElectedMayor))"
fstub='opened12'
searchTerm=paste('#',fstub,sep='')
rdmTweets <- searchTwitter(searchTerm, n=1500)
tw.df=twListToDF(rdmTweets)
#library(iconv)
tw.df$text=iconv(tw.df$text)
tw.df$from_user=tw.df$screenName
# Note that the Twitter search API only goes back 1500 tweets (I think?)
df.data=twParse(tw.df)
df.counts=twCounts(df.data)
require(ggplot2)
@
\title{Example Document Summarising Contents of a Twitter Search}
\author{
Tony Hirst\thanks{@psychemedia}\\License: CC-BY
}
\date{\today}
\begin{document}
\SweaveOpts{concordance=TRUE}
\maketitle
\renewcommand{\topfraction}{0.85}
\renewcommand{\textfraction}{0.1}
\renewcommand{\floatpagefraction}{0.75}
\newpage
\section{Twitter Search Summary}
This report has been automatically generated from Twitter data grabbed by a recent results search via the Twitter API.
(I've got myself in a muddle around new style and RTs... I'm not sure what RT data is being parsed out correctly atm... Maybe it all is...! )
\subsection{Tweet and RT Stats}
A series of summary reports around retweet behaviour observed within the search results.
\begin{figure}[htbp]
\begin{center}
<<exampleRTbarchart, fig = T, echo = F>>=
#plot a bar chart of RT of counts
p=ggplot() + geom_bar(aes(x=na.omit(df.data$rtof))) + opts(axis.text.x=theme_text(angle=-90,size=6)) + xlab(NULL)
print(p)
@
\caption{RT bar chart}
\end{center}
\end{figure}
The RT bar chart shows a count of who has been retweeted. This view is ordered case sensitive alphabetically.
%comment
\newpage
This take on the RT bar chart orders the tweeps according to how often they were RTd.
\begin{figure}[htbp]
\begin{center}
<<exampleSortedRTodbarchart, fig = T, echo = F>>=
#sorted plot based on computed counts - "RT of"
df.data$hrt=barsorter(df.data$rtof)
p=ggplot() + geom_bar(aes(x=na.omit(df.data$hrt))) + opts(axis.text.x=theme_text(angle=-90,size=6)) + xlab(NULL)
print(p)
@
\caption{RT of bar chart}
\end{center}
\end{figure}
\newpage
Sometimes, it makes more sense to present a tabular view of the data itself. However, when rows are ordered according to column, should we also order the presentation of the columns so that the ordering column is the first numerical column. I don't do that for these tables, but maybe I should?
<<label=table1,echo=FALSE,results=tex>>=
require(xtable)
require(plyr)
print(xtable(head(arrange(df.counts,desc(rtofCount)),10), caption = "Top ten users by 'RT of'' count",caption.placement = "top"))
##But how do we also order the columns, so eg the sortedBy column is first?
@
<<label=table2,echo=FALSE,results=tex>>=
print(xtable(head(arrange(df.counts,desc(toCount)),10), caption = "Top ten users by 'to'' count",caption.placement = "top"))
##But how do we also order the columns, so eg the sortedBy column is first?
@
<<label=table3,echo=FALSE,results=tex>>=
print(xtable(head(arrange(df.counts,desc(rtbyCount)),10), caption = "Top ten users by 'RT by'' count",caption.placement = "top"))
##But how do we also order the columns, so eg the sortedBy column is first?
@
<<label=table4,echo=FALSE,results=tex>>=
print(xtable(head(arrange(df.counts,desc(fromCount)),10), caption = "Top ten users by 'from'' count",caption.placement = "top"))
##But how do we also order the columns, so eg the sortedBy column is first?
@
\newpage
\SweaveOpts{width=12,height=7}
\setkeys{Gin}{width=1.5\textwidth}
This chart captures who (green) RTd whom (red) over time.
\begin{landscape}
\begin{figure}[htbp]
\begin{center}
<<exampleRTconversationChart, fig = T, echo = F>>=
p=ggplot(subset(df.data,subset=(!is.na(rtof))))+geom_linerange(aes(x=created,ymin=screenName,ymax=rtof),colour='lightgrey')+geom_point(aes(x=created,y=screenName),colour='green')+geom_point(aes(x=created,y=rtof),colour='red')
p=p+opts(axis.text.y=theme_text(size=5))
#ggplot(subset(df.data,subset=(!is.na(rtof) & (screenName=='mdpistilli' |screenName=='kimberlyarnold' ))))+geom_linerange(aes(x=created,ymin=screenName,ymax=rtof),colour='lightgrey')+geom_point(aes(x=created,y=screenName),colour='grey')+geom_point(aes(x=created,y=rtof,colour=screenName))
print(p)
@
\caption{RT graph over time}
\end{center}
\end{figure}
\end{landscape}
\SweaveOpts{width=6,height=6}
\setkeys{Gin}{width=0.8\textwidth}
\newpage
A possibly pointless comparison between the number of tweets from a user and the number of times they are retweeted (how many of the tweets are retweeted, for example? Lots of them a few times, or one of them lots of times?) This really needs to also convey the number of unique tweets that are retweeted, perhaps using label size?
\begin{figure}[htbp]
\begin{center}
<<exampleRTofbyScatter, fig = T, echo = F>>=
p=ggplot(df.counts)+geom_text(aes(x=fromCount,y=rtofCount,label=Name),size=2,angle=45)
print(p)
@
\caption{RT of vs. tweet from counts}
\end{center}
\end{figure}
\newpage
This is maybe a little more meaningful - what proportion of a user's tweets are retweets? I guess I should put an x=y line in here to help with the comparison?
\begin{figure}[htbp]
\begin{center}
<<exampleRTbyScatter, fig = T, echo = F,width=14,height=14>>=
p=ggplot(df.counts)+geom_text(aes(x=fromCount,y=rtbyCount,label=Name),size=5,angle=45)
print(p)
@
\caption{RT by vs. tweet from counts}
\end{center}
\end{figure}
\newpage
Are things any more informative if we also size labels according to the number of times each person has been retweeted? (This is a mixed signal though - RTs as counted may come from different people relating to one tweet, or from the same person over several tweets. )
\begin{figure}[htbp]
\begin{center}
<<exampleRTbyScatterSized, fig = T, echo = F,width=14,height=14>>=
p=ggplot(df.counts)+geom_text(aes(x=fromCount,y=rtbyCount,label=Name,size=rtofCount),angle=45)
print(p)
@
\caption{RT by vs. tweet from counts sized by RT of count}
\end{center}
\end{figure}
\newpage
If we want to get a feeling for what proportion of a person's tweets are RT's, this may help? The size of the point is proportional to the number of tweets the person has sent.
\begin{figure}[htbp]
\begin{center}
<<exampleRTQuotientScatter, fig = T, echo = F,width=14,height=14>>=
p=ggplot(subset(df.counts,rtbyCount>0))+geom_point(aes(x=Name,y=rtbyCount/fromCount,size=rtbyCount))
p=p+ opts(axis.text.x=theme_text(angle=-90,size=5)) + xlab(NULL)
print(p)
@
\caption{RT by/tweet from ratio}
\end{center}
\end{figure}
\newpage
This chart shows folk who tweeted twice or more in the sample, ordered case sensitive alphabetically.
Are tweet volumes by user useful? Is the ordering of names on the axis useful or confusing? Would ordering according to tweet count be more useful? If nothing else, it would help us see the distribution of tweet volumes by user?
\begin{figure}[htbp]
\begin{center}
<<exampleSorted2LimitTweetbarchart, fig = T, echo = F>>=
# Limit the data set to show only folk who tweeted twice or more in the sample
counts=table(df.data$screenName)
cc=subset(counts,counts>1)
barplot(cc,las=2,cex.names =0.5)
@
\caption{Folk who tweeted twice or more in the sample}
\end{center}
\end{figure}
\newpage
This chart, inspired by an idea from @mediaczar, shows the accession of Twitters users into usage of the hashtag. Time goes along the x-axis, and the order in which folk first use the tag within the current sample is displayed incrementally along the y-axis. What do you think the long vertical runs, showing the accession of several people into hashtag usage over a short period of time, might represent?
\begin{figure}[htbp]
\begin{center}
<<exampleAccession, fig = T, echo = F,width=12,height=16>>=
tw.dfx=ddply(df.data, .var = "screenName", .fun = function(x) {return(subset(x, created %in% min(created),select=c(screenName,created)))})
## 2) arrange the users in accession order
tw.dfxa=arrange(tw.dfx,-desc(created))
## 3) Use the username accession order to order the screenName factors in the searchlist
df.data$screenName=factor(df.data$screenName, levels = tw.dfxa$screenName)
#ggplot seems to be able to cope with time typed values...
p=ggplot(df.data)+geom_point(aes(x=created,y=screenName))
p=p+opts(axis.text.y=theme_text(size=5))
print(p)
@
\caption{Accession order of tweeps}
\end{center}
\end{figure}
\newpage
A simple hypothesis regarding the rapid accession of large numbers of people into usage of a hashtag might be that they are watching a particular event live and responding to a notable incident occurring within it. Or maybe they all suddenly retweet one or more tweets that contain the hashtag? Colouring accession chart samples according to whether or not a tweet was an RT or not may help us get a feel for whether the latter is the case...
\begin{figure}[htbp]
\begin{center}
<<exampleAccessionRT, fig = T, echo = F,width=12,height=16>>=
df.data$rtt=sapply(df.data$rtof,function(rt) if (is.na(rt)) 'T' else 'RT')
p=ggplot(df.data)+geom_point(aes(x=created,y=screenName,col=rtt))
p=p+opts(axis.text.y=theme_text(size=6))
print(p)
@
\caption{Accession of tweeps (highlighting RTs)}
\end{center}
\end{figure}
\newpage
If we're interested in folk creating new tweets, we can easily exclude the RT'd tweets from the chart.
\begin{figure}[htbp]
\begin{center}
<<exampleAccessionNoRT, fig = T, echo = F,width=12,height=16>>=
p=ggplot(subset(df.data,rtt=='T'))+geom_point(aes(x=created,y=screenName,col=rtt),colour='aquamarine3')
p=p+opts(axis.text.y=theme_text(size=6))
print(p)
@
\caption{Accession of tweeps (ignoring RTs)}
\end{center}
\end{figure}
\newpage
The complementary approach, only showing RTs, is a little less useful though, I think? Note that I really should force the colour in this chart to maintain consistency of colour across all the charts in the doc...
\begin{figure}[htbp]
\begin{center}
<<exampleAccessionOnlyRT, fig = T, echo = F,width=12,height=16>>=
p=ggplot(subset(df.data,rtt=='RT'))+geom_point(aes(x=created,y=screenName),colour='red')
p=p+opts(axis.text.y=theme_text(size=6))
print(p)
@
\caption{Accession of tweeps (only showing RTs)}
\end{center}
\end{figure}
\newpage
We do this because we can, right?;-) A wordcloud (with usernames removed) showing popular terms mentioned in tag bearing tweets.
\begin{figure}[htbp]
\begin{center}
<<wordcloud, fig = T, echo = F>>=
RemoveAtPeople <- function(tweet) {
gsub("@\\w+", "", tweet)
}
tweets <- as.vector(sapply(df.data$text, RemoveAtPeople))
require(tm)
generateCorpus= function(df,my.stopwords=c()){
#Install the textmining library
tw.corpus= Corpus(VectorSource(df))
# remove punctuation
## I wonder if it would make sense to remove @d names first?
tw.corpus = tm_map(tw.corpus, removePunctuation)
#normalise case
tw.corpus = tm_map(tw.corpus, tolower)
# remove stopwords
tw.corpus = tm_map(tw.corpus, removeWords, stopwords('english'))
tw.corpus = tm_map(tw.corpus, removeWords, my.stopwords)
tw.corpus
}
wordcloud.generate=function(corpus,min.freq=3){
require(wordcloud)
doc.m = TermDocumentMatrix(corpus, control = list(minWordLength = 1))
dm = as.matrix(doc.m)
# calculate the frequency of words
v = sort(rowSums(dm), decreasing=TRUE)
d = data.frame(word=names(v), freq=v)
wc=wordcloud(d$word, d$freq, min.freq=min.freq)
wc
}
try(print(wordcloud.generate(generateCorpus(tweets),7)))
@
\caption{Wordcloud of tweets with @usernames excluded)}
\end{center}
\end{figure}
\newpage
The wordcloud with the search tag excluded.
\begin{figure}[htbp]
\begin{center}
<<wordcloudNoRootTag, fig = T, echo = F>>=
try(print(wordcloud.generate(generateCorpus(tweets,fstub),7)))
@
\caption{Wordcloud of tweets with @usernames excluded)}
\end{center}
\end{figure}
\newpage
A chart displaying the usage of hashtags across tweets in the sample, ordered alphabetically. Would an ordering according to the number of times each particular tag was used make more sense? Sould case be ignored? If case isn't ignored, should we maybe use a stacked barchart to count each case-normalised tweet, whilst also denoting the range if use of different capitalisations through stack segments? (I'm not sure I know how to do that in this case, but hey, that's what this is about, right?!;-)
\begin{figure}[htbp]
\begin{center}
<<exampleHashtagCount, fig = T, echo = F>>=
#hashtag processing via http://stackoverflow.com/a/9360445/454773
hashtagAugment=function(tmp){
#I think we need to defend against cases with zero tagged or untagged tweets?
tags <- str_extract_all(tmp$text, '#[a-zA-Z0-9]+')
index <- rep.int(seq_len(nrow(tmp)), sapply(tags, length))
if (length(index)!=0 || index ){
tagged <- tmp[index, ]
tagged$tag <- unlist(tags)
} else {tagged=data.frame()}
has_no_tag <- sapply(tags, function(x) length(x) == 0L)
not_tagged <- tmp[has_no_tag, ]
rbind(tagged, not_tagged)
}
df.data.t=hashtagAugment(df.data)
p=ggplot(df.data.t,aes(x=na.omit(tag)))+geom_bar(aes(y=(..count..))) + xlab(NULL) + opts(axis.text.x=theme_text(angle=-90,size=6))
print(p)
@
\caption{Example hashtag count}
\end{center}
\end{figure}
\newpage
Another accession ordered usage chart, this time for hashtags appearing within the sample. I guess I really should do a matrix style chart showing what tags were retwteeted with each other?
\begin{figure}[htbp]
\begin{center}
<<exampleHashtagOverTime, fig = T, echo = F,width=12,height=16>>=
#Order the tag factor levels in the order in which they were first created
tw.dfx=ddply(df.data.t, .var = "tag", .fun = function(x) {return(subset(x, created %in% min(created),select=c(tag,created)))})
tw.dfxa=arrange(tw.dfx,-desc(created))
df.data.t$tag=factor(df.data.t$tag, levels = tw.dfxa$tag)
#and plot the result
p=ggplot(df.data.t)+geom_point(aes(x=created,y=tag))
print(p)
@
\caption{Hashtag Usage Over Time}
\end{center}
\end{figure}
\newpage
Are things any more informative if we distinguish betwen tweets that were RTs or not?
\begin{figure}[htbp]
\begin{center}
<<exampleHashtagOverTimeRTcol, fig = T, echo = F,width=12,height=16>>=
# plot the result
p=ggplot(df.data.t)+geom_point(aes(x=created,y=tag,colour=rtt))
print(p)
@
\caption{Hashtag Usage Over Time}
\end{center}
\end{figure}
\newpage
One of the line charts everyone expects to see - a daily count of tweets over time. This is coarse grained, and uses an R time series analysis function (xts/apply.daily) to generate a plot that counts tweets within day bins. (I don't think it goes down to the level of hour?) I gues I should also do a cumulative volume chart? And if we were looking at an archive over a long period of time, maybe even a rolling average? Or would that be too lossy of signal across the time dimension.
\begin{figure}[htbp]
\begin{center}
<<exampleTweetsTimeseries, fig = T, echo = F>>=
require(xts)
#The xts function creates a timeline from a vector of values and a vector of timestamps.
#If we know how many tweets we have, we can just create a simple list or vector containing that number of 1s
ts=xts(rep(1,times=nrow(df.data)),df.data$created)
#We can now do some handy number crunching on the timeseries, such as applying a formula to values contained with day, week, month, quarter or year time bins.
#So for example, if we sum the unit values in daily bin, we can get a count of the number of tweets per day
ts.sum=apply.daily(ts,sum)
#also apply. weekly, monthly, quarterly, yearly
#If for any reason we need to turn the timeseries into a dataframe, we can:
#http://stackoverflow.com/a/3387259/454773
ts.sum.df=data.frame(date=index(ts.sum), coredata(ts.sum))
colnames(ts.sum.df)=c('date','sum')
#We can then use ggplot to plot the timeseries...
p=ggplot(ts.sum.df)+geom_line(aes(x=date,y=sum))
print(p)
@
\caption{Tweeting activity}
\end{center}
\end{figure}
\newpage
This chart shows how folk are RTd over time, though not by whom... The ordering of tweeps is alphabetical; would it make more sense to order by number of times folk are RTd, or according to the date at which they were first RTd in the current sample?
\begin{figure}[htbp]
\begin{center}
<<exampleRTofUserActivity, fig = T, echo = F,width=12,height=16>>=
#We can start to get a feel for who RTs whom...
#Generate a plot showing how a person is RTd
tw.df$rtof=sapply(tw.df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
#Note that this doesn't show how many RTs each person got in a given time period if they got more than one...
p=ggplot(subset(tw.df,subset=(!is.na(rtof))))+geom_point(aes(x=created,y=rtof))
print(p)
@
\caption{RT activity around a person (how they were RTd)}
\end{center}
\end{figure}
\newpage
It is easy enough to chart who was retweeted by whom, though. (I guess what this chart needs are points sized according to the number of RTs of one person by another?) A large number of points going across a row show that person has been retweeted by a lot of separate individuals. I'm not sure about the ordering of tweeps in this chart though?
\begin{figure}[htbp]
\begin{center}
<<exampleWHoRTWhom, fig = T, echo = F,width=12,height=16>>=
#We can start to get a feel for who RTs whom...
require(gdata)
#We don't want to display screenNames of folk who tweeted but didn't RT
tw.df.rt=drop.levels(subset(tw.df,subset=(!is.na(rtof))))
#Order the screennames of folk who did RT by accession order (ie order in which they RTd)
tw.df.rta=arrange(ddply(tw.df.rt, .var = "screenName", .fun = function(x) {return(subset(x, created %in% min(created),select=c(screenName,created)))}),-desc(created))
tw.df.rt$screenName=factor(tw.df.rt$screenName, levels = tw.df.rta$screenName)
# Plot who RTd whom
p=ggplot(subset(tw.df.rt,subset=(!is.na(rtof))))+geom_point(aes(x=screenName,y=rtof))+opts(axis.text.x=theme_text(angle=-90,size=6))
p=p+ylab('Person retweeted')+xlab('Person Retweeting')
print(p)
@
\caption{Who Retweeted whom?}
\end{center}
\end{figure}
\newpage
It is arguably easier to look across a row to spot who's been retweeting a lot of different people. As with the previous chart, the ordering of tweeps on x and y axes doesn't really work for me in this case either. But what would be better?
\begin{figure}[htbp]
\begin{center}
<<exampleWHoRTWhom2, fig = T, echo = F,width=12,height=16>>=
# Plot who RTd whom
p=ggplot(subset(tw.df.rt,subset=(!is.na(rtof))))+geom_point(aes(y=screenName,x=rtof))+opts(axis.text.x=theme_text(angle=-90,size=6))
p=p+xlab('Person retweeted')+ylab('Person Retweeting')
print(p)
@
\caption{Who Retweeted whom?}
\end{center}
\end{figure}
%comment
\SweaveInput{doodles/exampleSearchReachReport.Rnw}
%comment
\end{document}