-
Notifications
You must be signed in to change notification settings - Fork 370
/
Copy pathMining-Articles.Rmd
174 lines (135 loc) · 6.76 KB
/
Mining-Articles.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
title: "Text Mining Finance News of Four Big American Companies"
output: html_document
---
There’s rightly been a lot of attention paid to text mining. Text mining is the data analysis of natural language works (articles, books, etc.), using text as a form of data, joined with the numeric analysis.
```{r global_options, include=FALSE}
knitr::opts_chunk$set(echo=FALSE, warning=FALSE, message=FALSE)
```
Two years ago, TheStreet claimed that [Traders Are Using Text and Data Mining to Beat the Market](https://www.thestreet.com/story/13044694/1/how-traders-are-using-text-and-data-mining-to-beat-the-market.html). All this, and the sense of major things going on in the world, prompted me to see what I could find myself in the world of text mining.
```{r}
library(tm.plugin.webmining)
library(purrr)
library(dplyr)
library(ggplot2)
library(ggthemes)
library(stringr)
library(tidyr)
library(tidytext)
```
I decided to analyze Google finance articles for the following American companies: Starbucks, Kraft, Wal-Mart and Mondelez.
```{r}
company <- c("Starbucks", "Kraft", "Walmart", "Mondelez")
symbol <- c("SBUX", "KHC", "WMT", "MDLZ")
download_articles <- function(symbol) {
WebCorpus(GoogleFinanceSource(paste0("NASDAQ:", symbol)))
}
stock_articles <- data_frame(company = company,
symbol = symbol) %>%
mutate(corpus = map(symbol, download_articles))
```
### Google Finance Articles
This allows me to retrieve the 20 most recent articles related to each stock.
```{r}
stock_articles
```
### Tokens
A token is a meaningful unit of text, most often a word, that we are interested in using for further analysis, and tokenization is the process of splitting text into tokens. I need to use "unnest_tokens" to break text into individual tokens and transform it to a tidy data, that is one-row-per-term-per-document:
```{r}
tokens <- stock_articles %>%
unnest(map(corpus, tidy)) %>%
unnest_tokens(word, text) %>%
select(company, datetimestamp, word, id, heading)
tokens
```
### tf_idf
```{r}
article_tf_idf <- tokens %>%
count(company, word) %>%
filter(!str_detect(word, "\\d+")) %>%
bind_tf_idf(word, company, n) %>%
arrange(-tf_idf)
article_tf_idf
```
Here we see all nouns, names that are important in these companies(articles). None of them occur in all of the articles.
tf_idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
Visualize these high tf-idf words.
```{r}
plot_article <- article_tf_idf %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word))))
plot_article %>%
top_n(10) %>%
ggplot(aes(word, tf_idf, fill = company)) +
geom_col() +
labs(x = NULL, y = "tf-idf") +
coord_flip() + theme_minimal() + ggtitle('Highest tf_idf Words for Each Company')
```
Visualize the top terms for each company individually.
```{r}
plot_article %>%
group_by(company) %>%
top_n(10) %>%
ungroup %>%
ggplot(aes(word, tf_idf, fill = company)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~company, ncol = 2, scales = "free") +
coord_flip() + theme_minimal()
```
As we have expected, the company names, stock symbols, some of companies' products and executives are usually included, as well as companies' latest movements such as Wal-Mart's climate pledges.
### Sentiment
To see whether the finance news coverage is positive or negative for these four companies, I opted to use [AFINN](http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010) lexicons which provides a positivity score for each word, from -5 (most negative) to 5 (most positive) to do a simple sentiment analysis.
```{r}
tokens %>%
anti_join(stop_words, by = "word") %>%
count(word, id, sort = TRUE) %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
group_by(word) %>%
summarize(contribution = sum(n * score)) %>%
top_n(15, abs(contribution)) %>%
mutate(word = reorder(word, contribution)) %>%
ggplot(aes(word, contribution)) +
geom_col() +
coord_flip() +
ggtitle('Frequency of Words AFINN Score') + theme_minimal()
```
If I am right then I can use the sentiment analysis to help make decision on my investment. But am I right?
The word "gross" is considered negative by AFINN lexicons, but it means "gross margin" in the context of finance articles. The word "share" and "shares" are neither positive nor negative in finance articles.
"tidytext" includes another sentiment lexicon - "loughran", which was developed based on analyses of financial reports, and intentionally avoids words like “share” and "gross" that may not have a positive or negative meaning in a financial context.
The Loughran dictionary divides words into six sentiments: “positive”, “negative”, “litigious”, “uncertainty”, “constraining”, and “superfluous”.
```{r}
library(tidytext)
tokens %>%
count(word) %>%
inner_join(get_sentiments("loughran"), by = "word") %>%
group_by(sentiment) %>%
top_n(5, n) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
coord_flip() +
facet_wrap(~ sentiment, scales = "free") +
ggtitle("Frequency of This Word in Google Finance Articles") + theme_minimal()
```
This gives the most common words in the financial news articles associated with each of the six sentiments in the Loughran lexicon. Here I only get five sentiments, this indicates that there is no word can be associated with "superfluous" in recent Google finance news articles related to these four companies.
Now it makes much better sense and I can trust the results to count how frequently each sentiment was associated with each company in these articles.
```{r}
sentiment_fre <- tokens %>%
inner_join(get_sentiments("loughran"), by = "word") %>%
count(sentiment, company) %>%
spread(sentiment, n, fill = 0)
sentiment_fre
```
```{r}
sentiment_fre %>%
mutate(score = (positive - negative) / (positive + negative)) %>%
mutate(company = reorder(company, score)) %>%
ggplot(aes(company, score, fill = score > 0)) +
geom_col(show.legend = FALSE) +
coord_flip() + theme_minimal() + ggtitle('Positive or Negative Scores Among Recent Google Finance Articles')
```
Based the results, I'd say that in May 2017 most of the recent coverage on Walmart was strong negative and most of the recent coverage on Mondelez was positive. A quick search on the recent finance headlines suggests that I am on the right track.
### The End
The code to produce all this in R depends heavily on Julia Silge and David Robinson’s [Text Mining with R](http://tidytextmining.com/) book.