-
Notifications
You must be signed in to change notification settings - Fork 27
/
12-Summarizing-Link-Targets.rmd
45 lines (33 loc) · 2.07 KB
/
12-Summarizing-Link-Targets.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Summarizing Link Targets
## Problem
You want to summarize the text of a web page that’s indicated by a short URL in a tweet.
## Solution
Extract the text from the web page, and then use a natural language processing (NLP) toolkit to help you extract the most important sentences to create a machine-generated abstract.
## Discussion
R has more than a few NLP tools to work with. We'll work with the `LSAfun` package for this exercise. As the acronym-laden package name implies, it uses Latent Semantic Analysis (LSA) to determine the most important bits in a set of text.
We'll use tweets by data journalist extraordinaire [Matt Stiles](https://twitter.com/stiles). Matt works for the Los Angeles Times and I learn a _ton_ from him on a daily basis. He's on top of _everything_. Let's summarise some news he shared recently from the New York Times, Reuters, Washington Post, Five Thirty-Eight and his employer.
We'll limit our exploration to the first three new links we find.
```{r 12_lib, message=FALSE, warning=FALSE}
library(rtweet)
library(LSAfun)
library(jerichojars) # hrbrmstr/jerichojars
library(jericho) # hrbrmstr/jericho
library(tidyverse)
```
```{r 12_summarise, message=FALSE, warning=FALSE, cache=TRUE}
stiles <- get_timeline("stiles")
filter(stiles, str_detect(urls_expanded_url, "nyti|reut|wapo|lat\\.ms|53ei")) %>% # only get tweets with news links
pull(urls_expanded_url) %>% # extract the links
flatten_chr() %>% # mush them into a nice character vector
head(3) %>% # get the first 3
map_chr(~{
httr::GET(.x) %>% # get the URL (I'm lazily calling "fair use" here vs check robots.txt since I'm suggesting you do this for your benefit vs profit)
httr::content(as="text") %>% # extract the HTML
jericho::html_to_text() %>% # strip away extraneous HTML tags
LSAfun::genericSummary(k=3) %>% # summarise!
paste0(collapse="\n\n") # easier to see
}) %>%
walk(cat)
```
## See Also
As noted, there are other NLP packages. Check out the [CRAN Task View](https://cran.r-project.org/web/views/NaturalLanguageProcessing.html) on NLP for more resources.